Predibase MCP for AI. Manage LLM Inference, Tuning, and Metrics from your Agent.

Q: How do I check if my LLM endpoint is working with gethealth?

You call the gethealth tool on your deployment ID. It returns a simple status code and message, telling you instantly if the endpoint is up for traffic or if it's throwing errors.

Q: Can I use generatetext with my custom fine-tuned model?

Yes. The generatetext tool allows you to target specific deployments and even dynamically apply LoRA adapters via parameters, ensuring you run the precise version of the model you intend.

Q: What is the difference between completion and chatcompletion?

chatcompletion uses a message structure (system, user, assistant) designed for conversational flow. completion provides a simpler, traditional text completion format.

Q: What kind of metadata can I retrieve using getinfo about my inference endpoint?

getinfo retrieves details like the deployed model name, its version number, and general operational parameters. It's useful for confirming your setup before running complex tasks.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

How this MCP server connects to your AI agent

Predibase handles high-performance LLM serving and fine-tuning right through your AI agent. It lets you run inference, classify batches of text, and monitor deployment health using tools like `generate_text` and `get_metrics`.

You query deployed models—whether they're for chat, standard completion, or structured JSON output—all without leaving the conversation window.

What AI agents can do with Predibase (LLM Serving & Finetuning) Automation

Chat completion

Generates conversational responses compatible with OpenAI's chat message format.

Classify

Runs batch classification tasks, assigning structured labels to one or more input texts.

Completion

Creates standard text completions using a deployed LLM model.

+ 4 more capabilities included

Run Text Generation (Inference)

You generate text or chat responses using generate_text, chat_completion, or completion tools against a specific deployed model.

Classify Batches of Data

The agent runs the classify tool to assign structured labels (like sentiment or category) to multiple pieces of input text at once.

Check Endpoint Health and Status

You call get_health to confirm if the LLM endpoint is operational, which is critical before running any heavy jobs.

Retrieve Performance Metrics

The agent uses get_metrics to pull live Prometheus data on throughput and resource usage for the deployment.

Get Deployment Metadata

You run get_info to check details about a specific model endpoint, like its version or configuration.

Ask an AI about this

Included with Plan

Waiting for input…

AI Agent

What AI agents can do with Predibase (LLM Serving & Finetuning) MCP Server: 7 Tools for AI Ops

These tools allow your agent to manage the full lifecycle of LLMs—from checking endpoint health and retrieving metrics to running structured classification and generating text.

Make your AI actually useful.

Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.

Start using Predibase (LLM Serving & Finetuning) on Vinkius

Chat Completion

Generates conversational responses compatible with OpenAI's chat message format.

Classify

Runs batch classification tasks, assigning structured labels to one or more input...

Completion

Creates standard text completions using a deployed LLM model.

Generate Text

Generates plain text content by calling an active, deployed Large Language Model...

Get Health

Checks the operational status of a specific LLM inference endpoint to confirm it is...

Get Info

Retrieves metadata about an LLM deployment, such as its current version or configuration details.

Get Metrics

Pulls Prometheus metrics for the deployment, detailing performance data like request counts and latency.

Security and governance baked right in.

Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.

Claude AI

Open Claude Settings

Go to claude.ai, click your profile icon, then navigate to Customize → Connectors.

Add Custom Connector

Click the "+" button and select Add custom connector. Paste your Vinkius endpoint URL:

https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp

Replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com. For OAuth-protected servers, expand Advanced settings to add credentials.

Start a conversation

Open a new chat. The Predibase integration is available immediately — no restart needed.

Antigravity

Configure Agent Environment

Open your Antigravity agent's workspace configuration or mcp-servers.json file.

Bind the Endpoint

Add the Vinkius endpoint URL to your agent's MCP connections list:

"mcp_servers": {
  "predibase-llm-serving-finetuning": {
    "serverUrl": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
  }
}

Provide your secure token in place of [YOUR_TOKEN_HERE] to ensure your agent requests are authenticated.

Execute

Start your Antigravity session. The agent will autonomously discover and utilize the Predibase tools with full Vinkius guardrails applied.

VS Code Copilot

⚡

One-Click Install (Recommended)

In your Vinkius Dashboard, simply click the Add to VS Code button for this server. We'll automatically configure your local workspace.

Or configure manually

Open MCP Settings

Open VS Code, press Ctrl/Cmd + Shift + P, and search for GitHub Copilot: MCP Servers.

Add Server Config

Add the Vinkius endpoint configuration to your mcp-servers.json file:

"predibase-llm-serving-finetuning": {
  "url": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
}

Ensure you replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com.

LangChain

Install Dependencies

Install the LangChain MCP adapters for your environment:

pip install langchain-mcp-adapters

Connect the Server

Use the SSEClient in LangChain to connect to the Vinkius managed endpoint:

from langchain_mcp_adapters.client import SSEClient

# Connect to Vinkius
client = SSEClient(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")
tools = client.get_tools()

CrewAI

Define the Tool

Load the Vinkius MCP tools into your CrewAI agents:

from crewai import Agent
from mcp_crewai import MCPTool

# Connect securely to Vinkius
vinkius_tools = MCPTool(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")

# Assign to Agent
researcher = Agent(
    role='Data Researcher',
    tools=vinkius_tools.get_all()
)

Execute Task

Run your CrewAI process. The agent will autonomously route tasks to the Vinkius managed server.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Predibase (LLM Serving & Finetuning), then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 5,100+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Predibase. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

Your data is protected. See how we built it.

Built on the Model Context Protocol (MCP) for Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This connection provides 7 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.

Running custom ML models shouldn't feel like logging into a dashboard just to check status., Solved with Vinkius AI Gateway

Today, if your agent needs to run an inference job, you usually have to jump through hoops. You send the prompt, wait for it to process, and then, if the result is critical, you have to leave the chat, navigate to a separate ML Ops dashboard, and manually check the endpoint's health or resource utilization via another system. It’s tedious, slow, and prone to context switching.

With this MCP server, all that visibility stays in one place. You run `get_health` directly through your agent. If it fails, you know immediately. If it succeeds, you can proceed with the actual task using `generate_text`. The whole process runs seamlessly within your chat interface.

Use the `classify` tool for structured data extraction.

Manually running classification tasks means taking a list of texts, pasting them into a separate UI, and waiting for results. If you have hundreds or thousands of items, this process is non-linear and requires manual API calls that are hard to track.

The `classify` tool changes this. You provide the inputs, specify your model, and get structured, actionable labels back in bulk—all from one command. It makes batch processing reliable.

Support 24/7 support@vinkius.com ↗

Security Vinkius Trust Center ↗

SLA Service Level Agreement ↗

Report Listing Send Report ↗

llm-serving

fine-tuning

inference

machine-learning

ai-ops

What your AI can actually do with this

Look, you don't wanna juggle API keys for every model you run. This MCP server connects your agent right to a managed endpoint, so you can use deployed LLMs and monitoring tools without leaving your conversation window or managing external credentials.

When it comes to generating text, you’ve got three specific ways to call the model depending on what you need. If you want plain, raw content—like drafting a simple paragraph or extracting a block of code—you use generate_text. This calls an active LLM endpoint and spits out pure text. If you're doing standard, general-purpose completions, completion is your tool; it takes a prompt and gives back the next sequence of tokens.

But if you’re handling conversation—like building a chatbot or simulating a dialogue—you gotta use chat_completion. This one formats responses to match OpenAI’s specific chat message structure, which handles roles like 'user' and 'assistant' correctly.

Beyond simple generation, the server handles structured data tasks. You can run batch classification jobs using the classify tool. Instead of processing text piece by piece, this runs a single command that assigns structured labels—like sentiment or category codes—to multiple inputs at once.

For keeping tabs on your deployments and making sure everything's running smooth, you use dedicated monitoring tools. Before you run any heavy job, you check the operational status with get_health. This confirms if a specific LLM endpoint is actually up and ready to take requests. If you need background details, get_info pulls metadata about the deployment—you can grab stuff like its current version or core configuration settings.

When performance matters, you pull live data using get_metrics. This tool grabs Prometheus metrics directly from the deployment, giving you hard numbers on throughput and latency. You'll see request counts over time and exactly how many milliseconds a typical query takes to process.

Predibase gives your agent a single source for everything: running inference against fine-tuned deployments, forcing model responses into reliable JSON schemas for downstream automation, classifying large batches of text, and pulling real operational data on performance. You're not managing external keys; you’re just calling tools that work directly within the framework.

Built · Hosted · Managed by Vinkius Predibase MCP Server - LLM Inference & Tuning Tools

Server ID 019e5d4b-0c59-7386-9a75-7419cdeadf2f

Vinkius Inspector

Compliance Grade D

Score 55/100

Report View Report ↗

Here's how it actually works

The bottom line is you use your AI client to call specific tools that interact with your managed LLM endpoints.

First, you subscribe this server and provide your Predibase API Token and Tenant ID.

When the task is ready, you prompt your AI client to perform an action (e.g., 'Classify these reviews').

The agent invokes the correct tool (classify, generate_text, etc.), executes the call against your deployed models, and returns structured results directly in the chat.

Who is this actually for?

This server targets people who actually deploy and maintain models. Think of the ML Engineer whose job it is to prove a model works under load, or the Data Scientist trying to build an internal feature that needs structured output every time. You're here because you can't afford guesswork when your LLM deployment fails.

Machine Learning Engineer

You use get_metrics and get_health to monitor production endpoints and test fine-tuning updates before going live.

Data Scientist

You run the classify tool in batch mode, feeding it thousands of records from a CSV or list to generate structured insights for reporting.

Application Developer

You integrate text generation into an app using generate_text, forcing JSON output so your backend can reliably parse the results.

What Changes When You Connect

Structured Output: Instead of getting unstructured text, you force model responses into specific JSON schemas. This makes the output reliable for immediate downstream automation logic. No more parsing headaches.

Live Monitoring: You don't have to check a separate dashboard. The get_metrics tool pulls performance data (like requests/minute or latency) directly so your agent can report on model load instantly.

Fine-Tuning Integration: Apply specific LoRA adapters during inference using the parameters in generation tasks. This lets you use specialized, custom versions of models without redeploying everything.

Batch Classification: Handling multiple inputs is simple with the classify tool. You feed it a list of reviews or documents and get structured labels back for every single item.

Reliability Checks: Before running anything expensive, you call get_health. This confirms the endpoint isn't down or having resource issues—a must-do before production use.

See it in action

01 01

Analyzing Customer Feedback Sentiment

A data scientist needs to process 500 new support tickets. Instead of writing a Python script and running it locally, they tell their agent: 'Run classify on these 500 inputs using the sentiment model.' The server runs the tool and returns structured JSON with Positive/Negative labels for every ticket.

02 02

Prototyping in Chat

An AI engineer wants to see if their new llama-3-70b fine-tune works before committing to a deployment. They prompt the agent: 'Summarize this article using the generate_text tool on my llama-3-70b deployment.' The server handles the inference and returns the summary right there in the chat.

03 03

Pre-Flight Deployment Check

Before a major business process starts, an operations team member checks the system. They prompt: 'What is the current status of the fraud detection model?' The agent calls get_health and immediately reports if the endpoint is green or red.

04 04

Building Complex Agents

An application developer needs a multi-step process. First, they use get_info to verify the model version. Then, they use chat_completion with the correct parameters and finally enforce JSON output to build a structured database record.

The honest tradeoffs

Treating LLMs like local scripts

Anti-pattern

Trying to run a complex, memory-intensive model inference job directly inside the chat window without checking resource limits first. This often leads to cryptic failure messages and wasted time.

The Fix

Always check operational status first. Run get_health before attempting any generation task. If you're concerned about load, review get_metrics to confirm throughput capacity.

Assuming unstructured output is okay

Anti-pattern

Requesting the agent to summarize data and accepting a plain text block. This forces manual parsing later, which breaks automation pipelines.

The Fix

Always enforce JSON schemas when possible. Use generate_text with structured output parameters or rely on the model's native JSON generation capability.

Ignoring specific deployment needs

Anti-pattern

Asking for general text completion (completion) when your task requires a specialized, fine-tuned version of the model. The generic tool won't have access to your custom weights.

The Fix

Use generate_text and explicitly specify your deployed endpoint ID or adapter parameters in the request payload.

When It Fits, When It Doesn't

You should use this server if your workflow requires controlled, observable calls to deployed LLMs. Specifically:

✅ Use this if: You need to run inference (generate_text, chat_completion) on a model you've already fine-tuned or deployed via Predibase; you need guaranteed structured output (JSON); or you must monitor the model's operational status using get_health and get_metrics. This is for production integration.

❌ Don't use this if: You just need to test a simple prompt with an open-source model without monitoring or deployment management. If your task only involves basic data lookups that don't require generative AI, skip the LLM tools entirely and stick to direct database connectors instead.

Questions you might have

How do I check if my LLM endpoint is working with `get_health`? +

You call the get_health tool on your deployment ID. It returns a simple status code and message, telling you instantly if the endpoint is up for traffic or if it's throwing errors.

Can I use `generate_text` with my custom fine-tuned model? +

Yes. The generate_text tool allows you to target specific deployments and even dynamically apply LoRA adapters via parameters, ensuring you run the precise version of the model you intend.

Is there a way to force JSON output using this server? +

Absolutely. You can enforce strict JSON schemas when calling generation tools like generate_text. This makes the output predictable and easy for your application code to consume without parsing errors.

What is the difference between `completion` and `chat_completion`? +

chat_completion uses a message structure (system, user, assistant) designed for conversational flow. completion provides a simpler, traditional text completion format.

How do I use `get_metrics` to check the real-time performance of my LLM deployment? +

It returns Prometheus metrics that cover things like request count, throughput, and latency. You can analyze this data to understand how your model performs under load.

When I call `classify`, what input structure do I need if I'm processing a large batch of items? +

You must provide an array or list of inputs. The tool processes these in bulk, returning corresponding results for every single item you submit.

If I want to switch between different fine-tuned versions, how do I use `generate_text` with a specific LoRA adapter? +

You specify the desired model version by passing the adapter ID in the generation parameters. This tells the endpoint exactly which weights to apply for that request.

What kind of metadata can I retrieve using `get_info` about my inference endpoint? +

get_info retrieves details like the deployed model name, its version number, and general operational parameters. It's useful for confirming your setup before running complex tasks.

Can I use my fine-tuned adapters with this server? +

Yes. When using the generate_text tool, you can provide an adapter_id to apply your specific fine-tuned LoRA adapter to the base model deployment.

How do I monitor the performance of my Predibase deployment? +

Use the get_metrics tool to scrape Prometheus-formatted metrics or get_info to retrieve metadata like model ID and device type.

Does this support structured JSON responses? +

Absolutely. The generate_text tool includes a schema parameter that allows you to pass a JSON schema to ensure the model output follows a specific structure.

How this MCP server connects to your AI agent

What AI agents can do with Predibase (LLM Serving & Finetuning) Automation

Chat completion

Classify

Completion

What AI agents can do with Predibase (LLM Serving & Finetuning) MCP Server: 7 Tools for AI Ops

Chat Completion

Generates conversational responses compatible with OpenAI's chat message format.

Classify

Runs batch classification tasks, assigning structured labels to one or more input...

Completion

Creates standard text completions using a deployed LLM model.

Generate Text

Generates plain text content by calling an active, deployed Large Language Model...

Get Health

Checks the operational status of a specific LLM inference endpoint to confirm it is...

Get Info

Retrieves metadata about an LLM deployment, such as its current version or configuration details.

Get Metrics

Pulls Prometheus metrics for the deployment, detailing performance data like request counts and latency.

Security and governance baked right in.

Claude AI

Open Claude Settings

Add Custom Connector

Start a conversation

Claude Code

Open your terminal

Add the MCP Server

Start coding

Cursor

One-Click Install (Recommended)

Open Cursor Settings

Add New Server

Use in Composer

Antigravity

Configure Agent Environment

Bind the Endpoint

Execute

VS Code Copilot

One-Click Install (Recommended)

Open MCP Settings

Add Server Config

Windsurf

One-Click Install (Recommended)

Open Windsurf Settings

Add Server Endpoint

LangChain

Install Dependencies

Connect the Server

CrewAI

Define the Tool

Execute Task

Choose How to Get Started

Build Your Own

Make Your AI Do More

Built on the Model Context Protocol (MCP) for Claude, ChatGPT, Cursor, and more

Running custom ML models shouldn't feel like logging into a dashboard just to check status., Solved with Vinkius AI Gateway

Use the `classify` tool for structured data extraction.

llm-serving

fine-tuning

inference

machine-learning

ai-ops

What your AI can actually do with this

Here's how it actually works

Who is this actually for?

What Changes When You Connect

Analyzing Customer Feedback Sentiment

Prototyping in Chat

Pre-Flight Deployment Check

Building Complex Agents

The honest tradeoffs

Treating LLMs like local scripts

Assuming unstructured output is okay

Ignoring specific deployment needs

When It Fits, When It Doesn't

Questions you might have