# Predibase MCP

> Predibase handles high-performance LLM serving and fine-tuning right through your AI agent. It lets you run inference, classify batches of text, and monitor deployment health using tools like `generate_text` and `get_metrics`. You query deployed models—whether they're for chat, standard completion, or structured JSON output—all without leaving the conversation window.

## Overview
- **Category:** developer-tools
- **Price:** Free
- **Tags:** llm-serving, fine-tuning, inference, machine-learning, ai-ops

## Description

Look, you don't wanna juggle API keys for every model you run. This MCP server connects your agent right to a managed endpoint, so you can use deployed LLMs and monitoring tools without leaving your conversation window or managing external credentials.

When it comes to generating text, you’ve got three specific ways to call the model depending on what you need. If you want plain, raw content—like drafting a simple paragraph or extracting a block of code—you use `generate_text`. This calls an active LLM endpoint and spits out pure text. If you're doing standard, general-purpose completions, `completion` is your tool; it takes a prompt and gives back the next sequence of tokens. But if you’re handling conversation—like building a chatbot or simulating a dialogue—you gotta use `chat_completion`. This one formats responses to match OpenAI’s specific chat message structure, which handles roles like 'user' and 'assistant' correctly.

Beyond simple generation, the server handles structured data tasks. You can run batch classification jobs using the `classify` tool. Instead of processing text piece by piece, this runs a single command that assigns structured labels—like sentiment or category codes—to multiple inputs at once. 

For keeping tabs on your deployments and making sure everything's running smooth, you use dedicated monitoring tools. Before you run any heavy job, you check the operational status with `get_health`. This confirms if a specific LLM endpoint is actually up and ready to take requests. If you need background details, `get_info` pulls metadata about the deployment—you can grab stuff like its current version or core configuration settings.

When performance matters, you pull live data using `get_metrics`. This tool grabs Prometheus metrics directly from the deployment, giving you hard numbers on throughput and latency. You'll see request counts over time and exactly how many milliseconds a typical query takes to process.

Predibase gives your agent a single source for everything: running inference against fine-tuned deployments, forcing model responses into reliable JSON schemas for downstream automation, classifying large batches of text, and pulling real operational data on performance. You're not managing external keys; you’re just calling tools that work directly within the framework.

## Tools

### chat_completion
Generates conversational responses compatible with OpenAI's chat message format.

### classify
Runs batch classification tasks, assigning structured labels to one or more input texts.

### completion
Creates standard text completions using a deployed LLM model.

### generate_text
Generates plain text content by calling an active, deployed Large Language Model endpoint.

### get_health
Checks the operational status of a specific LLM inference endpoint to confirm it is running correctly.

### get_info
Retrieves metadata about an LLM deployment, such as its current version or configuration details.

### get_metrics
Pulls Prometheus metrics for the deployment, detailing performance data like request counts and latency.

## Prompt Examples

**Prompt:** 
```
Generate a summary of this text using the 'llama-3-70b' deployment.
```

**Response:** 
```
I'll use the `generate_text` tool on your 'llama-3-70b' deployment. Processing the input prompt now...
```

**Prompt:** 
```
Check the health and metrics for my 'customer-support-llm' deployment.
```

**Response:** 
```
I am calling `get_health` and `get_metrics` for 'customer-support-llm'. The endpoint is currently healthy and processing 12 requests per minute.
```

**Prompt:** 
```
Classify these three reviews using our sentiment model deployment.
```

**Response:** 
```
Running the `classify` tool for your inputs. Results: Review 1 (Positive), Review 2 (Negative), Review 3 (Neutral).
```

## Capabilities

### Run Text Generation (Inference)
You generate text or chat responses using `generate_text`, `chat_completion`, or `completion` tools against a specific deployed model.

### Classify Batches of Data
The agent runs the `classify` tool to assign structured labels (like sentiment or category) to multiple pieces of input text at once.

### Check Endpoint Health and Status
You call `get_health` to confirm if the LLM endpoint is operational, which is critical before running any heavy jobs.

### Retrieve Performance Metrics
The agent uses `get_metrics` to pull live Prometheus data on throughput and resource usage for the deployment.

### Get Deployment Metadata
You run `get_info` to check details about a specific model endpoint, like its version or configuration.

## Use Cases

### Analyzing Customer Feedback Sentiment
A data scientist needs to process 500 new support tickets. Instead of writing a Python script and running it locally, they tell their agent: 'Run `classify` on these 500 inputs using the sentiment model.' The server runs the tool and returns structured JSON with Positive/Negative labels for every ticket.

### Prototyping in Chat
An AI engineer wants to see if their new `llama-3-70b` fine-tune works before committing to a deployment. They prompt the agent: 'Summarize this article using the `generate_text` tool on my llama-3-70b deployment.' The server handles the inference and returns the summary right there in the chat.

### Pre-Flight Deployment Check
Before a major business process starts, an operations team member checks the system. They prompt: 'What is the current status of the fraud detection model?' The agent calls `get_health` and immediately reports if the endpoint is green or red.

### Building Complex Agents
An application developer needs a multi-step process. First, they use `get_info` to verify the model version. Then, they use `chat_completion` with the correct parameters and finally enforce JSON output to build a structured database record.

## Benefits

- **Structured Output:** Instead of getting unstructured text, you force model responses into specific JSON schemas. This makes the output reliable for immediate downstream automation logic. No more parsing headaches.
- **Live Monitoring:** You don't have to check a separate dashboard. The `get_metrics` tool pulls performance data (like requests/minute or latency) directly so your agent can report on model load instantly.
- **Fine-Tuning Integration:** Apply specific LoRA adapters during inference using the parameters in generation tasks. This lets you use specialized, custom versions of models without redeploying everything.
- **Batch Classification:** Handling multiple inputs is simple with the `classify` tool. You feed it a list of reviews or documents and get structured labels back for every single item.
- **Reliability Checks:** Before running anything expensive, you call `get_health`. This confirms the endpoint isn't down or having resource issues—a must-do before production use.

## How It Works

The bottom line is you use your AI client to call specific tools that interact with your managed LLM endpoints.

1. First, you subscribe this server and provide your Predibase API Token and Tenant ID.
2. When the task is ready, you prompt your AI client to perform an action (e.g., 'Classify these reviews').
3. The agent invokes the correct tool (`classify`, `generate_text`, etc.), executes the call against your deployed models, and returns structured results directly in the chat.

## Frequently Asked Questions

**How do I check if my LLM endpoint is working with `get_health`?**
You call the `get_health` tool on your deployment ID. It returns a simple status code and message, telling you instantly if the endpoint is up for traffic or if it's throwing errors.

**Can I use `generate_text` with my custom fine-tuned model?**
Yes. The `generate_text` tool allows you to target specific deployments and even dynamically apply LoRA adapters via parameters, ensuring you run the precise version of the model you intend.

**Is there a way to force JSON output using this server?**
Absolutely. You can enforce strict JSON schemas when calling generation tools like `generate_text`. This makes the output predictable and easy for your application code to consume without parsing errors.

**What is the difference between `completion` and `chat_completion`?**
`chat_completion` uses a message structure (system, user, assistant) designed for conversational flow. `completion` provides a simpler, traditional text completion format.

**How do I use `get_metrics` to check the real-time performance of my LLM deployment?**
It returns Prometheus metrics that cover things like request count, throughput, and latency. You can analyze this data to understand how your model performs under load.

**When I call `classify`, what input structure do I need if I'm processing a large batch of items?**
You must provide an array or list of inputs. The tool processes these in bulk, returning corresponding results for every single item you submit.

**If I want to switch between different fine-tuned versions, how do I use `generate_text` with a specific LoRA adapter?**
You specify the desired model version by passing the adapter ID in the generation parameters. This tells the endpoint exactly which weights to apply for that request.

**What kind of metadata can I retrieve using `get_info` about my inference endpoint?**
`get_info` retrieves details like the deployed model name, its version number, and general operational parameters. It's useful for confirming your setup before running complex tasks.

**Can I use my fine-tuned adapters with this server?**
Yes. When using the `generate_text` tool, you can provide an `adapter_id` to apply your specific fine-tuned LoRA adapter to the base model deployment.

**How do I monitor the performance of my Predibase deployment?**
Use the `get_metrics` tool to scrape Prometheus-formatted metrics or `get_info` to retrieve metadata like model ID and device type.

**Does this support structured JSON responses?**
Absolutely. The `generate_text` tool includes a `schema` parameter that allows you to pass a JSON schema to ensure the model output follows a specific structure.