# Braintrust MCP

> Braintrust helps developers systematically test and validate LLMs. You manage projects, track prompt versions, run complex benchmark experiments, and query structured 'Ground Truth' data—all within one place. Stop guessing if your model works; prove it.

## Overview
- **Category:** brain-trust
- **Price:** Free
- **Tags:** ai-evaluation, llm-benchmarking, prompt-engineering, model-testing, ai-observability, data-analytics

## Description

Building reliable AI models means more than just writing a single good prompt. It demands rigorous testing across multiple variables. This MCP lets you set up formal evaluation pipelines right from your agent, giving you full visibility into exactly how the model behaves under pressure. You can track specific variable distributions and compare outputs against historical benchmarks without ever leaving your chat window. Need to check if a new feature broke an old response pattern? Use this MCP. If you're building anything complex for production, connecting it through the Vinkius catalog is the right move. It lets you turn vague model performance anxiety into concrete data points.

## Tools

### create_experiment
Records a new historical experiment trace to track LLM pipeline tests.

### create_project
Sets up a new project environment for tracking AI evaluations and data sets.

### list_datasets
Lists available 'Ground Truth' text banks used for automated evaluation scoring.

### list_env_vars
Checks the Braintrust AI Gateway configurations, showing model API keys securely.

### list_experiments
Retrieves all recorded evaluation experiments, mapping out model test scores and metrics.

### get_dataset
Retrieves a specific dataset containing structured schemas that bound LLM outputs.

### get_prompt
Grabs the exact variable contexts and literal text templates used in a prompt.

### insert_dataset_row
Adds new test cases into an existing dataset matrix for specific evaluations.

### list_projects
Lists all existing AI evaluation projects configured in Braintrust.

### list_prompts
Retrieves a list of system prompts that are explicitly version-controlled inside Braintrust.

## Prompt Examples

**Prompt:** 
```
List all active test datasets configured under Braintrust.
```

**Response:** 
```
I've fetched your Ground Truth repositories. There's 1 dataset active under ID 4a83b9c named 'Support-Responses-Testing'. Should I list the rows nested there?
```

**Prompt:** 
```
Look up prompt template using specific ID XYZ.
```

**Response:** 
```
Prompt XYZ returns successfully. It tracks specific {{user}} tags targeting strict instructions enforcing a professional tone. The JSON mapping version is 1.0.4. Do you need further metadata?
```

**Prompt:** 
```
Analyze recent experiments across multiple models testing behavior.
```

**Response:** 
```
Extracted the historical trace boundaries. Experiment run ID V3 generated a 94% alignment score compared to the previously logged V2 base structure matrix mapping differences on false positives.
```

## Capabilities

### Track Model Performance
Run formal experiments that record and compare LLM outputs against historical runs.

### Manage Test Data Sets
Query accurate, structured 'Ground Truth' data sets to score model responses automatically.

### Version Prompt Templates
Securely grab and compare specific versions of system prompts without touching the core code base.

### Organize Evaluation Scope
Create isolated projects to keep different model test runs separate and clean.

## Use Cases

### Validating a new onboarding flow
A product manager needs to know if the LLM handles edge-case user input correctly. They use `create_project` to isolate 'Onboarding V2'. Then, they run multiple tests using `get_dataset` data and track all results with `create_experiment`. This proves that the new flow doesn't regress on old bugs.

### A/B testing prompt variations
An ML engineer needs to compare two versions of a summarization prompt. They use `list_prompts` to grab both templates, then call `create_experiment` twice, feeding the same inputs into both setups to measure which one hits better alignment scores.

### Debugging production failures
A developer notices a drop in performance. They immediately use `list_environments` and `get_dataset` to retrieve the exact model configuration and the specific ground truth data that failed, pinpointing the issue instantly.

### Archiving model versions
A team is retiring an old version of their chatbot. They use `list_projects` first to see every evaluation run tied to it, ensuring no historical test data or metrics are lost before decommissioning the model.

## Benefits

- You track prompt changes instantly. Instead of manually checking code, use the `list_prompts` tool to grab perfectly frozen semantic prompts and test them without breaking your core system.
- Never lose a data point again. Use `insert_dataset_row` to append new test cases into an existing dataset matrix, ensuring every evaluation has fresh coverage.
- Visualize performance changes with certainty. Running `list_experiments` gives you all historical runs, letting you compare model scores and metrics side-by-day.
- Maintain clean separation of concerns. Use `create_project` to isolate different types of testing—say, one project for user onboarding flows and another for admin commands.
- Know exactly what data you're using. By calling `list_datasets`, you see all your 'Ground Truth' repositories before writing a single test case.

## How It Works

The bottom line is: instead of scrolling through massive spreadsheets, your bot handles strict semantic checking via Braintrust infrastructure.

1. Add this MCP to your AI client. Next, bind your Braintrust API ID variables.
2. Tell the agent what you want to benchmark—maybe a new prompt version or an existing project scope.
3. The system runs complex model tuning pipelines, querying native logic regressions directly on chat output.

## Frequently Asked Questions

**How do I start testing my model with Braintrust using `create_project`?**
You first call `create_project` to establish the boundaries for your tests. This gives you a clean, isolated environment that prevents new test runs from contaminating existing project data.

**What is the difference between `get_dataset` and `list_datasets`?**
`list_datasets` shows you all available 'Ground Truth' text banks. You then use `get_dataset` to pull a specific, structured dataset for active testing.

**How do I track changes to my prompt templates with Braintrust?**
Use the `list_prompts` tool to see all version-controlled prompts. You can then call `get_prompt` to retrieve a specific template ID, ensuring you test against an exact version.

**Can I add custom failed tests using Braintrust?**
Yes. After running a batch of tests, you use `insert_dataset_row` to manually append new failure cases or specific edge-case inputs into your dataset matrix for future runs.

**How do I check which API keys are configured for Braintrust using `list_environments_vars`?**
It shows you all the current gateway configuration variables. This is how your agent accesses the necessary model API keys securely without needing manual setup.

**If I want to review previous test runs, what does `list_experiments` retrieve?**
`list_experiments` retrieves a comprehensive map of all past evaluation attempts. This lets you check historical metrics and model scores across various run IDs.

**Can I use `insert_dataset_row` to append just a single test case into my matrix?**
Yes, that's exactly what it does. You can target a specific dataset and inject new evaluation data row by row without having to build an entire master sheet first.

**Before starting a new project, how do I use `list_projects` to see current evaluations?**
`list_projects` gives you the list of all existing AI evaluation containers. This helps you confirm your scope and choose the right environment for your next test.

**Can I insert new test data dynamically tracking specific limits?**
Yes. Utilizing the `insert_dataset_row` method, you can effortlessly inject exact JSON tracking payload mapping strings directly inside the text corpus evaluating the final results.

**Does it pull out original Prompt definitions stored securely?**
Certainly. The `get_prompt` command isolates and returns perfectly version-controlled bounding parameters slicing literal templates natively hosted under the Braintrust database.

**How deeply can it inspect test regressions or scoring limits?**
Using the robust `list_experiments` call, you can branch full arrays separating LLM version behaviors over massive iterations tracking the performance anomalies accurately.