Braintrust MCP. Automate AI model evaluation and data tracking.
Works with every AI agent you already use
…and any MCP-compatible client
Just plug in your AI agents and start using Vinkius.
Braintrust. This server lets your AI agent run deep logic checks on model outputs. You can manage projects, create test experiments, and query Ground Truth datasets.
It's for developers who need to systematically test and track AI model performance against specific, versioned data sets and prompts.
It's an observability layer for building reliable AI.
What your AI agents can do
Create experiment
Creates a new historical experiment trace to record and track LLM pipeline tests.
Create project
Sets up a new project environment for tracking and organizing AI evaluations and datasets.
Get dataset
Retrieves a specific dataset that contains defined schemas for bounding LLM outputs.
You can establish new historical traces to record and compare different versions of your LLM pipeline tests.
The server lets you create isolated project environments specifically for tracking and organizing AI evaluation datasets.
Query specific datasets that contain exact schemas needed to bound and test LLM outputs against known correct answers.
Append new test cases directly into a dataset matrix, targeting specific model evaluations for comparison.
Retrieve exact variable contexts and literal text templates for a prompt, ensuring you test against the intended text.
Access lists of all available datasets, projects, experiments, and prompts for auditing.
Ask AI about this MCP
Supported MCP Clients
Waiting for input…
Braintrust MCP Server: 10 Tools for AI Evaluation
These tools let you manage the entire lifecycle of model testing: creating projects, querying ground truth data, and tracking historical performance traces.
019d7562create experiment
Creates a new historical experiment trace to record and track LLM pipeline tests.
019d7562create project
Sets up a new project environment for tracking and organizing AI evaluations and datasets.
019d7562get dataset
Retrieves a specific dataset that contains defined schemas for bounding LLM outputs.
019d7562get prompt
Retrieves the exact variable contexts and literal text templates used in a prompt.
019d7562insert dataset row
Appends a new test case row into a dataset matrix for specific evaluation scoring.
019d7562list datasets
Lists all isolated Ground Truth text banks used for automated evaluation scoring.
019d7562list env vars
Checks the Braintrust AI Gateway configurations, listing managed model API keys.
019d7562list experiments
Gets a list of all evaluation experiments, mapping model test scores and metrics.
019d7562list projects
Retrieves the list of all active AI evaluation projects within Braintrust.
019d7562list prompts
Gets a list of all explicitly version-controlled system prompts stored in Braintrust.
Choose How to Get Started
Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.
Build Your Own
Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.
- Import from OpenAPI, Swagger, or YAML specs
- Create Agent Skills with progressive disclosure
- Deploy to edge with MCPFusion framework
- Built in DLP, auth, and compliance on every call
- Real time usage dashboard and cost metering
- Publish to catalog or keep private
Make Your AI Do More
Start with Braintrust, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.
- Use this MCP plus 4,700+ others, all in one place
- Add new capabilities to your AI anytime you want
- Every connection is secured and compliant automatically
- Track usage and costs across all your servers
- Works with Claude, ChatGPT, Cursor, and more
- New servers added to the catalog every week
What you can do with this MCP connector
This server lets your AI agent run deep logic checks on model outputs. You'll use it to systematically test and track AI model performance against specific, versioned data sets and prompts. You'll manage projects, create test experiments, and query Ground Truth datasets. You'll use create_project to set up isolated project environments for tracking and organizing AI evaluation datasets.
You'll use list_projects to see every active AI evaluation project in Braintrust. You'll use create_experiment to establish new historical traces, recording and comparing different versions of your LLM pipeline tests. You'll use list_experiments to get a list of all evaluation experiments, mapping model test scores and metrics. You'll use get_dataset to retrieve specific datasets, providing defined schemas for bounding LLM outputs.
You'll use list_datasets to see all isolated Ground Truth text banks available for automated evaluation scoring. You'll use insert_dataset_row to append new test cases directly into a dataset matrix, targeting specific model evaluations for comparison. You'll use get_prompt to retrieve the exact variable contexts and literal text templates used in a prompt, making sure you test against the intended text.
You'll use list_prompts to get a list of all explicitly version-controlled system prompts stored in Braintrust. You'll use list_env_vars to check the Braintrust AI Gateway configurations, listing managed model API keys. You'll use list_env_vars to check the Braintrust AI Gateway configurations, listing managed model API keys.
How Braintrust MCP Works
- 1 Add the Braintrust server to your AI cluster.
- 2 Bind your personal Braintrust API ID variables.
- 3 Your agent runs complex model tuning pipelines, querying native AI logic regressions right in the chat.
The bottom line is, your agent handles the strict semantic checking and data logging via Braintrust's infrastructure logic, instead of you manually reviewing tables.
Who Is Braintrust MCP For?
This is for the ML Engineer tired of manually compiling regression reports. It's for the Data Scientist who needs to build massive test matrices without running dozens of scripts. If you're testing AI outputs against strict rules, this is your layer. It moves testing from a CLI script to a chat command.
Tracks specific variable distributions and checks accurate regressions remotely across different model versions.
Pushes Ground Truth evaluation text datasets on the fly to test how changing prompts affects model output.
Constructs massive test matrices and evaluates multiple test runs without writing custom script queries.
Observes exact string prompts dynamically, pushing features and validating response styles before deployment.
What Changes When You Connect
- Track model drift and regressions using
create_experiment. Instead of manually comparing output logs, you generate a traceable historical record of how the model's performance changes over time. - Keep your test data clean with
list_datasetsandget_dataset. You query isolated Ground Truth text banks, ensuring your evaluation is always run against the correct, audited standard. - Manage prompt changes safely with
list_promptsandget_prompt. You grab perfectly frozen semantic prompts, so you never accidentally test against a partially edited or unstable version of the instructions. - Organize large tests using
create_project. You keep all evaluation assets—datasets, experiments, prompts—in one dedicated, isolated project environment, preventing data bleed between testing efforts. - Validate test data immediately with
insert_dataset_row. You don't just run the test; you can append new, specific test cases to the matrix right from your agent's response. - Audit your setup with
list_env_vars. You check the Braintrust AI Gateway to confirm which model API keys are actively managed and accessible for the current run.
Real-World Use Cases
Validating a new prompt style
A Product Team needs to validate if a new feature description maintains a professional tone. They use list_prompts to find the base template, then use get_prompt to grab the exact text. They run the test, and the agent reports the adherence score, letting them know the prompt worked without manual checks.
Tracking model degradation
An ML Engineer suspects the model is drifting. They use list_experiments to pull up the last three runs. They then run create_experiment with the new data, comparing the resulting trace against the old metrics to pinpoint exactly where the performance dropped.
Building a comprehensive data test set
A Data Scientist needs a massive matrix. They first use list_datasets to find the source, then use get_dataset to pull the schema. Finally, they use insert_dataset_row to add 50 new, specific test cases to the matrix before running the full evaluation.
Comparing different model versions
A developer wants to compare Model A vs. Model B on the same data. They use create_project to isolate the test, list_datasets to confirm the data source, and then run the comparison, logging the results in a new experiment trace.
The Tradeoffs
Running tests manually in a script
The developer writes a Python script that calls three different APIs (one for data, one for project, one for prompt) and has to manually handle state passing and error logging for every single call.
→
Instead, let your agent manage the flow. Use create_project first, then use list_datasets to identify the source, and finally call create_experiment to wrap the entire process in one traceable, single command.
Forgetting the data source
The team assumes the test data is in the main database, but the schema has changed. They run the test, but the results are garbage because the data wasn't versioned or isolated.
→
Always start by running list_datasets to find the isolated Ground Truth bank. Then use get_dataset to ensure you're pulling the exact, correct schema required for the test.
Modifying prompts directly
A developer tweaks the core prompt text in the codebase, but doesn't realize the change broke the required professional tone, and the test fails silently.
→
Use list_prompts to view all version-controlled prompts. When you need a template, use get_prompt to pull the exact, frozen version, guaranteeing the test runs against the intended instructions.
When It Fits, When It Doesn't
Use Braintrust if your core job is measuring and tracking model performance against defined, stable truth. You need to know why the model changed, not just that it changed. Use this if you need to: 1) Isolate a test run (create_project). 2) Query specific, versioned data (get_dataset). 3) Compare results over time (create_experiment). Don't use this if your goal is simple data retrieval or basic CRUD operations. If you just need to list all available projects or prompts, list_projects and list_prompts are simpler starting points. If you only need to run a test once and don't care about history, a simple script might suffice, but you lose the auditing capability.
Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Braintrust. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.
VINKIUS INFRASTRUCTURE
Cloud Hosted
Managed infra
V8 Isolated
Sandboxed per request
Zero-Trust Proxy
No stored credentials
DLP Enforced
Policy on every call
GDPR Compliant
EU data residency
Token Compression
~60% cost reduction
Works with Claude, ChatGPT, Cursor, and more
The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.
This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.
Available Capabilities
Debugging AI outputs shouldn't mean juggling a dozen tabs and APIs.
Today, testing a model's logic is a manual mess. You copy the prompt template into a notebook, run the code, then copy the output into a spreadsheet. Then you check the version control to make sure you're using the right dataset schema, all while hoping you didn't accidentally modify the core prompt files.
With the Braintrust MCP Server, you tell your agent to check the logic. It uses `list_datasets` and `get_dataset` to pull the right data. It uses `get_prompt` to grab the exact instructions. The agent runs the test and spits out the full, auditable result, no copy-pasting required.
Braintrust MCP Server: Pinpoint model regressions instantly.
You no longer have to run a test, wait for the results, and then try to manually compare the new output to the previous week's result. You simply call `create_experiment` with the new data. The agent handles the comparison, showing you the exact percentage shift and the fields that broke.
The difference is history. You gain a dedicated, auditable record of every test run, making model debugging a repeatable, single-command process.
Common Questions About Braintrust MCP
How do I use Braintrust to check if my model output is accurate? +
You use get_dataset to retrieve the specific dataset containing the required Ground Truth schemas. This ensures your model output is scored against the correct, audited standard.
Can I track multiple model versions with Braintrust MCP Server? +
Yes. You use create_project to isolate the environment, and then list_experiments and create_experiment to track and compare multiple model runs over time.
What is the best way to test a new prompt template using Braintrust MCP Server? +
First, use list_prompts to see available templates. Then, use get_prompt to pull the exact text. This guarantees your test runs against the version you intended.
How do I add new test cases to my dataset? +
You run insert_dataset_row to append new test cases directly into the dataset matrix without modifying the underlying source data.
How do I manage my API credentials using the list_env_vars tool in Braintrust? +
Use the list_env_vars tool to probe the Braintrust AI Gateway configurations. This confirms that your model API keys are managed securely within the system.
What happens if I try to run an experiment with an invalid dataset ID using Braintrust? +
The system returns an explicit error detailing the invalid ID and the necessary format. This prevents the execution of flawed historical trace boundaries.
Can Braintrust handle large-scale dataset matrices for evaluation? +
Yes, Braintrust supports constructing massive matrices. You can append test cases using insert_dataset_row to evaluate large datasets without running script queries.
How can I retrieve version-controlled prompts using the list_prompts tool in Braintrust? +
The list_prompts tool retrieves all explicitly version-controlled system prompts. You'll get access to the exact variable contexts and literal text templates.
Can I insert new test data dynamically tracking specific limits? +
Yes. Utilizing the insert_dataset_row method, you can effortlessly inject exact JSON tracking payload mapping strings directly inside the text corpus evaluating the final results.
Does it pull out original Prompt definitions stored securely? +
Certainly. The get_prompt command isolates and returns perfectly version-controlled bounding parameters slicing literal templates natively hosted under the Braintrust database.
How deeply can it inspect test regressions or scoring limits? +
Using the robust list_experiments call, you can branch full arrays separating LLM version behaviors over massive iterations tracking the performance anomalies accurately.
Use it with your favorite AI tools
Connect this server to Cursor, Claude, VS Code, and more.
More in this category
ClinicalTrials.gov
Search the world's largest registry of clinical research studies — covering diseases, drugs, and experimental therapies across all phases.
MongoDB Atlas Vector Search
Manage vector storage via MongoDB Atlas — perform similarity searches, query MQL documents, and audit collections.
PubMed
Search 37M+ biomedical research articles from the world's largest medical literature database — with full abstracts, authors, MeSH terms, and citation tracking.
You might also like
ArcXP
Automate newsroom publishing via ArcXP — manage, search, and update articles, photos, and videos directly from any AI agent.
ZIP Codes API
Manage ZIP code data — audit locations, distances, and regions via AI.
Moneypenny
Never miss a business call with dedicated virtual receptionists and live chat agents who represent your brand professionally.