Braintrust MCP. Stop guessing if your model broke after an update.
Works with every AI agent you already use
…and any MCP-compatible client
Just plug in your AI agents and start using Vinkius.
Braintrust lets you stop guessing if your model broke. Connect this MCP to run structured tests, track prompt changes, and benchmark AI logic against specific ground truth datasets.
It's for developers who need proof that their LLM output meets strict quality standards every single time.
What your AI agents can do
Create experiment
Sets up a new historical experiment trace to record specific LLM pipeline tests and metrics.
Create project
Initializes an isolated project environment for tracking AI evaluations and related datasets.
Get dataset
Fetches a specific dataset containing predefined schemas for bounding LLM outputs.
Create new containers to organize and track multiple related AI testing efforts.
Execute isolated test runs, appending unique scores and metrics for every model iteration.
Query or append specific datasets that define the perfect, expected output for your models to measure against.
Save and track exact prompt text templates so you can compare older versions without changing core code.
Retrieve comprehensive lists of past experiments, showing which metrics were tracked across different model runs.
Ask AI about this MCP
Supported MCP Clients
OAuth 2.0 CompatibleWaiting for input…
Braintrust: 10 Tools for AI Evaluation
These tools let you manage the entire lifecycle of a model test, from setting up a project to running benchmarks and scoring results.
Make your AI actually useful.
Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.
Start using Braintrust on Vinkius019d7562create experiment
Sets up a new historical experiment trace to record specific LLM pipeline tests and metrics.
019d7562create project
Initializes an isolated project environment for tracking AI evaluations and related datasets.
019d7562get dataset
Fetches a specific dataset containing predefined schemas for bounding LLM outputs.
019d7562get prompt
Retrieves the exact variable contexts and literal text templates used in a given prompt version.
019d7562insert dataset row
Adds new test cases into an existing dataset matrix targeting specific evaluations for scoring.
019d7562list datasets
Lists all isolated Ground Truth text banks used specifically for automated evaluation scoring.
019d7562list env vars
Probes the Braintrust AI Gateway configurations, showing model API keys and setup variables securely.
019d7562list experiments
Retrieves a list of all evaluation experiments, detailing historical model test scores and metrics.
019d7562list projects
Gets the complete list of all AI evaluation projects currently running in Braintrust.
019d7562list prompts
Retrieves a record of explicitly version-controlled system prompts isolated within Braintrust.
Choose How to Get Started
Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.
Build Your Own
Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.
- Import from OpenAPI, Swagger, or YAML specs
- Create Agent Skills with progressive disclosure
- Deploy to edge with MCPFusion framework
- Built in DLP, auth, and compliance on every call
- Real time usage dashboard and cost metering
- Publish to catalog or keep private
Make Your AI Do More
Start with Braintrust, then connect any of our 4,800+ other servers whenever your AI needs more. One click, no limits.
- Use this MCP plus 4,800+ others, all in one place
- Add new capabilities to your AI anytime you want
- Every connection is secured and compliant automatically
- Track usage and costs across all your servers
- Works with Claude, ChatGPT, Cursor, and more
- New servers added to the catalog every week
Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Braintrust. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.
VINKIUS INFRASTRUCTURE
Cloud Hosted
Managed infra
V8 Isolated
Sandboxed per request
Zero-Trust Proxy
No stored credentials
DLP Enforced
Policy on every call
GDPR Compliant
EU data residency
Token Compression
~60% cost reduction
Works with Claude, ChatGPT, Cursor, and more
The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.
This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.
Manually validating model logic is a nightmare of tabs and spreadsheets.
Right now, checking your model's behavior feels like forensic accounting. You have to copy the prompt into an agent, run it; check the output. Then you manually update a spreadsheet with the expected result—the Ground Truth. If you change one variable, you repeat 50+ times, copying everything, hoping you didn't miss a corner case.
With this MCP, that manual grind disappears. You define your entire test suite in an isolated project and use `create_project`. Your agent handles the repetition; it runs dozens of varied tests automatically, giving you one clean report showing where every single model failed to meet the standard.
Get Exact Model Benchmarks with Braintrust
You eliminate the need for manual data entry by using `list_datasets` to find existing test banks and then appending new failure points via `insert_dataset_row`. You never have to guess what inputs matter; you just add them.
What's different now is certainty. Instead of having a gut feeling about model quality, you have verifiable scores across every single dimension you care about.
What you can do with this MCP connector
This connector gives you a platform to observe, test, and debug your AI models in isolation. Instead of just running prompts and hoping they work, you establish formal projects where you define the inputs (the prompt templates) and the expected outputs (the ground truth dataset). You run structured experiments that execute model variations against this known standard, generating detailed performance traces.
This capability is critical for catching subtle regressions—when a minor update to your code causes a massive drop in quality. Whether you're testing how different versions of a prompt affect the tone, or checking if two models handle edge cases differently, you get hard metrics on alignment scores. You can also build up and version those core datasets over time.
When you run these evaluations through Vinkius, your AI agent doesn't just send data; every tool call is recorded in a cryptographically signed audit trail. This means that when you review the results, you know exactly which inputs caused which failures, giving you full visibility into what happened across the entire test run.
019d7562-3c18-72ce-8b34-c4fc9e9f37ad How Braintrust MCP Works
- 1 First, establish a project to containerize your evaluation scope using the
create_projecttool. - 2 Next, define or retrieve your specific dataset using
get_datasetand then initiate a test run withcreate_experiment. - 3 The system executes the model against your data, returning detailed results that you can review via
list_experiments.
The bottom line is: it takes complex, multi-step QA processes and turns them into simple commands for your agent.
Who Is Braintrust MCP For?
This MCP is built for the ML Engineer who spends all day looking at charts trying to figure out why a model suddenly started hallucinating. It's also for the Data Scientist who needs verifiable proof that their new prompt template actually works before it hits production.
They use this MCP to track variable distributions and check accurate regressions remotely, ensuring model drift is caught immediately.
They construct massive matrices of test data and evaluate full run outcomes without having to write complex script queries every time.
They push Ground Truth evaluation text datasets on the fly, testing subtle differences between prompt versions live in a chat interface.
What Changes When You Connect
- You gain full version control over prompts using
list_promptsandget_prompt. This means you can prove that a change to the system instructions wasn't responsible for performance dips. You keep a perfect record of what was used when. - Tracking model changes gets easier with
create_experimentandlist_experiments. Instead of looking at vague chat logs, you get structured historical traces showing exactly how scores changed between runs (e.g., V2 vs V3). - The MCP lets you manage your data sources using
list_datasetsandget_dataset. This establishes a single source of truth for what the model should be doing when it encounters common scenarios. - Need to test a new edge case? You can use
insert_dataset_rowto append a few targeted test cases into an existing dataset matrix, instantly running the evaluation against that specific gap. - You always know your setup is clean. The MCP allows you to check configurations and credentials using
list_env_vars, verifying everything needed for model execution without exposing sensitive keys.
Real-World Use Cases
Validating Tone Shifts After a Rewrite
A Product Team wrote a new marketing copy. Instead of manually testing the prompt with 20 different scenarios, they use list_datasets to pull five key examples and run an experiment via create_experiment. The agent reports back on alignment scores, confirming the tone remains professional across all tests.
Debugging Model Drift in Production
The ML Engineer noticed a drop in accuracy. They use list_experiments to compare the current run against the previous stable benchmark and use get_prompt to confirm that the prompt template hasn't drifted, quickly isolating the source of the failure.
Testing New Compliance Rules
A Data Scientist needs to test how a model handles new data privacy rules. They first use create_project to scope the work, then use insert_dataset_row to add five examples of non-compliant inputs into their ground truth dataset.
Comparing Prompt Strategies
An AI Developer wants to compare two different system instructions. They retrieve both templates using get_prompt, set up a dedicated environment via create_project, and run separate experiments to see which prompt yields the highest consistency score.
The Tradeoffs
Treating it like simple chat prompts
Just pasting a random question into the agent hoping for good results, without defining what 'good' means.
→
Don’t rely on ad-hoc testing. Use create_project to structure your work first, and then use get_dataset to define a mandatory set of inputs that guarantee test coverage.
Ignoring prompt versioning
Rerunning the same test today and assuming the results match yesterday's run because someone 'tweaked' the instructions.
→
Always use get_prompt to pull the exact template ID and ensure you are testing against a defined, frozen version. Use list_prompts to see all historical versions.
Mixing up project scope
Running tests for billing features in the same container as user support responses.
→
Keep your work separated. Start by calling create_project, giving every distinct functional area its own isolated evaluation environment.
When It Fits, When It Doesn't
Use this MCP if your core job involves proving that an AI model's output is consistently correct and predictable against a set standard. You need to measure performance, not just chat with it. If you can answer 'How do I prove X?' the answer is Braintrust. Don’t use this if you simply want creative brainstorming or general ideas; those are better handled by simple conversation tools. Only use this when you absolutely require a measurable score against defined Ground Truth data, like what get_dataset provides.
Common Questions About Braintrust MCP
How do I start tracking new evaluations with the Braintrust MCP? +
Start by calling create_project to establish a dedicated scope for your work. This container keeps all related tests and datasets isolated from other projects.
What is Ground Truth in the context of list_datasets? +
Ground Truth refers to the definitive, correct answers or expected outputs used as a benchmark. The list_datasets tool helps you find these core repositories for your model testing.
Can I compare two different prompts using Braintrust MCP? +
Yes. You use get_prompt to retrieve both templates, then create separate experiments using create_experiment to run them side-by-side against the same dataset.
Do I need to manually manage API keys for this MCP? +
The platform handles credential management via a zero-trust proxy. You only need to confirm your environment variables using list_env_vars, and Vinkius manages the secure transit of those credentials.
How do I check my Braintrust Gateway settings using `list_env_vars`? +
It probes the secure configuration variables for your gateway. This tool lets you confirm which model API keys and parameters are active without exposing sensitive credentials, giving you confidence in the setup.
If I find a gap in my test data, how do I add new examples using `insert_dataset_row`? +
You append new records directly into your dataset matrix. This tool lets you feed specific, high-value test cases to the evaluation system without needing to manually update the source file.
I'm starting a totally different product line; how do I use `create_project`? +
It establishes a completely isolated workspace for your new efforts. This ensures that testing, datasets, and metrics for Project A won't accidentally mix with or affect Project B.
What if I need to see all saved versions of a prompt? Can `list_prompts` help? +
It retrieves an explicit list of every version-controlled system prompt. This is crucial for auditing and tracking how your core instructions have evolved over time.
Use it with your favorite AI tools
Connect this server to Cursor, Claude, VS Code, and more.