Braintrust MCP for AI. Prove model quality with systematic evaluation.
Works with every AI agent you already use
…and any MCP-compatible client








Connect to your AI in seconds.
Braintrust helps developers systematically test and validate LLMs. You manage projects, track prompt versions, run complex benchmark experiments, and query structured 'Ground Truth' data—all within one place.
Stop guessing if your model works; prove it.
What your AI can do
Create experiment
Records a new historical experiment trace to track LLM pipeline tests.
Create project
Sets up a new project environment for tracking AI evaluations and data sets.
List datasets
Lists available 'Ground Truth' text banks used for automated evaluation scoring.
Run formal experiments that record and compare LLM outputs against historical runs.
Query accurate, structured 'Ground Truth' data sets to score model responses automatically.
Securely grab and compare specific versions of system prompts without touching the core code base.
Create isolated projects to keep different model test runs separate and clean.
Ask an AI about this
Waiting for input…
Braintrust: 10 Tools for Evaluation
These tools let you build a complete testing pipeline, allowing you to define projects, retrieve data sets, version prompts, and track every single test run result.
Make your AI actually useful.
Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.
Start using Braintrust on VinkiusCreate Experiment
Records a new historical experiment trace to track LLM pipeline tests.
Create Project
Sets up a new project environment for tracking AI evaluations and data sets.
List Datasets
Lists available 'Ground Truth' text banks used for automated evaluation scoring.
List Env Vars
Checks the Braintrust AI Gateway configurations, showing model API keys securely.
List Experiments
Retrieves all recorded evaluation experiments, mapping out model test scores and...
Get Dataset
Retrieves a specific dataset containing structured schemas that bound LLM outputs.
Get Prompt
Grabs the exact variable contexts and literal text templates used in a prompt.
Insert Dataset Row
Adds new test cases into an existing dataset matrix for specific evaluations.
List Projects
Lists all existing AI evaluation projects configured in Braintrust.
List Prompts
Retrieves a list of system prompts that are explicitly version-controlled inside...
Security and governance baked right in.
Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.
Choose How to Get Started
Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.
Build Your Own
Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.
- Import from OpenAPI, Swagger, or YAML specs
- Create Agent Skills with progressive disclosure
- Deploy to edge with MCPFusion framework
- Built in DLP, auth, and compliance on every call
- Real time usage dashboard and cost metering
- Publish to catalog or keep private
Make Your AI Do More
Start with Braintrust, then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.
- Use this MCP plus 5,100+ others, all in one place
- Add new capabilities to your AI anytime you want
- Every connection is secured and compliant automatically
- Track usage and costs across all your servers
- Works with Claude, ChatGPT, Cursor, and more
- New servers added to the catalog every week
Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Braintrust. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.
VINKIUS INFRASTRUCTURE
Cloud Hosted
Managed infra
V8 Isolated
Sandboxed per request
Zero-Trust Proxy
No stored credentials
DLP Enforced
Policy on every call
GDPR Compliant
EU data residency
Token Compression
~60% cost reduction
Works with Claude, ChatGPT, Cursor, and more
The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.
This connection provides 10 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.
Testing AI outputs feels like guesswork right now.
Today, when your model fails, you usually end up in a messy cycle: checking the input logs, opening another tab to look at the prompt template, cross-referencing it with a separate spreadsheet of 'known good' answers. You spend half your time just trying to collect enough data points to figure out *why* it went wrong.
With this MCP, you stop guessing. The platform handles that messy process. You define the scope using projects and datasets; then, when the model outputs a response, the system scores it against your 'Ground Truth' immediately. What you get is clean, measurable data about performance.
Braintrust gives you full control over model evaluation.
You no longer have to rely on vague metrics or manual spot-checks. You can use `list_projects` to see every test environment, and then run specific comparisons by retrieving prompt templates using `get_prompt`. This gives you an audit trail of everything.
The difference is control. You move from 'I hope this works' to 'Here are the metrics proving it works.' It’s a fundamental shift in how you build reliable AI.
What your AI can actually do with this
Building reliable AI models means more than just writing a single good prompt. It demands rigorous testing across multiple variables. This MCP lets you set up formal evaluation pipelines right from your agent, giving you full visibility into exactly how the model behaves under pressure. You can track specific variable distributions and compare outputs against historical benchmarks without ever leaving your chat window.
Need to check if a new feature broke an old response pattern? Use this MCP. If you're building anything complex for production, connecting it through the Vinkius catalog is the right move. It lets you turn vague model performance anxiety into concrete data points.
019d7562-3c18-72ce-8b34-c4fc9e9f37ad Here's how it actually works
The bottom line is: instead of scrolling through massive spreadsheets, your bot handles strict semantic checking via Braintrust infrastructure.
Add this MCP to your AI client. Next, bind your Braintrust API ID variables.
Tell the agent what you want to benchmark—maybe a new prompt version or an existing project scope.
The system runs complex model tuning pipelines, querying native logic regressions directly on chat output.
Who is this actually for?
AI developers who have seen their model fail in production and now need to prove exactly why it failed. Prompt engineers tired of guessing if a minor prompt change breaks the whole thing. Data scientists needing structured, historical data for validation.
Tracking specific variable distributions or running remote checks on accurate regressions across multiple models.
Setting up isolated test sets and pushing 'Ground Truth' evaluation text to validate prompt differences rapidly.
Constructing massive matrices of test runs and evaluating model performance without writing complex script queries.
What Changes When You Connect
You track prompt changes instantly. Instead of manually checking code, use the list_prompts tool to grab perfectly frozen semantic prompts and test them without breaking your core system.
Never lose a data point again. Use insert_dataset_row to append new test cases into an existing dataset matrix, ensuring every evaluation has fresh coverage.
Visualize performance changes with certainty. Running list_experiments gives you all historical runs, letting you compare model scores and metrics side-by-day.
Maintain clean separation of concerns. Use create_project to isolate different types of testing—say, one project for user onboarding flows and another for admin commands.
Know exactly what data you're using. By calling list_datasets, you see all your 'Ground Truth' repositories before writing a single test case.
See it in action
Validating a new onboarding flow
A product manager needs to know if the LLM handles edge-case user input correctly. They use create_project to isolate 'Onboarding V2'. Then, they run multiple tests using get_dataset data and track all results with create_experiment. This proves that the new flow doesn't regress on old bugs.
A/B testing prompt variations
An ML engineer needs to compare two versions of a summarization prompt. They use list_prompts to grab both templates, then call create_experiment twice, feeding the same inputs into both setups to measure which one hits better alignment scores.
Debugging production failures
A developer notices a drop in performance. They immediately use list_environments and get_dataset to retrieve the exact model configuration and the specific ground truth data that failed, pinpointing the issue instantly.
Archiving model versions
A team is retiring an old version of their chatbot. They use list_projects first to see every evaluation run tied to it, ensuring no historical test data or metrics are lost before decommissioning the model.
The honest tradeoffs
Treating the LLM like a simple chat window
Just pasting 10 questions into your agent and hoping the answers are good. You get random, unorganized output with no way to measure if it's wrong or just different.
You have to build structure. Use create_project first, then populate test cases using insert_dataset_row. This forces systematic testing so you can actually score the results.
Manually tracking prompt changes in spreadsheets
Copying and pasting old prompts into a sheet to compare them. It's slow, prone to formatting errors, and doesn't track version history or dependencies.
Use list_prompts and get_prompt. This captures the exact, frozen semantic prompt template and its version ID, giving you an immutable record.
Running tests without context
Just running a test against the live model build. You get results, but if it fails, you don't know if the failure was due to bad data or bad code.
You must run controlled experiments. Use create_experiment combined with list_datasets. This ties your failing output directly back to a specific test case and project.
When It Fits, When It Doesn't
Use this MCP if validating model quality is mission-critical. If you need to compare Model A's response against Model B's, or if you must prove that a prompt change didn't break existing functionality, this toolset is necessary. It gives you the structure of formal testing. Don't use it if your goal is simple: 'Just chat with the model and see what it says.' For those basic conversational checks, a standard agent connection works fine. But for production systems—the kind that handle money or critical data—you need the rigor provided by list_datasets and create_experiment. If you only care about one thing (e.g., just viewing old results), you might be able to get away with just running list_experiments, but if you're building a verifiable product, use this MCP.
Questions you might have
How do I start testing my model with Braintrust using `create_project`? +
You first call create_project to establish the boundaries for your tests. This gives you a clean, isolated environment that prevents new test runs from contaminating existing project data.
What is the difference between `get_dataset` and `list_datasets`? +
list_datasets shows you all available 'Ground Truth' text banks. You then use get_dataset to pull a specific, structured dataset for active testing.
How do I track changes to my prompt templates with Braintrust? +
Use the list_prompts tool to see all version-controlled prompts. You can then call get_prompt to retrieve a specific template ID, ensuring you test against an exact version.
Can I add custom failed tests using Braintrust? +
Yes. After running a batch of tests, you use insert_dataset_row to manually append new failure cases or specific edge-case inputs into your dataset matrix for future runs.
How do I check which API keys are configured for Braintrust using `list_environments_vars`? +
It shows you all the current gateway configuration variables. This is how your agent accesses the necessary model API keys securely without needing manual setup.
If I want to review previous test runs, what does `list_experiments` retrieve? +
list_experiments retrieves a comprehensive map of all past evaluation attempts. This lets you check historical metrics and model scores across various run IDs.
Can I use `insert_dataset_row` to append just a single test case into my matrix? +
Yes, that's exactly what it does. You can target a specific dataset and inject new evaluation data row by row without having to build an entire master sheet first.
Before starting a new project, how do I use `list_projects` to see current evaluations? +
list_projects gives you the list of all existing AI evaluation containers. This helps you confirm your scope and choose the right environment for your next test.
Can I insert new test data dynamically tracking specific limits? +
Yes. Utilizing the insert_dataset_row method, you can effortlessly inject exact JSON tracking payload mapping strings directly inside the text corpus evaluating the final results.
Does it pull out original Prompt definitions stored securely? +
Certainly. The get_prompt command isolates and returns perfectly version-controlled bounding parameters slicing literal templates natively hosted under the Braintrust database.
How deeply can it inspect test regressions or scoring limits? +
Using the robust list_experiments call, you can branch full arrays separating LLM version behaviors over massive iterations tracking the performance anomalies accurately.
We've already built the connector for Braintrust. Just plug in your AI agents and start using Vinkius.
No hosting. No infrastructure. No complex setup.
All 10 tools are live and waiting.
You're up and running in seconds.
Vinkius gives your AI agents access to the full catalog of app connectors, all fully managed, secure, and enterprise-ready. One subscription, every tool you need.
Built, hosted, and secured by Vinkius. You just connect and go.