4,500+ servers built on MCP Fusion
Vinkius

Braintrust MCP. Automate AI model evaluation and data tracking.

Claude Claude
ChatGPT ChatGPT
Cursor Cursor
Gemini Gemini
Windsurf Windsurf
VS Code VS Code
JetBrains JetBrains
Vercel Vercel
See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Braintrust MCP on Cursor AI Code Editor MCP Client Braintrust MCP on Claude Desktop App MCP Integration Braintrust MCP on OpenAI Agents SDK MCP Compatible Braintrust MCP on Visual Studio Code MCP Extension Client Braintrust MCP on GitHub Copilot AI Agent MCP Integration Braintrust MCP on Google Gemini AI MCP Integration Braintrust MCP on Lovable AI Development MCP Client Braintrust MCP on Mistral AI Agents MCP Compatible Braintrust MCP on Amazon AWS Bedrock MCP Support

Just plug in your AI agents and start using Vinkius.

Braintrust. This server lets your AI agent run deep logic checks on model outputs. You can manage projects, create test experiments, and query Ground Truth datasets.

It's for developers who need to systematically test and track AI model performance against specific, versioned data sets and prompts.

It's an observability layer for building reliable AI.

What your AI agents can do

Create experiment

Creates a new historical experiment trace to record and track LLM pipeline tests.

Create project

Sets up a new project environment for tracking and organizing AI evaluations and datasets.

Get dataset

Retrieves a specific dataset that contains defined schemas for bounding LLM outputs.

+ 7 more capabilities included
Track AI Experiments

You can establish new historical traces to record and compare different versions of your LLM pipeline tests.

Manage Project Environments

The server lets you create isolated project environments specifically for tracking and organizing AI evaluation datasets.

Retrieve Ground Truth Data

Query specific datasets that contain exact schemas needed to bound and test LLM outputs against known correct answers.

Add Test Data

Append new test cases directly into a dataset matrix, targeting specific model evaluations for comparison.

View Prompt Templates

Retrieve exact variable contexts and literal text templates for a prompt, ensuring you test against the intended text.

List Evaluation Assets

Access lists of all available datasets, projects, experiments, and prompts for auditing.

Supported MCP Clients

Claude Claude
ChatGPT ChatGPT
Cursor Cursor
Gemini Gemini
Windsurf Windsurf
VS Code VS Code
JetBrains JetBrains
Vercel Vercel
+ other MCP clients
Free for Subscribers

Waiting for input…

AI Agent

Braintrust MCP Server: 10 Tools for AI Evaluation

These tools let you manage the entire lifecycle of model testing: creating projects, querying ground truth data, and tracking historical performance traces.

create019d7562

create experiment

Creates a new historical experiment trace to record and track LLM pipeline tests.

create019d7562

create project

Sets up a new project environment for tracking and organizing AI evaluations and datasets.

get019d7562

get dataset

Retrieves a specific dataset that contains defined schemas for bounding LLM outputs.

get019d7562

get prompt

Retrieves the exact variable contexts and literal text templates used in a prompt.

insert019d7562

insert dataset row

Appends a new test case row into a dataset matrix for specific evaluation scoring.

list019d7562

list datasets

Lists all isolated Ground Truth text banks used for automated evaluation scoring.

list019d7562

list env vars

Checks the Braintrust AI Gateway configurations, listing managed model API keys.

list019d7562

list experiments

Gets a list of all evaluation experiments, mapping model test scores and metrics.

list019d7562

list projects

Retrieves the list of all active AI evaluation projects within Braintrust.

list019d7562

list prompts

Gets a list of all explicitly version-controlled system prompts stored in Braintrust.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

  • Import from OpenAPI, Swagger, or YAML specs
  • Create Agent Skills with progressive disclosure
  • Deploy to edge with MCPFusion framework
  • Built in DLP, auth, and compliance on every call
  • Real time usage dashboard and cost metering
  • Publish to catalog or keep private
Start building

Make Your AI Do More

Start with Braintrust, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

  • Use this MCP plus 4,700+ others, all in one place
  • Add new capabilities to your AI anytime you want
  • Every connection is secured and compliant automatically
  • Track usage and costs across all your servers
  • Works with Claude, ChatGPT, Cursor, and more
  • New servers added to the catalog every week

What you can do with this MCP connector

This server lets your AI agent run deep logic checks on model outputs. You'll use it to systematically test and track AI model performance against specific, versioned data sets and prompts. You'll manage projects, create test experiments, and query Ground Truth datasets. You'll use create_project to set up isolated project environments for tracking and organizing AI evaluation datasets.

You'll use list_projects to see every active AI evaluation project in Braintrust. You'll use create_experiment to establish new historical traces, recording and comparing different versions of your LLM pipeline tests. You'll use list_experiments to get a list of all evaluation experiments, mapping model test scores and metrics. You'll use get_dataset to retrieve specific datasets, providing defined schemas for bounding LLM outputs.

You'll use list_datasets to see all isolated Ground Truth text banks available for automated evaluation scoring. You'll use insert_dataset_row to append new test cases directly into a dataset matrix, targeting specific model evaluations for comparison. You'll use get_prompt to retrieve the exact variable contexts and literal text templates used in a prompt, making sure you test against the intended text.

You'll use list_prompts to get a list of all explicitly version-controlled system prompts stored in Braintrust. You'll use list_env_vars to check the Braintrust AI Gateway configurations, listing managed model API keys. You'll use list_env_vars to check the Braintrust AI Gateway configurations, listing managed model API keys.

How Braintrust MCP Works

  1. 1 Add the Braintrust server to your AI cluster.
  2. 2 Bind your personal Braintrust API ID variables.
  3. 3 Your agent runs complex model tuning pipelines, querying native AI logic regressions right in the chat.

The bottom line is, your agent handles the strict semantic checking and data logging via Braintrust's infrastructure logic, instead of you manually reviewing tables.

Who Is Braintrust MCP For?

This is for the ML Engineer tired of manually compiling regression reports. It's for the Data Scientist who needs to build massive test matrices without running dozens of scripts. If you're testing AI outputs against strict rules, this is your layer. It moves testing from a CLI script to a chat command.

Machine Learning Engineer

Tracks specific variable distributions and checks accurate regressions remotely across different model versions.

AI Developer

Pushes Ground Truth evaluation text datasets on the fly to test how changing prompts affects model output.

Data Scientist

Constructs massive test matrices and evaluates multiple test runs without writing custom script queries.

Product Team Lead

Observes exact string prompts dynamically, pushing features and validating response styles before deployment.

What Changes When You Connect

  • Track model drift and regressions using create_experiment. Instead of manually comparing output logs, you generate a traceable historical record of how the model's performance changes over time.
  • Keep your test data clean with list_datasets and get_dataset. You query isolated Ground Truth text banks, ensuring your evaluation is always run against the correct, audited standard.
  • Manage prompt changes safely with list_prompts and get_prompt. You grab perfectly frozen semantic prompts, so you never accidentally test against a partially edited or unstable version of the instructions.
  • Organize large tests using create_project. You keep all evaluation assets—datasets, experiments, prompts—in one dedicated, isolated project environment, preventing data bleed between testing efforts.
  • Validate test data immediately with insert_dataset_row. You don't just run the test; you can append new, specific test cases to the matrix right from your agent's response.
  • Audit your setup with list_env_vars. You check the Braintrust AI Gateway to confirm which model API keys are actively managed and accessible for the current run.

Real-World Use Cases

01

Validating a new prompt style

A Product Team needs to validate if a new feature description maintains a professional tone. They use list_prompts to find the base template, then use get_prompt to grab the exact text. They run the test, and the agent reports the adherence score, letting them know the prompt worked without manual checks.

02

Tracking model degradation

An ML Engineer suspects the model is drifting. They use list_experiments to pull up the last three runs. They then run create_experiment with the new data, comparing the resulting trace against the old metrics to pinpoint exactly where the performance dropped.

03

Building a comprehensive data test set

A Data Scientist needs a massive matrix. They first use list_datasets to find the source, then use get_dataset to pull the schema. Finally, they use insert_dataset_row to add 50 new, specific test cases to the matrix before running the full evaluation.

04

Comparing different model versions

A developer wants to compare Model A vs. Model B on the same data. They use create_project to isolate the test, list_datasets to confirm the data source, and then run the comparison, logging the results in a new experiment trace.

The Tradeoffs

Running tests manually in a script

The developer writes a Python script that calls three different APIs (one for data, one for project, one for prompt) and has to manually handle state passing and error logging for every single call.

Instead, let your agent manage the flow. Use create_project first, then use list_datasets to identify the source, and finally call create_experiment to wrap the entire process in one traceable, single command.

Forgetting the data source

The team assumes the test data is in the main database, but the schema has changed. They run the test, but the results are garbage because the data wasn't versioned or isolated.

Always start by running list_datasets to find the isolated Ground Truth bank. Then use get_dataset to ensure you're pulling the exact, correct schema required for the test.

Modifying prompts directly

A developer tweaks the core prompt text in the codebase, but doesn't realize the change broke the required professional tone, and the test fails silently.

Use list_prompts to view all version-controlled prompts. When you need a template, use get_prompt to pull the exact, frozen version, guaranteeing the test runs against the intended instructions.

When It Fits, When It Doesn't

Use Braintrust if your core job is measuring and tracking model performance against defined, stable truth. You need to know why the model changed, not just that it changed. Use this if you need to: 1) Isolate a test run (create_project). 2) Query specific, versioned data (get_dataset). 3) Compare results over time (create_experiment). Don't use this if your goal is simple data retrieval or basic CRUD operations. If you just need to list all available projects or prompts, list_projects and list_prompts are simpler starting points. If you only need to run a test once and don't care about history, a simple script might suffice, but you lose the auditing capability.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Braintrust. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

create_experiment create_project get_dataset get_prompt insert_dataset_row list_datasets list_env_vars list_experiments list_projects list_prompts

Debugging AI outputs shouldn't mean juggling a dozen tabs and APIs.

Today, testing a model's logic is a manual mess. You copy the prompt template into a notebook, run the code, then copy the output into a spreadsheet. Then you check the version control to make sure you're using the right dataset schema, all while hoping you didn't accidentally modify the core prompt files.

With the Braintrust MCP Server, you tell your agent to check the logic. It uses `list_datasets` and `get_dataset` to pull the right data. It uses `get_prompt` to grab the exact instructions. The agent runs the test and spits out the full, auditable result, no copy-pasting required.

Braintrust MCP Server: Pinpoint model regressions instantly.

You no longer have to run a test, wait for the results, and then try to manually compare the new output to the previous week's result. You simply call `create_experiment` with the new data. The agent handles the comparison, showing you the exact percentage shift and the fields that broke.

The difference is history. You gain a dedicated, auditable record of every test run, making model debugging a repeatable, single-command process.

Common Questions About Braintrust MCP

How do I use Braintrust to check if my model output is accurate? +

You use get_dataset to retrieve the specific dataset containing the required Ground Truth schemas. This ensures your model output is scored against the correct, audited standard.

Can I track multiple model versions with Braintrust MCP Server? +

Yes. You use create_project to isolate the environment, and then list_experiments and create_experiment to track and compare multiple model runs over time.

What is the best way to test a new prompt template using Braintrust MCP Server? +

First, use list_prompts to see available templates. Then, use get_prompt to pull the exact text. This guarantees your test runs against the version you intended.

How do I add new test cases to my dataset? +

You run insert_dataset_row to append new test cases directly into the dataset matrix without modifying the underlying source data.

How do I manage my API credentials using the list_env_vars tool in Braintrust? +

Use the list_env_vars tool to probe the Braintrust AI Gateway configurations. This confirms that your model API keys are managed securely within the system.

What happens if I try to run an experiment with an invalid dataset ID using Braintrust? +

The system returns an explicit error detailing the invalid ID and the necessary format. This prevents the execution of flawed historical trace boundaries.

Can Braintrust handle large-scale dataset matrices for evaluation? +

Yes, Braintrust supports constructing massive matrices. You can append test cases using insert_dataset_row to evaluate large datasets without running script queries.

How can I retrieve version-controlled prompts using the list_prompts tool in Braintrust? +

The list_prompts tool retrieves all explicitly version-controlled system prompts. You'll get access to the exact variable contexts and literal text templates.

Can I insert new test data dynamically tracking specific limits? +

Yes. Utilizing the insert_dataset_row method, you can effortlessly inject exact JSON tracking payload mapping strings directly inside the text corpus evaluating the final results.

Does it pull out original Prompt definitions stored securely? +

Certainly. The get_prompt command isolates and returns perfectly version-controlled bounding parameters slicing literal templates natively hosted under the Braintrust database.

How deeply can it inspect test regressions or scoring limits? +

Using the robust list_experiments call, you can branch full arrays separating LLM version behaviors over massive iterations tracking the performance anomalies accurately.

More in this category

You might also like

Built & Managed by Vinkius 30s setup 10 tools

We've already built the connector for Braintrust. Just plug in your AI agents and start using Vinkius.

No hosting. No infrastructure. No complex setup.
All 10 tools are live and waiting. You're up and running in seconds.

Claude Claude
ChatGPT ChatGPT
Cursor Cursor
Gemini Gemini
Windsurf Windsurf
VS Code VS Code
JetBrains JetBrains
Vercel Vercel
+ other MCP clients

Vinkius gives your AI agents access to the full catalog of app connectors, all fully managed, secure, and enterprise-ready. One subscription, every tool you need.

Zero hosting required Full MCP catalog included Enterprise-grade security Auto-updated by Vinkius

Built, hosted, and secured by Vinkius. You just connect and go.