Braintrust MCP for AI. Prove model quality with systematic evaluation.

Q: How do I start testing my model with Braintrust using createproject?

You first call createproject to establish the boundaries for your tests. This gives you a clean, isolated environment that prevents new test runs from contaminating existing project data.

Q: What is the difference between getdataset and listdatasets?

listdatasets shows you all available 'Ground Truth' text banks. You then use getdataset to pull a specific, structured dataset for active testing.

Q: How do I track changes to my prompt templates with Braintrust?

Use the listprompts tool to see all version-controlled prompts. You can then call getprompt to retrieve a specific template ID, ensuring you test against an exact version.

Q: Can I add custom failed tests using Braintrust?

Yes. After running a batch of tests, you use insertdatasetrow to manually append new failure cases or specific edge-case inputs into your dataset matrix for future runs.

Q: If I want to review previous test runs, what does listexperiments retrieve?

listexperiments retrieves a comprehensive map of all past evaluation attempts. This lets you check historical metrics and model scores across various run IDs.

Q: Before starting a new project, how do I use listprojects to see current evaluations?

listprojects gives you the list of all existing AI evaluation containers. This helps you confirm your scope and choose the right environment for your next test.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Connect to your AI in seconds.

Braintrust helps developers systematically test and validate LLMs. You manage projects, track prompt versions, run complex benchmark experiments, and query structured 'Ground Truth' data—all within one place.

Stop guessing if your model works; prove it.

What your AI can do

Create experiment

Records a new historical experiment trace to track LLM pipeline tests.

Create project

Sets up a new project environment for tracking AI evaluations and data sets.

List datasets

Lists available 'Ground Truth' text banks used for automated evaluation scoring.

+ 7 more capabilities included

Track Model Performance

Run formal experiments that record and compare LLM outputs against historical runs.

Manage Test Data Sets

Query accurate, structured 'Ground Truth' data sets to score model responses automatically.

Version Prompt Templates

Securely grab and compare specific versions of system prompts without touching the core code base.

Organize Evaluation Scope

Create isolated projects to keep different model test runs separate and clean.

Ask an AI about this

Included with Plan

Waiting for input…

AI Agent

Braintrust: 10 Tools for Evaluation

These tools let you build a complete testing pipeline, allowing you to define projects, retrieve data sets, version prompts, and track every single test run result.

Make your AI actually useful.

Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.

Start using Braintrust on Vinkius

Create Experiment

Records a new historical experiment trace to track LLM pipeline tests.

Create Project

Sets up a new project environment for tracking AI evaluations and data sets.

List Datasets

Lists available 'Ground Truth' text banks used for automated evaluation scoring.

List Env Vars

Checks the Braintrust AI Gateway configurations, showing model API keys securely.

List Experiments

Retrieves all recorded evaluation experiments, mapping out model test scores and...

Get Dataset

Retrieves a specific dataset containing structured schemas that bound LLM outputs.

Get Prompt

Grabs the exact variable contexts and literal text templates used in a prompt.

Insert Dataset Row

Adds new test cases into an existing dataset matrix for specific evaluations.

List Projects

Lists all existing AI evaluation projects configured in Braintrust.

List Prompts

Retrieves a list of system prompts that are explicitly version-controlled inside...

Security and governance baked right in.

Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.

Claude AI

Open Claude Settings

Go to claude.ai, click your profile icon, then navigate to Customize → Connectors.

Add Custom Connector

Click the "+" button and select Add custom connector. Paste your Vinkius endpoint URL:

https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp

Replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com. For OAuth-protected servers, expand Advanced settings to add credentials.

Start a conversation

Open a new chat. The Braintrust integration is available immediately — no restart needed.

Antigravity

Configure Agent Environment

Open your Antigravity agent's workspace configuration or mcp-servers.json file.

Bind the Endpoint

Add the Vinkius endpoint URL to your agent's MCP connections list:

"mcp_servers": {
  "braintrust": {
    "serverUrl": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
  }
}

Provide your secure token in place of [YOUR_TOKEN_HERE] to ensure your agent requests are authenticated.

Execute

Start your Antigravity session. The agent will autonomously discover and utilize the Braintrust tools with full Vinkius guardrails applied.

VS Code Copilot

⚡

One-Click Install (Recommended)

In your Vinkius Dashboard, simply click the Add to VS Code button for this server. We'll automatically configure your local workspace.

Or configure manually

Open MCP Settings

Open VS Code, press Ctrl/Cmd + Shift + P, and search for GitHub Copilot: MCP Servers.

Add Server Config

Add the Vinkius endpoint configuration to your mcp-servers.json file:

"braintrust": {
  "url": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
}

Ensure you replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com.

LangChain

Install Dependencies

Install the LangChain MCP adapters for your environment:

pip install langchain-mcp-adapters

Connect the Server

Use the SSEClient in LangChain to connect to the Vinkius managed endpoint:

from langchain_mcp_adapters.client import SSEClient

# Connect to Vinkius
client = SSEClient(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")
tools = client.get_tools()

CrewAI

Define the Tool

Load the Vinkius MCP tools into your CrewAI agents:

from crewai import Agent
from mcp_crewai import MCPTool

# Connect securely to Vinkius
vinkius_tools = MCPTool(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")

# Assign to Agent
researcher = Agent(
    role='Data Researcher',
    tools=vinkius_tools.get_all()
)

Execute Task

Run your CrewAI process. The agent will autonomously route tasks to the Vinkius managed server.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Braintrust, then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 5,100+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Braintrust. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

Your data is protected. See how we built it.

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This connection provides 10 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.

Testing AI outputs feels like guesswork right now.

Today, when your model fails, you usually end up in a messy cycle: checking the input logs, opening another tab to look at the prompt template, cross-referencing it with a separate spreadsheet of 'known good' answers. You spend half your time just trying to collect enough data points to figure out *why* it went wrong.

With this MCP, you stop guessing. The platform handles that messy process. You define the scope using projects and datasets; then, when the model outputs a response, the system scores it against your 'Ground Truth' immediately. What you get is clean, measurable data about performance.

Braintrust gives you full control over model evaluation.

You no longer have to rely on vague metrics or manual spot-checks. You can use `list_projects` to see every test environment, and then run specific comparisons by retrieving prompt templates using `get_prompt`. This gives you an audit trail of everything.

The difference is control. You move from 'I hope this works' to 'Here are the metrics proving it works.' It’s a fundamental shift in how you build reliable AI.

Support 24/7 support@vinkius.com ↗

Security Vinkius Trust Center ↗

SLA Service Level Agreement ↗

Report Listing Send Report ↗

What your AI can actually do with this

Building reliable AI models means more than just writing a single good prompt. It demands rigorous testing across multiple variables. This MCP lets you set up formal evaluation pipelines right from your agent, giving you full visibility into exactly how the model behaves under pressure. You can track specific variable distributions and compare outputs against historical benchmarks without ever leaving your chat window.

Need to check if a new feature broke an old response pattern? Use this MCP. If you're building anything complex for production, connecting it through the Vinkius catalog is the right move. It lets you turn vague model performance anxiety into concrete data points.

Built · Hosted · Managed by Vinkius Braintrust MCP - Systematic AI Model Benchmarking

Server ID 019d7562-3c18-72ce-8b34-c4fc9e9f37ad

Vinkius Inspector

Compliance Grade A+

Score 100/100

Report View Report ↗

What Changes When You Connect

You track prompt changes instantly. Instead of manually checking code, use the list_prompts tool to grab perfectly frozen semantic prompts and test them without breaking your core system.

Never lose a data point again. Use insert_dataset_row to append new test cases into an existing dataset matrix, ensuring every evaluation has fresh coverage.

Visualize performance changes with certainty. Running list_experiments gives you all historical runs, letting you compare model scores and metrics side-by-day.

Maintain clean separation of concerns. Use create_project to isolate different types of testing—say, one project for user onboarding flows and another for admin commands.

Know exactly what data you're using. By calling list_datasets, you see all your 'Ground Truth' repositories before writing a single test case.

See it in action

01 01

Validating a new onboarding flow

A product manager needs to know if the LLM handles edge-case user input correctly. They use create_project to isolate 'Onboarding V2'. Then, they run multiple tests using get_dataset data and track all results with create_experiment. This proves that the new flow doesn't regress on old bugs.

02 02

A/B testing prompt variations

An ML engineer needs to compare two versions of a summarization prompt. They use list_prompts to grab both templates, then call create_experiment twice, feeding the same inputs into both setups to measure which one hits better alignment scores.

03 03

Debugging production failures

A developer notices a drop in performance. They immediately use list_environments and get_dataset to retrieve the exact model configuration and the specific ground truth data that failed, pinpointing the issue instantly.

04 04

Archiving model versions

A team is retiring an old version of their chatbot. They use list_projects first to see every evaluation run tied to it, ensuring no historical test data or metrics are lost before decommissioning the model.

The honest tradeoffs

Treating the LLM like a simple chat window

Anti-pattern

Just pasting 10 questions into your agent and hoping the answers are good. You get random, unorganized output with no way to measure if it's wrong or just different.

The Fix

You have to build structure. Use create_project first, then populate test cases using insert_dataset_row. This forces systematic testing so you can actually score the results.

Manually tracking prompt changes in spreadsheets

Anti-pattern

Copying and pasting old prompts into a sheet to compare them. It's slow, prone to formatting errors, and doesn't track version history or dependencies.

The Fix

Use list_prompts and get_prompt. This captures the exact, frozen semantic prompt template and its version ID, giving you an immutable record.

Running tests without context

Anti-pattern

Just running a test against the live model build. You get results, but if it fails, you don't know if the failure was due to bad data or bad code.

The Fix

You must run controlled experiments. Use create_experiment combined with list_datasets. This ties your failing output directly back to a specific test case and project.

When It Fits, When It Doesn't

Use this MCP if validating model quality is mission-critical. If you need to compare Model A's response against Model B's, or if you must prove that a prompt change didn't break existing functionality, this toolset is necessary. It gives you the structure of formal testing. Don't use it if your goal is simple: 'Just chat with the model and see what it says.' For those basic conversational checks, a standard agent connection works fine. But for production systems—the kind that handle money or critical data—you need the rigor provided by list_datasets and create_experiment. If you only care about one thing (e.g., just viewing old results), you might be able to get away with just running list_experiments, but if you're building a verifiable product, use this MCP.

Questions you might have

How do I start testing my model with Braintrust using `create_project`? +

You first call create_project to establish the boundaries for your tests. This gives you a clean, isolated environment that prevents new test runs from contaminating existing project data.

What is the difference between `get_dataset` and `list_datasets`? +

list_datasets shows you all available 'Ground Truth' text banks. You then use get_dataset to pull a specific, structured dataset for active testing.

How do I track changes to my prompt templates with Braintrust? +

Use the list_prompts tool to see all version-controlled prompts. You can then call get_prompt to retrieve a specific template ID, ensuring you test against an exact version.

Can I add custom failed tests using Braintrust? +

Yes. After running a batch of tests, you use insert_dataset_row to manually append new failure cases or specific edge-case inputs into your dataset matrix for future runs.

How do I check which API keys are configured for Braintrust using `list_environments_vars`? +

It shows you all the current gateway configuration variables. This is how your agent accesses the necessary model API keys securely without needing manual setup.

If I want to review previous test runs, what does `list_experiments` retrieve? +

list_experiments retrieves a comprehensive map of all past evaluation attempts. This lets you check historical metrics and model scores across various run IDs.

Can I use `insert_dataset_row` to append just a single test case into my matrix? +

Yes, that's exactly what it does. You can target a specific dataset and inject new evaluation data row by row without having to build an entire master sheet first.

Before starting a new project, how do I use `list_projects` to see current evaluations? +

list_projects gives you the list of all existing AI evaluation containers. This helps you confirm your scope and choose the right environment for your next test.

Can I insert new test data dynamically tracking specific limits? +

Yes. Utilizing the insert_dataset_row method, you can effortlessly inject exact JSON tracking payload mapping strings directly inside the text corpus evaluating the final results.

Does it pull out original Prompt definitions stored securely? +

Certainly. The get_prompt command isolates and returns perfectly version-controlled bounding parameters slicing literal templates natively hosted under the Braintrust database.

How deeply can it inspect test regressions or scoring limits? +

Using the robust list_experiments call, you can branch full arrays separating LLM version behaviors over massive iterations tracking the performance anomalies accurately.

Connect to your AI in seconds.

Create experiment

Create project

List datasets

Braintrust: 10 Tools for Evaluation

Make your AI actually useful.

Create Experiment

Create Project

List Datasets

List Env Vars

List Experiments

Get Dataset

Get Prompt

Insert Dataset Row

List Projects

List Prompts

Security and governance baked right in.

Claude AI

Open Claude Settings

Add Custom Connector

Start a conversation

Claude Code

Open your terminal

Add the MCP Server

Start coding

Cursor

One-Click Install (Recommended)

Open Cursor Settings

Add New Server

Use in Composer

Antigravity

Configure Agent Environment

Bind the Endpoint

Execute

VS Code Copilot

One-Click Install (Recommended)

Open MCP Settings

Add Server Config

Windsurf

One-Click Install (Recommended)

Open Windsurf Settings

Add Server Endpoint

LangChain

Install Dependencies

Connect the Server

CrewAI

Define the Tool

Execute Task

Choose How to Get Started

Build Your Own

Make Your AI Do More

Works with Claude, ChatGPT, Cursor, and more

Testing AI outputs feels like guesswork right now.

Braintrust gives you full control over model evaluation.

What your AI can actually do with this

Here's how it actually works

Who is this actually for?

What Changes When You Connect

See it in action

Validating a new onboarding flow

A/B testing prompt variations

Debugging production failures

Archiving model versions

The honest tradeoffs

Treating the LLM like a simple chat window

Manually tracking prompt changes in spreadsheets

Running tests without context

When It Fits, When It Doesn't

Questions you might have