Ragas MCP for AI. Run RAG evaluations and track metrics from your chat.

Q: How do I check if my dataset list is up to date using listdatasets?

You call the listdatasets tool. This command retrieves all current datasets associated with your project ID, letting you confirm which versions are available for testing.

Q: I need to compare two models, do I use getresults or listexperiments?

Use listdatasets first. Then, run both models separately using runevaluation. Finally, use getexperiment for each model's ID to pull detailed results and compare them.

Q: What is the difference between getresults and listexperiments?

listexperiments shows you a history of runs (the metadata). getresults pulls the actual, final calculated scores for one specific run ID.

Q: Can I see what metrics are available before I run an evaluation with listmetrics?

Yes. Running listmetrics shows every scoring dimension (like faithfulness) that Ragas can calculate, helping you know exactly what numbers to look for in the final report.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Connect to your AI in seconds.

Ragas lets your AI client manage professional RAG evaluation and tracking directly inside your chat or IDE. It provides specialized tools to list datasets, run evaluations against LLM pipelines, fetch detailed metrics like faithfulness, and track experiment versions without needing a separate dashboard.

What your AI can do

List datasets

Lists all available datasets used for RAG testing in your project.

Get dataset

Retrieves specific metadata for one evaluation dataset ID.

List experiments

Shows a list of past experiments tied to a specific dataset ID.

+ 4 more capabilities included

List available datasets

The agent calls list_datasets to retrieve the names and IDs of all evaluation datasets configured in your Ragas project.

Get specific dataset details

You use get_dataset to pull metadata for a single dataset, checking its schema or required parameters before an evaluation run.

Start a new RAG pipeline evaluation

The agent executes run_evaluation, kicking off the scoring process against a specified dataset and model configuration.

Find experiment history

You ask the client to run list_experiments to see all past evaluation runs associated with a given dataset ID.

Retrieve final test scores

The agent calls get_results to pull the summarized, aggregate performance score for a completed experiment.

List all measurable metrics

You use list_metrics to check which scoring dimensions (e.g., faithfulness, answer relevancy) are available for reporting.

Ask an AI about this

Included with Plan

Waiting for input…

AI Agent

Ragas MCP Server: 7 Tools for RAG Evaluation

These tools let your agent handle the full lifecycle of RAG testing: listing data, running tests, and retrieving verifiable performance metrics.

Make your AI actually useful.

Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.

Start using Ragas on Vinkius

List Datasets

Lists all available datasets used for RAG testing in your project.

Get Dataset

Retrieves specific metadata for one evaluation dataset ID.

List Experiments

Shows a list of past experiments tied to a specific dataset ID.

Get Experiment

Gets detailed information about a single, recorded experiment run.

Run Evaluation

Initiates a new Ragas evaluation run on the specified dataset ID.

List Metrics

Outputs every scoring dimension available for RAG evaluation (e.g., faithfulness, relevancy).

Get Results

Retrieves the final scoring metrics and outcomes from a completed evaluation run.

Security and governance baked right in.

Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.

Claude AI

Open Claude Settings

Go to claude.ai, click your profile icon, then navigate to Customize → Connectors.

Add Custom Connector

Click the "+" button and select Add custom connector. Paste your Vinkius endpoint URL:

https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp

Replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com. For OAuth-protected servers, expand Advanced settings to add credentials.

Start a conversation

Open a new chat. The Ragas integration is available immediately — no restart needed.

Antigravity

Configure Agent Environment

Open your Antigravity agent's workspace configuration or mcp-servers.json file.

Bind the Endpoint

Add the Vinkius endpoint URL to your agent's MCP connections list:

"mcp_servers": {
  "ragas": {
    "serverUrl": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
  }
}

Provide your secure token in place of [YOUR_TOKEN_HERE] to ensure your agent requests are authenticated.

Execute

Start your Antigravity session. The agent will autonomously discover and utilize the Ragas tools with full Vinkius guardrails applied.

VS Code Copilot

⚡

One-Click Install (Recommended)

In your Vinkius Dashboard, simply click the Add to VS Code button for this server. We'll automatically configure your local workspace.

Or configure manually

Open MCP Settings

Open VS Code, press Ctrl/Cmd + Shift + P, and search for GitHub Copilot: MCP Servers.

Add Server Config

Add the Vinkius endpoint configuration to your mcp-servers.json file:

"ragas": {
  "url": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
}

Ensure you replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com.

LangChain

Install Dependencies

Install the LangChain MCP adapters for your environment:

pip install langchain-mcp-adapters

Connect the Server

Use the SSEClient in LangChain to connect to the Vinkius managed endpoint:

from langchain_mcp_adapters.client import SSEClient

# Connect to Vinkius
client = SSEClient(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")
tools = client.get_tools()

CrewAI

Define the Tool

Load the Vinkius MCP tools into your CrewAI agents:

from crewai import Agent
from mcp_crewai import MCPTool

# Connect securely to Vinkius
vinkius_tools = MCPTool(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")

# Assign to Agent
researcher = Agent(
    role='Data Researcher',
    tools=vinkius_tools.get_all()
)

Execute Task

Run your CrewAI process. The agent will autonomously route tasks to the Vinkius managed server.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Ragas, then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 5,100+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Ragas. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

Your data is protected. See how we built it.

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This connection provides 7 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.

Testing LLMs shouldn't require context switching or boilerplate code.

Today, running a proper RAG evaluation means navigating to a separate dashboard. You upload the dataset there, click 'Run Test,' wait for it to process, then download a CSV of scores. If you want to compare two models, you repeat that entire cycle—copying IDs, remembering which score is faithfulness, and pasting everything into a spreadsheet.

With this MCP server, your agent handles all the tedious parts. You simply talk to your client: 'Run Model B against the Legal Q1 Test.' The agent executes `run_evaluation`, pulls back the detailed metrics via tools like `get_results`, and shows you the clean numbers right in your chat window. No dashboard hops required.

Ragas MCP Server: Get structured RAG evaluation results.

Manual testing means running scripts locally, then manually updating a central tracking sheet with the final score and model version. It's slow, prone to human error, and makes comparing multiple runs nearly impossible without deep manual effort.

Now, your client controls this entire process. You use `list_datasets` for discovery, trigger tests with `run_evaluation`, and finally retrieve structured data using `get_results`. The whole measurement chain is automated, verifiable, and right where you work.

Support 24/7 support@vinkius.com ↗

Security Vinkius Trust Center ↗

SLA Service Level Agreement ↗

Report Listing Send Report ↗

What your AI can actually do with this

Ragas gives your AI client professional-grade Retrieval-Augmented Generation (RAG) evaluation and tracking right inside your chat or IDE. It's built to let you manage datasets and measure how well your LLM pipelines actually perform, all without needing some separate dashboard. You don't gotta leave your workflow just to check scores.

If you need to get started, the first thing your agent calls is list_datasets. This action shows you every dataset ID configured for RAG testing in your project. Once you know which data pool you're working with, you can use get_dataset to pull specific metadata for a single ID; this lets you check things like schema details or required parameters before you kick off any evaluation run.

When it comes time for the test itself, you first need to know what metrics you're supposed to measure. Call list_metrics and you get every scoring dimension available—stuff like faithfulness and answer relevancy—that Ragas can report on. After that, your agent executes run_evaluation, which kicks off the full scoring process against a specific dataset ID and model setup.

This initiates the whole thing.

Once the evaluation finishes, you use get_results to pull the summary: it gives you the final, aggregate performance score for that entire run. But if you need to track how your models change over time, you can ask the client to look at experiment history. By calling list_experiments, you see a record of every past evaluation run tied back to a specific dataset ID.

If you're digging into the specifics of one of those old tests, get_experiment pulls all the detailed information about that single recorded run.

Basically, if you're checking up on your RAG process, you'll use these tools in sequence: List what datasets exist; check a dataset's parameters; list the metrics available for scoring; initiate the evaluation run; and then grab the final scores or dive deep into the history of past runs.

Built · Hosted · Managed by Vinkius Ragas MCP Server - Evaluate RAG Models & Metrics

Server ID 019d75fc-3898-7169-9831-0da3f7c25d5a

Vinkius Inspector

Compliance Grade A+

Score 100/100

Report View Report ↗

Here's how it actually works

The bottom line is: You talk to your AI client in plain English, and it translates that into a sequence of calls (like list_datasets -> run_evaluation -> get_results) to get the final data.

First, enable the server integration and provide your Ragas Application URL and API Token.

Then, instruct your AI client to list datasets using list_datasets or run an evaluation with run_evaluation.

Finally, the client uses the returned IDs to call tools like get_results and display actionable performance metrics.

Who is this actually for?

ML Engineers who are tired of context switching between IDEs, dashboards, and local scripts. QA Specialists running constant model tests need this for rapid benchmarking. Data Scientists needing to compare multiple RAG configuration changes side-by-side will find this essential.

ML Engineer

You run pipeline evaluations without leaving your chat interface, using run_evaluation and then fetching the results with get_results.

QA Specialist for LLMs

You rapidly examine datasets and benchmark scores to prove that hallucination rates stay below a certain threshold by listing metrics via list_metrics.

Data Scientist

You compare the performance of several RAG configuration experiments side-by-side using unified data pulled from multiple calls to get_experiment.

What Changes When You Connect

Automated Scoring: Instead of writing boilerplate Python scripts, simply ask the agent to run_evaluation when a model changes. You get detailed scores without leaving your workflow.

Full Traceability: Need to compare Model V1 against Model V2? Use list_datasets and then track every test run using get_experiment. It keeps everything linked by project ID.

Deep Metric Visibility: Don't just look at the score. Call list_metrics to see exactly what is being measured (like faithfulness or answer relevancy) before you evaluate it.

Rapid Iteration Cycle: If a run fails, immediately use get_results to pull the final scores and diagnose if the issue was poor context retrieval or bad generation.

Project Organization: The server associates every metric set with a project ID. This prevents data sprawl and makes comparing results across different business units simple.

See it in action

01 01

QA needs to check for hallucination after an update

The QA specialist uploads the latest knowledge base, runs a new test set via run_evaluation, and then immediately uses get_results to pull the faithfulness score. If the score drops below 0.85, they know exactly where the model failed without checking a separate dashboard.

02 02

ML team needs to compare two models quickly

The ML engineer uses list_datasets to find 'Legal Q3 Test'. They then run Model A's evaluation, capture the metrics, and repeat the process for Model B. The agent structures the results so they can side-by-side comparison.

03 03

Data Scientist wants a comprehensive metric audit

A data scientist calls list_metrics first to confirm all available scoring dimensions, then uses get_dataset to verify the input schema. This pre-flight check ensures no metrics are missed before running run_evaluation.

04 04

Debugging an old model run

A junior analyst knows a test ran last week but can't find the score. They use list_experiments with the dataset ID to pull up the specific experiment record, then call get_results for that exact run.

The honest tradeoffs

Relying on a single summary view

Anti-pattern

Thinking that just reading a high-level dashboard score is enough. You might see 'Score: 0.92' but have no idea if that number accounts for answer relevance or faithfulness.

The Fix

Always confirm the underlying metrics first. Use list_metrics to see all scoring dimensions, then use get_results to pull specific scores like 'faithfulness' and 'answer relevancy'.

Skipping dataset listing

Anti-pattern

Telling your agent to run an evaluation without first confirming the correct, up-to-date dataset ID. You might accidentally test against old or incomplete data.

The Fix

Always start with list_datasets. This gives you a definitive list of available resources and prevents running evaluations on stale IDs.

Assuming the last run is current

Anti-pattern

Calling get_results immediately after an edit, assuming the score reflects that change. The cached data might be outdated.

The Fix

If you modified the source material or model, you must explicitly trigger a new test using run_evaluation. Then, check for results with get_results.

Questions you might have

How do I check if my dataset list is up to date using list_datasets? +

You call the list_datasets tool. This command retrieves all current datasets associated with your project ID, letting you confirm which versions are available for testing.

I need to compare two models, do I use get_results or list_experiments? +

Use list_datasets first. Then, run both models separately using run_evaluation. Finally, use get_experiment for each model's ID to pull detailed results and compare them.

What is the difference between get_results and list_experiments? +

list_experiments shows you a history of runs (the metadata). get_results pulls the actual, final calculated scores for one specific run ID.

Can I see what metrics are available before I run an evaluation with list_metrics? +

Yes. Running list_metrics shows every scoring dimension (like faithfulness) that Ragas can calculate, helping you know exactly what numbers to look for in the final report.

How do I authenticate my AI agent before using `list_datasets`? +

You must provide your Ragas Application URL and a generated token. The client uses these credentials to validate access immediately, ensuring the agent has proper permissions for any read operation like listing datasets.

If I run an evaluation with `run_evaluation` and it fails, how do I debug the error? +

The system response includes a detailed stack trace or specific error code. Check this output first; it points directly to input data issues or configuration problems within your Ragas setup that need correcting.

When using `get_dataset`, are there specific document formats required for optimal performance? +

The system handles standard text inputs, but structured data performs best. Make sure your source documents include clear metadata fields (like 'source' or 'date') so Ragas can accurately attribute scores when you later use the results.

Is there a rate limit for how many evaluations I can run using `run_evaluation`? +

While specific limits vary by subscription tier, running multiple evaluations is generally fine. If you hit an API call threshold error, check the server logs; they will flag whether you've exceeded usage quotas.

How do I secure an App Token for Ragas? +

Log into your provided Ragas dashboard. In your project's settings or dedicated security section, you will find the ability to generate a new Application Token. Copy it immediately, as it may only appear once.

What format is required to upload a dataset? +

The tool uses common array formats through the MCP wrapper. When passing data, the AI maps arrays containing question, ground_truth and contexts natively matching Ragas base requirements.

Does the server evaluate prompts automatically during testing? +

Yes. When triggering evaluations, Ragas uses its own sophisticated metrics (like Faithfulness, Answer Relevance) running internally. The MCP server simply pipes these generated reports back to your chat.

Connect to your AI in seconds.

List datasets

Get dataset

List experiments

Ragas MCP Server: 7 Tools for RAG Evaluation

Make your AI actually useful.

List Datasets

Get Dataset

List Experiments

Get Experiment

Run Evaluation

List Metrics

Get Results

Security and governance baked right in.

Claude AI

Open Claude Settings

Add Custom Connector

Start a conversation

Claude Code

Open your terminal

Add the MCP Server

Start coding

Cursor

One-Click Install (Recommended)

Open Cursor Settings

Add New Server

Use in Composer

Antigravity

Configure Agent Environment

Bind the Endpoint

Execute

VS Code Copilot

One-Click Install (Recommended)

Open MCP Settings

Add Server Config

Windsurf

One-Click Install (Recommended)

Open Windsurf Settings

Add Server Endpoint

LangChain

Install Dependencies

Connect the Server

CrewAI

Define the Tool

Execute Task

Choose How to Get Started

Build Your Own

Make Your AI Do More

Works with Claude, ChatGPT, Cursor, and more

Testing LLMs shouldn't require context switching or boilerplate code.

Ragas MCP Server: Get structured RAG evaluation results.

What your AI can actually do with this

Here's how it actually works

Who is this actually for?

What Changes When You Connect

See it in action

QA needs to check for hallucination after an update

ML team needs to compare two models quickly

Data Scientist wants a comprehensive metric audit

Debugging an old model run

The honest tradeoffs

Relying on a single summary view

Skipping dataset listing

Assuming the last run is current

When It Fits, When It Doesn't

Questions you might have