Ragas MCP. Run RAG evaluations and track metrics from your chat.

Q: How do I check if my dataset list is up to date using listdatasets?

You call the listdatasets tool. This command retrieves all current datasets associated with your project ID, letting you confirm which versions are available for testing.

Q: I need to compare two models, do I use getresults or listexperiments?

Use listdatasets first. Then, run both models separately using runevaluation. Finally, use getexperiment for each model's ID to pull detailed results and compare them.

Q: What is the difference between getresults and listexperiments?

listexperiments shows you a history of runs (the metadata). getresults pulls the actual, final calculated scores for one specific run ID.

Q: Can I see what metrics are available before I run an evaluation with listmetrics?

Yes. Running listmetrics shows every scoring dimension (like faithfulness) that Ragas can calculate, helping you know exactly what numbers to look for in the final report.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Just plug in your AI agents and start using Vinkius.

Ragas lets your AI client manage professional RAG evaluation and tracking directly inside your chat or IDE. It provides specialized tools to list datasets, run evaluations against LLM pipelines, fetch detailed metrics like faithfulness, and track experiment versions without needing a separate dashboard.

What your AI agents can do

Get dataset

Retrieves specific metadata for one evaluation dataset ID.

Get experiment

Gets detailed information about a single, recorded experiment run.

Get results

Retrieves the final scoring metrics and outcomes from a completed evaluation run.

+ 4 more capabilities included

List available datasets

The agent calls list_datasets to retrieve the names and IDs of all evaluation datasets configured in your Ragas project.

Get specific dataset details

You use get_dataset to pull metadata for a single dataset, checking its schema or required parameters before an evaluation run.

Start a new RAG pipeline evaluation

The agent executes run_evaluation, kicking off the scoring process against a specified dataset and model configuration.

Find experiment history

You ask the client to run list_experiments to see all past evaluation runs associated with a given dataset ID.

Retrieve final test scores

The agent calls get_results to pull the summarized, aggregate performance score for a completed experiment.

List all measurable metrics

You use list_metrics to check which scoring dimensions (e.g., faithfulness, answer relevancy) are available for reporting.

Ask AI about this MCP

Ask ChatGPT

Ask Claude

Ask Perplexity

Supported MCP Clients

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

+ other MCP clients

Free for Subscribers

Waiting for input…

AI Agent

Ragas MCP Server: 7 Tools for RAG Evaluation

These tools let your agent handle the full lifecycle of RAG testing: listing data, running tests, and retrieving verifiable performance metrics.

get019d75fc

get dataset

Retrieves specific metadata for one evaluation dataset ID.

get019d75fc

get experiment

Gets detailed information about a single, recorded experiment run.

get019d75fc

get results

Retrieves the final scoring metrics and outcomes from a completed evaluation run.

list019d75fc

list datasets

Lists all available datasets used for RAG testing in your project.

list019d75fc

list experiments

Shows a list of past experiments tied to a specific dataset ID.

list019d75fc

list metrics

Outputs every scoring dimension available for RAG evaluation (e.g., faithfulness, relevancy).

run019d75fc

run evaluation

Initiates a new Ragas evaluation run on the specified dataset ID.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Ragas, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 4,700+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

What you can do with this MCP connector

Ragas gives your AI client professional-grade Retrieval-Augmented Generation (RAG) evaluation and tracking right inside your chat or IDE. It's built to let you manage datasets and measure how well your LLM pipelines actually perform, all without needing some separate dashboard. You don't gotta leave your workflow just to check scores.

If you need to get started, the first thing your agent calls is list_datasets. This action shows you every dataset ID configured for RAG testing in your project. Once you know which data pool you're working with, you can use get_dataset to pull specific metadata for a single ID; this lets you check things like schema details or required parameters before you kick off any evaluation run.

When it comes time for the test itself, you first need to know what metrics you're supposed to measure. Call list_metrics and you get every scoring dimension available—stuff like faithfulness and answer relevancy—that Ragas can report on. After that, your agent executes run_evaluation, which kicks off the full scoring process against a specific dataset ID and model setup.

This initiates the whole thing.

Once the evaluation finishes, you use get_results to pull the summary: it gives you the final, aggregate performance score for that entire run. But if you need to track how your models change over time, you can ask the client to look at experiment history. By calling list_experiments, you see a record of every past evaluation run tied back to a specific dataset ID.

If you're digging into the specifics of one of those old tests, get_experiment pulls all the detailed information about that single recorded run.

Basically, if you're checking up on your RAG process, you'll use these tools in sequence: List what datasets exist; check a dataset's parameters; list the metrics available for scoring; initiate the evaluation run; and then grab the final scores or dive deep into the history of past runs.

How Ragas MCP Works

1 First, enable the server integration and provide your Ragas Application URL and API Token.
2 Then, instruct your AI client to list datasets using list_datasets or run an evaluation with run_evaluation.
3 Finally, the client uses the returned IDs to call tools like get_results and display actionable performance metrics.

The bottom line is: You talk to your AI client in plain English, and it translates that into a sequence of calls (like list_datasets -> run_evaluation -> get_results) to get the final data.

Who Is Ragas MCP For?

ML Engineers who are tired of context switching between IDEs, dashboards, and local scripts. QA Specialists running constant model tests need this for rapid benchmarking. Data Scientists needing to compare multiple RAG configuration changes side-by-side will find this essential.

ML Engineer

You run pipeline evaluations without leaving your chat interface, using run_evaluation and then fetching the results with get_results.

QA Specialist for LLMs

You rapidly examine datasets and benchmark scores to prove that hallucination rates stay below a certain threshold by listing metrics via list_metrics.

Data Scientist

You compare the performance of several RAG configuration experiments side-by-side using unified data pulled from multiple calls to get_experiment.

What Changes When You Connect

Automated Scoring: Instead of writing boilerplate Python scripts, simply ask the agent to run_evaluation when a model changes. You get detailed scores without leaving your workflow.
Full Traceability: Need to compare Model V1 against Model V2? Use list_datasets and then track every test run using get_experiment. It keeps everything linked by project ID.
Deep Metric Visibility: Don't just look at the score. Call list_metrics to see exactly what is being measured (like faithfulness or answer relevancy) before you evaluate it.
Rapid Iteration Cycle: If a run fails, immediately use get_results to pull the final scores and diagnose if the issue was poor context retrieval or bad generation.
Project Organization: The server associates every metric set with a project ID. This prevents data sprawl and makes comparing results across different business units simple.

Real-World Use Cases

QA needs to check for hallucination after an update

The QA specialist uploads the latest knowledge base, runs a new test set via run_evaluation, and then immediately uses get_results to pull the faithfulness score. If the score drops below 0.85, they know exactly where the model failed without checking a separate dashboard.

ML team needs to compare two models quickly

The ML engineer uses list_datasets to find 'Legal Q3 Test'. They then run Model A's evaluation, capture the metrics, and repeat the process for Model B. The agent structures the results so they can side-by-side comparison.

Data Scientist wants a comprehensive metric audit

A data scientist calls list_metrics first to confirm all available scoring dimensions, then uses get_dataset to verify the input schema. This pre-flight check ensures no metrics are missed before running run_evaluation.

Debugging an old model run

A junior analyst knows a test ran last week but can't find the score. They use list_experiments with the dataset ID to pull up the specific experiment record, then call get_results for that exact run.

The Tradeoffs

Relying on a single summary view

Thinking that just reading a high-level dashboard score is enough. You might see 'Score: 0.92' but have no idea if that number accounts for answer relevance or faithfulness.

→ Always confirm the underlying metrics first. Use list_metrics to see all scoring dimensions, then use get_results to pull specific scores like 'faithfulness' and 'answer relevancy'.

Skipping dataset listing

Telling your agent to run an evaluation without first confirming the correct, up-to-date dataset ID. You might accidentally test against old or incomplete data.

→ Always start with list_datasets. This gives you a definitive list of available resources and prevents running evaluations on stale IDs.

Assuming the last run is current

Calling get_results immediately after an edit, assuming the score reflects that change. The cached data might be outdated.

→ If you modified the source material or model, you must explicitly trigger a new test using run_evaluation. Then, check for results with get_results.

When It Fits, When It Doesn't

Use Ragas if your core pain point is managing and measuring LLM performance within your coding environment. You need an agent to handle the workflow of 'list data' -> 'run test' -> 'show score.'

Don't use it if you just need a quick, single-point check on one metric without tracking history. For that, maybe a simple API wrapper is enough. Don't use it if your testing process requires complex data transformations or pre-processing outside of Ragas' scope (like advanced image analysis). In those cases, stick to dedicated ETL pipelines.

The key decision point: If you need the AI client to act as a continuous QA layer that reads and writes test metadata, this server is built for that. If you just want raw data access, use standard database connectors instead.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Ragas. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 7 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

get_dataset get_experiment get_results list_datasets list_experiments list_metrics run_evaluation

Testing LLMs shouldn't require context switching or boilerplate code.

Today, running a proper RAG evaluation means navigating to a separate dashboard. You upload the dataset there, click 'Run Test,' wait for it to process, then download a CSV of scores. If you want to compare two models, you repeat that entire cycle—copying IDs, remembering which score is faithfulness, and pasting everything into a spreadsheet.

With this MCP server, your agent handles all the tedious parts. You simply talk to your client: 'Run Model B against the Legal Q1 Test.' The agent executes `run_evaluation`, pulls back the detailed metrics via tools like `get_results`, and shows you the clean numbers right in your chat window. No dashboard hops required.

Ragas MCP Server: Get structured RAG evaluation results.

Manual testing means running scripts locally, then manually updating a central tracking sheet with the final score and model version. It's slow, prone to human error, and makes comparing multiple runs nearly impossible without deep manual effort.

Now, your client controls this entire process. You use `list_datasets` for discovery, trigger tests with `run_evaluation`, and finally retrieve structured data using `get_results`. The whole measurement chain is automated, verifiable, and right where you work.

Common Questions About Ragas MCP

How do I check if my dataset list is up to date using list_datasets? +

You call the list_datasets tool. This command retrieves all current datasets associated with your project ID, letting you confirm which versions are available for testing.

I need to compare two models, do I use get_results or list_experiments? +

Use list_datasets first. Then, run both models separately using run_evaluation. Finally, use get_experiment for each model's ID to pull detailed results and compare them.

What is the difference between get_results and list_experiments? +

list_experiments shows you a history of runs (the metadata). get_results pulls the actual, final calculated scores for one specific run ID.

Can I see what metrics are available before I run an evaluation with list_metrics? +

Yes. Running list_metrics shows every scoring dimension (like faithfulness) that Ragas can calculate, helping you know exactly what numbers to look for in the final report.

How do I authenticate my AI agent before using `list_datasets`? +

You must provide your Ragas Application URL and a generated token. The client uses these credentials to validate access immediately, ensuring the agent has proper permissions for any read operation like listing datasets.

If I run an evaluation with `run_evaluation` and it fails, how do I debug the error? +

The system response includes a detailed stack trace or specific error code. Check this output first; it points directly to input data issues or configuration problems within your Ragas setup that need correcting.

When using `get_dataset`, are there specific document formats required for optimal performance? +

The system handles standard text inputs, but structured data performs best. Make sure your source documents include clear metadata fields (like 'source' or 'date') so Ragas can accurately attribute scores when you later use the results.

Is there a rate limit for how many evaluations I can run using `run_evaluation`? +

While specific limits vary by subscription tier, running multiple evaluations is generally fine. If you hit an API call threshold error, check the server logs; they will flag whether you've exceeded usage quotas.

How do I secure an App Token for Ragas? +

Log into your provided Ragas dashboard. In your project's settings or dedicated security section, you will find the ability to generate a new Application Token. Copy it immediately, as it may only appear once.

What format is required to upload a dataset? +

The tool uses common array formats through the MCP wrapper. When passing data, the AI maps arrays containing question, ground_truth and contexts natively matching Ragas base requirements.

Does the server evaluate prompts automatically during testing? +

Yes. When triggering evaluations, Ragas uses its own sophisticated metrics (like Faithfulness, Answer Relevance) running internally. The MCP server simply pipes these generated reports back to your chat.

Use it with your favorite AI tools

Connect this server to Cursor, Claude, VS Code, and more.

OpenAI Agents SDK sdk-python

Google ADK sdk-python

Pydantic AI sdk-python

Vercel AI SDK sdk-typescript