# Ragas MCP

> Ragas lets your AI client manage professional RAG evaluation and tracking directly inside your chat or IDE. It provides specialized tools to list datasets, run evaluations against LLM pipelines, fetch detailed metrics like faithfulness, and track experiment versions without needing a separate dashboard.

## Overview
- **Category:** ai-frontier
- **Price:** Free
- **Tags:** rag, llm-evaluation, metrics, dataset-management, model-performance, experiment-tracking

## Description

Ragas gives your AI client professional-grade Retrieval-Augmented Generation (RAG) evaluation and tracking right inside your chat or IDE. It's built to let you manage datasets and measure how well your LLM pipelines actually perform, all without needing some separate dashboard. You don't gotta leave your workflow just to check scores.

If you need to get started, the first thing your agent calls is `list_datasets`. This action shows you every dataset ID configured for RAG testing in your project. Once you know which data pool you're working with, you can use `get_dataset` to pull specific metadata for a single ID; this lets you check things like schema details or required parameters before you kick off any evaluation run.

When it comes time for the test itself, you first need to know what metrics you're supposed to measure. Call `list_metrics` and you get every scoring dimension available—stuff like faithfulness and answer relevancy—that Ragas can report on. After that, your agent executes `run_evaluation`, which kicks off the full scoring process against a specific dataset ID and model setup. This initiates the whole thing.

Once the evaluation finishes, you use `get_results` to pull the summary: it gives you the final, aggregate performance score for that entire run. But if you need to track how your models change over time, you can ask the client to look at experiment history. By calling `list_experiments`, you see a record of every past evaluation run tied back to a specific dataset ID. If you're digging into the specifics of one of those old tests, `get_experiment` pulls all the detailed information about that single recorded run.

Basically, if you're checking up on your RAG process, you'll use these tools in sequence: List what datasets exist; check a dataset's parameters; list the metrics available for scoring; initiate the evaluation run; and then grab the final scores or dive deep into the history of past runs.

## Tools

### list_datasets
Lists all available datasets used for RAG testing in your project.

### get_dataset
Retrieves specific metadata for one evaluation dataset ID.

### list_experiments
Shows a list of past experiments tied to a specific dataset ID.

### get_experiment
Gets detailed information about a single, recorded experiment run.

### run_evaluation
Initiates a new Ragas evaluation run on the specified dataset ID.

### list_metrics
Outputs every scoring dimension available for RAG evaluation (e.g., faithfulness, relevancy).

### get_results
Retrieves the final scoring metrics and outcomes from a completed evaluation run.

## Prompt Examples

**Prompt:** 
```
List all Ragas datasets available in my project.
```

**Response:** 
```
Using the `list_datasets` command, I found 3 datasets: 'Legal_Q1_Test' (ID: 01), 'Medical_V2_Base' (ID: 02), and 'General_FAQ_Validation' (ID: 03).
```

**Prompt:** 
```
Fetch the metrics and results for the recent experiment 'Support Bot V3'.
```

**Response:** 
```
Looking up experiments... For 'Support Bot V3', the evaluation scored an aggregate 0.89. Faithfulness scored 0.92, while Answer Relevance was slightly lower at 0.85.
```

**Prompt:** 
```
Create a new Ragas project named 'Financial_RAG_Testing'.
```

**Response:** 
```
I executed `create_project`. The project 'Financial_RAG_Testing' has been successfully created and initialized on your Ragas dashboard.
```

## Capabilities

### List available datasets
The agent calls `list_datasets` to retrieve the names and IDs of all evaluation datasets configured in your Ragas project.

### Get specific dataset details
You use `get_dataset` to pull metadata for a single dataset, checking its schema or required parameters before an evaluation run.

### Start a new RAG pipeline evaluation
The agent executes `run_evaluation`, kicking off the scoring process against a specified dataset and model configuration.

### Find experiment history
You ask the client to run `list_experiments` to see all past evaluation runs associated with a given dataset ID.

### Retrieve final test scores
The agent calls `get_results` to pull the summarized, aggregate performance score for a completed experiment.

### List all measurable metrics
You use `list_metrics` to check which scoring dimensions (e.g., faithfulness, answer relevancy) are available for reporting.

## Use Cases

### QA needs to check for hallucination after an update
The QA specialist uploads the latest knowledge base, runs a new test set via `run_evaluation`, and then immediately uses `get_results` to pull the faithfulness score. If the score drops below 0.85, they know exactly where the model failed without checking a separate dashboard.

### ML team needs to compare two models quickly
The ML engineer uses `list_datasets` to find 'Legal Q3 Test'. They then run Model A's evaluation, capture the metrics, and repeat the process for Model B. The agent structures the results so they can side-by-side comparison.

### Data Scientist wants a comprehensive metric audit
A data scientist calls `list_metrics` first to confirm all available scoring dimensions, then uses `get_dataset` to verify the input schema. This pre-flight check ensures no metrics are missed before running `run_evaluation`.

### Debugging an old model run
A junior analyst knows a test ran last week but can't find the score. They use `list_experiments` with the dataset ID to pull up the specific experiment record, then call `get_results` for that exact run.

## Benefits

- **Automated Scoring:** Instead of writing boilerplate Python scripts, simply ask the agent to `run_evaluation` when a model changes. You get detailed scores without leaving your workflow.
- **Full Traceability:** Need to compare Model V1 against Model V2? Use `list_datasets` and then track every test run using `get_experiment`. It keeps everything linked by project ID.
- **Deep Metric Visibility:** Don't just look at the score. Call `list_metrics` to see exactly *what* is being measured (like faithfulness or answer relevancy) before you evaluate it.
- **Rapid Iteration Cycle:** If a run fails, immediately use `get_results` to pull the final scores and diagnose if the issue was poor context retrieval or bad generation.
- **Project Organization:** The server associates every metric set with a project ID. This prevents data sprawl and makes comparing results across different business units simple.

## How It Works

The bottom line is: You talk to your AI client in plain English, and it translates that into a sequence of calls (like `list_datasets` -> `run_evaluation` -> `get_results`) to get the final data.

1. First, enable the server integration and provide your Ragas Application URL and API Token.
2. Then, instruct your AI client to list datasets using `list_datasets` or run an evaluation with `run_evaluation`.
3. Finally, the client uses the returned IDs to call tools like `get_results` and display actionable performance metrics.

## Frequently Asked Questions

**How do I check if my dataset list is up to date using list_datasets?**
You call the `list_datasets` tool. This command retrieves all current datasets associated with your project ID, letting you confirm which versions are available for testing.

**I need to compare two models, do I use get_results or list_experiments?**
Use `list_datasets` first. Then, run both models separately using `run_evaluation`. Finally, use `get_experiment` for each model's ID to pull detailed results and compare them.

**What is the difference between get_results and list_experiments?**
`list_experiments` shows you a history of runs (the metadata). `get_results` pulls the actual, final calculated scores for one specific run ID.

**Can I see what metrics are available before I run an evaluation with list_metrics?**
Yes. Running `list_metrics` shows every scoring dimension (like faithfulness) that Ragas can calculate, helping you know exactly what numbers to look for in the final report.

**How do I authenticate my AI agent before using `list_datasets`?**
You must provide your Ragas Application URL and a generated token. The client uses these credentials to validate access immediately, ensuring the agent has proper permissions for any read operation like listing datasets.

**If I run an evaluation with `run_evaluation` and it fails, how do I debug the error?**
The system response includes a detailed stack trace or specific error code. Check this output first; it points directly to input data issues or configuration problems within your Ragas setup that need correcting.

**When using `get_dataset`, are there specific document formats required for optimal performance?**
The system handles standard text inputs, but structured data performs best. Make sure your source documents include clear metadata fields (like 'source' or 'date') so Ragas can accurately attribute scores when you later use the results.

**Is there a rate limit for how many evaluations I can run using `run_evaluation`?**
While specific limits vary by subscription tier, running multiple evaluations is generally fine. If you hit an API call threshold error, check the server logs; they will flag whether you've exceeded usage quotas.

**How do I secure an App Token for Ragas?**
Log into your provided Ragas dashboard. In your project's settings or dedicated security section, you will find the ability to generate a new Application Token. Copy it immediately, as it may only appear once.

**What format is required to upload a dataset?**
The tool uses common array formats through the MCP wrapper. When passing data, the AI maps arrays containing `question`, `ground_truth` and `contexts` natively matching Ragas base requirements.

**Does the server evaluate prompts automatically during testing?**
Yes. When triggering evaluations, Ragas uses its own sophisticated metrics (like Faithfulness, Answer Relevance) running internally. The MCP server simply pipes these generated reports back to your chat.