# Arize AI MCP

> Arize AI connects your agent to ML observability. You monitor LLM performance, track model metrics, and check data drift right from your terminal or IDE. It lets you ingest raw inference logs and run automated evaluations against static datasets without opening a dashboard. This is for engineers who need real-time visibility into their models.

## Overview
- **Category:** ai-frontier
- **Price:** Free
- **Tags:** ml-observability, llm-evaluation, model-monitoring, telemetry, data-drift, ai-alignment

## Description

You can connect this MCP to any agent client, giving it full access to your ML observability platform. Forget switching context into heavy graphical dashboards just to see if an LLM prompt hallucinated or if performance dipped. Now, your AI acts like a dedicated MLOps engineer talking to you in plain English.

Need to know what models are running? You can ask the agent to list all tracked ML models. Want to check data quality? It fetches real-time metrics and shows prediction drift flags. The system also lets you push raw logs, predictions, and inferences directly into Arize for immediate tracking using `ingest_log`. For governance, you can browse organizational spaces and deployment environments via `list_environments`, keeping track of Production versus Training data.

Beyond monitoring, the agent handles testing. You can list automated evaluation runs or even trigger a custom check using `run_eval` against static datasets. It’s about making your ML telemetry workflow conversational; it just works.

## Tools

### list_datasets
Returns a list of all available static evaluation datasets for testing.

### list_environments
Lists configured deployment environments (like Production or Training) used to segment model data.

### list_evals
Shows a list of automated evaluation runs that have been executed against models.

### get_dataset
Retrieves details for a specific static dataset used in evaluations.

### get_model
Gets metadata, inputs, and outputs for a specific tracked machine learning model.

### ingest_log
Accepts raw telemetry data (payload_json) and sends it into the Arize logging system.

### get_metrics
Fetches real-time observability metrics and performance scores for an ML model.

### list_models
Lists all ML models or LLMs currently being tracked within the platform space.

### run_eval
Triggers an automated evaluation run for LLM checks using configured ground truth baselines.

### list_spaces
Returns a list of accessible workspaces, which separate different model telemetry datasets.

## Prompt Examples

**Prompt:** 
```
List all active Machine Learning models monitored in my workspace.
```

**Response:** 
```
I've fetched your models from the Arize Space. You currently have 3 connected models: 'Fraud-Detection-v2' (Classification), 'Customer-Churn-XGB' (Score), and 'OpenAI-Customer-Service-Bot' (LLM). Would you like to see the recent drift metrics for any of them?
```

**Prompt:** 
```
Get the evaluation baseline datasets available for our LLM checks.
```

**Response:** 
```
You have two static evaluation datasets loaded in Arize: 'Support-Tickets-Q2-GroundTruth' (1400 rows) and 'Toxicity-Benchmark' (250 rows). I can trigger an automated `run_eval` check targeting these datasets against your active LLM logs if needed.
```

**Prompt:** 
```
Push these 3 mocked prompt responses as telemetry logs to the 'OpenAI-Customer-Service-Bot' model.
```

**Response:** 
```
I successfully structured your 3 prompts into valid ingestion payloads and pushed them via the `ingest_log` tool. They should now be available for analysis and drift observation in the Arize telemetry dashboard.
```

## Capabilities

### Check Model Status
List all active ML models and retrieve their detailed configuration schemas.

### Monitor Performance Metrics
Fetch current observability metrics, including performance scores and data quality reports for any tracked model.

### Manage Data Inputs
List available static evaluation datasets or retrieve specific dataset metadata for testing purposes.

### Track Live Data Streams
Push raw logs, predictions, and inferences into the platform for immediate visualization and drift analysis.

### Control Environments
List configured deployment environments, such as Production or Verification, to ensure data segregation.

## Use Cases

### Debugging a Production Drift Spike
A user notices model accuracy dropped in production. Instead of diving into the UI, they ask their agent to check `get_metrics` for the specific model and then use `list_environments` to confirm if the issue is isolated to the active deployment space.

### Setting up a New Evaluation Benchmark
A data scientist needs to test an LLM against new toxicity rules. They first run `list_datasets` to find available benchmarks, then use `get_dataset` to confirm the schema, and finally trigger the check with `run_eval`.

### Capturing Live Inference Data
A developer writes a new feature that makes many calls. They don't want to manually record everything; they simply use `ingest_log` to push the entire payload stream, guaranteeing Arize sees every single prediction.

### Auditing Model Readiness
A product manager needs proof that a model is stable before release. They ask the agent to list all active models (`list_models`), check its current performance metrics using `get_metrics`, and confirm it's running in a verified environment.

## Benefits

- Stop context-switching. You don't have to leave your terminal or IDE just because you need to check `get_metrics` for prediction drift. Your agent does the heavy lifting, keeping your focus on coding.
- Better governance means knowing where your data comes from. Use `list_environments` and `list_spaces` to separate Production telemetry from Training runs, which is critical for clean audits.
- `ingest_log` allows you to push raw inference payloads programmatically. This guarantees that every piece of observed behavior gets tracked in Arize for later analysis.
- When you need assurance on model output quality, the agent can list automated evaluation runs (`list_evals`) or even kick off a new check using `run_eval` against ground truth data.
- The system provides deep visibility into your entire ML stack. You get to see everything from the initial schema definition via `get_model` all the way through live performance tracking.

## How It Works

The bottom line is you don't need a GUI; your AI client handles the API calls and reports back what it finds.

1. Subscribe to this MCP and provide your Arize API Key and Space ID.
2. Reference a model by name (e.g., 'Fraud-Detection-v2') so the agent knows where to look for metrics.
3. Ask the agent to perform an action, like fetching drift metrics or listing active models.

## Frequently Asked Questions

**How does I use the ingest_log tool with Arize AI?**
You pass a payload JSON structure to `ingest_log`. The agent handles structuring your raw telemetry logs into the valid format and pushing them directly to Arize for analysis.

**Can I list all monitored ML models with list_models?**
Yes, running `list_models` retrieves a complete list of every tracked ML or LLM model in your current workspace, helping you narrow down where the issue is occurring.

**What's the difference between getting metrics and listing environments?**
`get_metrics` gives quantitative data (performance scores, drift rates) for a specific model. `list_environments` just shows you the names of available deployment contexts like 'Production' or 'Staging'.

**Do I need to use run_eval if I want to test my LLM?**
No, not always. If you have a specific dataset and just need metrics, `get_metrics` might suffice. However, using `run_eval` triggers the formal evaluation process against ground truth baselines.

**How do I use list_spaces to see all my available workspaces?**
It lists every organizational space you have access to in Arize. This lets your agent pinpoint exactly which model or telemetry dataset needs monitoring, keeping your work properly segmented.

**What information does get_model need about my tracked ML model?**
The tool requires the specific name and ID of the model you are tracking. This confirms the metadata, defining all inputs, outputs, and features so your agent knows exactly what to monitor.

**What does list_environments show me about my deployment stages?**
It shows defined contexts like Production, Training, or Verification. You can use this to restrict monitoring to a specific lifecycle stage, which is critical for accurate reporting before going live.

**If I list_datasets, how do I get the details on a particular dataset using get_dataset?**
The tool retrieves all metadata for a specified dataset. You'll find immediate details like row counts, column names, and schema information without having to guess.

**Can my AI automatically trigger a hallucination evaluation on a new dataset?**
Yes! You can ask your agent to retrieve the specific Ground Truth dataset ID, formulate a testing payload, and invoke the `run_eval` tool natively. Arize will process the asynchronous scoring internally and log the evaluation securely.

**How can I quickly check if a production model is experiencing data drift?**
Just tell your agent: 'Fetch the primary metrics for model X'. The AI uses the `get_metrics` query to immediately surface latency degradation, prediction drift flags, and incoming data quality indexes without opening the browser.

**Is it possible to track telemetry simultaneously for both local development and production environments?**
Absolutely. Arize enforces strict separation using Spaces and Environments. You can instruct your AI agent to query the `list_environments` tool, figure out the sandbox ID, and push manual test logs strictly to the sandbox scope during debugging sessions, keeping production metrics clean.