# Datadog AI LLM Observability MCP for AI Agents MCP

> Datadog AI (LLM Observability) MCP allows you to monitor, audit, and track performance metrics for your LLMs in real-time. It lets your agent pull high-precision data on token usage, latency spikes, prompt content, and overall infrastructure health directly from your existing Datadog setup.

## Overview
- **Category:** ai-frontier
- **Price:** Free
- **Tags:** llm-observability, token-usage, prompt-monitoring, ai-performance, telemetry, model-auditing

## Description

Running models is complex; tracking their cost and performance shouldn't be. This MCP connects your AI client to your Datadog account so you can manage LLM observability through natural conversation. Instead of hopping between dashboards and logs, your agent handles the deep dive. You can query metrics for specific things like token counts or latency timeseries, pull full prompt logs, and even check active outages that might be blocking multi-agent workflows. It also lets you view widgets graphing global AI expenses across providers like OpenAI and Anthropic.

When you connect this MCP via Vinkius, your agent gets immediate visibility into every part of your model stack—from simple usage tracking to complex incident reporting. You'll know exactly when a dynamic LLM model was switched out or if performance is starting to drop below established thresholds.

## Tools

### create_event
Inspects deep internal arrays related to plan math calculations for debugging purposes.

### create_monitor
Creates explicit validation checks, allowing you to monitor specific metrics or thresholds automatically.

### list_dashboards
Retrieves a list of structured rules attached to billing accounts for monitoring purposes.

### list_events
Identifies precise active arrays spanning native gateway authentication records.

### list_incidents
Dispatches an automated validation check to route explicit historical service outage data.

### search_llm_spans
Searches for detailed JSON payload contents, providing hard customer usage bindings and context.

### list_ai_monitors
Retrieves explicit cloud logging information that traces resource limits associated with AI models.

### query_metrics
Queries core LLM observability metrics, such as token count and latency, from the platform.

### submit_series
Performs structural extraction of properties that drive active account logic changes.

### list_service_accounts
Identifies precise active arrays spanning native hold parsing records for service access management.

## Prompt Examples

**Prompt:** 
```
Can you show me the token usage and latency for our main chatbot over the last four hours?
```

**Response:** 
```
**LLM Performance Metrics (Last 4 Hours)**

| Metric | Average Value |
| :--- | :---: |
| Tokens/Request | 185 tokens |
| Latency | 920 ms |
| Max Usage Spike | 3.1k tokens (1:30 PM) |

The average latency is stable, but usage spiked near 1:30 PM. This suggests a high-complexity query ran during that time.
```

**Prompt:** 
```
I suspect an outage happened yesterday; list any active service disruptions.
```

**Response:** 
```
**Incident Report**

✅ **Status:** Multiple incidents detected.
*   **Service:** Primary Gateway Auth
*   **Impact:** Intermittent failure (10/26, 4:00 PM - 4:30 PM EST)
*   **Details:** Authentication errors blocked multi-agent orchestration. The service was restored automatically.
```

**Prompt:** 
```
List all the current AI performance monitors we have set up.
```

**Response:** 
```
**Active Monitoring Status**

You currently have 5 monitors running:
*   [LLM-Latency-High]: **ALERT**. Average latency exceeded threshold (1200ms).
*   [Token-Quota-Reached]: OK.
*   [Model-Drift]: OK. Needs review.
*   [GPU-Utilization]: OK.
*   [ErrorRate-Threshold]: OK.
```

## Capabilities

### Querying Token and Latency Metrics
Find the average token usage, peak consumption times, and overall latency for your models over specific periods.

### Auditing Prompt Content and Model Spans
Retrieve detailed records of literal prompts and response traces, helping you debug exactly what inputs caused performance issues.

### Checking for Active Service Outages
Monitor your infrastructure to detect real-time service disruptions or active outages blocking agent workflows.

### Creating Performance Alerts
Set up monitors that alert you when AI responses drop below expected performance levels or hit resource limits.

### Analyzing Global AI Infrastructure Spending
Enumerate widgets that graph total global spending and usage across different LLM providers, aiding budget planning.

## Use Cases

### Debugging a sudden spike in costs
A developer noticed their monthly LLM bill was spiking. They asked their agent to check the logs, which used `search_llm_spans` to retrieve specific payloads and pinpointed that a single unoptimized prompt loop was causing excessive token usage.

### Verifying model stability after an update
The MLOps team just rolled out Model v2. They used `list_ai_monitors` to check if all their existing performance monitors were still tracking correctly and confirmed that the new version maintained low latency metrics.

### Diagnosing agent failure during peak hours
When an automated workflow failed, the SRE used `list_incidents` to check for active service disruptions. The report showed a temporary gateway authentication failure that was blocking multi-agent orchestration.

### Optimizing cloud spending across multiple services
The FinOps team needed an overall view of AI spend. They used the MCP to enumerate global expenses, allowing them to compare usage patterns between OpenAI and Anthropic in one place.

## Benefits

- Track actual resource consumption by querying specific metrics, like average tokens per request or latency spikes, using `query_metrics`.
- Never miss an outage. Use `list_incidents` to get real-time updates on service disruptions that could halt your agentic workflows.
- Manage performance automatically by calling `create_monitor`, setting alerts for when model responses fall below acceptable thresholds.
- Keep a clean audit trail of every interaction. Utilize `search_llm_spans` to retrieve the exact prompt and response payload contents needed for debugging.
- Control your costs proactively. You can view global spending patterns by using `list_dashboards`, giving you financial oversight across all model providers.

## How It Works

The bottom line is that you get direct, natural language access to highly technical performance and financial logs without ever leaving your chat window.

1. Subscribe to this MCP in Vinkius and provide your Datadog API Key, APP Key, and Site details.
2. Your AI client authenticates using these credentials, granting the necessary read permissions for observability data.
3. You simply ask your agent a question—like 'What was my token usage last quarter?'—and it fetches the precise metrics from your infrastructure.

## Frequently Asked Questions

**How does the Datadog AI LLM Observability MCP help me track costs?**
It provides a unified view of your spending. Instead of checking separate billing portals for every provider, you can ask the agent to graph global expenses and see exactly which models are driving your highest costs.

**I need to debug a failed LLM workflow; what should I use with this MCP?**
Use the tool that searches for LLM spans. It lets you pull the full prompt payload and response traces, showing you exactly which input caused the failure or poor output.

**Can this MCP tell me if my AI services are currently down?**
Yes. By listing incidents, your agent checks for active outages and service disruptions across your entire infrastructure, ensuring that a simple background failure won't break your workflow.

**How do I set up alerts for poor model performance using the Datadog AI LLM Observability MCP?**
You can use the capability to create monitors. You tell the agent what threshold you care about, and it sets up an alert that notifies you when the latency or token usage gets too high.

**Is this Datadog AI LLM Observability MCP better than just checking raw logs?**
It's much better. Instead of drowning in raw, unstructured data, the MCP interprets those logs and presents you with actionable metrics—like average usage or specific failure points—in plain language.