# Datadog MCP MCP

> Datadog provides unified observability for your entire tech stack. Use this MCP to run deep metric queries, search error logs across all sources, inspect active incidents, and validate Service Level Objectives (SLOs) without opening a dashboard. It gives you full-stack visibility, right from your chat client.

## Overview
- **Category:** loved-by-devs
- **Price:** Free
- **Tags:** full-stack-monitoring, infrastructure-metrics, log-analysis, incident-management, cloud-monitoring, alerting

## Description

When an outage hits, you don't have time to click through five different tabs—you just need answers. This connector lets your agent treat Datadog like a command line interface. You can check the current health of monitors, list all active incidents with their severity and timeline, or query specific metrics using complex syntax. Need to know why? Run log searches across every indexed source, narrowing down 234 errors in minutes. It even lets you see which SLOs are dipping below target and helps manage alert noise by muting unnecessary monitors during maintenance windows. The power comes when your agent chains these actions together; Vinkius AI Analytics gives you full visibility into every step—which metrics were called, what data flowed through, and how much of your budget was used. You run complex diagnostic workflows against production systems without ever leaving your chat window.

## Tools

### check_datadog_status
Verifies basic connectivity to the Datadog API.

### create_event
Allows you to manually generate a custom event record in the platform.

### get_dashboard
Retrieves detailed layout information for a specific dashboard ID.

### get_incident
Fetches full details about a single, active incident.

### get_monitor
Retrieves the configuration and current status of an individual monitor.

### list_dashboards
Lists all available dashboards in your account for quick reference.

### list_events
Retrieves a list of recent platform events, including tags and priority levels.

### list_hosts
Lists all reporting hosts along with their metadata and agent versions.

### list_incidents
Gets a summary list of active service incidents, showing severity and status.

### list_metrics
Returns a comprehensive list of all available metrics in the system.

### list_monitors
Provides an overview of every configured monitor, helping you see what's alerting.

### list_slos
Lists all defined Service Level Objectives (SLOs) and their compliance status.

### mute_monitor
Temporarily silences an alert monitor to prevent notification spam during testing or maintenance.

### query_metrics
Executes detailed time-series queries using Datadog syntax for specific metric data points.

### search_logs
Searches through historical log events across all indexed sources based on query criteria.

### search_monitors
Finds monitors matching specific keywords or configuration filters.

## Prompt Examples

**Prompt:** 
```
Show all monitors that are currently alerting and mute the noisiest one.
```

**Response:** 
```
Found 4 monitors in Alert state. 1) 'High CPU - web-prod-01' (ID: 58291, triggered 47 times today). 2) 'Disk Space Low - db-replica-03' (ID: 58344, triggered 12 times). 3) 'API Latency P99 > 500ms' (ID: 58402, triggered 8 times). 4) 'Error Rate > 5%' (ID: 58415, triggered 3 times). Muting the noisiest one... Monitor 58291 'High CPU - web-prod-01' has been muted. No further alerts will fire until unmuted.
```

**Prompt:** 
```
Search for error logs in production from the last hour.
```

**Response:** 
```
Searching logs with query 'status:error env:production'... Found 234 error events in the last hour. Top sources: api-gateway (89 errors), payment-service (67 errors), user-auth (45 errors), notification-service (33 errors). The most frequent error: 'TimeoutException: upstream connection timed out' (124 occurrences). Would you like to see the full stack traces or check related incidents?
```

**Prompt:** 
```
List all SLOs and tell me which ones are at risk of breaching their error budget.
```

**Response:** 
```
Retrieved 8 SLOs. 2 are at risk: 1) 'API Availability (99.9%)' — current status: 99.82%, error budget remaining: 12% (7-day window). At current burn rate, budget will be exhausted in 2.3 days. 2) 'Checkout Latency P95 < 800ms' — current status: 99.1%, error budget remaining: 28%. The other 6 SLOs are healthy with >60% budget remaining. Would you like to inspect the monitors tied to the at-risk SLOs?
```

## Capabilities

### Review service health targets
List Service Level Objectives (SLOs) to check current error budgets and compliance status.

### Diagnose metric performance
Run time-series queries using Datadog syntax to analyze specific system metrics over custom ranges.

### Triage active service disruptions
List and get details on current incidents, showing severity, responders, and the full timeline.

### Pinpoint error sources in logs
Search log events using Datadog query syntax across all indexed log sources to find root causes.

### Manage alert noise
List, search, and mute individual monitors when the system generates too many false alarms.

## Use Cases

### The Production Outage Triage
An alert fires for high latency. Instead of guessing, your agent first runs `list_incidents` to confirm the scope. Then it uses `search_logs` with a 'timeout' query and finally executes `query_metrics` on the P95 metric to prove where the bottleneck is.

### Pre-Deployment Readiness Check
Before rolling out code, an engineer runs `list_monitors` to verify all necessary health checks are active. They then use `get_monitor` on the critical metrics to ensure baseline performance is met.

### Compliance Reporting Audit
A product manager needs proof of uptime. The agent uses `list_slos` and provides a summary report detailing error budget consumption across key services, proving adherence to SLAs.

### Post-Mortem Deep Dive
After an incident, you need data on the failing component. You run `list_hosts` to inventory the exact machine that failed and then use `search_logs` targeting that host ID for a complete error trace.

## Benefits

- Stop clicking through dashboards. You can run complex diagnostic queries using `query_metrics` or `search_logs` directly against your live data set.
- Control alert fatigue immediately. Use the MCP to list all monitors and then mute specific ones with `mute_monitor` during maintenance windows, keeping your focus on critical alerts.
- Validate service health at a glance. Run `list_slos` to see which services are nearing their error budget limits without running dedicated reports.
- Get immediate incident context. Instead of browsing the dashboard for an active issue, use `get_incident` to pull severity, status, and responder details instantly.
- Accelerate root cause analysis. When a problem surfaces, run `search_logs` with specific error queries to pinpoint exactly which microservice failed.

## How It Works

The bottom line is: you get full operational visibility into your entire infrastructure stack through plain conversation, skipping the manual dashboard drill-down process.

1. Connect your Datadog API Key and site URL to this MCP.
2. Tell your agent exactly what you need—for example, 'List all SLOs and find the error budget for checkout latency.'
3. The agent runs the necessary queries and returns a consolidated report on service health and potential failure points.

## Frequently Asked Questions

**How do I use search_logs with Datadog? **
You ask your agent to run `search_logs` and provide the necessary query syntax, like 'status:error env:production'. The MCP handles the complex API calls so you don't have to worry about formatting.

**Can I mute a monitor using list_monitors? **
No. First, use `list_monitors` or `search_monitors` to find the correct ID, then ask your agent to execute the `mute_monitor` tool with that specific ID.

**What is the difference between list_metrics and query_metrics? **
Use `list_metrics` when you just want a catalog of what metrics exist. Use `query_metrics` when you know the metric name and need to run actual time-series data against it.

**Does get_incident provide enough detail for post-mortem? **
It provides core incident details, like status and responders. For a full root cause analysis, you'll want to follow up by running `search_logs` targeting the time frame provided in the incident record.

**How does the `check_datadog_status` tool verify connectivity for my agent?**
It runs a basic API call using your credentials to confirm access. If the status check succeeds, you know the key is valid and the network path is open. This confirms everything works before running complex metric queries.

**What's the difference between `list_events` and using the `create_event` tool?**
They do different things. `list_events` pulls existing platform events for review by your agent. You use `create_event` when you need your AI client to actively inject a new, custom event with specific tags or priority level.

**If I need to find an alert that isn't currently firing, should I just list monitors or use `search_monitors`?**
You should use `search_monitors`. This tool lets your agent filter the view far beyond just active alerts. You can narrow down monitor results by tags, owner, or status when you're troubleshooting a specific component.

**What detailed information does running `list_slos` give me about service health?**
It provides the full picture of Service Level Objectives. For every SLO, your agent retrieves the current success rate, the remaining error budget percentage, and how close you are to exhausting your target.