# Datadog MCP for AI Agents MCP

> Datadog connects your AI client to full-stack observability data. You get conversational control over metrics, infrastructure health, and logs in real time. Instead of clicking through complex dashboards, you talk to your system to list active incidents, query specific performance metrics, or search error logs across every service.

## Overview
- **Category:** loved-by-devs
- **Price:** Free
- **Tags:** full-stack-monitoring, infrastructure-metrics, log-analysis, incident-management, cloud-monitoring, alerting

## Description

Monitoring a large application stack shouldn't require deep knowledge of the Datadog UI. This MCP connects any AI client to your entire observability setup, letting you manage infrastructure health using natural conversation. You can ask your agent to find out why latency spiked yesterday or check if a specific host is running low on disk space without opening a single dashboard tab. It lets you run time-series queries with precise Datadog syntax and search across all indexed logs immediately. Need to control noise? You can even list and mute monitors during maintenance windows, keeping your team focused on actual issues. Connect this MCP through Vinkius to gain unified visibility into everything from Service Level Objectives (SLOs) down to individual host metadata.

## Tools

### check_datadog_status
Verifies that your AI client can successfully connect to Datadog.

### create_event
Allows you to programmatically create a new system event with specific tags and priority levels.

### get_dashboard
Fetches the complete layout, widgets, and template variables for a specified dashboard.

### get_incident
Retrieves all details about an active incident, including responders and timeline information.

### get_monitor
Gets the full configuration and status of a single alert monitor.

### list_dashboards
Shows all available dashboards within your account.

### list_events
Retrieves a list of recent platform events and custom system activity.

### list_hosts
Lists all reporting hosts, providing metadata, tags, and agent version details for inventory checks.

### list_incidents
Shows a comprehensive list of currently open incidents with their severity and status.

### list_metrics
Lists the available metric types that can be queried.

### list_monitors
Retrieves a list of all defined alert monitors for review.

### list_slos
Shows a summary of Service Level Objectives, including their targets and current compliance status.

### mute_monitor
Temporarily silences an alert monitor to prevent notification noise during maintenance periods.

### query_metrics
Executes specific time-series queries using Datadog syntax on metric data.

### search_logs
Searches through indexed logs across all sources to find specific error or warning messages.

### search_monitors
Allows you to search for monitors using keywords or filters.

## Prompt Examples

**Prompt:** 
```
Show me all monitors that are currently alerting and mute the noisiest one.
```

**Response:** 
```
**Alert Summary**

Found 4 monitors in Alert state:
1. 'High CPU - web-prod-01' (ID: 58291, triggered 47 times today).
2. 'Disk Space Low - db-replica-03' (ID: 58344, triggered 12 times).
... 

I muted the noisiest one, Monitor 58291. No further CPU alerts will fire until I unmuted it.
```

**Prompt:** 
```
Search for error logs in production from the last hour.
```

**Response:** 
```
**Log Search Results**

Query executed: `status:error env:production`...

Found **234** error events in the last hour. Top sources:
*   api-gateway (89 errors)
*   payment-service (67 errors)
The most common issue is 'TimeoutException: upstream connection timed out' (124 occurrences). Do you want to check related incidents?
```

**Prompt:** 
```
List all SLOs and tell me which ones are at risk of breaching their error budget.
```

**Response:** 
```
**SLO Health Check**

Reviewed 8 Service Level Objectives. Two are flagged as 'At Risk':
1. **API Availability (99.9%):** Current status: 99.82%. Budget remaining: 12% (7-day window). Expect exhaustion in 2.3 days.
2. **Checkout Latency P95 < 800ms:** Status: 99.1%. Budget remaining: 28%. 
The other six are healthy (>60% budget).
Which monitor should I inspect first?
```

## Capabilities

### Check system connectivity
Verify the connection status between your AI client and Datadog.

### List, search, or inspect monitors
Review all defined alerts to see what's firing or mute noisy ones during planned maintenance periods.

### Get dashboard layouts and variables
Retrieve the full structure of any operational dashboard, including widget details and template variables.

### Run custom metric queries
Execute specific time-series queries using Datadog syntax to analyze performance data across custom time ranges.

### Search detailed log events
Find specific error or warning events by querying logs using standard Datadog query language.

### Manage platform and custom events
List existing system events, check out host inventory details, or create new operational tags.

### Track active incidents and SLOs
See a list of current high-severity incidents, including who is responding and the timeline. You can also review Service Level Objectives for error budget compliance status.

## Use Cases

### Investigating a sudden spike in checkout latency
The agent can run `query_metrics` for P95 latency over the last four hours. It then uses `search_logs` to correlate the exact time window of high latency with error events found in the payment service logs, identifying 'TimeoutException' as the root cause.

### Preparing for a major system update
Before deploying new code, the agent can use `list_hosts` to generate a current inventory list of all reporting hosts and their tags. It can then run `get_monitor` on key services to ensure alerts are configured correctly before the change.

### Handling an active outage incident
A user asks, 'What's going wrong right now?' The agent uses `list_incidents` for a summary, then checks `get_incident` details to see who is responding and the current status of the service.

### Auditing system reliability targets
The team needs an overview. They ask the agent to check all SLOs via `list_slos`. The agent identifies which objectives are nearing their error budget limit, flagging services that require immediate attention.

## Benefits

- Instantly triage alerts. Instead of listing every monitor manually, use the `list_monitors` tool to quickly see which alerts are firing and check their status.
- Deep dive into performance data. Run complex time-series analysis using `query_metrics` with specific Datadog syntax, getting granular results without writing a query language script.
- Reduce alert fatigue. Use the `mute_monitor` tool to silence noisy alerts for planned maintenance periods, ensuring your team only gets notified of real issues.
- Pinpoint root causes fast. The `search_logs` tool lets you search across all indexed log sources using natural queries, correlating errors with specific hosts or services.
- Full visibility into service commitments. Review Service Level Objectives and check error budget compliance via the SLO tools to ensure your application meets its goals.

## How It Works

The bottom line is that instead of navigating complex UIs, your AI client talks directly to your monitoring data via structured tools.

1. Subscribe to this MCP and provide your Datadog API Key along with the correct site URL (e.g., `https://api.datadoghq.com`).
2. Your AI client authenticates with Vinkius, allowing it to send structured commands directly to the monitoring platform.
3. You simply ask your agent a question—like 'Why did the API latency spike last night?'—and it runs the necessary metric queries or log searches for you.

## Frequently Asked Questions

**How does the Datadog MCP help me query performance metrics?**
It lets you run time-series queries using specific syntax, so you don't have to manually build complex metric queries. You just ask for the data point—like 'P95 latency over 4 hours'—and get the graph.

**Can I use this Datadog MCP to manage my alerts and monitors?**
Yes, you can list all defined monitors and even mute them. This is useful for reducing alert noise when your team knows maintenance or testing is happening across the infrastructure.

**What if I need to check logs from a specific host?**
The MCP lets you access the full `list_hosts` inventory details, giving you metadata and tags. You can then use this context when searching for error logs via `search_logs`.

**Does connecting Datadog MCP improve my incident response time?**
Yes, because it aggregates all critical information—incidents, SLOs, and logs—into a single conversational flow. You spend less time jumping between tabs and more time fixing the problem.

**Is this MCP only for viewing data or can I perform actions?**
It does both. You can read detailed reports on SLOs, but you can also take action, like muting a monitor or creating a new system event directly through your AI agent.