# Datadog MCP MCP

> Datadog gives your AI agent full control over monitoring complex cloud infrastructure. It lets you pull historical performance data, search through application logs for specific errors, and check the status of all active alerts using natural conversation.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** infrastructure-monitoring, log-analysis, performance-metrics, cloud-observability, alerting, real-time-monitoring

## Description

Monitoring modern apps means juggling metrics dashboards, log aggregators, and alert systems—it’s a massive headache. This MCP lets your agent treat Datadog like a single chat window. Instead of clicking through three different dashboards to figure out why the checkout service is slow, you just ask it. It pulls performance data points, searches specific logs for error patterns, and even checks if scheduled maintenance caused an outage.

If you're building complex automations—say, automatically checking a deployment status via `list_events` and then querying metrics to see the impact—you can chain this MCP with others. For instance, by connecting multiple services through Vinkius, your agent can build automated incident response workflows spanning logging, monitoring, and ticketing systems all from one chat session.

This means you get real-time visibility into what’s actually happening under the hood without having to switch tabs or copy/paste timestamps. It's about getting definitive answers instantly.

## Tools

### get_dashboard
Retrieves details about a specific dashboard's layout structure and widget configurations.

### get_monitor
Gets the full status, alert thresholds, and historical changes for one monitor ID.

### list_dashboards
Returns a list of all available dashboards, including their titles and access URLs.

### list_downtimes
Lists planned maintenance periods, showing scope tags and recurring schedules for outages.

### list_events
Retrieves a collection of events, detailing titles, priority levels, and the source that generated them.

### list_hosts
Returns metadata for all infrastructure hosts, including agent versions and cloud provider tags.

### list_monitors
Finds monitors by their current state (e.g., Alert, OK) and returns key details about the alert type.

### list_slos
Returns a list of Service Level Objectives, showing target percentages and compliance status for services.

### mute_monitor
Silences an alert boundary temporarily until a set time or expiration date.

### query_metrics
Pulls historical time-series data for metrics, including scope tags and units, within defined time ranges.

### search_logs
Searches through application logs using specific query syntax to find entries with timestamps and status levels.

## Prompt Examples

**Prompt:** 
```
Show me the CPU usage for 'web-server' over the last 30 minutes
```

**Response:** 
```
Querying Datadog metrics... Average CPU usage for 'web-server' is 45%. Usage peaked at 78% about 10 minutes ago. Performance appears stable.
```

**Prompt:** 
```
Find logs with '500 Internal Server Error' from the last hour
```

**Response:** 
```
I found 12 logs with '500 Internal Server Error'. Most occurrences are in the 'auth-service'. I can provide the detailed stack traces for these errors if you need.
```

**Prompt:** 
```
Are there any active monitors in 'Alert' state?
```

**Response:** 
```
Checking Datadog monitors... I found 2 monitors in 'Alert' state: [Free Disk Space Low] on 'db-node-1' and [High Request Latency] on 'api-gateway'.
```

## Capabilities

### Analyze performance trends
Pull time-series data for any metric, allowing you to see how usage changes over specific time windows.

### Search application logs
Find specific error patterns or status codes across massive volumes of collected log entries.

### Check system health alerts
List and check the current status of all configured monitors, identifying what's alerting right now.

### Review infrastructure assets
Get metadata on every host connected to your account, including agent versions and tags.

### Identify planned outages
List scheduled maintenance periods or known service downtime windows.

## Use Cases

### The deployment rollback check
A developer needs to confirm the impact of a recent release. They ask their agent to run query_metrics for CPU usage over the last hour, then cross-reference list_events to see if any warnings triggered right after the deployment started.

### The mysterious user report
A customer reports intermittent failures. Instead of guessing, you ask your agent to search_logs for 'HTTP 500' errors across all apps from the last four hours to pinpoint the failing service and time window.

### The capacity planning audit
You need proof that a specific database has hit its usage limits. You ask your agent to retrieve historical data using query_metrics for disk utilization over the last quarter, validating capacity needs.

### Auditing alert fatigue
The team is overwhelmed by alerts. You use list_monitors and then check list_slos to verify if the current high number of alerts still meets acceptable service level objectives before escalating.

## Benefits

- Stop jumping between tabs. You can check service health, list_downtimes, and view active alerts (list_monitors) all in one conversation thread.
- Analyze performance trends over time by using query_metrics to pull historical data for specific services, giving you concrete evidence of degradation.
- Pinpoint the exact moment a failure occurred. Using search_logs with ISO boundary mappings helps filter logs and identify error timelines quickly.
- Manage alerts without switching tools. You can list_slos to check compliance or mute_monitor if an alert is false positive, all through chat.
- Understand your whole infrastructure at once. Use list_hosts to see every agent version running across cloud providers.

## How It Works

The bottom line is you get immediate, actionable operational status updates without touching the web UI.

1. Connect the Datadog MCP to your AI client using your API keys and application credentials.
2. Tell your agent what you need, for example: 'Show me all monitors that are currently in an Alert state.'
3. The agent executes the required calls, pulls the structured data (like a list of active alerts), and gives you a plain language answer.

## Frequently Asked Questions

**How does query_metrics help with performance analysis?**
query_metrics pulls specific time-series data points for any metric. You can set a start and end timestamp to analyze how performance behaved during a critical window.

**Can I find errors using search_logs with the Datadog MCP?**
Yes. search_logs lets you query through massive log volumes using syntax matching, helping you locate specific error patterns and status codes (like 500).

**What if I need to check multiple service alerts at once? Use list_monitors.**
list_monitors filters results by operational state. You can ask it specifically for all monitors in an 'Alert' status, giving you a quick overview of system health.

**Can I check if the service is down due to planned work? Use list_downtimes.**
list_downtimes checks for scheduled maintenance periods. This confirms whether the current issue is an unexpected failure or a known outage window.

**How do I use `list_dashboards` to see all available monitoring views?**
It returns a list of dashboard IDs, titles, and direct access URLs. This is useful because it lets you audit every reporting view in your account without manually clicking through them.

**What kind of data does `list_hosts` provide about my infrastructure?**
The tool provides host metadata, including agent versions and active tags. You can use this to quickly audit which systems are connected or if a specific group of hosts needs an update across your cloud providers.

**I'm performing maintenance; how do I temporarily silence alerts using `mute_monitor`?**
It interacts with the alerting boundary to set temporary periods of silence. This prevents false alarms from triggering during planned changes, keeping your team focused on actual issues.

**How does `list_slos` help me verify service compliance status?**
It shows Service Level Objective definitions, target percentages, and current compliance status. You can quickly confirm if a monitored application is actually meeting its required uptime promises.