# Datadog MCP

> Datadog MCP connects your AI agent directly to infrastructure monitoring and log management data. Query performance metrics, search application logs for specific errors, and manage alert monitors without leaving your chat window or IDE. Monitor everything from service level objectives to host health using natural language commands.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** infrastructure-monitoring, log-analysis, performance-metrics, cloud-observability, alerting, real-time-monitoring

## Description

Connecting Datadog via this MCP lets you take full command of complex cloud infrastructure monitoring right through a simple conversation with your AI agent. Instead of jumping between dashboards, sifting through raw logs, and cross-referencing alert status pages, you just ask. You can query time-series metrics to track performance trends over specific time periods or use the tool to search application logs for structural traces matching known errors. Need to know if a service is healthy? Check active monitors by state or list out all your configured Service Level Objectives (SLOs) to see compliance status at a glance. Everything you need—from checking host metadata to identifying planned downtime—is available via natural language interaction, making troubleshooting faster and less painful. Since this MCP lives on Vinkius, you connect once from any AI-compatible client and gain immediate access to top-tier observability tools like this one.

## Tools

### list_dashboards
Lists all available monitoring dashboards and provides their titles, layout types, and direct URLs.

### query_metrics
Retrieves time-series data points for infrastructure or application metrics within a specific Unix timestamp range.

### list_downtimes
Identifies planned maintenance windows by listing scheduled downtime periods and their current status.

### list_slos
Retrieves all Service Level Objective definitions, showing target percentages and compliance status for monitored services.

### search_logs
Searches the log storage to find entries matching a query syntax, including timestamps and structured attributes.

### list_monitors
Filters and returns metadata for all configured monitors, allowing you to check their type, current status (alert, ok), or query definition.

### get_monitor
Fetches detailed information about a specific monitor, including its thresholds, notification settings, and historical status changes.

### mute_monitor
Silences an active alert monitor for a set period of time to prevent unnecessary notifications.

### list_events
Provides a collection of system events, such as alerts or deployment actions, including their titles and priority levels.

### get_dashboard
Retrieves the full configuration details for a specific dashboard, including widget definitions and template variables.

### list_hosts
Lists all connected infrastructure hosts, showing their agent version, associated tags, and cloud provider attributes.

## Prompt Examples

**Prompt:** 
```
Show me the CPU usage for 'web-server' over the last 30 minutes
```

**Response:** 
```
Querying Datadog metrics... Average CPU usage for 'web-server' is 45%. Usage peaked at 78% about 10 minutes ago. Performance appears stable.
```

**Prompt:** 
```
Find logs with '500 Internal Server Error' from the last hour
```

**Response:** 
```
I found 12 logs with '500 Internal Server Error'. Most occurrences are in the 'auth-service'. I can provide the detailed stack traces for these errors if you need.
```

**Prompt:** 
```
Are there any active monitors in 'Alert' state?
```

**Response:** 
```
Checking Datadog monitors... I found 2 monitors in 'Alert' state: [Free Disk Space Low] on 'db-node-1' and [High Request Latency] on 'api-gateway'.
```

## Capabilities

### Querying Performance Metrics
Get time-series data for specific infrastructure or application metrics within a defined date range.

### Searching Application Logs
Pull structured log entries to find traces and status codes related to errors or bottlenecks across services.

### Managing Alerts and Monitors
View, list, and modify monitor configurations, checking current alert statuses or muting active alerts temporarily.

### Inspecting Service Health Objectives (SLOs)
Retrieve the definitions of service level agreements, including target percentages and current compliance status for a given metric or monitor.

### Reviewing Infrastructure Assets
List all connected hosts, view dashboard layouts, or identify scheduled maintenance periods to plan around.

## Use Cases

### Investigating a sudden spike in latency
The developer asks the agent: 'Show me performance metrics for API latency last hour.' The agent uses `query_metrics` to find the time series data. They then use `search_logs` around that peak time, finding 503 errors, and finally check all active monitors using `list_monitors` to see if an alert was triggered.

### Auditing a flaky service
The SRE needs assurance the system is stable. They ask the agent to list SLOs (`list_slos`). If compliance looks good, they check the dashboard details using `get_dashboard` and then use `query_metrics` on key resource usage to verify stability.

### Handling planned downtime
A team member needs to schedule maintenance. They ask the agent to list scheduled downtimes (`list_downtimes`). This confirms if the window is clear, and they can use `list_hosts` afterward to ensure all target infrastructure nodes are accounted for.

### Onboarding a new team member
A junior engineer needs to understand the system boundaries. They ask to list all dashboards (`list_dashboards`) and check which hosts are connected (`list_hosts`), giving them a clear map of the operational scope.

## Benefits

- Stop context switching. Instead of jumping between the dashboard, log viewer, and alert list, your agent handles all three steps in one chat interaction.
- Get immediate visibility into service health by querying Service Level Objectives (SLOs), which shows exactly how close or far a metric is from its compliance target.
- Save time during incidents. Use `list_monitors` to quickly find every active alert and then use `get_monitor` to check if it needs muting before calling the team.
- Pinpoint failures fast. You can `search_logs` for specific error codes across massive log volumes, instantly narrowing down bottlenecks without writing complex regex filters.
- Understand your infrastructure deeply. Use `list_hosts` or `get_dashboard` to get metadata on every asset connected, including agent versions and cloud provider details.

## How It Works

The bottom line is that you get full operational visibility into your entire cloud infrastructure without needing to switch applications or write complex query language manually.

1. Connect the Datadog MCP to your AI agent and authorize it using your API keys.
2. Ask your agent a question like, 'Show me the CPU usage for the web-server last week' or 'Find all 500 errors from yesterday.'
3. The MCP executes the necessary query against the monitoring backend and returns structured data directly to your chat interface.

## Frequently Asked Questions

**How do I find specific errors using Datadog MCP?**
You use the `search_logs` tool. You just tell your agent what you're looking for, like '500 Internal Server Error from yesterday,' and it pulls structured data directly.

**Can I check if my service meets its goals with Datadog MCP?**
Yes. You run `list_slos` to see all defined Service Level Objectives, which instantly tells you the target percentage and your current compliance status for any monitored metric.

**What is the purpose of the `query_metrics` tool?**
`query_metrics` retrieves time-series data. This lets you visualize performance trends, like CPU usage or request count, over a specific period to spot gradual degradation.

**Does Datadog MCP help with scheduled maintenance? **
Yes, the `list_downtimes` tool checks for planned maintenance periods. This prevents you from wasting time troubleshooting an outage that was simply expected downtime.

**How do I see all available monitors quickly?**
Use the `list_monitors` function to get a filtered list of every active monitor, letting you check their type, query definition, and current alert status instantly.