# Datadog MCP for AI Agents MCP

> Datadog provides full observability over your entire infrastructure, applications, and logs through natural conversation. Your AI client can query raw metrics, search structured error logs, track incidents, and audit service health without you ever having to open a dashboard.

## Overview
- **Category:** loved-by-devs
- **Price:** Free
- **Tags:** apm, infrastructure-monitoring, incident-response, metrics-querying, alerting, cloud-ops

## Description

Managing complex systems means jumping between dashboards, log viewers, and metric graphs—it's exhausting. This MCP connects your existing Datadog account directly to your AI agent, giving it the power to act as a dedicated Site Reliability Engineer (SRE). Instead of clicking through tabs, you just talk to your client. You can ask about resource bottlenecks across specific hosts or check if a recent deployment broke an endpoint. The tool lets you search logs using complex filters, audit service level objectives (SLOs), and even list all active alerts so you know exactly what needs attention. If you're already managing observability tools in the Vinkius catalog, adding this MCP means consolidating your entire operational knowledge base into one conversation with your AI agent.

## Tools

### create_monitor
Creates a new alert monitor based on specified criteria like metric thresholds, anomaly detection, or service checks.

### list_dashboards
Retrieves a list of all available dashboards so you can identify the right view for your investigation.

### get_dashboard
Fetches specific details about one dashboard using its unique ID.

### get_monitor
Gets detailed information for a single, specified monitor by its numeric ID.

### list_hosts
Lists all monitored servers and hosts, providing key metrics summary and filtering options by tags.

### list_incidents
Retrieves a record of current or resolved incidents, showing severity, responder assignments, and postmortem status.

### list_monitors
Lists every active monitor, allowing you to audit your overall alerting coverage across different types (metric, log, etc.).

### mute_monitor
Temporarily silences a specified alert monitor during planned maintenance or known issues.

### query_metrics
Runs detailed queries against time-series metric data, analyzing trends for specific metrics within defined time ranges.

### search_logs
Searches through large volumes of log events using advanced filters like service name or status code to find contextual errors.

### list_slos
Retrieves all defined Service Level Objectives, which track the target availability percentage for your services.

### list_synthetics_tests
Lists automated synthetic tests to verify that key endpoints and user journeys are actively monitored.

### list_teams
Shows the organizational structure by listing all teams responsible for specific monitors, SLOs, or dashboards.

### unmute_monitor
Reactivates a previously muted alert monitor.

### update_monitor
Modifies an existing monitor's details, such as changing the query string or updating the notification message.

### list_users
Provides a directory of user accounts and their access permissions within the platform for security audits.

## Prompt Examples

**Prompt:** 
```
Show me all monitors that are currently in an alert state.
```

**Response:** 
```
**Active Alerts Found:**

*   ⚠️ **Web01 CPU Spike** (Metric Alert) - Critical. Triggered 12 minutes ago. Check `query_metrics` for context.
*   ❌ **Payment Gateway Down** (Service Check) - Error. Last checked 5 minutes ago. Requires immediate action.
*   ⚙️ **Cache Miss Rate High** (Query Alert) - Warning. Running since 4 hours ago. Review recent logs.
```

**Prompt:** 
```
I need to check the error rate for the user profile service in the last hour.
```

**Response:** 
```
**Error Log Summary: User Profile Service (Last Hour)**

We found 45 entries. The primary issue is `DatabaseConnectionFailed` originating from host `user-db-02`. 

**Top Error Codes:**
*   `DB Connection Failed`: 31 occurrences
*   `Timeout`: 14 occurrences

Would you like me to run a metric query on the connection pool utilization for that host?
```

**Prompt:** 
```
What are our Service Level Objectives and who owns them?
```

**Response:** 
```
**Service Level Objective Audit:**

| SLO Name | Target | Period | Current Status |
| :--- | :--- | :--- | :--- |
| Checkout Availability | 99.9% | 30d | **Green** (99.8%) |
| API Latency P95 | < 200ms | 7d | **Yellow** (210ms) |

*Ownership:* The `Payments Team` is responsible for the Checkout SLO, and the `Platform Core` team owns the API latency target.
```

## Capabilities

### Audit Infrastructure Inventory
See a list of every host monitored by Datadog, along with its current CPU, memory usage, and custom tags.

### Query Metric Trends Over Time
Analyze raw time-series data for any metric type—like system CPU or custom business metrics—using specific query syntax to understand performance trends.

### Search Detailed Log Events
Filter through structured and unstructured log entries using advanced queries, narrowing results by service, host, or status code.

### Manage Alerting Monitors
Create new alerts or modify existing ones (like changing a threshold or setting the notification message) to ensure your systems are properly monitored.

### Review Service Health Objectives
List and audit all defined SLOs, letting you check compliance rates for critical services over specific time periods.

## Use Cases

### Finding a P99 Latency Spike
The agent notices an alert spike on API latency. You ask, 'What was our average response time for the payment service last night?' The agent runs `query_metrics`, providing a graph showing the exact minute and magnitude of the performance dip.

### Investigating a Service Outage
A user reports an outage. You ask, 'Search for errors related to payment failure.' The agent uses `search_logs` and returns 20 matching entries, pointing immediately to the failing host and providing the stack trace.

### Auditing Alerting Coverage
Before a major release, you ask, 'List all service monitors.' The agent uses `list_monitors`, allowing you to quickly spot any critical services that lack an alert definition. You can then use `create_monitor` to fix it.

### Onboarding New Team Members
The Engineering Manager asks, 'Who owns the inventory monitoring?' The agent runs `list_teams`, showing team membership and ownership for both monitors and SLOs, streamlining knowledge transfer.

## Benefits

- Instantly triage alerts: Use the `list_monitors` tool to see all active monitors without opening the dashboard. You'll know exactly what needs attention right now.
- Deep log analysis on demand: Instead of manually clicking through Log Explorer filters, use `search_logs` to find specific error patterns across services in seconds.
- Analyze historical performance trends: The `query_metrics` tool lets you pull raw metric timeseries data for deep dives, identifying the root cause before it becomes a major incident.
- Audit service reliability easily: Check compliance by calling `list_slos` or review your automated coverage using `list_synthetics_tests`. No more guessing on SLA adherence.
- Control alert lifecycle: Use `mute_monitor` during planned maintenance windows, and then call `unmute_monitor` when the work is done. It keeps alerts from becoming noise.

## How It Works

The bottom line is that it turns complex dashboard navigation into simple, actionable conversations with your AI client.

1. First, subscribe to this MCP on Vinkius and provide your Datadog API Key and Application Key.
2. Your AI client authenticates with the keys, establishing a secure connection to your live monitoring data.
3. You then ask conversational questions—like 'What was the average CPU usage for the staging environment last week?'—and receive real-time answers from the platform.

## Frequently Asked Questions

**What's the difference between Datadog API Key and Application Key?**
The **API Key** authenticates your requests to the Datadog platform and is required for all endpoints. The **Application Key** is an additional layer of authorization that controls what actions your integration can perform. Both are generated in Organization Settings > API and Application Keys. Most Datadog API endpoints require both keys.

**Can I mute a monitor during a maintenance window?**
Yes! Use the `mute_monitor` action with the monitor ID. You can optionally set an `end` timestamp (ISO 8601) for the mute to automatically expire, or specify a `scope` to mute only certain sub-alerts (e.g. 'env:staging'). Use `unmute_monitor` to re-enable notifications.

**What query syntax does the metrics endpoint use?**
Datadog uses a specific query format: `[function]:[metric]{[tags]}`. For example: `avg:system.cpu.user{host:web01}` returns the average CPU user time for host web01. Common functions include `avg`, `sum`, `max`, `min`, `count`. Time windows are specified in the query as `avg(last_5m):...` or passed as `from`/`to` Unix timestamps to the tool.