# Incident Postmortem Prover MCP MCP

> Incident Postmortem Prover forces SRE-grade rigor into your incident reports. It reviews raw postmortems and flags common failures: incomplete timelines, surface-level root causes, mixed contributing factors, or action items with no clear owner or deadline. This MCP ensures every write-up is a complete investigation, not just a narrative summary.

## Overview
- **Category:** productivity
- **Price:** Free
- **Tags:** incident-postmortem, root-cause-analysis, sre, five-whys, blameless-postmortem, action-items, pattern-detection, timeline-reconstruction

## Description

Writing up an incident review is usually worse than the actual incident. You end up with vague stories: 'The site went down for about two hours.' That report accomplishes nothing. This MCP makes sure your analysis stands up to scrutiny. It forces you to reconstruct every minute of the outage, requiring timestamps for every alert, action taken, and decision made during the chaos. Crucially, it pushes past the obvious symptom—like a service crashing—to find the actual systemic or process failure that allowed it. The tool demands accountability by forcing owners, specific deadlines, and measurable outcomes on all action items. If you need to prove your incident was more than just bad luck, this MCP is how you do it. You'll connect everything through Vinkius, making the rigorous postmortem process an automated part of your workflow.

## Tools

### validate_incident_postmortem
Checks your draft postmortem against five failure modes: incomplete timelines, shallow root causes, mixed factors, vague actions, and ignored patterns.

## Prompt Examples

**Prompt:** 
```
Our API went down for 2 hours last Tuesday. The database connection pool was exhausted. We added monitoring.
```

**Response:** 
```
Verdict: ACTIONS_UNACCOUNTABLE + ROOT_CAUSE_SHALLOW. 'Added monitoring' has no owner, no deadline, no success metric. 'Connection pool exhausted' is the symptom. WHY was it exhausted? Leaked connections? Missing pool size tuning? N+1 queries from a new deployment?
```

**Prompt:** 
```
The deploy at 14:00 caused a 500 error spike. John rolled back at 14:45. Problem solved.
```

**Response:** 
```
Verdict: TIMELINE_INCOMPLETE. 45 minutes between deploy and rollback — what happened at 14:01? 14:15? When was the alert fired? Who was paged? What was the blast radius? A timeline needs minute-by-minute entries, not just start and end.
```

**Prompt:** 
```
Root cause: the engineer forgot to update the config file before deploying.
```

**Response:** 
```
Verdict: ROOT_CAUSE_SHALLOW. 'Forgot to update config' is blame, not root cause. Systemic question: WHY is config managed manually? WHY does the deploy pipeline not validate config? WHY is there no pre-deploy diff check? The root cause is the absence of automated config validation.
```

## Capabilities

### Reconstruct minute-by-minute timelines
Analyzes raw text to ensure every time gap during an incident is accounted for by a specific alert, action, or decision.

### Identify systemic root causes (5 Whys)
Drives the analysis past symptoms ('The server crashed') until it isolates a true organizational or process failure that enabled the outage.

### Separate cause from contributing factors
Forces the distinction between the single enabling root cause and secondary conditions that only amplified the impact.

### Draft accountable action items (ODAS)
Generates remediation tasks requiring a specific Owner, Deadline, Action, and Success Metric to prevent them from being ignored.

### Cross-reference historical patterns
Compares the current incident against past records to flag recurring service issues or shared failure modes.

## Use Cases

### The team needs to pass a compliance audit.
A major outage happened last quarter, and the CTO wants proof that the root cause was addressed. Your agent runs the incident postmortem through `validate_incident_postmortem`, which flags missing owner-deadline metrics on previous action items. You now have a bulletproof report proving accountability.

### The bug keeps coming back in the same service.
You've seen this issue three times this year. Instead of writing a new postmortem, you run `validate_incident_postmortem` to cross-reference history. The tool flags that the root cause type and deployment window match previous incidents, forcing systemic pattern detection.

### The initial report is too vague.
An engineer writes: 'Connection pool was exhausted.' This sounds shallow. You run `validate_incident_postmortem`. The tool rejects it immediately, demanding the next Why: 'Why did the connection pool exhaust?'—forcing you to find a leak or bad query instead.

### The incident timeline is fuzzy.
The on-call notes say: 'Around 3 PM things got messy.' You run `validate_incident_postmortem` and it rejects the summary, demanding minute-by-minute timestamps for every alert fired, person paged, and decision made.

## Benefits

- Stops vague timelines dead. Instead of 'The site was down for an hour,' it forces you to account for every minute, pinpointing the exact process gap that caused a delay.
- Goes deeper than symptoms. It doesn't just say 'the server crashed.' It demands five Whys until it finds the organizational flaw—like flawed incentives or poor testing culture—that truly enabled the failure.
- Ensures every action item is solid. You can't write 'add monitoring.' The tool requires an Owner, a specific Deadline, an Action plan, and a measurable Success Metric (ODAS).
- Prevents finger-pointing. By separating the single root cause from contributing factors, you focus fixes on what actually *enabled* the failure, not just what happened to it.
- Builds institutional memory. It automatically cross-references current outages against your incident history, making sure 'first time' requires hard evidence.

## How It Works

The bottom line is, it turns your messy narrative into an actionable, auditable investigation that points to process fixes, not finger-pointing.

1. Feed the MCP your raw draft postmortem, including timeline notes and proposed fixes.
2. The tool analyzes the text, running it through 5-Whys chains to find systemic gaps and checking timelines for missing minutes.
3. You get back a verdict detailing exactly which failures occurred (e.g., 'ACTIONS_UNACCOUNTABLE') and specific questions you must answer before the report is considered complete.

## Frequently Asked Questions

**How does the Incident Postmortem Prover work with existing incident history?**
The tool cross-references your current analysis against historical records. It flags when a service or cause type repeats, forcing you to prove whether this is truly a 'first time' event.

**Can I use validate_incident_postmortem if the incident was very complex?**
Yes. The tool handles complexity by breaking it down into discrete checks: timeline gaps, root cause isolation, and ODAS verification. It treats massive data sets as a series of required proofs.

**Does this MCP rewrite my full postmortem for me?**
No, it doesn't write the report. It acts as an auditor; you provide the draft, and it gives you a verdict listing exactly what needs to be fixed—like adding owners or refining timelines.

**What is the difference between root cause and contributing factor in this MCP?**
The tool forces separation. The root cause is the single, enabling failure (e.g., no integration test). Contributing factors are secondary conditions that made it worse (e.g., deploying on Friday).

**How do I connect my AI client to use the validate_incident_postmortem tool?**
You simply enable access through your preferred AI client within Vinkius. Once connected, you'll get instant access to this MCP and all other available tools in the catalog.

**What kind of source material should I feed into validate_incident_postmortem?**
The tool handles both detailed narrative summaries and raw incident logs. The deeper and more granular your input is, the more precise its timeline reconstruction and root cause analysis will be.

**If my postmortem is vague or incomplete, how does validate_incident_postmortem guide me?**
It forces rigor by pointing out gaps. If you skip minutes in your timeline or stop at symptoms instead of systemic causes, the tool will flag it and demand specific evidence.

**Are there restrictions on how many times I can run validate_incident_postmortem?**
Vinkius handles resource management across all MCPs. While we recommend using this for deep analysis, you won't encounter typical API rate limits when working within the platform.

**What makes a timeline 'complete'?**
Minute-by-minute entries from first alert to full resolution. Each entry: [HH:MM] [Actor] [Action] [Outcome]. 'Around 3 PM' is rejected. '15:03 — PagerDuty alert fired for p95 > 2s on /api/orders' is accepted.

**How deep should the 5 Whys analysis go?**
Until you reach a SYSTEMIC root — not a human error. 'Bob forgot to restart' is blame. 'The deployment pipeline has no post-deploy health check' is a system fix. The 5th Why should expose a process, architecture, or policy gap.

**What makes an action item 'accountable'?**
Three requirements: (1) Named owner — not 'the team'. (2) Deadline — not 'soon'. (3) Success metric — not 'improved'. Example: 'Owner: @maria, Deadline: 2024-02-15, Metric: p95 latency < 500ms for 7 consecutive days.'