# QA Arbiter MCP

> QA Arbiter resolves test failures by forcing deterministic root cause analysis in one call. Stop guessing why a test failed. This server uses the `diagnose_test_failure` tool to force your agent to trace engine execution step-by-step, compare inputs, and assign a precise verdict: TEST_ERROR, ENGINE_DEFECT, or BOTH_WRONG.

## Overview
- **Category:** productivity
- **Price:** Free
- **Tags:** qa-automation, test-diagnostics, fault-localization, multi-agent, structured-reasoning, decision-pivots, agentic-pipeline, test-verification

## Description

When an AI agent runs tests, failing results create ambiguity. Is the code broken? Or did the test writer write bad expectations?

QA Arbiter eliminates this guesswork. It forces your agent to use the `diagnose_test_failure` tool before assuming a fix. This isn't just logging; it’s structured fault diagnosis.

### How It Works

The process is rigid: for every failing test, the agent must call `diagnose_test_failure`. This forces five steps:

1.  **Trace:** The agent traces the engine function with the exact inputs, showing *every* intermediate calculation and value produced.
2.  **Compare Received:** It compares the actual vitest `Received` value against its own trace. If they match, the engine is working as designed.
3.  **Compare Expected:** It compares the test's static `Expected` value against its own trace. This checks if the original assertion was flawed.
4.  **Commit Pivots:** The agent commits to two boolean flags: `receivedMatchesTrace` and `expectedMatchesTrace`.
5.  **Verdict:** The tool calculates a deterministic verdict from those pivots (e.g., Received=Trace AND Expected≠Trace means TEST_ERROR).

The best part? The tool validates the logic. If your agent tries to declare an `ENGINE_DEFECT` but marked `receivedMatchesTrace: true`, the tool rejects the diagnosis immediately, forcing re-analysis.

## Tools

### diagnose_test_failure
Forces structured diagnostic for failing tests by requiring step-by-step tracing and comparing three values (Received, Expected, Trace) to assign a deterministic verdict.

## Prompt Examples

**Prompt:** 
```
I have a failing test: expected '05:25' but received '04:45'. The engine computes bedtime as wake_time minus (cycles × 90 + 15) minutes. For 2 cycles from 08:00, diagnose this failure.
```

**Response:** 
```
Verdict: TEST_ERROR. Trace: 2×90+15=195min, 480-195=285min → 04:45. The engine correctly returns '04:45'. Your expected value '05:25' is wrong — fix it to '04:45'.
```

**Prompt:** 
```
My test expects '22:45' for 6 cycles from 08:00, but the engine returns '-02:-15'. This looks like a midnight crossover bug.
```

**Response:** 
```
Verdict: ENGINE_DEFECT. Trace: 6×90+15=555min, 480-555=-75min → JS modulo gives -2h-15m = '-02:-15' (incorrect). Domain-correct: (-75+1440)%1440=1365min → 22:45. Root cause: JavaScript % operator preserves sign on negative numbers. Keep this test — it documents the bug.
```

**Prompt:** 
```
After diagnosing 6 test failures, what is the aggregate recommendation for my QA pipeline?
```

**Response:** 
```
Session Summary — 6 analyzed: 4 TEST_ERROR, 1 ENGINE_DEFECT, 1 BOTH_WRONG. Recommendation: MIXED. Fix your 4 test assertions first, then report the engine defect with traced proof to the developer agent.
```

## Capabilities

### Diagnose Test Failure
Forces a deterministic root cause analysis by comparing observed test results against a manually traced engine execution path.

### Trace Engine Function Steps
Requires the agent to show every intermediate calculation, branch taken, and value produced during a failing test run.

### Identify Test Assertion Errors
Determines if the failure is due to an incorrect expected value set by the test author, independent of code behavior.

### Pinpoint Code Defects
Flags failures where the engine's actual output contradicts the required trace, proving a genuine bug exists in the underlying logic.

### Validate Logical Consistency
Rejects diagnoses that are internally contradictory (e.g., claiming an engine defect when the received value matches the trace).

## Use Cases

### A test fails because the expected value was wrong.
The QA engineer runs a failing test through `diagnose_test_failure`. The tool returns: Received=Trace and Expected≠Trace. The verdict is TEST_ERROR. The agent immediately tells the developer, 'It’s not the code; fix the expectation here.' No time wasted on debugging the engine.

### The core function has a bug (e.g., midnight crossover).
A test fails because of an overflow or domain logic issue. Running it through `diagnose_test_failure` confirms that Received≠Trace and Expected=Trace. The verdict is ENGINE_DEFECT. This proves the code needs fixing, not the test.

### The pipeline gets stuck guessing which component failed.
Instead of having three separate agents argue over a failure, one agent calls `diagnose_test_failure`. The resulting structured verdict (e.g., BOTH_WRONG) gives the entire team an immediate, single point of focus: fix both components.

### Debugging intermittent 'flaky' test failures.
A test only fails sometimes. Running it through `diagnose_test_failure` forces the agent to examine the intermediate calculations and compare them to a stable trace, revealing if the failure is due to shared state (Environment Pollution) rather than logic.

## Benefits

- Stops deadlocks in multi-agent pipelines. Instead of agents guessing or retrying blindly, `diagnose_test_failure` forces a structured analysis that yields one of three clear verdicts: TEST_ERROR, ENGINE_DEFECT, or BOTH_WRONG.
- Eliminates the 'Fix introduces regression' problem. The tool proves whether the test assertion is wrong (TEST_ERROR) or if the engine genuinely broke, stopping developers from fixing tests instead of code.
- Forces detailed reasoning. Your agent can’t skip steps—it must trace the function step-by-step and compare values against two pivots. This prevents superficial debugging and guarantees internal consistency.
- Handles complex failure modes. It provides specific guidance on issues like Environment Pollution (shared state) or Flaky Test Tolerance, forcing you to quarantine intermittent failures instead of ignoring them.
- Provides verifiable evidence. The tool validates the logical consistency of the diagnosis. If your agent lies about the pivots, the server rejects it, ensuring the final verdict is trustworthy.

## How It Works

The bottom line is you force structured reasoning to separate test definition errors from actual engine bugs before any code changes happen.

1. The agent receives a test failure and must call `diagnose_test_failure`. It starts by tracing the function step-by-step with the inputs.
2. Next, the agent fills in two boolean pivots: whether the vitest 'Received' value matches its trace, and if the test's 'Expected' value matches the trace.
3. The tool then computes the final verdict (TEST_ERROR, ENGINE_DEFECT, or BOTH_WRONG) based on those pivots. If there’s a contradiction, the tool rejects it.

## Frequently Asked Questions

**Does QA Arbiter run my tests or compute expected values?**
No. QA Arbiter performs zero computation and zero side effects. It forces the AI agent to structure its own reasoning into verifiable steps, then validates that the reasoning is logically consistent. Think of it as a reasoning enforcer — like Sequential Thinking, but specialized for test failure diagnosis.

**What are Decision Pivots?**
Decision Pivots are minimal, verifiable checkpoints that all correct reasoning paths must pass through — a concept from the ROMA research framework. In QA Arbiter, the two pivots are boolean fields: `receivedMatchesTrace` (does the engine's output match the hand-traced computation?) and `expectedMatchesTrace` (does the test's expected value match?). The verdict is derived deterministically from these two booleans, making it impossible to reach a wrong conclusion without contradicting yourself.

**How does it prevent pipeline deadlocks in multi-agent systems?**
In a typical QA→Developer pipeline, when tests fail, the system routes back to the developer. But if the tests themselves are wrong (QA's fault), the developer can't fix them — creating an infinite retry loop. QA Arbiter forces the QA agent to determine fault attribution BEFORE the pipeline routes: if it's TEST_ERROR, the QA agent fixes its own tests; if it's ENGINE_DEFECT, it routes to the developer with traced proof. The aggregate summary tells the orchestrator exactly what to do.

**What happens if the agent lies about the boolean pivots?**
The consistency validation catches direct contradictions — e.g., if the agent says both values match the trace but chose TEST_ERROR instead of FALSE_ALARM, the tool rejects it. For subtler misrepresentations, the `engineTrace` field creates an auditable trail: post-hoc analysis can cross-reference the trace against the actual engine source code. The structured format makes deception mechanically harder than with free-form text.

**How does QA Arbiter handle complex data when using diagnose_test_failure?**
The tool requires you to provide a full, step-by-step trace of the engine function execution. You must include every intermediate calculation and value produced by the code logic itself. Simply stating that 'the data processes correctly' is insufficient; the diagnosis depends on arithmetic proof.

**What input format does QA Arbiter need for diagnose_test_failure?**
You must provide three specific components: the original failing test assertion (Expected value), the live output from vitest (Received value), and the detailed trace of the engine's internal steps. All inputs are required to calculate the two boolean pivots accurately.

**If I get rejected by QA Arbiter, what does that mean?**
A rejection means your proposed diagnosis is logically inconsistent. The tool catches contradictions—for instance, if you claim an engine defect but marked 'receivedMatchesTrace: true'. You must re-examine the intermediate calculations until the reasoning holds up.

**Is QA Arbiter limited only to software testing?**
While built for test diagnostics, its core function is structured fault diagnosis. It forces a systematic separation of conflicting data points—a pattern applicable to any domain requiring verifiable root cause analysis beyond simple guesswork.