# Data Pipeline Prover MCP

> Data Pipeline Prover forces your AI agent to validate data architecture before it runs. It audits for common, silent failures: schema drift, non-idempotent writes, stale data reporting, and untraceable data lineage. Don't let bad pipelines corrupt your warehouse; get an architectural proof.

## Overview
- **Category:** productivity
- **Price:** Free
- **Tags:** data-pipeline, data-contract, idempotency, freshness-sla, data-lineage, data-quality, etl-validation, data-mesh

## Description

Data pipelines are supposed to move clean data from Point A to Point B. They usually work until they don't. The problem is that failure rarely looks like a crash screen; it quietly introduces wrong numbers into production, sometimes months later. This MCP forces your agent to prove the architecture is sound. It doesn’t run the ETL job itself; it audits the blueprint for flaws.

When you use this, your AI client must define explicit rules: what happens if the input data changes shape (schema contract)? How does the system handle retries so it never double-counts revenue? Is there a measurable warning when data gets older than 15 minutes? And most importantly, can every single number in a final report be traced back to its raw source record through every transformation? Passing these checks means your pipeline is truly resilient. You'll find this MCP available within the Vinkius catalog alongside other governance tools.

It moves data quality assurance from reactive debugging—where you spend weeks tracing errors—to proactive, mandatory architectural validation.

## Tools

### validate_data_pipeline
Audits a pipeline design by forcing definitions for schema contracts, idempotency mechanisms, freshness SLAs, and data lineage traceability.

## Prompt Examples

**Prompt:** 
```
Let's load the user events from CSV to our MySQL database. We'll run it every hour.
```

**Response:** 
```
Verdict: SCHEMA_ABSENT. You are running a load task without schema validation. Define input types, configure validation limits, and implement a dead-letter queue (DLQ) for corrupt lines to prevent parser errors from halting the run.
```

**Prompt:** 
```
Pipeline: CSV-to-MySQL. Schema: Input validated with Zod ({ event_id: string, user_id: string, page: string, timestamp: ISO8601 }). Rejects invalid lines to DLQ. Idempotency: Upsert using INSERT ON CONFLICT (event_id) DO NOTHING. Freshness: SLA is maximum 1 hour latency, monitored via event_timestamp vs load_timestamp, alerts Slack on breach. Lineage: Raw S3 CSV bucket -> Lambda Parser -> RDS MySQL database.
```

**Response:** 
```
Verdict: PIPELINE_PROVEN. Schemas are validated with Zod, idempotency is guaranteed via database upserts, freshness alerts are in place, and the raw-to-db data flow is traceable.
```

**Prompt:** 
```
Pipeline: CSV-to-MySQL. Schema: Validated with Zod ({ event_id: string }). Idempotency: none, we append all rows. Freshness: monitored every hour. Lineage: S3 bucket to DB.
```

**Response:** 
```
Verdict: NON_IDEMPOTENT. Appending rows without deduplication causes duplication if the run fails mid-process and restarts. Enforce a composite primary key or perform upserts to ensure safety.
```

## Capabilities

### Validate Input Contracts
The MCP verifies that input and output schemas are strictly defined at every stage, preventing the system from accepting unexpected data shapes.

### Guarantee Safe Data Replay
It forces mechanisms like upserts or deduplication keys into place, ensuring running a job multiple times won't corrupt your records.

### Monitor Data Freshness
The system requires a measurable Service Level Agreement (SLA) and defines alerts for when data exceeds that age limit.

### Track End-to-End Lineage
You define the source, every transformation step, and the owner of the data to trace any number back to its origin point.

## Use Cases

### Finance Audit Trail
A finance analyst needs to ensure that monthly revenue reports are always traceable. They use the MCP with `validate_data_pipeline`, defining source-to-report lineage and guaranteeing every transaction record is tied back to an original journal entry ID.

### Real-Time Operations Dashboard
An ops engineer needs their dashboard to show only data generated in the last hour. They use `validate_data_pipeline` to set a strict freshness SLA, automatically triggering alerts if the pipeline latency exceeds 60 minutes.

### User Behavior Event Processing
A marketing team processes millions of user click events hourly. To prevent duplicate customer profiles when the job retries, they use `validate_data_pipeline` to mandate an upsert strategy using a unique event ID.

## Benefits

- Eliminate silent failures. Instead of finding out 3 months later that a source change corrupted your data, the `validate_data_pipeline` tool forces you to define schema contracts upfront.
- Stop double-counting revenue. The MCP guarantees safe re-running by forcing mechanisms like upserts or composite keys, preventing duplicate records when jobs fail and restart.
- Never build on old numbers again. By defining a measurable freshness SLA, your agent ensures that dashboards only display data within a specified time window.
- Trace every number back to the source. The tool requires full lineage tracking, so if the CFO asks 'why is this wrong,' you can point to the exact raw record and transformation step.
- Move beyond vague claims. This MCP rejects generic answers like 'data quality is handled.' It demands specific keys, types, and monitoring triggers.

## How It Works

The bottom line is: it turns data quality from an assumption into a mandatory, auditable engineering requirement.

1. Start by defining the pipeline's full scope: what sources feed it, and what final tables receive the output.
2. The MCP requires you to detail four architectural contracts: schema validation rules, retry safety mechanisms, a measurable freshness SLA, and a complete transformation log (lineage).
3. You get an immediate verdict—either 'PIPELINE_PROVEN' or a specific failure point, telling you exactly which contract is missing.

## Frequently Asked Questions

**Can Data Pipeline Prover validate_data_pipeline run the actual ETL job?**
No. It doesn't execute code. Instead, it validates the *design* of your data pipeline architecture, forcing you to define all necessary contracts and safeguards.

**What is schema drift validation with Data Pipeline Prover?**
It prevents pipelines from accepting unexpected input shapes or types. You must specify the exact fields, data types, and failure behavior for every boundary in your pipeline.

**Does validate_data_pipeline help with duplicate records?**
Yes. It forces you to define an idempotency mechanism (like upserts or deduplication keys) so that if a job retries, it won't create multiple copies of the same record.

**Is lineage tracking necessary for data pipelines?**
It is critical. The tool forces you to map every data point back to its raw source and through every single transformation step, eliminating 'black box' numbers.

**How does running validate_data_pipeline report architectural failures with Data Pipeline Prover?**
The MCP provides a structured verdict matrix, immediately identifying which of the four core pillars is missing. It won't just say 'bad data'; it explicitly flags if you are `SCHEMA_ABSENT`, `NON_IDEMPOTENT`, or `LINEAGE_BLIND`. This guides you directly to the architectural flaw that needs fixing.

**What kinds of schema contracts can Data Pipeline Prover enforce using validate_data_pipeline?**
It enforces industry-standard schemas, including Zod, Protobuf, Avro, and JSON Schema. You can't just claim a contract exists; the tool forces you to define the specific fields, data types, and exactly how corrupt or invalid lines get handled (like sending them to a dead-letter queue).

**Is there a limit on complexity when running validate_data_pipeline?**
No. The tool analyzes your pipeline's *architecture*—the logical flow, the transformations, and the ownership boundaries—rather than running the actual data load itself. This means you can review massive, multi-stage ETL designs without hitting runtime limits.

**What is required to set up a Freshness SLA using Data Pipeline Prover?**
You must define a concrete Service Level Agreement with a measurable number, like 'data must be under 15 minutes old.' This requires monitoring a specific timestamp (like `last_updated_at`) and triggering automated alerts when that defined window passes.

**How do you achieve idempotency in write jobs?**
Use unique keys and database constraints (e.g. `INSERT INTO ... ON CONFLICT DO UPDATE`), match against unique business transaction IDs, or write to partition targets that are cleared before the load.

**What is data lineage and why is it important?**
Data lineage represents the complete lifecycle of a data point: from raw ingestion, through transformations and aggregations, to the final report. It is critical for root-cause analysis when data is wrong.

**Where should pipeline schemas be enforced?**
Schemas should be validated at the boundaries of each processing stage: immediately upon ingestion, after cleaning transformations, and prior to writing to the destination data warehouse.