# Extracta MCP

> Extracta uses AI to automate data extraction and document classification from PDFs, images, and other files. It lets you define exactly what data you need—like dates, amounts, or vendor names—and then processes entire batches of documents into clean, structured JSON formats using your agent.

## Overview
- **Category:** artificial-intelligence
- **Price:** Free
- **Tags:** ocr, data-extraction, document-classification, json-parsing, automated-data-entry, unstructured-data

## Description

Imagine getting mountains of invoices, receipts, and contracts that all need to be logged into a database. Doing this manually is a nightmare. Extracta changes the game by connecting directly to your AI client, letting you handle complex data extraction through natural conversation. You don't just read; you build the process itself. You define custom JSON schemas—telling the system exactly which fields matter (like invoice dates or total amounts). Then, simply give it a URL for any document, and it handles the rest. It doesn't just pull text; it classifies documents first, telling you if that file is an 'Invoice' or a 'Receipt,' and then extracts the necessary data into structured JSON. If you're building out your toolset on Vinkius, this MCP gives you enterprise-grade document processing without needing to write custom API calls every time.

## Tools

### create_classification
Sets up a new document classification model by defining the categories you want to sort documents into (e.g., invoice, receipt).

### view_classification
Shows the specific details and settings of an existing document classification process.

### get_batch_results
Retrieves historical results for a large number of documents processed through an extraction workflow.

### get_classification_results
Provides the AI's predicted category and confidence score for a specific document.

### create_extraction
Initializes an entire data extraction process, allowing you to specify required fields like dates or total amounts.

### delete_extraction
Removes an existing document extraction configuration; this stops all future processing for that setup ID.

### get_results
Checks the current status of a document's extraction job, indicating if it’s still running or complete.

### update_extraction
Modifies mapping rules and field definitions for an already created extraction process without starting over.

### upload_file_url
Submits a link to a document file, kicking off the background job necessary for data processing.

### view_extraction
Displays all settings and current parameters of an existing extraction process configuration.

## Prompt Examples

**Prompt:** 
```
Create an extraction process for invoices with fields: date, vendor, total
```

**Response:** 
```
Extraction setup created! I've initialized a new process (ID: ext_abc) configured to extract 'date', 'vendor', and 'total' fields. You can now start uploading invoice URLs to this process.
```

**Prompt:** 
```
Extract data from this receipt URL: https://example.com/receipt.pdf
```

**Response:** 
```
Processing document... I've submitted the receipt URL for extraction. Document ID: doc_789. I'll poll for the results now... [Agent retrieves structured JSON with date, total, and items].
```

**Prompt:** 
```
What type of document is doc_789 according to my classification rules?
```

**Response:** 
```
Checking classification for doc_789... The AI has classified this document as 'Invoice' with a 98% confidence score. It fits within your defined document categories flawlessly.
```

## Capabilities

### Define Extraction Schemas
You create and configure data extraction processes by defining precise JSON schemas for the fields you need from documents.

### Process File URLs
Submit publicly accessible file links (PDF, JPG, PNG) to trigger a background workflow that returns structured JSON data later.

### Classify Document Type
Set up rules that automatically sort incoming documents into predefined types, like invoices or contracts, based on AI analysis.

### Audit Historical Results
Retrieve status and structured data for specific documents, including confidence scores and predicted categories.

### Manage Configurations
Update existing extraction settings or view the full configuration of an active document process without creating new endpoints.

## Use Cases

### Processing Vendor Payments
A finance manager needs to pay vendors using scanned invoices. They ask their agent to use `create_extraction` first, defining fields like 'vendor name' and 'total amount.' Then, they submit 50 URLs via `upload_file_url`, getting back structured JSON data ready for payment processing.

### Building a Document Library
A legal team receives thousands of client agreements. They use the MCP to define document types using `create_classification`. The agent processes them, automatically identifying and grouping everything as 'Contract' or 'NDA,' allowing quick auditing.

### Tracking Data Changes Over Time
An operations team needs to monitor how many receipts they process each month. They use the `get_batch_results` tool to fetch a paginated list of all processed documents and associated data payloads for historical review.

### Validating New Data Pipelines
A developer needs to test if their new extraction schema works on live files. They use `view_extraction` to check the configuration, then submit a single URL using `upload_file_url`, and poll with `get_results` until they get structured JSON.

## Benefits

- Stop manually defining schemas. You tell the system exactly what fields you need—like invoice dates or product totals—and it handles the rest through the `create_extraction` tool.
- You don't wait for manual file uploads. Just give it a URL using `upload_file_url`, and the background process does the heavy lifting, giving you structured JSON later on.
- Classification is built-in. Before extracting data, the system uses document type rules (via `create_classification`) to ensure you know if the file is an invoice or a contract.
- You never lose history. Use `get_batch_results` to pull records from hundreds of processed documents at once for audit purposes.
- Need a quick change? You can use `update_extraction` to tweak mapping rules on a live process instead of having to build an entirely new setup.

## How It Works

The bottom line is that your agent handles the entire pipeline, from schema definition to final data output, so you get clean JSON ready for analysis.

1. First, you define your data needs by setting up a specific extraction process and detailing the required JSON schemas.
2. Next, you submit one or more publicly accessible document URLs to kick off an asynchronous processing job.
3. Finally, you poll for results, receiving structured JSON containing the extracted data, its confidence score, and classification details.

## Frequently Asked Questions

**How do I start using Extracta with my documents?**
You first need to run `create_extraction` to define what data you want. Then, use the `upload_file_url` tool to submit your files for processing.

**Can Extracta tell me if a document is an invoice or something else?**
Yes. You set up rules using `create_classification`, and then you can use `get_classification_results` to check the predicted type of any uploaded document.

**What happens if I change my extraction requirements after setting it up?**
You don't need to start over. Use the `update_extraction` tool to modify your existing configuration and mapping rules on the fly.

**Does Extracta handle large batches of documents?**
Yes, you use the `get_batch_results` tool to retrieve historical data from multiple processed files in bulk.

**What is the difference between `create_extraction` and `view_extraction`?**
`create_extraction` sets up a brand new process with defined schemas. `view_extraction` just shows you all the current settings for an extraction process that already exists.