# Sensible MCP

> Sensible handles structured data extraction from any document type—PDFs, images, Word files, etc. It turns messy, unstructured documents into clean, predictable JSON records using a robust parsing engine. You classify documents first, then extract specific fields (like invoice numbers or tax IDs) whether the file is local, remote via URL, or part of a portfolio batch.

## Overview
- **Category:** artificial-intelligence
- **Price:** Free
- **Tags:** document-extraction, pdf-parsing, ocr, data-extraction, structured-data

## Description

You're done copy-pasting data out of PDFs and invoices. This server takes messy, unstructured documents—whether they’re PDFs, images, or Word files—and converts them into clean, predictable JSON records using a robust parsing engine.

The system starts by letting you **define** what kind of document you're dealing with. You establish an entirely new category or type of document using `create_document_type`, and then you guide the specific data extraction for that type by creating a rule set with `create_configuration`. If you need to standardize your process, you can create a 'Golden' reference document via `create_golden`, which serves as the primary source for defining both data structure and quality control. You manage these rules using functions like `update_configuration` or `get_configuration`, and if something goes wrong, you can delete configurations with `delete_configuration` or remove entire types with `delete_document_type`.

When you receive a document, the first step is figuring out what it is. You classify documents using `classify_sync` to get the type immediately, or you run an async job with `classify_async` if classification takes time. Once classified, you have several ways to extract data. If you've got a local file encoded as Base64, `extract_sync` pulls structured data instantly. For files hosted online, you start a background process using `generate_upload_url`, then run the extraction with `extract_from_url`. You can even target multiple related documents at one spot by running `extract_portfolio_from_url`. If your async job needs specific rules, use `generate_upload_url_with_config` and execute the extraction via `extract_from_url_with_config`, or for immediate local pulls using a configuration ID, run `extract_sync_with_config`.

To keep track of all this background work, you can list past jobs with `list_extractions`. You'll also need to manage your reference materials; `get_golden` retrieves metadata on the current standard document, and you can inspect every text line and its coordinates from that master source using `extract_text_from_golden`. When you’ve pulled a batch of data, you don't want JSON blobs. You compile multiple results into usable formats: `generate_csv` creates a standard CSV spreadsheet, while `generate_excel` builds a formatted Microsoft Excel workbook.

For large-scale operations, the server helps you manage access and history. Use `get_auth_tokens` to create temporary credentials for external reviewers accessing data securely. You can list all available configurations with `list_configurations`, or check historical versions and drafts of any rule set using `list_configuration_versions`. If you're auditing your process, `get_extraction_statistics` returns metrics on how much data has been extracted recently, and you can see every defined document type overview by calling `list_document_types`. The system also lets you review all available reference documents for a specific type using `list_goldens`, or view the metadata about any given configuration with `get_configuration_version`.

The server handles everything from setup to final delivery. It allows you to pull data instantly on local files, run complex background jobs against remote URLs, and aggregate those results into spreadsheets ready for your team to use.

## Tools

### classify_async
Classifies a document asynchronously by determining its type (e.g., invoice, W-2).

### classify_sync
Classifies a document synchronously and returns the document type immediately.

### create_configuration
Creates a new rule set (configuration) used to guide data extraction for a specific document type.

### create_document_type
Establishes an entirely new category or type of document within the system.

### create_golden
Creates a 'Golden' reference document, which serves as the primary source for defining data structure and quality control.

### delete_configuration
Removes an existing data extraction configuration rule set.

### delete_configuration_version
Deletes a draft or unpublished version of a saved configuration.

### delete_document_type
Removes an entire document type definition from the system.

### delete_golden
Deletes a reference document used for setting standards or templates.

### extract_from_url_with_config
Extracts data from a remote URL using specific rules defined by a configuration ID asynchronously.

### extract_from_url
Extracts data from any document hosted online, starting an asynchronous job.

### extract_portfolio_from_url
Collects and extracts data from multiple related documents located at a single URL endpoint asynchronously.

### extract_sync_with_config
Performs synchronous extraction on a local file (Base64) using an explicitly defined configuration ID.

### extract_sync
Extracts structured data instantly when you provide the document as a Base64 encoded string.

### extract_text_from_golden
Pulls all text lines and their exact coordinates from the master reference document for inspection.

### generate_csv
Compiles multiple JSON extraction results into a standard CSV spreadsheet file format.

### generate_excel
Compiles multiple JSON extraction results into a formatted Microsoft Excel workbook.

### generate_portfolio_upload_url
Generates a secure, temporary URL for uploading an entire portfolio of documents for batch processing.

### generate_upload_url_with_config
Generates an upload URL specifically for asynchronous processing that must use a defined configuration set.

### generate_upload_url
Creates a pre-signed upload URL required to start any asynchronous document extraction process.

### get_auth_tokens
Creates temporary authorization credentials, allowing external reviewers to access the data securely.

### get_configuration
Retrieves details for a specific document extraction configuration rule set by its ID.

### get_configuration_version
Gets data about a particular saved version of an existing configuration.

### get_document_type
Retrieves all metadata and details about a specific document type definition.

### get_document
Fetches the final extraction results for a document using its unique ID.

### get_extraction_statistics
Returns metrics showing how much data has been extracted over recent days.

### get_golden
Retrieves metadata about the current reference document used for standardization.

### list_configuration_versions
Shows all historical versions and drafts of a configuration rule set.

### list_configurations
Lists all available configurations that apply to a specific document type.

### list_document_types
Provides an overview of every defined document type in the server system.

### list_extractions
Retrieves a paginated list of past extraction jobs, allowing you to track history and status.

### list_goldens
Shows all available reference documents defined for a specific document type.

### publish_configuration
Makes a specific version of a configuration active and usable by the agent in production environments.

### unassociate_golden
Removes a reference document from its current functional link to a specific configuration.

### update_configuration
Modifies an existing data extraction configuration rule set, adjusting the parsing logic.

### update_document_type
Changes the general metadata or rules for a document type definition.

### update_golden
Updates the metadata associated with a reference document without changing its core content.

## Prompt Examples

**Prompt:** 
```
Extract data synchronously from this Base64 PDF using the 'invoice' document type.
```

**Response:** 
```
I've processed the document using `extract_sync`. Here are the extracted fields: Invoice Number: INV-2023-001, Total Amount: $1,250.00, Due Date: December 15, 2023.
```

**Prompt:** 
```
Extract data from the document at 'https://example.com/tax_form.pdf' using the 'tax_1099' document type.
```

**Response:** 
```
I have initiated the asynchronous extraction using `extract_from_url`. The document has been submitted to Sensible for processing.
```

**Prompt:** 
```
Generate a pre-signed upload URL for a PDF document of type 'bank_statement'.
```

**Response:** 
```
I've generated the upload URL using `generate_upload_url`. You can upload your PDF directly to this secure endpoint to start the extraction process.
```

## Capabilities

### Classify Documents
Determines what kind of document you have (e.g., invoice, tax form) using synchronous or asynchronous classification tools.

### Extract Data from Local Files
Runs an extraction job instantly on a file provided as a Base64 string (`extract_sync`).

### Process Documents via URL
Starts background processing for documents hosted online, which is necessary for large volumes of files or external sources.

### Handle Document Portfolios
Extracts data from a group (portfolio) of related documents at a specific URL using `extract_portfolio_from_url`.

### Manage Data Schemas and Types
Allows you to create, update, and manage the rules (`create_configuration`) that dictate exactly what data points should be extracted from a given document type.

## Use Cases

### Processing Incoming Vendor Invoices
The Ops Engineer gets a batch of 50 PDF invoices attached to an email. Instead of downloading and opening each file, they use their agent to call `generate_upload_url` and then upload the files. The system runs `extract_from_url_with_config`, returning structured data for all 50 invoices in one go.

### Cleaning Historical Scanned Records
The Data Analyst has a folder of old, scanned tax forms (images). They run the agent to classify them first using `classify_async` to confirm the document type. Then they process the batch via an upload URL and use `generate_csv` to convert all records into a single CSV file for BI tools.

### Building a Contract Reviewer Tool
The Developer builds a workflow that first checks if a document is a contract using `classify_sync`. If it matches, the agent proceeds to call `extract_sync_with_config` to pull out specific clauses like 'Termination Date' and 'Governing Law', making the data ready for immediate use.

### Standardizing Financial Data Feeds
A finance team needs to ensure all vendor invoices conform to one standard. They define a master schema using `create_golden` and then update their extraction tools with that configuration ID, guaranteeing the output structure is always correct.

## Benefits

- Stop manual data entry. Whether you process a single invoice or 10,000 tax forms, Sensible handles the extraction into predictable JSON records using `extract_sync` or `extract_from_url`.
- Build reliable pipelines by managing your schemas first. Use tools like `create_configuration` and establish 'Golden' reference documents via `create_golden` to ensure consistent data mapping every time.
- Handle scale with ease. Instead of running a job repeatedly, use the asynchronous URL methods (`extract_from_url`, `generate_upload_url`) for massive batch processing without timing out your agent.
- Turn raw data into usable assets instantly. After extraction, call `generate_excel` or `generate_csv`. Your JSON output immediately becomes a spreadsheet ready for analysis in Excel or Google Sheets.
- Manage complex document groups. If you have multiple related documents (like a Statement and an Appendix), use the portfolio tools—specifically `extract_portfolio_from_url`—to process them as one unit.

## How It Works

The bottom line is: You set up the rules once, point your AI agent at the messy file, and get clean, structured data back every time.

1. First, use `list_document_types` or `get_document_type` to verify the required schema. If needed, you'll use tools like `create_configuration` and establish a 'Golden' reference document via `create_golden`.
2. Next, your agent calls an extraction tool—like `extract_from_url_with_config` for remote files or `extract_sync_with_config` for local data—passing the file and the target configuration ID.
3. Finally, Sensible returns JSON data. If you need a spreadsheet, call `generate_csv` or `generate_excel` to compile the results.

## Frequently Asked Questions

**How do I process a PDF file that's already attached to an email?**
You should use `extract_sync` if the file is small enough to pass as Base64. If it's part of a large batch, generating an upload URL with `generate_upload_url` and having your agent process the attachment through that secure endpoint works better.

**What's the difference between classify_async and classify_sync?**
`classify_sync` returns the document type immediately, which is great for quick validation checks. `classify_async` is better if you are dealing with a massive batch of files and want to run classification in the background without blocking your workflow.

**Can I extract data from multiple different types of documents at once?**
You can use portfolio tools like `extract_portfolio_from_url`. This lets you process related files together, ensuring all necessary structured fields are extracted in one go.

**Which tool should I use to turn my JSON output into an Excel sheet?**
After the extraction is complete and you have the resulting JSON data, call `generate_excel`. It compiles your records directly into a usable spreadsheet format that's ready for sharing.

**Why do I need to use create_golden before extracting?**
The Golden record establishes the single source of truth and the optimal schema for your data points. Using it guarantees consistency, so even if a vendor changes their invoice layout slightly, your extraction rules stay accurate.

**How can I check the status or retrieve results using `get_document` after an asynchronous extraction job?**
You call `get_document(id)` to pull specific extraction results. This is critical for confirming successful processing, especially if an async job was slow or failed initially.

**If I need temporary read-only access for external reviewers, what does the `get_auth_tokens` tool provide?**
It generates temporary authorization tokens. You can use these to give reviewers limited viewing access without handing over your main API key credentials.

**I refined my extraction rules; how do I apply those changes using `update_configuration`?**
You send the parameters via `update_configuration(id)`. This lets you revise an existing rule set without having to rebuild the entire configuration from scratch.

**Can I extract data from a document instantly if I have its Base64 representation?**
Yes! Use the `extract_sync` tool. Provide the document type and the Base64-encoded document bytes, and your agent will return the structured extraction results synchronously.

**How do I extract data from a document hosted at a public URL?**
You can use the `extract_from_url` tool. Simply provide the document type, the document URL, and the content type (e.g., application/pdf) to trigger an asynchronous extraction.

**Can I specify a custom configuration layout when extracting?**
Yes, you can target specific configurations by using the `extract_sync_with_config` or `extract_from_url_with_config` tools, which allow you to define the exact configuration name to use for parsing.