# Unstructured MCP

> Unstructured MCP Server manages the entire lifecycle of raw data. Connect it to your AI client to pull documents from sources like S3 or SharePoint, define processing rules, and send clean outputs directly to Vector DBs or SQL records. It lets you automate document ingestion pipelines without opening a dashboard.

## Overview
- **Category:** ai-frontier
- **Price:** Free
- **Tags:** rag, data-ingestion, document-processing, etl, unstructured-data, pipeline-automation

## Description

Listen up. This server handles your entire raw data lifecycle, taking messy documents—PDFs, reports, whatever—and turning them into clean, structured data your AI client can actually use. You connect this thing to your agent so it can automate document ingestion pipelines without you having to open some clunky dashboard and mess around with settings.

It’ll pull docs from sources like AWS S3 or Google Cloud Storage, let you define the processing rules, and spit out clean records straight into Vector DBs or SQL tables. Your AI agent becomes a command center for building and running Retrieval-Augmented Generation (RAG) pipelines using real data.

Here's what it does:

**Listing Data Sources and Targets:** You can check where your documents sit and where the clean output needs to go. The `list_data_sources` tool shows you every configured remote connector, whether that’s an AWS S3 bucket or a Google Cloud Storage location. Similarly, if you need to know what kind of databases are waiting for data, the `list_data_destinations` tool displays all target locations—that means specific Vector DB endpoints and SQL table definitions.

**Managing Pipelines:** To see what's possible, you first gotta list out every defined end-to-end pipeline. The `list_processing_workflows` function gives you a rundown of every existing document processing workflow. Once you know which pipelines exist, the `get_workflow_details` tool lets you pull up the precise configuration details for any specific one; it shows exactly how that data transformation is supposed to happen.

**Running and Tracking Jobs:** You don't wanna wait around watching things happen. The server lets you manually kick off a whole workflow run immediately using `trigger_workflow_execution`, and it returns a job ID so you know what started. To keep tabs on that, the `list_workflow_jobs` tool shows you both a history of completed tasks and any jobs that are still running. This log tells you if the sync finished successfully or if there was an error in the processing task.

This whole setup means you can manage data ingestion from source discovery—using `list_data_sources` to find your files on S3 or GCS—all the way through defining the rules with `get_workflow_details`, triggering the job with `trigger_workflow_execution`, and finally confirming the clean output landed exactly where it needed to go, listing those destinations via `list_data_destinations`. It's a full loop: find it, define how to process it, run it, check its status.

## Tools

### get_workflow_details
Gets configuration details for a specific document processing pipeline workflow.

### list_data_destinations
Lists all configured target locations where processed data can be stored (Vector DBs, SQL).

### list_data_sources
Lists all configured remote connectors to find where documents are currently located (S3, GCS).

### list_processing_workflows
Lists every defined end-to-end pipeline that processes raw documents.

### list_workflow_jobs
Shows a history of all active and completed document processing tasks, including success/fail status.

### trigger_workflow_execution
Manually starts an immediate run of a defined processing workflow and returns a job ID.

## Prompt Examples

**Prompt:** 
```
Show me all our active destination connectors.
```

**Response:** 
```
You have 3 active destinations configured in Unstructured. 
1. Pinecone Index (Production Knowledge Base).
2. MongoDB Atlas Vector Search.
3. AWS S3 (Raw JSON Output). 
Would you like me to check which workflows are currently sending data to the Pinecone index?
```

**Prompt:** 
```
List the historical processing jobs from today.
```

**Response:** 
```
I found 2 workflow jobs executed today:
- Job ID `wf_92jdfk`: Completed successfully at 08:30 AM (Ingested 450 PDFs from Sharepoint).
- Job ID `wf_44klqp`: Failed at 11:15 AM (Error connecting to destination Pinecone timeout). 
Would you like me to share more log details about the failed job?
```

**Prompt:** 
```
Trigger the engineering onboarding workflow.
```

**Response:** 
```
I have successfully triggered the workflow `wf_eng_onboarding`. The execution has started with Job ID `job_12bxc6`. It is currently processing files from your Google Drive source. Do you want me to monitor it and let you know when it's populated into the Vector DB?
```

## Capabilities

### List Available Data Sources
Retrieves a list of all configured external connectors, such as AWS S3 buckets or Google Cloud Storage locations.

### List Target Databases
Displays every destination where processed data can be sent, including specific Vector DBs and SQL endpoints.

### View Workflow Definitions
Retrieves the precise configuration details for any defined document processing pipeline.

### Start Data Ingestion Job
Immediately triggers a full workflow run to ingest and process documents from your specified sources.

### Track Job Status
Lists active and historical jobs, letting you monitor progress or check failure logs for document processing tasks.

## Use Cases

### Debugging an Intermittent Sync Failure
The MLOps team notices the vector store is incomplete. They ask their agent to run `list_workflow_jobs`. The agent finds a recent job that failed and reports it couldn't connect to Pinecone. This tells the engineer exactly where the failure happened, allowing them to fix the destination credentials immediately.

### Setting up New Data Sources
A Product Manager needs to start indexing documents from a new department SharePoint site. They ask their agent to run `list_data_sources` to see if SharePoint is supported, confirm the connection credentials are set, and then use `get_workflow_details` to map that source into an existing workflow.

### On-Demand Knowledge Base Update
A company releases a major policy update. Instead of waiting for the scheduled ETL run, the developer asks their agent to use `trigger_workflow_execution`. The job starts instantly, pulling data from GCS and populating the vector DB within minutes.

### Schema Validation Check
Before deployment, a developer uses `list_data_destinations` to verify that the target SQL database structure is correct. They then use `get_workflow_details` to confirm the workflow's output schema matches the destination table columns.

## Benefits

- **Audit Pipelines:** Use `list_processing_workflows` and `get_workflow_details` to audit every step of your data flow without logging into the main dashboard. You see exactly how raw documents map to clean JSON records.
- **Debug Data Flow:** If a vector store is missing data, run `list_workflow_jobs`. This shows you if the job failed and gives you the ID needed to investigate the specific failure point.
- **Know Your Inputs/Outputs:** Run `list_data_sources` before building anything. It confirms which remote buckets (S3, GCS) are connected and ready to feed documents into your system.
- **Test Immediately:** Don't wait for a cron job. Use `trigger_workflow_execution` to manually run the pipeline on demand, proving the whole stack works with one command.
- **Centralized View:** You get a unified view of data movement—from raw file storage (Source) → Processing rules (Workflow) → Final database (Destination)—all in your chat.

## How It Works

The bottom line is: you manage complex data flow from source identification through execution monitoring using only commands in your chat interface.

1. First, your agent uses `list_data_sources` to confirm the raw documents are available (e.g., a specific S3 bucket).
2. Next, it calls `get_workflow_details` to ensure the defined workflow correctly maps those sources to the desired destination (e.g., Pinecone or PostgreSQL).
3. Finally, the agent executes `trigger_workflow_execution`, starting the job and receiving a unique Job ID to track its completion status.

## Frequently Asked Questions

**How do I check if my S3 bucket is connected using list_data_sources?**
Run `list_data_sources`. This command checks all configured remote connectors and tells you whether your S3 credentials are active and recognized by the system.

**What does trigger_workflow_execution return when I run it?**
It returns a unique job ID. You must capture this ID to track the execution status using `list_workflow_jobs` later on.

**Can I see which destination databases are available with list_data_destinations?**
Yes, running `list_data_destinations` lists all configured target locations. You can confirm if Pinecone or MongoDB Atlas is set up to receive the processed data.

**How do I check if a past ingestion job failed using list_workflow_jobs?**
Run `list_workflow_jobs`. The output gives you a history of all runs, including success/fail status and the exact time stamp for quick debugging.

**What specific configuration data does `get_workflow_details` provide?**
It returns the full blueprint for a single workflow. You get details like required input sources, expected output destinations, and any custom steps or transformations needed before execution.

**I need to see all available pipelines; how does `list_processing_workflows` help?**
The function lists every end-to-end processing pipeline configured on your account. It gives you a quick overview of workflow names and their high-level purpose so you can choose the right one.

**How do I monitor the real-time status of an ongoing job using `list_workflow_jobs`?**
You query `list_workflow_jobs` and filter by 'status: running'. This shows if a job is currently queued, actively processing data, or paused.

**Can I pass specific parameters when calling `trigger_workflow_execution`?**
Yes. When triggering the workflow, you must include necessary input parameters in the payload. This lets you target a specific directory path or file list for immediate processing.

**Can my AI agent trigger an immediate document processing job?**
Yes! If you have a workflow configured to pull files from an S3 bucket and load them into a Pinecone index, you can ask your agent to `trigger workflow XYZ`. It will start the execution and return the new Job ID, which you can use to track the progress.

**How can I verify if my RAG pipelines are failing or succeeding?**
Ask your agent to list your workflow jobs. It will securely connect to Unstructured's engine and return historical and active executions, displaying statuses such as 'completed', 'failed', or 'in_progress'. This is extremely useful for MLOps engineers diagnosing ingestion alerts directly in their terminal.

**Can I edit the destination database directly through the agent?**
This server is focused on auditing and executing your existing pipelines. Currently, you can list all connections (sources and destinations) and obtain their details, but creating or destructively modifying vector database connectors must be done inside the Unstructured dashboard for security.