# Octoparse MCP

> Octoparse MCP Server lets your AI client manage web scraping tasks directly in chat. It connects to Octoparse's API, giving you full control over complex data extraction workflows—no manual exporting required. Your agent can list all task groups, check the real-time status of scrapers, start new extractions on demand, and pull filtered, non-exported records based purely on conversation. This tool turns your AI client into a dedicated data researcher.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** data-extraction, no-code, web-automation, cloud-scraping, structured-data, data-ingestion

## Description

**Octoparse MCP Server** lets your AI client run complex web scraping jobs right in chat. Ya don't have to click around in the Octoparse interface; you just tell your agent what data ya need, and it handles every API call behind the scenes. This tool turns your AI client into a dedicated data research machine, giving you full control over extraction workflows without ever leaving your conversation window.

**Managing Tasks and Groups**
You can ask your agent to pull up a list of all defined task groups in your account using `list_task_groups`. To see the specific scrapers within those groups, use `list_tasks`; you can even filter that list down by a designated group ID. For any running job, you'll get real-time updates on its operational status—whether it’s Running or Completed—by calling `get_task_status`.

**Controlling the Scraping Job**
When you need data, your agent can start the extraction process using `start_task`, firing up a cloud job for any specified task. If something goes sideways, don't sweat it; you can halt an active job immediately with `stop_task`. Once the scraping is done and the data sits waiting in the system, use `update_data_status` to manually mark specific records as exported, changing their status within Octoparse.

**Retrieving the Data**
The real power is getting the raw intel. To pull only the records that have been extracted but haven't been marked as exported yet, you call `get_new_data`. If you need a huge dataset and it comes back in chunks, you can use `get_task_data` to fetch structured data batches using an offset value, letting your agent paginate through massive result sets. This gives you granular control over exactly what data gets pulled directly into the chat.

## Tools

### get_new_data
Fetches all extracted records that have not been marked as exported for a specific task.

### get_task_data
Retrieves structured data from a specified scraping task using an offset value to handle large result sets.

### get_task_status
Returns the current operational status of any defined web scraping task.

### list_task_groups
Lists all existing, managed groups of scraping tasks within your Octoparse account.

### list_tasks
Retrieves a list of specific scraping tasks, optionally filtered by an associated group ID.

### start_task
Initiates the execution of a specified web scraping task in the cloud environment.

### stop_task
Halts an actively running web scraping task immediately.

### update_data_status
Manually marks a given set of data records as exported, changing their status within the system.

## Prompt Examples

**Prompt:** 
```
List all my scraping tasks in Octoparse.
```

**Response:** 
```
I've retrieved your tasks. You have 5 active scrapers including 'Amazon Monitor' and 'Real Estate Leads'. Which one would you like to start or retrieve data for?
```

**Prompt:** 
```
Start running my Amazon product scraping task and check its current status.
```

**Response:** 
```
Task "Amazon Electronics Scraper" (ID: tsk_8921) has been started successfully. Current status: Running. It is processing page 12 of an estimated 85 pages. 264 product records have been extracted so far. Estimated completion: approximately 45 minutes based on current crawl speed.
```

**Prompt:** 
```
Get the extracted data from my latest completed scraping task.
```

**Response:** 
```
Fetching results from task "Competitor Pricing Monitor" (ID: tsk_8905), completed 3 hours ago. Retrieved 1,247 records with fields: Product Name, Price, Rating, Review Count, and URL. The first batch of 100 records is ready. Shall I export the full dataset or retrieve the next page of results?
```

## Capabilities

### List All Task Groups
It lists every defined task group in your Octoparse account.

### Manage and List Tasks
You can list specific scraping tasks, optionally filtering them by a designated task group ID.

### Start/Stop Scraping Jobs
It initiates or halts cloud-based data extraction jobs on any specified task.

### Get Task Status Updates
You receive the current operational status (Running, Completed, etc.) of a scraping job.

### Retrieve New Data Records
It pulls records that have been extracted but haven't been marked as exported yet.

### Get Specific Task Data Batches
You fetch structured data from a task using an offset, allowing for pagination of results.

## Use Cases

### Monitoring Competitor Pricing Shifts
A market researcher needs to know if a competitor changed their pricing page. They prompt their agent: 'Start the Amazon Monitor task and check its status.' The agent runs `start_task`, gives them the real-time progress via `get_task_status` response, and then, once complete, pulls all new leads using `get_new_data`. Problem solved without opening a browser.

### Debugging Data Pipelines
A developer runs a scraper but suspects some data is marked incorrectly. They use the agent to run `list_tasks` first, verify the task ID, and then call `update_data_status` to mark a batch of records as exported, ensuring subsequent pulls via `get_task_data` are accurate.

### Comprehensive Data Audit
A data analyst needs an overview of all scraping projects. They ask the agent to run `list_task_groups`, getting a full map of available scrapers. Then, they can individually check each group using `list_tasks` before deciding which one to kick off via `start_task`.

### Handling Large Datasets
The agent fetches results from the 'Competitor Monitor' task. Instead of receiving a massive data dump, it uses `get_task_data` with an offset parameter to pull the first 100 records, keeping the conversation manageable and actionable.

## Benefits

- You control task flow without leaving your agent. Instead of opening Octoparse, you simply ask to `list_tasks` or check status with `get_task_status`. This keeps your entire research process centralized in one window.
- Data retrieval is smarter and faster. Don't manually export CSVs; use `get_new_data` to pull only the records that are ready for review, filtering out already processed data points.
- Full automation of complex jobs. Need to monitor a competitor? Use your AI client to execute `start_task` on demand and then check progress with `get_task_status`, all in one continuous conversation thread.
- Granular control over the dataset lifecycle. If you need to manually update records, use `update_data_status`. This capability lets you manage data flags right where your AI agent is working.
- Streamlined data access for analysts. When a task is done and you need results, don't just download everything. Use `get_task_data` with offsets to pull specific batches of records directly into the chat context.

## How It Works

The bottom line is you manage complex web scraping processes by talking to your AI client instead of navigating multiple dashboards.

1. Subscribe to the server and provide your Octoparse OpenAPI Access Token in the settings.
2. Your AI client sends a conversational prompt (e.g., 'Start the pricing monitor task').
3. The agent routes that request through the correct tool (`start_task`) and returns the results, status updates, or data directly to the chat.

## Frequently Asked Questions

**How do I get the status of my scraping task using Octoparse MCP Server?**
Use `get_task_status`. This tool returns the current operational state (Running, Completed, Stopped) of your specified task ID. It's the first check you should always run.

**Can I only get new data using Octoparse MCP Server?**
No. While `get_new_data` pulls non-exported records, you also use `get_task_data` if you need to paginate through large datasets or retrieve specific batches by offset.

**What happens when I use start_task? Does it run forever?**
`start_task` initiates the job. You must then repeatedly check the progress using `get_task_status`. If needed, you can call `stop_task` to halt the process if it goes off track.

**Do I need to manually export data after scraping with Octoparse MCP Server?**
No. The whole point is that your AI agent interacts directly with the API. You can pull and filter results in chat using `get_new_data` or `get_task_data`, bypassing manual exports.

**How do I authenticate my connection to the Octoparse MCP Server?**
You connect by entering your OpenAPI Access Token. You need this token from your Octoparse profile settings to manage web scrapers through your AI client.

**How do I get all historical data using the `get_task_data` tool?**
You must call `get_task_data` repeatedly, incrementing the offset parameter each time. This allows you to pull through every record in a task, not just the first batch.

**What is the purpose of using the `update_data_status` tool?**
This tool marks data records as exported or processed within Octoparse. Running this prevents you from retrieving the same data repeatedly, improving efficiency.

**How do I see all available scraping setups using `list_task_groups`?**
`list_task_groups` retrieves a comprehensive list of all managed task groups. You use these IDs to filter and locate specific sets of tasks when you need them.

**Can my AI automatically find the latest extracted data for a specific task?**
Yes! Use the `get_not_exported_data` tool with the Task ID. Your agent will respond with complete metadata for the newest records that haven't been marked as exported yet in seconds.

**How do I find my Octoparse OpenAPI Access Token?**
Log in to Octoparse, navigate to the **OpenAPI** section in your profile or developer portal, and follow the instructions to generate a Bearer token using your account credentials.

**Can I start a scraper via the AI?**
Absolutely. Use the `start_task` tool with your Task ID. The AI will command Octoparse to begin the extraction in the cloud immediately.