# Apify MCP

> Apify MCP connects your AI agent directly to a full-stack web scraping platform. You can list available scrapers, run bots asynchronously or synchronously, and pull structured data records in raw JSON format, all through natural conversation.

## Overview
- **Category:** ship-it
- **Price:** Free
- **Tags:** web-automation, data-extraction, proxy-services, headless-browser, datasets, api-integration

## Description

This connector lets you direct complex data extraction workflows entirely through chat. Instead of setting up API keys and running scripts locally, your agent talks to the Apify system—it finds the right scraper bot, runs it, monitors its progress, and pulls the resulting structured datasets into your context window. You can even tell the running process to crawl new pages you discover mid-job. Because Vinkius hosts this MCP in the catalog, you just connect once from any compatible client, giving you immediate access to run these sophisticated web tasks without touching a command line.

## Tools

### abort_run
Stops a running Apify scraper job immediately if the scrape is going off track or if enough data has already been collected.

### get_account_limits
Checks your current consumption and subscription limits to make sure you don't hit an overage charge.

### get_dataset_items
Exports the structured JSON data from a completed Apify dataset, supporting large bulk downloads by page limit.

### get_key_value_store
Retrieves miscellaneous files related to a run, like screenshots or configuration settings used during scraping.

### get_run
Checks the status and metadata of an active scrape job so you know if it's still running or if it failed.

### list_actors
Shows all scraper bots available in your account, including their IDs and default settings for triggering a run.

### list_webhooks
Lists the external systems that get notified when an actor run succeeds or fails.

### push_to_queue
Tells a currently running scraper to add new URLs it discovers to its list of pages to crawl next.

### run_actor_sync
Runs a short-lived scraper and makes your agent wait until it finishes before giving you the results.

### run_actor
Starts an Apify scraper bot in the background using custom input settings, returning immediately with a job ID you can track later.

## Prompt Examples

**Prompt:** 
```
List all the Apify actors available on my account.
```

**Response:** 
```
I've scanned your Apify actor catalog. You possess 3 main actors: 'ecommerce-spider-v1' and two public templates 'google-search-scraper' and 'instagram-scraper'. Shall I print their base configurable inputs?
```

**Prompt:** 
```
Verify the status of run 'qKpwH9LgC3r0Xm' and show me its final dataset if finished.
```

**Response:** 
```
I checked run `qKpwH9LgC3r0Xm`. Its current status is `SUCCEEDED`. The run consumed 0.045 Compute Units. I have successfully downloaded the associated dataset (`ds_8aBxP_...`) which yielded 302 product profiles. I'm injecting the top 5 parsed items below.
```

**Prompt:** 
```
How are our compute usage limits tracking this current month on Apify?
```

**Response:** 
```
I pulled your overall compute records. Right now on the 'Scale' plan, your account used 82 out of 100 Compute Units (CU). You've consumed significant resources on proxy bandwidth (1.8TB/2TB) leading to roughly 80% usage threshold. I recommend holding further massive scraped runs until renewal happens on the 10th.
```

## Capabilities

### Discover available scrapers
Lists every scraper bot (Actor) configured in your Apify account so you know what data is accessible.

### Initiate scraping jobs
Starts a web scrape, either waiting for it to finish immediately or running it in the background for long-term monitoring.

### Pull structured data
Retrieves the full dataset of scraped records as JSON objects after a job completes.

### Control ongoing jobs
Allows you to stop runaway scrapes or tell an active scraper to crawl new URLs it finds.

### Check system limits
Gives you a status report on your account's usage, including compute unit consumption and proxy bandwidth.

## Use Cases

### Monitoring competitor pricing changes
A market researcher needs to track product prices daily. They use `list_actors` to find the correct price scraper, then use `run_actor` for a scheduled job. When the results arrive, they pass them through the agent and ask it to format the latest data into a summary table.

### Crawling deeply linked product catalogs
An AI developer needs to scrape an entire website section that involves clicking 'next page' buttons. They use `run_actor` and then follow up by calling `push_to_queue`, telling the running process exactly which newly found URLs it must crawl.

### Checking for data completeness
A data engineer runs a scrape and is worried about missing metadata. They use `get_key_value_store` to pull down any attached screenshots or configuration files from the job, ensuring they have all the necessary audit details.

### Verifying service health after failure
A client runs a large scrape and it fails. They use `get_run` first to see the exact error status, then check `list_webhooks` to confirm if external systems were supposed to get notified about the failure.

## Benefits

- You bypass writing boilerplate Python. Instead of calling `run_actor` and then polling with `get_run`, you just ask your agent to check the status, making the whole process conversational.
- Data retrieval is simple: Once the scrape finishes, use `get_dataset_items` to pull all structured data directly into the chat context. It handles massive JSON exports for you.
- You maintain control over expensive jobs. If a scraper starts going wild or runs past its usefulness, you can hit 'stop' using `abort_run`, saving compute units and time.
- The system is resilient because you don't have to manually manage state. The agent tracks the job ID from `run_actor` and knows when to query for results using `get_dataset_items`.
- It helps with governance too. Before you start anything huge, check `get_account_limits`. You don't want a runaway scrape wiping out your budget because you forgot about it.

## How It Works

The bottom line is that you treat complex web scraping like a conversation; the AI handles all the underlying API calls and state management for you.

1. First, connect your AI client to the Apify MCP and tell it which scraper bot (Actor) you need for the job.
2. Next, trigger the scrape. If it's a big job, run it asynchronously; if it’s quick, let the agent wait until it confirms completion.
3. Finally, ask your agent to pull the data. It will retrieve the structured records and inject them directly into the chat context for you to read or process.

## Frequently Asked Questions

**How do I list all available scrapers using Apify MCP?**
You use the `list_actors` tool. This shows you every scraper bot (Actor) you have access to, giving you their IDs so you know exactly what job they're built for.

**What is the difference between run_actor and run_actor_sync?**
The difference is timing. `run_actor` starts a background process, which is best for long jobs because it returns immediately. `run_actor_sync` blocks your agent until the job finishes; use this only for very short tasks (under five minutes).

**How do I get the final data from an Apify run?**
After a scrape is complete, use `get_dataset_items`. This pulls all the structured records and provides them to your agent in usable JSON format.

**Can I stop a running scrape with abort_run?**
Yep. You can call `abort_run` anytime you need to halt a job, which is critical if the scraper starts pulling junk data or exceeds your budget.

**How do I check my compute unit usage using get_account_limits?**
It immediately reports your current consumption against your subscription cap. This tool monitors both compute units and proxy bandwidth, helping you prevent unexpected overage charges on large scraping jobs.

**What information does get_run provide about an active scraping job?**
The endpoint provides the run's current status, metadata, and consumption details. You can poll this tool to track if a long-running scrape is still running or has successfully completed.

**When should I use get_key_value_store instead of getting dataset items?**
Use it for non-structured files like screenshots, configuration inputs, or raw HTML snapshots. The key-value store holds arbitrary binary and text data linked to a specific run ID.

**How does list_webhooks help with automated workflows?**
This tool lists all configured webhooks, which enable external systems to react when an actor run succeeds or fails. It is essential for building reliable, event-driven architectures.

**How can the AI agent run a scrape on a list of product URLs?**
First, find your specific scraping Actor ID via `list_actors`. Then, prompt your agent to execute `run_actor`, providing the target URLs formatted as a structured JSON input payload. It returns a 'Run ID'. You can poll this run via `get_run`, and once it succeeds, the agent calls `get_dataset_items` to pull all acquired data straight to your window.

**Can the agent interact with run configurations mid-way during crawling?**
Yes. If an Apify crawler is currently executing and utilizes a Request Queue, you can instruct your agent to call `push_to_queue`. Doing so dynamically ships new URLs to the active queue instance, extending the current web crawl without needing to stop or restart the Actor.

**Can my AI automatically detect scraping timeouts and debug the failure?**
Absolutely. Because your agent can track real execution flows with `get_run`, it's aware if it transitions to TIMED-OUT or FAILED states. Subsequently, you can ask the agent to examine the KV Store log outputs ensuring the underlying issue (e.g. captcha block, blocking proxy) is identified immediately.