# Crawlbase MCP

> Crawlbase gives your AI agent full control over web data extraction. It handles complex sites, including JavaScript-rendered pages and social media platforms like Amazon, LinkedIn, and Facebook. You can bypass security measures and capture structured data from almost any public website.

## Overview
- **Category:** friends-mcp
- **Price:** Free
- **Tags:** proxy, captcha-solving, html-extraction, headless-browser, data-collection, web-crawling

## Description

Need to get data off the web? This MCP connects Crawlbase directly to your AI client, letting you take over tricky scraping jobs through natural conversation. Forget writing complex code or spending hours debugging anti-bot walls. You just ask for the data—the price list from a competitor's site, the profiles of key people on LinkedIn, or the specific specs of an Amazon product—and it handles the rest. It even figures out how to read content hidden behind JavaScript and tackles search results that are constantly changing. Because Vinkius hosts this MCP in their catalog, you connect once and your agent gets access to all these web scraping capabilities. You’ll get clean JSON outputs, screenshots for validation, or full site crawls, all without touching a single line of code.

## Tools

### scrape_html
Performs basic web scraping by identifying contained HTML content using datacenter proxies.

### scrape_js_rendered
Accesses and pulls data from modern websites that load their content dynamically using JavaScript.

### scrape_json_format
Converts complex, messy web data into clean, structured JSON objects.

### get_screenshot_link
Runs automated checks to provide a permanent URL link for visual snapshots of any target page.

### scrape_amazon
Extracts specific product details and data points from Amazon e-commerce listings.

### scrape_linkedin
Retrieves detailed professional profile information matching LinkedIn's structural constraints.

### scrape_facebook
Retrieves structured information directly from active Facebook social pages.

### scrape_google_serp
Identifies and collects data points spanning Google search results, bypassing CAPTCHAs.

### scrape_twitter
Fetches mapped, structured data points from Twitter (X) graph profiles and timelines.

### custom_scrape
Generates custom proxy endpoints that can be used for highly reliable, targeted data collection runs.

## Prompt Examples

**Prompt:** 
```
Scrape the price and features from this Amazon product: [Amazon URL]
```

**Response:** 
```
Amazon scraping complete! I've extracted the following JSON data: Title: 'Eco-Smart Watch', Price: '$199.00', Rating: '4.5 stars', Features: ['Waterproof', 'Sleep tracking', '10-day battery'].
```

**Prompt:** 
```
Get Google search results for 'best machine learning platforms 2024'
```

**Response:** 
```
I've extracted the Google SERP results for your query. Top organic links include 'Top 10 ML Platforms (Site A)', 'The Future of AI (Site B)', and 'Enterprise ML Guide (Site C)'. Would you like the full meta descriptions for these?
```

**Prompt:** 
```
Take a screenshot of https://example.com
```

**Response:** 
```
Screenshot requested! Crawlbase is generating the snapshot. You can access the rendered image at this temporary proxy link: [Crawlbase Screenshot URL].
```

## Capabilities

### Capture Web Screenshots
Run automated checks that generate permanent links to visual snapshots of any web page.

### Extract Structured JSON Data
Force raw website outputs into precise, structured JSON formats for immediate use by your agent.

### Scrape JavaScript Pages
Retrieve content from modern websites that load data dynamically using JavaScript.

### Target Social Networks
Specialized extraction tools for key platforms like Amazon, LinkedIn, and Facebook.

### Analyze Search Results
Identify data from Google search results pages (SERPs) while bypassing CAPTCHAs.

### Create Custom Proxies
Generate and provision custom proxy endpoints with specific headers and crawling logic for high-availability requests.

## Use Cases

### Competitor Price Monitoring
A growth team needs daily price updates for five key products across Amazon. Instead of manually visiting ten different listings and entering data into a spreadsheet, they prompt their agent: 'Run `scrape_amazon` on these URLs.' They get a clean JSON file with all prices and ratings.

### Talent Scouting
A recruiter needs to identify all professionals with specific titles from a list of companies. Instead of navigating dozens of LinkedIn profiles, they use the agent with `scrape_linkedin` to build a structured database of names and roles in minutes.

### Deep Web Research
A researcher needs data from an old university site that doesn't display content until you run specific scripts. They use the agent, which activates `scrape_js_rendered`, ensuring no hidden or dynamically loaded data points are missed.

### Search Engine Intelligence
A marketing professional needs to track how search results change over time. Instead of manually running Google searches and copying titles, they use the agent with `scrape_google_serp` for structured, repeatable data collection.

## Benefits

- Get structured data without scripting. Instead of writing complex Python code to handle different site structures, you just ask your agent for the JSON format using `scrape_json_format`.
- Handle dynamic sites easily. If a website loads its content via JavaScript—the kind of thing that breaks simple scrapers—this MCP uses specialized tools like `scrape_js_rendered` to get it anyway.
- Bypass security challenges. Stop hitting CAPTCHAs or rate limits; the system handles search engine discovery and proxy management, even giving you custom endpoints with `custom_scrape`.
- Target social media efficiently. Instead of manual copy-pasting from LinkedIn pages or Amazon listings, dedicated tools like `scrape_linkedin` and `scrape_amazon` pull out clean, specific data points.
- Validate your work instantly. Need proof the page was scraped correctly? Use `get_screenshot_link` to capture a visual snapshot of exactly what your agent saw on the target site.

## How It Works

The bottom line is that you tell your AI client what web data you need, and it manages all the complex infrastructure required to get it for you.

1. Subscribe to the Crawlbase MCP on Vinkius, then provide your unique Normal Token and any required JavaScript Token.
2. Ask your AI client to perform a task—for example, 'Get me the feature list for this Amazon product' or 'Scrape all users from this LinkedIn page'.
3. Your agent uses the necessary tool within the MCP to access the site, process the data (handling rendering and anti-bot measures), and return clean JSON or an image link.

## Frequently Asked Questions

**How does Crawlbase MCP handle JavaScript rendered content?**
It uses specialized tools like `scrape_js_rendered`. This means it doesn't just read the initial HTML; it waits for the page to fully load data using JS before extracting the information.

**Can I use scrape_google_serp with my AI agent?**
Yes. `scrape_google_serp` allows your agent to identify and pull structured results from Google search pages, which is necessary for repeatable SEO research without manual searching.

**Which tool should I use if the data is messy?**
If you get raw or inconsistent web output from any scraping attempt, run `scrape_json_format`. This forces the complex content into a predictable JSON structure your agent can work with.

**Is scrape_linkedin good for professional data collection?**
Yes. It’s designed to retrieve detailed profile information while respecting LinkedIn's structural constraints, making it reliable for building contact lists or talent databases.

**Before using `custom_scrape`, what credentials do I need to set up a proxy payload?**
You'll need your Crawlbase Normal Token. This token authenticates your connection and allows the agent to provision highly-available custom proxies, ensuring reliable payloads for all of your web crawling tasks.

**If my AI agent hits rate limits while using `scrape_html`, how does Crawlbase handle it?**
The MCP manages this by utilizing its specialized proxy list and dedicated algorithms. It handles IP rotation and includes CAPTCHA solving, keeping your data collection flowing even when sites try to block you.

**When I use `get_screenshot_link`, what is the purpose of capturing a web snapshot?**
The screenshot link generates a visual record of the page exactly as it appeared. This lets you validate the content extracted by other tools, confirming precisely what the headless engine saw before processing it into structured data.

**Does `scrape_facebook` handle complex or nested social page structures?**
Yes, this tool is designed to enumerate attached structured rules specific to Facebook pages. It exports active social page content while mitigating the typical constraints found when scraping large-scale social media data.

**When should I use the JavaScript (JS) Token versus the Normal Token?**
Use the Normal Token for fast, static HTML extraction. Switch to the JavaScript Token when the target site uses frameworks like React or Angular, where content is rendered dynamically in the browser. The 'scrape_js_rendered' tool requires the JS Token to function.

**Can my agent bypass CAPTCHAs while scraping Google or LinkedIn?**
Yes. Crawlbase is built to handle CAPTCHAs and blocks natively. When you use specialized tools like 'scrape_google_serp' or 'scrape_linkedin', the agent routes your requests through Crawlbase's advanced proxy infrastructure to ensure successful data extraction.

**How do I get a structured JSON response instead of raw HTML?**
Use the 'scrape_json_format' tool or the specialized scraper tools (Amazon, LinkedIn, etc.). These trigger Crawlbase's auto-extraction pipelines, which analyze the page structure and return specific data fields in a clean JSON format.