# Web Scraper MCP

> Web Scraper is an MCP that gives your AI agent direct read access to live web pages. It lets your agent pull clean, usable text from any URL, stripping away ads and site clutter. You can also extract structured metadata like titles and links, or crawl entire documentation sites up to ten pages deep.

## Overview
- **Category:** developer-tools
- **Price:** Free
- **Tags:** web-crawling, markdown-conversion, data-extraction, reader-view, content-parsing, url-fetching

## Description

Stop letting your agent guess facts. This MCP connects your AI client directly to the public internet, giving it a real-time source of information. Instead of hallucinating, your agent reads live articles, parses complex technical documentation, and pulls clean text from any link you provide. You can convert cluttered webpages into pristine Markdown using Mozilla Readability logic. Need to compare sources? Use the batch reading feature to pull up to ten different URLs simultaneously. For developers, this means pointing your agent at a new library's API docs and having it write code based on the absolute latest syntax. The whole catalog of tools is hosted and managed by Vinkius, so you connect once and get access to all web data capabilities.


## Tools

### read
Pulls any public webpage into clean Markdown format, stripping away ads and clutter for readable content.

### extract
Gathers structured metadata from a page, pulling out the title, description, OG tags, and link count.

### list_links
Pulls every single outbound hyperlink found across an entire web page's source code.

### batch_read
Fetches and processes content from up to ten different URLs simultaneously for comparison or summary.

### crawl
Automatically crawls a website starting at a given URL, mapping out the content of subsequent pages (up to 10).

## Prompt Examples

**Prompt:** 
```
Read https://en.wikipedia.org/wiki/Artificial_intelligence and summarize its history.
```

**Response:** 
```
I've fetched the Wikipedia page. The history of AI spans back to antiquity with myths of artificial beings, but the formal field was founded in 1956 at Dartmouth College. It experienced cycles of immense optimism followed by disappointment ('AI winters'), eventually leading to the modern deep learning revolution fueled by huge datasets and compute power.
```

**Prompt:** 
```
Extract the links from https://news.ycombinator.com/
```

**Response:** 
```
I've extracted the outbound links. The site currently links out to 30 primary article sources including domains like github.com, weired.com, and nytimes.com, along with many internal navigational links to user profiles and comment threads.
```

**Prompt:** 
```
Compare these two links: url1.com and url2.com
```

**Response:** 
```
Using the batch reading tool, I've loaded both URLs simultaneously. URL 1 discusses a 'React-first' architecture and uses component styling. URL 2 advocates for 'HTML-first', server-rendered patterns. While both aim to increase web performance, they take fundamentally opposite approaches to client-side hydration.
```

## Capabilities

### Clean Article Reading
Your agent strips away ads and site navigation from any webpage, returning only the main article content as clean Markdown.

### Metadata Collection
The tool extracts structured data like SEO titles, descriptions, canonical links, and all outbound hyperlinks without downloading the page body.

### Site Deep Crawling
Your agent automatically navigates a starting URL, crawling up to ten pages deep to map out an entire documentation site or wiki.

### Bulk Data Fetching
You can process multiple web sources at once, fetching and comparing content from up to ten different URLs in parallel.

## Use Cases

### Comparing two product architectures
A developer needs to know if 'React-first' or 'HTML-first' is better for their client. They ask their agent to run `batch_read` on both competing articles, allowing the AI to compare them side-by-side and give a definitive recommendation.

### Auditing competitor websites
An SEO specialist uses `extract` on five competitor sites. The agent quickly pulls all metadata—titles, descriptions, canonical tags—enabling the specialist to identify weak spots in their own site's optimization.

### Researching a niche topic
A researcher drops 15 links related to quantum computing. They ask the agent to use `read` on each, and then summarize the entire collection of clean Markdown text into one coherent report.

### Mapping out an old wiki
An internal team uses `crawl` on their company's legacy documentation hub. The agent maps every related page up to ten deep, giving the team a complete structure map before migrating the content.

## Benefits

- Instead of relying on stale training data, you let the agent read real-time articles. This eliminates factual hallucinations entirely.
- The `read` tool converts any messy website into pristine Markdown. You get readable content instantly, perfect for documentation or blog posts.
- Need to audit a site? Use the `extract` tool to pull only the metadata—titles, descriptions, and OG tags—without downloading the whole page body.
- Comparing sources is easy with `batch_read`. You can feed up to ten URLs at once, allowing your agent to compare concepts or summarize multiple articles in one go.
- For deep research, use the `crawl` tool. Give it a single documentation hub link and let your agent map out every related page automatically.

## How It Works

The bottom line is you get real-time web data delivered into your workflow without any setup or keys.

1. Subscribe to this MCP on Vinkius. No API keys or authentication are required.
2. Simply paste a web link into your chat and tell your agent what to do, such as 'read this URL' or 'crawl this documentation'.
3. Your AI client executes the request, fetching the data and returning it directly to your conversation.

## Frequently Asked Questions

**How does the Web Scraper MCP handle complex documentation sites?**
It uses the crawl tool to map out entire documentation hubs. You give it the starting URL, and your agent automatically navigates up to ten related pages so you don't miss any linked content.

**Can I compare articles from different websites at once?**
Yes, use `batch_read`. This tool fetches multiple URLs in parallel, allowing your agent to process and compare the content of up to ten sources simultaneously. It's ideal for comparative analysis.

**Do I need any special keys or authentication to use Web Scraper?**
No. You don't need API keys or any specific credentials. Once you subscribe to this MCP, you just paste the link into your chat and tell your agent what task it needs to perform.

**Is the content from the Web Scraper always clean?**
Yes. The primary reading tool converts messy webpages using Mozilla Readability logic, which strips out boilerplate code, ads, and navigation bars so you only get pristine text.

**How do I find all links on a page?**
Use the `list_links` tool. It systematically pulls every single outbound hyperlink from the web page without needing to download or parse the full body content, giving you just the list.