Diffbot MCP. Turn any website into clean, structured JSON data.

Q: How do I use the analyzepage tool with Diffbot?

The analyzepage tool gives you a high-level view of the page. You pass it a URL, and it tells you if it's an article, a list, or something else, which helps you decide which specialized tool to use next.

Q: Can I use extractproduct to scrape any e-commerce site?

Yes. extractproduct pulls structured e-commerce details like pricing, SKU, and brand mappings from almost any website, making it reliable for market monitoring.

Q: What is the difference between extractarticle and analyzepage?

The analyzepage tool gives a general classification of the entire page. extractarticle focuses specifically on pulling clean, long-form content, identifying authors and dates.

Q: Can I pull job listings using the extractjob tool?

Yes. The extractjob tool pulls explicit job titles, employer names, and salary vectors from recruitment pages, which is exactly what you need for market reports.

Q: How do I handle complex data structures with the extractproduct tool?

The extractproduct tool captures structured details like SKU, precise pricing, and brand mappings. You can get detailed specifications for over a dozen fields, making it suitable for complex product data sets.

Q: Does the extractlist tool handle search results or directory pages?

Yes, the extractlist tool is designed specifically for this. It identifies bounded directories and search results, allowing you to extract clean arrays of item titles and links.

Q: What kind of data can the extractdiscussion tool pull from forum threads?

The extractdiscussion tool pulls structured data from user-generated content. You can extract forum threads, reviews, or comments, and it even offers automated sentiment scoring.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Just plug in your AI agents and start using Vinkius.

Diffbot automates web data extraction. It lets you turn any website—articles, product pages, job listings, or forums—into clean, structured JSON data.

You connect it to your AI client and use natural conversation to run specialized APIs, avoiding complex scraping code. It handles everything from e-commerce SKUs and pricing to full article text and sentiment scoring across web pages.

What your AI agents can do

Analyze page

Uses machine learning to classify a page and extract generalized structured data (article, product, list, etc.) from a URL.

Extract article

Extracts clean text, HTML, and metadata from news, blog, or general article content.

Extract custom api

Runs data extraction based on a specific, pre-defined set of rules you configured in your Diffbot dashboard.

+ 7 more capabilities included

Classify and extract page type

The analyze_page tool determines if a given URL is an article, product, list, or job, and returns the appropriate structured JSON data.

Extract clean article content

The extract_article tool pulls news, blog, or article content, identifying authors and dates while leaving out the noise.

Scrape e-commerce product details

The extract_product tool captures structured product information, including pricing, SKU, brand, and full specifications from retail websites.

Analyze forum discussions and reviews

The extract_discussion tool pulls user-generated content, capturing forum threads and allowing for automated sentiment analysis.

Capture list or search result arrays

The extract_list tool finds bounded directory pages or search results and extracts structured arrays of titles and links.

Extract job postings

The extract_job tool pulls specific details like job titles, company names, and salary vectors from recruitment sites.

Extract media assets

The extract_image and extract_video tools pull primary images and video metadata from a page.

Ask AI about this MCP

Ask ChatGPT

Ask Claude

Ask Perplexity

Supported MCP Clients

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

+ other MCP clients

Free for Subscribers

Waiting for input…

AI Agent

Diffbot MCP Server: 10 Tools for Web Data Extraction

Use these 10 tools to extract structured content, product details, and data types from any URL without writing a single line of scraping code.

analyze019d7585

analyze page

Uses machine learning to classify a page and extract generalized structured data (article, product, list, etc.) from a URL.

extract019d7585

extract article

Extracts clean text, HTML, and metadata from news, blog, or general article content.

extract019d7585

extract custom api

Runs data extraction based on a specific, pre-defined set of rules you configured in your Diffbot dashboard.

extract019d7585

extract discussion

Pulls forum threads, user reviews, or comments, including automated sentiment scoring.

extract019d7585

extract event

Extracts schedules and details from event and conference pages.

extract019d7585

extract image

Pulls the main images and metadata from a web page.

extract019d7585

extract job

Extracts structured job postings, including titles, employers, and salary information.

extract019d7585

extract list

Scrapes structured arrays of items, like search results or directory links, from a URL.

extract019d7585

extract product

Extracts detailed e-commerce information, such as pricing, SKU, and specifications, from a product page.

extract019d7585

extract video

Extracts video content and metadata from a page.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Diffbot, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 4,700+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

What you can do with this MCP connector

You connect your AI client to this server and tell it what data you need from any website. It handles everything from pulling clean article text to structured product specs and forum sentiment analysis, all through natural conversation. You don't write complex scraping code; you just talk to your agent.

analyze_page determines if a URL is an article, product, list, or job, and it spits out the core structured JSON data for you.
extract_article pulls clean text, HTML, and metadata from news sites, blogs, or general articles, making sure to identify authors and dates while ditching the noise.
extract_product captures structured product info—think pricing, SKU, brand, and full specs—straight from retail sites.
extract_discussion pulls user-generated content, grabbing forum threads and running automated sentiment analysis on reviews.
extract_job pulls specific job details like titles, company names, and salary vectors from career sites.
extract_list finds directory pages or search results and extracts structured arrays of titles and links.
extract_event pulls schedules and details from event or conference pages.
extract_image and extract_video grab primary images and video metadata from a page.
extract_job pulls specific job details like titles, company names, and salary vectors from career sites.
extract_custom_api runs data extraction based on specific rules you set up in your own Diffbot dashboard.

extract_list scrapes structured arrays of items, like search results or directory links, from a URL.
extract_event extracts schedules and details from event and conference pages.

How Diffbot MCP Works

1 First, you give your agent a URL and tell it what kind of data you need (e.g., 'Give me the product details for this page').
2 Your agent calls the appropriate tool (like extract_product) and passes the URL to the Diffbot server.
3 The server runs the extraction logic and returns the clean, structured data in JSON format for your agent to use.

The bottom line is you skip writing scrapers. You just point your AI client at a website and ask for the data you want.

Who Is Diffbot MCP For?

This is for researchers and data teams that spend too much time manually copying data from the web. If you're constantly scraping competitor prices, summarizing news articles, or compiling job market reports, this saves you hours of painful, brittle scripting. It makes the web an API.

Market Researcher

Monitors competitor pricing and product updates across multiple e-commerce sites using natural language queries.

Content Marketer

Summarizes articles from various sources and monitors brand mentions across forum discussions in real time.

Data Analyst

Extracts structured data from thousands of varied websites—from news sites to directories—without writing complex selectors.

Software Developer

Tests and debugs web extraction pipelines and custom data rules directly within natural conversation.

What Changes When You Connect

Stop writing brittle scrapers. Use extract_product to pull precise pricing and SKUs from any e-commerce site, no matter how the site's code changes.
Get full context without multiple passes. The analyze_page tool classifies the page type (article, job, etc.) first, giving your agent a structured starting point.
Analyze conversation, not just text. Use extract_discussion to pull forum reviews and automatically score the sentiment, giving you immediate market feedback.
Build market intelligence fast. Run extract_job to monitor job titles and salary vectors across dozens of career pages in one go.
Handle messy data sources. Need to know what a page is? extract_article cleans up the main content, while extract_list pulls organized sets of links and titles.
Go beyond the obvious. With extract_custom_api, you can bridge raw URLs to your own specific, trained extraction rulesets.

Real-World Use Cases

Monitoring competitor pricing

A market researcher needs to know the latest price and SKU for a rival's product. They pass the URL to their agent and ask to 'What is the price and SKU for this product?' The agent calls extract_product, getting a clean JSON object with the details they need, instantly.

Summarizing industry news

A content marketer needs to track AI trends from five different blogs. They pass the URLs and ask the agent to 'Extract the article content from these five links.' The agent runs extract_article, returning clean text bodies and identifying authors and dates for all five.

Analyzing customer sentiment

A product team wants to gauge public feeling about a new feature. They pass a URL to a review section and ask to 'Extract the discussion and analyze the sentiment.' The agent calls extract_discussion, giving them structured reviews and sentiment scores.

Building a job market report

A recruiter needs to track salary trends for 'DevOps Engineer.' They pass a list of job board URLs and ask the agent to 'Get job postings for this list.' The agent calls extract_job, returning clean data points like job titles and salary ranges.

The Tradeoffs

Running multiple scrapers manually

The developer runs a scraper for articles, then a separate script for images, then another for product data. They spend hours manually stitching together JSON files and dealing with missing keys.

→ Instead, ask your agent to run analyze_page first. Then, in a single conversation, follow up with extract_article and extract_product to pull all necessary data types at once.

Treating the web like a simple database

Trying to use generic tools to pull structured data from a complex e-commerce site, resulting in missing SKUs or incorrect pricing fields.

→ Use the specialized extract_product tool. This API is built specifically for e-commerce, ensuring it captures precise pricing, SKU, and brand mapping.

Ignoring page context

Only scraping a list of links (extract_list) and missing the context—like the image or the primary article that links to those items.

→ Start with analyze_page to understand the overall page type. Then, use extract_image or extract_article to enrich the data you pull from the list.

When It Fits, When It Doesn't

Use this server if your job requires reading data from the web and turning it into structured JSON without writing custom Python or JavaScript scrapers. It's perfect for market research, content aggregation, and competitive intelligence.

Don't use it if:

1. You need to scrape behind a login wall or a paywall. This tool works on publicly visible content.
2. You need to perform complex, multi-step logic that requires external API calls (e.g., checking inventory levels on a separate system). For that, you'll need a dedicated backend service.

If you just need to know what kind of data is on the page, run analyze_page. If you need to pull structured data for a specific type (e.g., extract_product for e-commerce), use the specialized tool. Don't try to use a general tool for a specialized job.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Diffbot. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

analyze_page extract_article extract_custom_api extract_discussion extract_event extract_image extract_job extract_list extract_product extract_video

Web data extraction shouldn't require a PhD in web scraping.

Today, getting clean data from a website is a mess. You find a page, then you have to manually run a scraper for the article body, then another one for the product info, and a third one just to pull the images. You spend hours dealing with selectors that break when the site owner updates a single CSS class. It's tedious, and it's always incomplete.

With Diffbot, you just point your agent at the URL and tell it what you need. You can ask for the article text and the product SKU in the same chat. The agent handles the complex extraction and gives you clean, structured JSON, period.

Use Diffbot MCP Server: Structured Data from Web Pages

You don't have to copy-paste prices, titles, and descriptions from multiple pages. You just run `extract_product` and get the whole dataset mapped out. Similarly, `extract_job` pulls out all the required fields (salary, employer, title) in one go.

The difference is that the data you get back is structured, not just a dump of text. You get usable fields—SKU, date, title, price—ready for your database. That's the only way to move fast.

Common Questions About Diffbot MCP

How do I use the analyze_page tool with Diffbot? +

The analyze_page tool gives you a high-level view of the page. You pass it a URL, and it tells you if it's an article, a list, or something else, which helps you decide which specialized tool to use next.

Can I use extract_product to scrape any e-commerce site? +

Yes. extract_product pulls structured e-commerce details like pricing, SKU, and brand mappings from almost any website, making it reliable for market monitoring.

What is the difference between extract_article and analyze_page? +

The analyze_page tool gives a general classification of the entire page. extract_article focuses specifically on pulling clean, long-form content, identifying authors and dates.

Does extract_custom_api require coding? +

No. You don't write code. You define your rules in the Diffbot dashboard, and then your agent runs those rules for you via the extract_custom_api tool.

Can I pull job listings using the extract_job tool? +

Yes. The extract_job tool pulls explicit job titles, employer names, and salary vectors from recruitment pages, which is exactly what you need for market reports.

How do I handle complex data structures with the extract_product tool? +

The extract_product tool captures structured details like SKU, precise pricing, and brand mappings. You can get detailed specifications for over a dozen fields, making it suitable for complex product data sets.

Does the extract_list tool handle search results or directory pages? +

Yes, the extract_list tool is designed specifically for this. It identifies bounded directories and search results, allowing you to extract clean arrays of item titles and links.

What kind of data can the extract_discussion tool pull from forum threads? +

The extract_discussion tool pulls structured data from user-generated content. You can extract forum threads, reviews, or comments, and it even offers automated sentiment scoring.

Can my agent automatically identify what kind of page a URL points to? +

Yes. Use the 'analyze_page' tool. Diffbot uses ML to classify the URL as an article, product, image, video, or list, and returns the appropriate structured JSON payload automatically.

How do I extract only the main text from a blog post without comments? +

Use the 'extract_article' tool and set the 'discussion' parameter to 'false'. The agent will retrieve the clean text and HTML body while explicitly ignoring any forum threads or review blocks on the page.

Can I use custom extraction rules I've defined in my Diffbot dashboard? +

Absolutely. Use the 'extract_custom_api' tool. Provide your trained 'api_name' and the target URL. Diffbot will extract the data according to your specific structural ruleset natively.

Use it with your favorite AI tools

Connect this server to Cursor, Claude, VS Code, and more.

OpenAI Agents SDK sdk-python

Google ADK sdk-python

Pydantic AI sdk-python

Vercel AI SDK sdk-typescript