Diffbot MCP. Turn any website into clean, structured JSON data.
Works with every AI agent you already use
…and any MCP-compatible client
Just plug in your AI agents and start using Vinkius.
Diffbot automates web data extraction. It lets you turn any website—articles, product pages, job listings, or forums—into clean, structured JSON data.
You connect it to your AI client and use natural conversation to run specialized APIs, avoiding complex scraping code. It handles everything from e-commerce SKUs and pricing to full article text and sentiment scoring across web pages.
What your AI agents can do
Analyze page
Uses machine learning to classify a page and extract generalized structured data (article, product, list, etc.) from a URL.
Extract article
Extracts clean text, HTML, and metadata from news, blog, or general article content.
Extract custom api
Runs data extraction based on a specific, pre-defined set of rules you configured in your Diffbot dashboard.
The analyze_page tool determines if a given URL is an article, product, list, or job, and returns the appropriate structured JSON data.
The extract_article tool pulls news, blog, or article content, identifying authors and dates while leaving out the noise.
The extract_product tool captures structured product information, including pricing, SKU, brand, and full specifications from retail websites.
The extract_discussion tool pulls user-generated content, capturing forum threads and allowing for automated sentiment analysis.
The extract_list tool finds bounded directory pages or search results and extracts structured arrays of titles and links.
The extract_job tool pulls specific details like job titles, company names, and salary vectors from recruitment sites.
The extract_image and extract_video tools pull primary images and video metadata from a page.
Ask AI about this MCP
Supported MCP Clients
Waiting for input…
Diffbot MCP Server: 10 Tools for Web Data Extraction
Use these 10 tools to extract structured content, product details, and data types from any URL without writing a single line of scraping code.
019d7585analyze page
Uses machine learning to classify a page and extract generalized structured data (article, product, list, etc.) from a URL.
019d7585extract article
Extracts clean text, HTML, and metadata from news, blog, or general article content.
019d7585extract custom api
Runs data extraction based on a specific, pre-defined set of rules you configured in your Diffbot dashboard.
019d7585extract discussion
Pulls forum threads, user reviews, or comments, including automated sentiment scoring.
019d7585extract event
Extracts schedules and details from event and conference pages.
019d7585extract image
Pulls the main images and metadata from a web page.
019d7585extract job
Extracts structured job postings, including titles, employers, and salary information.
019d7585extract list
Scrapes structured arrays of items, like search results or directory links, from a URL.
019d7585extract product
Extracts detailed e-commerce information, such as pricing, SKU, and specifications, from a product page.
019d7585extract video
Extracts video content and metadata from a page.
Choose How to Get Started
Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.
Build Your Own
Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.
- Import from OpenAPI, Swagger, or YAML specs
- Create Agent Skills with progressive disclosure
- Deploy to edge with MCPFusion framework
- Built in DLP, auth, and compliance on every call
- Real time usage dashboard and cost metering
- Publish to catalog or keep private
Make Your AI Do More
Start with Diffbot, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.
- Use this MCP plus 4,700+ others, all in one place
- Add new capabilities to your AI anytime you want
- Every connection is secured and compliant automatically
- Track usage and costs across all your servers
- Works with Claude, ChatGPT, Cursor, and more
- New servers added to the catalog every week
What you can do with this MCP connector
You connect your AI client to this server and tell it what data you need from any website. It handles everything from pulling clean article text to structured product specs and forum sentiment analysis, all through natural conversation. You don't write complex scraping code; you just talk to your agent.
analyze_page determines if a URL is an article, product, list, or job, and it spits out the core structured JSON data for you.extract_article pulls clean text, HTML, and metadata from news sites, blogs, or general articles, making sure to identify authors and dates while ditching the noise.extract_product captures structured product info—think pricing, SKU, brand, and full specs—straight from retail sites.extract_discussion pulls user-generated content, grabbing forum threads and running automated sentiment analysis on reviews.extract_job pulls specific job details like titles, company names, and salary vectors from career sites.extract_list finds directory pages or search results and extracts structured arrays of titles and links.extract_event pulls schedules and details from event or conference pages.extract_image and extract_video grab primary images and video metadata from a page.extract_job pulls specific job details like titles, company names, and salary vectors from career sites.extract_custom_api runs data extraction based on specific rules you set up in your own Diffbot dashboard.
extract_list scrapes structured arrays of items, like search results or directory links, from a URL.extract_event extracts schedules and details from event and conference pages.
How Diffbot MCP Works
- 1 First, you give your agent a URL and tell it what kind of data you need (e.g., 'Give me the product details for this page').
- 2 Your agent calls the appropriate tool (like
extract_product) and passes the URL to the Diffbot server. - 3 The server runs the extraction logic and returns the clean, structured data in JSON format for your agent to use.
The bottom line is you skip writing scrapers. You just point your AI client at a website and ask for the data you want.
Who Is Diffbot MCP For?
This is for researchers and data teams that spend too much time manually copying data from the web. If you're constantly scraping competitor prices, summarizing news articles, or compiling job market reports, this saves you hours of painful, brittle scripting. It makes the web an API.
Monitors competitor pricing and product updates across multiple e-commerce sites using natural language queries.
Summarizes articles from various sources and monitors brand mentions across forum discussions in real time.
Extracts structured data from thousands of varied websites—from news sites to directories—without writing complex selectors.
Tests and debugs web extraction pipelines and custom data rules directly within natural conversation.
What Changes When You Connect
- Stop writing brittle scrapers. Use
extract_productto pull precise pricing and SKUs from any e-commerce site, no matter how the site's code changes. - Get full context without multiple passes. The
analyze_pagetool classifies the page type (article, job, etc.) first, giving your agent a structured starting point. - Analyze conversation, not just text. Use
extract_discussionto pull forum reviews and automatically score the sentiment, giving you immediate market feedback. - Build market intelligence fast. Run
extract_jobto monitor job titles and salary vectors across dozens of career pages in one go. - Handle messy data sources. Need to know what a page is?
extract_articlecleans up the main content, whileextract_listpulls organized sets of links and titles. - Go beyond the obvious. With
extract_custom_api, you can bridge raw URLs to your own specific, trained extraction rulesets.
Real-World Use Cases
Monitoring competitor pricing
A market researcher needs to know the latest price and SKU for a rival's product. They pass the URL to their agent and ask to 'What is the price and SKU for this product?' The agent calls extract_product, getting a clean JSON object with the details they need, instantly.
Summarizing industry news
A content marketer needs to track AI trends from five different blogs. They pass the URLs and ask the agent to 'Extract the article content from these five links.' The agent runs extract_article, returning clean text bodies and identifying authors and dates for all five.
Analyzing customer sentiment
A product team wants to gauge public feeling about a new feature. They pass a URL to a review section and ask to 'Extract the discussion and analyze the sentiment.' The agent calls extract_discussion, giving them structured reviews and sentiment scores.
Building a job market report
A recruiter needs to track salary trends for 'DevOps Engineer.' They pass a list of job board URLs and ask the agent to 'Get job postings for this list.' The agent calls extract_job, returning clean data points like job titles and salary ranges.
The Tradeoffs
Running multiple scrapers manually
The developer runs a scraper for articles, then a separate script for images, then another for product data. They spend hours manually stitching together JSON files and dealing with missing keys.
→
Instead, ask your agent to run analyze_page first. Then, in a single conversation, follow up with extract_article and extract_product to pull all necessary data types at once.
Treating the web like a simple database
Trying to use generic tools to pull structured data from a complex e-commerce site, resulting in missing SKUs or incorrect pricing fields.
→
Use the specialized extract_product tool. This API is built specifically for e-commerce, ensuring it captures precise pricing, SKU, and brand mapping.
Ignoring page context
Only scraping a list of links (extract_list) and missing the context—like the image or the primary article that links to those items.
→
Start with analyze_page to understand the overall page type. Then, use extract_image or extract_article to enrich the data you pull from the list.
When It Fits, When It Doesn't
Use this server if your job requires reading data from the web and turning it into structured JSON without writing custom Python or JavaScript scrapers. It's perfect for market research, content aggregation, and competitive intelligence.
Don't use it if:
1. You need to scrape behind a login wall or a paywall. This tool works on publicly visible content.
2. You need to perform complex, multi-step logic that requires external API calls (e.g., checking inventory levels on a separate system). For that, you'll need a dedicated backend service.
If you just need to know what kind of data is on the page, run analyze_page. If you need to pull structured data for a specific type (e.g., extract_product for e-commerce), use the specialized tool. Don't try to use a general tool for a specialized job.
Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Diffbot. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.
VINKIUS INFRASTRUCTURE
Cloud Hosted
Managed infra
V8 Isolated
Sandboxed per request
Zero-Trust Proxy
No stored credentials
DLP Enforced
Policy on every call
GDPR Compliant
EU data residency
Token Compression
~60% cost reduction
Works with Claude, ChatGPT, Cursor, and more
The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.
This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.
Available Capabilities
Web data extraction shouldn't require a PhD in web scraping.
Today, getting clean data from a website is a mess. You find a page, then you have to manually run a scraper for the article body, then another one for the product info, and a third one just to pull the images. You spend hours dealing with selectors that break when the site owner updates a single CSS class. It's tedious, and it's always incomplete.
With Diffbot, you just point your agent at the URL and tell it what you need. You can ask for the article text and the product SKU in the same chat. The agent handles the complex extraction and gives you clean, structured JSON, period.
Use Diffbot MCP Server: Structured Data from Web Pages
You don't have to copy-paste prices, titles, and descriptions from multiple pages. You just run `extract_product` and get the whole dataset mapped out. Similarly, `extract_job` pulls out all the required fields (salary, employer, title) in one go.
The difference is that the data you get back is structured, not just a dump of text. You get usable fields—SKU, date, title, price—ready for your database. That's the only way to move fast.
Common Questions About Diffbot MCP
How do I use the analyze_page tool with Diffbot? +
The analyze_page tool gives you a high-level view of the page. You pass it a URL, and it tells you if it's an article, a list, or something else, which helps you decide which specialized tool to use next.
Can I use extract_product to scrape any e-commerce site? +
Yes. extract_product pulls structured e-commerce details like pricing, SKU, and brand mappings from almost any website, making it reliable for market monitoring.
What is the difference between extract_article and analyze_page? +
The analyze_page tool gives a general classification of the entire page. extract_article focuses specifically on pulling clean, long-form content, identifying authors and dates.
Does extract_custom_api require coding? +
No. You don't write code. You define your rules in the Diffbot dashboard, and then your agent runs those rules for you via the extract_custom_api tool.
Can I pull job listings using the extract_job tool? +
Yes. The extract_job tool pulls explicit job titles, employer names, and salary vectors from recruitment pages, which is exactly what you need for market reports.
How do I handle complex data structures with the extract_product tool? +
The extract_product tool captures structured details like SKU, precise pricing, and brand mappings. You can get detailed specifications for over a dozen fields, making it suitable for complex product data sets.
Does the extract_list tool handle search results or directory pages? +
Yes, the extract_list tool is designed specifically for this. It identifies bounded directories and search results, allowing you to extract clean arrays of item titles and links.
What kind of data can the extract_discussion tool pull from forum threads? +
The extract_discussion tool pulls structured data from user-generated content. You can extract forum threads, reviews, or comments, and it even offers automated sentiment scoring.
Can my agent automatically identify what kind of page a URL points to? +
Yes. Use the 'analyze_page' tool. Diffbot uses ML to classify the URL as an article, product, image, video, or list, and returns the appropriate structured JSON payload automatically.
How do I extract only the main text from a blog post without comments? +
Use the 'extract_article' tool and set the 'discussion' parameter to 'false'. The agent will retrieve the clean text and HTML body while explicitly ignoring any forum threads or review blocks on the page.
Can I use custom extraction rules I've defined in my Diffbot dashboard? +
Absolutely. Use the 'extract_custom_api' tool. Provide your trained 'api_name' and the target URL. Diffbot will extract the data according to your specific structural ruleset natively.
Use it with your favorite AI tools
Connect this server to Cursor, Claude, VS Code, and more.
More in this category
Lingyi Wanwu
Orchestrate Lingyi Wanwu AI models — manage chat completions, embeddings, and monitor Yi model performance directly from any AI agent.
Supabase Vector
Connect your AI to Supabase Vector. Execute pgvector semantic searches, manage embeddings, and run relational database queries directly from your terminal.
Helicone (LLM Observability)
Monitor LLM usage via Helicone — track requests, analyze costs, measure latency, and manage prompts.
You might also like
Standard Notes
Connect your AI to the Standard Notes encrypted ecosystem. Sync items natively, modify protected notes, and manage tags seamlessly.
Videco
Create and manage personalized videos via Videco u2014 launch campaigns, capture leads, and track analytics from your AI agent.
Portkey
AI gateway observability: monitor logs, costs, and manage LLM configurations via agents.