# Diffbot MCP

> Diffbot takes unstructured web content and turns it into usable data. Connect your AI client to extract metadata from articles, product pages, or forum threads with simple instructions. It also lets you query a massive knowledge graph to find specific company details and people profiles using natural language.

## Overview
- **Category:** artificial-intelligence
- **Price:** Free
- **Tags:** knowledge-graph, data-extraction, machine-learning, structured-data, web-crawling, metadata-parsing

## Description

You don't have to build complex scrapers or spend hours copy-pasting data into spreadsheets anymore. This MCP acts as your dedicated research analyst, letting your AI agent pull structured information directly from any web page URL. You can ask it to find the main text of a news article, gather all product specs from an e-commerce listing, or summarize key points from a discussion board.

If you need more than just raw text, you can use its massive database to look up company details or employee information using simple queries. It’s like having a data engineer ready for conversation. Vinkius hosts this connector so your AI client gets access to these powerful data tools alongside everything else you're building.

It's about going from 'here' (a messy webpage) to 'there' (clean, structured data points) with just a prompt.

## Tools

### analyze_page_type
Automatically detects if a given web page is an article, product listing, or something else entirely.

### enhance_company_profile
Adds professional details and background information to a company using its name or domain.

### enhance_person_profile
Enriches personal profiles by adding professional history, social links, and contact data.

### extract_article_data
Pulls the clean text, author, and metadata from news or blog posts.

### extract_forum_thread
Gathers key comments and content from online discussion threads or message boards.

### extract_images
Identifies the main visual assets and primary images on a webpage.

### extract_product_data
Extracts specific details like SKU, price, and descriptions from e-commerce product pages.

### extract_video_metadata
Identifies embedded videos on a page and retrieves their associated metadata.

### search_knowledge_graph
Queries the massive world knowledge graph to find specific organizational or industry data points using DQL.

### verify_api_credentials
Verify your Diffbot API credentials

### list_active_crawls
Provides a list and operational status check for any running data crawling jobs.

## Prompt Examples

**Prompt:** 
```
Extract the main content from the article at 'https://vinkius.com/blog/mcp-standards'.
```

**Response:** 
```
Article extraction triggered! I've retrieved the clean text, author 'John Doe', and a positive sentiment score. Would you like me to extract any discussion threads from the page as well?
```

**Prompt:** 
```
Search for companies in 'San Francisco' with more than 1000 employees using DQL.
```

**Response:** 
```
Querying Knowledge Graph... I found 15 organizations matching your criteria. Highlights include 'Salesforce' and 'Uber'. Would you like the detailed firmographics for these entities?
```

**Prompt:** 
```
Enhance the company profile for 'Vinkius' using domain 'vinkius.com'.
```

**Response:** 
```
Profile enhanced! Vinkius is identified as a 'Technology' organization based in Portugal. I've retrieved their social links, employee count, and latest funding metadata. Shall I search for their key people?
```

## Capabilities

### Extracting Web Data
Pull specific information—like article content or product details—from any given web address.

### Enhancing Profiles
Add professional details, funding history, and company background to existing names or domains.

### Querying Knowledge Graphs
Search a massive database of billions of entities to find specific market signals or industry data points.

### Identifying Page Types
Automatically detect if a URL is an article, product page, or forum thread before extracting the data.

## Use Cases

### Competitive Intelligence Gathering
A growth marketer needs to compare product features across five rival websites. Instead of visiting each site and manually gathering specs, they instruct their agent to use `extract_product_data` on all five URLs in a batch, getting clean data points for direct comparison.

### Deep Market Sizing
A market researcher needs to find all companies in the 'AI' sector located in London with over 50 employees. They use `search_knowledge_graph` and define the parameters, receiving a filtered list of detailed firmographics instantly.

### Content Aggregation
A content curator wants to build an internal summary of industry news. They feed their agent 10 links and ask it to use `extract_article_data` on each, getting ten clean summaries with authors attached for immediate review.

### Contact List Cleanup
A sales team member has a list of old client names. They feed the agent the name and domain, asking it to `enhance_person_profile` to verify current job titles, social links, and company affiliation.

## Benefits

- Stop manual copy-pasting. With `extract_article_data`, you simply point your agent at a URL and get the core text and author details instantly, without cleanup.
- Go beyond basic scraping. Use `search_knowledge_graph` to query specific industry signals or firmographics from billions of world entities in one step.
- Boost your lead database quality. Running `enhance_company_profile` on a domain gives you structured data like employee count and funding metadata, not just a name.
- Handle diverse content types. The tool first runs `analyze_page_type` so it knows whether to use the right model for an e-commerce listing (`extract_product_data`) or a discussion forum (`extract_forum_thread`).
- Maintain operational visibility. You can monitor everything by using `list_active_crawls` and checking your API status with `get_api_status`.
- Contextualize your data gathering. Need to know if the page is even an article? Running `analyze_page_type` confirms the source type before you waste time trying to extract content.

## How It Works

The bottom line is, you use your AI client to talk to this MCP, and it handles all the data extraction and structuring behind the scenes.

1. Subscribe to this MCP and retrieve your API token from the Diffbot dashboard.
2. Your AI client connects using that token. You then instruct your agent with a URL or query, telling it exactly what kind of data you need extracted.
3. The tool processes the request, returning clean, structured results—whether that's a list of company bios or a specific product SKU.

## Frequently Asked Questions

**How do I use Diffbot MCP to extract product data?**
You tell your agent to use `extract_product_data` on the URL. It will pull out structured details like SKUs, prices, and specifications found on e-commerce listing pages.

**Can I find company information using search_knowledge_graph?**
Yes, you can use `search_knowledge_graph` to query its massive database. You just need to define the parameters like industry or location, and it returns structured results.

**What is the difference between extract_article_data and extract_forum_thread?**
They are for different types of content. Use `extract_article_data` for clean blog posts or news articles, and use `extract_forum_thread` when you need to pull key points from a discussion board.

**Does Diffbot MCP handle data enrichment?**
Yes. You can run `enhance_company_profile` or `enhance_person_profile` with just a name and domain, and the tool adds professional background details to that entity.

**How do I know what kind of page I'm looking at?**
Before extracting anything, you can run `analyze_page_type`. This tells your agent if the URL is an article, a product listing, or something else so it uses the right extraction method.

**How do I find my Diffbot API Token?**
Log in to your Diffbot account and navigate to the **Dashboard** or **Manage Tokens** section to copy your unique access token.

**What is DQL and how can I use it?**
DQL (Diffbot Query Language) allows you to filter the Knowledge Graph. Use the `search_knowledge_graph` tool with queries like `type:Organization industries:"AI"`.

**Can I extract comments from articles?**
Yes! The `extract_article_data` tool has an optional `discussion` parameter. Set it to `true` to retrieve structured comment threads if available.