# HTML DOM Query Engine MCP

> HTML DOM Query Engine provides precise data extraction from messy web pages. Stop feeding massive HTML payloads into your AI agent and risking token limits or hallucination. This MCP lets you pass a raw webpage string and a CSS selector, instantly pulling out exactly the text or attributes (like image URLs or prices) you need. It's fast, memory-efficient parsing for reliable scraping.

## Overview
- **Category:** loved-by-devs
- **Price:** Free
- **Tags:** html-parsing, css-selectors, data-extraction, web-automation, dom-manipulation

## Description

When you run into a huge e-commerce page—say, one with thousands of lines of HTML—and you only care about three things, like the product price and all the gallery images, passing that whole raw code block to your agent is bad news. It wastes tokens and often confuses the AI.

This MCP fixes that. You feed it the messy HTML alongside a specific CSS selector. The engine handles the heavy lifting of parsing the page structure, isolating only the data you asked for. You get back clean text or attributes directly, without any surrounding junk code. This capability is built on reliable native runtimes and makes scraping predictable.

Connecting this MCP through Vinkius gives your agent a dedicated tool to handle web data extraction cleanly. It means your workflow doesn't crash when it hits complex, poorly structured websites; it just gets the numbers or links you need.

## Tools

### query_dom
Passes a raw HTML string and a CSS query to extract the matching text content or attributes from the web element.

## Prompt Examples

**Prompt:** 
```
Extract the text from `.product-price` from this 5,000 line HTML file.
```

**Response:** 
```
✅ **Matches Found:**
1. `$149.99`
```

**Prompt:** 
```
Extract all image source URLs (`src`) from the `.gallery img` selector.
```

**Response:** 
```
✅ **Matches:** Extracted 12 `src` attributes successfully.
```

**Prompt:** 
```
Get the text inside the `<h1>` tag.
```

**Response:** 
```
✅ **Matched Text:** 'Welcome to our API documentation.'
```

## Capabilities

### Extracting text content
It pulls out visible text from a web element identified by its CSS selector.

### Retrieving attributes
You can grab specific data points associated with an element, like the 'src' of an image or the 'href' of a link.

### Parsing complex selectors
The tool supports advanced CSS queries (e.g., targeting elements only inside another container) for pinpoint accuracy.

## Use Cases

### Collecting product link lists
An SEO analyst needs all the image URLs for a gallery. Instead of reading through thousands of lines just to find the `src` attributes, they run their agent with this MCP and specify `.gallery img`. The agent instantly gets a clean list of every single source URL.

### Extracting pricing data
A researcher is compiling price comparisons across several competitor websites. They pass the raw HTML for each page to their agent, use this MCP with the selector `.price-display`, and consistently retrieve only the accurate dollar amounts.

### Auditing documentation structure
A developer needs to find all internal links on a help page. They feed the HTML into the MCP and query for `a[href*='/help/']`. The agent returns only the relevant link texts and URLs, perfect for building an index.

### Extracting headers or titles
A content curator needs to pull just the main title of several articles from a directory listing. They use the MCP with `h1` as the selector, and their agent gets back only the clean text for every matching article headline.

## Benefits

- Saves tokens. Instead of dumping gigabytes of raw web content into your agent, this MCP processes the heavy lifting outside the LLM, keeping your context window clean and efficient.
- Guarantees precision. By requiring a CSS selector, you tell the system exactly where to look (e.g., `.product-title`), minimizing the chance of irrelevant data being pulled in.
- Handles attributes easily. Need all image sources? You don't have to parse them manually; this tool lets your agent grab every `src` or `href` attribute from a specified selector group.
- Stops hallucination. Because the extraction happens via native code, the results are deterministic and factual, unlike when an LLM tries to guess data from raw HTML.
- Supports complex targeting. You can use advanced selectors like `#main .price:nth-child(2)` to hit elements that only appear sometimes or in a specific order.

## How It Works

The bottom line is you get structured data out of unstructured HTML without overloading your AI client's context window.

1. You pass the raw HTML content of a webpage and specify exactly what you're looking for using a standard CSS selector string.
2. The MCP engine processes the entire payload, running the query against the DOM structure to locate all matches.
3. Your agent receives only the clean data—either the requested text or list of attributes—ready for immediate use.

## Frequently Asked Questions

**How do I use the HTML DOM Query Engine MCP for image URLs?**
You pass the raw HTML and use `query_dom` with a selector like `.gallery img`. The tool will then return all the source (`src`) attributes found on those specific image elements.

**Is the HTML DOM Query Engine MCP faster than just sending the whole page?**
Yes. By running the parsing in a native runtime, it skips processing massive amounts of junk data that would bog down your agent's context window and slow down response time.

**What if I want to extract text from an ID selector?**
You simply use `#your-specific-id` as the CSS query. The engine will target that element directly and return its clean, visible text content.

**Can this MCP handle very long HTML pages?**
Absolutely. It's designed to parse large payloads efficiently, making it ideal for scraping entire documentation sections or massive e-commerce product listings.

**Does the HTML DOM Query Engine MCP only support text extraction?**
No, it supports attributes too. You can query not just the text inside an element, but also its associated attributes like `href` or `data-id`.