# N-Gram Frequency Engine MCP

> The N-Gram Frequency Engine precisely counts word phrases. It extracts unigrams, bigrams (two words), and trigrams (three words) from huge documents using native V8 JavaScript. Stop relying on LLMs to approximate phrase counts; this server gives you mathematically perfect frequency numbers every time.

## Overview
- **Category:** developer-tools
- **Price:** Free
- **Tags:** nlp, text-processing, frequency-analysis, bigram, trigram, linguistic-analysis

## Description

**N-Gram Frequency Engine - Count Word Phrases**

You need to know exactly how often specific word combinations—like "core business strategy" or "Q3 revenue forecast"—show up in massive reports. Standard language models can't handle that; they approximate the count, or they just run into token limits and miss entire phrases. This isn't guesswork.

The N-Gram Frequency Engine fixes that problem completely. It pulls data directly using native V8 JavaScript, giving you mathematically perfect counts for bigrams (two words), trigrams (three words), and any custom word group size (N) every time. Forget estimations; this is a deterministic count of word patterns across huge bodies of text.

### The `extract_ngram_frequencies` Tool

The primary tool, `extract_ngram_frequencies`, calculates the top most frequent N-Grams from any source text deterministically. You feed it your documents, and it doesn't just skim the surface; it processes them fully.

When you run this engine, you get immediate access to three core capabilities. First, you can count word phrases by specifying if you want bigrams or trigrams, knowing that each sequence is counted precisely. Second, because it runs on V8 JavaScript, the tool handles huge documents without tripping over token limits—you don't lose data just 'cause it's too long for a typical AI client. Third, you can specify exactly how large of a word group (the N value) you want to count, letting you pull out only those specific patterns and ignoring everything else.

This isn't about general text analysis; it's surgical counting. You're not asking your agent for a summary—you're demanding precise data points showing exactly how many times 'supply chain management' or 'regulatory compliance risk' appears across thousands of pages of transcripts. The engine delivers that structured list detailing the top N-Grams and their exact counts.

Think of it this way: you hand over a massive corpus—say, all the meeting minutes from the last year—and your agent doesn't waste time trying to summarize the vibe. Instead, it uses `extract_ngram_frequencies` to generate a list that tells you, definitively, which three-word phrases dominated the conversation and how many times each one appeared. You get these numbers back immediately.

The ability to specify N means you control the scope of the count. Need only two-word pairs? Set N=2. Only looking for key concepts spread over three words? Set N=3. The tool handles all those parameters using native JS power, guaranteeing that every instance of your target phrase gets tallied correctly, no exceptions.

## Tools

### extract_ngram_frequencies
This tool pulls the top most frequent word groups (N-Grams) from text using deterministic counting.

## Prompt Examples

**Prompt:** 
```
Here is a 50-page PDF text. Find the top 10 most frequent trigrams (n=3) to help me understand the core topics.
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

**Prompt:** 
```
Combine all these customer reviews into one string and extract the top 5 bigrams (n=2).
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

**Prompt:** 
```
Extract the exact frequencies of 4-grams from this competitor's article to map their SEO keyword strategy.
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

## Capabilities

### Count Word Phrases
It calculates how many times specific sequences of words (bigrams, trigrams) appear in your text.

### Handle Large Texts
The engine processes large documents without hitting the token limits that trip up standard language models.

### Extract Specific N-Grams
You specify the size of the word group (N) and the tool pulls out only those specific patterns.

## Use Cases

### Analyzing Competitor Content
An SEO analyst needs to map a competitor's keyword strategy from 10 linked articles. Running the text through `extract_ngram_frequencies` finds the exact top 10 most frequent trigrams, showing where they are focusing their content efforts. This is impossible to do reliably using only an LLM prompt.

### Mining Customer Feedback
A product manager collects thousands of user reviews. They use the engine to extract bigram frequencies, identifying phrases like 'slow loading' or 'login error,' which pinpoints exactly where users are struggling across the entire dataset.

### Academic Corpus Review
A linguist is studying a niche field. They feed the engine an entire corpus of historical documents and use `extract_ngram_frequencies` to get deterministic counts on specific academic terminology, verifying patterns that standard summarization tools would miss.

### Identifying Core Themes in Legal Docs
A compliance officer needs to check thousands of meeting transcripts for recurring legal phrases. They use the engine to calculate trigram frequencies, providing a verifiable count of key terms like 'non-disclosure agreement' or 'liability waiver'.

## Benefits

- Stop guessing counts. The engine provides deterministic frequency numbers, eliminating the approximations standard LLMs make on large texts.
- Speed matters. It runs native V8 JavaScript in milliseconds, giving you results fast enough to keep your workflow moving.
- Control the scope. You specify N—whether it's bigrams (2 words), trigrams (3 words), or a custom size—so you only count what you need.
- Handles bulk data. It processes huge documents that would immediately blow up an LLM’s context window, giving you reliable results on every page.
- Verifiable metrics. You get raw counts and structured output, perfect for feeding directly into spreadsheets or other databases.

## How It Works

The bottom line is you get reliable, mathematically perfect phrase counts without relying on an LLM's memory or approximation.

1. Feed the engine a large body of text. This can be everything from transcripts to full articles.
2. The server runs `extract_ngram_frequencies` using V8 JavaScript, which calculates exact word counts by identifying common N-Grams.
3. You get back a list that shows the top phrases and their precise frequency count.

## Frequently Asked Questions

**How does N-Gram Frequency Engine MCP Server count phrases?**
It uses native V8 JavaScript to perform deterministic counting on the source text, guaranteeing accurate counts for unigrams, bigrams, and trigrams. This process bypasses LLM token limits entirely.

**Can I use extract_ngram_frequencies to count phrases in PDFs?**
Yes, as long as the PDF content is first extracted into a plain text string, the `extract_ngram_frequencies` tool can process it. The engine works on raw text data.

**Is this better than just asking my agent to summarize the document?**
Yes, because summarizing describes concepts; counting is factual. This server gives you hard metrics (the frequency count), while a summary only provides qualitative takeaways. They solve different problems.

**How do I change the N-Gram size using extract_ngram_frequencies?**
You set the desired 'N' value in your prompt or function call. For example, setting N=2 counts bigrams (two words), and N=3 counts trigrams (three words).

**When I use `extract_ngram_frequencies`, what is the maximum size of text it can process?**
The engine handles extremely large texts, limited primarily by available memory. You don't need to worry about typical token limits or length restrictions. Since it uses native V8 JavaScript, processing speed remains high even with massive inputs.

**Can `extract_ngram_frequencies` handle text that has complex formatting or mixed characters?**
It requires raw, clean plain text input for the most accurate results. If your source material includes HTML tags or unusual symbols, it’s best practice to strip those out first. This ensures the engine focuses only on meaningful word sequences.

**What security measures govern the data used by `extract_ngram_frequencies`?**
Your text input is processed securely within the Vinkius infrastructure for computation. We do not retain your source documents or use them to train our models; you only receive the calculated frequency output.

**If I run `extract_ngram_frequencies` with an empty string, what error response should I expect?**
It handles null or empty inputs gracefully. Instead of throwing an error, it returns a zero count for all N-Grams. This makes the tool reliable for conditional logic within your agent workflows.

**What are Bigrams and Trigrams?**
A bigram is a sequence of two adjacent words (e.g., 'machine learning'). A trigram is three (e.g., 'natural language processing').

**Does it lowercase the text automatically?**
Yes, all text is automatically lowercased and tokenized natively to ensure accurate aggregation of phrases.

**Is this faster than asking Claude?**
Significantly faster and 100% accurate. LLMs cannot count occurrences across thousands of tokens reliably.