# TF-IDF Vectorizer Engine MCP

> TF-IDF Vectorizer Engine calculates the exact Term Frequency-Inverse Document Frequency scores for your text data. Feed it a collection of documents and a list of keywords; it returns mathematically precise weights that tell you exactly how relevant each term is across your entire corpus, eliminating keyword guessing.

## Overview
- **Category:** developer-tools
- **Price:** Free
- **Tags:** nlp, text-analysis, statistical-modeling, keyword-extraction, data-processing, deterministic-math

## Description

**`calculate_tf_idf`** calculates the exact Term Frequency-Inverse Document Frequency scores for your data set. You feed it an array of specific terms and an accompanying array of documents; in return, it gives you mathematically precise weights that tell you exactly how relevant each single term is across your entire body of text. 

Forget about keyword guessing games. Your agent doesn't have to guess what's important; this engine figures out the objective relevance score for every word. It's deterministic scoring based on true statistical frequency, not some vague 'gut feeling.' When you run it, it processes a defined set of input data—specifically, an array of terms and multiple text arrays (the documents)—and spits out scores that quantify how often those terms appear relative to the entire corpus.

Here's the deal: The tool computes precise TF-IDF scores. It looks at every term you give it and measures its frequency within each document, then weights that score by how rare or common that term is across *all* documents in your collection. A high score means the word pops up a lot in one specific spot but isn't everywhere else; a low score suggests the word is just background noise used in pretty much every single piece of writing.

You use this mechanism when you need to rank importance objectively. You don't want rankings based on simple counts or how often something appears generally—you need the statistical punch that only TF-IDF delivers. The system takes your defined list of terms and measures their relative weight across an array of documents, giving you a highly granular understanding of term significance.

It’s built to handle large collections of text data efficiently. Think about scoring thousands of articles or millions of chat logs. Instead of wading through qualitative analysis, you give it the inputs—the document arrays and the target terms—and you get back an immediate set of weighted scores. These weights tell your AI client exactly which terms carry the most meaning within a specific context relative to everything else in the data.

When your agent needs to score documents mathematically, this is what you use. It’s not magic; it's math. The tool computes those precise TF-IDF values for every term in your provided set against every document in your corpus. You get an objective measure of relevance that lets you pinpoint the absolute core concepts without any guesswork involved. If you need to know which terms really drive meaning within a specific group of documents, this is where you start.

You feed it the data structure: one array for all the terms you care about, and another corresponding array containing your full set of documents. It then processes that pairing, calculating those complex scores—the TF-IDF weights—and returns them to you in a structured format. You’ll get back an immediate ranking that shows which terms are statistically most indicative of topic relevance within your data set.

It's critical for any use case requiring deep semantic analysis beyond basic keyword matching. Whether you're building a search engine, running document similarity checks, or training models on specialized text corpora, the output from `calculate_tf_idf` is what you want: measurable proof of term importance across multiple documents. You don't just get scores; you get objective evidence that certain terms are disproportionately important to specific pieces of content within your overall collection. It's reliable, deterministic scoring, period.

## Tools

### calculate_tf_idf
Calculates the exact TF-IDF scores for an array of terms across an array of documents.

## Prompt Examples

**Prompt:** 
```
Here are 5 article texts and the terms ['crypto', 'regulation']. Give me the exact TF-IDF scores to rank these articles.
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

**Prompt:** 
```
I have a dataset of customer reviews. Run TF-IDF on the words 'slow' and 'expensive' to see which reviews focus on them.
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

**Prompt:** 
```
Calculate the exact TF-IDF scores for these 10 support tickets using these 3 technical keywords.
```

**Response:** 
```
The computation has been executed with mathematical precision. All results are exact and ready for review.
```

## Capabilities

### Score Term Relevance Across Documents
The `calculate_tf_idf` tool computes the precise TF-IDF scores for a given set of terms across multiple text arrays.

## Use Cases

### Analyzing Technical Support Tickets
A support manager wants to know if 'API endpoint' is more critical than 'login failure' when reviewing 500 tickets. Instead of reading them all, they use `calculate_tf_idf` with the keywords ['API endpoint', 'login failure']. The agent returns exact scores, allowing the manager to immediately see which topic dominates the conversation.

### Benchmarking Academic Papers
A researcher has 10 articles on climate change and needs to prove that 'carbon capture' is the most unique term. They run `calculate_tf_idf` across all texts using only ['renewable', 'solar', 'carbon capture']. The resulting scores provide objective evidence for their thesis.

### Sentiment Scoring on Product Reviews
A product team wants to see if the words 'slow' or 'expensive' are driving complaints in a batch of 200 reviews. They use `calculate_tf_idf` and get precise scores, immediately identifying which issue (speed vs. cost) is statistically more relevant across the corpus.

### Identifying Key Concepts in Legal Documents
A paralegal needs to quickly compare 15 legal contracts for specific phrases like 'indemnification' or 'termination clause'. Running `calculate_tf_idf` provides a numerical ranking, letting them prioritize which documents contain the most unique and critical language.

## Benefits

- **Objective Ranking:** Instead of relying on vague text summaries, you get a hard score for every term. This means your document ranking is mathematically provable using `calculate_tf_idf`.
- **Deterministic Results:** The engine uses the Node.js V8 engine to ensure calculations are repeatable and precise. You'll never get fluctuating scores based on prompt wording; it’s always the same math.
- **Scalability for Corpus Analysis:** Feed thousands of documents into the system. The engine handles the complex mathematics needed to score relevance across massive datasets without breaking down.
- **Direct NLP Integration:** Integrates native statistical text analysis—something LLMs are bad at. You get true keyword weight, perfect for building robust search features or topic models.
- **Reliable Keyword Weighting:** Use `calculate_tf_idf` to determine which technical terms actually drive the unique meaning of a document compared to general vocabulary.

## How It Works

The bottom line is you get mathematically proven weights for your keywords, allowing reliable ranking where LLMs fail by guessing.

1. Provide the engine with two data sets: an array representing the documents (the corpus) and another array listing the specific terms you want to score.
2. The server uses the V8 engine to run a deterministic calculation, mapping term frequency against inverse document frequency across all provided texts.
3. You receive objective scores that rank how important each keyword is to the collection of documents.

## Frequently Asked Questions

**Why is TF-IDF better than simple word counting?**
Word counting overvalues common words like 'the' or 'and'. TF-IDF lowers the weight of words that appear in many documents, highlighting terms that are uniquely relevant to a specific text.

**Can it process JSON document arrays?**
Yes, just provide a stringified JSON array of text documents and a target array of terms. The engine handles the corpus building and tokenization.

**Does it work in languages other than English?**
Yes, TF-IDF relies on token frequency, making it highly effective for multi-language corpuses without needing specific translation logic.

**What are the performance limits when running `calculate_tf_idf` on massive document corpuses?**
The engine handles large batches efficiently by processing documents deterministically in memory. For optimal speed, keep your total corpus size under 50,000 documents per single request; exceeding this limit may require chunking the input data.

**Does `calculate_tf_idf` automatically clean non-text content like HTML tags or Markdown formatting?**
No, you must pre-clean your text inputs. The tool expects pure strings; if you feed it raw HTML or structured markdown, the statistical analysis will fail because those tags count as irrelevant 'terms'.

**If I pass empty documents or null values to `calculate_tf_idf`, how does the system respond?**
The tool handles these edge cases gracefully. It simply skips any entries in the document array that are blank or null, preventing calculation errors and allowing you to process only valid texts.

**Is the data used by `calculate_tf_idf` secure when running it through your agent?**
Yes. All input data remains confined within the Vinkius sandbox environment during processing. We do not store or share proprietary text corpora outside of the active computation session.

**What is the ideal format for the document array when calling `calculate_tf_idf`?**
The best practice is an array of simple string values, where each string represents a complete, cleaned document. Avoid nested objects or complex data types in the documents list.