# Groq MCP

> Groq MCP Server. Get blazing-fast LLM inference by connecting your AI agent to Groq's LPU-accelerated endpoints. Run chat completions using Llama 3 or Mixtral, transcribe audio files, translate non-English audio to English text, and enforce structured JSON output—all with minimal latency.

## Overview
- **Category:** superpower
- **Price:** Free
- **Tags:** llm-inference, lpu-acceleration, ai-latency, audio-transcription, generative-ai, high-performance-computing

## Description

**Groq MCP Server - Ultra-fast LLM Inference**

Connect your AI agent to Groq's LPU-accelerated endpoints. You get blazing-fast LLM inference and full control over your generative AI workflows. **Use** `chat_completion` to run text generation with Llama, Mixtral, or Gemma models at ultra-fast speeds. You can create numerical embeddings from text input using `create_embedding` for vector storage and retrieval. Need to know what models are available? You'll use `list_models` to see all model IDs and versions, and `get_model` to pull specific details about any Groq model. You can check content safety using `moderate_content` to flag unsafe or prohibited material. If you need the AI to output data that strictly matches a defined JSON format, use `structured_output`. You can convert audio files to plain text transcripts with `transcribe_audio`, and you'll use `translate_audio` to take non-English audio and output a readable English text translation.

## Tools

### chat_completion
Generates a chat completion using Llama, Mixtral, or Gemma models at ultra-fast inference speeds.

### create_embedding
Creates numerical embeddings from text input for vector storage and retrieval.

### get_model
Retrieves specific details and metadata about an available Groq model.

### list_models
Lists all model IDs and versions currently available for inference.

### moderate_content
Checks a given piece of content for safety violations or policy breaches.

### structured_output
Forces the AI to output data that strictly matches a defined JSON format.

### transcribe_audio
Converts audio files into a readable text transcript.

### translate_audio
Converts non-English audio files into written English text.

## Prompt Examples

**Prompt:** 
```
Ask llama3-70b: 'Write a python function to scrape a website.'
```

**Response:** 
```
Inference complete! Llama 3 response: 'Here is a simple python function using BeautifulSoup and requests to scrape data...' [Blazing-fast response delivered via Groq LPU].
```

**Prompt:** 
```
Transcribe this audio meeting: https://example.com/meeting.mp3
```

**Response:** 
```
Transcription started! I'm using Groq optimized Whisper large-v3 model to parse your meeting audio. I'll provide the full timestamped text for you in just a few seconds.
```

**Prompt:** 
```
Get model info for 'mixtral-8x7b-32768'
```

**Response:** 
```
Retrieving model metadata... Mixtral-8x7b-32768 is a high-performance LLM with a context window of 32,768 tokens. It supports chat completions and tool-calling on Groq's LPU architecture.
```

## Capabilities

### Generate Chat Completions
Runs text generation using Llama, Mixtral, or Gemma models at ultra-fast speeds.

### Create Text Embeddings
Generates numerical vectors for text chunks to power semantic search and RAG systems.

### Retrieve Model Metadata
Pulls details about specific Groq models, like context window size or supported features.

### List Available Models
Returns a list of all high-speed models currently available on the Groq platform.

### Check Content Safety
Runs text or content through a moderation check to flag unsafe or prohibited material.

### Enforce JSON Output
Forces the AI to generate text that strictly adheres to a valid JSON schema, perfect for database writing.

### Transcribe Audio Files
Converts an audio file into a plain text transcript using optimized Whisper models.

### Translate Audio Files
Takes non-English audio and outputs a synchronized, readable English text translation.

## Use Cases

### Building a Customer Support Bot
A support agent needs to handle incoming audio calls. They ask their agent to run `transcribe_audio` on the recording. The agent feeds the resulting text into `chat_completion` to summarize the issue and then uses `structured_output` to log the ticket details into a structured format. The problem is solved in one conversational flow.

### Analyzing Foreign Market Interviews
A market researcher records interviews in Mandarin. Instead of manually transcribing and translating, they ask their agent to run `translate_audio` on the file. They get immediate, readable English text, which they can then feed into `create_embedding` to build a knowledge base.

### Streaming Real-Time Code Assistance
A software engineer is coding and needs fast context. They use their agent to run `chat_completion` with Llama 3 on a large code block, getting near-instant responses. This lets them debug or write code without the typical API lag.

### Automating Form Submission from Chat
A product manager wants their agent to capture user requirements. They prompt the agent to run `structured_output` and specify a JSON schema for 'Feature Request'. The agent outputs the data, and the PM can pipe that JSON directly into a ticketing system.

## Benefits

- **Speed:** Chat completions using Llama 3 or Mixtral run with LPU acceleration, meaning your agent gets responses in fractions of a second. This is critical for good user experience.
- **Multimodal Workflow:** Handle complex inputs easily. You can transcribe audio with `transcribe_audio` and immediately pass that text to `chat_completion` for summarization.
- **Data Reliability:** Never trust raw LLM output. Use `structured_output` to guarantee the AI returns perfect, valid JSON, making it ready for database writes.
- **Global Reach:** Process audio from any language. Run `translate_audio` to get immediate, synchronized English text, eliminating the need for external translation APIs.
- **System Control:** Monitor your setup with `get_model` and `list_models`. You always know exactly which model and version your agent is using.
- **Safety & Compliance:** Use `moderate_content` to filter all input and output data, keeping your application secure and compliant by design.

## How It Works

The bottom line is, you get sub-second, hardware-accelerated AI results directly in your chat or IDE.

1. Subscribe to the Groq server and provide your Groq API Key in the client settings.
2. Your AI client sends a request to the server (e.g., 'Transcribe this file').
3. The server executes the necessary tool call on Groq's LPU architecture and returns the result.

## Frequently Asked Questions

**How does the Groq MCP Server improve my LLM speed?**
It utilizes Groq's LPU-accelerated endpoints, which deliver chat completions at extremely low latency. This means your agent feels instant, making the overall application feel much snappier.

**Can I use the Groq MCP Server for both transcription and translation?**
Yes. Use `transcribe_audio` to get plain text, or use `translate_audio` to get a synchronized English text version of non-English audio.

**Is the structured_output tool reliable?**
Yes, the `structured_output` tool constrains the AI's generation to a strict JSON format. This eliminates the risk of the model adding explanatory text or stray characters.

**What models can I use with the chat_completion tool?**
You can use Llama 3, Mixtral, and Gemma models for chat completions. You can check model availability using `list_models`.

**Does the Groq MCP Server handle model discovery?**
Yes, the `get_model` and `list_models` tools let your agent check available models and retrieve their specific metadata before making a call.

**How do I manage model availability using the list_models tool?**
The `list_models` tool shows all available models. You can use this to check model IDs and versions before calling other tools, ensuring your agent targets a high-speed, active instance.

**What is the purpose of the structured_output tool?**
It forces the AI to generate output in rigid JSON format. This is critical for automating data entry and integrating the results into downstream systems reliably.

**Can the chat_completion tool handle complex tool-calling logic?**
Yes, the chat completion tool supports tool calling. You can bind external definitions and let your agent interact with specialized tools using a secure JSON architecture.

**How fast are Groq's chat completions compared to standard GPUs?**
Groq's LPU architecture is designed for extreme low-latency inference, often delivering hundreds of tokens per second. Your agent uses the 'chat' tool to execute these blazing-fast requests, returning AI responses almost instantly.

**Can my agent transcribe long audio files using Groq Whisper?**
Yes. Use the 'transcribe' tool. Provide the public URL of your audio file and select a Whisper model (e.g., 'whisper-large-v3'). The agent will parse the stream and return the full text transcript flawlessly.

**How do I ensure the AI response is formatted as valid JSON via chat?**
Use the 'chat_json' tool. This activates Groq's JSON mode, which explicitly constrains the text inference to rigid, valid JSON formatting, making it perfect for direct system integrations.