# SambaNova (AI Inference) MCP

> SambaNova (AI Inference) MCP Server provides high-speed access to state-of-the-art open models like Llama 3 and DeepSeek. It runs inference using SambaNova's SN40L chips, giving you record token speeds that standard cloud APIs can't match. Use it for chat completions, generating vector embeddings, or forcing model output into structured formats.

## Overview
- **Category:** developer-tools
- **Price:** Free
- **Tags:** llm-inference, llama3, deepseek, embeddings, high-performance-computing

## Description

Listen up. This ain't your grandma's API connection. You're hooking your agent up to SambaNova Cloud through this MCP Server, giving you access to some of the fastest open models out there—stuff like Llama 3.3 and DeepSeek. It runs inference using SN40L chips, meaning you get token speeds that standard cloud APIs just can't touch. You're building real-time apps here; latency is everything.

When your agent needs to chat, you use `create_chat_completion`. This tool lets you generate high-quality conversational text using state-of-the-art models. You feed it a chat history, and it spits out the next part of the conversation, keeping that OpenAI Chat Completions API format intact so your client doesn't sweat the details.

Need to turn unstructured text into something usable? Call `create_embedding`. This tool generates high-dimensional numerical vectors for any piece of text you throw at it, using the specialized SambaStack service. These vectors are exactly what you need when you build a Retrieval-Augmented Generation (RAG) system or integrate with a vector database. You get the numbers, period.

And here’s the critical part for reliable agents: structured output. When you use `create_response`, you force the model to spit out items that are strictly typed. This isn't just getting text; this is guaranteeing a predictable data structure—like a JSON object with specific keys and values—which your agent needs to actually perform an action, not just talk about it.

The whole thing works by having your client call one of these tools. You subscribe to the server, give your AI agent permission, and bam—you get the result instantly because of that SN40L infrastructure under the hood. It's pure speed for complex tasks.

You're building anything where throughput matters, right? If you can't afford lag or low processing power when handling a steady stream of requests, this is your ticket. You bypass standard LLM provider bottlenecks and get a faster, more robust way to run open-source models directly from your agent.

## Tools

### create_chat_completion
Uses SambaNova models to generate conversational text based on a chat history, compatible with OpenAI's Chat Completions API format.

### create_embedding
Generates high-dimensional vector embeddings for given text inputs using the specialized SambaStack service.

### create_response
Processes a request and returns model output items that are strictly typed, ensuring reliable data structure for agentic workflows.

## Prompt Examples

**Prompt:** 
```
Use create_chat_completion with Meta-Llama-3.3-70B-Instruct to explain how SN40L chips work.
```

**Response:** 
```
I'll generate that explanation for you using Llama 3.3 on SambaNova. [The model explains the Reconfigurable Dataflow Architecture of SN40L...]
```

**Prompt:** 
```
Generate an embedding for the sentence 'SambaNova is the fastest inference platform' using E5-Mistral-7B-Instruct.
```

**Response:** 
```
I've generated the embedding vector for your text. It contains 4096 dimensions (example) ready for your vector database.
```

**Prompt:** 
```
Use create_response with MiniMax-M2.7 to process this conversation history.
```

**Response:** 
```
Processing the agentic workflow with MiniMax... I've returned the structured response items based on the input history provided.
```

## Capabilities

### Generate Chat Completions
Use `create_chat_completion` to get high-quality conversational text from state-of-the-art models like Llama 3.3 and DeepSeek.

### Create Text Embeddings
Call `create_embedding` to generate numerical vectors for any piece of text, making it ready for vector databases or RAG systems.

### Force Structured Output
Use `create_response` to make sure the model output is always in a predictable, typed JSON format, which reliable agents need.

## Use Cases

### Building a Real-Time Q&A Chatbot
The user uploads documents and needs a chatbot. Instead of just using chat completions, the agent first calls `create_embedding` to index the docs. When asked a question, it runs the query through `create_embedding` again, finds the most relevant chunks, then passes that context to `create_chat_completion` for an accurate, grounded answer.

### Developing Autonomous Workflow Agents
An agent needs to process user input and output three specific fields: 'User ID,' 'Task Type,' and 'Priority.' Instead of asking the LLM to just write text (and hoping it's formatted correctly), the agent calls `create_response`, forcing the model to return a clean, typed JSON object every single time.

### Comparing Model Performance
A developer needs to see how Llama 3 handles complex reasoning compared to DeepSeek. They can use `create_chat_completion` multiple times with the same prompt and only swap out the model name, letting them benchmark performance side-by-side in a single workflow.

### Large-Scale Knowledge Base Indexing
A data team has 10,000 articles. They can't manually process every one. They use `create_embedding` repeatedly across all documents to generate a massive index of vectors for later retrieval.

## Benefits

- Speed: Achieve rock-bottom latency. SambaNova's SN40L chips deliver tokens per second speeds that make standard cloud providers look slow.
- Reliability: Use `create_response` to guarantee your model output is always structured and typed, which stops agents from breaking over unexpected text formats.
- Scale: Generate millions of vectors quickly. The `create_embedding` tool handles large-scale data indexing for RAG at high throughput.
- Flexibility: Access multiple top models (Llama 3, DeepSeek) through a unified interface via `create_chat_completion`, letting you pick the best model for the job.
- Cost Control: It offers developers a faster and often more cost-effective alternative to standard LLM APIs without sacrificing performance.

## How It Works

The bottom line is you connect your client directly to high-performance compute for instant model results.

1. Subscribe to the SambaNova (AI Inference) MCP Server and input your API key.
2. Your AI agent calls one of the defined tools (e.g., `create_chat_completion`) with specific parameters.
3. The server uses SambaNova's SN40L chips to process the request and sends back the result, whether it’s text, a vector, or structured data.

## Frequently Asked Questions

**How does create_chat_completion differ from a standard API call?**
It runs inference on SambaNova's specialized SN40L chips, giving you much higher tokens-per-second speeds. This means less waiting time and better performance for large models like Llama 3.

**Which tool should I use if my goal is Retrieval-Augmented Generation (RAG)?**
You need both `create_embedding` and `create_chat_completion`. First, use `create_embedding` to turn your documents into vectors. Then, pass the retrieved context via `create_chat_completion` for the final answer.

**Can I use create_response even if I don't need JSON?**
No. The whole point of `create_response` is to enforce structure. It forces type safety, which means you get reliable, structured data—that’s what it does best.

**Is this server compatible with my existing Python agent setup?**
Yes. The tools are designed to be called by any MCP-compatible client (Claude, Cursor, etc.), meaning you can integrate the functions into your existing Python or JavaScript agents.

**How do I authenticate when using create_chat_completion?**
You use your SambaNova Cloud API Key for authentication. Your AI client passes this key as a secure header, giving you direct access to run the high-performance models. This keeps your connection isolated and authenticated.

**What is the maximum input text size when calling create_embedding?**
The embedding tool accepts up to a specified token limit per request, which varies by model (e.g., E5-Mistral-7B-Instruct). Always check the documentation for that specific model's exact token count before sending data.

**Does using create_chat_completion have strict rate limits?**
SambaNova’s infrastructure is designed to maintain low latency and high throughput under heavy load. While standard API usage limits apply, the SN40L chips minimize bottlenecks when running complex inference jobs.

**What happens if I call create_response with incorrect data types?**
The system handles type validation errors by returning a specific HTTP status code and an error message. Your agent can catch this failure and correct the input payload before retrying the structured output request.

**Which models are available for chat completions?**
You can use `create_chat_completion` with models like Meta-Llama-3.3-70B-Instruct, DeepSeek-V3.1, and MiniMax-M2.5 for high-speed text generation.

**Can I generate embeddings for my RAG pipeline?**
Yes! Use the `create_embedding` tool with models like E5-Mistral-7B-Instruct to create vectorized representations of your text data.

**What is the difference between create_chat_completion and create_response?**
`create_chat_completion` follows the standard OpenAI chat format, while `create_response` is a stateless API designed specifically for agentic workflows, returning typed output items.