SambaNova (AI Inference) MCP for AI. Run Llama 3 and DeepSeek models at record-breaking speed.

Q: Which tool should I use if my goal is Retrieval-Augmented Generation (RAG)?

You need both createembedding and createchatcompletion. First, use createembedding to turn your documents into vectors. Then, pass the retrieved context via createchatcompletion for the final answer.

Q: Can I use createresponse even if I don't need JSON?

No. The whole point of createresponse is to enforce structure. It forces type safety, which means you get reliable, structured data—that’s what it does best.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

How this MCP server connects to your AI agent

SambaNova (AI Inference) MCP Server provides high-speed access to state-of-the-art open models like Llama 3 and DeepSeek. It runs inference using SambaNova's SN40L chips, giving you record token speeds that standard cloud APIs can't match.

Use it for chat completions, generating vector embeddings, or forcing model output into structured formats.

What AI agents can do with SambaNova (AI Inference) Automation

Create chat completion

Uses SambaNova models to generate conversational text based on a chat history, compatible with OpenAI's Chat Completions API format.

Create embedding

Generates high-dimensional vector embeddings for given text inputs using the specialized SambaStack service.

Create response

Processes a request and returns model output items that are strictly typed, ensuring reliable data structure for agentic workflows.

Generate Chat Completions

Use create_chat_completion to get high-quality conversational text from state-of-the-art models like Llama 3.3 and DeepSeek.

Create Text Embeddings

Call create_embedding to generate numerical vectors for any piece of text, making it ready for vector databases or RAG systems.

Force Structured Output

Use create_response to make sure the model output is always in a predictable, typed JSON format, which reliable agents need.

Ask an AI about this

Included with Plan

Waiting for input…

AI Agent

What AI agents can do with SambaNova (AI Inference) MCP Server: 3 Model Inference Tools

Use these three core tools to manage the entire LLM workflow—from generating raw text responses to creating reliable structured data and vector embeddings.

Make your AI actually useful.

Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.

Start using SambaNova (AI Inference) on Vinkius

Create Chat Completion

Uses SambaNova models to generate conversational text based on a chat history, compatible with OpenAI's Chat Completions API format.

Create Embedding

Generates high-dimensional vector embeddings for given text inputs using the...

Create Response

Processes a request and returns model output items that are strictly typed, ensuring...

Security and governance baked right in.

Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.

Claude AI

Open Claude Settings

Go to claude.ai, click your profile icon, then navigate to Customize → Connectors.

Add Custom Connector

Click the "+" button and select Add custom connector. Paste your Vinkius endpoint URL:

https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp

Replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com. For OAuth-protected servers, expand Advanced settings to add credentials.

Start a conversation

Open a new chat. The SambaNova (AI Inference) integration is available immediately — no restart needed.

Antigravity

Configure Agent Environment

Open your Antigravity agent's workspace configuration or mcp-servers.json file.

Bind the Endpoint

Add the Vinkius endpoint URL to your agent's MCP connections list:

"mcp_servers": {
  "sambanova-ai-inference": {
    "serverUrl": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
  }
}

Provide your secure token in place of [YOUR_TOKEN_HERE] to ensure your agent requests are authenticated.

Execute

Start your Antigravity session. The agent will autonomously discover and utilize the SambaNova (AI Inference) tools with full Vinkius guardrails applied.

VS Code Copilot

⚡

One-Click Install (Recommended)

In your Vinkius Dashboard, simply click the Add to VS Code button for this server. We'll automatically configure your local workspace.

Or configure manually

Open MCP Settings

Open VS Code, press Ctrl/Cmd + Shift + P, and search for GitHub Copilot: MCP Servers.

Add Server Config

Add the Vinkius endpoint configuration to your mcp-servers.json file:

"sambanova-ai-inference": {
  "url": "https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp"
}

Ensure you replace [YOUR_TOKEN_HERE] with your token from cloud.vinkius.com.

LangChain

Install Dependencies

Install the LangChain MCP adapters for your environment:

pip install langchain-mcp-adapters

Connect the Server

Use the SSEClient in LangChain to connect to the Vinkius managed endpoint:

from langchain_mcp_adapters.client import SSEClient

# Connect to Vinkius
client = SSEClient(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")
tools = client.get_tools()

CrewAI

Define the Tool

Load the Vinkius MCP tools into your CrewAI agents:

from crewai import Agent
from mcp_crewai import MCPTool

# Connect securely to Vinkius
vinkius_tools = MCPTool(url="https://edge.vinkius.com/[YOUR_TOKEN_HERE]/mcp")

# Assign to Agent
researcher = Agent(
    role='Data Researcher',
    tools=vinkius_tools.get_all()
)

Execute Task

Run your CrewAI process. The agent will autonomously route tasks to the Vinkius managed server.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with SambaNova (AI Inference), then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 5,100+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

SambaNova (AI Inference) MCP server cover

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by SambaNova. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

Your data is protected. See how we built it.

Built on the Model Context Protocol (MCP) for Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This connection provides 3 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.

Waiting on AI responses shouldn't feel like dial-up internet., Solved with Vinkius AI Gateway

Today, making an LLM work means hitting a bunch of endpoints. You send a prompt for text, then maybe another call to generate embeddings, and if you need the output to be JSON, you hope the model is nice enough to format it right. It's slow, and every step introduces potential failure points.

With this MCP server, your agent talks to SambaNova directly. You make one call—say, `create_chat_completion`—and we handle the blazing-fast processing on SN40L chips. The result is back instantly, giving you a single source of truth for model output.

The `create_response` tool lets your agent stop guessing and start building.

Before this server, if an agent needed to extract three facts from text, it would call a general chat completion. The output might be: 'The car was red. It cost $50k. John bought it.' Your code then has to use regex or complex parsing to reliably pull out the color, price, and name.

Now, your agent calls `create_response` and specifies the schema: {color: string, price: number, buyer: string}. The model is forced to fit that structure. You get clean, predictable data every time.

Support 24/7 support@vinkius.com ↗

Security Vinkius Trust Center ↗

SLA Service Level Agreement ↗

Report Listing Send Report ↗

llm-inference

llama3

deepseek

embeddings

high-performance-computing

What your AI can actually do with this

Listen up. This ain't your grandma's API connection. You're hooking your agent up to SambaNova Cloud through this MCP Server, giving you access to some of the fastest open models out there—stuff like Llama 3.3 and DeepSeek. It runs inference using SN40L chips, meaning you get token speeds that standard cloud APIs just can't touch.

You're building real-time apps here; latency is everything.

When your agent needs to chat, you use create_chat_completion. This tool lets you generate high-quality conversational text using state-of-the-art models. You feed it a chat history, and it spits out the next part of the conversation, keeping that OpenAI Chat Completions API format intact so your client doesn't sweat the details.

Need to turn unstructured text into something usable? Call create_embedding. This tool generates high-dimensional numerical vectors for any piece of text you throw at it, using the specialized SambaStack service. These vectors are exactly what you need when you build a Retrieval-Augmented Generation (RAG) system or integrate with a vector database.

You get the numbers, period.

And here’s the critical part for reliable agents: structured output. When you use create_response, you force the model to spit out items that are strictly typed. This isn't just getting text; this is guaranteeing a predictable data structure—like a JSON object with specific keys and values—which your agent needs to actually perform an action, not just talk about it.

The whole thing works by having your client call one of these tools. You subscribe to the server, give your AI agent permission, and bam—you get the result instantly because of that SN40L infrastructure under the hood. It's pure speed for complex tasks.

You're building anything where throughput matters, right? If you can't afford lag or low processing power when handling a steady stream of requests, this is your ticket. You bypass standard LLM provider bottlenecks and get a faster, more robust way to run open-source models directly from your agent.

Built · Hosted · Managed by Vinkius SambaNova AI Inference - High-Speed LLM & Embedding Tools

Server ID 019e5d52-bb3f-71fb-aa54-2e3a615c11b4

Vinkius Inspector

Compliance Grade A+

Score 100/100

Report View Report ↗

Here's how it actually works

The bottom line is you connect your client directly to high-performance compute for instant model results.

Subscribe to the SambaNova (AI Inference) MCP Server and input your API key.

Your AI agent calls one of the defined tools (e.g., create_chat_completion) with specific parameters.

The server uses SambaNova's SN40L chips to process the request and sends back the result, whether it’s text, a vector, or structured data.

What Changes When You Connect

Speed: Achieve rock-bottom latency. SambaNova's SN40L chips deliver tokens per second speeds that make standard cloud providers look slow.

Reliability: Use create_response to guarantee your model output is always structured and typed, which stops agents from breaking over unexpected text formats.

Scale: Generate millions of vectors quickly. The create_embedding tool handles large-scale data indexing for RAG at high throughput.

Flexibility: Access multiple top models (Llama 3, DeepSeek) through a unified interface via create_chat_completion, letting you pick the best model for the job.

Cost Control: It offers developers a faster and often more cost-effective alternative to standard LLM APIs without sacrificing performance.

See it in action

01 01

Building a Real-Time Q&A Chatbot

The user uploads documents and needs a chatbot. Instead of just using chat completions, the agent first calls create_embedding to index the docs. When asked a question, it runs the query through create_embedding again, finds the most relevant chunks, then passes that context to create_chat_completion for an accurate, grounded answer.

02 02

Developing Autonomous Workflow Agents

An agent needs to process user input and output three specific fields: 'User ID,' 'Task Type,' and 'Priority.' Instead of asking the LLM to just write text (and hoping it's formatted correctly), the agent calls create_response, forcing the model to return a clean, typed JSON object every single time.

03 03

Comparing Model Performance

A developer needs to see how Llama 3 handles complex reasoning compared to DeepSeek. They can use create_chat_completion multiple times with the same prompt and only swap out the model name, letting them benchmark performance side-by-side in a single workflow.

04 04

Large-Scale Knowledge Base Indexing

A data team has 10,000 articles. They can't manually process every one. They use create_embedding repeatedly across all documents to generate a massive index of vectors for later retrieval.

The honest tradeoffs

Treating LLM output as reliable data

Anti-pattern

The agent calls create_chat_completion and expects the result to always be JSON, but if the model hallucinates a stray comma or changes its format slightly, the downstream code breaks.

The Fix

Don't rely on raw chat completions for structured input. Always use create_response. This tool forces the output into specific data types, so your agent won't crash when the LLM gets creative.

Running vector searches manually

Anti-pattern

The developer copies a user query and tries to pass it directly into create_chat_completion hoping the model will 'search' for context, which it can't do reliably.

The Fix

You must convert text to numbers first. Before any search or RAG step, call create_embedding. This turns your plain text query into a vector that your database actually understands.

Using general APIs for specialized tasks

Anti-pattern

Trying to run computationally heavy model inference on a basic endpoint when you need the speed of SN40L.

The Fix

For maximum performance, especially with large models like Llama 3.3-70B, use this server. It's built specifically for high-throughput, low-latency inference.

When It Fits, When It Doesn't

Use SambaNova (AI Inference) MCP Server if your application needs ultra-low latency and very high throughput when calling LLMs or generating embeddings. Specifically: 1) If you need the model output to be a guaranteed data structure, use create_response. 2) If your app relies on searching documents against a knowledge base, start by calling create_embedding for all inputs. 3) If you just need general chat responses at record speed, stick with create_chat_completion. DON'T use this server if: You are building simple local scripts that run occasionally and latency is zero-percent of your concern. In those cases, a basic API call might suffice. But for any production system handling real users, the performance edge here is critical.

Questions you might have

How does create_chat_completion differ from a standard API call? +

It runs inference on SambaNova's specialized SN40L chips, giving you much higher tokens-per-second speeds. This means less waiting time and better performance for large models like Llama 3.

Which tool should I use if my goal is Retrieval-Augmented Generation (RAG)? +

You need both create_embedding and create_chat_completion. First, use create_embedding to turn your documents into vectors. Then, pass the retrieved context via create_chat_completion for the final answer.

Can I use create_response even if I don't need JSON? +

No. The whole point of create_response is to enforce structure. It forces type safety, which means you get reliable, structured data—that’s what it does best.

Is this server compatible with my existing Python agent setup? +

Yes. The tools are designed to be called by any MCP-compatible client (Claude, Cursor, etc.), meaning you can integrate the functions into your existing Python or JavaScript agents.

How do I authenticate when using create_chat_completion? +

You use your SambaNova Cloud API Key for authentication. Your AI client passes this key as a secure header, giving you direct access to run the high-performance models. This keeps your connection isolated and authenticated.

What is the maximum input text size when calling create_embedding? +

The embedding tool accepts up to a specified token limit per request, which varies by model (e.g., E5-Mistral-7B-Instruct). Always check the documentation for that specific model's exact token count before sending data.

Does using create_chat_completion have strict rate limits? +

SambaNova’s infrastructure is designed to maintain low latency and high throughput under heavy load. While standard API usage limits apply, the SN40L chips minimize bottlenecks when running complex inference jobs.

What happens if I call create_response with incorrect data types? +

The system handles type validation errors by returning a specific HTTP status code and an error message. Your agent can catch this failure and correct the input payload before retrying the structured output request.

Which models are available for chat completions? +

You can use create_chat_completion with models like Meta-Llama-3.3-70B-Instruct, DeepSeek-V3.1, and MiniMax-M2.5 for high-speed text generation.

Can I generate embeddings for my RAG pipeline? +

Yes! Use the create_embedding tool with models like E5-Mistral-7B-Instruct to create vectorized representations of your text data.

What is the difference between create_chat_completion and create_response? +

create_chat_completion follows the standard OpenAI chat format, while create_response is a stateless API designed specifically for agentic workflows, returning typed output items.

How this MCP server connects to your AI agent

What AI agents can do with SambaNova (AI Inference) Automation

Create chat completion

Create embedding

Create response

What AI agents can do with SambaNova (AI Inference) MCP Server: 3 Model Inference Tools

Create Chat Completion

Uses SambaNova models to generate conversational text based on a chat history, compatible with OpenAI's Chat Completions API format.

Create Embedding

Generates high-dimensional vector embeddings for given text inputs using the...

Create Response

Processes a request and returns model output items that are strictly typed, ensuring...

Security and governance baked right in.

Claude AI

Open Claude Settings

Add Custom Connector

Start a conversation

Claude Code

Open your terminal

Add the MCP Server

Start coding

Cursor

One-Click Install (Recommended)

Open Cursor Settings

Add New Server

Use in Composer

Antigravity

Configure Agent Environment

Bind the Endpoint

Execute

VS Code Copilot

One-Click Install (Recommended)

Open MCP Settings

Add Server Config

Windsurf

One-Click Install (Recommended)

Open Windsurf Settings

Add Server Endpoint

LangChain

Install Dependencies

Connect the Server

CrewAI

Define the Tool

Execute Task

Choose How to Get Started

Build Your Own

Make Your AI Do More

Built on the Model Context Protocol (MCP) for Claude, ChatGPT, Cursor, and more

Waiting on AI responses shouldn't feel like dial-up internet., Solved with Vinkius AI Gateway

The `create_response` tool lets your agent stop guessing and start building.

llm-inference

llama3

deepseek

embeddings

high-performance-computing

What your AI can actually do with this

Here's how it actually works

Who is this actually for?

What Changes When You Connect

Building a Real-Time Q&A Chatbot

Developing Autonomous Workflow Agents

Comparing Model Performance

Large-Scale Knowledge Base Indexing

The honest tradeoffs

Treating LLM output as reliable data

Running vector searches manually

Using general APIs for specialized tasks

When It Fits, When It Doesn't

Questions you might have