Together AI MCP. Run open-source LLMs and ML pipelines directly in your agent.

Q: How do I check which open-source LLMs are available?

You run the listavailablemodels tool. This gives you a list of every model ID and its capabilities right now, letting you pick the best engine for your job.

Q: Is chatcompletion better than textcompletion?

chatcompletion is almost always what you want. It's built to handle message history (the whole conversation), while textcompletion is only for single, stateless prompts.

Q: How do I start training my own LLM?

Use the createfinetunejob tool. You must provide a base model ID and point to your specific dataset file for it to begin.

Q: How do I check the status of a fine-tuning job after running createfinetunejob?

You use the listfinetunejobs tool to query all jobs. This returns a list that includes both active and completed runs, showing you the current state (e.g., 'PENDING', 'RUNNING', or 'FAILED') for easy monitoring.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Just plug in your AI agents and start using Vinkius.

Together AI connects your local agent to dozens of open-source models and ML services. You can instantly generate chat completions, create vector embeddings for RAG pipelines, or fine-tune custom LLMs—all through one API endpoint.

It lets you query Llama, Mixtral, and more from a single place without leaving your IDE.

What your AI agents can do

Chat completion

Runs a multi-turn conversation using an open-source model, accepting a model ID and message history array.

Create finetune job

Starts the training process for a custom LLM by specifying a base model and the dataset to train on.

Generate embeddings

Converts a list of input strings into numerical vector embeddings using a specified embedding model ID.

+ 4 more capabilities included

List available models

Checks the Together AI network to find all currently supported open-source LLMs and diffusion models.

Run chat completions

Executes multi-turn conversational cycles using advanced, specified open-source models (e.g., Llama 3).

Generate text embeddings

Converts input texts into numerical vectors that capture semantic meaning for database indexing.

Create images from prompts

Uses external diffusion models to generate visual media based on a detailed text description.

Start fine-tuning jobs

Initiates a custom training run by pointing the system to a base model and your specific dataset file.

Check job statuses

Retrieves the current status of any existing or previously submitted model fine-tuning jobs.

Ask AI about this MCP

Ask ChatGPT

Ask Claude

Ask Perplexity

Supported MCP Clients

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

+ other MCP clients

Free for Subscribers

Waiting for input…

AI Agent

Together AI MCP Server: 7 Tools for Model Operations

Master model execution, embedding generation, and custom training by accessing seven specialized tools within your agent.

chat019d7613

chat completion

Runs a multi-turn conversation using an open-source model, accepting a model ID and message history array.

create019d7613

create finetune job

Starts the training process for a custom LLM by specifying a base model and the dataset to train on.

generate019d7613

generate embeddings

Converts a list of input strings into numerical vector embeddings using a specified embedding model ID.

generate019d7613

generate image

Creates an image file by sending a detailed descriptive text prompt to the external diffusion model.

list019d7613

list available models

Returns a list of all LLMs and open-source models currently supported on the Together AI platform.

list019d7613

list finetune jobs

Retrieves a list of all fine-tuning jobs, allowing you to check their current status.

text019d7613

text completion

Executes a single text generation request using an open-source model based on a provided prompt and model ID.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Together AI, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 4,700+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

What you can do with this MCP connector

Look, you've got an agent running locally, and it needs muscle that doesn't cost a fortune or tie you down to some closed system. This MCP server connects your setup directly to dozens of open-source models and ML services from the Together AI network. It gives you high-speed inference for big language models like Llama 3 and Mixtral.

You can run everything—from simple prompts to full custom model training runs—all through one API endpoint, right inside your IDE.

When you need to figure out what's available, start with the list_available_models tool. It checks the entire Together AI network and spits back a comprehensive list of every open-source LLM and diffusion model they support. This lets you know exactly which engine—whether it's for natural language processing or image generation—you need to tackle your current task.

For basic text tasks, you've got two ways to go. If you just need a quick answer based on a single prompt, use text_completion. You just send over the specific model ID and the prompt, and it spits out the requested text. But if you’re building a chat interface or running a complex dialogue that requires remembering context, you'll want to run a multi-turn conversation using chat_completion.

This tool handles the entire message history—you pass in the model ID along with an array of previous messages—so your agent doesn't forget what was said two turns ago.

If your goal is building a Retrieval Augmented Generation (RAG) pipeline, you gotta deal with embeddings. Use generate_embeddings to convert any list of raw input strings into numerical vector embeddings. You just specify the embedding model ID, and it handles turning that plain text into vectors ready for database indexing.

This is how you make your documents searchable.

Need some visual flair? If you're working on anything graphical, generate_image uses external diffusion models to create image files. All you gotta do is send over a detailed descriptive text prompt—the more specific you are about what you want the picture to look like, the better it turns out.

For custom AI development, you have two tools managing the entire lifecycle of fine-tuning. First, when your open-source model isn't quite hitting the mark for your niche use case, you kick off a new training run using create_finetune_job. This tool takes two key inputs: the base model ID and the specific dataset you want it to train on.

That starts the whole process.

Once that job is running in the background—and it will take time—you need to know if it's stuck or done. Use list_finetune_jobs to retrieve a list of all your submitted fine-tuning jobs. This lets you check the current status of every single job, giving you visibility into whether they're queued, running, or finished.

It covers everything from checking existing runs to listing them for an audit.

How Together AI MCP Works

1 Sign up for the Together AI integration and grab a developer API Key from their control panel.
2 Plug that API key into your agent's configuration, specifying which models you need to access.
3 Your AI client uses the server tools (like chat_completion or generate_embeddings) to run inference or start jobs directly.

The bottom line is: it lets your local code talk to dozens of powerful open-source LLMs without you needing separate keys or endpoints for each one.

Who Is Together AI MCP For?

This stack is built for the ML Engineer and Software Developer who's sick of juggling multiple cloud provider dashboards. If your job involves building complex, multi-stage AI pipelines—like RAG or specialized classification—you need this. It gives you model diversity without vendor lock-in.

Machine Learning Engineer

Uses generate_embeddings to bulk-vectorize raw log data and then feeds those vectors into a Retrieval Augmented Generation (RAG) pipeline using chat_completion.

Software Developer

Integrates alternative open-source LLMs (like Llama 3) directly into the application's codebase to test against proprietary models before deployment.

AI Researcher

Orchestrates specialized model fine-tuning jobs using create_finetune_job and monitors progress with list_finetune_jobs, all from the same chat environment.

What Changes When You Connect

Model Diversity: You don't get locked into one vendor. Use list_available_models to see dozens of open-source alternatives (Llama, Mixtral) and test them all within the same chat session.
Vector Prep on Demand: Need embeddings for a knowledge base? Call generate_embeddings with raw text logs; you get vectors ready to load into your analytical database immediately.
Zero Context Switching for Tuning: Instead of jumping between CLI tools, use create_finetune_job and list_finetune_jobs right inside your chat environment. It keeps the whole workflow together.
Full Media Pipeline: Need a visual element? Use generate_image. You can generate code from an LLM (chat_completion) and then use that output to describe what image you need next.
Flexible Inference: Whether you're doing simple, single-prompt text generation with text_completion or complex multi-turn dialogue with chat_completion, the server handles it all.

Real-World Use Cases

Building a Custom FAQ Bot (RAG)

The ML Engineer has 10,000 pages of PDFs. They feed these into an indexing service to get embeddings using generate_embeddings. When a user asks a question later, the agent uses those vectors to retrieve context and then passes that context plus the query into chat_completion for a precise answer.

Creating Marketing Assets from Chat Output

The developer asks their agent to write three product descriptions for a new gadget using text_completion. They copy one of those descriptions, and immediately use it as the detailed prompt in the generate_image tool to create accompanying marketing art.

Validating Model Choices Before Commit

The Software Engineer is debating between Llama 3 and Mixtral. Instead of writing two separate scripts, they use list_available_models first. Then, they run the same prompt through both models using their respective model IDs in a single chat session to compare performance.

Archiving Custom Data Models

The Research Scientist has identified a niche domain for an LLM. They use create_finetune_job with their specialized dataset and monitor the job progress using list_finetune_jobs, all without ever leaving their main agent interface.

The Tradeoffs

Using simple text completion for dialogue

Trying to simulate a conversation by running five separate calls using text_completion with slightly modified prompts. This loses context and is brittle.

→ For any multi-turn interaction, always use chat_completion. This tool manages the entire message history array for you, keeping the agent's memory intact across all turns.

Assuming model availability

Writing code that immediately calls a specific LLM (e.g., Mistral) without knowing if it's currently available or if a better alternative exists.

→ Always start by calling list_available_models. This gives you the definitive, current list of all supported engines, letting your agent decide on the best tool for the job.

Over-relying on local models

Thinking that running a model locally will be faster or cheaper than using an optimized service.

→ If you need high performance and access to bleeding-edge, open-source weights (like Llama 3), use the server. It provides managed, high-speed inference without complex local setup.

When It Fits, When It Doesn't

Use this MCP Server if your core problem involves connecting multiple, specialized AI functions—text generation, image creation, and data vectorization—into one coherent pipeline. You need model diversity (accessing Llama, Mixtral, etc.) without the operational overhead of managing ten different API keys.

Don't use this server if you only need to run a single, simple task, like just basic classification on static input files; a dedicated function call or a simpler cloud SDK might be cleaner. Also, don't rely on it for guaranteed uptime SLAs—it’s designed for rapid prototyping and experimentation where the complexity of connecting tools outweighs minor latency concerns.

However, if your workflow requires generating embeddings and then using those embeddings to inform an LLM chat response, this server is the right choice. It provides all the necessary components (generate_embeddings, chat_completion) in one place.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Together AI. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 7 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

chat_completion create_finetune_job generate_embeddings generate_image list_available_models list_finetune_jobs text_completion

Manually setting up complex AI pipelines takes too many steps.

Right now, if you want to build an advanced system—say, something that needs to read a document and then chat about it—you're dealing with chaos. You have to set up a data pipeline in one service, get the API key for embeddings from another, and then call the LLM model using yet a third provider's credentials. It’s copy-pasting keys everywhere just to make two things talk.

With this MCP server, you keep it local. Your agent handles the whole sequence. You send the text in, the tool generates embeddings with `generate_embeddings`, and then your chat completion runs using those vectors—all within one conversation flow. It's clean.

Together AI lets you run specialized model jobs instantly.

Before this, if you wanted to train a custom LLM on your company's data, the process was huge. You had to provision compute clusters, upload massive datasets manually, and wait hours for status updates in a separate web panel. It was slow and siloed.

Now, you just point to the base model ID and the dataset file using `create_finetune_job`. The job starts, and you track it right there with `list_finetune_jobs`. It's that simple.

Common Questions About Together AI MCP

How do I check which open-source LLMs are available? +

You run the list_available_models tool. This gives you a list of every model ID and its capabilities right now, letting you pick the best engine for your job.

Is `chat_completion` better than `text_completion`? +

chat_completion is almost always what you want. It's built to handle message history (the whole conversation), while text_completion is only for single, stateless prompts.

What models can I use for image generation? +

The server uses external diffusion models for this. You just need a detailed text description in the prompt provided to the generate_image tool; you don't specify the model ID.

How do I start training my own LLM? +

Use the create_finetune_job tool. You must provide a base model ID and point to your specific dataset file for it to begin.

If I have a massive dataset, how do I efficiently run `generate_embeddings`? +

You process them in batches. While the tool handles large arrays of strings, we recommend grouping texts into manageable chunks (e.g., 100-500 items) to prevent timeouts and optimize throughput. This method helps you monitor progress and ensures reliable data transfer for your vector database.

How do I check the status of a fine-tuning job after running `create_finetune_job`? +

You use the list_finetune_jobs tool to query all jobs. This returns a list that includes both active and completed runs, showing you the current state (e.g., 'PENDING', 'RUNNING', or 'FAILED') for easy monitoring.

Can `chat_completion` force the output into JSON format? +

Yes, you can guide the model to output structured data. When providing the prompt and message history, include specific instructions requesting a JSON schema. This ensures your AI client receives predictable, machine-readable results for reliable parsing.

What parameters should I control when using `generate_image`? +

Beyond the descriptive prompt, you can often specify dimensions or aspect ratios in the tool call. Checking the model's documentation will show supported size constraints (e.g., 1:1 square, 16:9 landscape) to get exactly the format your application requires.

Where do I obtain my Together AI API Key? +

Log in to the developer portal via api.together.xyz/settings/api-keys. If you do not have an existing key, click Create API Key. This token enables the execution of remote inferences spanning their hosted clusters securely.

Do I have to pay to use Together models through the agent? +

Yes. This connector simply routes your instructions to Together AI. Any tokens consumed during chat completion, embeddings, images generation, or fine-tuning workloads are billed directly to your registered Together AI account balance according to their official compute pricing models.

Can I access free models on Together AI? +

Yes! Together AI frequently offers free tiers for certain open-source models intended for experimentation and research. You can query these directly from your agent without depleting your account balance, though specific free-tier rate limits will apply.

View all recipes →

Fine-Tune AI Models Using MCP Servers

GPT-4 costs $30 per 1M tokens for your classification task , fine-tune a $0.20/M model on Together AI that scores 96% accuracy, track every experiment in W&B, and save $29.80 per million tokens

Together Ai Weights Biases Google Sheets

View all recipes

Use it with your favorite AI tools

Connect this server to Cursor, Claude, VS Code, and more.

OpenAI Agents SDK sdk-python

Google ADK sdk-python