# Cartesia (Voice AI) MCP for AI Agents MCP

> Cartesia (Voice AI) brings state-of-the-art voice synthesis and speech recognition to your AI client. Clone voices using just five seconds of audio, generate high-fidelity text-to-speech streams, or transcribe any audio file with industry-leading latency. It's built for building truly human conversational experiences.

## Overview
- **Category:** ai-frontier
- **Price:** Free
- **Tags:** text-to-speech, speech-to-text, voice-synthesis, low-latency, ai-voice, audio-streaming

## Description

This MCP connects powerful voice processing into anything your agent runs on. You can build applications where the AI speaks and understands like a person—not a robot reading text.

Need to generate natural audio? Use high-fidelity models to synthesize speech, or stream it out in real time via SSE for low latency. Want to make sure your brand voice is consistent? Clone voices from minimal samples of audio input, then adapt that voice to different languages and dialects. Need the AI to understand something complicated? Transcribe any spoken audio file into text using advanced models that support multiple languages.

It’s also great for maintaining context. You can manage custom pronunciation dictionaries so the AI says specialized or technical terms correctly every time, even across complex agent orchestration flows. If you're building a sophisticated application, Vinkius makes connecting this voice intelligence to your existing workflows simple and reliable.

## Tools

### get_voice
Retrieves specific metadata for a known voice model.

### list_agent_calls
Shows a record of past calls and transcripts handled by a particular agent.

### update_voice
Changes general information or metadata associated with an existing voice model.

### clone_voice
Creates a custom, unique voice profile from a small audio clip of five seconds or longer.

### create_pronunciation_dict
Establishes a new list of specific word pronunciations for the AI to follow.

### delete_pronunciation_dict
Removes an existing custom pronunciation dictionary entirely.

### delete_voice
Permanently removes a voice model from the system.

### generate_access_token
Creates a temporary token needed for running client-side requests securely.

### get_agent
Fetches detailed information about a specific configured voice agent.

### get_usage_credits
Retrieves current statistics on the account's remaining usage credits and billing history.

### infill_bytes
Generates audio content to smoothly bridge a gap between two existing audio segments.

### list_agents
Provides an overview of all configured voice agents within the account.

### list_pronunciation_dicts
Lists all custom pronunciation dictionaries that have been created.

### list_voices
Returns a comprehensive list of every available voice model in the system.

### localize_voice
Adapts an existing voice profile to sound natural in a new language or regional dialect.

### stt_batch
Transcribes multiple audio files into text format efficiently, suitable for bulk processing.

### tts_bytes
Generates and returns the full audio data bytes from a given text input.

### tts_sse
Streams generated speech audio in real time using Server-Sent Events for immediate playback.

### update_pronunciation_dict
Modifies or corrects specific word pronunciations within an existing dictionary.

### voice_changer_bytes
Alters the voice of a provided audio clip while carefully preserving its original emotion and intonation.

## Prompt Examples

**Prompt:** 
```
I just finished a call with a customer. Can I get a summary transcript and also check how many calls we've handled this month?
```

**Response:** 
```
**Call Summary & Usage Report**

*   **Date:** 2026-06-19
*   **Duration:** 8:45
*   **Key Topics:** Billing dispute, feature request (dark mode), next steps.
*   **Action Items:** Follow up with user on billing by end of day.

--- 

📊 **Account Usage**
You have used 12,300 credits this month. Last cycle refreshed: June 1st.
```

**Prompt:** 
```
I need to generate a sample audio file saying 'The next quarter's focus is on cloud migration.' What voice should I use?
```

**Response:** 
```
**Audio Generation Report**

✅ **Voice Selected:** 'Baritone' (ID: 79a045e3...)

*   **Model Used:** Sonic-3.5
*   **Output Format:** WAV audio stream (ready for download)
*   **Latency:** Sub-second generation confirmed.

The file is attached, ready to play.
```

**Prompt:** 
```
List all the custom voices we have and check if 'Spanish' is an option.
```

**Response:** 
```
**Available Voice Models**

Here are your active models:
*   ✅ Baritone (ID: 79a045e3...)
*   ✅ British Lady (ID: 123x-456y...)
*   👤 Custom Clone - User A (ID: [C L O N E D]...)

I found a model labeled 'Latin American Female' which supports Spanish. You can use `get_voice` for its specific ID.
```

## Capabilities

### Generate realistic speech audio
Convert text into high-quality audio bytes or stream the output instantly using advanced TTS models.

### Transcribe spoken word to text
Process and convert any audio file, regardless of language, into accurate written text.

### Create custom voice profiles
Build entirely new, personalized voices using short samples of existing human speech.

### Modify and manage voices
Get details about available voices, update their metadata, or even delete them when they're no longer needed.

### Control specific pronunciations
Create and maintain custom dictionaries to ensure the AI pronounces technical names or foreign words exactly right.

## Use Cases

### Building a multilingual customer service bot
A support company needs their agent to handle calls in Spanish, German, and French. They use `localize_voice` on one core voice model, ensuring the tone remains consistent while adapting the audio output for each language.

### Automating video podcast production
A content creator has many interviews to turn into episodes. Instead of hiring a voice actor, they use `clone_voice` on their own voice and then run `tts_bytes` to generate the entire script's audio track instantly.

### Analyzing recorded user feedback
A product team records hundreds of video calls with users. Instead of listening manually, they feed all the audio into `stt_batch`, getting clean text transcripts that can be analyzed for key pain points.

### Creating dynamic narrative audiobooks
An audiobook developer needs a narrator who sounds consistent but also needs to speak specialized scientific terms correctly. They use `create_pronunciation_dict` and then generate the entire book's narration using high-quality TTS.

## Benefits

- Achieve true conversational depth. Use `tts_sse` to stream audio in real time, making your agent feel responsive instead of delayed.
- Maintain brand consistency globally. Clone a voice using just five seconds of audio via `clone_voice`, then adapt it across regions using `localize_voice`.
- Eliminate mispronunciation errors. Use `create_pronunciation_dict` to lock down how your AI agent speaks specialized terminology, ensuring technical accuracy every time.
- Process large amounts of data easily. Run bulk transcriptions on hours of audio files using `stt_batch`, saving manual effort across content teams.
- Build sophisticated call tracking. Use `list_agent_calls` to track exactly what your agents talked about and how many credits were used.

## How It Works

The bottom line is that you just tell your AI agent what you need—a voice, a transcription, or a spoken message—and it handles the complex generation process.

1. Subscribe to this MCP and provide your Cartesia API Key.
2. Your agent calls a function, specifying the action (e.g., generating audio) and providing the necessary input data like text or an audio file.
3. The MCP processes the request using its voice models and returns the resulting audio stream or transcribed text to your client.

## Frequently Asked Questions

**How do I make my AI agent sound like me, even if I only record myself briefly?**
You clone your voice using a short audio clip. This creates a unique digital model of your speaking patterns and tone that the AI can use across all its outputs, maintaining brand consistency.

**Does Cartesia (Voice AI) support transcribing different languages?**
Yes. The system handles multi-language transcription, meaning you don't have to worry about language switching when processing audio files into text for your agents.

**Is the generated speech low latency enough for a real-time chat agent?**
Absolutely. By streaming audio via Server-Sent Events, the system delivers synthesized sound almost instantly, making the conversation flow naturally and feel highly responsive to the user.

**What if my company has specialized terminology that sounds wrong when spoken by the AI?**
You solve this with pronunciation dictionaries. You define exactly how a specific word or acronym should sound, and the MCP forces the agent to say it correctly every time.

**Can I update my voice models if they need new metadata or changes?**
Yes, you can manage existing voices by calling `update_voice`. This lets you modify details like model descriptions or usage parameters without changing the actual sound profile.