# Speech Synthesis MCP

> Volcengine Speech Synthesis handles high-fidelity, multi-lingual text-to-speech conversion. Use this MCP to generate natural narration, including signature TikTok voice styles, from simple text or complex markup languages like SSML. It’s built for content creators and developers needing professional audio output across English, Chinese, Japanese, and more.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** volcengine, tiktok-voice, tts, speech, text-to-speech, ai-audio

## Description

This connector lets you take any written text and turn it into broadcast-quality audio. You can generate speech using ByteDance's advanced voice models—the ones behind TikTok's viral effects—for everything from quick social media clips to entire audiobooks. It supports multi-language synthesis across English, Chinese, Japanese, and more, letting you create global content without ever touching a recording studio. Need precise timing? You can use SSML tags to dictate exactly where the speaker pauses or when they put emphasis. For massive documents, there’s a dedicated process for synthesizing long text that standard tools choke on. Because this MCP deals with sensitive keys and high-volume audio generation, your credentials pass through Vinkius's zero-trust proxy; your keys never sit on disk. This means you can trust the connection while building complex automations across multiple platforms.

## Tools

### get_audio_formats
Lists the available output formats for the generated audio (like MP3 or WAV).

### list_voices
Retrieves every available TTS voice model to help you select the right sound for your project.

### synthesize_long_text
Generates audio from texts that are too long for standard synthesis calls, like full articles or reports.

### synthesize_ssml
Uses specialized tags to control the exact timing, pauses, and emotional delivery of the generated audio.

### synthesize_speech
Converts text into speech using various voice styles and supports multiple languages with adjustable speed and volume.

## Prompt Examples

**Prompt:** 
```
Generate speech with the TikTok trendy female voice: 'Welcome to my video!'
```

**Response:** 
```
🔊 Speech synthesized successfully! Using BV033_streaming (TikTok Trendy Female). Audio generated in MP3 format at 24kHz.
```

**Prompt:** 
```
List all available voices and show me English options.
```

**Response:** 
```
🎙️ Available voices: BV001 (Generic Female, zh), BV002 (Generic Male, zh), BV033 (TikTok Trendy Female, zh), BV113 (English Female, en), BV115 (English Male, en). English options: BV113 (Female), BV115 (Male).
```

**Prompt:** 
```
Synthesize this article into speech: [long article text...]
```

**Response:** 
```
📖 Long-text synthesis started! Article split into 5 chunks. Using BV001_streaming voice. Processing will take ~30 seconds for full narration.
```

## Capabilities

### Generate standard speech
Convert any block of text into natural, spoken audio using general voice styles.

### Create unique voices
Train a custom voice model from your own high-quality audio recordings to give the AI a personalized sound.

### Synthesize massive documents
Convert entire articles or long manuals into speech without hitting character limits.

### Control tone and pacing
Use markup language to dictate precise timing, pauses, and emphasis in the generated audio.

### Manage voice selection
List all available voice models—including specialized styles—before beginning any synthesis job.

## Use Cases

### Building a multi-lingual training module
An e-learning developer needs to create course material in English and Japanese. They use `list_voices` to confirm language support, then call `synthesize_speech` multiple times with different voice IDs for each language.

### Creating an automated podcast chapter
A podcaster writes a 5,000-word transcript. Instead of manually segmenting it, they use `synthesize_long_text`, which handles the chunking and synthesis process automatically, giving them full narration.

### Improving accessibility in an app
A developer is building a medical guide app. They need to ensure complex terms are read with perfect emphasis and pacing, so they use `synthesize_ssml` tags around the critical phrases.

### Launching branded content
A marketing team wants their video ads to feature a specific brand voice. They first run through `create_custom_voice`, and once approved, they use that custom voice in all subsequent calls to `synthesize_speech`.

## Benefits

- Need a specific sound? You can use `list_voices` to check every available model, from general narrators to the famous TikTok voices, before you write a single line of code.
- Dealing with massive documents? Forget manual splitting. Use `synthesize_long_text` for articles and manuals that exceed standard character limits, keeping your workflow continuous.
- Want maximum control over the narrative flow? Instead of just sending text, use `synthesize_ssml` to programmatically insert pauses, changes in tone, or specific emphasis points.
- Need a unique brand sound? Train a custom voice using `create_custom_voice`. You feed it 10-50 recordings, and you get a proprietary voice model for your product.
- Speed is key. The ability to adjust speech rate and volume with `synthesize_speech` means you can tweak the delivery of any piece without re-recording anything.

## How It Works

The bottom line is: it takes plain text input and returns structured, multi-lingual audio output.

1. First, you connect your access keys and secret keys to this MCP.
2. Next, you call the tool with the text and parameters (e.g., voice model, language).
3. Finally, you receive the generated audio data or a link to the completed file.

## Frequently Asked Questions

**Does `synthesize_speech` support TikTok voices?**
Yes, the core synthesis function supports specific voice styles, including the famous TikTok models. You can select these via the available voice IDs to add trending flair to your content.

**How do I make my own brand voice? Use `create_custom_voice`.**
You need 10-50 high-quality recordings of a single speaker. The tool trains the model over 1 to 3 days, giving you an exclusive voice for your brand.

**`synthesize_long_text` vs `synthesize_speech`, which should I use?**
If your text is short (under 1024 characters), use `synthesize_speech`. If you're working with full articles, reports, or documentation, always use the dedicated `synthesize_long_text` tool.

**What if I need to control pauses in my audio? Use `synthesize_ssml`.**
The specialized SSML function lets you embed tags like `<break>` and `<emphasis>`. This gives granular control over the timing, pitch, and intonation that basic text synthesis can't manage.

**Can I see what voices are available first? Use `list_voices`.**
Running `list_voices` is essential. It pulls all current voice models—male, female, child, and style-specific options—so you can build your script around known capabilities.

**When I use `get_audio_formats`, what's the difference between MP3 and WAV for my project?**
MP3 is best for delivery. It compresses audio, making it small enough for web streaming or apps without losing too much quality. If you need to edit the file later, stick with WAV; it keeps the raw, uncompressed data.

**If I run a long synthesis job using `synthesize_speech`, how do I check its progress with `get_task_status`?**
You must pass the unique task ID returned by the initial request to `get_task_status`. This tool lets you poll the system to see if the process is pending, running, or if it failed completely.

**Does `synthesize_speech` let me control the reading speed or volume of the generated audio?**
Yes, you can adjust both. The synthesis call accepts parameters for rate and volume. This lets your agent dynamically modify how fast or loud the final narration sounds.

**What makes Volcengine TTS different from other TTS services?**
Volcengine powers the iconic TikTok TTS effects used in billions of videos. It offers industry-leading Chinese speech quality, trendy social media voices, and ByteDance's proprietary neural voice technology.

**Which languages are supported?**
Chinese (Mandarin), English, Japanese, and more. Use language parameter: 'zh' for Chinese, 'en' for English, 'ja' for Japanese. Each language has multiple voice styles.

**What's the max text length?**
Standard synthesis supports up to 1024 characters per request. For longer texts, use the synthesize_long_text tool which automatically handles chunking and combining results for articles and audiobooks.