Volcengine Speech Synthesis MCP. Generate any voice, from TikTok trends to custom studio quality.

Q: How do I find out which voice IDs are available using listvoices?

Run the listvoices tool. It returns a comprehensive catalog of all active TTS models, including their style (e.g., Trendy Female) and language codes.

Q: Do I use synthesizespeech or synthesizelongtext for my article?

If your text is over 1024 characters, you must use synthesizelongtext. Using the standard synthesizespeech function will either fail or cut off the content prematurely.

Q: How do I make sure my speech has natural pauses?

Don't rely on default pacing. Use the synthesizessml tool. It lets you embed SSML tags like to dictate exactly when and how long a pause should be.

Q: Can I use my own voice in synthesizespeech?

Yes, first call createcustomvoice using 10-50 audio samples of your speaker. Once the custom voice model is trained, you reference its ID when calling synthesizespeech.

Q: After sending a large text block using synthesizelongtext, how do I check if the process completed successfully with gettaskstatus?

The gettaskstatus tool accepts a unique task ID. It returns one of three states: 'processing,' 'completed,' or 'failed.' You poll this endpoint until you confirm completion.

Q: Before using any synthesis function, how do I use getaudioformats to ensure the output data matches my client's needs?

Running getaudioformats lists all supported output codecs. You can then specify the exact format (like MP3 or WAV) when calling a synthesis tool like synthesizespeech.

Q: If I use synthesizelongtext, how does the system manage documents that exceed the standard character limit?

The synthesizelongtext function automatically splits large inputs into manageable chunks. It then processes these segments sequentially to generate one cohesive audio file.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Just plug in your AI agents and start using Vinkius.

Volcengine Speech Synthesis provides a direct connection to ByteDance’s TTS platform, letting your AI client generate highly natural speech. It supports iconic TikTok voice styles and handles multi-language synthesis across Chinese, English, and Japanese.

Users can also train custom voices using their own audio samples or gain fine control over output timing with SSML tags.

What your AI agents can do

Create custom voice

Trains and registers a custom voice model using 10-50 high-quality audio recordings of a single speaker.

Get audio formats

Lists the supported output file types, including MP3, WAV, OGG Opus, and PCM.

Get task status

Checks if an asynchronous TTS job is still processing, completed, or failed.

+ 4 more capabilities included

Synthesize Speech from Text

Converts standard text input into audio, supporting multiple languages and adjustable parameters like speed and volume.

Handle Long-Form Documents

Processes articles or books that exceed the 1024-character limit by splitting and synthesizing them in chunks.

Apply Advanced Timing Control (SSML)

Uses SSML tags to insert specific pauses, emphasize words, or adjust intonation for highly natural output timing.

Create Personalized Voice Models

Trains a unique voice model by uploading multiple audio samples from one speaker.

Discover Available Voices

Lists all available TTS models, including specific regional or style voices like the TikTok library.

Ask AI about this MCP

Ask ChatGPT

Ask Claude

Ask Perplexity

Supported MCP Clients

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

+ other MCP clients

Free for Subscribers

Waiting for input…

AI Agent

Volcengine Speech Synthesis MCP Server: 7 Tools for Audio Control

These seven tools let you manage every part of the TTS process—from listing available voices to synthesizing ultra-long documents and creating custom voice models.

create019d8499

create custom voice

Trains and registers a custom voice model using 10-50 high-quality audio recordings of a single speaker.

get019d8499

get audio formats

Lists the supported output file types, including MP3, WAV, OGG Opus, and PCM.

get019d8499

get task status

Checks if an asynchronous TTS job is still processing, completed, or failed.

list019d8499

list voices

Retrieves a list of all available voice models and their attributes for selection.

synthesize019d8499

synthesize long text

Generates speech from text longer than 1024 characters, ideal for full articles or manuals.

synthesize019d8499

synthesize speech

Converts standard text to audio, supporting multiple languages and adjusting volume/speed. This is the general-purpose synthesis tool.

synthesize019d8499

synthesize ssml

Generates speech from SSML markup, allowing precise control over timing, pacing, and emphasis using tags like <break>.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with Volcengine Speech Synthesis, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 4,700+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

What you can do with this MCP connector

You're connecting your AI client to Volcengine Speech Synthesis, giving you a direct link to ByteDance’s TTS platform. It generates super natural speech, and it handles multiple languages—think Chinese, English, and Japanese. This isn't some basic text-to-speech gimmick; this thing gives you serious control over the audio output.

Synthesizing Speech from Text: You can run standard text through synthesize_speech to turn written words into audio. That general tool supports multiple languages, and it lets you fine-tune the resulting voice by adjusting parameters like speed or volume for every single call. For maximum precision, use synthesize_ssml. This function takes SSML markup, letting you insert specific pauses using tags like <break>, emphasize certain words, or adjust intonation patterns.

It gives you granular timing control that simple text conversion just can't touch.

Handling Long Documents: If your article or manual is longer than 1024 characters, you don't feed it all at once. Use synthesize_long_text. This tool processes massive documents by splitting them up and synthesizing the chunks sequentially, ensuring you get full audio from even the longest-form content.

Building Custom Voices: Don't just settle for stock voices. You can train your own unique voice model using create_custom_voice. Just upload between 10 and 50 high-quality audio recordings of a single speaker, and the server registers that custom voice for you to use later.

Discovering Voices and Formats: Need to know what's available? Run list_voices to pull up every current TTS model. This list includes specific regional or style voices, like those famous ones from TikTok. To know what file type your audio will be in, check get_audio_formats, which lists supported outputs such as MP3, WAV, OGG Opus, and PCM.

Managing Jobs: Since some synthesis jobs take time, you'll need to track them. Use get_task_status to check if an asynchronous TTS job is still running, finished, or if it failed. This keeps your whole workflow from stalling out waiting for audio.

It’s that simple—you feed the text, tell it how fast and loud you want it, and it spits out highly natural audio files.

How Volcengine Speech Synthesis MCP Works

1 Subscribe to this server and provide your Volcengine Access Key and Secret Key credentials.
2 Your AI client calls a synthesis tool (e.g., synthesize_speech) with the text, desired voice model ID, and language parameters.
3 The server processes the request—or sends it for long-form processing—and returns the audio data or a URL to the finished file.

The bottom line is: you point your agent at this server, and it handles all the complex TTS logic so you just get the audio output.

Who Is Volcengine Speech Synthesis MCP For?

Content teams who need high-volume video voiceovers. Audiobook producers dealing with massive text files. Developers building accessibility features or custom apps that require perfect, reliable speech output.

Video Content Creator

Generates voice tracks for Reels or TikToks using specific, trendy voices without hiring a voice actor.

Developer / Integrator

Builds proof-of-concept features into apps that require multi-language speech output (e.g., an educational app).

Accessibility Engineer

Adds reliable screen reader or spoken word functionality to a website or internal tool.

What Changes When You Connect

The synthesize_ssml tool gives you timing control that standard APIs lack. You can dictate pauses or emphasize words by using tags like and , making the final audio sound less robotic.
If your text is an article, don't use synthesize_speech. Use synthesize_long_text instead; it handles content over 1024 characters without breaking the flow or losing fidelity.
Need a voice that sounds exactly like a specific person? Run create_custom_voice. It builds a personalized model from just audio samples, giving you brand consistency.
You can't build anything if you don't know what voices exist. Use list_voices first to get the full catalog and find the perfect regional or style voice ID for your project.
Need to ensure your output works everywhere? Before synthesizing, call get_audio_formats to confirm if MP3 is best for web delivery or WAV is needed for editing.

Real-World Use Cases

Creating a YouTube explainer video.

The scriptwriter finishes the text and sends it to their agent. The agent first runs list_voices to pick a deep male voice, then uses synthesize_speech with that ID. Finally, they use synthesize_ssml to add dramatic pauses at key points, ensuring the video narration sounds professionally produced.

Converting an entire company handbook.

The technical writer uploads the 50-page PDF and extracts the text. Instead of calling a basic synthesis function repeatedly, they pass the whole document to synthesize_long_text. This tool handles the massive word count automatically, giving one continuous audio file.

Building an interactive character dialogue system.

The developer needs multiple distinct voices. They use create_custom_voice for a main character's voice. Then they call synthesize_speech repeatedly, switching the custom voice ID based on which character is speaking in the script.

Making multilingual educational content.

The curriculum manager needs to translate and narrate a lesson into Japanese and Chinese. They use list_voices to find appropriate language-specific IDs, then call synthesize_speech twice—once for each language—to ensure the correct linguistic model is applied.

The Tradeoffs

Using basic synthesis for large documents.

Calling synthesize_speech on a 5000-word article. The API will fail or truncate the output, giving you broken audio segments and forcing multiple manual calls.

→ Always check text length first. For bulk content, use synthesize_long_text. This tool manages chunking automatically for reliable, continuous synthesis.

Ignoring required formatting control.

Simply sending raw text to the API and expecting perfect pacing or dramatic effect. The result sounds flat, mechanical, and utterly lifeless.

→ For critical moments in the script, use synthesize_ssml. It lets you wrap specific words or phrases in tags like to control rhythm.

Relying on default voice settings.

Just calling synthesize_speech without specifying a voice ID, resulting in the generic 'default' tone that sounds suspiciously AI-generated.

→ Always run list_voices first. Pick an explicit Voice Model ID (e.g., BV113) to ensure you hit the desired accent or persona.

When It Fits, When It Doesn't

You should use this server if your primary requirement is high-fidelity audio output with deep control over voice and timing. Use it when: 1) You need specific, trendy voices (TikTok models). 2) Your text exceeds standard API character limits (synthesize_long_text). 3) You require precise pacing or dramatic pauses (synthesize_ssml).

Don't use this if: 1) You only need a quick 'placeholder' read that doesn't matter. In that case, a simple free-tier API might suffice.
2) Your workflow is purely text processing and never involves generating audio files.

If you are unsure what voices or formats are available, run list_voices first. This tool gives you the necessary data to build your reliable workflow.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Volcengine Speech. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 7 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

create_custom_voice get_audio_formats get_task_status list_voices synthesize_long_text synthesize_speech synthesize_ssml

Voiceovers shouldn't sound like they came from a robot reading an instruction manual.

Today, generating professional audio involves messy handoffs: writing the script in Google Docs, exporting it to Word, taking screenshots of voice notes, and then manually calling out specific timing requirements—all while constantly worrying if the default voice sounds too generic. You end up with fragmented files that sound amateur.

With this MCP server, you simply pass your clean text and a set of parameters (voice ID, language) to an agent. The agent uses `synthesize_speech` or `synthesize_ssml`, automatically stitching together the audio into one polished file. What you get is studio-grade narration, every time.

Synthesizing full books and long documents with synthesize_long_text.

Before this, synthesizing a technical manual meant copy-pasting the text into multiple API calls because most services had strict character limits. You'd end up paying for dozens of small jobs just to get one finished audiobook chapter.

Now, you hand the entire document off to `synthesize_long_text`. It handles the chunking and processing in the background. The result is a single, cohesive audio file that sounds like it was recorded by one person, not assembled by an API.

Common Questions About Volcengine Speech Synthesis MCP

How do I find out which voice IDs are available using list_voices? +

Run the list_voices tool. It returns a comprehensive catalog of all active TTS models, including their style (e.g., Trendy Female) and language codes.

Do I use synthesize_speech or synthesize_long_text for my article? +

If your text is over 1024 characters, you must use synthesize_long_text. Using the standard synthesize_speech function will either fail or cut off the content prematurely.

How do I make sure my speech has natural pauses? +

Don't rely on default pacing. Use the synthesize_ssml tool. It lets you embed SSML tags like to dictate exactly when and how long a pause should be.

Can I use my own voice in synthesize_speech? +

Yes, first call create_custom_voice using 10-50 audio samples of your speaker. Once the custom voice model is trained, you reference its ID when calling synthesize_speech.

When calling `synthesize_speech`, what specific credentials do I need for proper authentication? +

You must provide your Volcengine Access Key and Secret Key in the API call. These keys authenticate your agent against the TTS platform, allowing it to generate audio data.

After sending a large text block using `synthesize_long_text`, how do I check if the process completed successfully with `get_task_status`? +

The get_task_status tool accepts a unique task ID. It returns one of three states: 'processing,' 'completed,' or 'failed.' You poll this endpoint until you confirm completion.

Before using any synthesis function, how do I use `get_audio_formats` to ensure the output data matches my client's needs? +

Running get_audio_formats lists all supported output codecs. You can then specify the exact format (like MP3 or WAV) when calling a synthesis tool like synthesize_speech.

If I use `synthesize_long_text`, how does the system manage documents that exceed the standard character limit? +

The synthesize_long_text function automatically splits large inputs into manageable chunks. It then processes these segments sequentially to generate one cohesive audio file.

What makes Volcengine TTS different from other TTS services? +

Volcengine powers the iconic TikTok TTS effects used in billions of videos. It offers industry-leading Chinese speech quality, trendy social media voices, and ByteDance's proprietary neural voice technology.

Which languages are supported? +

Chinese (Mandarin), English, Japanese, and more. Use language parameter: 'zh' for Chinese, 'en' for English, 'ja' for Japanese. Each language has multiple voice styles.

What's the max text length? +

Standard synthesis supports up to 1024 characters per request. For longer texts, use the synthesize_long_text tool which automatically handles chunking and combining results for articles and audiobooks.

Use it with your favorite AI tools

Connect this server to Cursor, Claude, VS Code, and more.

OpenAI Agents SDK sdk-python

Google ADK sdk-python

Pydantic AI sdk-python

Vercel AI SDK sdk-typescript