NVIDIA Audio MCP. Turn messy audio into structured, multilingual data.

Q: Can I clone a specific voice?

Yes! Use the clonevoice tool with a reference audio sample (a few seconds is enough) and the text you want the cloned voice to speak.

Q: After running speechtotext, what is the best way to clean up raw transcript output?

Use the punctuatetext tool to fix the text. It automatically adds proper capitalization and punctuation, turning rough transcripts into ready-to-publish content.

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

See Vinkius in Action

Works with every AI agent you already use

…and any MCP-compatible client

Just plug in your AI agents and start using Vinkius.

NVIDIA Audio provides professional APIs for handling complex audio data. Transcribe speech, translate languages instantly, generate natural voices, and clone speaker identities using a single connection point.

It powers multi-stage audio pipelines—from cleaning noisy field recordings to generating localized voiceovers.

What your AI agents can do

Audio translation

Translates spoken audio into another language based on your specified target language.

Cancel noise

Removes background noise and static from an uploaded audio file, cleaning up the speech signal.

Classify audio

Analyzes an audio file to determine what type of sound it contains (e.g., speech, music, siren) and provides a confidence score.

+ 7 more capabilities included

Transcribe multi-lingual audio

Converts spoken words from various languages in an audio file into written text.

Identify different speakers and timestamps

Separates a single audio track to determine how many people spoke, who they were, and what time segment they covered.

Clean background noise from recordings

Removes static, hums, or other background interference so the speech remains clear for transcription.

Synthesize voiceovers using cloned identities

Uses a reference audio sample to clone a specific voice and then generates brand-new speech from provided text.

Translate spoken audio in real-time

Captures audio spoken in one language and outputs the translated version into another target language.

Ask AI about this MCP

Ask ChatGPT

Ask Claude

Ask Perplexity

Supported MCP Clients

Claude

ChatGPT

Cursor

Gemini

Windsurf

VS Code

JetBrains

Vercel

+ other MCP clients

Free for Subscribers

Waiting for input…

AI Agent

NVIDIA Audio MCP Server: 10 Tools for Audio Processing

These tools give your agent granular control over every step of the audio pipeline, from initial noise cancellation to final voice synthesis.

audio019d75e1

audio translation

Translates spoken audio into another language based on your specified target language.

cancel019d75e1

cancel noise

Removes background noise and static from an uploaded audio file, cleaning up the speech signal.

classify019d75e1

classify audio

Analyzes an audio file to determine what type of sound it contains (e.g., speech, music, siren) and provides a confidence score.

clone019d75e1

clone voice

Creates a synthetic voice model from a reference audio clip that you provide.

list019d75e1

list audio models

Retrieves a list of all available audio models supported by the NVIDIA API Catalog for use in your workflow.

punctuate019d75e1

punctuate text

Adds correct punctuation and capitalization to raw text output that lacks proper formatting.

speaker019d75e1

speaker diarization

Identifies and separates the voices of different speakers within a single audio recording, tracking who talks when.

speech019d75e1

speech to text

Transcribes spoken content from an audio file into text format, supporting multiple languages.

summarize019d75e1

summarize audio

Generates a condensed summary of the key points covered in a long audio transcript.

text019d75e1

text to speech

Converts plain text input into natural-sounding speech data, with options for selecting specific voices.

Choose How to Get Started

Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.

Build Your Own

Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.

Import from OpenAPI, Swagger, or YAML specs
Create Agent Skills with progressive disclosure
Deploy to edge with MCPFusion framework
Built in DLP, auth, and compliance on every call
Real time usage dashboard and cost metering
Publish to catalog or keep private

Start building

Make Your AI Do More

Start with NVIDIA Audio, then connect any of our 4,700+ other servers whenever your AI needs more. One click, no limits.

Use this MCP plus 4,700+ others, all in one place
Add new capabilities to your AI anytime you want
Every connection is secured and compliant automatically
Track usage and costs across all your servers
Works with Claude, ChatGPT, Cursor, and more
New servers added to the catalog every week

What you can do with this MCP connector

You connect this server to your agent when you need professional-grade audio processing. It handles everything from cleaning up bad recordings to generating full localized voiceovers using a single connection point. You'll use it for multi-stage pipelines, whether you're transcribing field interviews or localizing corporate training videos.

Before you even start processing, you gotta know what kinda audio file you’ve got. Use list_audio_models to grab a full rundown of every audio model the NVIDIA API Catalog supports for your workflow. Then, if you're unsure about the recording quality or content, run classify_audio. This analyzes the track and tells you exactly what kind of sound it is—speech, music, sirens—and it even gives you a confidence score so you know how solid that analysis is.

If the original file is noisy or static-filled, don't waste time. You run cancel_noise first; that tool scrubs out background interference and hums so the actual speech signal stays crystal clear for transcription. This clean feed is crucial.

When you’re ready to get text from audio, start with speech_to_text. It transcribes spoken content into written format, supporting multiple languages right off the bat. If your recording is long—like a committee meeting or podcast—you don't wanna read every word. You feed the transcript into summarize_audio to pull out only the key points and give you a concise summary.

Meetings are rough because more than one person talks. For those scenarios, run speaker_diarization. This doesn't just transcribe; it identifies multiple speakers within the single recording, letting you know who talked when and what time segment they covered. Once you have that raw text dump, remember to pass it through punctuate_text.

That tool cleans up all the messy output by adding proper capitalization and punctuation, making it ready for actual use.

Now, let's talk about polishing the content or changing its language. If the original speech was in Spanish but you need English, you run audio_translation. It translates the spoken audio directly into your specified target language. You can take that process further: if you only have text and need it to sound like a real person talking, use text_to_speech.

This converts plain text into natural-sounding speech data; you've options for selecting specific voices here.

But the best part is creating new speech using an old voice. You use clone_voice first. You provide a reference audio clip, and this tool builds a synthetic voice model based on that sample. After cloning, you feed your desired text into text_to_speech, but this time, the system generates brand-new speech using the cloned profile.

This is perfect for creating localized content or making characters sound like specific people without ever recording them again.

This whole sequence lets you go from a noisy, multi-lingual audio file to polished, translated, and professionally voiced assets with minimal friction.

How NVIDIA Audio MCP Works

1 1. Subscribe to the server and provide your NVIDIA API Key. You'll give this key to your AI client.
2 2. Your agent sends the raw audio file (e.g., a noisy meeting recording) to an initial tool like cancel_noise or speaker_diarization.
3 3. The server processes the clean data and hands off the result—either a cleaned transcript (speech_to_text) or a translated audio stream (audio_translation)—back to your agent for final use.

The bottom line is that you pipe messy, multi-source audio through this server. It cleans it up, figures out who said what, and then converts it into structured text or new audio in a language of your choice.

Who Is NVIDIA Audio MCP For?

Content producers, legal teams, and call center managers need this. If you deal with large volumes of raw audio—like meeting recordings, international customer calls, or podcast interviews—and need to turn that sound into actionable, structured data, this is for you. It solves the 'audio-to-text' bottleneck.

Content Creator

Uses clone_voice and text_to_speech to generate entire seasons of multilingual podcast episodes without hiring voice actors.

Legal Analyst

Runs call recordings through speaker_diarization and speech_to_text to pinpoint exactly which party said what during a deposition, saving hours of manual review.

Global Support Manager

Uses audio_translation to analyze call recordings from different international markets instantly, allowing the support agent to understand context without human intervention.

What Changes When You Connect

Stop wasting time cleaning up recordings. cancel_noise runs first, removing background static and hums before the transcript even starts, ensuring cleaner input for all downstream tools.
Go from meeting recording to actionable notes instantly. The combination of speech_to_text, followed by speaker_diarization and punctuate_text, gives you a clean transcript that identifies every single speaker shift in minutes.
Create content for global markets with one click. Use clone_voice on an internal sample, feed it text, and then use text_to_speech to generate the finished audio file—all while maintaining brand consistency.
Never struggle with language barriers again. The audio_translation tool handles the complex process of translating spoken words, not just written text, preserving tone and context across languages.
Understand raw data better. Instead of a single block of text, you get classifications using classify_audio, telling you if the audio was mostly speech, music, or ambient noise—vital for filtering garbage input.

Real-World Use Cases

Analyzing international customer support calls

A global support team receives a call recording in Mandarin. Instead of waiting for human transcription and translation, the agent runs audio_translation on the raw MP3. The output is an English transcript with speaker identification, allowing immediate case logging and root cause analysis.

Preparing podcast content from interviews

A podcaster has a 90-minute interview recording that includes lots of background chatter. They run the file through cancel_noise first, then use speaker_diarization to separate the host's voice from the guest's. Finally, they feed the resulting transcript into summarize_audio for show notes.

Creating localized training materials

A company needs a product demo video in Spanish and German. They use their CEO’s voice sample with clone_voice, input English scripts, and generate two new audio tracks using text_to_speech for both languages—all without the CEO having to re-record anything.

Reviewing legal deposition footage

A lawyer needs to know exactly when a key witness spoke vs. when another person interrupted them. They run the audio through speaker_diarization, which maps out distinct time segments for each participant, providing an undeniable timeline of conversation.

The Tradeoffs

Trying to clean noise manually

Downloading a noisy recording and running it through a generic audio editor just to cut out the background static. It's slow, imprecise, and often fails on complex industrial noise.

→ Use cancel_noise first. This tool automatically processes the file and removes background interference, giving you clean speech data ready for speech_to_text. You don't have to touch an editor.

Ignoring speaker identity

Running a multi-person meeting recording through simple transcription. The output is one block of text, and you have no idea who said what or when the topic changed hands.

→ Always run speaker_diarization on multi-person audio. It maps out timestamps for every unique voice, giving your agent precise speaker attribution for every line.

Forgetting punctuation in transcripts

Getting a raw transcript that looks like: 'welcome everyone to the q2 review our revenue grew.' This is useless and requires manual cleanup just to read.

→ Pipe the output of speech_to_text through punctuate_text. It automatically adds capitalization, commas, and periods, making the text immediately readable and ready for publication.

When It Fits, When It Doesn't

Use this server if your primary input is audio (recorded speech) and your desired output is structured data or new synthesized audio. Specifically, use it when you need to perform a multi-stage pipeline: Clean $\rightarrow$ Transcribe $\rightarrow$ Analyze $\rightarrow$ Output.

Don't use this if all you have is perfectly clean, single-speaker audio that needs only basic transcription and punctuation; a simpler toolset might handle that. Also, don't try to analyze video footage for gesture recognition—this server deals with pure sound waves. When in doubt about the complexity of your data pipeline (e.g., noisy, multilingual, multiple speakers), this suite provides the necessary building blocks.

Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by NVIDIA. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.

VINKIUS INFRASTRUCTURE

Cloud Hosted

Managed infra

V8 Isolated

Sandboxed per request

Zero-Trust Proxy

No stored credentials

DLP Enforced

Policy on every call

GDPR Compliant

EU data residency

Token Compression

~60% cost reduction

How we secure it →

Works with Claude, ChatGPT, Cursor, and more

The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.

This server provides 10 capabilities that interface natively with Claude, ChatGPT, Cursor, and any MCP client. No middleware. No custom integration required.

Available Capabilities

audio_translation cancel_noise classify_audio clone_voice list_audio_models punctuate_text speaker_diarization speech_to_text summarize_audio text_to_speech

Turning hours of raw audio into usable text shouldn't feel like a forensic investigation.

Right now, if you get a recording—say, an international client call—you have to send it out. First, someone has to manually clean the background noise. Then, a human transcribes the speech (and maybe struggles with accents). If that's not English, they pay for and wait on a translator. You end up with three separate files: noisy audio, raw text, and finally, an expensive translation.

With this MCP server, your agent runs it all in one go. It first uses `cancel_noise` to clean the file. Then, `speech_to_text` transcribes it. If needed, you route that transcript through `audio_translation`. The whole process is automated and gives you a single, structured output—no manual handoffs required.

NVIDIA Audio MCP Server: Use speaker diarization to know who said what.

The old way of analyzing meetings meant reading through transcripts that just listed blocks of text. You'd get a summary, sure, but you wouldn't know if the key decision came from the CFO or the VP of Marketing—you just knew *that* it was said.

Now, by running `speaker_diarization`, your agent separates those voices and timestamps them accurately. It’s not just transcription; it’s authorship tracking. You get an undeniable record of who contributed what.

Common Questions About NVIDIA Audio MCP

What languages are supported for transcription? +

Parakeel models support 50+ languages including English, Portuguese, Spanish, French, German, Mandarin, Japanese, and many more. Specify the language for best results.

Can I clone a specific voice? +

Yes! Use the clone_voice tool with a reference audio sample (a few seconds is enough) and the text you want the cloned voice to speak.

What is speaker diarization? +

Speaker diarization identifies 'who spoke when' in an audio recording. It segments the audio by speaker and returns timestamps for each speaker's turns.

What audio formats are supported? +

The API supports WAV, MP3, FLAC, OGG, and most common audio formats. For best transcription accuracy, use high-quality WAV or FLAC files at 16kHz or higher sample rate.

How do I authenticate when listing models using `list_audio_models`? +

You must provide an active NVIDIA API key. You get this unique key from build.nvidia.com and supply it during the server setup process.

After running `speech_to_text`, what is the best way to clean up raw transcript output? +

Use the punctuate_text tool to fix the text. It automatically adds proper capitalization and punctuation, turning rough transcripts into ready-to-publish content.

When using `classify_audio`, what do the confidence scores mean? +

The score shows the probability of the identified sound type. If a classification has a low confidence score, you should treat that data point with caution because the result might be unreliable.

What steps should I take if `audio_translation` fails on my recording? +

First, verify your input audio format and ensure you specify a valid target language. Most failures stem from unsupported file types or incorrect language codes; check the NVIDIA documentation for specific rules.

Use it with your favorite AI tools

Connect this server to Cursor, Claude, VS Code, and more.

OpenAI Agents SDK sdk-python

Google ADK sdk-python

Pydantic AI sdk-python

Vercel AI SDK sdk-typescript