# NVIDIA Audio MCP

> NVIDIA Audio provides professional-grade tools for handling complex audio files. You can transcribe spoken words, generate realistic voices from text, translate entire conversations across languages, and isolate different speakers in recordings. This MCP lets your AI client handle everything from raw meeting transcripts to polished, multilingual content.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** speech-to-text, text-to-speech, audio-processing, speaker-diarization, voice-cloning, transcription

## Description

This MCP connects advanced audio processing directly into your agent's workflow. Instead of manually feeding long audio files through multiple services—one for transcription, another for cleaning noise, and a third for translation—you pass the file once. Your AI client handles the whole chain: it transcribes speech to text using high-accuracy models, cleans up background noise, identifies who spoke when, and then can summarize that entire conversation into actionable bullet points. You'll find this MCP available in the Vinkius catalog alongside other powerful connectors. If you need to create content for multiple regions or languages, you can convert simple written text into natural speech using various voices, or even clone a voice from a short sample to generate entirely new audio segments. This ability to manage and polish every aspect of spoken word—from classification to punctuation restoration—turns raw recording data into perfectly structured, usable information.

## Tools

### list_audio_models
Shows you a list of all available audio models the API can use.

### classify_audio
Determines what type of sound is in an audio file and gives confidence scores for that classification.

### clone_voice
Creates a digital replica of a voice using a small sample recording, allowing you to generate new speech later.

### cancel_noise
Removes unwanted background sounds and static from the recorded audio file.

### speaker_diarization
Analyzes an audio file to pinpoint and separate different speakers, noting when each person started and stopped talking.

### punctuate_text
Adds correct punctuation and capitalization to raw text transcripts that might be missing these elements.

### speech_to_text
Transcribes audio from multiple languages, taking a public URL for the MP3 or WAV file as input.

### summarize_audio
Takes an existing audio transcript and boils it down to a concise summary.

### text_to_speech
Converts written text into natural-sounding speech, letting you select different voices for the output.

### audio_translation
Translates spoken audio directly from one language to a specified target language.

## Prompt Examples

**Prompt:** 
```
Transcribe this meeting recording: https://example.com/meeting.mp3
```

**Response:** 
```
Transcription: 'Welcome everyone to the Q2 review. Our revenue grew 15% compared to last quarter...'
```

**Prompt:** 
```
Convert this text to speech: 'Welcome to our presentation today.'
```

**Response:** 
```
Speech generated successfully! Audio data available for playback.
```

**Prompt:** 
```
Identify different speakers in this call: https://example.com/call.wav
```

**Response:** 
```
Detected 3 speakers: Speaker 1 (0:00-2:30), Speaker 2 (2:31-5:45), Speaker 1 (5:46-8:20).
```

## Capabilities

### Transcribe speech to text
Turns any recorded audio file into accurate written text for immediate use.

### Identify different speakers
Separates and labels every voice in a recording so you know exactly who said what and when.

### Translate spoken audio
Converts spoken words from one language into another, maintaining natural flow.

### Generate realistic speech
Creates high-quality audio files from any text input, using customizable voices.

### Clean and improve recordings
Removes distracting background noises or adds proper punctuation to raw transcripts.

## Use Cases

### Analyzing multi-party calls
A customer support manager uploads 20 hours of call recordings. The agent uses speaker_diarization and speech_to_text to separate every conversation segment, creating a searchable database that shows who said what across all agents.

### Creating global podcast episodes
A content creator records an interview in English. They pass the audio through audio_translation and then use text_to_speech to generate fully polished, localized voice tracks for Spanish and French audiences.

### Meeting summary automation
After a 90-minute product planning meeting, the team runs the recording. The agent uses summarize_audio on the transcript to pull out only three key action items and responsible parties, saving hours of manual note-taking.

### Cleaning up old field recordings
A researcher has raw audio from a remote location full of wind noise. They first run cancel_noise to clean the file, then use speech_to_text and punctuate_text to get a highly readable transcript.

## Benefits

- Stop cleaning audio in multiple steps. Use cancel_noise to remove background buzz or traffic noise instantly, giving you clean source material right away.
- Don't just transcribe; understand the speakers. speaker_diarization identifies who spoke when across a long call, making meeting minutes infinitely more accurate than simple word-for-word scripts.
- Scale content globally without hiring translators. Feed text into audio_translation and generate polished voiceovers in dozens of languages using your agent's workflow.
- Turn notes into media. If you have raw transcripts that lack commas or periods, run them through punctuate_text to make the writing look professionally edited before publishing.
- Create endless content variations. Use clone_voice to replicate a speaker’s tone and pitch, letting your agent generate new material without needing the original person in the studio.

## How It Works

The bottom line is that you get a unified pipeline to turn raw sound into perfectly polished digital assets.

1. First, subscribe to this MCP and provide your NVIDIA API Key within your agent's configuration.
2. Next, pass the audio file (or text you want spoken) from your AI client. The agent decides which sequence of tools is needed—like translating or cleaning up.
3. Finally, the MCP returns the processed output: clean transcripts, translated audio files, or new voice recordings ready for the next step in your workflow.

## Frequently Asked Questions

**Does NVIDIA Audio MCP support multiple languages?**
Yes, it supports numerous languages for both transcription and translation. You simply specify the source and target language when using audio_translation or speech_to_text.

**Can I clean noise from a recording before transcribing it with NVIDIA Audio?**
Absolutely. Before running the transcript through speech_to_text, you should first run cancel_noise on the audio file to remove background static or hums, ensuring cleaner results.

**How does speaker_diarization work with NVIDIA Audio?**
speaker_diarization analyzes an audio recording and outputs a time-stamped log that identifies different speakers by assigning them unique labels throughout the file's duration.

**What is the difference between summarize_audio and transcribing with NVIDIA Audio?**
Transcribing (speech_to_text) gives you every word spoken. Summarizing (summarize_audio) takes that full transcript and condenses it into key takeaways, saving you reading time.

**Is voice cloning in NVIDIA Audio restricted to one language?**
No, the clone_voice tool allows you to establish a unique audio fingerprint. You can then generate new speech using that cloned voice across multiple languages for consistent branding.