# Monster API MCP

> Monster API provides access to high-performance AI models for image generation, text-to-speech, and transcription via serverless GPU infrastructure. Use your agent to run advanced tools like SDXL or Whisper without managing any local hardware or complex deployments.

## Overview
- **Category:** image-video
- **Price:** Free
- **Tags:** sdxl, whisper, text-to-speech, image-generation, serverless-gpu

## Description

Yo, listen up. This MCP server isn't some fancy marketing gimmick; it's straight GPU power wrapped up in an endpoint. You hook your agent into this thing, and you get access to top-tier AI models—like SDXL for visuals, Whisper for audio, and Sunno Bark for voices—without you gotta worry about managing a single line of infrastructure code or spinning up local hardware. It's just the tools, pure and simple.

**Image Generation.** You want images? First, you can generate one from scratch using `generate_sdxl`. Just hand it a text prompt, and the model spits out high-resolution visuals. If you got an existing photo you wanna tweak—maybe you need to change the background or fix up some details—you use `generate_image_to_image` for that. Both of these tools take your instructions and return a process ID; remember, they don't give you the final picture right away.

**Audio Processing.** Dealing with sound? You got two main options here. If you write something down but need it to sound like a person talking, `generate_sunno_bark` takes that text script and converts it into natural-sounding voiceover audio. Conversely, if you've recorded some actual speech—maybe an interview or a podcast clip—you upload the file, and `generate_whisper` runs Whisper on it to transcribe all that talk into clean text; it even handles formatting like SRT/VTT files.

**Job Status Tracking.** Since generating these things takes time—it's not instant magic—you gotta track them. That's where `get_job_status` comes in. You feed it the process ID you got back from any of the other tools (image, audio, or transcription), and it checks the progress until the job is done. When it's finished, that tool hands you the final output URL so your agent can download the finished asset.

In short: If you need to make an image, `generate_sdxl` builds it; if you wanna edit one, `generate_image_to_image` messes with it. If you got text and want sound, use `generate_sunno_bark`. If you got audio and need text, run `generate_whisper`. And no matter what job you start, always check the status using `get_job_status` until that process ID pops out a download link.

## Tools

### generate_image_to_image
Modifies an existing image using a text prompt, returning a process ID to poll for status.

### generate_sdxl
Generates a new image from scratch using SDXL and returns a process ID to poll for status.

### generate_sunno_bark
Converts input text into natural-sounding speech (TTS) and returns a process ID to poll for status.

### generate_whisper
Transcribes an uploaded audio file into text using Whisper, returning a process ID to poll for status.

### get_job_status
Checks the progress of any asynchronous generation job (image, audio, or transcription) and returns the final output URL when complete.

## Prompt Examples

**Prompt:** 
```
Generate a high-quality image of a cyberpunk city at night using SDXL in landscape mode.
```

**Response:** 
```
I've submitted the SDXL generation job. Your process ID is `abc-123`. I'll poll the status for you to retrieve the image URL.
```

**Prompt:** 
```
Transcribe this audio file into SRT format: https://example.com/audio.mp3
```

**Response:** 
```
Whisper transcription job started. Process ID: `trans-789`. I will let you know when the SRT file is ready.
```

**Prompt:** 
```
Check the status of my generation job with process ID 'job-xyz-456'.
```

**Response:** 
```
The job `job-xyz-456` is COMPLETED. You can access your generated asset here: [URL]
```

## Capabilities

### Generate images from text
Uses SDXL to create high-resolution visuals based on a simple text prompt.

### Modify existing images
Takes an existing photo and modifies it using a new text prompt, great for inpainting or outpainting.

### Create natural-sounding audio
Converts written script into realistic voiceovers using advanced TTS models.

### Transcribe and translate speech
Takes an audio file and accurately converts it to text or formats like SRT/VTT.

### Check job status
Polls the API using a process ID until asynchronous media generation is finished, providing the final asset URL.

## Use Cases

### Building a multi-modal marketing asset pipeline
A content team needs 50 images and corresponding voiceovers. They ask their agent to: 1) Run `generate_sdxl` for the base visuals, 2) Use `generate_image_to_image` to add character variations, and finally, 3) Run `generate_sunno_bark` on the script to create narration audio. The whole process is managed via a single API sequence.

### Cleaning up user recorded interviews
A product manager gets an hour-long raw audio file. They ask their agent to run `generate_whisper`. This tool transcribes the content and provides the data in SRT format, which they can immediately feed into a summary LLM call for actionable insights.

### Rapid prototyping of media features
A developer wants to test an 'edit photo' feature. Instead of setting up local models, they use `generate_image_to_image`. They provide a starting image and a prompt, get the process ID, and check the status until the edited asset is available for preview.

### Automating podcast episode prep
The team records an interview. The agent uses `generate_whisper` to transcribe the raw audio into a text document. They then use that text in a separate tool call to generate structured show notes, saving hours of manual cleanup.

## Benefits

- You get high-res visuals using `generate_sdxl` and `generate_image_to_image`, bypassing the need to manage local GPU memory or complex model dependencies. Just send a prompt.
- Stop juggling multiple services for audio content. Use `generate_sunno_bark` for text-to-speech, then pass that output directly into an LLM workflow—all within your agent's context.
- Need to analyze user feedback? Send the audio file once and use `generate_whisper`. It handles transcription and format conversion (SRT/VTT) so you get clean data immediately.
- The `get_job_status` tool means you don't have to build complex polling logic. You submit a job, track the ID, and wait for the final URL when it’s ready.
- This setup keeps your core application code clean. Instead of writing image generation boilerplate or audio processing SDK calls, you just call the appropriate tool name.

## How It Works

The bottom line is that it manages the entire lifecycle of demanding media tasks—from initial request through complex GPU processing and final result retrieval.

1. Subscribe to this server and input your Monster API Key into your MCP client.
2. Your agent calls an initiation tool (e.g., `generate_sdxl`) with the necessary parameters.
3. The process returns a temporary job ID; you then use `get_job_status` until the status is COMPLETED to get the final output URL.

## Frequently Asked Questions

**How do I transcribe an audio file using generate_whisper?**
You pass the audio file URL or data directly to `generate_whisper`. The server returns a process ID, and you must then use `get_job_status` repeatedly until it confirms the transcription is ready for download.

**Is generate_sdxl better than other image generation APIs?**
It provides access to SDXL directly without needing local setup. It's designed as a managed service, so you don't worry about versioning or resource allocation when generating visuals.

**What is the difference between generate_sdxl and generate_image_to_image?**
`generate_sdxl` creates an image from a text prompt only. `generate_image_to_image` requires you to provide both a starting image *and* a text prompt, modifying the original picture instead.

**How do I know when my job is done? Using get_job_status?**
After any generation call (like `generate_sunno_bark`), you must track the process ID using `get_job_status`. The response tells you exactly when the asset URL becomes available.

**What credentials do I need to run image generation with generate_sdxl?**
You must provide a valid Monster API key. This key authenticates your requests and manages billing for all generation tasks, including those using SDXL. Always secure this key.

**Are there rate limits when processing audio with generate_whisper?**
Yes, the service enforces rate limits to ensure stability across all users. If you exceed them, your AI client will receive a 429 error; wait and retry later.

**If my image job fails with generate_image_to_image, how do I get an error reason?**
The process status response includes an explicit error code. You must check the full job details to see if the failure was due to input constraints or a service issue.

**What file formats are supported for text-to-speech using generate_sunno_bark?**
This tool accepts plain text strings as primary input. The system handles conversion internally, so you don't need to worry about sending specific audio source files.

**How do I get the final result of an image generation job?**
Since generation is asynchronous, the tool returns a `process_id`. You must use the `get_job_status` tool with that ID to check if the status is 'COMPLETED' and retrieve the output URL.

**Can I specify the dimensions of the generated images?**
Yes, when using `generate_sdxl`, you can provide an `aspect_ratio` parameter such as 'square', 'landscape', or 'portrait' to control the output shape.

**What transcription formats does the Whisper tool support?**
The `generate_whisper` tool allows you to choose between 'text', 'srt', and 'vtt' formats via the `transcription_format` parameter.