# NVIDIA NIM MCP

> NVIDIA NIM MCP connects your AI agent directly to physical hardware metrics, giving you deep visibility into GPU usage and LLM performance. You can check container health, track memory limits, pull real-time resource statistics via Prometheus endpoints, and manage model scaling—all without logging into a dashboard. It gives the ops engineer total command over their ML infrastructure.

## Overview
- **Category:** industry-titans
- **Price:** Free
- **Tags:** mlops, gpu-telemetry, container-management, hardware-profiling, resource-monitoring, infrastructure-limits

## Description

This MCP lets your agent talk directly to complex physical hardware running AI workloads. Instead of relying on high-level dashboards that mask the actual bottlenecks, you gain direct control over monitoring and resource management for NVIDIA containers. You can ask your agent to check if a model has finished loading or pull raw performance numbers from Prometheus endpoints. The system allows you to map exactly what's loaded onto the GPU and even scale the entire infrastructure up or down with simple commands. It’s like giving your AI client root access to the machine's core stats. If managing this complexity feels overwhelming, remember that Vinkius hosts this MCP so your agent can connect once and get access to all these critical hardware tools.

## Tools

### nim_check_health_live
Runs a liveness check to see if the physical host container orchestrator is running and responsive.

### nim_check_health_ready
Confirms that the GPU inference layers have finished loading all necessary model artifacts for use.

### nim_get_container_logs
Retrieves execution parameters and standard output logs from the container orchestrator layer.

### nim_get_gpu_status
Reads and formats active hardware memory variables, showing you the GPU's topological limits.

### nim_get_metadata
Pulls core engine execution metrics, mapping out the foundational configuration bounds currently loaded.

### nim_get_metrics
Extracts comprehensive hardware scaling and performance metrics directly from Prometheus endpoints attached to NIM.

### nim_list_models
Dumps a list of all active LLMs that are allocated as inference targets on the backend array.

### nim_scale_replicas
Automatically adjusts the number of hardware replicas, scaling the execution layers up or down dynamically.

## Prompt Examples

**Prompt:** 
```
Analyze container limits executing active native probes mapped on the physical server to check explicit liveness natively securely.
```

**Response:** 
```
Parsed logically evaluating native NIM bound (`check_health_live`). Inference container executed successfully actively returning 200 HTTP cleanly bounding constraints accurately.
```

**Prompt:** 
```
Dump active LLM targets explicitly listing matrices isolating natively loaded models natively secure.
```

**Response:** 
```
Tunnel explicitly active limit targets mapped isolating model targets safely (`list_models`). Extracted cleanly formatting 'meta/llama3-8b-instruct' dynamically actively routing safely natively.
```

**Prompt:** 
```
Extract explicit proxy hardware telemetry strictly extracting native GPU metrics logically evaluating bounds attached to the docker bounds natively.
```

**Response:** 
```
Execution telemetry directly extracted natively utilizing metric parameters securely matching `get_metrics`. Parsed arrays successfully formatting structural mappings mapping explicitly memory bounds efficiently.
```

## Capabilities

### Check container health status
Determines if the physical host container orchestrator is running and responsive using liveness probes.

### Verify model readiness
Confirms whether the GPU inference layers have successfully loaded all required model artifacts for use.

### Extract hardware resource usage
Gathers specific details on allocated memory and topological limits mapped onto the NIM proxy.

### Pull performance metrics data
Fetches raw, actionable scaling metrics directly from Prometheus endpoints attached to the orchestrator.

### Audit active models deployed
Lists all currently loaded large language models (LLMs) that are available for inference targets on the backend array.

### Adjust resource scaling
Changes the number of hardware replicas assigned to the proxy, allowing you to scale execution layers up or down automatically.

## Use Cases

### Diagnosing a sudden performance drop
The agent detects high latency and runs `nim_get_metrics`. The output shows that GPU utilization is maxed out, pointing the engineer immediately to insufficient resources. They then use `nim_scale_replicas` to allocate more capacity.

### Validating model deployment
Before launching a new feature, an admin uses `nim_get_metadata` to verify that the foundational configuration bounds are correctly set. They then run `nim_check_health_ready` to ensure all required artifacts loaded properly.

### Troubleshooting container failures
The agent fails to connect, so the engineer runs `nim_get_container_logs` and uses `nim_list_models` simultaneously. The logs reveal a permission error, while the model list confirms the correct models were supposed to be running.

## Benefits

- Instant Model Inventory: Use `nim_list_models` to get an immediate, clean dump of every LLM target running on your system. You don't have to guess what models are active.
- Deep Health Checks: Quickly verify the entire stack with dedicated calls like `nim_check_health_live` or confirming readiness using `nim_check_health_ready`. This is faster than waiting for a dashboard widget to load.
- Performance Benchmarking: Access raw, structured data by running `nim_get_metrics`. This lets you pull Prometheus hardware scaling metrics needed for true performance analysis.
- Resource Visibility: Know exactly what's consuming memory. `nim_get_gpu_status` provides a clear breakdown of GPU topological limits and allocated memory variables.
- Operational Stability: When traffic spikes, don't panic. Use `nim_scale_replicas` to dynamically adjust resources, ensuring your models stay online without manual intervention.

## How It Works

The bottom line is that your agent gets a direct data stream into the physical performance layer of your AI infrastructure.

1. Your agent targets the local instance by specifying the `NVIDIA_NIM_URL` in the prompt.
2. The system passes native proxy queries that explore hardware latencies using specific Prometheus endpoints.
3. The MCP maps and executes the necessary hardware limits, returning diagnostic error codes or status reports.

## Frequently Asked Questions

**How do I check if my NIM container is alive using nim_check_health_live?**
You invoke `nim_check_health_live` to run a liveness probe. This checks the physical host orchestrator's status, telling you immediately if the core service layer is responsive or down.

**Does nim_get_gpu_status show total memory or used memory?**
It shows both the topological limits and the currently allocated memory parameters. This allows you to calculate available headroom, which is crucial for capacity planning.

**What should I use if I need detailed performance data? Is nim_get_metrics correct?**
Yes, `nim_get_metrics` is the right tool. It pulls Prometheus-formatted hardware scaling metrics directly from the orchestrator, giving you raw, quantitative data points.

**If I increase traffic, how do I manage capacity with nim_scale_replicas?**
You call `nim_scale_replicas` and provide the desired replica count. The MCP handles the dynamic orchestration of scaling the execution layers up or down safely.

**What is the difference between nim_list_models and nim_get_metadata?**
Use `nim_list_models` for a simple, clean dump of which LLMs are loaded. Use `nim_get_metadata` to pull deeper information about the foundational configuration bounds themselves.