# LangSmith MCP

> LangSmith (LLM Observability & Hub) gives you full control over LLM pipelines. It lets your agent trace every model call, audit prompt templates, and track performance metrics. You get detailed logs for debugging complex multi-step AI workflows directly through natural conversation with any MCP-compatible client.

## Overview
- **Category:** superpower
- **Price:** Free
- **Tags:** llm-observability, tracing, prompt-management, evaluation, ai-debugging, llm-ops

## Description

Debugging large language models can be a nightmare. When an agent fails, you need to know exactly *why*. This MCP connects your LLM application to LangSmith, giving you deep observability over every run. Instead of digging through massive UI dashboards and filtering logs manually, you talk to your agent, and it retrieves the necessary data for you. You can ask what happened in a specific pipeline, pull precise metrics on token usage or latency, or check the full history of prompt templates used across projects. It's like having a dedicated diagnostic console built into your workflow. Because Vinkius hosts this MCP, you connect once from any client and get access to robust LLM governance for debugging and auditing.

## Tools

### list_projects
Maps out the boundaries of distinct AI pipelines, allowing you to see all active tracing projects.

### list_runs
Lists specific LLM invocation runs, showing the prompts sent and responses received within a project.

### get_run
Gets detailed performance metrics for a single, specific LLM invocation run.

### list_datasets
Retrieves a list of all evaluation and fine-tuning datasets tracked in LangSmith.

### list_prompts
Extracts a directory listing of all available prompt templates hosted in the LangChain Hub.

### list_annotation_queues
Lists all active human-in-the-loop queues where people are reviewing generated model traces.

## Prompt Examples

**Prompt:** 
```
List all active tracing projects in LangSmith
```

**Response:** 
```
I've found 4 active tracing projects: 'Production-Bot-V2', 'Staging-RAG-Explorer', 'Customer-Support-Agent', and 'Internal-Testing-Suite'. Which one would you like to see the recent runs for?
```

**Prompt:** 
```
Show me the telemetry for the last run in the 'Production-Bot-V2' project
```

**Response:** 
```
Retrieving data for run ID 'run-98765'… The execution was successful. Total tokens: 1,420 (Prompt: 850, Completion: 570). Latency: 2.4s. The agent successfully used the 'Web-Search' tool before generating the answer. No errors detected.
```

**Prompt:** 
```
List all prompts hosted in our Hub repository
```

**Response:** 
```
I've identified 3 prompts in your Hub: 'summarization-agent-v1', 'customer-service-v3', and 'data-extraction-helper'. I can retrieve the full instruction text and version history for any of these.
```

## Capabilities

### Trace entire agent workflows
See the step-by-step execution path of multi-turn agents, including every tool call and internal reasoning decision.

### Analyze model performance metrics
Extract precise data points like token count, prompt latency, and error strings from any completed LLM run.

### Manage prompt versions
Access the central hub to view, retrieve, and audit all managed prompt templates and their version history.

### Audit human feedback queues
List active annotation queues where human reviewers assess model safety, alignment, or accuracy in generated traces.

### Track evaluation datasets
View the curated 'golden' datasets used for automatically testing prompt logic and few-shot models.

## Use Cases

### The agent hallucinated a key fact.
An ML Engineer notices an agent giving incorrect data. They first use `list_projects` to find the correct pipeline, then call `list_runs` for that project. Finally, they use `get_run` on the failing run ID to get the exact token usage and error strings needed to fix the prompt.

### We need a new feature-specific prompt.
An AI Developer needs a better data extraction template. They start by running `list_prompts` to see what's available in the Hub, verify existing templates, and then retrieve the full instruction text for versioning.

### Our model seems unsafe on edge cases.
An LLM Analyst suspects alignment issues. They use `list_annotation_queues` to pull up the live queue where human reviewers are assessing safety, allowing them to report on overall model grounding immediately.

### We need to test a new dataset against an old prompt.
A data scientist wants to benchmark. They run `list_datasets` to confirm the available evaluation sets and then use these identifiers when checking performance metrics via `get_run`.

## Benefits

- Stop guessing why an agent failed. By calling `get_run`, you instantly pull precise metrics like token consumption and latency, pinpointing the exact moment of failure.
- Manage your prompt logic centrally. Use `list_prompts` to see every template in the LangChain Hub and check its full version history without navigating a separate UI.
- Track model safety with human oversight. The `list_annotation_queues` tool lets you audit where human reviewers are assessing accuracy, helping you ground your model's behavior.
- Map out your entire infrastructure quickly. Running `list_projects` shows all active AI pipelines, letting you focus only on the systems that matter right now.
- Verify testing assets with one call. Use `list_datasets` to enumerate 'golden' datasets, confirming the structure used for automated evaluation before deployment.

## How It Works

The bottom line is: you get instant access to your LLM infrastructure metrics without leaving your chat interface.

1. Subscribe to this MCP and provide your LangSmith API Key and Endpoint credentials.
2. Your agent connects using the Vinkius framework, establishing a secure link to the monitoring platform.
3. You query the system—for instance, 'Show me the performance for last week's runs'—and the data streams back instantly.

## Frequently Asked Questions

**How do I check the performance metrics for a single LLM invocation run using get_run?**
You use `get_run` by providing the specific run ID. This returns precise telemetry, including total tokens consumed and latency in seconds. It’s the fastest way to measure performance.

**What is list_projects for in LangSmith?**
`list_projects` maps out all distinct AI pipelines you are currently monitoring. This tool helps scope your investigation by showing which projects have recent activity or need auditing.

**Can I see what prompt templates my agent is using with list_prompts?**
Yes, `list_prompts` extracts all available templates from the LangChain Hub. This lets you audit which instructions are active and check their version histories.

**What should I do if I need to see a list of evaluation datasets?**
To view your curated 'golden' datasets for testing, use `list_datasets`. This confirms the data structure you should be using when measuring model performance.

**If I want to see all raw interactions in a project, should I use list_runs?**
Yes. This tool isolates every single interaction run within a specific project. You get the full history of prompts sent and responses received from the LLM model, which is critical for debugging complex failure paths.

**What does list_annotation_queues do regarding human oversight?**
This tool lists active queues where human reviewers are assessing generated LLM traces. You can check if your model's outputs meet alignment or safety standards before you deploy them.

**How can I use list_projects to understand my monitoring scope?**
It maps out the boundaries of every distinct AI pipeline currently running in your environment. This helps you know exactly where all your tracing data is segmented across the platform.

**When using get_run, how do I find specific error messages from a failed run?**
The telemetry returned by get_run includes exact error strings. This lets you pinpoint failure modes—like API rate limits or invalid inputs—without having to guess the cause of the crash.

**Can I see the token usage for a specific LLM run through my agent?**
Yes. Use the `get_run_telemetry` tool with a specific Run ID. Your agent will retrieve the exact token count (prompt + completion) and latency metrics calculated by LangSmith for that interaction.

**How do I fetch a prompt template from the LangChain Hub using natural language?**
The `list_prompts` tool allows your agent to navigate your hosted Hub repository. You can ask your agent to find a specific prompt by name to inspect its instruction text, variables, and version history.

**Can my agent check the status of human annotation queues?**
Absolutely. Use the `list_annotation_queues` tool to retrieve all active queues where human feedback is being collected. Your agent can report on the number of pending traces and general alignment scores established by your reviewers.