# Langfuse MCP

> Langfuse connects your AI agent directly to deep LLM observability and evaluation data. You track API session traces, inspect token usage, manage prompt versions, and audit model accuracy metrics without leaving your chat window.

## Overview
- **Category:** friends-mcp
- **Price:** Free
- **Tags:** llm-tracing, prompt-management, evaluation-metrics, token-tracking, ai-analytics, llm-ops

## Description

Every time you build an application using large language models, the actual execution details get buried in logs. This MCP lets your agent connect to Langfuse, giving you full visibility into what the model is doing—and why it might fail. You can ask about specific API calls and retrieve the exact payload that caused a latency spike. It's not just logging; it’s structured monitoring for performance and quality control. If you need to track costs or check how good the prompts are, this MCP gives your agent direct access. It integrates into your existing stack via Vinkius, letting you pull insights from complex systems simply by asking questions in natural language.

## Tools

### get_trace
Fetches all telemetry and the nested graph for one complete LLM API session.

### get_daily_metrics
Generates rolled-up reports showing total USD cost and aggregated latency for the day.

### create_observation
Adds a detailed event, span, or generation record into an active LLM trace.

### get_observation
Retrieves context from a single specific span or generation event within a trace.

### list_observations
Lists raw observation objects across multiple different traces.

### list_prompts
Extracts and views all active prompt templates and their versions.

### create_score
Attaches human feedback or automated quality metrics to a specific model run.

### list_scores
Lists all stored evaluation scores, mapping quality or cost algorithms used on model runs.

### list_sessions
Retrieves high-level groups of user interactions that contain multiple related traces.

### list_traces
Lists all recorded LLM API sessions for quick review.

## Prompt Examples

**Prompt:** 
```
List the last 5 traces in my Langfuse project
```

**Response:** 
```
I've retrieved the latest 5 traces. Highlights include 'Email-Summarization' (Latency: 1.2s, 450 tokens), 'Chat-Interaction-987' (Latency: 2.4s, 1,200 tokens), and 'Data-Extraction-Fail' (Status: Error). Would you like to inspect the error payload for the failed trace?
```

**Prompt:** 
```
Show me the instructions for the 'customer-support-v3' prompt
```

**Response:** 
```
Retrieving 'customer-support-v3'… The system instruction is: 'You are a helpful support agent for TechCorp. Answer queries based on the provided documentation…'. It expects variables: 'customer_name' and 'query_text'. Would you like to see the previous versions?
```

**Prompt:** 
```
What was our total LLM spending for today?
```

**Response:** 
```
Your total LLM spending for today is $12.45 across 45,200 total tokens. The average latency is 1.8s. The most expensive model remains 'gpt-4-turbo' with $8.20 in costs. I can provide a breakdown by provider if needed.
```

## Capabilities

### Audit full interaction chains
Retrieve the complete history of an AI session, including all steps, timings, and token counts.

### Pinpoint performance bottlenecks
Drill down into specific moments within a trace to find out exactly where latency or failures occurred.

### Manage system instructions
View and query the active versions of prompt templates used by the model, checking for expected inputs.

### Measure quality and cost
Attach human feedback or automated metrics to specific runs, and generate daily reports on total USD spending and average latency.

### Analyze user context flow
Group together related conversations to understand multi-turn interaction boundaries over time.

## Use Cases

### Debugging an intermittent API error
An engineer notices a chat feature fails sometimes. They tell their agent, 'Show me the last three failed traces.' The agent uses `list_traces` and then pulls the specific context with `get_observation`, revealing that the failure only happens when a certain variable is null.

### Auditing prompt compliance
A Product Owner needs to check if developers are using the latest version of the internal 'customer support' guide. They ask their agent, and it uses `list_prompts` to display the system instructions and expected variables for review.

### Calculating operational cost
The CTO needs an end-of-month report on AI spending. The agent runs a query using `get_daily_metrics`, providing an accurate, aggregated dollar amount of total tokens consumed and average latency for the month.

### Analyzing multi-user behavior
A data scientist wants to know if users who interact with Feature A also tend to use Feature B. The agent uses `list_sessions` to group correlated user activity, allowing them to pinpoint usage patterns across different features.

## Benefits

- You instantly see the cost breakdown. Instead of guessing, use `get_daily_metrics` to get aggregated reports on total USD spending and average latency for today's runs.
- Debugging complex chains is faster. You can retrieve a full session graph using `get_trace`, letting you see every single payload that passed through the system.
- Never lose track of a conversation. By calling `list_sessions`, your agent groups together all related user interactions, making it easier to improve long-term workflows.
- Manage prompt drift easily. Use `list_prompts` to inspect active templates and see exactly what system instructions are currently running in production.
- Validate model output quality using structured feedback. You can assign scores via `create_score`, attaching human judgment or automated metrics to specific runs.
- Deep dive into failures. If a call breaks, you don't have to search logs; just ask your agent and use `get_observation` to get the context of that failure.

## How It Works

The bottom line is: you talk to your agent, and it talks directly to your live LLM data store.

1. Subscribe to the MCP and provide your Langfuse API URL, Public Key, and Secret Key.
2. Your agent connects using the credentials. This initializes monitoring for all LLM activity.
3. You ask a question like, 'What were the top three most expensive calls today?' and get an immediate, structured answer.

## Frequently Asked Questions

**How do I check the total spending with Langfuse MCP?**
Run `get_daily_metrics`. This tool provides an aggregated report on your total USD costs and average latency across all runs for the day.

**What does get_trace do in Langfuse MCP?**
It retrieves the complete, detailed telemetry graph for a single LLM session. This shows every internal step (span) that occurred during the API call.

**I need to see what prompts are used by my agent using Langfuse MCP.**
Use `list_prompts`. This tool extracts and displays all actively managed prompt templates, letting you inspect their system instructions and expected input variables.

**How do I track multiple conversations in Langfuse MCP?**
Call `list_sessions` to get high-level user session entities. This groups together related multi-turn interactions, helping you understand the full context.

**How can I use list_observations to find a specific performance bottleneck within an LLM trace?**
You get raw data points by listing observations, which lets you examine individual spans or generations. This pinpoints exactly where latency spikes or errors occurred in the chain, helping you diagnose bottlenecks without reviewing the entire session graph.

**Should I use create_score when evaluating model grounding and accuracy?**
Yes, using create_score lets you attach structured feedback or evaluation metrics to a specific trace or observation. This is critical for monitoring model performance against defined human standards or automated quality checks.

**What's the difference between get_trace and get_observation when troubleshooting?**
get_trace retrieves the complete, nested graph of an entire LLM API session. If you only need to check a single event or span within that trace, use get_observation for faster, more targeted context retrieval.

**How do I analyze which parts of my application are consuming the most tokens using list_traces?**
You can list traces to review metadata attached to each API session. This raw data allows you to quickly sort and identify sessions with unusually high token counts or excessive latencies across your various pipelines.

**Can I see the exact system instruction for a specific prompt version?**
Yes. Use the `list_prompts` tool to browse your managed templates. Your agent can retrieve the exact text and variables for any deployed prompt version, making it easy to audit AI logic through natural conversation.

**How do I log human feedback for a specific trace?**
Use the `create_score` tool by providing the Trace ID and a JSON payload defining the score name (e.g. 'user-satisfaction') and value. Your agent will attach this structured data directly to the Langfuse record.

**Can my agent report on my LLM spending for the current day?**
Absolutely. The `get_daily_metrics` tool retrieves aggregated USD costs and average latency metrics from Langfuse. Your agent can summarize these statistics to help you monitor your infrastructure budget in real-time.