Langfuse MCP for AI. See exactly how your AI calls work.
Works with every AI agent you already use
…and any MCP-compatible client








Connect to your AI in seconds.
Langfuse connects your AI agent directly to deep LLM observability and evaluation data. You track API session traces, inspect token usage, manage prompt versions, and audit model accuracy metrics without leaving your chat window.
What your AI can do
Get trace
Fetches all telemetry and the nested graph for one complete LLM API session.
Get daily metrics
Generates rolled-up reports showing total USD cost and aggregated latency for the day.
Create observation
Adds a detailed event, span, or generation record into an active LLM trace.
Retrieve the complete history of an AI session, including all steps, timings, and token counts.
Drill down into specific moments within a trace to find out exactly where latency or failures occurred.
View and query the active versions of prompt templates used by the model, checking for expected inputs.
Attach human feedback or automated metrics to specific runs, and generate daily reports on total USD spending and average latency.
Group together related conversations to understand multi-turn interaction boundaries over time.
Ask an AI about this
Waiting for input…
Langfuse (LLM Tracing & Evals) - 10 Tools
These tools let you query every part of your LLM application—from full session traces to specific cost metrics and prompt versions.
Make your AI actually useful.
Add this MCP to Claude, Cursor, or Windsurf and your AI stops guessing. It gets real tools to look things up, take action, and handle the stuff you keep doing by hand.
Start using Langfuse (LLM Tracing & Evals) on VinkiusGet Trace
Fetches all telemetry and the nested graph for one complete LLM API session.
Get Daily Metrics
Generates rolled-up reports showing total USD cost and aggregated latency for the...
Create Observation
Adds a detailed event, span, or generation record into an active LLM trace.
Get Observation
Retrieves context from a single specific span or generation event within a trace.
List Observations
Lists raw observation objects across multiple different traces.
List Prompts
Extracts and views all active prompt templates and their versions.
Create Score
Attaches human feedback or automated quality metrics to a specific model run.
List Scores
Lists all stored evaluation scores, mapping quality or cost algorithms used on model...
List Sessions
Retrieves high-level groups of user interactions that contain multiple related...
List Traces
Lists all recorded LLM API sessions for quick review.
Security and governance baked right in.
Pick your AI client below to get set up. Just create a Vinkius account, subscribe, and you're instantly up and running. We handle the entire backend infrastructure, delivering out-of-the-box support for HTTPS Streamable, SSE, and OAuth2—zero messy routing required.
Choose How to Get Started
Build a custom MCP for your own tools, or connect a ready-made integration from our catalog.
Build Your Own
Turn any API into an MCP. Import a spec, define Agent Skills, or deploy with MCPFusion.
- Import from OpenAPI, Swagger, or YAML specs
- Create Agent Skills with progressive disclosure
- Deploy to edge with MCPFusion framework
- Built in DLP, auth, and compliance on every call
- Real time usage dashboard and cost metering
- Publish to catalog or keep private
Make Your AI Do More
Start with Langfuse (LLM Tracing & Evals), then connect any of our 5,100+ other servers whenever your AI needs more. One click, no limits.
- Use this MCP plus 5,100+ others, all in one place
- Add new capabilities to your AI anytime you want
- Every connection is secured and compliant automatically
- Track usage and costs across all your servers
- Works with Claude, ChatGPT, Cursor, and more
- New servers added to the catalog every week
Independent Platform Disclaimer: Vinkius is an independent platform and is not affiliated with, endorsed by, sponsored by, verified by, or otherwise authorized by Langfuse. All third-party trademarks, logos, and brand names are the property of their respective owners. Their use on this website is strictly for informational purposes to identify service compatibility and interoperability.
VINKIUS INFRASTRUCTURE
Cloud Hosted
Managed infra
V8 Isolated
Sandboxed per request
Zero-Trust Proxy
No stored credentials
DLP Enforced
Policy on every call
GDPR Compliant
EU data residency
Token Compression
~60% cost reduction
Works with Claude, ChatGPT, Cursor, and more
The Model Context Protocol standardizes how applications expose capabilities to LLMs. Instead of operating in isolation, your AI gains direct access to external platforms, live data, and real-world actions through secure, standardized connections.
This connection provides 10 powerful capabilities that interface natively with Claude, ChatGPT, Cursor, and other compatible AI platforms. No middleware. No custom integration required.
The hardest part isn't building the AI; it's knowing what happened when it failed.
Today, if your agent breaks or performs slowly, you end up in a mess. You copy IDs from one dashboard, then hop to another to check token counts, and maybe jump to a third system just to see the payload. It's manual, it’s slow, and it's impossible to track correlation across multiple services.
With this MCP, you talk to your agent about the failure. You don't copy anything; you just ask. Your agent then pulls all that cross-referenced data—the timings, the payloads, the entire execution graph—and gives it back in a readable format. It cuts out the dashboard hopping.
Langfuse MCP: Get Quality Scores and Usage Metrics
You stop relying on guesswork for model quality. Instead of hoping the AI is good enough, you can now attach structured human feedback or automated metrics directly to specific runs using `create_score`. You also get real-time financial visibility by running `get_daily_metrics`.
This means your development cycle shifts from 'Did it work?' to 'How well did it work, and what did it cost us?' It’s a fundamental shift in how you treat AI functionality.
What your AI can actually do with this
Every time you build an application using large language models, the actual execution details get buried in logs. This MCP lets your agent connect to Langfuse, giving you full visibility into what the model is doing—and why it might fail. You can ask about specific API calls and retrieve the exact payload that caused a latency spike.
It's not just logging; it’s structured monitoring for performance and quality control. If you need to track costs or check how good the prompts are, this MCP gives your agent direct access. It integrates into your existing stack via Vinkius, letting you pull insights from complex systems simply by asking questions in natural language.
019d75c4-7f86-73f7-9d96-ef98162e59dd Here's how it actually works
The bottom line is: you talk to your agent, and it talks directly to your live LLM data store.
Subscribe to the MCP and provide your Langfuse API URL, Public Key, and Secret Key.
Your agent connects using the credentials. This initializes monitoring for all LLM activity.
You ask a question like, 'What were the top three most expensive calls today?' and get an immediate, structured answer.
Who is this actually for?
The LLM engineer who hates sifting through raw log files. The product owner who needs to justify the cost of every API call. Any developer who treats AI functionality as a production service, not a prototype.
Debugging complex chains and measuring exact token latencies across multiple microservices.
Auditing evaluation metrics and managing prompt templates to improve model grounding without manual dashboard searches.
Monitoring daily AI costs and user satisfaction scores across multiple production environments for business reporting.
What Changes When You Connect
You instantly see the cost breakdown. Instead of guessing, use get_daily_metrics to get aggregated reports on total USD spending and average latency for today's runs.
Debugging complex chains is faster. You can retrieve a full session graph using get_trace, letting you see every single payload that passed through the system.
Never lose track of a conversation. By calling list_sessions, your agent groups together all related user interactions, making it easier to improve long-term workflows.
Manage prompt drift easily. Use list_prompts to inspect active templates and see exactly what system instructions are currently running in production.
Validate model output quality using structured feedback. You can assign scores via create_score, attaching human judgment or automated metrics to specific runs.
Deep dive into failures. If a call breaks, you don't have to search logs; just ask your agent and use get_observation to get the context of that failure.
See it in action
Debugging an intermittent API error
An engineer notices a chat feature fails sometimes. They tell their agent, 'Show me the last three failed traces.' The agent uses list_traces and then pulls the specific context with get_observation, revealing that the failure only happens when a certain variable is null.
Auditing prompt compliance
A Product Owner needs to check if developers are using the latest version of the internal 'customer support' guide. They ask their agent, and it uses list_prompts to display the system instructions and expected variables for review.
Calculating operational cost
The CTO needs an end-of-month report on AI spending. The agent runs a query using get_daily_metrics, providing an accurate, aggregated dollar amount of total tokens consumed and average latency for the month.
Analyzing multi-user behavior
A data scientist wants to know if users who interact with Feature A also tend to use Feature B. The agent uses list_sessions to group correlated user activity, allowing them to pinpoint usage patterns across different features.
The honest tradeoffs
Searching for single log lines
A dev manually copies a trace ID and then navigates through three separate logging dashboards (latency, tokens, payload) to piece together what happened.
Tell your agent to run get_trace with the specific ID. It pulls all the nested graph data—latencies, token counts, payloads—in one go.
Assuming cost is constant
A PM assumes a new feature will only cost $50/day, based on initial estimates, without tracking actual usage.
Run get_daily_metrics to get the real-time data. This shows exactly how many tokens were consumed and what the average latency was today.
Ignoring prompt changes
A team updates a core prompt template but forgets to check if older versions are still in use, leading to unpredictable behavior.
Use list_prompts to see every version and the current system instructions. This ensures you know exactly what context the model is operating under.
When It Fits, When It Doesn't
Use this MCP if your primary pain point is observability—you need to understand why an LLM call succeeded or failed, who used which prompt version, and how much it cost. You're dealing with production AI services that require deep auditing.
Don't use this if you simply need a basic chat interface; the complexity of tracing adds overhead. Also, don't use it if your only concern is simple data storage—this tracks performance and usage, not just records. If all you need to do is read user profiles, stick with a standard database tool instead.
Questions you might have
How do I check the total spending with Langfuse MCP? +
Run get_daily_metrics. This tool provides an aggregated report on your total USD costs and average latency across all runs for the day.
What does get_trace do in Langfuse MCP? +
It retrieves the complete, detailed telemetry graph for a single LLM session. This shows every internal step (span) that occurred during the API call.
I need to see what prompts are used by my agent using Langfuse MCP. +
Use list_prompts. This tool extracts and displays all actively managed prompt templates, letting you inspect their system instructions and expected input variables.
How do I track multiple conversations in Langfuse MCP? +
Call list_sessions to get high-level user session entities. This groups together related multi-turn interactions, helping you understand the full context.
How can I use list_observations to find a specific performance bottleneck within an LLM trace? +
You get raw data points by listing observations, which lets you examine individual spans or generations. This pinpoints exactly where latency spikes or errors occurred in the chain, helping you diagnose bottlenecks without reviewing the entire session graph.
Should I use create_score when evaluating model grounding and accuracy? +
Yes, using create_score lets you attach structured feedback or evaluation metrics to a specific trace or observation. This is critical for monitoring model performance against defined human standards or automated quality checks.
What's the difference between get_trace and get_observation when troubleshooting? +
get_trace retrieves the complete, nested graph of an entire LLM API session. If you only need to check a single event or span within that trace, use get_observation for faster, more targeted context retrieval.
How do I analyze which parts of my application are consuming the most tokens using list_traces? +
You can list traces to review metadata attached to each API session. This raw data allows you to quickly sort and identify sessions with unusually high token counts or excessive latencies across your various pipelines.
Can I see the exact system instruction for a specific prompt version? +
Yes. Use the list_prompts tool to browse your managed templates. Your agent can retrieve the exact text and variables for any deployed prompt version, making it easy to audit AI logic through natural conversation.
How do I log human feedback for a specific trace? +
Use the create_score tool by providing the Trace ID and a JSON payload defining the score name (e.g. 'user-satisfaction') and value. Your agent will attach this structured data directly to the Langfuse record.
Can my agent report on my LLM spending for the current day? +
Absolutely. The get_daily_metrics tool retrieves aggregated USD costs and average latency metrics from Langfuse. Your agent can summarize these statistics to help you monitor your infrastructure budget in real-time.
Powerful workflows you can unlock today
MCP Recipe for AI Inference Monitoring
Your GPT-4 API takes 4 seconds per response , Groq returns the same quality answer in 180 milliseconds, Langfuse traces every call, and Sheets shows the latency-cost comparison that makes your product feel instant
Monitor AI Agent Performance Using MCP Servers
Your agents run in production but you cannot explain why one failed at 3am , fix that
Route AI Requests to the Fastest Model via MCP
You run everything on GPT-4o because choosing a model per task is hard , your agent benchmarks Groq and Mistral against your actual workloads
Track LLM Cost vs Quality Using MCP Servers
Your OpenAI bill grew from $200 to $2,400 in 2 months and you have no idea which feature caused it , because you track API spend at the account level, not at the prompt level
We've already built the connector for Langfuse. Just plug in your AI agents and start using Vinkius.
No hosting. No infrastructure. No complex setup.
All 10 tools are live and waiting.
You're up and running in seconds.
Vinkius gives your AI agents access to the full catalog of app connectors, all fully managed, secure, and enterprise-ready. One subscription, every tool you need.
Built, hosted, and secured by Vinkius. You just connect and go.