# CERN Open Data MCP

> CERN Open Data connects your agent directly to over 66,000 particle physics datasets and research documents from the Large Hadron Collider. You can query by experiment type, collision energy range, or specific theoretical concept like Dark Matter; it retrieves full metadata, file listings, and technical glossaries.

## Overview
- **Category:** the-unthinkable
- **Price:** Free
- **Tags:** particle-physics, open-data, research-datasets, large-hadron-collider, scientific-data

## Description

Need access to high-energy physics data? This MCP gives your agent direct read access to the CERN Open Data Portal, a massive repository of scientific research. Forget navigating complex web forms just to check an event count or find a specific analysis framework. You query for 'Higgs boson' or 'ATLAS experiment,' and you get metadata right back. It’s designed for those who need raw data details—full abstracts, author ORCID identifiers, file URIs—without the clicks. Vinkius hosts this connection, making it available to any MCP-compatible client. Your agent instantly becomes a particle physics research assistant, giving you immediate access to datasets and documentation spanning decades of collision history.

## Tools

### check_cern_opendata_status
Verifies that the connection to the CERN Open Data Portal is active and operational.

### get_glossary
Searches the official particle physics glossary for definitions of technical terms, components, or phenomena.

### get_portal_statistics
Retrieves high-level statistics on the entire data portal's scope, including record counts and available file formats.

### get_record_by_doi
Finds the corresponding open data record when you provide a digital object identifier (DOI).

### get_record
Fetches comprehensive metadata for a specific dataset ID, detailing authors, experiments, and collision parameters.

### list_categories
Lists all available physics research categories and their associated dataset counts.

### list_experiments
Provides an inventory of active CERN collaborations, like CMS or ATLAS, along with the number of datasets each has published.

### list_record_files
Lists every file associated with a specific dataset record, providing size and direct URI links for retrieval.

### search_by_category
Searches the entire repository using physics research categories to narrow down the data pool.

### search_by_collision_energy
Filters datasets based on the specific collision energy used during the experiment run (e.g., 13 TeV).

### search_by_collision_type
Narrows results by the particle interaction type, such as proton-proton (pp) or electron-positron (e+e-).

### search_by_experiment
Focuses searches exclusively on data generated by one specific collaboration, like ALICE.

### search_datasets
Performs a broad search across all available fields using keywords plus multiple filters for maximum precision.

### search_documentation
Locates user guides, policies, and technical documentation related to the data or experiments.

### search_software
Finds analysis frameworks, reconstruction tools, and specialized code used in physics research.

### search_supplementaries
Retrieves technical context documents essential for reproducing published scientific analyses.

## Prompt Examples

**Prompt:** 
```
Show me the available experiments and how many datasets each one has on CERN Open Data.
```

**Response:** 
```
I've retrieved the experiment list from CERN Open Data. There are 9 experiments with published datasets: CMS leads with 51,955 datasets, followed by DELPHI (12,727), OPERA (904), ATLAS (159), ALICE (153), LHCb (108), TOTEM (1), JADE (1), and PHENIX (1). The total repository contains 66,009 datasets. Would you like me to search within a specific experiment?
```

**Prompt:** 
```
Search for Dark Matter datasets from the CMS experiment at 13 TeV.
```

**Response:** 
```
I've found Dark Matter datasets from CMS at 13 TeV. The search returned simulated datasets in NanoAODSIM format containing dark matter signal samples. Each record includes event counts ranging from 50,000 to 500,000 events, with file sizes in ROOT format. These datasets are commonly used for mono-jet, mono-photon, and missing transverse energy analyses. Would you like the full file listing for any specific record?
```

**Prompt:** 
```
What does 'luminosity' mean in particle physics? Check the CERN glossary.
```

**Response:** 
```
According to the CERN physics glossary, luminosity is a measure of the number of collisions that can be produced in a particle accelerator. It is one of the most important parameters of an accelerator, directly related to the number of events observed. Integrated luminosity (measured in inverse femtobarns, fb⁻¹) quantifies the total data collected during a run period. Higher luminosity means more collision events and greater chances of observing rare processes like Higgs boson production.
```

## Capabilities

### Search by scientific parameters
Locate specific records using filters like collision energy (e.g., 13 TeV) or particle collision type (e+e-).

### Retrieve detailed record metadata
Fetch complete details for any dataset, including authors' ORCID IDs and the DOI.

### List all available experiments
Get a full count and list of major CERN collaborations like CMS, ATLAS, and ALICE.

### Browse physics categories
Filter the data pool by broad research topics, such as Exotica or B physics.

### Look up technical jargon definitions
Access a specialized glossary to define terms like pseudorapidity or luminosity for reports or presentations.

## Use Cases

### Tracking historical data gaps
The user knows they need to compare LEP era results with modern LHC runs. They first use `list_experiments` to confirm DELPHI and CMS exist, then combine `search_by_collision_type` (e+e- for DELPHI; pp for CMS) with `get_portal_statistics` to gauge the historical scope of available data.

### Recreating a complex analysis
A researcher finds an abstract but needs the underlying files. They use `get_record_by_doi` first, then run `list_record_files` to get file URIs and checksums, finally checking `search_supplementaries` for the specific analysis configuration needed.

### Understanding a niche term
The user encounters 'pseudorapidity' in an article. They immediately use the `get_glossary` tool to get a precise definition, ensuring their report is technically accurate before proceeding with dataset queries.

### Finding analysis code for a specific topic
A student wants to build a model for Dark Matter. They use `search_by_category` and filter by 'Exotica,' then run `search_software` to find the appropriate reconstruction frameworks before they even touch the raw data.

## Benefits

- Precision filtering saves time. Instead of browsing general results, you can narrow the search immediately by collision energy using `search_by_collision_energy` or particle type with `search_by_collision_type`.
- Reproducibility is built in. Need to understand how a result was achieved? Use `get_record` for full metadata or run `list_record_files` to see the exact files available for analysis.
- No jargon left unexplained. The dedicated `get_glossary` tool lets you define obscure physics terms instantly, which is critical when writing technical reports.
- The scope is visible upfront. Before deep diving, use `list_experiments` to understand the sheer volume and variety of data contributed by major collaborations like CMS (52k datasets).
- Full traceability means confidence. If you have a publication DOI, run `get_record_by_doi`. It resolves that reference directly into an open dataset record, skipping manual searches.
- Beyond just numbers: Use `search_supplementaries` to find the technical configuration details and guides necessary to actually replicate published research.

## How It Works

The bottom line is that your agent treats the entire CERN Open Data Portal as an immediate, queryable knowledge base.

1. Subscribe to this MCP. Since the CERN portal is public, no API key is required.
2. Instruct your AI client to perform a query; specify if you need datasets filtered by collision type or records resolved via a DOI.
3. The agent returns structured data containing metadata, file links, and abstract summaries for review.

## Frequently Asked Questions

**How do I search for a specific experiment like ALICE using search_datasets?**
You combine `search_datasets` with the 'experiment' filter. This lets you scope your full-text query specifically to data from that collaboration, giving you highly targeted results.

**I found a publication DOI; how do I get the data record using get_record_by_doi?**
You pass the DOI directly to `get_record_by_doi`. This tool resolves the reference ID and returns the dataset's title, type, and direct link if one exists.

**What is the best way to find all available physics research topics?**
Run `list_categories` first. It provides a master list of every major topic, like Exotica or B physics, along with an immediate count of datasets for each.

**Can I check if the data portal connection is working before querying?**
Yes, run `check_cern_opendata_status`. This simple tool verifies the API connectivity and overall status of the entire CERN Open Data system.

**I want to know what specific files are inside a record; how do I use list_record_files?**
It returns the filename, size in bytes, checksum, and direct data URI for every file linked to that dataset. This tool is essential because it lets you verify exactly what you'll download before pulling large datasets into your analysis.

**How do I find out the overall scope of all available physics data using get_portal_statistics?**
It provides comprehensive statistics across every facet: record types, years, keywords, and event count distributions. This is the best way to gauge the total volume and composition of the entire CERN dataset repository.

**I need instructions on how to use a specific dataset or understand detector setups; should I use search_documentation?**
Yes, it searches for guides, policies, and documentation. You'll find titles and abstracts that point you toward usage instructions, detector configurations, or data processing workflows needed for reproduction.

**I know the specific collision energy I need; how does search_by_collision_energy help me scope my results?**
It filters datasets based on established collision energies (like 13TeV or 7TeV). This lets you quickly narrow down millions of records to only those matching your precise experimental conditions.

**Do I need an API key to use this server?**
No. The CERN Open Data Portal API is completely public and requires no authentication. Simply subscribe to this server and enter any placeholder value in the API key field to start querying particle physics datasets immediately.

**What kind of data can I access from CERN?**
You can access over 66,000 datasets from major LHC experiments (CMS, ATLAS, ALICE, LHCb) and legacy experiments (DELPHI, OPERA). This includes real collision data, Monte Carlo simulations, derived datasets, analysis software, physics glossary entries, and detailed documentation. Data covers Higgs boson searches, Dark Matter studies, exotic particle searches, heavy-ion physics, and more.

**Can I use CERN data for machine learning projects?**
Absolutely. CERN provides labeled datasets specifically designed for ML applications, including particle identification, jet classification, event reconstruction, and anomaly detection. Use the search tools with queries like 'machine learning' or filter by file type 'csv' or 'nanoaodsim' to find ML-ready formats. The CMS experiment alone has published thousands of simulated datasets with known physics labels.