# Kubernetes Architecture Prover MCP

> Kubernetes Architecture Prover is an MCP that validates your entire Kubernetes workload against production-grade standards. It forces strict governance across resource allocation, security hardening, reliability design, observability, and network restriction before deployment. This tool catches architectural flaws—like running as root or having no memory limits—that cause outages in real clusters.

## Overview
- **Category:** infrastructure
- **Price:** Free
- **Tags:** kubernetes, k8s, container-orchestration, resource-limits, pod-security, network-policies, observability, sre, platform-engineering

## Description

AI agents write perfect Kubernetes manifests until they hit production. The problem is that most generated code skips critical governance steps: resource requests, security standards, or proper networking policies. This MCP forces your agent to think like a seasoned SRE. It doesn't just deploy containers; it validates the entire architecture. You get assurance that your payment service won't crash at 3 AM because an unrelated logging sidecar ran out of memory. The tool makes sure you enforce security hardening, build redundancy into the design, and implement full observability—things people often treat as 'nice-to-have.' When you connect this MCP via Vinkius, your agent gains a rigorous architectural layer that catches dangerous assumptions about 'the cluster handling it.'

## Tools

### validate_kubernetes_architecture
Runs a structured check against an architectural plan to identify five critical gaps: governance, security, reliability, observability, and networking.

## Prompt Examples

**Prompt:** 
```
Deploy a Node.js API to Kubernetes. Use the default settings, one replica should be enough. We will add monitoring later.
```

**Response:** 
```
RESOURCES_UNGOVERNED — 'Default settings' means no CPU/memory requests or limits. The first noisy neighbor OOM-kills your API. Define: [API container: CPU 100m/500m, memory 128Mi/512Mi] → [LimitRange: default 100m CPU, 128Mi memory] → [ResourceQuota: 2 CPU, 2Gi memory ceiling]. Also: single replica + no monitoring = blind single point of failure.
```

**Prompt:** 
```
API container: CPU 200m/1000m, memory 256Mi/1Gi. LimitRange: default 100m/128Mi. ResourceQuota: 8 CPU, 16Gi. Security: runAsNonRoot=true, runAsUser=1001, PodSecurityStandard=restricted, drop ALL capabilities, readOnlyRootFilesystem=true. Reliability: 3 replicas, PDB minAvailable=2, podAntiAffinity on topology.kubernetes.io/zone, HPA targeting 70% CPU, min=3 max=10. Observability: liveness GET /healthz initialDelay=15s period=10s, readiness GET /ready failureThreshold=3, startup GET /healthz failureThreshold=30 period=10s, structured JSON logging to Loki, Prometheus /metrics endpoint with p99 latency alerting at 500ms SLO. Networking: deny-all default NetworkPolicy, allow ingress from nginx-ingress namespace on port 3000, allow egress to postgres on port 5432 and to external API on port 443, TLS via cert-manager with Let's Encrypt, Linkerd mTLS between services.
```

**Response:** 
```
ARCHITECTURE_PROVEN — Production-grade Kubernetes architecture validated. Resources: 200m/1Gi with namespace governance. Security: restricted PSS, non-root, capabilities dropped. Reliability: 3 replicas across zones with PDB and HPA. Observability: full probe chain, Loki logging, Prometheus with p99 SLO alerting. Networking: deny-all with explicit allow, cert-manager TLS, Linkerd mTLS. This workload survives node failure, maintenance drain, and lateral movement attempts.
```

**Prompt:** 
```
Resources: CPU 100m/500m per container, LimitRange set, ResourceQuota 4CPU/8Gi. Security: runAsNonRoot=true, restricted PSS. Reliability: 2 replicas, PDB minAvailable=1, anti-affinity on hostname. Observability: liveness and readiness probes configured, logs to stdout. But we keep the default flat network because all our services need to communicate with each other and NetworkPolicies add complexity.
```

**Response:** 
```
NETWORKING_EXPOSED — Resources, security, reliability, and observability pass — but networking is flat. 'All services need to communicate' is false — your API does not need to reach your monitoring stack's database. Define deny-all default policies, then allow only the specific pod-to-pod flows you need. 'NetworkPolicies add complexity' — so does a compromised pod scanning your entire cluster. Also: 'logs to stdout' without structured format or aggregation target is insufficient observability.
```

## Capabilities

### Validate resource governance
It checks if every container defines required CPU/memory requests and limits, preventing noisy neighbors from causing outages.

### Enforce security hardening
The MCP verifies that containers run without root privileges and drop all unnecessary capabilities, minimizing the attack surface.

### Design for reliability
It ensures services have multiple replicas across zones and utilize disruption budgets to survive node maintenance.

### Instrument observability
This feature mandates proper liveness, readiness, and structured logging probes so operations teams know exactly what's happening in the cluster.

### Restrict networking access
It enforces a deny-all network policy structure, ensuring that only explicitly allowed pods can talk to each other.

## Use Cases

### Deploying an API with minimal effort
An agent generates a simple Node.js deployment using default settings. The MCP immediately rejects it, pointing out that missing resource limits will cause the first memory spike to crash the service.

### Handling maintenance downtime
A team updates a core payment service and forgets redundancy. The MCP validates that the service has multiple replicas with a PodDisruptionBudget, ensuring zero availability during node drain.

### Preventing data leakage from compromised pods
A developer connects an internal frontend to a database pod over a flat network. The MCP fails validation because it enforces explicit 'deny-all' NetworkPolicies and requires service mesh mTLS connections.

## Benefits

- It forces resource governance by defining CPU/memory requests and limits, preventing the 'noisy neighbor' problem where one pod starves others of resources.
- The MCP enforces security hardening rules like running as non-root and dropping capabilities, mitigating the risk of container escape leading to node compromise.
- You gain reliability design checks, ensuring production workloads have at least two replicas across zones via PodDisruptionBudgets (PDBs) and anti-affinity.
- Observability is mandated with liveness, readiness, and structured logging probes. You stop guessing about failure modes and start getting actionable metrics.
- Networking restrictions are enforced using default deny-all NetworkPolicies, stopping a compromised pod from scanning or exfiltrating data across the entire cluster.

## How It Works

The bottom line is your agent doesn't just write code; it validates the entire operational stability and security model of that code.

1. You provide your desired Kubernetes deployment manifest or architectural plan to the MCP.
2. The tool analyzes the workload against five mandatory production standards: resource limits, root-level security checks, redundancy design, probing methods, and network policies.
3. It returns a detailed verdict, pinpointing specific gaps (like missing anti-affinity rules or lack of memory limits) that must be fixed before deployment.

## Frequently Asked Questions

**Does validate_kubernetes_architecture check for network policies?**
Yes. It validates your network configuration by demanding default deny-all NetworkPolicies with explicit allow rules defined for every service interaction.

**How do I use the validate_kubernetes_architecture tool?**
You provide the MCP with your desired manifest or architecture scope. The tool returns a detailed, actionable list of gaps across resource governance, security, reliability, observability, and networking.

**What if my service is already deployed? Does validate_kubernetes_architecture still help?**
Yes. You use it to audit the *design* of your existing architecture by providing its configuration parameters. It identifies weaknesses that are currently running live in production.

**Can I skip setting resource limits with validate_kubernetes_architecture?**
No. The tool enforces resource governance, meaning it will reject any plan lacking defined CPU/memory requests and limits for every container to prevent node overcommitment.

**If validate_kubernetes_architecture rejects my architecture, what does the output tell me?**
It provides specific, actionable failure reports. Instead of just failing, it names the gap (like RESOURCES_UNGOVERNED) and explains exactly why that lack of governance—such as missing CPU limits—creates a production risk. This tells you precisely where to fix your manifests.

**How does validate_kubernetes_architecture assess scaling and redundancy?**
It checks for mechanisms that keep the service running when things go wrong, such as setting PodDisruptionBudgets (PDBs) and implementing anti-affinity across nodes or zones. It also validates if you've set up Horizontal/Vertical Pod Autoscalers (HPA/VPA). Single replicas fail this test.

**What is the primary focus of security checks in validate_kubernetes_architecture?**
The tool focuses on enforcing architectural hardening, not just network rules. It requires containers to run as non-root users (runAsNonRoot=true), drop all capabilities, and use readOnlyRootFilesystem. This mitigates the risk if an attacker successfully escapes the container.

**What kind of Kubernetes manifest structure should I provide to validate_kubernetes_architecture?**
You must provide full deployment definitions that include resource requests, limits, and security context settings. The tool doesn't just check for missing fields; it validates the principles (e.g., if you define a limit, is it appropriate) across all your services.

**Does it generate Kubernetes manifests?**
No. It validates that your architecture addresses the five production-critical pillars — resource governance, security hardening, reliability design, observability instrumentation, and network restriction. It does not generate YAML. It forces you to prove your YAML is production-ready.

**What counts as proper resource governance?**
Every container must have CPU and memory requests AND limits. Every namespace must have a LimitRange (defaults for containers that don't specify) and a ResourceQuota (ceiling for the namespace). 'The cluster handles it' is not governance — it is the absence of governance.

**Is it useful for managed Kubernetes (EKS, GKE, AKS)?**
Yes. Managed Kubernetes handles the control plane — it does NOT handle your workload architecture. Resource limits, security context, PDBs, probes, and NetworkPolicies are YOUR responsibility on every provider. The cloud provider manages etcd. You manage everything that runs on the nodes.