# CTO Architect Prover MCP MCP

> CTO Architect Prover forces any proposed system design to pass five mandatory architectural stress tests—Stack Fitness, Failure Tolerance, Security Posture, Migration Safety, and Observability. It rejects designs that are over-engineered or forget basic redundancy, making sure your tech stack is genuinely production-ready.

## Overview
- **Category:** architecture
- **Price:** Free
- **Tags:** cto, architecture, security, scalability, migration, observability, production

## Description

When you're planning a major feature rollout or building a new service, it's easy to get excited about the technology—Kubernetes, Kafka, microservices—and write down an architecture that sounds good on paper. But that rarely means it will actually survive in production. This MCP forces your AI agent to think like a Principal Engineer who gets paged at 3 AM when things break. It doesn't just approve; it actively challenges the design across five critical axes: matching tech complexity to team size, guaranteeing failure recovery paths, enforcing specific security controls, planning zero-downtime migrations, and proving monitoring capabilities. If your proposed system can't prove these points with hard numbers and mechanisms, this MCP flags it immediately. You access this powerful validation layer through the Vinkius catalog, connecting it directly to any MCP-compatible client so you never have to trust a gut feeling on architecture again.

## Tools

### validate_cto_architect
Forces the agent to validate an architecture by justifying stack fitness against constraints, proving failure tolerance with specific redundancy mechanisms, hardening security with named controls and thresholds, defining zero-downtime migration strategies, and proving observability with health probes and RTO/RPO targets.

## Prompt Examples

**Prompt:** 
```
We will build our SaaS with Kubernetes, Kafka for event streaming, and 8 microservices. We have a team of 3 developers and launch in 6 weeks. Security: we use HTTPS.
```

**Response:** 
```
OVER_ENGINEERED + SECURITY_VULNERABLE — Two fatal gaps. First: Kubernetes, Kafka, and 8 microservices for a 3-person team launching in 6 weeks is Resume-Driven Development. A Node.js monolith with PostgreSQL ships in 6 weeks. Kubernetes adds 3 months of operational overhead. Second: 'use HTTPS' is not a security strategy — name rate limiting thresholds, auth token type, parameterized queries, and encryption algorithm.
```

**Prompt:** 
```
Node.js monolith, 4-person team, $800/mo infra. PostgreSQL streaming replica (async, <1s lag). Redis Sentinel for cache. ALB with 10s health checks. Circuit breaker on Stripe (5s timeout, 3 retries). Rate limiting: 100/min per IP. JWT RS256, 15min access, 7-day refresh. Parameterized via Prisma. CORS whitelist. AES-256 on PII. Expand/contract migrations: dual-write, backfill, switch, drop. API v2 alongside v1. Rollback tested in staging. Health: /health/live, /health/ready. Prometheus + Grafana, p99 < 200ms alert. RTO < 4h, RPO < 1h, daily S3 backups, tested quarterly.
```

**Response:** 
```
ARCHITECTURE_PROVEN — CTO-level architecture validated. Stack: Node.js monolith right-sized for 4-person team at $800/mo. Failure tolerance: PostgreSQL replica, Redis Sentinel, ALB, circuit breaker on Stripe. Security: rate limiting 100/min, JWT RS256 with rotation, parameterized Prisma, CORS whitelist, AES-256. Migration: expand/contract with tested rollback. Observability: health probes, Prometheus with p99 alerting, RTO < 4h, RPO < 1h. Production-ready.
```

**Prompt:** 
```
We need to migrate our user table to add a new column for subscription tier. We will take the system offline for 2 hours on Saturday night, run ALTER TABLE, and deploy the new code. We will check the logs manually after to make sure everything works.
```

**Response:** 
```
MIGRATION_BLOCKED + OBSERVABILITY_VOID — Two fatal gaps. First: 'take the system offline for 2 hours' is a maintenance window — blocked. Use expand/contract: add the column with a default value (no lock), dual-write from the application, backfill existing rows in batches, switch reads, drop the old logic. Each step handles live traffic. Second: 'check the logs manually' is not monitoring — define automated health probes and alerting thresholds.
```

## Capabilities

### Validate Architectural Stack Fitness
It forces you to justify technology choices against real constraints like team size and budget, preventing over-engineering.

### Prove Failure Tolerance
The tool demands specific redundancy mechanisms and failover paths for every component (database, cache, queue) so the system won't crash when parts break.

### Harden Security Posture
It requires naming specific security controls with measurable thresholds—like rate limits or encryption algorithms—instead of vague statements like 'use HTTPS'.

### Guarantee Migration Safety
It validates that schema changes are zero-downtime, demanding techniques like expand/contract and parallel writes.

### Prove Observability Targets
You must define health probes, specific metric dashboards, and measurable Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

## Use Cases

### Designing a new core microservice.
The team proposes building three interconnected services with a Kafka queue. Your agent runs the MCP, which immediately flags that using Kafka adds too much operational overhead for your small team and forces you to consider a simpler message bus solution instead.

### Updating the primary user database schema.
You need to add an index column. Instead of planning a simple 'ALTER TABLE' during maintenance, the MCP demands a zero-downtime migration plan using dual-writing and phased rollouts, preventing table locks entirely.

### Reviewing a vendor integration point.
A new third-party API needs to be called. The agent uses the MCP's failure tolerance check, forcing you to implement circuit breakers and define exact timeout limits so that an external service outage doesn't take down your entire system.

### Preparing for a major compliance audit.
You must prove data integrity. The MCP requires defining specific encryption algorithms, key rotation schedules, and access control lists, turning abstract security goals into concrete engineering requirements.

## Benefits

- Eliminates 'Resume-Driven Development.' You'll stop designing complex, overblown stacks just because the tech sounds cool. This MCP forces you to match technology complexity directly to your actual team size and budget.
- Finds Single Points of Failure (SPOF). The tool demands specific redundancy mechanisms for every component—cache, database, load balancer—so that when one thing crashes, the whole system doesn't go down.
- Enforces Real Security. Forget vague statements like 'use HTTPS.' This MCP forces you to name precise security thresholds, like rate limiting quotas or token rotation schedules.
- Guarantees Zero Downtime Migrations. It validates migration plans using industry best practices like the expand/contract pattern and parallel writes, blocking any plan that requires scheduled downtime.
- Proves Operational Readiness. You'll define concrete health probes, metric alerting thresholds (p99 latency), and measurable RTO/RPO targets—metrics, not hopes.

## How It Works

The bottom line is: it turns 'looks fine' into 'proven operational capability'.

1. Give your AI agent a full system design or architectural proposal.
2. The MCP analyzes the plan against five core engineering axes, forcing reflection on constraints like team size and failure mechanisms.
3. You receive an immediate verdict that shows exactly which axis failed (e.g., SPOF_DETECTED), pinpointing the exact gap in your architecture.

## Frequently Asked Questions

**What exactly does the validate_cto_architect MCP check for?**
It checks five core areas: Stack Fitness (is the tech right?), Failure Tolerance (what happens when things break?), Security Posture (are the controls specific enough?), Migration Safety (can we change data without downtime?), and Observability (do we have metrics and alerts?).

**Can I use validate_cto_architect for simple API integrations?**
Yes, absolutely. Use it to ensure that the integration point has defined circuit breakers, rate limiting thresholds, and a clear failover path if the external service goes down.

**Is validate_cto_architect only for microservices?**
No. While it handles complex services well, you can use it on any system design—even monolithic applications—to ensure you've planned for zero-downtime changes and proper redundancy.

**Does validate_cto_architect require me to know specific metrics?**
It requires you to define them. You must name the metric (like p99 latency) and set a measurable alert threshold for your agent to prove observability.

**How does validate_cto_architect enforce specific security controls?**
It demands deep technical specificity. Instead of saying 'use HTTPS,' you must name the control, like rate limits (X req/min per IP), and specify cryptographic details such as JWT RS256 or AES-256 on PII.

**Does validate_cto_architect help with disaster recovery planning?**
Yes, it forces you to define concrete RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets. You must detail specific failover paths for every core component in your stack.

**When is the best time to run validate_cto_architect during development?**
You should run it before committing to any architecture decision or entering a design review. It catches critical, expensive flaws early, saving you costly rework down the line.

**How does validate_cto_architect ensure my tech stack fits my resources?**
The tool validates Stack Fitness by matching technology complexity to your actual constraints like team size and budget. It rejects solutions that are technically impressive but operationally impossible for a small team.

**What is 'Resume-Driven Development'?**
It is when engineers choose technologies because they look impressive on a resume, not because they solve the problem. Kubernetes for a 3-person seed team with 50 users is Resume-Driven Development. A monolith with PostgreSQL would ship in 8 weeks. Kubernetes adds 3 months of operational overhead for no user benefit.

**Why does it reject 'use HTTPS' as a security strategy?**
Because HTTPS is the bare minimum, not a strategy. A hardened security posture requires rate limiting with specific thresholds (100 req/min per IP), parameterized queries to prevent SQL injection, JWT with RS256 and rotation policy, CORS whitelisting, and data-at-rest encryption with a named algorithm (AES-256). Saying 'use HTTPS' is like saying 'lock the door' — it does not address the windows.

**Why is 'maintenance window' blocked?**
Because zero-downtime is the production standard. Maintenance windows are an admission that your migration strategy cannot handle live traffic. Use the expand/contract pattern: add the new column, dual-write, backfill, switch reads, drop the old column. Each step is reversible. Each step handles live traffic. If your migration requires downtime, your architecture is not production-ready.