# Incident Handoff: mTLS Certificate Outage (INC-2026-0412)

> **Status:** Resolved
> **Date:** 2026-04-12
> **Incident Commander:** Aria Vasquez
> **Author:** On-call handoff (Priya Nair)

## 📋 Summary

At 13:38 UTC a scheduled certificate rotation pushed a bad intermediate CA to the Kong API gateway, breaking mTLS between the gateway and three downstream services. **Payments-api went hard down**; inventory-service and notifications-service degraded. Users saw checkout failures from 13:44 to 14:06 UTC (24-minute customer impact window). The on-call SRE identified the root cause within 16 minutes and rolled back the gateway config. All services recovered by 14:11; no data was lost.

**Blast radius:** ~1,400 failed checkouts, ~320 affected users, ~8,500 queued notifications (all delivered by 14:15).

## 📊 Outage Propagation

```mermaid
sequenceDiagram
    accTitle: mTLS Outage Propagation
    accDescr: Shows how the bad intermediate CA in the Kong gateway config caused cascading mTLS handshake failures to payments-api, inventory-service, and notifications-service, resulting in user-visible checkout errors.

    participant deploy as Deploy Pipeline
    participant kong as Kong API Gateway
    participant payments as payments-api
    participant inventory as inventory-service
    participant notifs as notifications-service
    participant storefront as Storefront
    participant user as User

    deploy->>kong: Push new cert bundle (bad intermediate CA)
    Note over kong: 13:38 — Config applied

    rect rgb(254, 226, 226)
        Note over kong,notifs: mTLS handshake failures begin
        storefront->>kong: POST /checkout
        kong-x payments: mTLS rejected (expired CA)
        Note over payments: HARD DOWN — all requests 503
        kong-x inventory: mTLS write failure
        Note over inventory: Degraded — reads cached, writes fail
        kong-x notifs: mTLS failure
        Note over notifs: Degraded — queue backing up
    end

    kong-->>storefront: 503 Service Unavailable
    storefront-->>user: "Payment failed" error
    Note over user: 13:44 — Customer reports begin
```

## 🕐 Incident Timeline

```mermaid
timeline
    accTitle: INC-2026-0412 Timeline
    accDescr: Chronological breakdown of the mTLS certificate outage from the initial config push at 13:32 through full resolution at 14:30 UTC.

    title INC-2026-0412 — mTLS Certificate Outage
    section 13:32 -- 13:40 Trigger
        Cert rotation starts : SRE team begins scheduled rotation via deploy pipeline
        Bad config pushed    : Kong gateway receives cert bundle with expired intermediate CA (13:38)
    section 13:40 -- 13:50 Detection
        PagerDuty fires         : payments-api 503 rate exceeds 5% (13:42)
        Customer reports        : Storefront checkout failing (13:44)
        On-call acknowledges    : Priya Nair begins investigation (13:47)
    section 13:50 -- 14:00 Investigation
        mTLS failures found     : Kong logs show handshake errors (13:52)
        Root cause confirmed    : Expired intermediate CA in new bundle (13:58)
    section 14:00 -- 14:15 Recovery
        Rollback initiated      : Revert Kong to previous cert bundle (14:02)
        payments-api recovers   : Service starts returning 200s (14:06)
        All services healthy    : PagerDuty resolved (14:11)
        Backlog drained         : 8,500 queued notifications delivered (14:15)
    section 14:15 -- 14:30 Closeout
        Incident resolved       : No data loss confirmed (14:30)
```

## 🔍 Root Cause and Contributing Factors

A scheduled cert rotation assembled the TLS bundle **manually**, and the bundle included an **expired intermediate CA**. The staging environment uses a different CA chain, so the bad cert passed pre-deploy validation. Once pushed to Kong, every mTLS handshake to downstream services failed.

```mermaid
flowchart TD
    accTitle: Root Cause Chain
    accDescr: Shows the four contributing factors that allowed the expired intermediate CA to reach production and cause the outage.

    manual_assembly["Manual cert bundle assembly<br/>(no automated chain validation)"]
    staging_mismatch["Staging uses different CA chain<br/>(bad cert passed staging)"]
    no_health_check["Kong health checks skip<br/>mTLS handshake verification"]
    no_canary["No canary rollout for<br/>gateway config changes"]
    bad_cert(("Expired<br/>intermediate CA<br/>in production"))
    outage["mTLS failures across<br/>3 downstream services"]

    manual_assembly --> bad_cert
    staging_mismatch --> bad_cert
    bad_cert --> outage
    no_health_check -.->|undetected| outage
    no_canary -.->|full blast radius| outage

    classDef critical fill:#fee2e2,stroke:#dc2626,stroke-width:2px,color:#991b1b
    classDef warning fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12
    classDef neutral fill:#f3f4f6,stroke:#6b7280,stroke-width:1px,color:#374151

    class bad_cert,outage critical
    class manual_assembly,staging_mismatch warning
    class no_health_check,no_canary neutral
```

### Impact

| Area | Impact | Severity |
| :--- | :----- | :------: |
| Checkout | ~1,400 failed attempts over 24 min | High |
| Users | ~320 saw "payment failed" errors | High |
| Notifications | ~8,500 messages queued (all delivered by 14:15) | Medium |
| Inventory | Writes failed; reads served from cache | Medium |
| Data loss | None — failed writes were idempotent | Low |
| Revenue | No permanent loss — users retried post-recovery | Low |

## 🔧 Recovery Procedure

```mermaid
flowchart TD
    accTitle: Recovery Steps Executed
    accDescr: Step-by-step flowchart of the actions taken by the on-call SRE to diagnose and recover from the mTLS outage.

    alert(["PagerDuty alert received<br/>13:42 UTC"])
    ack["On-call SRE acknowledges<br/>(Priya Nair, 13:47)"]
    check_logs["Check Kong gateway logs"]
    identify{{"mTLS handshake<br/>errors found?"}}
    check_cert["Inspect new cert bundle"]
    confirm_ca{{"Expired intermediate<br/>CA confirmed?"}}
    rollback["Revert Kong config to<br/>previous cert bundle (14:02)"]
    verify_payments["Verify payments-api<br/>returning 200s (14:06)"]
    verify_all["Confirm all 3 services<br/>healthy (14:11)"]
    drain_queue["Monitor notification<br/>backlog drain (14:15)"]
    resolved(["Incident resolved<br/>14:30 UTC"])

    alert --> ack --> check_logs --> identify
    identify -->|Yes| check_cert --> confirm_ca
    identify -->|No| check_logs
    confirm_ca -->|Yes| rollback --> verify_payments --> verify_all --> drain_queue --> resolved
    confirm_ca -->|No| check_logs

    classDef start fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e3a5f
    classDef action fill:#f3f4f6,stroke:#6b7280,stroke-width:1px,color:#374151
    classDef decision fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#713f12
    classDef recovery fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d

    class alert,ack start
    class check_logs,check_cert,rollback action
    class identify,confirm_ca decision
    class verify_payments,verify_all,drain_queue,resolved recovery
```

## ✅ Action Items

| # | Action | Owner | Priority | Status |
| ---: | :----- | :---- | :------: | :----: |
| 1 | Add cert chain validation to deploy pipeline (block on incomplete chain) | Marcus Chen | P0 | [ ] |
| 2 | Align staging CA chain with production | Marcus Chen | P0 | [ ] |
| 3 | Add mTLS handshake success metric to Kong health checks | Platform Eng | P1 | [ ] |
| 4 | Implement canary rollout for gateway config (10% for 5 min before full push) | Platform Eng | P1 | [ ] |
| 5 | Create runbook for cert-related gateway failures | Priya Nair | P2 | [ ] |

## 👥 People

| Name | Role |
| :--- | :--- |
| Priya Nair | On-call SRE — led investigation and rollback |
| Marcus Chen | Platform Eng — performed original cert rotation |
| Aria Vasquez | Incident Commander — coordinated comms |
| Jamie Okafor | Payments team lead — confirmed payments-api recovery |

## 📎 References

- PagerDuty alert: payments-api 503 rate threshold
- Slack channel: #incidents (2026-04-12 thread)
- Kong gateway logs: mTLS handshake failures 13:38--14:06 UTC
- Deploy pipeline run: cert rotation job
