# Architecture: rg-doany-prod-api

> **Subscription:** doany-production (a3e1f7d2-8b4c-4a6e-9f12-3c5d7e8b9a01)  
> **Region:** East US 2  
> **Generated:** 2026-04-13  
> **Owner:** Rohan Patel (Platform Lead)

## Summary

`rg-doany-prod-api` is the resource group powering the customer-facing API at `api.doany.ai`. The architecture follows a hub pattern: external traffic enters through Azure API Management, which routes synchronous requests to an App Service (Node.js 20) and asynchronous partner webhook traffic to a Function App. Both compute resources connect to Azure SQL Database for persistent customer data and Azure Redis Cache for session management and rate limiting, with private endpoints enforcing network-level isolation for both data stores.

Supporting services include a Key Vault for centralized secrets management (connection strings, API keys), a Storage Account for file uploads and data exports, and Application Insights for telemetry and diagnostics. A system-assigned Managed Identity on the App Service provides passwordless access to Key Vault and other resources. The entire stack is integrated into a VNet with dedicated subnets for App Service VNet integration and private endpoints, ensuring that data-plane traffic stays off the public internet.

**Incident context (2026-04-12):** A partner webhook retry storm (Acme Corp, ~15k retries in 3 minutes) overwhelmed the Function App → SQL path, spiking DTU to 98% and causing p95 latency > 8s for 47 minutes. This diagram was an action item from that SEV-2 to give responders a clear view of how APIM, Function App, SQL, and Redis are connected.

## Resource Inventory

| Resource Name | Type | SKU/Tier | Location | Key Properties |
|---------------|------|----------|----------|----------------|
| App Service Plan | Microsoft.Web/serverfarms | *(see Azure portal)* | East US 2 | Hosts main API App Service |
| App Service (API) | Microsoft.Web/sites | *(Node.js 20 runtime)* | East US 2 | Main customer API at `api.doany.ai`; system-assigned Managed Identity enabled |
| Function App (Webhooks) | Microsoft.Web/sites (functionapp) | *(Node.js runtime)* | East US 2 | Async partner webhook processing |
| Azure SQL Database | Microsoft.Sql/servers/databases | *(DTU-based)* | East US 2 | Customer data; private endpoint; DTU alert at 95% (80% proposed) |
| Azure Cache for Redis | Microsoft.Cache/Redis | *(see Azure portal)* | East US 2 | Session store + rate limiting; private endpoint |
| Key Vault | Microsoft.KeyVault/vaults | Standard | East US 2 | Secrets, connection strings, partner API keys |
| Storage Account | Microsoft.Storage/storageAccounts | *(Standard)* | East US 2 | File uploads, data exports (Blob) |
| Application Insights | Microsoft.Insights/components | — | East US 2 | Telemetry, performance monitoring, alerting |
| API Management | Microsoft.ApiManagement/service | *(see Azure portal)* | East US 2 | Partner API key management, throttling, rate limiting |
| Virtual Network | Microsoft.Network/virtualNetworks | — | East US 2 | App Service integration + private endpoint subnets |
| Private Endpoint (SQL) | Microsoft.Network/privateEndpoints | — | East US 2 | Private connectivity to Azure SQL |
| Private Endpoint (Redis) | Microsoft.Network/privateEndpoints | — | East US 2 | Private connectivity to Redis Cache |
| Managed Identity | System-assigned (on App Service) | — | East US 2 | Passwordless access to Key Vault and other resources |

## Architecture Diagram

```mermaid
graph TB
    subgraph "External"
        CLIENTS["Customers / Apps"]
        PARTNERS["Partner Systems<br/>(e.g. Acme Corp)"]
    end

    subgraph "rg-doany-prod-api — East US 2"

        subgraph "API Gateway"
            APIM["API Management<br/>Throttling + API Keys"]
        end

        subgraph "Compute Layer"
            APP["App Service<br/>Node.js 20 — Main API"]
            FUNC["Function App<br/>Async Webhook Processing"]
        end

        subgraph "Network Layer"
            VNET["Virtual Network"]
            SUBNET_APP["Subnet: app-integration"]
            SUBNET_PE["Subnet: private-endpoints"]
            PE_SQL["Private Endpoint<br/>→ SQL"]
            PE_REDIS["Private Endpoint<br/>→ Redis"]
        end

        subgraph "Data Layer"
            SQL[("Azure SQL Database<br/>Customer Data — DTU-based")]
            REDIS[("Azure Redis Cache<br/>Sessions + Rate Limiting")]
            STORAGE["Storage Account<br/>File Uploads / Exports"]
        end

        subgraph "Security & Identity"
            KV["Key Vault<br/>Secrets, Conn Strings, API Keys"]
            MI["Managed Identity<br/>(system-assigned)"]
        end

        subgraph "Observability"
            APPINS["Application Insights<br/>Telemetry + Alerts"]
        end

    end

    %% External traffic
    CLIENTS ==>|"HTTPS"| APIM
    PARTNERS ==>|"Webhook calls"| APIM

    %% APIM routing
    APIM ==>|"Sync API requests"| APP
    APIM ==>|"Async webhook delivery"| FUNC

    %% Compute → Data (primary paths)
    APP ==>|"Customer queries"| SQL
    APP -->|"Session lookup,<br/>rate limit check"| REDIS
    FUNC ==>|"Webhook SQL writes<br/>(complex joins)"| SQL
    FUNC -.->|"Cache check<br/>(unique keys = miss)"| REDIS
    APP -->|"Blob read/write"| STORAGE
    FUNC -->|"Export output"| STORAGE

    %% Identity & Secrets
    APP -->|"Uses"| MI
    MI -->|"Retrieve secrets"| KV
    FUNC -.->|"Read secrets"| KV

    %% Network topology
    VNET --- SUBNET_APP
    VNET --- SUBNET_PE
    APP ---|"VNet Integration"| SUBNET_APP
    SUBNET_PE --- PE_SQL
    SUBNET_PE --- PE_REDIS
    PE_SQL ---|"Private link"| SQL
    PE_REDIS ---|"Private link"| REDIS

    %% Observability
    APP -.->|"Telemetry"| APPINS
    FUNC -.->|"Telemetry"| APPINS

    %% Styling
    classDef critical fill:#ff6b6b,stroke:#c0392b,color:#fff
    class SQL critical
```

## Relationships & Data Flow

### Primary Data Paths

1. **Customer API requests:** Customers → APIM (auth + throttle) → App Service → Azure SQL + Redis
2. **Partner webhook flow:** Partners → APIM (API key validation + rate limiting) → Function App → Azure SQL (this is the path that caused the 2026-04-12 incident)
3. **File operations:** App Service / Function App → Storage Account (blob uploads and data exports)

### Network Topology

- A single **VNet** with at least two subnets:
  - **app-integration subnet** — delegated for App Service VNet integration, allowing the API to reach private endpoints
  - **private-endpoints subnet** — hosts private endpoints for Azure SQL and Redis Cache
- All data-plane traffic between compute and data resources flows through private endpoints — no public internet exposure for SQL or Redis
- APIM sits at the edge and handles external ingress

### Identity & Access

- The **App Service** has a **system-assigned Managed Identity** that authenticates to Key Vault for secrets retrieval (connection strings, API keys) — no credentials stored in app settings
- The **Function App** also accesses Key Vault for secrets (likely via its own managed identity or shared access policy)
- **APIM** manages partner-specific API keys for webhook authentication and throttling

## Incident-Relevant Notes (2026-04-12 SEV-2)

- **Attack surface:** The path `APIM → Function App → SQL` had insufficient concurrency controls. When Acme Corp retried ~15,000 webhooks in 3 minutes, the Function App processed all of them concurrently, each executing an expensive SQL join.
- **Redis miss:** Webhook payloads had unique cache keys, so the Redis cache provided no relief during the storm.
- **DTU saturation:** SQL DTU hit 98%, causing cascading latency on the App Service path (customer requests).
- **Key gap:** APIM rate limiting was not strict enough for partner webhook endpoints specifically.

## Action Items (from postmortem draft)

- [ ] Add concurrency throttling on the webhook Function App
- [ ] Move expensive SQL query to a read replica (pending budget)
- [ ] Review and tighten APIM rate limiting rules for partner webhook endpoints
- [ ] Add DTU % alert threshold at 80% (currently fires at 95%)

## Caveats

- **Resource inventory is based on `infra-notes.md` (last updated 2026-03-18) and incident data — not a live Azure query.** Azure CLI was not available in this environment. SKU tiers and exact configuration details should be verified in the Azure portal before the postmortem.
- Cross-resource-group dependencies exist: DNS and monitoring live in `rg-doany-shared`; the frontend in `rg-doany-prod-web`.

---

*Generated by Azure Resource Visualizer*
