# Retry Redesign Spec

**Date:** 2026-04-09
**Author:** Alex Chen
**Status:** Proposed
**Triggered by:** Webhook failure spike following PR #247 release (2026-04-08)

---

## 1. Problem Statement

After the release of PR #247 ("perf: speed up webhook retries for faster delivery"), webhook delivery failure rates spiked. The PR replaced the exponential backoff retry strategy with a fixed 5-second interval and simultaneously increased concurrency limits, creating a compounding failure cascade.

### What Changed in PR #247

| Parameter | Pre-#247 | Post-#247 |
|---|---|---|
| Retry delay | Exponential: `60s * 2^attempt`, jitter +/-20% | Fixed 5s, no jitter |
| Max retry attempts | 5 | 8 |
| Request timeout | 30s | 10s |
| Connection pool size | 20 | 50 |
| Retry batch size | 100 | 500 |

### Why It Broke

1. **No backoff**: Failing endpoints are retried every 5 seconds instead of backing off over ~31 minutes. Endpoints experiencing transient issues never get recovery time.
2. **No jitter**: All retries for a failing endpoint land at the exact same interval, causing thundering herd effects when multiple deliveries fail simultaneously.
3. **Aggressive timeout**: Reducing the timeout from 30s to 10s causes slow-but-healthy endpoints to time out, converting successes into failures and generating additional retry load.
4. **Amplified concurrency**: 50 connections x 500 batch size means up to 500 concurrent retries per cron tick (every 30s), up from 100. This overwhelms both our outbound capacity and customer endpoints.
5. **More total attempts**: 8 attempts at 5s intervals = 40 seconds of hammering vs. 5 attempts over 31 minutes. Endpoints that would have recovered between exponential retries now get dead-lettered before they can.

### Broken Tests

The existing `TestBackoff` test suite still asserts exponential behavior (e.g., first retry between 48-72s). These tests now fail against the current code, meaning PR #247 shipped without passing the backoff test suite.

---

## 2. Goals

1. **Restore safe retry behavior**: Re-establish exponential backoff with jitter to give failing endpoints time to recover and prevent thundering herd.
2. **Align code with architecture doc**: The documented retry policy (60s base, exponential, jitter) is the intended contract with customers. Code must match.
3. **Right-size concurrency**: Revert pool and batch sizes to levels that don't overwhelm downstream endpoints.
4. **Fix broken tests**: Tests must pass and actually validate the backoff strategy.
5. **Add guardrails**: Make it harder for future "perf" changes to silently remove backoff.

### Non-Goals

- Changing the fundamental retry architecture (e.g., switching to a queue-based retry system).
- Adding new retry features (e.g., per-endpoint retry policies, customer-configurable backoff).
- Modifying the circuit breaker logic (it is functioning correctly, just overwhelmed by volume).

---

## 3. Design

### 3.1 Restore Exponential Backoff with Jitter

Revert `_calculate_backoff()` in `src/webhook_sender.py` to exponential backoff:

```python
BASE_RETRY_DELAY = 60  # seconds
JITTER_FACTOR = 0.2

def _calculate_backoff(attempt_number: int) -> float:
    """Exponential backoff: 60s, 120s, 240s, 480s, 960s with +/-20% jitter."""
    delay = BASE_RETRY_DELAY * (2 ** attempt_number)
    jitter = delay * JITTER_FACTOR * (2 * random.random() - 1)
    return max(delay + jitter, 1.0)
```

**Resulting schedule** (without jitter):

| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | 60s | 1 min |
| 2 | 120s | 3 min |
| 3 | 240s | 7 min |
| 4 | 480s | 15 min |
| 5 | 960s | 31 min |

This matches the architecture doc at `docs/architecture.md:28-29`.

### 3.2 Revert Config Defaults

In `src/config.py`:

```python
WEBHOOK_TIMEOUT_SECONDS = int(os.getenv("WEBHOOK_TIMEOUT_SECONDS", "30"))
MAX_RETRY_ATTEMPTS = int(os.getenv("MAX_RETRY_ATTEMPTS", "5"))
RETRY_BATCH_SIZE = int(os.getenv("RETRY_BATCH_SIZE", "100"))
HTTP_POOL_SIZE = int(os.getenv("HTTP_POOL_SIZE", "20"))
```

**Rationale for each revert:**

- **Timeout 30s**: Customer endpoints have varied response times. 10s is too aggressive for legitimate slow responses (e.g., endpoints that do synchronous processing). The 30s default was chosen based on p99 latency of healthy endpoints.
- **Max attempts 5**: 5 attempts over 31 minutes gives endpoints adequate recovery time. 8 attempts with the restored backoff would extend the window to over 4 hours, which is excessive.
- **Batch size 100**: Limits the blast radius of each cron tick. 500 simultaneous retries can saturate outbound connections and trigger rate limits on customer infrastructure.
- **Pool size 20**: Matches the batch size. 50 connections with a batch of 100 means idle connections consuming resources.

### 3.3 Fix Test Suite

Update `tests/test_webhook_sender.py` to validate the restored backoff:

1. **`test_first_retry_around_60s`**: Assert `48 <= delay <= 72` (already correct, currently failing).
2. **`test_second_retry_around_120s`**: Assert `96 <= delay <= 144` (already correct, currently failing).
3. **`test_backoff_is_exponential`**: Assert each successive delay is ~2x the previous (already correct, currently failing).
4. **`test_delay_never_negative`**: Assert all delays > 0 (passes but should also assert delay >= 1.0 per the `max()` guard).

Add new test:

```python
def test_jitter_produces_variance(self):
    """Verify jitter prevents thundering herd."""
    delays = [_calculate_backoff(0) for _ in range(100)]
    assert len(set(delays)) > 1, "Jitter should produce varying delays"
    assert all(48 <= d <= 72 for d in delays)
```

### 3.4 Add Backoff Validation Guard

Add a startup assertion in `src/config.py` or `src/webhook_sender.py` to prevent future regressions:

```python
# In webhook_sender.py, at module level
assert BASE_RETRY_DELAY >= 30, (
    f"BASE_RETRY_DELAY={BASE_RETRY_DELAY}s is dangerously low. "
    "Retry delays under 30s risk overwhelming customer endpoints. "
    "See docs/architecture.md for the retry policy contract."
)
```

This makes it impossible to silently reduce backoff without deliberately removing the guard.

---

## 4. Rollout Plan

### Phase 1: Immediate Revert (Today)

1. Revert the backoff logic in `src/webhook_sender.py` to exponential with jitter.
2. Revert config defaults in `src/config.py`.
3. Verify all tests in `TestBackoff` pass.
4. Add jitter variance test and startup assertion.
5. Deploy to staging, verify retry schedule via delivery logs.
6. Deploy to production.

### Phase 2: Verification (Today + 1 day)

1. Monitor Datadog `webhook.retry.queue_depth` — should decrease as retries spread out.
2. Monitor `webhook.delivery.latency_p95` — should drop as endpoint hammering stops.
3. Monitor dead letter rate — should drop below the 5% PagerDuty threshold.
4. Confirm no customer escalations about duplicate deliveries or missed webhooks.

### Phase 3: Post-Incident Improvements (This week)

1. Add a CI check that runs the backoff tests and fails if any are skipped or removed.
2. Document the retry policy contract in `README.md` so future contributors understand the commitment.
3. Consider adding a Datadog monitor for retry interval distribution anomalies.

---

## 5. Risk Assessment

| Risk | Likelihood | Mitigation |
|---|---|---|
| Revert introduces new bug | Low | Changes are a direct revert to proven pre-#247 logic |
| Slow endpoints still timeout at 30s | Low | 30s was the stable default; no customer complaints pre-#247 |
| Backlog of queued retries causes burst on deploy | Medium | The cron job processes in batches of 100, naturally rate-limiting. Monitor queue depth at deploy time. |
| Future PRs re-introduce aggressive retry | Medium | Startup assertion + jitter test make this harder to do accidentally |

---

## 6. Success Criteria

- [ ] Dead letter rate returns to pre-#247 baseline (< 2%)
- [ ] Retry queue depth stabilizes (no unbounded growth)
- [ ] All `TestBackoff` tests pass
- [ ] No regression in delivery success rate for healthy endpoints
- [ ] Retry intervals in delivery logs show exponential distribution with jitter
