--- a/draft-email.txt +++ b/draft-email.txt @@ -1,22 +1,21 @@ -Subject: Service Disruption Resolution and Next Steps +Subject: April 8 outage — what happened and what we're fixing -Dear Valued Customer, +Hi, -We are writing to provide you with a comprehensive update regarding the service disruption that impacted our platform on April 8, 2026. We sincerely apologize for the inconvenience this may have caused. +On April 8, a database migration we deployed at 14:22 UTC locked a table and exhausted our API connection pool. By 14:25, the API was returning 503s, the dashboard was down, and webhooks stopped going out. We rolled back the migration at 15:09 and confirmed full recovery at 16:35 — about 2 hours and 13 minutes of downtime total. -Our dedicated engineering team worked tirelessly to identify and resolve the root cause of the outage. The issue stemmed from a critical database migration that inadvertently caused connection pool exhaustion across our primary API cluster. This pivotal incident underscores our commitment to maintaining the highest standards of reliability for our customers. +We're sorry. That's not the reliability you signed up for. -We want to highlight the key steps we took to address this challenge: +Webhooks that queued during the window were all delivered by 17:00 UTC — no data was lost, but we know delayed webhooks can cause real problems on your end. -- Rolled back the problematic migration within 47 minutes of detection -- Implemented enhanced monitoring to ensure early detection of similar issues -- Conducted a thorough post-incident review, fostering a culture of continuous improvement +Here's what we've changed since: -The total downtime was approximately 2 hours and 13 minutes, affecting API responses, webhook deliveries, and dashboard access. We understand the crucial role our platform plays in your daily operations, and we deeply regret any disruption to your workflows. +- Added connection pool headroom alerts (already live) +- Large migrations now run in off-peak maintenance windows only +- Any migration over 10 seconds requires infra team review before deploy -Looking ahead, we are committed to bolstering our infrastructure to prevent similar occurrences. Our team is actively enhancing our deployment pipeline and implementing additional safeguards. +We should have caught this before it hit production. We didn't, and we've changed our process so it doesn't happen again. -We truly value your partnership and appreciate your patience and understanding during this time. Should you have any questions or concerns, please do not hesitate to reach out to our support team at support@stackrelay.com. +Questions? Hit us at support@stackrelay.com. -Warm regards, -The StackRelay Team +— The StackRelay Team