--- a/eng-update-draft.md +++ b/eng-update-draft.md @@ -1,46 +1,32 @@ -# Engineering Status Update — Week of April 7 +# eng update — week of april 7 -## Platform Infrastructure +## infra -This week marked a pivotal moment in our infrastructure modernization journey. The team successfully completed the migration of our primary database cluster from PostgreSQL 14 to PostgreSQL 16, ensuring enhanced performance and long-term maintainability. This initiative underscores our commitment to maintaining a robust and scalable foundation for our growing platform. +postgres 16 migration is done. 47 schema definitions updated, zero downtime — honestly didn't expect that to go as smoothly as it did. shoutout to Dana for figuring out the parallel batch approach for schema diffs, that's what made the zero-downtime thing possible. -The migration encompassed a comprehensive update of 47 schema definitions, highlighting the depth and complexity of this undertaking. Zero downtime was achieved during the transition, reflecting the team's meticulous planning and execution. +## api performance -## API Performance +- /search p99 latency: 340ms → ~180ms after the pgbouncer rollout. not bad for a config change +- connection overhead down ~40% across the board +- redis caching layer is live for product catalog -Our API team has made significant strides in optimizing endpoint response times. Key achievements include: +## auth rewrite -- Reduced p99 latency from 340ms to 180ms on the /search endpoint, showcasing the team's dedication to delivering a seamless user experience -- Implemented connection pooling via PgBouncer, contributing to a 40% reduction in database connection overhead -- Deployed Redis caching layer for the product catalog service, enhancing its ability to handle increased traffic volumes +94% test coverage, token refresh works, session management is way cleaner than the old spaghetti. rate limiting on login is in. on track for staged rollout april 14 — 10% first, then 25%, then 100% over two weeks assuming nothing catches fire. -These improvements represent a crucial step forward in ensuring our platform can scale effectively to meet growing demand. +## mobile -## Auth Service Rewrite +react native upgrade to 0.76 is done (was on 0.73). startup times are better and android crash rate dropped from 2.1% to 0.8%. -The authentication service rewrite stands as a testament to our engineering team's technical expertise. The new service, built on OAuth 2.1 with PKCE, boasts a comprehensive test suite encompassing 94% code coverage. This groundbreaking effort sets the stage for future security enhancements and ensures compliance with industry best practices. +v4.2 going to testflight by wednesday. -Key milestones achieved: -- Token refresh flow fully implemented and validated -- Session management refactored, fostering a more maintainable codebase -- Rate limiting added to login endpoints, symbolizing our proactive approach to security +## what's next -The auth rewrite is on track for staged rollout starting April 14, marking a key turning point in our platform's security posture. +- auth rollout starts april 14 +- search team kicking off semantic search prototype with embeddings +- hiring: two backend roles close friday, three strong candidates in final round -## Mobile App +## risks -The mobile team has been instrumental in driving forward our cross-platform strategy. React Native upgrade from 0.73 to 0.76 is complete, nestled within a broader effort to modernize our mobile stack. The upgrade contributes to improved startup times and enhanced developer experience. - -Crash rate has been reduced from 2.1% to 0.8% on Android, underscoring the team's commitment to quality and reliability. - -## What's Next - -- Auth staged rollout begins April 14 (10% → 25% → 100% over two weeks) -- Search team starting work on semantic search prototype using embeddings -- Mobile releasing v4.2 to TestFlight by Wednesday -- Hiring: two backend roles close Friday, three strong candidates in final round - -## Risks - -- The evolving landscape of third-party API dependencies presents ongoing challenges. Stripe webhook reliability issues persist — we're implementing retry logic as a vital safeguard -- CI pipeline is experiencing intermittent failures on the arm64 runners, reflecting broader infrastructure complexity +- stripe webhooks are still flaky. adding retry logic but it's not a real fix, just a band-aid until stripe gets their act together +- arm64 CI runners keep randomly dying. I have a workaround (pin to x64) but it's slow and I hate it