the deploy that wasn't zero-downtime
We have zero-downtime deploys. They work great for code changes, UI components, package updates, config changes. Friday we pushed a release and the app went down.
This release included destructive schema changes to core tables. Needed to modify conversations, chunks, and library to better track processing states. Not the usual non-destructive ALTER TABLE ADD COLUMN. These required PostgreSQL to obtain exclusive locks on the tables being altered.
We weren’t careless about it. Our practice includes taking a recent backup of production database structure and running the migration on top of it to verify. Did that. Worked perfectly in testing.
What we didn’t account for: active users. Test environment had nobody hitting the database. Production had people actively using the app. PostgreSQL couldn’t get the exclusive access it needed because active transactions held locks on the same tables. Migration timed out. App entered an inconsistent state. Downtime.
Fix was simple. Scheduled a brief maintenance window, ran the migration, brought everything back up. Fifteen minutes total. But for a product serving government organizations, unplanned downtime is unacceptable.
Additive schema changes are safe for zero-downtime deploys. Destructive schema changes are not. Hard rule now: any release with destructive schema changes gets a scheduled maintenance window. No exceptions.
There’s a deeper thing here too. Our testing validated the correctness of the migration (do the SQL statements execute without errors?) but not the operational context (can they execute while the database is under load?). Different questions, different test environments needed. If we’d run the migration against a database with simulated concurrent connections, we’d have caught it.
For now the simple rule works. Check if the release has destructive schema changes. If yes, schedule downtime.