a deprecated API took down our chat for 3 hours
Customer named Eve reported chat wasn’t working around 9 AM. By noon, we had it fixed. The actual bug fix took 30 minutes. The remaining 2.5 hours were testing and validation.
That ratio tells you everything about why we needed semi-automated QA.
what happened
Our chat feature uses LLM providers for conversation analysis. Part of the pipeline includes token counting to manage context windows. The token counting system was using an internal Anthropic API that had been quietly deprecated.
Worked fine for months. Then it didn’t.
Debugging was fast: trace the error, find the failing API call, see the deprecation notice. Migration to the current API was straightforward. But our build system then flagged previously-passing code as broken, which required additional changes before we could deploy.
hotfix process
Our deployment workflow uses a hotfix branch pattern: branch from production, fix, deploy through CI, merge back.
main <- hotfix/chat-anthropic-migration <- productionCI ran the full pipeline, deployed to production, chat came back. The process itself was solid. We’d invested in making hotfixes reliable and it paid off.
what came out of the postmortem
We need automated monitoring for API deprecations across all our providers. We’re using Anthropic, Google (Vertex AI), OpenAI, and potentially open-source models. Each has its own deprecation timeline, its own internal APIs that can disappear without warning. Manual tracking doesn’t scale.
Our linting and version locking should catch dependencies on experimental or internal APIs. If we’re importing from an endpoint that isn’t part of a stable SDK, the build should at minimum warn us.
The 30-minute fix / 2.5-hour validation split makes the case for semi-automated QA. We were already building a QA workflow, but this incident moved it from “nice to have” to “needed yesterday.” When your fix takes 10x longer to verify than to implement, the verification process is the bottleneck.
the pattern
Silent dependency on an unstable external API that works until it doesn’t. Almost impossible to catch through normal development. The code passed tests. It worked in staging. It worked in production for months.
The only proactive catch is maintaining an inventory of all external API dependencies with their stability guarantees and deprecation timelines. Overhead you’d rather not have, but the alternative is incidents timed perfectly for when a customer is trying to use your product.
We now keep a list of every external API endpoint we hit, tagged with stability status (stable SDK, beta, internal, experimental). When we onboard a new provider, adding their endpoints to this list is part of the integration checklist. Not foolproof, but it would have caught this one.