why we built a multi-model LLM router

ECHO processes conversations during live stakeholder events. Hundreds of people talking at once, all getting transcribed and analyzed through LLM pipelines. The worst time for your AI provider to return 429s is when a municipality has 200 citizens in a room and your platform goes quiet.

That’s exactly what kept happening.

the DSQ problem

We run on Google Cloud’s Vertex AI, primarily in europe-west4 (EU data residency requirement). The 429 errors mentioned DSQ, Dynamic Shared Quota. Not a hard quota on your account. Google managing capacity across a shared pool. When instantaneous demand outstrips supply, everyone in the pool gets throttled.

Our usage is inherently spiky. We don’t run steady-state inference. We run events. A session starts, 50 conversations begin recording simultaneously, and suddenly we need to process all of them through Gemini at once. DSQ hates this pattern.

I spent time writing to Google Cloud support, explaining the issue, asking about priority allocation. Applied for Google Cloud for Startups. Tried to get Claude models working through Model Garden. Hit walls everywhere. Meanwhile, users were getting intermittent failures during actual paid sessions.

the fix

Instead of waiting for Google to solve our capacity problem, we built a multi-model router using LiteLLM:

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gemini-pro",
            "litellm_params": {
                "model": "vertex_ai/gemini-1.5-pro",
                "vertex_project": "dembrane-echo",
                "vertex_location": "europe-west4",
                "vertex_credentials": "path/to/sa.json"
            }
        },
        {
            "model_name": "gemini-pro",
            "litellm_params": {
                "model": "vertex_ai/gemini-1.5-pro",
                "vertex_project": "dembrane-echo",
                "vertex_location": "europe-west1",
                "vertex_credentials": "path/to/sa.json"
            }
        },
    ],
    routing_strategy="simple-shuffle",
    num_retries=3,
    timeout=600
)

Vertex AI quotas are per-region. Same service account, same project, but different regions have independent capacity pools. By spreading requests across europe-west4, europe-west1, and even us-central1 as a last resort, we dramatically reduced our exposure to any single region’s DSQ contention.

We also added cross-provider fallbacks. If Vertex is completely unavailable, we fall to OpenAI or Anthropic’s direct APIs. The router handles this transparently. Our application code just calls router.completion() and doesn’t care which provider actually served the response.

what we told users

The changelog read: “Improved AI reliability with smarter infrastructure routing.”

No mention of 429s, DSQ, or fighting Google Cloud’s capacity management. Users just saw fewer failures.

what I’d take away from this

You can’t depend on a single LLM provider for production workloads. Even if they’re your primary provider, even if you have a good relationship with support, even if you’ve applied for every startup program they offer. Capacity is shared and your spiky workload will always lose to someone else’s steady-state usage.

Build the fallback layer before you need it. LiteLLM makes this straightforward. Abstract the provider behind a router, configure your fallback chain, application code stays clean.

Also: when you’re a small startup hitting DSQ limits, you’re not important enough to get priority allocation. That’s not cynicism, it’s resource planning. Build your system assuming you won’t get special treatment.