building a real-time conversation capture system

ECHO’s core job: a host generates a QR code, participants scan it on their phones, they have a conversation, and we turn that into structured insights. No app download. No account creation. Scan, record, done.

The engineering behind “scan, record, done” is less simple.

participant flow

Participant scans a QR code, lands on a web-based recording portal. Two clicks to start recording. Audio captures in the browser using the MediaRecorder API, encoding to webm. Every 30 seconds, a chunk gets uploaded to our backend via a pre-signed URL.

We don’t wait for the conversation to end before processing starts. Each chunk is independently uploaded, processed, and queued for transcription. The transcription pipeline is already working on the first 30 seconds while minute 10 is still being recorded.

The chunking has a nice side effect: if someone’s phone loses connection temporarily, they only lose the current chunk, not the entire recording. When connection resumes, uploading picks up from the next chunk.

processing pipeline

Once a chunk lands in storage, an event kicks off the processing chain:

Audio validation: is this actually audio? Is the file corrupt? Is there content, or is it 30 seconds of silence?
Quality assessment: background noise level, audio quality score, crosstalk detection
Transcription: WhisperX for speech-to-text, PyAnnote for speaker diarization
Structured output: segments with speaker labels, inline uncertainty markers, quality flags

Transcription runs on GPU workers (RunPod) because WhisperX needs CUDA. We use serverless workers that spin up on demand. During a large event with 200 participants, we might need significant GPU capacity for 30 minutes, then nothing for the rest of the day. Paying for always-on GPU instances doesn’t make sense for our usage pattern.

multilingual without configuration

ECHO handles 50+ languages natively. We don’t ask participants what language they’re speaking. WhisperX detects the language from the audio and transcribes accordingly. A single event can have conversations in Dutch, English, Turkish, and Arabic, all through the same pipeline without any configuration.

The edge case is multilingual code-switching, when a speaker alternates between languages mid-sentence. Happens constantly in diverse communities. WhisperX handles it reasonably well for dominant language pairs, but less common combinations produce messy output. We flag these segments with uncertainty markers rather than pretending the transcription is clean.

scale constraints

Load testing showed the system handles 100-250 concurrent recording sessions comfortably. For our typical deployment (municipal stakeholder events, citizen assemblies), that’s more than enough. Most events have 50-200 participants in simultaneous conversations.

The bottleneck isn’t the web server or chunk upload. It’s audio processing and transcription. GPU workers have finite throughput, and WhisperX processing time is proportional to audio length. During peak events, queue depth grows and transcription latency increases. Acceptable because the analysis (which users actually care about) happens after the event, not during recording.

what we’re still working on

Real-time transcription with live feedback to participants. Showing them a rough transcript as they speak. This requires streaming audio to the transcription model rather than batching in 30-second chunks. The quality-latency trade-off is brutal. Fast approximate text, or slow accurate text. We’re experimenting with showing real-time approximate transcripts with inline uncertainty markers that get cleaned up in a second pass.

The other open problem is evaluation. How do you measure transcription quality systematically? We’re building an eval system using human reference transcripts as ground truth. Every model change or pipeline tweak gets measured against these references before shipping to production.

The goal is never perfect transcription. It’s reliable-enough transcription that the downstream LLM analysis produces trustworthy insights. Optimize for end-to-end outcome, not word error rate in isolation.