pii redaction can't be a progressive enhancement

Jorim flagged this in Slack and it immediately became Urgent:

PII redaction is a progressive enhancement, so the first time you see “redacted” transcript it contains names that get redacted after a little while. Redaction ideally needs to be something you activate project wide and happens at the source, blocking access to the audio and waiting for transcripts to be redacted before showing them to anyone.

He was right. Not a bug patch. Architecture change.

What we had: transcription comes in, gets displayed, then PII redaction runs as a post-processing step and updates the transcript. For a few seconds (sometimes longer under load), users see unredacted text with real names. Then it gets replaced. The use_pii_redaction flag only worked during re-transcription, not on first pass.

“Progressive enhancement” is great for CSS features and UI polish. Terrible for privacy. You can’t progressively enhance someone’s personal information being exposed.

What we designed instead:

New pipeline: dembrane_26_01_redaction. Based on our existing dembrane_25_09 transcription pipeline but with redaction baked in. Critical rule: don’t set the transcript field to any intermediary result when PII redaction is enabled. No partial results. The transcript stays empty until it’s been through redaction.

Project-level toggle. “Anonymize transcripts” as a project setting, not per-conversation. Host creates a project for a municipality town hall, toggles anonymization once, every conversation in that project gets the treatment. Eventually inherits from workspace-level settings.

Block audio access. When anonymization is on, audio files are gated until redaction completes. Can’t just listen to the recording to get the names that were redacted from text.

Redaction approach. Started with regex patterns for emails, phone numbers, addresses. For names (the hardest part) we’re researching hybrid approaches. Pure regex can’t catch “Jan mentioned that Pieter said…” but LLM-based approaches have latency implications for a blocking pipeline.

Privacy needs to be a mode, not a filter. Either the system is collecting and displaying PII, or it’s not. The in-between state where it “eventually” gets redacted is the worst of both worlds.

We version our pipelines by date (dembrane_25_09, dembrane_26_01_redaction) which has been useful for exactly this kind of change. Old conversations keep their pipeline. New ones get the upgraded version. No migration needed.

ECHO-664, still in progress.