# TTS pipeline (v0.2 plan) Generation lands in v0.2; narration follows. This doc captures the shape so the v0.1 schema decisions don't drift away from the v0.2 implementation. ## Render path ``` chapter row (chapters.body_md) │ ▼ 1. pronunciation pre-pass │ pull pronunciation_overrides for (story_id, *) UNION global │ substitute proper nouns with phoneme markers F5 understands ▼ 2. F5-TTS render │ reference: voices.reference_path + voices.reference_text │ output: .wav, 24 kHz, mono ▼ 3. narration_runs row │ chapter_id, voice_id, engine="f5-tts", engine_version, seed, │ output_path, duration_seconds, status="succeeded" ▼ 4. POST-RENDER AUDIT CHAIN ◄── this is the cwho-shape audit we use to catch "spoken word that would wake you up at 2am" ``` ## Audit chain ### Tier 1 — Whisper-large-v3 STT + word-diff (text-to-text) ``` chapter.wav │ ▼ ffmpeg -i chapter.wav -af "silenceremove=..." -f segment -segment_time 30 -reset_timestamps 1 chunk%04d.wav │ ▼ for each chunk: whisper-large-v3 → transcript word-level diff vs (source_text + pronunciation_overrides applied) │ ▼ substitutions / drops / inserts │ ▼ narration_findings rows: kind = pronunciation | skip | insert timestamp_start/end (chunk-relative + chunk offset) expected_text, heard_text severity: pronunciation: warn (one wrong word — annoyance) skip: crit (line silently dropped — listener loses thread) insert: warn (extra word, usually a glitch) detector = "whisper-large-v3" ``` Whisper runs locally on Lucy's 8GB GPU (~3GB VRAM). Free, fast. Captures mispronunciations that come out as a different real word. ### Tier 2 — audio-native LLM review (subjective) Optional, more expensive, covers what Whisper can't: ``` chapter.wav + source_text │ ▼ clawdforge.run({ model: "gemini-flash-audio" | "gpt-4o-audio", files: [chapter.wav], prompt: "...listen to this rendered audiobook chapter...flag spans where (a) inflection breaks meaning, (b) pacing makes the listener lose the thread, (c) any audio glitch or dropout, (d) emotional tone is wrong..." }) │ ▼ narration_findings rows: kind = prosody | tone | glitch severity per the model's call detector = "gemini-flash-audio" (etc) ``` Claude (the model behind clawdforge's default) doesn't have audio modality (mid-2026). Audio-native models route through clawdforge to Gemini Flash / GPT-4o-audio when needed. Cost: ~$0.05/chapter for Gemini Flash. ### Reroll loop ``` SELECT count(*) FROM narration_findings WHERE run_id = $1 AND severity = 'crit' AND NOT resolved │ ▼ if > 0: update narration_runs.status = 'rerolled' queue a new render with seed = $old_seed + 1 │ ▼ narration_findings rows referencing the old run_id stay (audit trail) — the new run gets its own findings set ``` After two reroll attempts with crit findings still present, the chapter is marked for operator review (out-of-band — e.g., re-edit the source text, add more pronunciation overrides, hand- patch the audio). ## Why Whisper for text-to-text verification Whisper's word-error-rate on clean audiobook audio is ~3-5%. That means *most* of what it transcribes back will match the source exactly. The *deltas* are precisely the words the TTS got wrong. False-positive rate for true Whisper transcription errors (Whisper heard it right, called something different) is small enough to treat as noise — those findings get autoresolved if they don't reproduce on a reroll. Whisper's proper-noun WER is HIGHER than overall WER — which is actually what we want here. The harder Whisper finds it to transcribe a name back from F5's output, the more likely F5 got the name wrong in the first place. ## What this doc is NOT - Not a prompt template for the audio-LLM call (TBD in v0.2) - Not a Whisper config file (TBD: which Whisper, language, beam size, VAD) - Not the TTS sidecar container shape (separate spec) ## When this gets wired After the generation pipeline (gen → cleanup → canon-audit) is working end-to-end. TTS is downstream of "we have a story to render." Order of v0.2 work: 1. forge.rs prompt templates (gen/cleanup/audit) — fill in stubs 2. context.rs — assemble blob from DB for a given story_id 3. web UI for queueing + status 4. TTS sidecar container — F5 + ffmpeg + Whisper + this audit chain