skald/docs/tts-pipeline.md

# TTS pipeline (v0.2 plan)

Generation lands in v0.2; narration follows. This doc captures the
shape so the v0.1 schema decisions don't drift away from the v0.2
implementation.

## Render path

```
chapter row (chapters.body_md)
    │
    ▼
1. pronunciation pre-pass
    │   pull pronunciation_overrides for (story_id, *) UNION global
    │   substitute proper nouns with phoneme markers F5 understands
    ▼
2. F5-TTS render
    │   reference: voices.reference_path + voices.reference_text
    │   output: .wav, 24 kHz, mono
    ▼
3. narration_runs row
    │   chapter_id, voice_id, engine="f5-tts", engine_version, seed,
    │   output_path, duration_seconds, status="succeeded"
    ▼
4. POST-RENDER AUDIT CHAIN  ◄── this is the cwho-shape audit we use
                                 to catch "spoken word that would
                                 wake you up at 2am"
```

## Audit chain

### Tier 1 — Whisper-large-v3 STT + word-diff (text-to-text)

```
chapter.wav
    │
    ▼
ffmpeg -i chapter.wav -af "silenceremove=..."
    -f segment -segment_time 30 -reset_timestamps 1 chunk%04d.wav
    │
    ▼
for each chunk:
    whisper-large-v3 → transcript
    word-level diff vs (source_text + pronunciation_overrides applied)
    │
    ▼
substitutions / drops / inserts
    │
    ▼
narration_findings rows:
    kind = pronunciation | skip | insert
    timestamp_start/end (chunk-relative + chunk offset)
    expected_text, heard_text
    severity:
        pronunciation: warn  (one wrong word — annoyance)
        skip:          crit  (line silently dropped — listener loses thread)
        insert:        warn  (extra word, usually a glitch)
    detector = "whisper-large-v3"
```

Whisper runs locally on Lucy's 8GB GPU (~3GB VRAM). Free, fast.
Captures mispronunciations that come out as a different real word.

### Tier 2 — audio-native LLM review (subjective)

Optional, more expensive, covers what Whisper can't:

```
chapter.wav + source_text
    │
    ▼
clawdforge.run({
    model: "gemini-flash-audio" | "gpt-4o-audio",
    files: [chapter.wav],
    prompt: "...listen to this rendered audiobook chapter...flag
             spans where (a) inflection breaks meaning, (b) pacing
             makes the listener lose the thread, (c) any audio
             glitch or dropout, (d) emotional tone is wrong..."
})
    │
    ▼
narration_findings rows:
    kind = prosody | tone | glitch
    severity per the model's call
    detector = "gemini-flash-audio" (etc)
```

Claude (the model behind clawdforge's default) doesn't have audio
modality (mid-2026). Audio-native models route through clawdforge
to Gemini Flash / GPT-4o-audio when needed. Cost: ~$0.05/chapter
for Gemini Flash.

### Reroll loop

```
SELECT count(*) FROM narration_findings
 WHERE run_id = $1 AND severity = 'crit' AND NOT resolved
    │
    ▼ if > 0:
update narration_runs.status = 'rerolled'
queue a new render with seed = $old_seed + 1
    │
    ▼
narration_findings rows referencing the old run_id stay
(audit trail) — the new run gets its own findings set
```

After two reroll attempts with crit findings still present, the
chapter is marked for operator review (out-of-band — e.g.,
re-edit the source text, add more pronunciation overrides, hand-
patch the audio).

## Why Whisper for text-to-text verification

Whisper's word-error-rate on clean audiobook audio is ~3-5%. That
means *most* of what it transcribes back will match the source
exactly. The *deltas* are precisely the words the TTS got wrong.

False-positive rate for true Whisper transcription errors (Whisper
heard it right, called something different) is small enough to
treat as noise — those findings get autoresolved if they don't
reproduce on a reroll.

Whisper's proper-noun WER is HIGHER than overall WER — which is
actually what we want here. The harder Whisper finds it to
transcribe a name back from F5's output, the more likely F5 got
the name wrong in the first place.

## What this doc is NOT

- Not a prompt template for the audio-LLM call (TBD in v0.2)
- Not a Whisper config file (TBD: which Whisper, language, beam size, VAD)
- Not the TTS sidecar container shape (separate spec)

## When this gets wired

After the generation pipeline (gen → cleanup → canon-audit) is
working end-to-end. TTS is downstream of "we have a story to
render."

Order of v0.2 work:

1. forge.rs prompt templates (gen/cleanup/audit) — fill in stubs
2. context.rs — assemble blob from DB for a given story_id
3. web UI for queueing + status
4. TTS sidecar container — F5 + ffmpeg + Whisper + this audit chain