skald/docs/tts-pipeline.md
Kayos f71b533e52 v0.2 scaffold: vendor clawdforge SDK + forge module + Whisper plan
The Rust SDK already existed at Sulkta-Coop/clawdforge clients/rust/ — async,
reqwest-based, bearer-auth, exposes Client::run() + Session for multi-turn.
Vendoring it into vendor/clawdforge so skald is self-contained: no
git-submodule + no needing the clawdforge repo cloned next to skald.
Trade-off accepted: updates require manual re-copy until both sides
stabilize and we publish to a private cargo registry.

What landed:

- vendor/clawdforge/ — full SDK source from Sulkta-Coop/clawdforge HEAD.
  Pinned in skald-core/Cargo.toml as a path dep.
- skald-core/src/forge.rs — three-pass orchestration shell. Forge wraps
  clawdforge::Client; generate() / cleanup() / audit() each build a
  RunRequest with the right system prompt + model alias (always opus),
  call client.run(), return a PassOutput.
  Prompt templates are TODO stubs (SYSTEM_GEN_TODO etc) — filling in the
  actual prose-craft prompts is its own deep session.
- skald-core/src/config.rs — ForgeConfig { base_url, app_token, model }.
  Resolved by the binary from env (CLAWDFORGE_URL + CLAWDFORGE_TOKEN);
  lib stays env-agnostic.
- skald-core::AuditFinding + AuditResponse — parse shape for what the
  third-Opus canon audit returns, ready to map onto audit_findings rows.
- docs/tts-pipeline.md — full plan for v0.2 narration + post-TTS audit
  chain. Whisper-large-v3 STT does text-to-text verification on every
  render; an optional Gemini Flash audio pass catches subjective issues
  (prosody, tone) Whisper can't see. Reroll loop on crit findings.

What's still stubbed:

- Prompt templates in forge.rs (gen / cleanup / audit) — placeholders
  that describe the role but don't constrain output shape yet.
- context.rs (assemble the LLM context blob from DB rows) — entire module
  TBD.
- No CLI subcommand yet for invoking forge — that comes after context.rs.

Naming note: in Rust 2024 'gen' is a reserved keyword (for generators),
so the method is Forge::generate(), not Forge::gen().
2026-05-13 10:18:56 -07:00

4.5 KiB

TTS pipeline (v0.2 plan)

Generation lands in v0.2; narration follows. This doc captures the shape so the v0.1 schema decisions don't drift away from the v0.2 implementation.

Render path

chapter row (chapters.body_md)
    │
    ▼
1. pronunciation pre-pass
    │   pull pronunciation_overrides for (story_id, *) UNION global
    │   substitute proper nouns with phoneme markers F5 understands
    ▼
2. F5-TTS render
    │   reference: voices.reference_path + voices.reference_text
    │   output: .wav, 24 kHz, mono
    ▼
3. narration_runs row
    │   chapter_id, voice_id, engine="f5-tts", engine_version, seed,
    │   output_path, duration_seconds, status="succeeded"
    ▼
4. POST-RENDER AUDIT CHAIN  ◄── this is the cwho-shape audit we use
                                 to catch "spoken word that would
                                 wake you up at 2am"

Audit chain

Tier 1 — Whisper-large-v3 STT + word-diff (text-to-text)

chapter.wav
    │
    ▼
ffmpeg -i chapter.wav -af "silenceremove=..."
    -f segment -segment_time 30 -reset_timestamps 1 chunk%04d.wav
    │
    ▼
for each chunk:
    whisper-large-v3 → transcript
    word-level diff vs (source_text + pronunciation_overrides applied)
    │
    ▼
substitutions / drops / inserts
    │
    ▼
narration_findings rows:
    kind = pronunciation | skip | insert
    timestamp_start/end (chunk-relative + chunk offset)
    expected_text, heard_text
    severity:
        pronunciation: warn  (one wrong word — annoyance)
        skip:          crit  (line silently dropped — listener loses thread)
        insert:        warn  (extra word, usually a glitch)
    detector = "whisper-large-v3"

Whisper runs locally on Lucy's 8GB GPU (~3GB VRAM). Free, fast. Captures mispronunciations that come out as a different real word.

Tier 2 — audio-native LLM review (subjective)

Optional, more expensive, covers what Whisper can't:

chapter.wav + source_text
    │
    ▼
clawdforge.run({
    model: "gemini-flash-audio" | "gpt-4o-audio",
    files: [chapter.wav],
    prompt: "...listen to this rendered audiobook chapter...flag
             spans where (a) inflection breaks meaning, (b) pacing
             makes the listener lose the thread, (c) any audio
             glitch or dropout, (d) emotional tone is wrong..."
})
    │
    ▼
narration_findings rows:
    kind = prosody | tone | glitch
    severity per the model's call
    detector = "gemini-flash-audio" (etc)

Claude (the model behind clawdforge's default) doesn't have audio modality (mid-2026). Audio-native models route through clawdforge to Gemini Flash / GPT-4o-audio when needed. Cost: ~$0.05/chapter for Gemini Flash.

Reroll loop

SELECT count(*) FROM narration_findings
 WHERE run_id = $1 AND severity = 'crit' AND NOT resolved
    │
    ▼ if > 0:
update narration_runs.status = 'rerolled'
queue a new render with seed = $old_seed + 1
    │
    ▼
narration_findings rows referencing the old run_id stay
(audit trail) — the new run gets its own findings set

After two reroll attempts with crit findings still present, the chapter is marked for operator review (out-of-band — e.g., re-edit the source text, add more pronunciation overrides, hand- patch the audio).

Why Whisper for text-to-text verification

Whisper's word-error-rate on clean audiobook audio is ~3-5%. That means most of what it transcribes back will match the source exactly. The deltas are precisely the words the TTS got wrong.

False-positive rate for true Whisper transcription errors (Whisper heard it right, called something different) is small enough to treat as noise — those findings get autoresolved if they don't reproduce on a reroll.

Whisper's proper-noun WER is HIGHER than overall WER — which is actually what we want here. The harder Whisper finds it to transcribe a name back from F5's output, the more likely F5 got the name wrong in the first place.

What this doc is NOT

  • Not a prompt template for the audio-LLM call (TBD in v0.2)
  • Not a Whisper config file (TBD: which Whisper, language, beam size, VAD)
  • Not the TTS sidecar container shape (separate spec)

When this gets wired

After the generation pipeline (gen → cleanup → canon-audit) is working end-to-end. TTS is downstream of "we have a story to render."

Order of v0.2 work:

  1. forge.rs prompt templates (gen/cleanup/audit) — fill in stubs
  2. context.rs — assemble blob from DB for a given story_id
  3. web UI for queueing + status
  4. TTS sidecar container — F5 + ffmpeg + Whisper + this audit chain