The Rust SDK already existed at Sulkta-Coop/clawdforge clients/rust/ — async,
reqwest-based, bearer-auth, exposes Client::run() + Session for multi-turn.
Vendoring it into vendor/clawdforge so skald is self-contained: no
git-submodule + no needing the clawdforge repo cloned next to skald.
Trade-off accepted: updates require manual re-copy until both sides
stabilize and we publish to a private cargo registry.
What landed:
- vendor/clawdforge/ — full SDK source from Sulkta-Coop/clawdforge HEAD.
Pinned in skald-core/Cargo.toml as a path dep.
- skald-core/src/forge.rs — three-pass orchestration shell. Forge wraps
clawdforge::Client; generate() / cleanup() / audit() each build a
RunRequest with the right system prompt + model alias (always opus),
call client.run(), return a PassOutput.
Prompt templates are TODO stubs (SYSTEM_GEN_TODO etc) — filling in the
actual prose-craft prompts is its own deep session.
- skald-core/src/config.rs — ForgeConfig { base_url, app_token, model }.
Resolved by the binary from env (CLAWDFORGE_URL + CLAWDFORGE_TOKEN);
lib stays env-agnostic.
- skald-core::AuditFinding + AuditResponse — parse shape for what the
third-Opus canon audit returns, ready to map onto audit_findings rows.
- docs/tts-pipeline.md — full plan for v0.2 narration + post-TTS audit
chain. Whisper-large-v3 STT does text-to-text verification on every
render; an optional Gemini Flash audio pass catches subjective issues
(prosody, tone) Whisper can't see. Reroll loop on crit findings.
What's still stubbed:
- Prompt templates in forge.rs (gen / cleanup / audit) — placeholders
that describe the role but don't constrain output shape yet.
- context.rs (assemble the LLM context blob from DB rows) — entire module
TBD.
- No CLI subcommand yet for invoking forge — that comes after context.rs.
Naming note: in Rust 2024 'gen' is a reserved keyword (for generators),
so the method is Forge::generate(), not Forge::gen().
146 lines
4.5 KiB
Markdown
146 lines
4.5 KiB
Markdown
# TTS pipeline (v0.2 plan)
|
|
|
|
Generation lands in v0.2; narration follows. This doc captures the
|
|
shape so the v0.1 schema decisions don't drift away from the v0.2
|
|
implementation.
|
|
|
|
## Render path
|
|
|
|
```
|
|
chapter row (chapters.body_md)
|
|
│
|
|
▼
|
|
1. pronunciation pre-pass
|
|
│ pull pronunciation_overrides for (story_id, *) UNION global
|
|
│ substitute proper nouns with phoneme markers F5 understands
|
|
▼
|
|
2. F5-TTS render
|
|
│ reference: voices.reference_path + voices.reference_text
|
|
│ output: .wav, 24 kHz, mono
|
|
▼
|
|
3. narration_runs row
|
|
│ chapter_id, voice_id, engine="f5-tts", engine_version, seed,
|
|
│ output_path, duration_seconds, status="succeeded"
|
|
▼
|
|
4. POST-RENDER AUDIT CHAIN ◄── this is the cwho-shape audit we use
|
|
to catch "spoken word that would
|
|
wake you up at 2am"
|
|
```
|
|
|
|
## Audit chain
|
|
|
|
### Tier 1 — Whisper-large-v3 STT + word-diff (text-to-text)
|
|
|
|
```
|
|
chapter.wav
|
|
│
|
|
▼
|
|
ffmpeg -i chapter.wav -af "silenceremove=..."
|
|
-f segment -segment_time 30 -reset_timestamps 1 chunk%04d.wav
|
|
│
|
|
▼
|
|
for each chunk:
|
|
whisper-large-v3 → transcript
|
|
word-level diff vs (source_text + pronunciation_overrides applied)
|
|
│
|
|
▼
|
|
substitutions / drops / inserts
|
|
│
|
|
▼
|
|
narration_findings rows:
|
|
kind = pronunciation | skip | insert
|
|
timestamp_start/end (chunk-relative + chunk offset)
|
|
expected_text, heard_text
|
|
severity:
|
|
pronunciation: warn (one wrong word — annoyance)
|
|
skip: crit (line silently dropped — listener loses thread)
|
|
insert: warn (extra word, usually a glitch)
|
|
detector = "whisper-large-v3"
|
|
```
|
|
|
|
Whisper runs locally on Lucy's 8GB GPU (~3GB VRAM). Free, fast.
|
|
Captures mispronunciations that come out as a different real word.
|
|
|
|
### Tier 2 — audio-native LLM review (subjective)
|
|
|
|
Optional, more expensive, covers what Whisper can't:
|
|
|
|
```
|
|
chapter.wav + source_text
|
|
│
|
|
▼
|
|
clawdforge.run({
|
|
model: "gemini-flash-audio" | "gpt-4o-audio",
|
|
files: [chapter.wav],
|
|
prompt: "...listen to this rendered audiobook chapter...flag
|
|
spans where (a) inflection breaks meaning, (b) pacing
|
|
makes the listener lose the thread, (c) any audio
|
|
glitch or dropout, (d) emotional tone is wrong..."
|
|
})
|
|
│
|
|
▼
|
|
narration_findings rows:
|
|
kind = prosody | tone | glitch
|
|
severity per the model's call
|
|
detector = "gemini-flash-audio" (etc)
|
|
```
|
|
|
|
Claude (the model behind clawdforge's default) doesn't have audio
|
|
modality (mid-2026). Audio-native models route through clawdforge
|
|
to Gemini Flash / GPT-4o-audio when needed. Cost: ~$0.05/chapter
|
|
for Gemini Flash.
|
|
|
|
### Reroll loop
|
|
|
|
```
|
|
SELECT count(*) FROM narration_findings
|
|
WHERE run_id = $1 AND severity = 'crit' AND NOT resolved
|
|
│
|
|
▼ if > 0:
|
|
update narration_runs.status = 'rerolled'
|
|
queue a new render with seed = $old_seed + 1
|
|
│
|
|
▼
|
|
narration_findings rows referencing the old run_id stay
|
|
(audit trail) — the new run gets its own findings set
|
|
```
|
|
|
|
After two reroll attempts with crit findings still present, the
|
|
chapter is marked for operator review (out-of-band — e.g.,
|
|
re-edit the source text, add more pronunciation overrides, hand-
|
|
patch the audio).
|
|
|
|
## Why Whisper for text-to-text verification
|
|
|
|
Whisper's word-error-rate on clean audiobook audio is ~3-5%. That
|
|
means *most* of what it transcribes back will match the source
|
|
exactly. The *deltas* are precisely the words the TTS got wrong.
|
|
|
|
False-positive rate for true Whisper transcription errors (Whisper
|
|
heard it right, called something different) is small enough to
|
|
treat as noise — those findings get autoresolved if they don't
|
|
reproduce on a reroll.
|
|
|
|
Whisper's proper-noun WER is HIGHER than overall WER — which is
|
|
actually what we want here. The harder Whisper finds it to
|
|
transcribe a name back from F5's output, the more likely F5 got
|
|
the name wrong in the first place.
|
|
|
|
## What this doc is NOT
|
|
|
|
- Not a prompt template for the audio-LLM call (TBD in v0.2)
|
|
- Not a Whisper config file (TBD: which Whisper, language, beam size, VAD)
|
|
- Not the TTS sidecar container shape (separate spec)
|
|
|
|
## When this gets wired
|
|
|
|
After the generation pipeline (gen → cleanup → canon-audit) is
|
|
working end-to-end. TTS is downstream of "we have a story to
|
|
render."
|
|
|
|
Order of v0.2 work:
|
|
|
|
1. forge.rs prompt templates (gen/cleanup/audit) — fill in stubs
|
|
2. context.rs — assemble blob from DB for a given story_id
|
|
3. web UI for queueing + status
|
|
4. TTS sidecar container — F5 + ffmpeg + Whisper + this audit chain
|