The Rust SDK already existed at Sulkta-Coop/clawdforge clients/rust/ — async,
reqwest-based, bearer-auth, exposes Client::run() + Session for multi-turn.
Vendoring it into vendor/clawdforge so skald is self-contained: no
git-submodule + no needing the clawdforge repo cloned next to skald.
Trade-off accepted: updates require manual re-copy until both sides
stabilize and we publish to a private cargo registry.
What landed:
- vendor/clawdforge/ — full SDK source from Sulkta-Coop/clawdforge HEAD.
Pinned in skald-core/Cargo.toml as a path dep.
- skald-core/src/forge.rs — three-pass orchestration shell. Forge wraps
clawdforge::Client; generate() / cleanup() / audit() each build a
RunRequest with the right system prompt + model alias (always opus),
call client.run(), return a PassOutput.
Prompt templates are TODO stubs (SYSTEM_GEN_TODO etc) — filling in the
actual prose-craft prompts is its own deep session.
- skald-core/src/config.rs — ForgeConfig { base_url, app_token, model }.
Resolved by the binary from env (CLAWDFORGE_URL + CLAWDFORGE_TOKEN);
lib stays env-agnostic.
- skald-core::AuditFinding + AuditResponse — parse shape for what the
third-Opus canon audit returns, ready to map onto audit_findings rows.
- docs/tts-pipeline.md — full plan for v0.2 narration + post-TTS audit
chain. Whisper-large-v3 STT does text-to-text verification on every
render; an optional Gemini Flash audio pass catches subjective issues
(prosody, tone) Whisper can't see. Reroll loop on crit findings.
What's still stubbed:
- Prompt templates in forge.rs (gen / cleanup / audit) — placeholders
that describe the role but don't constrain output shape yet.
- context.rs (assemble the LLM context blob from DB rows) — entire module
TBD.
- No CLI subcommand yet for invoking forge — that comes after context.rs.
Naming note: in Rust 2024 'gen' is a reserved keyword (for generators),
so the method is Forge::generate(), not Forge::gen().
4.5 KiB
TTS pipeline (v0.2 plan)
Generation lands in v0.2; narration follows. This doc captures the shape so the v0.1 schema decisions don't drift away from the v0.2 implementation.
Render path
chapter row (chapters.body_md)
│
▼
1. pronunciation pre-pass
│ pull pronunciation_overrides for (story_id, *) UNION global
│ substitute proper nouns with phoneme markers F5 understands
▼
2. F5-TTS render
│ reference: voices.reference_path + voices.reference_text
│ output: .wav, 24 kHz, mono
▼
3. narration_runs row
│ chapter_id, voice_id, engine="f5-tts", engine_version, seed,
│ output_path, duration_seconds, status="succeeded"
▼
4. POST-RENDER AUDIT CHAIN ◄── this is the cwho-shape audit we use
to catch "spoken word that would
wake you up at 2am"
Audit chain
Tier 1 — Whisper-large-v3 STT + word-diff (text-to-text)
chapter.wav
│
▼
ffmpeg -i chapter.wav -af "silenceremove=..."
-f segment -segment_time 30 -reset_timestamps 1 chunk%04d.wav
│
▼
for each chunk:
whisper-large-v3 → transcript
word-level diff vs (source_text + pronunciation_overrides applied)
│
▼
substitutions / drops / inserts
│
▼
narration_findings rows:
kind = pronunciation | skip | insert
timestamp_start/end (chunk-relative + chunk offset)
expected_text, heard_text
severity:
pronunciation: warn (one wrong word — annoyance)
skip: crit (line silently dropped — listener loses thread)
insert: warn (extra word, usually a glitch)
detector = "whisper-large-v3"
Whisper runs locally on Lucy's 8GB GPU (~3GB VRAM). Free, fast. Captures mispronunciations that come out as a different real word.
Tier 2 — audio-native LLM review (subjective)
Optional, more expensive, covers what Whisper can't:
chapter.wav + source_text
│
▼
clawdforge.run({
model: "gemini-flash-audio" | "gpt-4o-audio",
files: [chapter.wav],
prompt: "...listen to this rendered audiobook chapter...flag
spans where (a) inflection breaks meaning, (b) pacing
makes the listener lose the thread, (c) any audio
glitch or dropout, (d) emotional tone is wrong..."
})
│
▼
narration_findings rows:
kind = prosody | tone | glitch
severity per the model's call
detector = "gemini-flash-audio" (etc)
Claude (the model behind clawdforge's default) doesn't have audio modality (mid-2026). Audio-native models route through clawdforge to Gemini Flash / GPT-4o-audio when needed. Cost: ~$0.05/chapter for Gemini Flash.
Reroll loop
SELECT count(*) FROM narration_findings
WHERE run_id = $1 AND severity = 'crit' AND NOT resolved
│
▼ if > 0:
update narration_runs.status = 'rerolled'
queue a new render with seed = $old_seed + 1
│
▼
narration_findings rows referencing the old run_id stay
(audit trail) — the new run gets its own findings set
After two reroll attempts with crit findings still present, the chapter is marked for operator review (out-of-band — e.g., re-edit the source text, add more pronunciation overrides, hand- patch the audio).
Why Whisper for text-to-text verification
Whisper's word-error-rate on clean audiobook audio is ~3-5%. That means most of what it transcribes back will match the source exactly. The deltas are precisely the words the TTS got wrong.
False-positive rate for true Whisper transcription errors (Whisper heard it right, called something different) is small enough to treat as noise — those findings get autoresolved if they don't reproduce on a reroll.
Whisper's proper-noun WER is HIGHER than overall WER — which is actually what we want here. The harder Whisper finds it to transcribe a name back from F5's output, the more likely F5 got the name wrong in the first place.
What this doc is NOT
- Not a prompt template for the audio-LLM call (TBD in v0.2)
- Not a Whisper config file (TBD: which Whisper, language, beam size, VAD)
- Not the TTS sidecar container shape (separate spec)
When this gets wired
After the generation pipeline (gen → cleanup → canon-audit) is working end-to-end. TTS is downstream of "we have a story to render."
Order of v0.2 work:
- forge.rs prompt templates (gen/cleanup/audit) — fill in stubs
- context.rs — assemble blob from DB for a given story_id
- web UI for queueing + status
- TTS sidecar container — F5 + ffmpeg + Whisper + this audit chain