cobb/skald

History

Kayos 9df378f799 engine/tortoise: sentence chunking + device fix + pitch/rate modulation Catches up engines/tortoise/server.py with what's been deployed on Lucy through tonight's smoke iterations: 0.2 — _chunk_for_tortoise splits text nodes at sentence boundaries (max 220 chars) before each tts_with_preset call. Fixes the end-of-prompt gibberish past tortoise's ~20s reliable horizon. 0.3 — _get_voice now .to(DEVICE) cached samples + latents. Without this, non-lj voices crash with 'Expected all tensors to be on the same device, but found cpu and cuda:0'. 0.4 — [voice:NAME pitch=N rate=R][/voice] tag syntax. librosa pitch_shift + time_stretch applied per-chunk for single-voice multi-character renders. The strategy survived the design table — but the librosa phase-vocoder artifacts at ±5 semitones ate the quality on the 2070 Super. Parked here for the GPU rebuild; modulation works architecturally, just needs better stretching algorithm (rubberband) + more headroom. Production stayed Kokoro. Coast-Down preferred_voice_id reverted to kokoro_af_heart in the live DB after this experiment.		2026-05-14 19:08:43 -07:00
..
f5-tts	engines: import f5-tts + kokoro + tortoise sidecars into the tree	2026-05-14 09:40:01 -07:00
kokoro	engines: import f5-tts + kokoro + tortoise sidecars into the tree	2026-05-14 09:40:01 -07:00
tortoise	engine/tortoise: sentence chunking + device fix + pitch/rate modulation	2026-05-14 19:08:43 -07:00
README.md	engines: import f5-tts + kokoro + tortoise sidecars into the tree	2026-05-14 09:40:01 -07:00

README.md

Skald TTS engines

This subtree holds the per-engine sidecars that skald's narrate path talks to over HTTP. Each engine has the same contract:

POST /synthesize — same JSON shape across engines so skald's one Rust client (skald-core::narrate::Narrator) deserializes all of them. See engines/<name>/server.py for the per-engine implementation.
GET /healthz — boot probe + model-loaded flag.

Skald routes per-request by voices.source: a kokoro_* source goes to $KOKORO_URL, a tortoise_* source goes to $TORTOISE_URL, anything else (lj_speech, generic) goes to $F5_TTS_URL.

Engines

Dir	Engine	License (code/weights)	VRAM	Speed	Voices
`f5-tts/`	SWivid F5-TTS v1	MIT / CC-BY-NC	~5GB	fast (~2x real-time on 2070S)	voice cloning (LJ Speech reference shipped)
`kokoro/`	hexgrad Kokoro-82M	Apache 2.0 / Apache 2.0	~1GB	very fast (~50x real-time)	50+ named presets (af_, am_, bf_, bm_)
`tortoise/`	neonbjb Tortoise-TTS	Apache 2.0 / Apache 2.0	~5GB	slow (~0.014x real-time, ~74s/s of audio on 2070S, standard preset)	26 named built-ins (lj, freeman, daniel, weaver, jlaw, etc.)

Branch model

main carries the vanilla version of each engine — what you'd get from a clean pip install <engine> plus the FastAPI sidecar

control-tag splitter. No engine-specific kludges. Safe to look at without context.

engine/<name> branches hold engine-tuned tweaks that don't generalise. Examples:

engine/kokoro — doubled-?? prosody hack for the 82M's weak question intonation, paragraph/scene/breath gap durations tuned for af_heart's pacing, notes on how respellings need to be all- lowercase to avoid letter-by-letter spell-out by misaki.
engine/tortoise — GPU exclusivity coordinator (stops F5 + Kokoro before a Tortoise run since the 2070 Super can't host all three at once), preset choice ergonomics, character→tortoise- voice seed assignments.

When deploying an engine to Lucy, the build dir at /mnt/cache/appdata/<engine>/build/ tracks the engine's branch:

cd /mnt/cache/appdata/kokoro/build
git fetch && git checkout engine/kokoro
docker compose -p <name> up -d --build

GPU coordination (2070 Super)

The 8GB card is the bottleneck. F5 + Kokoro can co-reside (~5GB + ~1GB). Tortoise pushes the budget over and needs the GPU largely to itself — the engine/tortoise branch will carry the script that stops kokoro + f5 before a tortoise run and restarts them after. Replace with proper coordination once we have more VRAM.