Adds the Tortoise-specific tooling that main intentionally omits: - engines/tortoise/exclusive-gpu.sh wraps any command, stops F5 + Kokoro on the GPU, restarts Tortoise to clear stale CUDA contexts, waits for healthz, runs the command, restarts the engines on EXIT trap. Solves the 8GB OOM that took down the first smoke. - engines/tortoise/hacks.md captures the speed reality (~74x real- time slowdown on the 2070 Super at standard preset) and the pronunciation-overrides cross-engine compatibility note. Deploy from this branch when you want Tortoise's tuning. Main's vanilla Tortoise is for the cross-engine reference + future 'we have more VRAM now' cleanup.
2.8 KiB
Tortoise engine — kludges branched off main
This branch carries the engine-specific tweaks that don't generalise to F5 / Kokoro. Tortoise is the audiobook-quality engine but the trade-offs are real and need explicit handling — speed and GPU.
1. GPU exclusivity
File: exclusive-gpu.sh.
The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise
(~5GB peak) sums to ~12GB — over budget. First Tortoise smoke
caught it: torch.OutOfMemoryError: ... 9.31 MiB is free.
Solution: stop the other two engines for the duration of a Tortoise
run. The script wraps any command, stops f5-tts + kokoro,
restarts tortoise to clean its CUDA context, waits for healthz,
runs the wrapped command, then restarts the engines on EXIT trap
(success or failure).
./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three engines co-reside.
2. Speed — slow, batch-only
Tortoise at standard preset is ~74x slower than real-time on
the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min
Chapter 2 render would take ~8 hours. Tortoise is acceptable for
overnight batched runs but NOT interactive rendering.
Quality presets and their approx wall-clock for a 3000-word chapter:
ultra_fast— ~1h, noticeable quality dropfast— ~2hstandard— ~6-8h, the recommended barhigh_quality— ~24h, marginally better than standard
For most use, standard is right. Reserve high_quality for
short prologues or named samples.
3. Voice mapping format
Tortoise's voice roster (lj, freeman, daniel, etc.) lives
behind source='tortoise_tts' in the voices table. Character
slug → Tortoise voice mapping is independent of the Kokoro mapping
— a story can have BOTH a Kokoro and Tortoise mapping live in
parallel, picked at render time via story.preferred_voice_id or
the --voice flag.
Tortoise voices may sometimes warble or stutter at chunk boundaries
— the tortoise.api.TextToSpeech.tts_with_preset call is per-chunk
and re-conditions the voice each time. Acceptable for v0.1; future
work could feed conditioning_latents directly for tighter cohesion.
4. No respelling overrides for Tortoise (yet)
The pronunciation_overrides rows in the DB are seeded with
lowercase-syllable respellings tuned for Kokoro's misaki tokenizer.
Tortoise uses a different phonemizer (g2p_en) which handles many
of those proper nouns better natively — but some still mangle.
For now, narrate's substitution applies the same overrides regardless
of engine, which means Tortoise sees prip-yat for "Pripyat" — same
input, different phonemizer interprets differently. Usually OK but
audit after each batch.
Future: per-engine override sets, OR an engine column on
pronunciation_overrides.