skald/engines/tortoise/hacks.md
Kayos 7a96031aa6 engine/tortoise: GPU exclusivity wrapper + kludges notes
Adds the Tortoise-specific tooling that main intentionally omits:

- engines/tortoise/exclusive-gpu.sh wraps any command, stops F5 +
  Kokoro on the GPU, restarts Tortoise to clear stale CUDA contexts,
  waits for healthz, runs the command, restarts the engines on EXIT
  trap. Solves the 8GB OOM that took down the first smoke.

- engines/tortoise/hacks.md captures the speed reality (~74x real-
  time slowdown on the 2070 Super at standard preset) and the
  pronunciation-overrides cross-engine compatibility note.

Deploy from this branch when you want Tortoise's tuning. Main's
vanilla Tortoise is for the cross-engine reference + future
'we have more VRAM now' cleanup.
2026-05-14 09:42:09 -07:00

2.8 KiB

Tortoise engine — kludges branched off main

This branch carries the engine-specific tweaks that don't generalise to F5 / Kokoro. Tortoise is the audiobook-quality engine but the trade-offs are real and need explicit handling — speed and GPU.

1. GPU exclusivity

File: exclusive-gpu.sh.

The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise (~5GB peak) sums to ~12GB — over budget. First Tortoise smoke caught it: torch.OutOfMemoryError: ... 9.31 MiB is free.

Solution: stop the other two engines for the duration of a Tortoise run. The script wraps any command, stops f5-tts + kokoro, restarts tortoise to clean its CUDA context, waits for healthz, runs the wrapped command, then restarts the engines on EXIT trap (success or failure).

./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>

Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three engines co-reside.

2. Speed — slow, batch-only

Tortoise at standard preset is ~74x slower than real-time on the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min Chapter 2 render would take ~8 hours. Tortoise is acceptable for overnight batched runs but NOT interactive rendering.

Quality presets and their approx wall-clock for a 3000-word chapter:

  • ultra_fast — ~1h, noticeable quality drop
  • fast — ~2h
  • standard — ~6-8h, the recommended bar
  • high_quality — ~24h, marginally better than standard

For most use, standard is right. Reserve high_quality for short prologues or named samples.

3. Voice mapping format

Tortoise's voice roster (lj, freeman, daniel, etc.) lives behind source='tortoise_tts' in the voices table. Character slug → Tortoise voice mapping is independent of the Kokoro mapping — a story can have BOTH a Kokoro and Tortoise mapping live in parallel, picked at render time via story.preferred_voice_id or the --voice flag.

Tortoise voices may sometimes warble or stutter at chunk boundaries — the tortoise.api.TextToSpeech.tts_with_preset call is per-chunk and re-conditions the voice each time. Acceptable for v0.1; future work could feed conditioning_latents directly for tighter cohesion.

4. No respelling overrides for Tortoise (yet)

The pronunciation_overrides rows in the DB are seeded with lowercase-syllable respellings tuned for Kokoro's misaki tokenizer. Tortoise uses a different phonemizer (g2p_en) which handles many of those proper nouns better natively — but some still mangle.

For now, narrate's substitution applies the same overrides regardless of engine, which means Tortoise sees prip-yat for "Pripyat" — same input, different phonemizer interprets differently. Usually OK but audit after each batch.

Future: per-engine override sets, OR an engine column on pronunciation_overrides.