skald/engines/README.md
Cobb Hayes 346cea515d Public-flip audit: env-driven paths, scrub audit-ticket prefixes, terser README
Lucy bind paths + LAN host pins replaced with env defaults. Repository URLs
→ git.sulkta.com. Audit-changelog scaffolding stripped from inline comments
(technical reasoning preserved). README sheds marketing scaffolding. AI-speak
in load-bearing prompts/SOULs left alone — that IS the product.
2026-05-27 11:42:58 -07:00

57 lines
2.5 KiB
Markdown

# Skald TTS engines
This subtree holds the per-engine sidecars that skald's narrate path
talks to over HTTP. Each engine has the same contract:
- `POST /synthesize` — same JSON shape across engines so skald's
one Rust client (`skald-core::narrate::Narrator`) deserializes
all of them. See `engines/<name>/server.py` for the per-engine
implementation.
- `GET /healthz` — boot probe + model-loaded flag.
Skald routes per-request by `voices.source`: a `kokoro_*` source
goes to `$KOKORO_URL`, a `tortoise_*` source goes to `$TORTOISE_URL`,
anything else (`lj_speech`, generic) goes to `$F5_TTS_URL`.
## Engines
| Dir | Engine | License (code/weights) | VRAM | Speed | Voices |
|---|---|---|---|---|---|
| `f5-tts/` | SWivid F5-TTS v1 | MIT / **CC-BY-NC** | ~5GB | fast (~2x real-time on 2070S) | voice cloning (LJ Speech reference shipped) |
| `kokoro/` | hexgrad Kokoro-82M | Apache 2.0 / Apache 2.0 | ~1GB | very fast (~50x real-time) | 50+ named presets (af_*, am_*, bf_*, bm_*) |
| `tortoise/` | neonbjb Tortoise-TTS | Apache 2.0 / Apache 2.0 | ~5GB | **slow** (~0.014x real-time, ~74s/s of audio on 2070S, standard preset) | 26 named built-ins (lj, freeman, daniel, weaver, jlaw, etc.) |
## Branch model
`main` carries the **vanilla** version of each engine — what you'd
get from a clean `pip install <engine>` plus the FastAPI sidecar
+ control-tag splitter. No engine-specific kludges. Safe to look
at without context.
`engine/<name>` branches hold engine-tuned tweaks that don't
generalise. Examples:
- `engine/kokoro` — doubled-`??` prosody hack for the 82M's weak
question intonation, paragraph/scene/breath gap durations tuned
for af_heart's pacing, notes on how respellings need to be all-
lowercase to avoid letter-by-letter spell-out by misaki.
- `engine/tortoise` — GPU exclusivity coordinator (stops F5 +
Kokoro before a Tortoise run since the 2070 Super can't host
all three at once), preset choice ergonomics, character→tortoise-
voice seed assignments.
To deploy a tuned engine, check out the engine's branch in the build
dir and `docker compose up -d --build`:
```bash
git fetch && git checkout engine/kokoro
docker compose up -d --build
```
## GPU coordination
On an 8GB card F5 + Kokoro can co-reside (~5GB + ~1GB). Tortoise
pushes the budget over and needs the GPU largely to itself — the
`engine/tortoise` branch carries a script to stop kokoro + f5
before a Tortoise run and restart them after. Replace with proper
coordination once more VRAM is available.