Compare commits

...
Sign in to create a new pull request.

2 commits

Author SHA1 Message Date
9df378f799 engine/tortoise: sentence chunking + device fix + pitch/rate modulation
Catches up engines/tortoise/server.py with what's been deployed on
Lucy through tonight's smoke iterations:

0.2 — _chunk_for_tortoise splits text nodes at sentence boundaries
      (max 220 chars) before each tts_with_preset call. Fixes the
      end-of-prompt gibberish past tortoise's ~20s reliable horizon.

0.3 — _get_voice now .to(DEVICE) cached samples + latents. Without
      this, non-lj voices crash with 'Expected all tensors to be on
      the same device, but found cpu and cuda:0'.

0.4 — [voice:NAME pitch=N rate=R][/voice] tag syntax. librosa
      pitch_shift + time_stretch applied per-chunk for single-voice
      multi-character renders. The strategy survived the design
      table — but the librosa phase-vocoder artifacts at ±5 semitones
      ate the quality on the 2070 Super. Parked here for the GPU
      rebuild; modulation works architecturally, just needs better
      stretching algorithm (rubberband) + more headroom.

Production stayed Kokoro. Coast-Down preferred_voice_id reverted
to kokoro_af_heart in the live DB after this experiment.
2026-05-14 19:08:43 -07:00
7a96031aa6 engine/tortoise: GPU exclusivity wrapper + kludges notes
Adds the Tortoise-specific tooling that main intentionally omits:

- engines/tortoise/exclusive-gpu.sh wraps any command, stops F5 +
  Kokoro on the GPU, restarts Tortoise to clear stale CUDA contexts,
  waits for healthz, runs the command, restarts the engines on EXIT
  trap. Solves the 8GB OOM that took down the first smoke.

- engines/tortoise/hacks.md captures the speed reality (~74x real-
  time slowdown on the 2070 Super at standard preset) and the
  pronunciation-overrides cross-engine compatibility note.

Deploy from this branch when you want Tortoise's tuning. Main's
vanilla Tortoise is for the cross-engine reference + future
'we have more VRAM now' cleanup.
2026-05-14 09:42:09 -07:00
3 changed files with 268 additions and 22 deletions

View file

@ -0,0 +1,47 @@
#!/bin/bash
# Tortoise GPU exclusivity wrapper. The 2070 Super (8GB) can't host
# F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise (~5GB peak) simultaneously,
# so we stop the other two engines for the duration of a Tortoise run
# and restart them after.
#
# Usage:
# exclusive-gpu.sh <command...>
#
# Example:
# exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
#
# Exits with the wrapped command's status. Restarts the engines
# regardless of success/failure (trap on EXIT).
set -euo pipefail
STOP_ENGINES=(f5-tts kokoro)
cleanup() {
local rc=$?
echo "[exclusive-gpu] restarting engines"
for engine in "${STOP_ENGINES[@]}"; do
docker start "$engine" >/dev/null 2>&1 || \
echo "[exclusive-gpu] failed to restart $engine — investigate"
done
return "$rc"
}
trap cleanup EXIT
echo "[exclusive-gpu] stopping engines: ${STOP_ENGINES[*]}"
for engine in "${STOP_ENGINES[@]}"; do
docker stop "$engine" >/dev/null 2>&1 || true
done
# Restart Tortoise to clean up any cached GPU allocations from the
# now-stopped engines (their CUDA contexts can linger briefly).
docker restart tortoise >/dev/null
echo "[exclusive-gpu] waiting for tortoise healthz..."
for i in {1..30}; do
if curl -sf http://192.168.0.5:7795/healthz | grep -q '"loaded":true'; then
break
fi
sleep 2
done
echo "[exclusive-gpu] running: $*"
"$@"

71
engines/tortoise/hacks.md Normal file
View file

@ -0,0 +1,71 @@
# Tortoise engine — kludges branched off main
This branch carries the engine-specific tweaks that don't generalise
to F5 / Kokoro. Tortoise is the audiobook-quality engine but the
trade-offs are real and need explicit handling — speed and GPU.
## 1. GPU exclusivity
**File:** `exclusive-gpu.sh`.
The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise
(~5GB peak) sums to ~12GB — over budget. First Tortoise smoke
caught it: `torch.OutOfMemoryError: ... 9.31 MiB is free`.
Solution: stop the other two engines for the duration of a Tortoise
run. The script wraps any command, stops `f5-tts` + `kokoro`,
restarts `tortoise` to clean its CUDA context, waits for healthz,
runs the wrapped command, then restarts the engines on EXIT trap
(success or failure).
```bash
./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
```
Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three
engines co-reside.
## 2. Speed — slow, batch-only
Tortoise at `standard` preset is **~74x slower than real-time** on
the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min
Chapter 2 render would take ~8 hours. Tortoise is acceptable for
overnight batched runs but NOT interactive rendering.
Quality presets and their approx wall-clock for a 3000-word chapter:
- `ultra_fast` — ~1h, noticeable quality drop
- `fast` — ~2h
- `standard` — ~6-8h, the recommended bar
- `high_quality` — ~24h, marginally better than standard
For most use, `standard` is right. Reserve `high_quality` for
short prologues or named samples.
## 3. Voice mapping format
Tortoise's voice roster (`lj`, `freeman`, `daniel`, etc.) lives
behind `source='tortoise_tts'` in the `voices` table. Character
slug → Tortoise voice mapping is independent of the Kokoro mapping
— a story can have BOTH a Kokoro and Tortoise mapping live in
parallel, picked at render time via story.preferred_voice_id or
the --voice flag.
Tortoise voices may sometimes warble or stutter at chunk boundaries
— the `tortoise.api.TextToSpeech.tts_with_preset` call is per-chunk
and re-conditions the voice each time. Acceptable for v0.1; future
work could feed `conditioning_latents` directly for tighter cohesion.
## 4. No respelling overrides for Tortoise (yet)
The `pronunciation_overrides` rows in the DB are seeded with
lowercase-syllable respellings tuned for Kokoro's misaki tokenizer.
Tortoise uses a different phonemizer (`g2p_en`) which handles many
of those proper nouns better natively — but some still mangle.
For now, narrate's substitution applies the same overrides regardless
of engine, which means Tortoise sees `prip-yat` for "Pripyat" — same
input, different phonemizer interprets differently. Usually OK but
audit after each batch.
Future: per-engine override sets, OR an `engine` column on
pronunciation_overrides.

View file

@ -23,6 +23,7 @@ import time
import uuid
from pathlib import Path
import librosa
import numpy as np
import soundfile as sf
import torch
@ -62,12 +63,31 @@ def _get_tts() -> TextToSpeech:
return _tts
def _move_to_device(obj):
"""Recursively .to(DEVICE) tensors inside the structure tortoise
returns from load_voice. voice_samples is a list of tensors;
conditioning_latents is a tuple of tensors. Anything else
passes through unchanged (e.g. None, ints)."""
if obj is None:
return obj
if isinstance(obj, torch.Tensor):
return obj.to(DEVICE)
if isinstance(obj, list):
return [_move_to_device(x) for x in obj]
if isinstance(obj, tuple):
return tuple(_move_to_device(x) for x in obj)
return obj
def _get_voice(name: str) -> tuple:
"""Cache voice latents to avoid re-loading reference clips on
every synthesis call. Tortoise's load_voice returns
(voice_samples, conditioning_latents)."""
(voice_samples, conditioning_latents) but they're created on
CPU; we move them to DEVICE so the autoregressive model (on
CUDA) doesn't fail with cpu/cuda tensor-device mismatch."""
if name not in _voice_cache:
_voice_cache[name] = load_voice(name)
samples, latents = load_voice(name)
_voice_cache[name] = (_move_to_device(samples), _move_to_device(latents))
return _voice_cache[name]
@ -75,15 +95,38 @@ def _get_voice(name: str) -> tuple:
class Node:
__slots__ = ("kind", "value", "voice")
__slots__ = ("kind", "value", "voice", "pitch", "rate")
def __init__(self, kind: str, value, voice: str | None = None):
def __init__(
self,
kind: str,
value,
voice: str | None = None,
pitch: float = 0.0,
rate: float = 1.0,
):
# kind ∈ {"text", "silence"}; value is str for text, float
# seconds for silence. voice/pitch/rate are character-voicing
# modifiers from [voice:NAME pitch=N rate=R] tags. Default:
# request voice, 0 semitones, 1x rate.
self.kind = kind
self.value = value
self.voice = voice
self.pitch = pitch
self.rate = rate
_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
# Voice open tag — name + optional pitch (semitones) + optional rate:
# [voice:dyatlov] → voice swap only
# [voice:lj pitch=-3] → same voice, 3 semitones lower
# [voice:lj pitch=2 rate=1.1] → higher + slightly faster (fairy)
# [voice:lj pitch=-4 rate=0.9] → lower + slower (troll)
_VOICE_OPEN_RE = re.compile(
r"\[voice:([A-Za-z0-9_-]+)"
r"(?:\s+pitch=(-?[0-9]+(?:\.[0-9]+)?))?"
r"(?:\s+rate=([0-9]+(?:\.[0-9]+)?))?"
r"\]"
)
_VOICE_CLOSE = "[/voice]"
_TAG_RE = re.compile(
r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
@ -102,7 +145,70 @@ def _parse_tag(match: re.Match) -> float:
return dur / 1000.0 if unit == "ms" else dur
def _expand_inline(text: str, voice: str | None) -> list[Node]:
# Tortoise's autoregressive head loses coherence past ~20s of generated
# audio per inference call. lj's pace is roughly 14 chars/s, so anything
# past ~280 chars per call risks gibberish at the end. We split inside
# _expand_inline at sentence boundaries to keep each tts_with_preset
# call inside the model's reliable horizon.
TORTOISE_MAX_CHUNK_CHARS = 220
# Sentence boundary regex — splits on `.`/`?`/`!` followed by whitespace
# and a capital letter (keeps "Mr. Smith" / "U.S." together) OR at any
# newline.
_SENTENCE_BOUNDARY = re.compile(r"(?<=[\.!?])\s+(?=[A-Z\"\(])|(?<=\n)\s*")
def _chunk_for_tortoise(text: str, max_chars: int = TORTOISE_MAX_CHUNK_CHARS) -> list[str]:
"""Split text into chunks <= max_chars at sentence boundaries.
If a single sentence exceeds max_chars (rare for prose), fall
back to splitting that sentence at commas or just hard-cutting.
"""
sentences = [s.strip() for s in _SENTENCE_BOUNDARY.split(text) if s and s.strip()]
chunks: list[str] = []
current = ""
for sent in sentences:
# Long sentence: emit alone, but try sub-splitting at commas.
if len(sent) > max_chars:
if current:
chunks.append(current.strip())
current = ""
# Split on commas
parts = [p.strip() for p in sent.split(",") if p.strip()]
sub = ""
for p in parts:
add = (sub + ", " if sub else "") + p
if len(add) <= max_chars:
sub = add
else:
if sub:
chunks.append(sub)
# If even the part alone exceeds, hard-cut at max_chars
while len(p) > max_chars:
chunks.append(p[:max_chars])
p = p[max_chars:]
sub = p
if sub:
chunks.append(sub)
continue
# Sentence fits — accumulate.
candidate = (current + " " if current else "") + sent
if len(candidate) <= max_chars:
current = candidate
else:
if current:
chunks.append(current.strip())
current = sent
if current:
chunks.append(current.strip())
return chunks
def _expand_inline(
text: str,
voice: str | None,
pitch: float = 0.0,
rate: float = 1.0,
) -> list[Node]:
out: list[Node] = []
text = text.strip()
if not text:
@ -111,12 +217,12 @@ def _expand_inline(text: str, voice: str | None) -> list[Node]:
for m in _TAG_RE.finditer(text):
pre = text[cursor : m.start()].strip()
if pre:
out.append(Node("text", pre, voice))
out.append(Node("text", pre, voice, pitch, rate))
out.append(Node("silence", _parse_tag(m)))
cursor = m.end()
tail = text[cursor:].strip()
if tail:
out.append(Node("text", tail, voice))
out.append(Node("text", tail, voice, pitch, rate))
return out
@ -130,12 +236,14 @@ def _split_paragraph_voices(para: str) -> list[Node]:
break
out.extend(_expand_inline(para[cursor : m.start()], None))
voice = m.group(1)
pitch = float(m.group(2)) if m.group(2) else 0.0
rate = float(m.group(3)) if m.group(3) else 1.0
body_start = m.end()
close_idx = para.find(_VOICE_CLOSE, body_start)
if close_idx < 0:
out.extend(_expand_inline(para[body_start:], voice))
out.extend(_expand_inline(para[body_start:], voice, pitch, rate))
break
out.extend(_expand_inline(para[body_start:close_idx], voice))
out.extend(_expand_inline(para[body_start:close_idx], voice, pitch, rate))
cursor = close_idx + len(_VOICE_CLOSE)
return out
@ -253,6 +361,7 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
started = time.monotonic()
pieces: list[np.ndarray] = []
voices_used: set[str] = set()
tortoise_chunks_rendered = 0
for node in nodes:
if node.kind == "silence":
pieces.append(_silence_samples(node.value))
@ -264,18 +373,37 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
except Exception as e:
log.warning("voice %s failed to load (%s); falling back to default", seg_voice, e)
samples, latents = _get_voice(voice)
# Tortoise's tts_with_preset returns a torch.Tensor on the
# configured device.
audio_tensor = tts.tts_with_preset(
text=node.value,
voice_samples=samples,
conditioning_latents=latents,
preset=preset,
)
if isinstance(audio_tensor, list):
audio_tensor = audio_tensor[0]
arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
pieces.append(arr)
# Each text node may exceed Tortoise's reliable ~20s horizon —
# split at sentence boundaries before feeding the model.
sub_chunks = _chunk_for_tortoise(node.value)
for sub_idx, sub in enumerate(sub_chunks):
audio_tensor = tts.tts_with_preset(
text=sub,
voice_samples=samples,
conditioning_latents=latents,
preset=preset,
)
if isinstance(audio_tensor, list):
audio_tensor = audio_tensor[0]
arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
# Per-character voice modulation via librosa. Apply
# pitch first (preserves duration), then rate (preserves
# pitch). Default pitch=0, rate=1.0 = no-op fast path.
if abs(node.pitch) > 1e-3:
arr = librosa.effects.pitch_shift(
arr, sr=SAMPLE_RATE, n_steps=node.pitch
)
if abs(node.rate - 1.0) > 1e-3:
arr = librosa.effects.time_stretch(arr, rate=node.rate)
arr = arr.astype(np.float32)
pieces.append(arr)
tortoise_chunks_rendered += 1
log.info(
"chunk %d/%d done (%d chars, pitch=%+.1f rate=%.2f, %.1fs audio so far)",
sub_idx + 1, len(sub_chunks), len(sub),
node.pitch, node.rate,
sum(len(p) for p in pieces) / SAMPLE_RATE,
)
elapsed_ms = int((time.monotonic() - started) * 1000)
if not pieces: