engine/tortoise: sentence chunking + device fix + pitch/rate modulation

Catches up engines/tortoise/server.py with what's been deployed on Lucy through tonight's smoke iterations: 0.2 — _chunk_for_tortoise splits text nodes at sentence boundaries (max 220 chars) before each tts_with_preset call. Fixes the end-of-prompt gibberish past tortoise's ~20s reliable horizon. 0.3 — _get_voice now .to(DEVICE) cached samples + latents. Without this, non-lj voices crash with 'Expected all tensors to be on the same device, but found cpu and cuda:0'. 0.4 — [voice:NAME pitch=N rate=R][/voice] tag syntax. librosa pitch_shift + time_stretch applied per-chunk for single-voice multi-character renders. The strategy survived the design table — but the librosa phase-vocoder artifacts at ±5 semitones ate the quality on the 2070 Super. Parked here for the GPU rebuild; modulation works architecturally, just needs better stretching algorithm (rubberband) + more headroom. Production stayed Kokoro. Coast-Down preferred_voice_id reverted to kokoro_af_heart in the live DB after this experiment.
engine/tortoise: GPU exclusivity wrapper + kludges notes
2026-05-14 19:08:43 -07:00 · 2026-05-14 09:42:09 -07:00
3 changed files with 268 additions and 22 deletions
--- a/engines/tortoise/exclusive-gpu.sh
+++ b/engines/tortoise/exclusive-gpu.sh
@ -0,0 +1,47 @@
+#!/bin/bash
+# Tortoise GPU exclusivity wrapper. The 2070 Super (8GB) can't host
+# F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise (~5GB peak) simultaneously,
+# so we stop the other two engines for the duration of a Tortoise run
+# and restart them after.
+#
+# Usage:
+#   exclusive-gpu.sh <command...>
+#
+# Example:
+#   exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
+#
+# Exits with the wrapped command's status. Restarts the engines
+# regardless of success/failure (trap on EXIT).
+set -euo pipefail
+
+STOP_ENGINES=(f5-tts kokoro)
+
+cleanup() {
+    local rc=$?
+    echo "[exclusive-gpu] restarting engines"
+    for engine in "${STOP_ENGINES[@]}"; do
+        docker start "$engine" >/dev/null 2>&1 || \
+            echo "[exclusive-gpu] failed to restart $engine — investigate"
+    done
+    return "$rc"
+}
+trap cleanup EXIT
+
+echo "[exclusive-gpu] stopping engines: ${STOP_ENGINES[*]}"
+for engine in "${STOP_ENGINES[@]}"; do
+    docker stop "$engine" >/dev/null 2>&1 || true
+done
+
+# Restart Tortoise to clean up any cached GPU allocations from the
+# now-stopped engines (their CUDA contexts can linger briefly).
+docker restart tortoise >/dev/null
+echo "[exclusive-gpu] waiting for tortoise healthz..."
+for i in {1..30}; do
+    if curl -sf http://192.168.0.5:7795/healthz | grep -q '"loaded":true'; then
+        break
+    fi
+    sleep 2
+done
+
+echo "[exclusive-gpu] running: $*"
+"$@"
--- a/engines/tortoise/hacks.md
+++ b/engines/tortoise/hacks.md
@ -0,0 +1,71 @@
+# Tortoise engine — kludges branched off main
+
+This branch carries the engine-specific tweaks that don't generalise
+to F5 / Kokoro. Tortoise is the audiobook-quality engine but the
+trade-offs are real and need explicit handling — speed and GPU.
+
+## 1. GPU exclusivity
+
+**File:** `exclusive-gpu.sh`.
+
+The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise
+(~5GB peak) sums to ~12GB — over budget. First Tortoise smoke
+caught it: `torch.OutOfMemoryError: ... 9.31 MiB is free`.
+
+Solution: stop the other two engines for the duration of a Tortoise
+run. The script wraps any command, stops `f5-tts` + `kokoro`,
+restarts `tortoise` to clean its CUDA context, waits for healthz,
+runs the wrapped command, then restarts the engines on EXIT trap
+(success or failure).
+
+```bash
+./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
+```
+
+Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three
+engines co-reside.
+
+## 2. Speed — slow, batch-only
+
+Tortoise at `standard` preset is **~74x slower than real-time** on
+the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min
+Chapter 2 render would take ~8 hours. Tortoise is acceptable for
+overnight batched runs but NOT interactive rendering.
+
+Quality presets and their approx wall-clock for a 3000-word chapter:
+- `ultra_fast` — ~1h, noticeable quality drop
+- `fast` — ~2h
+- `standard` — ~6-8h, the recommended bar
+- `high_quality` — ~24h, marginally better than standard
+
+For most use, `standard` is right. Reserve `high_quality` for
+short prologues or named samples.
+
+## 3. Voice mapping format
+
+Tortoise's voice roster (`lj`, `freeman`, `daniel`, etc.) lives
+behind `source='tortoise_tts'` in the `voices` table. Character
+slug → Tortoise voice mapping is independent of the Kokoro mapping
+— a story can have BOTH a Kokoro and Tortoise mapping live in
+parallel, picked at render time via story.preferred_voice_id or
+the --voice flag.
+
+Tortoise voices may sometimes warble or stutter at chunk boundaries
+— the `tortoise.api.TextToSpeech.tts_with_preset` call is per-chunk
+and re-conditions the voice each time. Acceptable for v0.1; future
+work could feed `conditioning_latents` directly for tighter cohesion.
+
+## 4. No respelling overrides for Tortoise (yet)
+
+The `pronunciation_overrides` rows in the DB are seeded with
+lowercase-syllable respellings tuned for Kokoro's misaki tokenizer.
+Tortoise uses a different phonemizer (`g2p_en`) which handles many
+of those proper nouns better natively — but some still mangle.
+
+For now, narrate's substitution applies the same overrides regardless
+of engine, which means Tortoise sees `prip-yat` for "Pripyat" — same
+input, different phonemizer interprets differently. Usually OK but
+audit after each batch.
+
+Future: per-engine override sets, OR an `engine` column on
+pronunciation_overrides.
--- a/engines/tortoise/server.py
+++ b/engines/tortoise/server.py
@ -23,6 +23,7 @@ import time
 import uuid
 from pathlib import Path

+import librosa
 import numpy as np
 import soundfile as sf
 import torch
@ -62,12 +63,31 @@ def _get_tts() -> TextToSpeech:
    return _tts


+def _move_to_device(obj):
+    """Recursively .to(DEVICE) tensors inside the structure tortoise
+    returns from load_voice. voice_samples is a list of tensors;
+    conditioning_latents is a tuple of tensors. Anything else
+    passes through unchanged (e.g. None, ints)."""
+    if obj is None:
+        return obj
+    if isinstance(obj, torch.Tensor):
+        return obj.to(DEVICE)
+    if isinstance(obj, list):
+        return [_move_to_device(x) for x in obj]
+    if isinstance(obj, tuple):
+        return tuple(_move_to_device(x) for x in obj)
+    return obj
+
+
 def _get_voice(name: str) -> tuple:
    """Cache voice latents to avoid re-loading reference clips on
    every synthesis call. Tortoise's load_voice returns
-    (voice_samples, conditioning_latents)."""
+    (voice_samples, conditioning_latents) — but they're created on
+    CPU; we move them to DEVICE so the autoregressive model (on
+    CUDA) doesn't fail with cpu/cuda tensor-device mismatch."""
    if name not in _voice_cache:
-        _voice_cache[name] = load_voice(name)
+        samples, latents = load_voice(name)
+        _voice_cache[name] = (_move_to_device(samples), _move_to_device(latents))
    return _voice_cache[name]


@ -75,15 +95,38 @@ def _get_voice(name: str) -> tuple:


 class Node:
-    __slots__ = ("kind", "value", "voice")
+    __slots__ = ("kind", "value", "voice", "pitch", "rate")

-    def __init__(self, kind: str, value, voice: str | None = None):
+    def __init__(
+        self,
+        kind: str,
+        value,
+        voice: str | None = None,
+        pitch: float = 0.0,
+        rate: float = 1.0,
+    ):
+        # kind ∈ {"text", "silence"}; value is str for text, float
+        # seconds for silence. voice/pitch/rate are character-voicing
+        # modifiers from [voice:NAME pitch=N rate=R] tags. Default:
+        # request voice, 0 semitones, 1x rate.
        self.kind = kind
        self.value = value
        self.voice = voice
+        self.pitch = pitch
+        self.rate = rate


-_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
+# Voice open tag — name + optional pitch (semitones) + optional rate:
+#   [voice:dyatlov]               → voice swap only
+#   [voice:lj pitch=-3]           → same voice, 3 semitones lower
+#   [voice:lj pitch=2 rate=1.1]   → higher + slightly faster (fairy)
+#   [voice:lj pitch=-4 rate=0.9]  → lower + slower (troll)
+_VOICE_OPEN_RE = re.compile(
+    r"\[voice:([A-Za-z0-9_-]+)"
+    r"(?:\s+pitch=(-?[0-9]+(?:\.[0-9]+)?))?"
+    r"(?:\s+rate=([0-9]+(?:\.[0-9]+)?))?"
+    r"\]"
+)
 _VOICE_CLOSE = "[/voice]"
 _TAG_RE = re.compile(
    r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
@ -102,7 +145,70 @@ def _parse_tag(match: re.Match) -> float:
    return dur / 1000.0 if unit == "ms" else dur


-def _expand_inline(text: str, voice: str | None) -> list[Node]:
+# Tortoise's autoregressive head loses coherence past ~20s of generated
+# audio per inference call. lj's pace is roughly 14 chars/s, so anything
+# past ~280 chars per call risks gibberish at the end. We split inside
+# _expand_inline at sentence boundaries to keep each tts_with_preset
+# call inside the model's reliable horizon.
+TORTOISE_MAX_CHUNK_CHARS = 220
+
+# Sentence boundary regex — splits on `.`/`?`/`!` followed by whitespace
+# and a capital letter (keeps "Mr. Smith" / "U.S." together) OR at any
+# newline.
+_SENTENCE_BOUNDARY = re.compile(r"(?<=[\.!?])\s+(?=[A-Z\"\(])|(?<=\n)\s*")
+
+
+def _chunk_for_tortoise(text: str, max_chars: int = TORTOISE_MAX_CHUNK_CHARS) -> list[str]:
+    """Split text into chunks <= max_chars at sentence boundaries.
+    If a single sentence exceeds max_chars (rare for prose), fall
+    back to splitting that sentence at commas or just hard-cutting.
+    """
+    sentences = [s.strip() for s in _SENTENCE_BOUNDARY.split(text) if s and s.strip()]
+    chunks: list[str] = []
+    current = ""
+    for sent in sentences:
+        # Long sentence: emit alone, but try sub-splitting at commas.
+        if len(sent) > max_chars:
+            if current:
+                chunks.append(current.strip())
+                current = ""
+            # Split on commas
+            parts = [p.strip() for p in sent.split(",") if p.strip()]
+            sub = ""
+            for p in parts:
+                add = (sub + ", " if sub else "") + p
+                if len(add) <= max_chars:
+                    sub = add
+                else:
+                    if sub:
+                        chunks.append(sub)
+                    # If even the part alone exceeds, hard-cut at max_chars
+                    while len(p) > max_chars:
+                        chunks.append(p[:max_chars])
+                        p = p[max_chars:]
+                    sub = p
+            if sub:
+                chunks.append(sub)
+            continue
+        # Sentence fits — accumulate.
+        candidate = (current + " " if current else "") + sent
+        if len(candidate) <= max_chars:
+            current = candidate
+        else:
+            if current:
+                chunks.append(current.strip())
+            current = sent
+    if current:
+        chunks.append(current.strip())
+    return chunks
+
+
+def _expand_inline(
+    text: str,
+    voice: str | None,
+    pitch: float = 0.0,
+    rate: float = 1.0,
+) -> list[Node]:
    out: list[Node] = []
    text = text.strip()
    if not text:
@ -111,12 +217,12 @@ def _expand_inline(text: str, voice: str | None) -> list[Node]:
    for m in _TAG_RE.finditer(text):
        pre = text[cursor : m.start()].strip()
        if pre:
-            out.append(Node("text", pre, voice))
+            out.append(Node("text", pre, voice, pitch, rate))
        out.append(Node("silence", _parse_tag(m)))
        cursor = m.end()
    tail = text[cursor:].strip()
    if tail:
-        out.append(Node("text", tail, voice))
+        out.append(Node("text", tail, voice, pitch, rate))
    return out


@ -130,12 +236,14 @@ def _split_paragraph_voices(para: str) -> list[Node]:
            break
        out.extend(_expand_inline(para[cursor : m.start()], None))
        voice = m.group(1)
+        pitch = float(m.group(2)) if m.group(2) else 0.0
+        rate = float(m.group(3)) if m.group(3) else 1.0
        body_start = m.end()
        close_idx = para.find(_VOICE_CLOSE, body_start)
        if close_idx < 0:
-            out.extend(_expand_inline(para[body_start:], voice))
+            out.extend(_expand_inline(para[body_start:], voice, pitch, rate))
            break
-        out.extend(_expand_inline(para[body_start:close_idx], voice))
+        out.extend(_expand_inline(para[body_start:close_idx], voice, pitch, rate))
        cursor = close_idx + len(_VOICE_CLOSE)
    return out

@ -253,6 +361,7 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
    started = time.monotonic()
    pieces: list[np.ndarray] = []
    voices_used: set[str] = set()
+    tortoise_chunks_rendered = 0
    for node in nodes:
        if node.kind == "silence":
            pieces.append(_silence_samples(node.value))
@ -264,18 +373,37 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
        except Exception as e:
            log.warning("voice %s failed to load (%s); falling back to default", seg_voice, e)
            samples, latents = _get_voice(voice)
-        # Tortoise's tts_with_preset returns a torch.Tensor on the
-        # configured device.
-        audio_tensor = tts.tts_with_preset(
-            text=node.value,
-            voice_samples=samples,
-            conditioning_latents=latents,
-            preset=preset,
-        )
-        if isinstance(audio_tensor, list):
-            audio_tensor = audio_tensor[0]
-        arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
-        pieces.append(arr)
+        # Each text node may exceed Tortoise's reliable ~20s horizon —
+        # split at sentence boundaries before feeding the model.
+        sub_chunks = _chunk_for_tortoise(node.value)
+        for sub_idx, sub in enumerate(sub_chunks):
+            audio_tensor = tts.tts_with_preset(
+                text=sub,
+                voice_samples=samples,
+                conditioning_latents=latents,
+                preset=preset,
+            )
+            if isinstance(audio_tensor, list):
+                audio_tensor = audio_tensor[0]
+            arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
+            # Per-character voice modulation via librosa. Apply
+            # pitch first (preserves duration), then rate (preserves
+            # pitch). Default pitch=0, rate=1.0 = no-op fast path.
+            if abs(node.pitch) > 1e-3:
+                arr = librosa.effects.pitch_shift(
+                    arr, sr=SAMPLE_RATE, n_steps=node.pitch
+                )
+            if abs(node.rate - 1.0) > 1e-3:
+                arr = librosa.effects.time_stretch(arr, rate=node.rate)
+            arr = arr.astype(np.float32)
+            pieces.append(arr)
+            tortoise_chunks_rendered += 1
+            log.info(
+                "chunk %d/%d done (%d chars, pitch=%+.1f rate=%.2f, %.1fs audio so far)",
+                sub_idx + 1, len(sub_chunks), len(sub),
+                node.pitch, node.rate,
+                sum(len(p) for p in pieces) / SAMPLE_RATE,
+            )
    elapsed_ms = int((time.monotonic() - started) * 1000)

    if not pieces: