From 7a96031aa67bc2c67d753e83f383a7ace9a9c8c6 Mon Sep 17 00:00:00 2001
From: Kayos <kayos@sulkta.com>
Date: Thu, 14 May 2026 09:42:09 -0700
Subject: [PATCH 1/2] engine/tortoise: GPU exclusivity wrapper + kludges notes

Adds the Tortoise-specific tooling that main intentionally omits:

- engines/tortoise/exclusive-gpu.sh wraps any command, stops F5 +
  Kokoro on the GPU, restarts Tortoise to clear stale CUDA contexts,
  waits for healthz, runs the command, restarts the engines on EXIT
  trap. Solves the 8GB OOM that took down the first smoke.

- engines/tortoise/hacks.md captures the speed reality (~74x real-
  time slowdown on the 2070 Super at standard preset) and the
  pronunciation-overrides cross-engine compatibility note.

Deploy from this branch when you want Tortoise's tuning. Main's
vanilla Tortoise is for the cross-engine reference + future
'we have more VRAM now' cleanup.
---
 engines/tortoise/exclusive-gpu.sh | 47 ++++++++++++++++++++
 engines/tortoise/hacks.md         | 71 +++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)
 create mode 100755 engines/tortoise/exclusive-gpu.sh
 create mode 100644 engines/tortoise/hacks.md
diff --git a/engines/tortoise/exclusive-gpu.sh b/engines/tortoise/exclusive-gpu.sh
new file mode 100755
index 0000000..fb2b0a4
--- /dev/null
+++ b/engines/tortoise/exclusive-gpu.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+# Tortoise GPU exclusivity wrapper. The 2070 Super (8GB) can't host
+# F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise (~5GB peak) simultaneously,
+# so we stop the other two engines for the duration of a Tortoise run
+# and restart them after.
+#
+# Usage:
+#   exclusive-gpu.sh <command...>
+#
+# Example:
+#   exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
+#
+# Exits with the wrapped command's status. Restarts the engines
+# regardless of success/failure (trap on EXIT).
+set -euo pipefail
+
+STOP_ENGINES=(f5-tts kokoro)
+
+cleanup() {
+    local rc=$?
+    echo "[exclusive-gpu] restarting engines"
+    for engine in "${STOP_ENGINES[@]}"; do
+        docker start "$engine" >/dev/null 2>&1 || \
+            echo "[exclusive-gpu] failed to restart $engine — investigate"
+    done
+    return "$rc"
+}
+trap cleanup EXIT
+
+echo "[exclusive-gpu] stopping engines: ${STOP_ENGINES[*]}"
+for engine in "${STOP_ENGINES[@]}"; do
+    docker stop "$engine" >/dev/null 2>&1 || true
+done
+
+# Restart Tortoise to clean up any cached GPU allocations from the
+# now-stopped engines (their CUDA contexts can linger briefly).
+docker restart tortoise >/dev/null
+echo "[exclusive-gpu] waiting for tortoise healthz..."
+for i in {1..30}; do
+    if curl -sf http://192.168.0.5:7795/healthz | grep -q '"loaded":true'; then
+        break
+    fi
+    sleep 2
+done
+
+echo "[exclusive-gpu] running: $*"
+"$@"
diff --git a/engines/tortoise/hacks.md b/engines/tortoise/hacks.md
new file mode 100644
index 0000000..ee335b4
--- /dev/null
+++ b/engines/tortoise/hacks.md
@@ -0,0 +1,71 @@
+# Tortoise engine — kludges branched off main
+
+This branch carries the engine-specific tweaks that don't generalise
+to F5 / Kokoro. Tortoise is the audiobook-quality engine but the
+trade-offs are real and need explicit handling — speed and GPU.
+
+## 1. GPU exclusivity
+
+**File:** `exclusive-gpu.sh`.
+
+The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise
+(~5GB peak) sums to ~12GB — over budget. First Tortoise smoke
+caught it: `torch.OutOfMemoryError: ... 9.31 MiB is free`.
+
+Solution: stop the other two engines for the duration of a Tortoise
+run. The script wraps any command, stops `f5-tts` + `kokoro`,
+restarts `tortoise` to clean its CUDA context, waits for healthz,
+runs the wrapped command, then restarts the engines on EXIT trap
+(success or failure).
+
+```bash
+./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
+```
+
+Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three
+engines co-reside.
+
+## 2. Speed — slow, batch-only
+
+Tortoise at `standard` preset is **~74x slower than real-time** on
+the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min
+Chapter 2 render would take ~8 hours. Tortoise is acceptable for
+overnight batched runs but NOT interactive rendering.
+
+Quality presets and their approx wall-clock for a 3000-word chapter:
+- `ultra_fast` — ~1h, noticeable quality drop
+- `fast` — ~2h
+- `standard` — ~6-8h, the recommended bar
+- `high_quality` — ~24h, marginally better than standard
+
+For most use, `standard` is right. Reserve `high_quality` for
+short prologues or named samples.
+
+## 3. Voice mapping format
+
+Tortoise's voice roster (`lj`, `freeman`, `daniel`, etc.) lives
+behind `source='tortoise_tts'` in the `voices` table. Character
+slug → Tortoise voice mapping is independent of the Kokoro mapping
+— a story can have BOTH a Kokoro and Tortoise mapping live in
+parallel, picked at render time via story.preferred_voice_id or
+the --voice flag.
+
+Tortoise voices may sometimes warble or stutter at chunk boundaries
+— the `tortoise.api.TextToSpeech.tts_with_preset` call is per-chunk
+and re-conditions the voice each time. Acceptable for v0.1; future
+work could feed `conditioning_latents` directly for tighter cohesion.
+
+## 4. No respelling overrides for Tortoise (yet)
+
+The `pronunciation_overrides` rows in the DB are seeded with
+lowercase-syllable respellings tuned for Kokoro's misaki tokenizer.
+Tortoise uses a different phonemizer (`g2p_en`) which handles many
+of those proper nouns better natively — but some still mangle.
+
+For now, narrate's substitution applies the same overrides regardless
+of engine, which means Tortoise sees `prip-yat` for "Pripyat" — same
+input, different phonemizer interprets differently. Usually OK but
+audit after each batch.
+
+Future: per-engine override sets, OR an `engine` column on
+pronunciation_overrides.

From 9df378f799f48db7e83635e48dbc3f6bd7cb16ef Mon Sep 17 00:00:00 2001
From: Kayos <kayos@sulkta.com>
Date: Thu, 14 May 2026 19:08:43 -0700
Subject: [PATCH 2/2] engine/tortoise: sentence chunking + device fix +
 pitch/rate modulation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Catches up engines/tortoise/server.py with what's been deployed on
Lucy through tonight's smoke iterations:

0.2 — _chunk_for_tortoise splits text nodes at sentence boundaries
      (max 220 chars) before each tts_with_preset call. Fixes the
      end-of-prompt gibberish past tortoise's ~20s reliable horizon.

0.3 — _get_voice now .to(DEVICE) cached samples + latents. Without
      this, non-lj voices crash with 'Expected all tensors to be on
      the same device, but found cpu and cuda:0'.

0.4 — [voice:NAME pitch=N rate=R][/voice] tag syntax. librosa
      pitch_shift + time_stretch applied per-chunk for single-voice
      multi-character renders. The strategy survived the design
      table — but the librosa phase-vocoder artifacts at ±5 semitones
      ate the quality on the 2070 Super. Parked here for the GPU
      rebuild; modulation works architecturally, just needs better
      stretching algorithm (rubberband) + more headroom.

Production stayed Kokoro. Coast-Down preferred_voice_id reverted
to kokoro_af_heart in the live DB after this experiment.
---
 engines/tortoise/server.py | 172 ++++++++++++++++++++++++++++++++-----
 1 file changed, 150 insertions(+), 22 deletions(-)

diff --git a/engines/tortoise/server.py b/engines/tortoise/server.py
index c39eafe..ef602e5 100644
--- a/engines/tortoise/server.py
+++ b/engines/tortoise/server.py
@@ -23,6 +23,7 @@ import time
 import uuid
 from pathlib import Path
 
+import librosa
 import numpy as np
 import soundfile as sf
 import torch
@@ -62,12 +63,31 @@ def _get_tts() -> TextToSpeech:
     return _tts
 
 
+def _move_to_device(obj):
+    """Recursively .to(DEVICE) tensors inside the structure tortoise
+    returns from load_voice. voice_samples is a list of tensors;
+    conditioning_latents is a tuple of tensors. Anything else
+    passes through unchanged (e.g. None, ints)."""
+    if obj is None:
+        return obj
+    if isinstance(obj, torch.Tensor):
+        return obj.to(DEVICE)
+    if isinstance(obj, list):
+        return [_move_to_device(x) for x in obj]
+    if isinstance(obj, tuple):
+        return tuple(_move_to_device(x) for x in obj)
+    return obj
+
+
 def _get_voice(name: str) -> tuple:
     """Cache voice latents to avoid re-loading reference clips on
     every synthesis call. Tortoise's load_voice returns
-    (voice_samples, conditioning_latents)."""
+    (voice_samples, conditioning_latents) — but they're created on
+    CPU; we move them to DEVICE so the autoregressive model (on
+    CUDA) doesn't fail with cpu/cuda tensor-device mismatch."""
     if name not in _voice_cache:
-        _voice_cache[name] = load_voice(name)
+        samples, latents = load_voice(name)
+        _voice_cache[name] = (_move_to_device(samples), _move_to_device(latents))
     return _voice_cache[name]
 
 
@@ -75,15 +95,38 @@ def _get_voice(name: str) -> tuple:
 
 
 class Node:
-    __slots__ = ("kind", "value", "voice")
+    __slots__ = ("kind", "value", "voice", "pitch", "rate")
 
-    def __init__(self, kind: str, value, voice: str | None = None):
+    def __init__(
+        self,
+        kind: str,
+        value,
+        voice: str | None = None,
+        pitch: float = 0.0,
+        rate: float = 1.0,
+    ):
+        # kind ∈ {"text", "silence"}; value is str for text, float
+        # seconds for silence. voice/pitch/rate are character-voicing
+        # modifiers from [voice:NAME pitch=N rate=R] tags. Default:
+        # request voice, 0 semitones, 1x rate.
         self.kind = kind
         self.value = value
         self.voice = voice
+        self.pitch = pitch
+        self.rate = rate
 
 
-_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
+# Voice open tag — name + optional pitch (semitones) + optional rate:
+#   [voice:dyatlov]               → voice swap only
+#   [voice:lj pitch=-3]           → same voice, 3 semitones lower
+#   [voice:lj pitch=2 rate=1.1]   → higher + slightly faster (fairy)
+#   [voice:lj pitch=-4 rate=0.9]  → lower + slower (troll)
+_VOICE_OPEN_RE = re.compile(
+    r"\[voice:([A-Za-z0-9_-]+)"
+    r"(?:\s+pitch=(-?[0-9]+(?:\.[0-9]+)?))?"
+    r"(?:\s+rate=([0-9]+(?:\.[0-9]+)?))?"
+    r"\]"
+)
 _VOICE_CLOSE = "[/voice]"
 _TAG_RE = re.compile(
     r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
@@ -102,7 +145,70 @@ def _parse_tag(match: re.Match) -> float:
     return dur / 1000.0 if unit == "ms" else dur
 
 
-def _expand_inline(text: str, voice: str | None) -> list[Node]:
+# Tortoise's autoregressive head loses coherence past ~20s of generated
+# audio per inference call. lj's pace is roughly 14 chars/s, so anything
+# past ~280 chars per call risks gibberish at the end. We split inside
+# _expand_inline at sentence boundaries to keep each tts_with_preset
+# call inside the model's reliable horizon.
+TORTOISE_MAX_CHUNK_CHARS = 220
+
+# Sentence boundary regex — splits on `.`/`?`/`!` followed by whitespace
+# and a capital letter (keeps "Mr. Smith" / "U.S." together) OR at any
+# newline.
+_SENTENCE_BOUNDARY = re.compile(r"(?<=[\.!?])\s+(?=[A-Z\"\(])|(?<=\n)\s*")
+
+
+def _chunk_for_tortoise(text: str, max_chars: int = TORTOISE_MAX_CHUNK_CHARS) -> list[str]:
+    """Split text into chunks <= max_chars at sentence boundaries.
+    If a single sentence exceeds max_chars (rare for prose), fall
+    back to splitting that sentence at commas or just hard-cutting.
+    """
+    sentences = [s.strip() for s in _SENTENCE_BOUNDARY.split(text) if s and s.strip()]
+    chunks: list[str] = []
+    current = ""
+    for sent in sentences:
+        # Long sentence: emit alone, but try sub-splitting at commas.
+        if len(sent) > max_chars:
+            if current:
+                chunks.append(current.strip())
+                current = ""
+            # Split on commas
+            parts = [p.strip() for p in sent.split(",") if p.strip()]
+            sub = ""
+            for p in parts:
+                add = (sub + ", " if sub else "") + p
+                if len(add) <= max_chars:
+                    sub = add
+                else:
+                    if sub:
+                        chunks.append(sub)
+                    # If even the part alone exceeds, hard-cut at max_chars
+                    while len(p) > max_chars:
+                        chunks.append(p[:max_chars])
+                        p = p[max_chars:]
+                    sub = p
+            if sub:
+                chunks.append(sub)
+            continue
+        # Sentence fits — accumulate.
+        candidate = (current + " " if current else "") + sent
+        if len(candidate) <= max_chars:
+            current = candidate
+        else:
+            if current:
+                chunks.append(current.strip())
+            current = sent
+    if current:
+        chunks.append(current.strip())
+    return chunks
+
+
+def _expand_inline(
+    text: str,
+    voice: str | None,
+    pitch: float = 0.0,
+    rate: float = 1.0,
+) -> list[Node]:
     out: list[Node] = []
     text = text.strip()
     if not text:
@@ -111,12 +217,12 @@ def _expand_inline(text: str, voice: str | None) -> list[Node]:
     for m in _TAG_RE.finditer(text):
         pre = text[cursor : m.start()].strip()
         if pre:
-            out.append(Node("text", pre, voice))
+            out.append(Node("text", pre, voice, pitch, rate))
         out.append(Node("silence", _parse_tag(m)))
         cursor = m.end()
     tail = text[cursor:].strip()
     if tail:
-        out.append(Node("text", tail, voice))
+        out.append(Node("text", tail, voice, pitch, rate))
     return out
 
 
@@ -130,12 +236,14 @@ def _split_paragraph_voices(para: str) -> list[Node]:
             break
         out.extend(_expand_inline(para[cursor : m.start()], None))
         voice = m.group(1)
+        pitch = float(m.group(2)) if m.group(2) else 0.0
+        rate = float(m.group(3)) if m.group(3) else 1.0
         body_start = m.end()
         close_idx = para.find(_VOICE_CLOSE, body_start)
         if close_idx < 0:
-            out.extend(_expand_inline(para[body_start:], voice))
+            out.extend(_expand_inline(para[body_start:], voice, pitch, rate))
             break
-        out.extend(_expand_inline(para[body_start:close_idx], voice))
+        out.extend(_expand_inline(para[body_start:close_idx], voice, pitch, rate))
         cursor = close_idx + len(_VOICE_CLOSE)
     return out
 
@@ -253,6 +361,7 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
     started = time.monotonic()
     pieces: list[np.ndarray] = []
     voices_used: set[str] = set()
+    tortoise_chunks_rendered = 0
     for node in nodes:
         if node.kind == "silence":
             pieces.append(_silence_samples(node.value))
@@ -264,18 +373,37 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
         except Exception as e:
             log.warning("voice %s failed to load (%s); falling back to default", seg_voice, e)
             samples, latents = _get_voice(voice)
-        # Tortoise's tts_with_preset returns a torch.Tensor on the
-        # configured device.
-        audio_tensor = tts.tts_with_preset(
-            text=node.value,
-            voice_samples=samples,
-            conditioning_latents=latents,
-            preset=preset,
-        )
-        if isinstance(audio_tensor, list):
-            audio_tensor = audio_tensor[0]
-        arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
-        pieces.append(arr)
+        # Each text node may exceed Tortoise's reliable ~20s horizon —
+        # split at sentence boundaries before feeding the model.
+        sub_chunks = _chunk_for_tortoise(node.value)
+        for sub_idx, sub in enumerate(sub_chunks):
+            audio_tensor = tts.tts_with_preset(
+                text=sub,
+                voice_samples=samples,
+                conditioning_latents=latents,
+                preset=preset,
+            )
+            if isinstance(audio_tensor, list):
+                audio_tensor = audio_tensor[0]
+            arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
+            # Per-character voice modulation via librosa. Apply
+            # pitch first (preserves duration), then rate (preserves
+            # pitch). Default pitch=0, rate=1.0 = no-op fast path.
+            if abs(node.pitch) > 1e-3:
+                arr = librosa.effects.pitch_shift(
+                    arr, sr=SAMPLE_RATE, n_steps=node.pitch
+                )
+            if abs(node.rate - 1.0) > 1e-3:
+                arr = librosa.effects.time_stretch(arr, rate=node.rate)
+            arr = arr.astype(np.float32)
+            pieces.append(arr)
+            tortoise_chunks_rendered += 1
+            log.info(
+                "chunk %d/%d done (%d chars, pitch=%+.1f rate=%.2f, %.1fs audio so far)",
+                sub_idx + 1, len(sub_chunks), len(sub),
+                node.pitch, node.rate,
+                sum(len(p) for p in pieces) / SAMPLE_RATE,
+            )
     elapsed_ms = int((time.monotonic() - started) * 1000)
 
     if not pieces: