Compare commits
2 commits
main
...
engine/tor
| Author | SHA1 | Date | |
|---|---|---|---|
| 9df378f799 | |||
| 7a96031aa6 |
3 changed files with 268 additions and 22 deletions
47
engines/tortoise/exclusive-gpu.sh
Executable file
47
engines/tortoise/exclusive-gpu.sh
Executable file
|
|
@ -0,0 +1,47 @@
|
|||
#!/bin/bash
|
||||
# Tortoise GPU exclusivity wrapper. The 2070 Super (8GB) can't host
|
||||
# F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise (~5GB peak) simultaneously,
|
||||
# so we stop the other two engines for the duration of a Tortoise run
|
||||
# and restart them after.
|
||||
#
|
||||
# Usage:
|
||||
# exclusive-gpu.sh <command...>
|
||||
#
|
||||
# Example:
|
||||
# exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
|
||||
#
|
||||
# Exits with the wrapped command's status. Restarts the engines
|
||||
# regardless of success/failure (trap on EXIT).
|
||||
set -euo pipefail
|
||||
|
||||
STOP_ENGINES=(f5-tts kokoro)
|
||||
|
||||
cleanup() {
|
||||
local rc=$?
|
||||
echo "[exclusive-gpu] restarting engines"
|
||||
for engine in "${STOP_ENGINES[@]}"; do
|
||||
docker start "$engine" >/dev/null 2>&1 || \
|
||||
echo "[exclusive-gpu] failed to restart $engine — investigate"
|
||||
done
|
||||
return "$rc"
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
echo "[exclusive-gpu] stopping engines: ${STOP_ENGINES[*]}"
|
||||
for engine in "${STOP_ENGINES[@]}"; do
|
||||
docker stop "$engine" >/dev/null 2>&1 || true
|
||||
done
|
||||
|
||||
# Restart Tortoise to clean up any cached GPU allocations from the
|
||||
# now-stopped engines (their CUDA contexts can linger briefly).
|
||||
docker restart tortoise >/dev/null
|
||||
echo "[exclusive-gpu] waiting for tortoise healthz..."
|
||||
for i in {1..30}; do
|
||||
if curl -sf http://192.168.0.5:7795/healthz | grep -q '"loaded":true'; then
|
||||
break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
echo "[exclusive-gpu] running: $*"
|
||||
"$@"
|
||||
71
engines/tortoise/hacks.md
Normal file
71
engines/tortoise/hacks.md
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
# Tortoise engine — kludges branched off main
|
||||
|
||||
This branch carries the engine-specific tweaks that don't generalise
|
||||
to F5 / Kokoro. Tortoise is the audiobook-quality engine but the
|
||||
trade-offs are real and need explicit handling — speed and GPU.
|
||||
|
||||
## 1. GPU exclusivity
|
||||
|
||||
**File:** `exclusive-gpu.sh`.
|
||||
|
||||
The 2070 Super has 8GB. F5 (~4.5GB) + Kokoro (~2.7GB) + Tortoise
|
||||
(~5GB peak) sums to ~12GB — over budget. First Tortoise smoke
|
||||
caught it: `torch.OutOfMemoryError: ... 9.31 MiB is free`.
|
||||
|
||||
Solution: stop the other two engines for the duration of a Tortoise
|
||||
run. The script wraps any command, stops `f5-tts` + `kokoro`,
|
||||
restarts `tortoise` to clean its CUDA context, waits for healthz,
|
||||
runs the wrapped command, then restarts the engines on EXIT trap
|
||||
(success or failure).
|
||||
|
||||
```bash
|
||||
./exclusive-gpu.sh docker exec skald skald narrate --chapter <uuid>
|
||||
```
|
||||
|
||||
Remove when: GPU upgrade (P40 24GB / 3090 24GB / etc) lets all three
|
||||
engines co-reside.
|
||||
|
||||
## 2. Speed — slow, batch-only
|
||||
|
||||
Tortoise at `standard` preset is **~74x slower than real-time** on
|
||||
the 2070 Super (smoke: 6.5s of audio took 478s wall clock). A 33-min
|
||||
Chapter 2 render would take ~8 hours. Tortoise is acceptable for
|
||||
overnight batched runs but NOT interactive rendering.
|
||||
|
||||
Quality presets and their approx wall-clock for a 3000-word chapter:
|
||||
- `ultra_fast` — ~1h, noticeable quality drop
|
||||
- `fast` — ~2h
|
||||
- `standard` — ~6-8h, the recommended bar
|
||||
- `high_quality` — ~24h, marginally better than standard
|
||||
|
||||
For most use, `standard` is right. Reserve `high_quality` for
|
||||
short prologues or named samples.
|
||||
|
||||
## 3. Voice mapping format
|
||||
|
||||
Tortoise's voice roster (`lj`, `freeman`, `daniel`, etc.) lives
|
||||
behind `source='tortoise_tts'` in the `voices` table. Character
|
||||
slug → Tortoise voice mapping is independent of the Kokoro mapping
|
||||
— a story can have BOTH a Kokoro and Tortoise mapping live in
|
||||
parallel, picked at render time via story.preferred_voice_id or
|
||||
the --voice flag.
|
||||
|
||||
Tortoise voices may sometimes warble or stutter at chunk boundaries
|
||||
— the `tortoise.api.TextToSpeech.tts_with_preset` call is per-chunk
|
||||
and re-conditions the voice each time. Acceptable for v0.1; future
|
||||
work could feed `conditioning_latents` directly for tighter cohesion.
|
||||
|
||||
## 4. No respelling overrides for Tortoise (yet)
|
||||
|
||||
The `pronunciation_overrides` rows in the DB are seeded with
|
||||
lowercase-syllable respellings tuned for Kokoro's misaki tokenizer.
|
||||
Tortoise uses a different phonemizer (`g2p_en`) which handles many
|
||||
of those proper nouns better natively — but some still mangle.
|
||||
|
||||
For now, narrate's substitution applies the same overrides regardless
|
||||
of engine, which means Tortoise sees `prip-yat` for "Pripyat" — same
|
||||
input, different phonemizer interprets differently. Usually OK but
|
||||
audit after each batch.
|
||||
|
||||
Future: per-engine override sets, OR an `engine` column on
|
||||
pronunciation_overrides.
|
||||
|
|
@ -23,6 +23,7 @@ import time
|
|||
import uuid
|
||||
from pathlib import Path
|
||||
|
||||
import librosa
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
import torch
|
||||
|
|
@ -62,12 +63,31 @@ def _get_tts() -> TextToSpeech:
|
|||
return _tts
|
||||
|
||||
|
||||
def _move_to_device(obj):
|
||||
"""Recursively .to(DEVICE) tensors inside the structure tortoise
|
||||
returns from load_voice. voice_samples is a list of tensors;
|
||||
conditioning_latents is a tuple of tensors. Anything else
|
||||
passes through unchanged (e.g. None, ints)."""
|
||||
if obj is None:
|
||||
return obj
|
||||
if isinstance(obj, torch.Tensor):
|
||||
return obj.to(DEVICE)
|
||||
if isinstance(obj, list):
|
||||
return [_move_to_device(x) for x in obj]
|
||||
if isinstance(obj, tuple):
|
||||
return tuple(_move_to_device(x) for x in obj)
|
||||
return obj
|
||||
|
||||
|
||||
def _get_voice(name: str) -> tuple:
|
||||
"""Cache voice latents to avoid re-loading reference clips on
|
||||
every synthesis call. Tortoise's load_voice returns
|
||||
(voice_samples, conditioning_latents)."""
|
||||
(voice_samples, conditioning_latents) — but they're created on
|
||||
CPU; we move them to DEVICE so the autoregressive model (on
|
||||
CUDA) doesn't fail with cpu/cuda tensor-device mismatch."""
|
||||
if name not in _voice_cache:
|
||||
_voice_cache[name] = load_voice(name)
|
||||
samples, latents = load_voice(name)
|
||||
_voice_cache[name] = (_move_to_device(samples), _move_to_device(latents))
|
||||
return _voice_cache[name]
|
||||
|
||||
|
||||
|
|
@ -75,15 +95,38 @@ def _get_voice(name: str) -> tuple:
|
|||
|
||||
|
||||
class Node:
|
||||
__slots__ = ("kind", "value", "voice")
|
||||
__slots__ = ("kind", "value", "voice", "pitch", "rate")
|
||||
|
||||
def __init__(self, kind: str, value, voice: str | None = None):
|
||||
def __init__(
|
||||
self,
|
||||
kind: str,
|
||||
value,
|
||||
voice: str | None = None,
|
||||
pitch: float = 0.0,
|
||||
rate: float = 1.0,
|
||||
):
|
||||
# kind ∈ {"text", "silence"}; value is str for text, float
|
||||
# seconds for silence. voice/pitch/rate are character-voicing
|
||||
# modifiers from [voice:NAME pitch=N rate=R] tags. Default:
|
||||
# request voice, 0 semitones, 1x rate.
|
||||
self.kind = kind
|
||||
self.value = value
|
||||
self.voice = voice
|
||||
self.pitch = pitch
|
||||
self.rate = rate
|
||||
|
||||
|
||||
_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
|
||||
# Voice open tag — name + optional pitch (semitones) + optional rate:
|
||||
# [voice:dyatlov] → voice swap only
|
||||
# [voice:lj pitch=-3] → same voice, 3 semitones lower
|
||||
# [voice:lj pitch=2 rate=1.1] → higher + slightly faster (fairy)
|
||||
# [voice:lj pitch=-4 rate=0.9] → lower + slower (troll)
|
||||
_VOICE_OPEN_RE = re.compile(
|
||||
r"\[voice:([A-Za-z0-9_-]+)"
|
||||
r"(?:\s+pitch=(-?[0-9]+(?:\.[0-9]+)?))?"
|
||||
r"(?:\s+rate=([0-9]+(?:\.[0-9]+)?))?"
|
||||
r"\]"
|
||||
)
|
||||
_VOICE_CLOSE = "[/voice]"
|
||||
_TAG_RE = re.compile(
|
||||
r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
|
||||
|
|
@ -102,7 +145,70 @@ def _parse_tag(match: re.Match) -> float:
|
|||
return dur / 1000.0 if unit == "ms" else dur
|
||||
|
||||
|
||||
def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
||||
# Tortoise's autoregressive head loses coherence past ~20s of generated
|
||||
# audio per inference call. lj's pace is roughly 14 chars/s, so anything
|
||||
# past ~280 chars per call risks gibberish at the end. We split inside
|
||||
# _expand_inline at sentence boundaries to keep each tts_with_preset
|
||||
# call inside the model's reliable horizon.
|
||||
TORTOISE_MAX_CHUNK_CHARS = 220
|
||||
|
||||
# Sentence boundary regex — splits on `.`/`?`/`!` followed by whitespace
|
||||
# and a capital letter (keeps "Mr. Smith" / "U.S." together) OR at any
|
||||
# newline.
|
||||
_SENTENCE_BOUNDARY = re.compile(r"(?<=[\.!?])\s+(?=[A-Z\"\(])|(?<=\n)\s*")
|
||||
|
||||
|
||||
def _chunk_for_tortoise(text: str, max_chars: int = TORTOISE_MAX_CHUNK_CHARS) -> list[str]:
|
||||
"""Split text into chunks <= max_chars at sentence boundaries.
|
||||
If a single sentence exceeds max_chars (rare for prose), fall
|
||||
back to splitting that sentence at commas or just hard-cutting.
|
||||
"""
|
||||
sentences = [s.strip() for s in _SENTENCE_BOUNDARY.split(text) if s and s.strip()]
|
||||
chunks: list[str] = []
|
||||
current = ""
|
||||
for sent in sentences:
|
||||
# Long sentence: emit alone, but try sub-splitting at commas.
|
||||
if len(sent) > max_chars:
|
||||
if current:
|
||||
chunks.append(current.strip())
|
||||
current = ""
|
||||
# Split on commas
|
||||
parts = [p.strip() for p in sent.split(",") if p.strip()]
|
||||
sub = ""
|
||||
for p in parts:
|
||||
add = (sub + ", " if sub else "") + p
|
||||
if len(add) <= max_chars:
|
||||
sub = add
|
||||
else:
|
||||
if sub:
|
||||
chunks.append(sub)
|
||||
# If even the part alone exceeds, hard-cut at max_chars
|
||||
while len(p) > max_chars:
|
||||
chunks.append(p[:max_chars])
|
||||
p = p[max_chars:]
|
||||
sub = p
|
||||
if sub:
|
||||
chunks.append(sub)
|
||||
continue
|
||||
# Sentence fits — accumulate.
|
||||
candidate = (current + " " if current else "") + sent
|
||||
if len(candidate) <= max_chars:
|
||||
current = candidate
|
||||
else:
|
||||
if current:
|
||||
chunks.append(current.strip())
|
||||
current = sent
|
||||
if current:
|
||||
chunks.append(current.strip())
|
||||
return chunks
|
||||
|
||||
|
||||
def _expand_inline(
|
||||
text: str,
|
||||
voice: str | None,
|
||||
pitch: float = 0.0,
|
||||
rate: float = 1.0,
|
||||
) -> list[Node]:
|
||||
out: list[Node] = []
|
||||
text = text.strip()
|
||||
if not text:
|
||||
|
|
@ -111,12 +217,12 @@ def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
|||
for m in _TAG_RE.finditer(text):
|
||||
pre = text[cursor : m.start()].strip()
|
||||
if pre:
|
||||
out.append(Node("text", pre, voice))
|
||||
out.append(Node("text", pre, voice, pitch, rate))
|
||||
out.append(Node("silence", _parse_tag(m)))
|
||||
cursor = m.end()
|
||||
tail = text[cursor:].strip()
|
||||
if tail:
|
||||
out.append(Node("text", tail, voice))
|
||||
out.append(Node("text", tail, voice, pitch, rate))
|
||||
return out
|
||||
|
||||
|
||||
|
|
@ -130,12 +236,14 @@ def _split_paragraph_voices(para: str) -> list[Node]:
|
|||
break
|
||||
out.extend(_expand_inline(para[cursor : m.start()], None))
|
||||
voice = m.group(1)
|
||||
pitch = float(m.group(2)) if m.group(2) else 0.0
|
||||
rate = float(m.group(3)) if m.group(3) else 1.0
|
||||
body_start = m.end()
|
||||
close_idx = para.find(_VOICE_CLOSE, body_start)
|
||||
if close_idx < 0:
|
||||
out.extend(_expand_inline(para[body_start:], voice))
|
||||
out.extend(_expand_inline(para[body_start:], voice, pitch, rate))
|
||||
break
|
||||
out.extend(_expand_inline(para[body_start:close_idx], voice))
|
||||
out.extend(_expand_inline(para[body_start:close_idx], voice, pitch, rate))
|
||||
cursor = close_idx + len(_VOICE_CLOSE)
|
||||
return out
|
||||
|
||||
|
|
@ -253,6 +361,7 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
|
|||
started = time.monotonic()
|
||||
pieces: list[np.ndarray] = []
|
||||
voices_used: set[str] = set()
|
||||
tortoise_chunks_rendered = 0
|
||||
for node in nodes:
|
||||
if node.kind == "silence":
|
||||
pieces.append(_silence_samples(node.value))
|
||||
|
|
@ -264,18 +373,37 @@ def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
|
|||
except Exception as e:
|
||||
log.warning("voice %s failed to load (%s); falling back to default", seg_voice, e)
|
||||
samples, latents = _get_voice(voice)
|
||||
# Tortoise's tts_with_preset returns a torch.Tensor on the
|
||||
# configured device.
|
||||
audio_tensor = tts.tts_with_preset(
|
||||
text=node.value,
|
||||
voice_samples=samples,
|
||||
conditioning_latents=latents,
|
||||
preset=preset,
|
||||
)
|
||||
if isinstance(audio_tensor, list):
|
||||
audio_tensor = audio_tensor[0]
|
||||
arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
|
||||
pieces.append(arr)
|
||||
# Each text node may exceed Tortoise's reliable ~20s horizon —
|
||||
# split at sentence boundaries before feeding the model.
|
||||
sub_chunks = _chunk_for_tortoise(node.value)
|
||||
for sub_idx, sub in enumerate(sub_chunks):
|
||||
audio_tensor = tts.tts_with_preset(
|
||||
text=sub,
|
||||
voice_samples=samples,
|
||||
conditioning_latents=latents,
|
||||
preset=preset,
|
||||
)
|
||||
if isinstance(audio_tensor, list):
|
||||
audio_tensor = audio_tensor[0]
|
||||
arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
|
||||
# Per-character voice modulation via librosa. Apply
|
||||
# pitch first (preserves duration), then rate (preserves
|
||||
# pitch). Default pitch=0, rate=1.0 = no-op fast path.
|
||||
if abs(node.pitch) > 1e-3:
|
||||
arr = librosa.effects.pitch_shift(
|
||||
arr, sr=SAMPLE_RATE, n_steps=node.pitch
|
||||
)
|
||||
if abs(node.rate - 1.0) > 1e-3:
|
||||
arr = librosa.effects.time_stretch(arr, rate=node.rate)
|
||||
arr = arr.astype(np.float32)
|
||||
pieces.append(arr)
|
||||
tortoise_chunks_rendered += 1
|
||||
log.info(
|
||||
"chunk %d/%d done (%d chars, pitch=%+.1f rate=%.2f, %.1fs audio so far)",
|
||||
sub_idx + 1, len(sub_chunks), len(sub),
|
||||
node.pitch, node.rate,
|
||||
sum(len(p) for p in pieces) / SAMPLE_RATE,
|
||||
)
|
||||
elapsed_ms = int((time.monotonic() - started) * 1000)
|
||||
|
||||
if not pieces:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue