engines: import f5-tts + kokoro + tortoise sidecars into the tree
The python FastAPI sidecars have lived ad-hoc at /mnt/cache/appdata/ <engine>/build/ on Lucy without version control. Bringing them into the skald repo so the engine code travels with the cross-engine routing it depends on. This commit lands the VANILLA version of each engine on main: engines/f5-tts/ SWivid F5-TTS (CC-BY-NC weights flagged) engines/kokoro/ hexgrad Kokoro-82M (Apache 2.0 top to bottom) engines/tortoise/ neonbjb Tortoise-TTS (Apache 2.0 top to bottom) Engine-specific kludges (question doubling, GPU coordination, pause-duration tuning) get layered on engine/* branches per the README. Main stays the safe-to-read baseline.
This commit is contained in:
parent
1c3fc11484
commit
d1631ddffe
10 changed files with 1115 additions and 0 deletions
58
engines/README.md
Normal file
58
engines/README.md
Normal file
|
|
@ -0,0 +1,58 @@
|
||||||
|
# Skald TTS engines
|
||||||
|
|
||||||
|
This subtree holds the per-engine sidecars that skald's narrate path
|
||||||
|
talks to over HTTP. Each engine has the same contract:
|
||||||
|
|
||||||
|
- `POST /synthesize` — same JSON shape across engines so skald's
|
||||||
|
one Rust client (`skald-core::narrate::Narrator`) deserializes
|
||||||
|
all of them. See `engines/<name>/server.py` for the per-engine
|
||||||
|
implementation.
|
||||||
|
- `GET /healthz` — boot probe + model-loaded flag.
|
||||||
|
|
||||||
|
Skald routes per-request by `voices.source`: a `kokoro_*` source
|
||||||
|
goes to `$KOKORO_URL`, a `tortoise_*` source goes to `$TORTOISE_URL`,
|
||||||
|
anything else (`lj_speech`, generic) goes to `$F5_TTS_URL`.
|
||||||
|
|
||||||
|
## Engines
|
||||||
|
|
||||||
|
| Dir | Engine | License (code/weights) | VRAM | Speed | Voices |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| `f5-tts/` | SWivid F5-TTS v1 | MIT / **CC-BY-NC** | ~5GB | fast (~2x real-time on 2070S) | voice cloning (LJ Speech reference shipped) |
|
||||||
|
| `kokoro/` | hexgrad Kokoro-82M | Apache 2.0 / Apache 2.0 | ~1GB | very fast (~50x real-time) | 50+ named presets (af_*, am_*, bf_*, bm_*) |
|
||||||
|
| `tortoise/` | neonbjb Tortoise-TTS | Apache 2.0 / Apache 2.0 | ~5GB | **slow** (~0.014x real-time, ~74s/s of audio on 2070S, standard preset) | 26 named built-ins (lj, freeman, daniel, weaver, jlaw, etc.) |
|
||||||
|
|
||||||
|
## Branch model
|
||||||
|
|
||||||
|
`main` carries the **vanilla** version of each engine — what you'd
|
||||||
|
get from a clean `pip install <engine>` plus the FastAPI sidecar
|
||||||
|
+ control-tag splitter. No engine-specific kludges. Safe to look
|
||||||
|
at without context.
|
||||||
|
|
||||||
|
`engine/<name>` branches hold engine-tuned tweaks that don't
|
||||||
|
generalise. Examples:
|
||||||
|
|
||||||
|
- `engine/kokoro` — doubled-`??` prosody hack for the 82M's weak
|
||||||
|
question intonation, paragraph/scene/breath gap durations tuned
|
||||||
|
for af_heart's pacing, notes on how respellings need to be all-
|
||||||
|
lowercase to avoid letter-by-letter spell-out by misaki.
|
||||||
|
- `engine/tortoise` — GPU exclusivity coordinator (stops F5 +
|
||||||
|
Kokoro before a Tortoise run since the 2070 Super can't host
|
||||||
|
all three at once), preset choice ergonomics, character→tortoise-
|
||||||
|
voice seed assignments.
|
||||||
|
|
||||||
|
When deploying an engine to Lucy, the build dir at
|
||||||
|
`/mnt/cache/appdata/<engine>/build/` tracks the engine's branch:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /mnt/cache/appdata/kokoro/build
|
||||||
|
git fetch && git checkout engine/kokoro
|
||||||
|
docker compose -p <name> up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
## GPU coordination (2070 Super)
|
||||||
|
|
||||||
|
The 8GB card is the bottleneck. F5 + Kokoro can co-reside (~5GB +
|
||||||
|
~1GB). Tortoise pushes the budget over and needs the GPU largely
|
||||||
|
to itself — the `engine/tortoise` branch will carry the script
|
||||||
|
that stops kokoro + f5 before a tortoise run and restarts them
|
||||||
|
after. Replace with proper coordination once we have more VRAM.
|
||||||
41
engines/f5-tts/Dockerfile
Normal file
41
engines/f5-tts/Dockerfile
Normal file
|
|
@ -0,0 +1,41 @@
|
||||||
|
# Sulkta build of F5-TTS — upstream ghcr.io/swivid/f5-tts:main was
|
||||||
|
# shipped with torch 2.11/torchaudio 2.4 ABI mismatch on 2026-05-13,
|
||||||
|
# breaking import torchaudio at boot. We rebuild on a known-good
|
||||||
|
# pytorch base + pip install f5-tts.
|
||||||
|
#
|
||||||
|
# Image tag in lucy-registry: lucy-registry:5000/f5-tts:<ver>
|
||||||
|
#
|
||||||
|
# License: Apache 2.0 (code) / CC-BY-NC (Emilia-trained weights).
|
||||||
|
# Personal use OK; redistribution gray-area — flagged.
|
||||||
|
|
||||||
|
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
|
||||||
|
|
||||||
|
ENV DEBIAN_FRONTEND=noninteractive \
|
||||||
|
PYTHONUNBUFFERED=1 \
|
||||||
|
HF_HOME=/cache/hf \
|
||||||
|
HF_HUB_DISABLE_TELEMETRY=1
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
ffmpeg \
|
||||||
|
git \
|
||||||
|
ca-certificates \
|
||||||
|
curl \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Base torch 2.6.0 + torchaudio 2.6.0; f5-tts pulls a recent
|
||||||
|
# transformers (5.x) which needs torch >=2.5's modern
|
||||||
|
# torch.library.custom_op type signatures.
|
||||||
|
RUN pip install --no-cache-dir 'f5-tts>=1.0.0'
|
||||||
|
|
||||||
|
# Pre-warm the HF cache directory.
|
||||||
|
RUN mkdir -p /cache/hf /audio /voices
|
||||||
|
|
||||||
|
COPY f5_server.py /app/f5_server.py
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
EXPOSE 7860
|
||||||
|
|
||||||
|
# Skald talks to our purpose-built FastAPI server, not Gradio.
|
||||||
|
# Models load at startup (first request would otherwise pay the
|
||||||
|
# cold-start cost). uvicorn on :7860 to keep the port stable.
|
||||||
|
CMD ["uvicorn", "f5_server:app", "--host", "0.0.0.0", "--port", "7860"]
|
||||||
43
engines/f5-tts/compose.yml
Normal file
43
engines/f5-tts/compose.yml
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
# F5-TTS standalone stack on Lucy.
|
||||||
|
#
|
||||||
|
# License posture (acknowledged 2026-05-13): code is Apache 2.0, but
|
||||||
|
# the pretrained model weights are CC-BY-NC (Emilia training data).
|
||||||
|
# Personal listen is fine; public sharing is a flagged gray area.
|
||||||
|
# Cobb's call: ship anyway.
|
||||||
|
#
|
||||||
|
# Runtime: 8GB GPU is plenty (F5 inference ~4-6GB peak).
|
||||||
|
#
|
||||||
|
# First-run cost: ~2GB model download from HuggingFace into hf-cache,
|
||||||
|
# happens on first inference request. Subsequent runs are warm.
|
||||||
|
name: f5-tts
|
||||||
|
|
||||||
|
services:
|
||||||
|
f5-tts:
|
||||||
|
image: lucy-registry:5000/f5-tts:0.3
|
||||||
|
container_name: f5-tts
|
||||||
|
restart: unless-stopped
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
ports:
|
||||||
|
- "192.168.0.5:7792:7860"
|
||||||
|
- "127.0.0.1:7792:7860"
|
||||||
|
volumes:
|
||||||
|
# HF model weights cache — persists ~2GB after first download.
|
||||||
|
- /mnt/cache/appdata/f5-tts/hf-cache:/cache/hf
|
||||||
|
# Reference voice clips (lj_speech.wav, etc).
|
||||||
|
- /mnt/cache/appdata/f5-tts/voices:/voices:ro
|
||||||
|
# Rendered audio output — skald writes story narrations here.
|
||||||
|
- /mnt/cache/appdata/f5-tts/audio:/audio
|
||||||
|
environment:
|
||||||
|
HF_HOME: /cache/hf
|
||||||
|
HF_HUB_DISABLE_TELEMETRY: "1"
|
||||||
|
labels:
|
||||||
|
org.sulkta.domain: "sulkta"
|
||||||
|
org.sulkta.owner: "cobb"
|
||||||
|
org.sulkta.managed-by: "compose"
|
||||||
|
org.sulkta.role: "f5-tts"
|
||||||
184
engines/f5-tts/server.py
Normal file
184
engines/f5-tts/server.py
Normal file
|
|
@ -0,0 +1,184 @@
|
||||||
|
"""Thin FastAPI server inside the F5-TTS container.
|
||||||
|
|
||||||
|
Loads model + vocoder ONCE at startup (heavy: ~5s, ~5GB VRAM).
|
||||||
|
POST /synthesize runs inference and writes the WAV to a shared
|
||||||
|
volume; the response is JSON with the output path and metadata —
|
||||||
|
not the WAV bytes, since chapter-length renders are 20-30MB and
|
||||||
|
both skald and the f5 container share /audio anyway.
|
||||||
|
|
||||||
|
Why not Gradio's API: Gradio's /gradio_api/call/* shape is event-
|
||||||
|
stream + polling; this is a single POST + immediate response.
|
||||||
|
Right for skald's "render one chapter, then move on" loop.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import soundfile as sf
|
||||||
|
import torch
|
||||||
|
from fastapi import FastAPI, HTTPException
|
||||||
|
from omegaconf import OmegaConf
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
from cached_path import cached_path
|
||||||
|
from importlib.resources import files
|
||||||
|
from hydra.utils import get_class
|
||||||
|
|
||||||
|
from f5_tts.infer.utils_infer import (
|
||||||
|
infer_process,
|
||||||
|
load_model,
|
||||||
|
load_vocoder,
|
||||||
|
preprocess_ref_audio_text,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
log = logging.getLogger("f5-server")
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s %(message)s")
|
||||||
|
|
||||||
|
|
||||||
|
# ─── model state ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
MODEL_NAME = "F5TTS_v1_Base"
|
||||||
|
VOCODER_NAME = "vocos"
|
||||||
|
AUDIO_ROOT = Path("/audio")
|
||||||
|
VOICES_ROOT = Path("/voices")
|
||||||
|
|
||||||
|
_model = None
|
||||||
|
_vocoder = None
|
||||||
|
|
||||||
|
|
||||||
|
def _load_models() -> None:
|
||||||
|
"""One-time model + vocoder load. ~5-8s wall-clock on first call."""
|
||||||
|
global _model, _vocoder
|
||||||
|
if _model is not None:
|
||||||
|
return
|
||||||
|
log.info("loading vocoder=%s device=%s", VOCODER_NAME, DEVICE)
|
||||||
|
_vocoder = load_vocoder(vocoder_name=VOCODER_NAME, device=DEVICE)
|
||||||
|
|
||||||
|
cfg_path = str(files("f5_tts").joinpath(f"configs/{MODEL_NAME}.yaml"))
|
||||||
|
model_cfg = OmegaConf.load(cfg_path)
|
||||||
|
model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}")
|
||||||
|
model_arc = model_cfg.model.arch
|
||||||
|
|
||||||
|
# F5TTS_v1_Base ships as a HuggingFace artifact; cached_path
|
||||||
|
# handles the resolution + downloads to HF_HOME.
|
||||||
|
ckpt_file = str(
|
||||||
|
cached_path("hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors")
|
||||||
|
)
|
||||||
|
vocab_file = str(files("f5_tts").joinpath("infer/examples/vocab.txt"))
|
||||||
|
|
||||||
|
log.info("loading model=%s ckpt=%s", MODEL_NAME, ckpt_file)
|
||||||
|
_model = load_model(
|
||||||
|
model_cls, model_arc, ckpt_file,
|
||||||
|
mel_spec_type=VOCODER_NAME, vocab_file=vocab_file, device=DEVICE,
|
||||||
|
)
|
||||||
|
log.info("model + vocoder loaded; ready")
|
||||||
|
|
||||||
|
|
||||||
|
# ─── FastAPI app ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeRequest(BaseModel):
|
||||||
|
# The text we want to synthesize. Long-form OK — F5-TTS chunks
|
||||||
|
# internally via infer_process.
|
||||||
|
gen_text: str = Field(min_length=1)
|
||||||
|
# Reference audio path (inside the f5-tts container). Defaults
|
||||||
|
# to the staged lj_speech clip.
|
||||||
|
ref_audio_path: str = "/voices/lj_speech.wav"
|
||||||
|
# Reference transcript. Defaults to the bundled lj_speech.txt.
|
||||||
|
ref_text: str | None = None
|
||||||
|
# Output filename, relative to /audio (the shared output dir).
|
||||||
|
# If omitted, a UUID-based name is assigned.
|
||||||
|
output_filename: str | None = None
|
||||||
|
# Speech speed (0.5-2.0). Default 1.0 = natural pace.
|
||||||
|
speed: float = Field(default=1.0, ge=0.3, le=2.0)
|
||||||
|
# Cross-fade between chunks; F5 default is 0.15s. Bigger smooths
|
||||||
|
# chunk boundaries on long-form prose at the cost of pacing.
|
||||||
|
cross_fade_duration: float = Field(default=0.15, ge=0.0, le=1.0)
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeResponse(BaseModel):
|
||||||
|
ok: bool
|
||||||
|
output_path: str
|
||||||
|
sample_rate_hz: int
|
||||||
|
duration_seconds: float
|
||||||
|
elapsed_ms: int
|
||||||
|
chars_in: int
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="f5-tts-server", version="0.1.0")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
def _startup() -> None:
|
||||||
|
_load_models()
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/healthz")
|
||||||
|
def healthz() -> dict:
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"device": DEVICE,
|
||||||
|
"model": MODEL_NAME,
|
||||||
|
"vocoder": VOCODER_NAME,
|
||||||
|
"loaded": _model is not None,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/synthesize", response_model=SynthesizeResponse)
|
||||||
|
def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
|
||||||
|
if _model is None:
|
||||||
|
raise HTTPException(503, "model not loaded yet — retry shortly")
|
||||||
|
|
||||||
|
ref_audio_path = Path(req.ref_audio_path)
|
||||||
|
if not ref_audio_path.is_file():
|
||||||
|
raise HTTPException(400, f"ref_audio_path not found: {ref_audio_path}")
|
||||||
|
|
||||||
|
# If no explicit ref_text, try sidecar .txt then fall back to ""
|
||||||
|
# (which triggers F5's auto-ASR).
|
||||||
|
ref_text = req.ref_text
|
||||||
|
if ref_text is None:
|
||||||
|
sidecar = ref_audio_path.with_suffix(".txt")
|
||||||
|
if sidecar.is_file():
|
||||||
|
ref_text = sidecar.read_text().strip()
|
||||||
|
else:
|
||||||
|
ref_text = ""
|
||||||
|
|
||||||
|
output_filename = req.output_filename or f"{uuid.uuid4().hex}.wav"
|
||||||
|
if "/" in output_filename or ".." in output_filename:
|
||||||
|
raise HTTPException(400, "output_filename must be a bare name, no path parts")
|
||||||
|
output_path = AUDIO_ROOT / output_filename
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
started = time.monotonic()
|
||||||
|
ref_audio_processed, ref_text_processed = preprocess_ref_audio_text(
|
||||||
|
str(ref_audio_path), ref_text
|
||||||
|
)
|
||||||
|
audio_segment, final_sample_rate, _ = infer_process(
|
||||||
|
ref_audio_processed,
|
||||||
|
ref_text_processed,
|
||||||
|
req.gen_text,
|
||||||
|
_model,
|
||||||
|
_vocoder,
|
||||||
|
mel_spec_type=VOCODER_NAME,
|
||||||
|
speed=req.speed,
|
||||||
|
cross_fade_duration=req.cross_fade_duration,
|
||||||
|
device=DEVICE,
|
||||||
|
)
|
||||||
|
elapsed_ms = int((time.monotonic() - started) * 1000)
|
||||||
|
|
||||||
|
sf.write(str(output_path), audio_segment, final_sample_rate, subtype="PCM_16")
|
||||||
|
duration_s = float(len(audio_segment)) / float(final_sample_rate)
|
||||||
|
log.info(
|
||||||
|
"synthesized chars=%d -> %s (sr=%d, dur=%.2fs, elapsed=%dms)",
|
||||||
|
len(req.gen_text), output_path, final_sample_rate, duration_s, elapsed_ms,
|
||||||
|
)
|
||||||
|
return SynthesizeResponse(
|
||||||
|
ok=True,
|
||||||
|
output_path=str(output_path),
|
||||||
|
sample_rate_hz=final_sample_rate,
|
||||||
|
duration_seconds=duration_s,
|
||||||
|
elapsed_ms=elapsed_ms,
|
||||||
|
chars_in=len(req.gen_text),
|
||||||
|
)
|
||||||
35
engines/kokoro/Dockerfile
Normal file
35
engines/kokoro/Dockerfile
Normal file
|
|
@ -0,0 +1,35 @@
|
||||||
|
# Sulkta build of Kokoro-82M TTS.
|
||||||
|
#
|
||||||
|
# License: Apache 2.0 (code AND model weights). Clean stack — no
|
||||||
|
# CC-BY-NC asterisk like F5-TTS's Emilia weights. This is the
|
||||||
|
# narrator engine for sleep-quality audiobook reads; F5-TTS stays
|
||||||
|
# around for voice-cloning cases.
|
||||||
|
#
|
||||||
|
# Kokoro is small enough to run on CPU but we use the cuda base
|
||||||
|
# anyway to stay consistent with f5-tts and so it'll pick up the
|
||||||
|
# GPU when no other tenant has it.
|
||||||
|
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
|
||||||
|
|
||||||
|
ENV DEBIAN_FRONTEND=noninteractive \
|
||||||
|
PYTHONUNBUFFERED=1 \
|
||||||
|
HF_HOME=/cache/hf \
|
||||||
|
HF_HUB_DISABLE_TELEMETRY=1
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
ffmpeg \
|
||||||
|
espeak-ng \
|
||||||
|
ca-certificates \
|
||||||
|
curl \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# kokoro pulls phonemizer + soundfile + espeakng transitively.
|
||||||
|
RUN pip install --no-cache-dir 'kokoro>=0.9.0' 'fastapi>=0.115.0' 'uvicorn>=0.32.0' 'soundfile>=0.13.0'
|
||||||
|
|
||||||
|
RUN mkdir -p /cache/hf /audio
|
||||||
|
|
||||||
|
COPY kokoro_server.py /app/kokoro_server.py
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
EXPOSE 7860
|
||||||
|
|
||||||
|
CMD ["uvicorn", "kokoro_server:app", "--host", "0.0.0.0", "--port", "7860"]
|
||||||
37
engines/kokoro/compose.yml
Normal file
37
engines/kokoro/compose.yml
Normal file
|
|
@ -0,0 +1,37 @@
|
||||||
|
# Kokoro-82M TTS stack on Lucy.
|
||||||
|
#
|
||||||
|
# Audiobook-quality narrator engine (Apache 2.0 code + weights —
|
||||||
|
# clean stack vs F5-TTS's CC-BY-NC asterisk). Sibling to f5-tts;
|
||||||
|
# both share /mnt/cache/appdata/f5-tts/audio so skald's audio
|
||||||
|
# route serves outputs from either engine through the same path.
|
||||||
|
#
|
||||||
|
# License: Apache 2.0 top to bottom. Right for share/publish.
|
||||||
|
name: kokoro
|
||||||
|
|
||||||
|
services:
|
||||||
|
kokoro:
|
||||||
|
image: lucy-registry:5000/kokoro:0.5
|
||||||
|
container_name: kokoro
|
||||||
|
restart: unless-stopped
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
ports:
|
||||||
|
- "192.168.0.5:7794:7860"
|
||||||
|
- "127.0.0.1:7794:7860"
|
||||||
|
volumes:
|
||||||
|
- /mnt/cache/appdata/kokoro/hf-cache:/cache/hf
|
||||||
|
# Shared with f5-tts so skald's /audio route covers both.
|
||||||
|
- /mnt/cache/appdata/f5-tts/audio:/audio
|
||||||
|
environment:
|
||||||
|
HF_HOME: /cache/hf
|
||||||
|
HF_HUB_DISABLE_TELEMETRY: "1"
|
||||||
|
labels:
|
||||||
|
org.sulkta.domain: "sulkta"
|
||||||
|
org.sulkta.owner: "cobb"
|
||||||
|
org.sulkta.managed-by: "compose"
|
||||||
|
org.sulkta.role: "kokoro"
|
||||||
324
engines/kokoro/server.py
Normal file
324
engines/kokoro/server.py
Normal file
|
|
@ -0,0 +1,324 @@
|
||||||
|
"""Kokoro-82M FastAPI server, sibling to f5_server.
|
||||||
|
|
||||||
|
Same /synthesize contract as F5 so skald can route between engines
|
||||||
|
just by which URL it points at. The semantic difference: Kokoro
|
||||||
|
voices are NAMED (af_heart, af_bella, am_michael, etc.) — there's no
|
||||||
|
reference audio. We repurpose the `ref_audio_path` field to carry
|
||||||
|
the voice name; if it starts with '/' we treat as F5-style path and
|
||||||
|
error.
|
||||||
|
|
||||||
|
Render-and-stitch:
|
||||||
|
The naive "feed the whole chapter to Kokoro" path produces audio
|
||||||
|
that runs paragraphs together — no breath between scenes, no beat
|
||||||
|
on a hard line break. So this server splits the input on paragraph
|
||||||
|
and scene boundaries, renders each chunk, and concatenates with
|
||||||
|
explicit silence inserts between chunks.
|
||||||
|
|
||||||
|
Control tags the splitter recognizes (case-insensitive):
|
||||||
|
[pause:1.5s] — silence of N seconds at this point
|
||||||
|
[pause:500ms] — silence of N milliseconds at this point
|
||||||
|
[breath] — short breath beat (~400ms)
|
||||||
|
[scene] — major scene break (~1500ms)
|
||||||
|
|
||||||
|
Implicit breaks the splitter inserts:
|
||||||
|
Blank line between paragraphs → 700ms
|
||||||
|
A line of just `---` → 1500ms (scene break)
|
||||||
|
|
||||||
|
Sentence-internal pacing (commas, periods, em-dashes, ellipses)
|
||||||
|
is left to Kokoro's own phonemizer — it handles that well.
|
||||||
|
|
||||||
|
License: Apache 2.0 (code + model weights). Clean stack for the
|
||||||
|
sleep-quality narrator use case.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import soundfile as sf
|
||||||
|
import torch
|
||||||
|
from fastapi import FastAPI, HTTPException
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
from kokoro import KPipeline
|
||||||
|
|
||||||
|
|
||||||
|
log = logging.getLogger("kokoro-server")
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s %(message)s")
|
||||||
|
|
||||||
|
|
||||||
|
# ─── pipeline state ──────────────────────────────────────────────
|
||||||
|
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
DEFAULT_LANG = "a"
|
||||||
|
DEFAULT_VOICE = "af_heart"
|
||||||
|
AUDIO_ROOT = Path("/audio")
|
||||||
|
SAMPLE_RATE = 24000
|
||||||
|
|
||||||
|
# Default silence durations for implicit breaks. Tags override.
|
||||||
|
PARAGRAPH_GAP_S = 0.7
|
||||||
|
SCENE_GAP_S = 1.5
|
||||||
|
BREATH_GAP_S = 0.4
|
||||||
|
|
||||||
|
_pipelines: dict[str, KPipeline] = {}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_pipeline(lang_code: str) -> KPipeline:
|
||||||
|
if lang_code not in _pipelines:
|
||||||
|
log.info("loading kokoro pipeline lang=%s device=%s", lang_code, DEVICE)
|
||||||
|
_pipelines[lang_code] = KPipeline(lang_code=lang_code, device=DEVICE)
|
||||||
|
log.info("kokoro pipeline loaded lang=%s", lang_code)
|
||||||
|
return _pipelines[lang_code]
|
||||||
|
|
||||||
|
|
||||||
|
# ─── split + render pipeline ─────────────────────────────────────
|
||||||
|
|
||||||
|
# A "node" is one of three kinds; the renderer walks the list,
|
||||||
|
# calls Kokoro on each text node with its (possibly per-segment)
|
||||||
|
# voice, and emits zeros for each silence node.
|
||||||
|
class Node:
|
||||||
|
__slots__ = ("kind", "value", "voice")
|
||||||
|
|
||||||
|
def __init__(self, kind: str, value, voice: str | None = None):
|
||||||
|
# kind ∈ {"text", "silence"}; value is str for text and
|
||||||
|
# float seconds for silence. voice override is only used
|
||||||
|
# on text nodes from [voice:X]...[/voice] blocks; outside
|
||||||
|
# those blocks the request's default voice is used.
|
||||||
|
self.kind = kind
|
||||||
|
self.value = value
|
||||||
|
self.voice = voice
|
||||||
|
|
||||||
|
|
||||||
|
# Voice-block delimiters are parsed at a higher level than other
|
||||||
|
# tags so dialogue can contain its own [breath]/[pause] beats.
|
||||||
|
_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
|
||||||
|
_VOICE_CLOSE = "[/voice]"
|
||||||
|
|
||||||
|
_TAG_RE = re.compile(
|
||||||
|
r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_tag(match: re.Match) -> float:
|
||||||
|
body = match.group(0).lower().strip("[]")
|
||||||
|
if body == "breath":
|
||||||
|
return BREATH_GAP_S
|
||||||
|
if body == "scene":
|
||||||
|
return SCENE_GAP_S
|
||||||
|
dur = float(match.group("dur"))
|
||||||
|
unit = (match.group("unit") or "s").lower()
|
||||||
|
return dur / 1000.0 if unit == "ms" else dur
|
||||||
|
|
||||||
|
|
||||||
|
def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
||||||
|
"""Expand inline [breath]/[pause]/[scene] tags inside a chunk
|
||||||
|
of text that already has a single voice attribution. Voice
|
||||||
|
blocks themselves are handled one level up in split_to_nodes."""
|
||||||
|
out: list[Node] = []
|
||||||
|
text = text.strip()
|
||||||
|
if not text:
|
||||||
|
return out
|
||||||
|
cursor = 0
|
||||||
|
for m in _TAG_RE.finditer(text):
|
||||||
|
pre = text[cursor : m.start()].strip()
|
||||||
|
if pre:
|
||||||
|
out.append(Node("text", pre, voice))
|
||||||
|
out.append(Node("silence", _parse_tag(m)))
|
||||||
|
cursor = m.end()
|
||||||
|
tail = text[cursor:].strip()
|
||||||
|
if tail:
|
||||||
|
out.append(Node("text", tail, voice))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def split_to_nodes(text: str) -> list[Node]:
|
||||||
|
"""Walk the source text and split it into text+silence nodes.
|
||||||
|
|
||||||
|
Order of operations:
|
||||||
|
1. Split on `---` lines (scene breaks).
|
||||||
|
2. Within each scene, split on blank lines (paragraph breaks).
|
||||||
|
3. Within each paragraph, split on [voice:X]...[/voice] blocks
|
||||||
|
so each dialogue line carries its own voice attribution.
|
||||||
|
4. Within each (paragraph, voice-region) chunk, expand inline
|
||||||
|
[breath]/[pause:Xs]/[scene] tags.
|
||||||
|
"""
|
||||||
|
nodes: list[Node] = []
|
||||||
|
scenes = re.split(r"(?m)^\s*---\s*$", text)
|
||||||
|
for s_idx, scene in enumerate(scenes):
|
||||||
|
if s_idx > 0:
|
||||||
|
nodes.append(Node("silence", SCENE_GAP_S))
|
||||||
|
paragraphs = re.split(r"\n\s*\n", scene)
|
||||||
|
first_para = True
|
||||||
|
for para in paragraphs:
|
||||||
|
para = para.strip()
|
||||||
|
if not para:
|
||||||
|
continue
|
||||||
|
if not first_para:
|
||||||
|
nodes.append(Node("silence", PARAGRAPH_GAP_S))
|
||||||
|
first_para = False
|
||||||
|
nodes.extend(_split_paragraph_voices(para))
|
||||||
|
return nodes
|
||||||
|
|
||||||
|
|
||||||
|
def _split_paragraph_voices(para: str) -> list[Node]:
|
||||||
|
"""Split a single paragraph on [voice:X]...[/voice] blocks.
|
||||||
|
Outside those blocks the voice is None (request default).
|
||||||
|
Unmatched/orphan [/voice] markers are silently stripped.
|
||||||
|
"""
|
||||||
|
out: list[Node] = []
|
||||||
|
cursor = 0
|
||||||
|
while cursor < len(para):
|
||||||
|
m = _VOICE_OPEN_RE.search(para, cursor)
|
||||||
|
if not m:
|
||||||
|
out.extend(_expand_inline(para[cursor:], None))
|
||||||
|
break
|
||||||
|
# Text BEFORE the voice block uses default voice.
|
||||||
|
out.extend(_expand_inline(para[cursor : m.start()], None))
|
||||||
|
voice = m.group(1)
|
||||||
|
body_start = m.end()
|
||||||
|
close_idx = para.find(_VOICE_CLOSE, body_start)
|
||||||
|
if close_idx < 0:
|
||||||
|
# Unclosed voice block; treat rest of paragraph as that
|
||||||
|
# voice. Defensive — should be rare.
|
||||||
|
out.extend(_expand_inline(para[body_start:], voice))
|
||||||
|
break
|
||||||
|
out.extend(_expand_inline(para[body_start:close_idx], voice))
|
||||||
|
cursor = close_idx + len(_VOICE_CLOSE)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _silence_samples(seconds: float) -> np.ndarray:
|
||||||
|
n = int(round(seconds * SAMPLE_RATE))
|
||||||
|
return np.zeros(n, dtype=np.float32)
|
||||||
|
|
||||||
|
|
||||||
|
# ─── FastAPI app ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeRequest(BaseModel):
|
||||||
|
gen_text: str = Field(min_length=1)
|
||||||
|
ref_audio_path: str = DEFAULT_VOICE
|
||||||
|
ref_text: str | None = None
|
||||||
|
output_filename: str | None = None
|
||||||
|
speed: float = Field(default=1.0, ge=0.3, le=2.0)
|
||||||
|
lang_code: str = DEFAULT_LANG
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeResponse(BaseModel):
|
||||||
|
ok: bool
|
||||||
|
output_path: str
|
||||||
|
sample_rate_hz: int
|
||||||
|
duration_seconds: float
|
||||||
|
elapsed_ms: int
|
||||||
|
chars_in: int
|
||||||
|
engine: str
|
||||||
|
voice: str
|
||||||
|
text_nodes: int
|
||||||
|
silence_nodes: int
|
||||||
|
# Every distinct Kokoro voice id that actually got rendered.
|
||||||
|
# Single-element when no [voice:X] tags were in the input;
|
||||||
|
# multiple when multi-voice dialogue was attributed.
|
||||||
|
voices_used: list[str]
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="kokoro-server", version="0.2.0")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
def _startup() -> None:
|
||||||
|
_get_pipeline(DEFAULT_LANG)
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/healthz")
|
||||||
|
def healthz() -> dict:
|
||||||
|
# Shape matches f5_server's so the same Rust HealthResponse
|
||||||
|
# struct deserializes both: model/vocoder/loaded fields are
|
||||||
|
# required by skald-core::narrate::HealthResponse.
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"device": DEVICE,
|
||||||
|
"model": "kokoro-82m",
|
||||||
|
"vocoder": "kokoro-internal",
|
||||||
|
"loaded": bool(_pipelines),
|
||||||
|
"engine": "kokoro-82m",
|
||||||
|
"default_voice": DEFAULT_VOICE,
|
||||||
|
"default_lang": DEFAULT_LANG,
|
||||||
|
"loaded_langs": list(_pipelines.keys()),
|
||||||
|
"version": "0.2.0",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/synthesize", response_model=SynthesizeResponse)
|
||||||
|
def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
|
||||||
|
if req.ref_audio_path.startswith("/"):
|
||||||
|
raise HTTPException(
|
||||||
|
400,
|
||||||
|
"ref_audio_path looks like a filesystem path; Kokoro takes a voice "
|
||||||
|
"name like 'af_heart' or 'am_michael'. Did you mean to route to the "
|
||||||
|
"f5-tts engine?",
|
||||||
|
)
|
||||||
|
voice = req.ref_audio_path
|
||||||
|
|
||||||
|
output_filename = req.output_filename or f"{uuid.uuid4().hex}.wav"
|
||||||
|
if "/" in output_filename or ".." in output_filename:
|
||||||
|
raise HTTPException(400, "output_filename must be a bare name, no path parts")
|
||||||
|
output_path = AUDIO_ROOT / output_filename
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
pipeline = _get_pipeline(req.lang_code)
|
||||||
|
|
||||||
|
# Split the text into a node list. Empty nodes get filtered.
|
||||||
|
nodes = [n for n in split_to_nodes(req.gen_text) if n.kind == "silence" or n.value]
|
||||||
|
text_count = sum(1 for n in nodes if n.kind == "text")
|
||||||
|
silence_count = sum(1 for n in nodes if n.kind == "silence")
|
||||||
|
if text_count == 0:
|
||||||
|
raise HTTPException(400, "gen_text expanded to zero text nodes")
|
||||||
|
|
||||||
|
started = time.monotonic()
|
||||||
|
pieces: list[np.ndarray] = []
|
||||||
|
voices_used: set[str] = set()
|
||||||
|
for node in nodes:
|
||||||
|
if node.kind == "silence":
|
||||||
|
pieces.append(_silence_samples(node.value))
|
||||||
|
continue
|
||||||
|
# text: hand to Kokoro. The node's voice override (set by
|
||||||
|
# [voice:X]...[/voice] blocks) wins; otherwise the request's
|
||||||
|
# default narrator voice.
|
||||||
|
seg_voice = node.voice or voice
|
||||||
|
voices_used.add(seg_voice)
|
||||||
|
chunk_audio: list[np.ndarray] = []
|
||||||
|
for _, _, audio in pipeline(node.value, voice=seg_voice, speed=req.speed):
|
||||||
|
arr = audio.cpu().numpy() if hasattr(audio, "cpu") else np.asarray(audio)
|
||||||
|
chunk_audio.append(arr.astype(np.float32))
|
||||||
|
if chunk_audio:
|
||||||
|
pieces.append(np.concatenate(chunk_audio))
|
||||||
|
elapsed_ms = int((time.monotonic() - started) * 1000)
|
||||||
|
|
||||||
|
if not pieces:
|
||||||
|
raise HTTPException(500, "kokoro returned no audio")
|
||||||
|
full_audio = np.concatenate(pieces)
|
||||||
|
sf.write(str(output_path), full_audio, SAMPLE_RATE, subtype="PCM_16")
|
||||||
|
duration_s = float(len(full_audio)) / float(SAMPLE_RATE)
|
||||||
|
|
||||||
|
log.info(
|
||||||
|
"synthesized chars=%d voice=%s text_nodes=%d silence_nodes=%d "
|
||||||
|
"voices_used=%s -> %s (dur=%.2fs, elapsed=%dms)",
|
||||||
|
len(req.gen_text), voice, text_count, silence_count,
|
||||||
|
sorted(voices_used), output_path, duration_s, elapsed_ms,
|
||||||
|
)
|
||||||
|
return SynthesizeResponse(
|
||||||
|
ok=True,
|
||||||
|
output_path=str(output_path),
|
||||||
|
sample_rate_hz=SAMPLE_RATE,
|
||||||
|
duration_seconds=duration_s,
|
||||||
|
elapsed_ms=elapsed_ms,
|
||||||
|
chars_in=len(req.gen_text),
|
||||||
|
engine="kokoro-82m",
|
||||||
|
voice=voice,
|
||||||
|
text_nodes=text_count,
|
||||||
|
silence_nodes=silence_count,
|
||||||
|
voices_used=sorted(voices_used),
|
||||||
|
)
|
||||||
45
engines/tortoise/Dockerfile
Normal file
45
engines/tortoise/Dockerfile
Normal file
|
|
@ -0,0 +1,45 @@
|
||||||
|
# Sulkta build of Tortoise-TTS.
|
||||||
|
#
|
||||||
|
# Voice roster (built-in, no cloning needed): angie, daniel, deniro,
|
||||||
|
# emma, freeman, geralt, halle, jlaw, lj, mol, myself, pat, pat2,
|
||||||
|
# rainbow, snakes, tim_reynolds, tom, train_atkins, train_dotrice,
|
||||||
|
# train_dreams, train_grace, train_kennard, train_lescault,
|
||||||
|
# train_mouse, weaver, william. ~26 voices baked in.
|
||||||
|
#
|
||||||
|
# License: Apache 2.0 (code) + Apache 2.0 (model weights). Clean
|
||||||
|
# stack for share/publish.
|
||||||
|
#
|
||||||
|
# Speed: slow. Trade for quality. Standard preset is ~10x slower
|
||||||
|
# than Kokoro; high_quality is ~30x slower. Worth it for the
|
||||||
|
# audiobook-quality bar.
|
||||||
|
|
||||||
|
FROM pytorch/pytorch:2.6.0-cuda12.4-cudnn9-runtime
|
||||||
|
|
||||||
|
ENV DEBIAN_FRONTEND=noninteractive \
|
||||||
|
PYTHONUNBUFFERED=1 \
|
||||||
|
HF_HOME=/cache/hf \
|
||||||
|
HF_HUB_DISABLE_TELEMETRY=1 \
|
||||||
|
TORTOISE_MODELS_DIR=/cache/tortoise-models
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
ffmpeg \
|
||||||
|
git \
|
||||||
|
ca-certificates \
|
||||||
|
curl \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
RUN pip install --no-cache-dir \
|
||||||
|
'tortoise-tts>=3.0.0' \
|
||||||
|
'fastapi>=0.115.0' \
|
||||||
|
'uvicorn>=0.32.0' \
|
||||||
|
'soundfile>=0.13.0' \
|
||||||
|
'numpy<2'
|
||||||
|
|
||||||
|
RUN mkdir -p /cache/hf /cache/tortoise-models /audio
|
||||||
|
|
||||||
|
COPY tortoise_server.py /app/tortoise_server.py
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
EXPOSE 7860
|
||||||
|
|
||||||
|
CMD ["uvicorn", "tortoise_server:app", "--host", "0.0.0.0", "--port", "7860"]
|
||||||
43
engines/tortoise/compose.yml
Normal file
43
engines/tortoise/compose.yml
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
# Tortoise-TTS stack on Lucy. Audiobook-quality engine with 25+
|
||||||
|
# named voices (no cloning). Apache 2.0 top to bottom.
|
||||||
|
#
|
||||||
|
# Slow: ~10x kokoro wall clock at 'standard' preset. Worth it for
|
||||||
|
# the quality bar. Cobb's call 2026-05-14: "use higgs (now tortoise)
|
||||||
|
# and we will only let it use the full gpu for runs" — translated:
|
||||||
|
# runs are batched, slow is acceptable.
|
||||||
|
#
|
||||||
|
# Co-resides with kokoro on the 2070 Super since tortoise is ~5GB
|
||||||
|
# and kokoro is ~1GB (8GB total). If OOM hits during a render,
|
||||||
|
# we'll add a coordination layer to pause kokoro first.
|
||||||
|
name: tortoise
|
||||||
|
|
||||||
|
services:
|
||||||
|
tortoise:
|
||||||
|
image: lucy-registry:5000/tortoise:0.1
|
||||||
|
container_name: tortoise
|
||||||
|
restart: unless-stopped
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
ports:
|
||||||
|
- "192.168.0.5:7795:7860"
|
||||||
|
- "127.0.0.1:7795:7860"
|
||||||
|
volumes:
|
||||||
|
- /mnt/cache/appdata/tortoise/hf-cache:/cache/hf
|
||||||
|
- /mnt/cache/appdata/tortoise/models:/cache/tortoise-models
|
||||||
|
# Shared audio dir with f5/kokoro so skald serves all engines'
|
||||||
|
# outputs through the same /audio route.
|
||||||
|
- /mnt/cache/appdata/f5-tts/audio:/audio
|
||||||
|
environment:
|
||||||
|
HF_HOME: /cache/hf
|
||||||
|
HF_HUB_DISABLE_TELEMETRY: "1"
|
||||||
|
TORTOISE_MODELS_DIR: /cache/tortoise-models
|
||||||
|
labels:
|
||||||
|
org.sulkta.domain: "sulkta"
|
||||||
|
org.sulkta.owner: "cobb"
|
||||||
|
org.sulkta.managed-by: "compose"
|
||||||
|
org.sulkta.role: "tortoise-tts"
|
||||||
305
engines/tortoise/server.py
Normal file
305
engines/tortoise/server.py
Normal file
|
|
@ -0,0 +1,305 @@
|
||||||
|
"""Tortoise-TTS FastAPI server. Sibling to kokoro_server.
|
||||||
|
|
||||||
|
Same /synthesize contract as the kokoro server so skald only has to
|
||||||
|
route by voice.source. Differences:
|
||||||
|
- Tortoise voices are NAMED PRESETS shipped with the library
|
||||||
|
(angie, daniel, freeman, jlaw, lj, weaver, etc.). No cloning.
|
||||||
|
- Tortoise is slow. Standard preset is ~10x kokoro's wall clock.
|
||||||
|
Caller should expect minutes per chunk, not seconds.
|
||||||
|
- We DON'T re-implement render-and-stitch + multi-voice tag parsing
|
||||||
|
here for v0.1 — tortoise's quality is the win, not multi-voice.
|
||||||
|
Long-form sequential renders use the request's default voice
|
||||||
|
throughout.
|
||||||
|
- The [voice:X]...[/voice] tags ARE parsed though: each block
|
||||||
|
renders with its named voice. This is the audiobook win.
|
||||||
|
|
||||||
|
Quality presets: ultra_fast / fast / standard / high_quality. The
|
||||||
|
trade-off is real — high_quality on a 2070 Super is ~30x slower
|
||||||
|
than kokoro. Default to 'standard' for the bar.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import re
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import soundfile as sf
|
||||||
|
import torch
|
||||||
|
from fastapi import FastAPI, HTTPException
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
from tortoise.api import TextToSpeech
|
||||||
|
from tortoise.utils.audio import load_voice
|
||||||
|
|
||||||
|
|
||||||
|
log = logging.getLogger("tortoise-server")
|
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s %(message)s")
|
||||||
|
|
||||||
|
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
DEFAULT_VOICE = "lj"
|
||||||
|
DEFAULT_PRESET = "standard"
|
||||||
|
AUDIO_ROOT = Path("/audio")
|
||||||
|
SAMPLE_RATE = 24000 # Tortoise outputs 24kHz
|
||||||
|
|
||||||
|
# Silence durations for between-chunk stitching (matches kokoro
|
||||||
|
# server's conventions so audio from both engines feels similar).
|
||||||
|
PARAGRAPH_GAP_S = 0.7
|
||||||
|
SCENE_GAP_S = 1.5
|
||||||
|
BREATH_GAP_S = 0.4
|
||||||
|
|
||||||
|
_tts: TextToSpeech | None = None
|
||||||
|
_voice_cache: dict[str, tuple] = {}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_tts() -> TextToSpeech:
|
||||||
|
global _tts
|
||||||
|
if _tts is None:
|
||||||
|
log.info("loading tortoise device=%s", DEVICE)
|
||||||
|
_tts = TextToSpeech(use_deepspeed=False, kv_cache=True, half=(DEVICE == "cuda"))
|
||||||
|
log.info("tortoise loaded")
|
||||||
|
return _tts
|
||||||
|
|
||||||
|
|
||||||
|
def _get_voice(name: str) -> tuple:
|
||||||
|
"""Cache voice latents to avoid re-loading reference clips on
|
||||||
|
every synthesis call. Tortoise's load_voice returns
|
||||||
|
(voice_samples, conditioning_latents)."""
|
||||||
|
if name not in _voice_cache:
|
||||||
|
_voice_cache[name] = load_voice(name)
|
||||||
|
return _voice_cache[name]
|
||||||
|
|
||||||
|
|
||||||
|
# ─── tag splitter (lifted from kokoro_server) ───────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class Node:
|
||||||
|
__slots__ = ("kind", "value", "voice")
|
||||||
|
|
||||||
|
def __init__(self, kind: str, value, voice: str | None = None):
|
||||||
|
self.kind = kind
|
||||||
|
self.value = value
|
||||||
|
self.voice = voice
|
||||||
|
|
||||||
|
|
||||||
|
_VOICE_OPEN_RE = re.compile(r"\[voice:([A-Za-z0-9_-]+)\]")
|
||||||
|
_VOICE_CLOSE = "[/voice]"
|
||||||
|
_TAG_RE = re.compile(
|
||||||
|
r"\[(pause:(?P<dur>[0-9]+(?:\.[0-9]+)?)(?P<unit>s|ms)?|breath|scene)\]",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_tag(match: re.Match) -> float:
|
||||||
|
body = match.group(0).lower().strip("[]")
|
||||||
|
if body == "breath":
|
||||||
|
return BREATH_GAP_S
|
||||||
|
if body == "scene":
|
||||||
|
return SCENE_GAP_S
|
||||||
|
dur = float(match.group("dur"))
|
||||||
|
unit = (match.group("unit") or "s").lower()
|
||||||
|
return dur / 1000.0 if unit == "ms" else dur
|
||||||
|
|
||||||
|
|
||||||
|
def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
||||||
|
out: list[Node] = []
|
||||||
|
text = text.strip()
|
||||||
|
if not text:
|
||||||
|
return out
|
||||||
|
cursor = 0
|
||||||
|
for m in _TAG_RE.finditer(text):
|
||||||
|
pre = text[cursor : m.start()].strip()
|
||||||
|
if pre:
|
||||||
|
out.append(Node("text", pre, voice))
|
||||||
|
out.append(Node("silence", _parse_tag(m)))
|
||||||
|
cursor = m.end()
|
||||||
|
tail = text[cursor:].strip()
|
||||||
|
if tail:
|
||||||
|
out.append(Node("text", tail, voice))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _split_paragraph_voices(para: str) -> list[Node]:
|
||||||
|
out: list[Node] = []
|
||||||
|
cursor = 0
|
||||||
|
while cursor < len(para):
|
||||||
|
m = _VOICE_OPEN_RE.search(para, cursor)
|
||||||
|
if not m:
|
||||||
|
out.extend(_expand_inline(para[cursor:], None))
|
||||||
|
break
|
||||||
|
out.extend(_expand_inline(para[cursor : m.start()], None))
|
||||||
|
voice = m.group(1)
|
||||||
|
body_start = m.end()
|
||||||
|
close_idx = para.find(_VOICE_CLOSE, body_start)
|
||||||
|
if close_idx < 0:
|
||||||
|
out.extend(_expand_inline(para[body_start:], voice))
|
||||||
|
break
|
||||||
|
out.extend(_expand_inline(para[body_start:close_idx], voice))
|
||||||
|
cursor = close_idx + len(_VOICE_CLOSE)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def split_to_nodes(text: str) -> list[Node]:
|
||||||
|
nodes: list[Node] = []
|
||||||
|
scenes = re.split(r"(?m)^\s*---\s*$", text)
|
||||||
|
for s_idx, scene in enumerate(scenes):
|
||||||
|
if s_idx > 0:
|
||||||
|
nodes.append(Node("silence", SCENE_GAP_S))
|
||||||
|
paragraphs = re.split(r"\n\s*\n", scene)
|
||||||
|
first_para = True
|
||||||
|
for para in paragraphs:
|
||||||
|
para = para.strip()
|
||||||
|
if not para:
|
||||||
|
continue
|
||||||
|
if not first_para:
|
||||||
|
nodes.append(Node("silence", PARAGRAPH_GAP_S))
|
||||||
|
first_para = False
|
||||||
|
nodes.extend(_split_paragraph_voices(para))
|
||||||
|
return nodes
|
||||||
|
|
||||||
|
|
||||||
|
def _silence_samples(seconds: float) -> np.ndarray:
|
||||||
|
n = int(round(seconds * SAMPLE_RATE))
|
||||||
|
return np.zeros(n, dtype=np.float32)
|
||||||
|
|
||||||
|
|
||||||
|
# ─── FastAPI ─────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeRequest(BaseModel):
|
||||||
|
gen_text: str = Field(min_length=1)
|
||||||
|
# Tortoise voice name (lj, freeman, daniel, etc.). API-compat
|
||||||
|
# field carries the voice id as a "path" — same shape as kokoro.
|
||||||
|
ref_audio_path: str = DEFAULT_VOICE
|
||||||
|
ref_text: str | None = None
|
||||||
|
output_filename: str | None = None
|
||||||
|
speed: float = Field(default=1.0, ge=0.3, le=2.0)
|
||||||
|
# Tortoise-specific: quality preset. Slower = better.
|
||||||
|
preset: str = Field(default=DEFAULT_PRESET)
|
||||||
|
|
||||||
|
|
||||||
|
class SynthesizeResponse(BaseModel):
|
||||||
|
ok: bool
|
||||||
|
output_path: str
|
||||||
|
sample_rate_hz: int
|
||||||
|
duration_seconds: float
|
||||||
|
elapsed_ms: int
|
||||||
|
chars_in: int
|
||||||
|
engine: str
|
||||||
|
voice: str
|
||||||
|
text_nodes: int
|
||||||
|
silence_nodes: int
|
||||||
|
voices_used: list[str]
|
||||||
|
|
||||||
|
|
||||||
|
app = FastAPI(title="tortoise-server", version="0.1.0")
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
def _startup() -> None:
|
||||||
|
_get_tts()
|
||||||
|
# Pre-load the default voice so the first synth doesn't pay
|
||||||
|
# the latent-extraction cost.
|
||||||
|
try:
|
||||||
|
_get_voice(DEFAULT_VOICE)
|
||||||
|
except Exception as e:
|
||||||
|
log.warning("could not preload default voice %s: %s", DEFAULT_VOICE, e)
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/healthz")
|
||||||
|
def healthz() -> dict:
|
||||||
|
# Shape matches f5_server/kokoro_server so skald's HealthResponse
|
||||||
|
# struct deserializes all three.
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"device": DEVICE,
|
||||||
|
"model": "tortoise-tts",
|
||||||
|
"vocoder": "tortoise-builtin",
|
||||||
|
"loaded": _tts is not None,
|
||||||
|
"engine": "tortoise-tts",
|
||||||
|
"default_voice": DEFAULT_VOICE,
|
||||||
|
"default_preset": DEFAULT_PRESET,
|
||||||
|
"cached_voices": list(_voice_cache.keys()),
|
||||||
|
"version": "0.1.0",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/synthesize", response_model=SynthesizeResponse)
|
||||||
|
def synthesize(req: SynthesizeRequest) -> SynthesizeResponse:
|
||||||
|
if req.ref_audio_path.startswith("/"):
|
||||||
|
raise HTTPException(
|
||||||
|
400,
|
||||||
|
"ref_audio_path looks like a filesystem path; tortoise takes a voice "
|
||||||
|
"name like 'lj' or 'freeman'.",
|
||||||
|
)
|
||||||
|
voice = req.ref_audio_path
|
||||||
|
preset = req.preset
|
||||||
|
|
||||||
|
output_filename = req.output_filename or f"{uuid.uuid4().hex}.wav"
|
||||||
|
if "/" in output_filename or ".." in output_filename:
|
||||||
|
raise HTTPException(400, "output_filename must be a bare name, no path parts")
|
||||||
|
output_path = AUDIO_ROOT / output_filename
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
tts = _get_tts()
|
||||||
|
|
||||||
|
nodes = [n for n in split_to_nodes(req.gen_text) if n.kind == "silence" or n.value]
|
||||||
|
text_count = sum(1 for n in nodes if n.kind == "text")
|
||||||
|
silence_count = sum(1 for n in nodes if n.kind == "silence")
|
||||||
|
if text_count == 0:
|
||||||
|
raise HTTPException(400, "gen_text expanded to zero text nodes")
|
||||||
|
|
||||||
|
started = time.monotonic()
|
||||||
|
pieces: list[np.ndarray] = []
|
||||||
|
voices_used: set[str] = set()
|
||||||
|
for node in nodes:
|
||||||
|
if node.kind == "silence":
|
||||||
|
pieces.append(_silence_samples(node.value))
|
||||||
|
continue
|
||||||
|
seg_voice = node.voice or voice
|
||||||
|
voices_used.add(seg_voice)
|
||||||
|
try:
|
||||||
|
samples, latents = _get_voice(seg_voice)
|
||||||
|
except Exception as e:
|
||||||
|
log.warning("voice %s failed to load (%s); falling back to default", seg_voice, e)
|
||||||
|
samples, latents = _get_voice(voice)
|
||||||
|
# Tortoise's tts_with_preset returns a torch.Tensor on the
|
||||||
|
# configured device.
|
||||||
|
audio_tensor = tts.tts_with_preset(
|
||||||
|
text=node.value,
|
||||||
|
voice_samples=samples,
|
||||||
|
conditioning_latents=latents,
|
||||||
|
preset=preset,
|
||||||
|
)
|
||||||
|
if isinstance(audio_tensor, list):
|
||||||
|
audio_tensor = audio_tensor[0]
|
||||||
|
arr = audio_tensor.squeeze().cpu().numpy().astype(np.float32)
|
||||||
|
pieces.append(arr)
|
||||||
|
elapsed_ms = int((time.monotonic() - started) * 1000)
|
||||||
|
|
||||||
|
if not pieces:
|
||||||
|
raise HTTPException(500, "tortoise returned no audio")
|
||||||
|
full_audio = np.concatenate(pieces)
|
||||||
|
sf.write(str(output_path), full_audio, SAMPLE_RATE, subtype="PCM_16")
|
||||||
|
duration_s = float(len(full_audio)) / float(SAMPLE_RATE)
|
||||||
|
|
||||||
|
log.info(
|
||||||
|
"synthesized chars=%d voice=%s preset=%s text_nodes=%d silence_nodes=%d "
|
||||||
|
"voices_used=%s -> %s (dur=%.2fs, elapsed=%dms)",
|
||||||
|
len(req.gen_text), voice, preset, text_count, silence_count,
|
||||||
|
sorted(voices_used), output_path, duration_s, elapsed_ms,
|
||||||
|
)
|
||||||
|
return SynthesizeResponse(
|
||||||
|
ok=True,
|
||||||
|
output_path=str(output_path),
|
||||||
|
sample_rate_hz=SAMPLE_RATE,
|
||||||
|
duration_seconds=duration_s,
|
||||||
|
elapsed_ms=elapsed_ms,
|
||||||
|
chars_in=len(req.gen_text),
|
||||||
|
engine="tortoise-tts",
|
||||||
|
voice=voice,
|
||||||
|
text_nodes=text_count,
|
||||||
|
silence_nodes=silence_count,
|
||||||
|
voices_used=sorted(voices_used),
|
||||||
|
)
|
||||||
Loading…
Add table
Add a link
Reference in a new issue