Compare commits
1 commit
main
...
engine/kok
| Author | SHA1 | Date | |
|---|---|---|---|
| 36922706a2 |
2 changed files with 60 additions and 1 deletions
48
engines/kokoro/hacks.md
Normal file
48
engines/kokoro/hacks.md
Normal file
|
|
@ -0,0 +1,48 @@
|
||||||
|
# Kokoro engine — kludges branched off main
|
||||||
|
|
||||||
|
This branch carries the engine-specific tweaks that don't generalise
|
||||||
|
to F5 / Tortoise. Each one is a real workaround for a real Kokoro-82M
|
||||||
|
limitation, not a stylistic choice — when we move to a bigger model
|
||||||
|
these should disappear.
|
||||||
|
|
||||||
|
## 1. Doubled `??` for question prosody
|
||||||
|
|
||||||
|
**File:** `server.py` — `_emphasize_questions` + `_QUESTION_RE`.
|
||||||
|
|
||||||
|
Kokoro-82M's prosody on single `?` is flat — interrogatives read like
|
||||||
|
declaratives. The 82M parameter cap shows up here. Doubling the mark
|
||||||
|
to `??` triggers a noticeably stronger rising-pitch contour.
|
||||||
|
|
||||||
|
Tried + works: `??`. Tried + worse: `?!` (sounds shouty), trailing
|
||||||
|
spaces (no effect).
|
||||||
|
|
||||||
|
Remove when: upgrading to a bigger model OR a Kokoro version with
|
||||||
|
better prosody control.
|
||||||
|
|
||||||
|
## 2. Paragraph / scene / breath gap durations
|
||||||
|
|
||||||
|
**File:** `server.py` — `PARAGRAPH_GAP_S=0.7`, `SCENE_GAP_S=1.5`,
|
||||||
|
`BREATH_GAP_S=0.4`.
|
||||||
|
|
||||||
|
These were eyeballed against af_heart's natural pacing for long-form
|
||||||
|
prose. Other voices (e.g. am_michael's slower delivery) may want
|
||||||
|
shorter gaps; a per-voice override map would be more correct but
|
||||||
|
isn't worth the complexity yet.
|
||||||
|
|
||||||
|
The 2026-05-14 feedback was "some pauses are a tad too long" — the
|
||||||
|
0.7/1.5/0.4 may want to drop to 0.5/1.2/0.3 if confirmed.
|
||||||
|
|
||||||
|
## 3. Pronunciation respellings as ALL-LOWERCASE
|
||||||
|
|
||||||
|
**File:** *(data, not code — pronunciation_overrides DB table)*
|
||||||
|
|
||||||
|
Kokoro's misaki phonemizer treats consecutive uppercase letters as
|
||||||
|
initialisms ("PRIP-yat" → "P-R-I-P yat"). The seeded respellings in
|
||||||
|
`pronunciation_overrides WHERE phoneme_format='respelling'` must
|
||||||
|
therefore use lowercase syllabification: `prip-yat`, `dyat-loff`,
|
||||||
|
`bryu-hah-noff`. Stress marking is lost.
|
||||||
|
|
||||||
|
For tortoise this constraint may not hold (different phonemizer);
|
||||||
|
the respelling format is currently kokoro-tuned. Future: per-engine
|
||||||
|
phoneme_format buckets, or have skald narrate pass the engine name
|
||||||
|
when selecting overrides.
|
||||||
|
|
@ -113,12 +113,23 @@ def _parse_tag(match: re.Match) -> float:
|
||||||
return dur / 1000.0 if unit == "ms" else dur
|
return dur / 1000.0 if unit == "ms" else dur
|
||||||
|
|
||||||
|
|
||||||
|
# [HACK — engine/kokoro] Kokoro-82M has weak question prosody on a
|
||||||
|
# single `?`. Doubling the question mark to `??` reliably triggers a
|
||||||
|
# more interrogative rising-pitch contour without changing semantics.
|
||||||
|
# Skip if already doubled or part of an interrobang. See hacks.md.
|
||||||
|
_QUESTION_RE = re.compile(r"(?<![?!])\?(?!\?)")
|
||||||
|
|
||||||
|
|
||||||
|
def _emphasize_questions(text: str) -> str:
|
||||||
|
return _QUESTION_RE.sub("??", text)
|
||||||
|
|
||||||
|
|
||||||
def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
def _expand_inline(text: str, voice: str | None) -> list[Node]:
|
||||||
"""Expand inline [breath]/[pause]/[scene] tags inside a chunk
|
"""Expand inline [breath]/[pause]/[scene] tags inside a chunk
|
||||||
of text that already has a single voice attribution. Voice
|
of text that already has a single voice attribution. Voice
|
||||||
blocks themselves are handled one level up in split_to_nodes."""
|
blocks themselves are handled one level up in split_to_nodes."""
|
||||||
out: list[Node] = []
|
out: list[Node] = []
|
||||||
text = text.strip()
|
text = _emphasize_questions(text.strip())
|
||||||
if not text:
|
if not text:
|
||||||
return out
|
return out
|
||||||
cursor = 0
|
cursor = 0
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue