sterilize Phase 2: pass Mealie's food catalog into Sonnet's prompt

Sonnet was picking food names blindly. We then tried to match those
back to Mealie's catalog post-hoc. When Sonnet's natural pick didn't
match Mealie's exact convention, we'd create a duplicate row instead
of reusing the existing one. Lucky alignment with the seed kept the
dupe rate low, but the architecture had no real "Mealie is source of
truth" guarantee.

This change makes that guarantee explicit:

1. Sterilizer now lazy-loads Mealie's full food catalog on first
   _parse_batch call (one fetch per Sterilizer instance, so a bulk
   job pulls 2895 rows once and reuses across all 226 recipe parses).
   Uses the underlying mealie._get with per_page=2000 + page-walk for
   defensive coverage of really large catalogs.

2. STERILIZE_SYSTEM is now STERILIZE_SYSTEM_TEMPLATE — a string with
   a {foods} placeholder. _system_prompt() splices in a bullet list of
   every Mealie food (name, plural, aliases) at runtime.

3. New CATALOG RULES in the prompt instruct Sonnet to:
   - Match against name / pluralName / aliases first, return canonical
     name verbatim with is_new_food=false
   - Strip prep modifiers into note
   - Singularize plurals when canonical is singular
   - Treat brand variations as canonical+note ("Heinz ketchup" →
     food: "ketchup", note: "Heinz")
   - Set is_new_food=true ONLY when no reasonable catalog match exists,
     since adding aliases to fix mismatches later is way easier than
     cleaning duplicate food rows after the fact

4. New is_new_food field on IngredientParse and per-item schema. Will
   eventually drive an "alias suggestion" UI, but for v1 just gives
   us telemetry on how often Sonnet falls back to inventing names.

Net effect for the family-internal goal: zero duplicate food creates
from convention mismatches, smarter parses that respect the catalog
Cobb spent time curating, foundation laid for the Step 2 re-key
migration where cauldron_food_metadata gets keyed by Mealie food.id.
This commit is contained in:
Kayos 2026-04-30 11:48:40 -07:00
parent d359bed450
commit 9334d161e4

View file

@ -22,7 +22,7 @@ from .forge import Forge, ForgeError
from .mealie import Mealie, MealieError from .mealie import Mealie, MealieError
STERILIZE_SYSTEM = """You are a precise recipe ingredient parser. You ONLY output valid JSON. STERILIZE_SYSTEM_TEMPLATE = """You are a precise recipe ingredient parser. You ONLY output valid JSON.
You receive a list of free-form ingredient strings and return a parallel You receive a list of free-form ingredient strings and return a parallel
list of LISTS one inner list per input. Most inputs map 11 (single item list of LISTS one inner list per input. Most inputs map 11 (single item
inside the list). Compound lines that name multiple distinct foods MUST inside the list). Compound lines that name multiple distinct foods MUST
@ -30,43 +30,65 @@ fan out into multiple items so each food gets its own row on the shopping
list. list.
Per-item schema: Per-item schema:
{ {{
"quantity": <number or null>, # numeric amount; fractions → decimals (1/2 → 0.5) "quantity": <number or null>, # numeric amount; fractions → decimals (1/2 → 0.5)
"unit": <string or null>, # singular canonical: "cup", "tbsp", "tsp", "oz", "lb", "g", "kg", "ml", "l", "clove", "slice", "can", "package", "piece", "pinch", "dash", "handful". null if no unit (e.g. "1 onion"). "unit": <string or null>, # singular canonical: "cup", "tbsp", "tsp", "oz", "lb", "g", "kg", "ml", "l", "clove", "slice", "can", "package", "piece", "pinch", "dash", "handful". null if no unit (e.g. "1 onion").
"food": <string or null>, # core food noun, singular canonical lowercase: "onion", "garlic", "rice", "olive oil". Strip prep state ("chopped", "diced") into note. "food": <string or null>, # core food noun. PREFER an exact match from the canonical catalog below; only invent a new name if nothing matches.
"note": <string or null>, # prep state, brand, color, modifier: "chopped", "extra virgin", "yellow", "to taste" "note": <string or null>, # prep state, brand, color, modifier: "chopped", "extra virgin", "yellow", "to taste"
"approx": <bool> # true if input said "about", "a pinch", "to taste", or otherwise vague "approx": <bool>, # true if input said "about", "a pinch", "to taste", or otherwise vague
} "is_new_food": <bool> # true if `food` is NOT in the canonical catalog and you're proposing a new name. false if `food` matches an existing canonical entry verbatim.
}}
CANONICAL FOOD CATALOG (use these names verbatim exact case + spelling
when the ingredient maps to one of them). Rows shown as:
<name> [aliases: a, b, c] or <name> (plural: <plural>)
{foods}
CATALOG RULES (most important):
- If the ingredient string matches an entry by name OR pluralName OR alias,
return the canonical NAME (the first column above) verbatim, exact case.
Set is_new_food=false.
- Strip prep modifiers ("chopped", "diced", "minced", "halved") into note;
the catalog name should be the bare food.
- Plural in input singular in output if the canonical is singular.
e.g. "2 onions" food: "onion" (when "onion" is in the catalog).
- Branded variations (e.g. "Heinz ketchup") use canonical "ketchup" (or
whatever's in the catalog), put brand in note.
- Only set is_new_food=true when there's truly no reasonable match. Prefer
matching even if imperfect (e.g. "kosher salt" "salt" with note
"kosher", is_new_food=false), since adding aliases later is easier than
cleaning duplicate food rows.
FAN-OUT RULES return MULTIPLE items for one input when: FAN-OUT RULES return MULTIPLE items for one input when:
- "salt and pepper" / "salt and ground black pepper to taste" split into 2 items, each - "salt and pepper" / "salt and ground black pepper to taste" split into 2 items, each
{quantity: 1, unit: "dash", food: "salt"|"black pepper", note: "to taste", approx: true} {{quantity: 1, unit: "dash", food: <catalog match>, note: "to taste", approx: true}}
- "Toppings (cinnamon butter, marshmallows, ground cinnamon, butter, etc)" or - "Toppings (cinnamon butter, marshmallows, ground cinnamon, butter, etc)" or
"Optional: cilantro, lime, queso fresco" one item per food in the comma list. "Optional: cilantro, lime, queso fresco" one item per food in the comma list.
Drop the wrapper word ("Toppings", "Optional"); leave it OUT of food/note. Skip Drop the wrapper word ("Toppings", "Optional"); leave it OUT of food/note. Skip
filler words like "etc". Each item: quantity=null, unit=null, food=<name>, note=null, approx=true. filler words like "etc". Each item: quantity=null, unit=null, food=<catalog match>, note=null, approx=true.
- "1 lemon, juice and zest" 2 items: {qty:1, unit:null, food:"lemon juice"} and {qty:1, unit:null, food:"lemon zest"} - "1 lemon, juice and zest" 2 items: {{qty:1, unit:null, food:<catalog match for "lemon juice">}} and {{qty:1, unit:null, food:<catalog match for "lemon zest">}}.
- DO NOT split "salt and vinegar chips" or "macaroni and cheese" those are - DO NOT split "salt and vinegar chips" or "macaroni and cheese" those are
compound food names, not multi-food lines. Heuristic: if the words on either compound food names, not multi-food lines. Heuristic: if the words on either
side of "and" are a recognized standalone food, split; otherwise keep as one. side of "and" are a recognized standalone food, split; otherwise keep as one.
PARSE RULES (for the common 11 case): PARSE RULES (for the common 11 case):
- Convert fractions: "1/2" 0.5, "1 1/4" 1.25 - Convert fractions: "1/2" 0.5, "1 1/4" 1.25
- "a pinch" / "a dash" alone {quantity: 1, unit: "pinch"|"dash", approx: true} - "a pinch" / "a dash" alone {{quantity: 1, unit: "pinch"|"dash", approx: true}}
- "to taste" alone {quantity: null, unit: null, food: <food>, note: "to taste", approx: true} - "to taste" alone {{quantity: null, unit: null, food: <food>, note: "to taste", approx: true}}
- "1 small onion" {quantity: 1, unit: null, food: "onion", note: "small"} - "1 small onion" {{quantity: 1, unit: null, food: "onion", note: "small"}}
- "2 cloves garlic, minced" {quantity: 2, unit: "clove", food: "garlic", note: "minced"} - "2 cloves garlic, minced" {{quantity: 2, unit: "clove", food: "garlic", note: "minced"}}
- "1.5 cups broccoli (coarsely chopped florets)" {quantity: 1.5, unit: "cup", food: "broccoli", note: "coarsely chopped florets"} - "1.5 cups broccoli (coarsely chopped florets)" {{quantity: 1.5, unit: "cup", food: "broccoli", note: "coarsely chopped florets"}}
- Section headers like "For the sauce:" 1 item with all fields null EXCEPT - Section headers like "For the sauce:" 1 item with all fields null EXCEPT
note: "<header text>" (so Mealie can preserve the header row) note: "<header text>", is_new_food: false (so Mealie can preserve the header row)
- If you genuinely cannot parse (junk input), return 1 item with all fields null - If you genuinely cannot parse (junk input), return 1 item with all fields null
and the original string in note. and the original string in note, is_new_food: false.
- DO NOT add fields not in the schema. - DO NOT add fields not in the schema.
- DO NOT wrap output in markdown fences. - DO NOT wrap output in markdown fences.
- DO NOT include any prose before or after the JSON. - DO NOT include any prose before or after the JSON.
Input shape: {"ingredients": ["str", "str", ...]} Input shape: {{"ingredients": ["str", "str", ...]}}
Output shape: {"parses": [[{...}, {...}], [{...}], [{...}, {...}, {...}], ...]} Output shape: {{"parses": [[{{...}}, {{...}}], [{{...}}], [{{...}}, {{...}}, {{...}}], ...]}}
The outer list MUST have the same length as the input list. Each inner list The outer list MUST have the same length as the input list. Each inner list
MUST contain at least 1 item (use the all-null junk-fallback if needed). MUST contain at least 1 item (use the all-null junk-fallback if needed).
""" """
@ -79,6 +101,7 @@ class IngredientParse:
food: str | None food: str | None
note: str | None note: str | None
approx: bool approx: bool
is_new_food: bool = False # true when Sonnet proposes a new canonical name
@dataclass @dataclass
@ -100,6 +123,11 @@ class Sterilizer:
self.mealie = mealie self.mealie = mealie
self.forge = forge self.forge = forge
self.model = model self.model = model
# Lazy-loaded canonical food catalog from Mealie. Fetched once
# per Sterilizer instance (so a bulk sterilize job pulls it
# once and reuses across all 226 recipe parses).
self._catalog_cache: list[dict] | None = None
self._catalog_prompt: str | None = None
# --- public ------------------------------------------------------------- # --- public -------------------------------------------------------------
@ -343,6 +371,73 @@ class Sterilizer:
# --- private ------------------------------------------------------------ # --- private ------------------------------------------------------------
# --- canonical food catalog (Mealie is source of truth) ----------------
def _load_catalog(self) -> list[dict]:
"""Pull every food row from Mealie in one big request. The user's
session token scopes to their group, so this spans every household
the user can see fine, we want Sonnet to know all canonical
names. Cached on the instance after first call.
We use the underlying _get directly (not list_foods) so we can
also pass a page param if a per_page=5000 doesn't return everything
in one shot."""
if self._catalog_cache is not None:
return self._catalog_cache
out: list[dict] = []
page = 1
while page <= 20: # defensive ceiling
resp = self.mealie._get(
"/api/foods", search="", perPage=2000, page=page
)
items = resp.get("items") or resp.get("data") or []
for item in items:
out.append(item)
tp = resp.get("total_pages") or resp.get("totalPages") or 1
if not items or page >= tp:
break
page += 1
self._catalog_cache = out
return out
def _catalog_for_prompt(self) -> str:
"""Render the catalog as a bullet list for the system prompt.
Cached on the instance so we don't rebuild this per batch."""
if self._catalog_prompt is not None:
return self._catalog_prompt
items = self._load_catalog()
lines: list[str] = []
for it in items:
name = (it.get("name") or "").strip()
if not name:
continue
plural = (it.get("pluralName") or "").strip()
aliases = it.get("aliases") or []
# Aliases on Mealie can be a list of strings or a list of
# {name, foodId} dicts depending on version. Normalize.
alias_names: list[str] = []
for a in aliases:
if isinstance(a, str) and a.strip():
alias_names.append(a.strip())
elif isinstance(a, dict):
n = (a.get("name") or "").strip()
if n:
alias_names.append(n)
line = f"{name}"
if plural and plural.lower() != name.lower():
line += f" (plural: {plural})"
if alias_names:
line += f" [aliases: {', '.join(alias_names)}]"
lines.append(line)
self._catalog_prompt = "\n".join(lines)
return self._catalog_prompt
def _system_prompt(self) -> str:
"""Build the full STERILIZE_SYSTEM prompt with the catalog spliced in."""
return STERILIZE_SYSTEM_TEMPLATE.format(foods=self._catalog_for_prompt())
# --- per-batch Sonnet call ---------------------------------------------
def _parse_batch(self, strings: list[str]) -> list[list[IngredientParse]]: def _parse_batch(self, strings: list[str]) -> list[list[IngredientParse]]:
"""Returns list-of-lists matching the input length. Each inner list """Returns list-of-lists matching the input length. Each inner list
is the parses derived from one input string (1 in normal case, N is the parses derived from one input string (1 in normal case, N
@ -352,8 +447,8 @@ class Sterilizer:
resp = self.forge.run( resp = self.forge.run(
prompt=prompt, prompt=prompt,
model=self.model, model=self.model,
system=STERILIZE_SYSTEM, system=self._system_prompt(),
timeout_secs=120, timeout_secs=180,
) )
except ForgeError as e: except ForgeError as e:
raise RuntimeError(f"clawdforge failed: {e}") from e raise RuntimeError(f"clawdforge failed: {e}") from e
@ -389,6 +484,7 @@ class Sterilizer:
food=_clean_str(it.get("food")), food=_clean_str(it.get("food")),
note=_clean_str(it.get("note")), note=_clean_str(it.get("note")),
approx=bool(it.get("approx")), approx=bool(it.get("approx")),
is_new_food=bool(it.get("is_new_food")),
) )
) )
if not inner: if not inner: