v0.3 step 1: foods schema + USDA SR Legacy density seed
Phase A foundation. Cobb 2026-04-29: 'go big or go home' on density-table
aggregator — this commit lands the schema + seed data so the aggregator
engine has something to look up against in step 2.
DB:
- migration 010: cauldron_foods (canonical_name PK, density_g_per_ml,
default_unit_class enum mass/volume/count/mixed, common_size_g,
category, usda_fdc_id, source enum)
- migration 011: cauldron_food_mapping (per-household Mealie food_id →
cauldron canonical food_id, used by aggregator + foods-dedupe later)
Seed data:
- scripts/build_foods_seed.py — extractor that walks USDA SR Legacy
foodPortions, derives density g/ml from cup/tbsp/tsp/fl-oz/ml/etc
measurements (handles SR Legacy's quirk of putting unit in 'modifier'
with measureUnit.name='undetermined'), filters out babyfood / branded
/ fast-food / alcoholic-beverage clutter, normalizes names, categorizes
via longest-keyword-wins
- cauldron/data/foods_seed_usda.json — 2,462 foods with density values
derived from USDA. 636KB, ships in the image.
- cauldron/data/README.md — regen instructions + known issues / iteration
plan (next pass: claude-curated cleanup → ~500-800 high-relevance entries
+ count-based foods like egg/onion that USDA doesn't cover)
Loader (cauldron/foods.py):
- load_seed_if_empty(db) called on app startup right after migrate().
Idempotent — won't reload if table is non-empty.
- reload_seed(db) for forced reloads (INSERT IGNORE).
- search_food(db, name) helper for the aggregator + UI.
Categories present in seed:
produce-vegetable: 300, spice: 256, dairy: 207, condiment: 197,
legume: 189, meat: 166, beverage: 153, baking: 129, produce-fruit: 128,
oil-fat: 126, nut-seed: 115, grain: 89, other: 407
The 407 'other' bucket and the verbose USDA names ('mayonnaise, reduced
fat, with olive oil') will get cleaned up via clawdforge in step 3.
For now the aggregator can already do the math against this seed; the
unit-conversion engine is the next commit.
This commit is contained in:
parent
c7ee84d70a
commit
edf679504d
6 changed files with 20223 additions and 0 deletions
46
cauldron/data/README.md
Normal file
46
cauldron/data/README.md
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
# cauldron/data — seed data shipped with the app
|
||||
|
||||
## foods_seed_usda.json
|
||||
|
||||
Canonical foods catalog seeded from USDA SR Legacy (2018-04 release).
|
||||
Each entry has a derived `density_g_per_ml` from USDA's `foodPortions`
|
||||
data — the quantity-of-grams reported for one cup / tablespoon / etc.
|
||||
|
||||
The aggregator engine (Phase A step 2) uses these density values to
|
||||
combine "2 cups rice + 1.25 lb rice" into a single shopping-list line.
|
||||
|
||||
### Regenerate
|
||||
|
||||
If USDA ships a new SR Legacy dataset:
|
||||
|
||||
```sh
|
||||
# 1. Download the new dataset from https://fdc.nal.usda.gov/download-datasets
|
||||
# Pick the "SR Legacy / Full Download — JSON" link.
|
||||
# 2. Unzip somewhere local, e.g. /tmp/usda-sr.json
|
||||
# 3. Re-run the extractor:
|
||||
python3 scripts/build_foods_seed.py /tmp/usda-sr.json > cauldron/data/foods_seed_usda.json
|
||||
# 4. Commit the resulting JSON
|
||||
```
|
||||
|
||||
### Loading
|
||||
|
||||
`cauldron/foods.py` runs `load_seed_if_empty(db)` on app boot — only loads
|
||||
when the `cauldron_foods` table is empty. Safe to redeploy without re-loading.
|
||||
For a manual reload (e.g. after updating the seed without dropping the
|
||||
table), call `foods.reload_seed(db)` which uses INSERT IGNORE on the
|
||||
canonical_name unique key.
|
||||
|
||||
### Known issues / iteration
|
||||
|
||||
The USDA SR Legacy descriptions are verbose and brand-laden ("Apples,
|
||||
sulfured, stewed, with added sugar"). Our normalization is a heuristic —
|
||||
expect ~15% of entries to have suboptimal canonical names. The Phase A
|
||||
step 3 plan is to feed the seed through clawdforge → Sonnet to:
|
||||
|
||||
1. Drop entries that aren't useful cooking foods
|
||||
2. Normalize names (drop ", raw", merge brand variants)
|
||||
3. Add count-based foods USDA doesn't cover (e.g. "egg", "onion" in count form)
|
||||
4. Curate down to ~500-800 high-relevance foods
|
||||
|
||||
Until that step lands, expect to manually search the canonical_name field
|
||||
to find what you want; the aggregator's fuzzy matching covers most of it.
|
||||
19698
cauldron/data/foods_seed_usda.json
Normal file
19698
cauldron/data/foods_seed_usda.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -146,6 +146,46 @@ MIGRATIONS = [
|
|||
FOREIGN KEY (household_id) REFERENCES cauldron_households(id) ON DELETE CASCADE
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
|
||||
""",
|
||||
# 010 — canonical foods table for the unit-aware aggregator. Each row is
|
||||
# ONE food (e.g. "rice", "butter", "onion") with density + unit class.
|
||||
# Seeded from USDA SR Legacy via scripts/build_foods_seed.py; will be
|
||||
# extended with claude-curated entries in v0.3 step 2.
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS cauldron_foods (
|
||||
id BIGINT PRIMARY KEY AUTO_INCREMENT,
|
||||
canonical_name VARCHAR(255) NOT NULL,
|
||||
plural_name VARCHAR(255),
|
||||
category VARCHAR(64),
|
||||
density_g_per_ml DECIMAL(6,3),
|
||||
common_size_g DECIMAL(8,2),
|
||||
default_unit_class ENUM('mass','volume','count','mixed') NOT NULL DEFAULT 'mass',
|
||||
usda_fdc_id INT,
|
||||
usda_description VARCHAR(500),
|
||||
notes JSON,
|
||||
source ENUM('usda','claude','manual') NOT NULL DEFAULT 'usda',
|
||||
last_updated DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
|
||||
UNIQUE KEY uk_canonical (canonical_name),
|
||||
INDEX idx_category (category),
|
||||
INDEX idx_usda (usda_fdc_id)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
|
||||
""",
|
||||
# 011 — Mealie food_id → cauldron food_id mapping per household. The
|
||||
# foods dedupe step (v0.3 A2) populates this. Aggregator joins through
|
||||
# this to group ingredients across recipes by canonical food.
|
||||
"""
|
||||
CREATE TABLE IF NOT EXISTS cauldron_food_mapping (
|
||||
household_id BIGINT NOT NULL,
|
||||
mealie_food_id VARCHAR(64) NOT NULL,
|
||||
cauldron_food_id BIGINT NOT NULL,
|
||||
confidence DECIMAL(4,2) NOT NULL DEFAULT 1.00,
|
||||
mapped_by ENUM('exact','fuzzy','claude','manual') NOT NULL DEFAULT 'fuzzy',
|
||||
mapped_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
PRIMARY KEY (household_id, mealie_food_id),
|
||||
INDEX idx_canonical (cauldron_food_id),
|
||||
FOREIGN KEY (household_id) REFERENCES cauldron_households(id) ON DELETE CASCADE,
|
||||
FOREIGN KEY (cauldron_food_id) REFERENCES cauldron_foods(id) ON DELETE CASCADE
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
|
||||
""",
|
||||
]
|
||||
|
||||
|
||||
|
|
|
|||
99
cauldron/foods.py
Normal file
99
cauldron/foods.py
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
"""Foods catalog — canonical food rows + the seed loader.
|
||||
|
||||
Phase A step 1 (v0.3): seed cauldron_foods from USDA SR Legacy via the
|
||||
JSON file at cauldron/data/foods_seed_usda.json. Idempotent — running
|
||||
multiple times is fine, INSERT IGNORE on the unique canonical_name key.
|
||||
|
||||
Phase A step 2 (next commit): aggregator engine reads these rows + the
|
||||
per-household cauldron_food_mapping to group recipe ingredients.
|
||||
|
||||
Phase A step 3 (later): claude-curated cleanup of the USDA seed (better
|
||||
names, missing common foods, count-based foods like 'egg' / 'onion').
|
||||
"""
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
SEED_PATH = Path(__file__).parent / "data" / "foods_seed_usda.json"
|
||||
|
||||
|
||||
def seed_count(db) -> int:
|
||||
with db.conn() as c, c.cursor() as cur:
|
||||
cur.execute("SELECT COUNT(*) AS n FROM cauldron_foods")
|
||||
return cur.fetchone()["n"]
|
||||
|
||||
|
||||
def load_seed_if_empty(db) -> int:
|
||||
"""If cauldron_foods is empty, load the USDA seed JSON. Returns rows
|
||||
inserted (0 if already populated). Called by app startup after migrate."""
|
||||
if not SEED_PATH.exists():
|
||||
return 0
|
||||
if seed_count(db) > 0:
|
||||
return 0
|
||||
return _load_seed_file(db, SEED_PATH)
|
||||
|
||||
|
||||
def reload_seed(db) -> int:
|
||||
"""Force-reload the seed file (used by /api/foods/reload-seed). Won't
|
||||
overwrite existing rows — INSERT IGNORE on canonical_name. Returns
|
||||
rows inserted on this run."""
|
||||
if not SEED_PATH.exists():
|
||||
return 0
|
||||
return _load_seed_file(db, SEED_PATH)
|
||||
|
||||
|
||||
def _load_seed_file(db, path: Path) -> int:
|
||||
with path.open() as f:
|
||||
data = json.load(f)
|
||||
inserted = 0
|
||||
with db.conn() as c, c.cursor() as cur:
|
||||
for entry in data:
|
||||
try:
|
||||
cur.execute(
|
||||
"""
|
||||
INSERT IGNORE INTO cauldron_foods
|
||||
(canonical_name, category, density_g_per_ml,
|
||||
default_unit_class, usda_fdc_id, usda_description, source)
|
||||
VALUES (%s, %s, %s, %s, %s, %s, 'usda')
|
||||
""",
|
||||
(
|
||||
entry["canonical_name"][:255],
|
||||
entry.get("category"),
|
||||
entry.get("density_g_per_ml"),
|
||||
entry.get("default_unit_class") or "mass",
|
||||
entry.get("usda_fdc_id"),
|
||||
(entry.get("usda_description") or "")[:500],
|
||||
),
|
||||
)
|
||||
inserted += cur.rowcount
|
||||
except Exception:
|
||||
# Skip malformed rows — seed cleanup is iterative
|
||||
continue
|
||||
return inserted
|
||||
|
||||
|
||||
def search_food(db, name: str, *, limit: int = 5) -> list[dict]:
|
||||
"""Best-effort canonical food lookup by name (used by aggregator + UI)."""
|
||||
with db.conn() as c, c.cursor() as cur:
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT id, canonical_name, category, density_g_per_ml,
|
||||
default_unit_class, common_size_g
|
||||
FROM cauldron_foods
|
||||
WHERE canonical_name LIKE %s OR canonical_name LIKE %s
|
||||
ORDER BY
|
||||
(CASE WHEN canonical_name = %s THEN 0
|
||||
WHEN canonical_name LIKE %s THEN 1
|
||||
ELSE 2 END),
|
||||
CHAR_LENGTH(canonical_name)
|
||||
LIMIT %s
|
||||
""",
|
||||
(f"{name}%", f"%{name}%", name, f"{name}%", limit),
|
||||
)
|
||||
return [dict(r) for r in cur.fetchall()]
|
||||
|
||||
|
||||
def get_food(db, food_id: int) -> dict | None:
|
||||
with db.conn() as c, c.cursor() as cur:
|
||||
cur.execute("SELECT * FROM cauldron_foods WHERE id=%s", (food_id,))
|
||||
return cur.fetchone()
|
||||
|
|
@ -30,6 +30,7 @@ from .config import load
|
|||
from .crypto import TokenCrypto
|
||||
from .db import DB
|
||||
from .forge import Forge
|
||||
from . import foods
|
||||
from .mealie import Mealie, MealieError
|
||||
from .oidc import init_oauth
|
||||
from .recipe_index import flatten_recipe, refresh_household_index, search_index
|
||||
|
|
@ -65,6 +66,14 @@ def create_app() -> Flask:
|
|||
if applied:
|
||||
app.logger.info("applied migrations: %s", applied)
|
||||
|
||||
# Seed cauldron_foods from the USDA snapshot if empty
|
||||
try:
|
||||
loaded = foods.load_seed_if_empty(db)
|
||||
if loaded:
|
||||
app.logger.info("loaded %d foods from USDA seed", loaded)
|
||||
except Exception as e:
|
||||
app.logger.warning("foods seed load failed: %s", e)
|
||||
|
||||
oauth = init_oauth(
|
||||
app,
|
||||
issuer=cfg.oidc_issuer,
|
||||
|
|
|
|||
331
scripts/build_foods_seed.py
Normal file
331
scripts/build_foods_seed.py
Normal file
|
|
@ -0,0 +1,331 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Build cauldron's foods_seed.json from USDA SR Legacy.
|
||||
|
||||
Usage:
|
||||
python scripts/build_foods_seed.py <usda-sr-legacy.json> > cauldron/data/foods_seed.json
|
||||
|
||||
Steps:
|
||||
1. Load SR Legacy JSON dump
|
||||
2. For each food, extract foodPortions and derive density g/ml from
|
||||
volume measurements (cup/tbsp/tsp/fl oz/ml/etc)
|
||||
3. Average densities across multiple portions of the same food
|
||||
4. Filter out non-cooking junk (branded items, weird stuff)
|
||||
5. Normalize description into a canonical_name (strip ", raw" suffixes,
|
||||
parenthetical brand names, etc.)
|
||||
6. Categorize using simple keyword heuristics
|
||||
7. Emit JSON ready for the cauldron_foods loader
|
||||
"""
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
VOL_TO_ML = {
|
||||
'cup': 236.588,
|
||||
'tablespoon': 14.787, 'tbsp': 14.787,
|
||||
'teaspoon': 4.929, 'tsp': 4.929,
|
||||
'fl oz': 29.574, 'fluid ounce': 29.574, 'fluid ounces': 29.574,
|
||||
'ml': 1.0, 'milliliter': 1.0,
|
||||
'liter': 1000.0, 'l': 1000.0,
|
||||
'pint': 473.176, 'quart': 946.353, 'gallon': 3785.41,
|
||||
}
|
||||
|
||||
# Description starts-with prefixes we drop entirely
|
||||
DROP_PREFIXES = (
|
||||
"babyfood",
|
||||
"infant formula",
|
||||
"alcoholic beverage",
|
||||
"snacks,",
|
||||
"fast food",
|
||||
"restaurant",
|
||||
"school lunch",
|
||||
"puddings,",
|
||||
"frostings,",
|
||||
"candies,",
|
||||
"leavening agents,", # these are baking but USDA names are weird
|
||||
)
|
||||
|
||||
# Substrings anywhere → drop
|
||||
DROP_KEYWORDS = [
|
||||
"fast food", "restaurant", "school lunch",
|
||||
"MCDONALDS", "BURGER KING", "KFC", "PIZZA HUT", "STARBUCKS",
|
||||
"SUBWAY", "TACO BELL", "WENDY'S", "DOMINOS", "PAPA JOHN",
|
||||
"CHICK-FIL-A", "POPEYES", "CHIPOTLE", "DENNY'S",
|
||||
"supplement", "weight control", "ready-to-drink", "ready to drink",
|
||||
"ready-to-eat", "muscle milk", "ENSURE", "BOOST",
|
||||
"nutrition bar", "meal replacement", "fortified",
|
||||
"sulfured", "dry mix", "frozen meal", "frozen dinner",
|
||||
"baby formula", "GERBER", "PILLSBURY", "KELLOGG",
|
||||
"QUAKER", "GENERAL MILLS", "POST", "BETTY CROCKER",
|
||||
"instant", "junior", "strained", "toddler",
|
||||
"(yield from", "from raw",
|
||||
]
|
||||
|
||||
# Brand-like ALL-CAPS tokens to strip from description
|
||||
BRAND_PATTERN = re.compile(r'\b[A-Z]{3,}(\s+[A-Z]{3,}|\s+[A-Z]{2,})*\b')
|
||||
|
||||
CATEGORY_MAP = [
|
||||
# (keyword in description.lower(), cauldron category)
|
||||
("oil", "oil-fat"),
|
||||
("butter", "oil-fat"),
|
||||
("lard", "oil-fat"),
|
||||
("shortening", "oil-fat"),
|
||||
("flour", "baking"),
|
||||
("sugar", "baking"),
|
||||
("yeast", "baking"),
|
||||
("baking powder", "baking"),
|
||||
("baking soda", "baking"),
|
||||
("vanilla", "baking"),
|
||||
("cocoa", "baking"),
|
||||
("chocolate", "baking"),
|
||||
("salt", "spice"),
|
||||
("pepper", "spice"),
|
||||
("cinnamon", "spice"),
|
||||
("paprika", "spice"),
|
||||
("oregano", "spice"),
|
||||
("basil", "spice"),
|
||||
("thyme", "spice"),
|
||||
("rosemary", "spice"),
|
||||
("garlic powder", "spice"),
|
||||
("onion powder", "spice"),
|
||||
("cumin", "spice"),
|
||||
("turmeric", "spice"),
|
||||
("ginger", "spice"),
|
||||
("milk", "dairy"),
|
||||
("cream", "dairy"),
|
||||
("yogurt", "dairy"),
|
||||
("cheese", "dairy"),
|
||||
("rice", "grain"),
|
||||
("pasta", "grain"),
|
||||
("noodle", "grain"),
|
||||
("bread", "grain"),
|
||||
("oats", "grain"),
|
||||
("oatmeal", "grain"),
|
||||
("quinoa", "grain"),
|
||||
("barley", "grain"),
|
||||
("couscous", "grain"),
|
||||
("beans", "legume"),
|
||||
("lentil", "legume"),
|
||||
("chickpea", "legume"),
|
||||
("garbanzo", "legume"),
|
||||
("tofu", "legume"),
|
||||
("tempeh", "legume"),
|
||||
("almond", "nut-seed"),
|
||||
("walnut", "nut-seed"),
|
||||
("pecan", "nut-seed"),
|
||||
("cashew", "nut-seed"),
|
||||
("peanut", "nut-seed"),
|
||||
("pistachio", "nut-seed"),
|
||||
("hazelnut", "nut-seed"),
|
||||
("seed", "nut-seed"),
|
||||
("nut", "nut-seed"),
|
||||
("beef", "meat"),
|
||||
("pork", "meat"),
|
||||
("chicken", "meat"),
|
||||
("turkey", "meat"),
|
||||
("lamb", "meat"),
|
||||
("ham", "meat"),
|
||||
("bacon", "meat"),
|
||||
("sausage", "meat"),
|
||||
("fish", "meat"),
|
||||
("salmon", "meat"),
|
||||
("tuna", "meat"),
|
||||
("shrimp", "meat"),
|
||||
("egg", "dairy"), # close enough
|
||||
("juice", "beverage"),
|
||||
("water", "beverage"),
|
||||
("tea", "beverage"),
|
||||
("coffee", "beverage"),
|
||||
("beer", "beverage"),
|
||||
("wine", "beverage"),
|
||||
("alcoholic", "beverage"),
|
||||
("soda", "beverage"),
|
||||
("vinegar", "condiment"),
|
||||
("sauce", "condiment"),
|
||||
("ketchup", "condiment"),
|
||||
("mustard", "condiment"),
|
||||
("mayonnaise", "condiment"),
|
||||
("soy sauce", "condiment"),
|
||||
("dressing", "condiment"),
|
||||
("syrup", "condiment"),
|
||||
("honey", "condiment"),
|
||||
("jam", "condiment"),
|
||||
("jelly", "condiment"),
|
||||
("apple", "produce-fruit"),
|
||||
("banana", "produce-fruit"),
|
||||
("orange", "produce-fruit"),
|
||||
("strawberry", "produce-fruit"),
|
||||
("blueberry", "produce-fruit"),
|
||||
("raspberry", "produce-fruit"),
|
||||
("grape", "produce-fruit"),
|
||||
("lemon", "produce-fruit"),
|
||||
("lime", "produce-fruit"),
|
||||
("pineapple", "produce-fruit"),
|
||||
("mango", "produce-fruit"),
|
||||
("watermelon", "produce-fruit"),
|
||||
("cherry", "produce-fruit"),
|
||||
("peach", "produce-fruit"),
|
||||
("pear", "produce-fruit"),
|
||||
("avocado", "produce-fruit"),
|
||||
("tomato", "produce-vegetable"), # we know
|
||||
("onion", "produce-vegetable"),
|
||||
("garlic", "produce-vegetable"),
|
||||
("carrot", "produce-vegetable"),
|
||||
("potato", "produce-vegetable"),
|
||||
("spinach", "produce-vegetable"),
|
||||
("lettuce", "produce-vegetable"),
|
||||
("kale", "produce-vegetable"),
|
||||
("broccoli", "produce-vegetable"),
|
||||
("cauliflower", "produce-vegetable"),
|
||||
("celery", "produce-vegetable"),
|
||||
("cucumber", "produce-vegetable"),
|
||||
("zucchini", "produce-vegetable"),
|
||||
("pepper, sweet", "produce-vegetable"),
|
||||
("pepper, bell", "produce-vegetable"),
|
||||
("mushroom", "produce-vegetable"),
|
||||
("squash", "produce-vegetable"),
|
||||
("pumpkin", "produce-vegetable"),
|
||||
("cabbage", "produce-vegetable"),
|
||||
]
|
||||
|
||||
|
||||
def categorize(name: str) -> str:
|
||||
"""Match against the longest keyword first so 'soy sauce' beats 'sauce'
|
||||
and 'pepper, black' beats 'pepper'. Score by keyword length."""
|
||||
n = name.lower()
|
||||
best = (None, 0)
|
||||
for kw, cat in CATEGORY_MAP:
|
||||
if kw in n and len(kw) > best[1]:
|
||||
best = (cat, len(kw))
|
||||
return best[0] or "other"
|
||||
|
||||
|
||||
def normalize_name(desc: str) -> str:
|
||||
"""Pull a canonical name out of the verbose USDA description."""
|
||||
s = desc
|
||||
# Strip everything after the first comma in many cases ("Salt, table" -> "Salt")
|
||||
# but keep useful descriptors ("Pepper, black, ground" -> "black pepper" via reorder)
|
||||
# First: drop preparation suffixes that don't matter for shopping
|
||||
s = re.sub(r',\s*(raw|cooked, boiled|cooked, drained|prepared|whole|ground|fresh|dried|granulated|all)(\s*,|$)', '', s, flags=re.I)
|
||||
# Drop branded all-caps tokens
|
||||
s = BRAND_PATTERN.sub('', s)
|
||||
# Drop parentheticals
|
||||
s = re.sub(r'\([^)]*\)', '', s)
|
||||
# Tidy whitespace
|
||||
s = re.sub(r'\s+', ' ', s).strip(', ').strip()
|
||||
# Reorder "X, Y" → "Y X" for spices/seasonings ("Pepper, black" → "black pepper")
|
||||
if ',' in s and not any(s.lower().startswith(p) for p in ('alcoholic', 'beverage', 'soup', 'sauce')):
|
||||
parts = [p.strip() for p in s.split(',') if p.strip()]
|
||||
if len(parts) == 2 and len(parts[1]) <= 25:
|
||||
s = f"{parts[1]} {parts[0]}"
|
||||
return s.lower().strip()
|
||||
|
||||
|
||||
_MODIFIER_VOL = re.compile(
|
||||
r'^(?:[\d./\s]*\s*)?(cup|tablespoon|tbsp|teaspoon|tsp|fl oz|fluid ounce|fluid ounces|ml|milliliter|liter|pint|quart|gallon)\b',
|
||||
re.I,
|
||||
)
|
||||
_MODIFIER_NORMALIZE = {
|
||||
'tbsp': 'tablespoon',
|
||||
'tsp': 'teaspoon',
|
||||
'fluid ounce': 'fl oz',
|
||||
'fluid ounces': 'fl oz',
|
||||
'milliliter': 'ml',
|
||||
'liter': 'liter',
|
||||
}
|
||||
|
||||
|
||||
def _modifier_to_unit(modifier: str) -> str | None:
|
||||
"""Pull a known volume unit out of a USDA modifier string. Handles
|
||||
'cup', 'cup (8 fl oz)', 'cup, chopped', 'tablespoon', etc."""
|
||||
m = _MODIFIER_VOL.match((modifier or '').strip().lower())
|
||||
if not m:
|
||||
return None
|
||||
raw = m.group(1).lower()
|
||||
return _MODIFIER_NORMALIZE.get(raw, raw)
|
||||
|
||||
|
||||
def derive_densities(food: dict) -> list[float]:
|
||||
"""Return list of derived g/ml density values from this food's portions.
|
||||
|
||||
SR Legacy puts the actual unit in `modifier` (not measureUnit.name,
|
||||
which is almost always 'undetermined'). We parse the modifier with a
|
||||
regex tolerant of garnish phrases ('cup, chopped', 'cup (8 fl oz)')."""
|
||||
out = []
|
||||
for p in (food.get('foodPortions') or []):
|
||||
gw = p.get('gramWeight')
|
||||
if not gw or gw <= 0:
|
||||
continue
|
||||
amount = p.get('amount') or 1
|
||||
unit_name = ((p.get('measureUnit') or {}).get('name') or '').lower().strip()
|
||||
modifier = p.get('modifier') or ''
|
||||
unit = unit_name if unit_name in VOL_TO_ML else _modifier_to_unit(modifier)
|
||||
if unit not in VOL_TO_ML:
|
||||
continue
|
||||
ml = VOL_TO_ML[unit] * amount
|
||||
if ml > 0:
|
||||
density = gw / ml
|
||||
if 0.1 < density < 3.0:
|
||||
out.append(density)
|
||||
return out
|
||||
|
||||
|
||||
def main():
|
||||
src = sys.argv[1]
|
||||
with open(src) as f:
|
||||
data = json.load(f)
|
||||
foods = data.get('SRLegacyFoods') or data.get('FoundationFoods') or []
|
||||
|
||||
out = []
|
||||
seen_canonical = {}
|
||||
for f in foods:
|
||||
desc = f.get('description') or ''
|
||||
if not desc:
|
||||
continue
|
||||
# Drop junk by prefix
|
||||
d_low = desc.lower()
|
||||
if any(d_low.startswith(p) for p in DROP_PREFIXES):
|
||||
continue
|
||||
# Drop junk by substring
|
||||
if any(kw.lower() in d_low for kw in DROP_KEYWORDS):
|
||||
continue
|
||||
densities = derive_densities(f)
|
||||
if not densities:
|
||||
continue
|
||||
avg = round(sum(densities) / len(densities), 3)
|
||||
|
||||
canonical = normalize_name(desc)
|
||||
if not canonical or len(canonical) > 80:
|
||||
continue
|
||||
# If we've already seen this canonical name with a similar density, skip
|
||||
if canonical in seen_canonical:
|
||||
existing = seen_canonical[canonical]
|
||||
existing['density_samples'].append(avg)
|
||||
existing['density_g_per_ml'] = round(
|
||||
sum(existing['density_samples']) / len(existing['density_samples']), 3
|
||||
)
|
||||
continue
|
||||
seen_canonical[canonical] = {
|
||||
'canonical_name': canonical,
|
||||
'category': categorize(canonical),
|
||||
'density_g_per_ml': avg,
|
||||
'default_unit_class': 'volume' if avg < 1.05 else 'mass',
|
||||
'usda_fdc_id': f.get('fdcId'),
|
||||
'usda_description': desc,
|
||||
'density_samples': [avg],
|
||||
}
|
||||
|
||||
# Drop the working sample list before serializing
|
||||
final = []
|
||||
for v in seen_canonical.values():
|
||||
v.pop('density_samples', None)
|
||||
final.append(v)
|
||||
|
||||
final.sort(key=lambda x: x['canonical_name'])
|
||||
json.dump(final, sys.stdout, indent=2, ensure_ascii=False)
|
||||
print(f'\n# {len(final)} foods', file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Loading…
Add table
Add a link
Reference in a new issue