Phase A foundation. Cobb 2026-04-29: 'go big or go home' on density-table
aggregator — this commit lands the schema + seed data so the aggregator
engine has something to look up against in step 2.
DB:
- migration 010: cauldron_foods (canonical_name PK, density_g_per_ml,
default_unit_class enum mass/volume/count/mixed, common_size_g,
category, usda_fdc_id, source enum)
- migration 011: cauldron_food_mapping (per-household Mealie food_id →
cauldron canonical food_id, used by aggregator + foods-dedupe later)
Seed data:
- scripts/build_foods_seed.py — extractor that walks USDA SR Legacy
foodPortions, derives density g/ml from cup/tbsp/tsp/fl-oz/ml/etc
measurements (handles SR Legacy's quirk of putting unit in 'modifier'
with measureUnit.name='undetermined'), filters out babyfood / branded
/ fast-food / alcoholic-beverage clutter, normalizes names, categorizes
via longest-keyword-wins
- cauldron/data/foods_seed_usda.json — 2,462 foods with density values
derived from USDA. 636KB, ships in the image.
- cauldron/data/README.md — regen instructions + known issues / iteration
plan (next pass: claude-curated cleanup → ~500-800 high-relevance entries
+ count-based foods like egg/onion that USDA doesn't cover)
Loader (cauldron/foods.py):
- load_seed_if_empty(db) called on app startup right after migrate().
Idempotent — won't reload if table is non-empty.
- reload_seed(db) for forced reloads (INSERT IGNORE).
- search_food(db, name) helper for the aggregator + UI.
Categories present in seed:
produce-vegetable: 300, spice: 256, dairy: 207, condiment: 197,
legume: 189, meat: 166, beverage: 153, baking: 129, produce-fruit: 128,
oil-fat: 126, nut-seed: 115, grain: 89, other: 407
The 407 'other' bucket and the verbose USDA names ('mayonnaise, reduced
fat, with olive oil') will get cleaned up via clawdforge in step 3.
For now the aggregator can already do the math against this seed; the
unit-conversion engine is the next commit.
1.8 KiB
cauldron/data — seed data shipped with the app
foods_seed_usda.json
Canonical foods catalog seeded from USDA SR Legacy (2018-04 release).
Each entry has a derived density_g_per_ml from USDA's foodPortions
data — the quantity-of-grams reported for one cup / tablespoon / etc.
The aggregator engine (Phase A step 2) uses these density values to combine "2 cups rice + 1.25 lb rice" into a single shopping-list line.
Regenerate
If USDA ships a new SR Legacy dataset:
# 1. Download the new dataset from https://fdc.nal.usda.gov/download-datasets
# Pick the "SR Legacy / Full Download — JSON" link.
# 2. Unzip somewhere local, e.g. /tmp/usda-sr.json
# 3. Re-run the extractor:
python3 scripts/build_foods_seed.py /tmp/usda-sr.json > cauldron/data/foods_seed_usda.json
# 4. Commit the resulting JSON
Loading
cauldron/foods.py runs load_seed_if_empty(db) on app boot — only loads
when the cauldron_foods table is empty. Safe to redeploy without re-loading.
For a manual reload (e.g. after updating the seed without dropping the
table), call foods.reload_seed(db) which uses INSERT IGNORE on the
canonical_name unique key.
Known issues / iteration
The USDA SR Legacy descriptions are verbose and brand-laden ("Apples, sulfured, stewed, with added sugar"). Our normalization is a heuristic — expect ~15% of entries to have suboptimal canonical names. The Phase A step 3 plan is to feed the seed through clawdforge → Sonnet to:
- Drop entries that aren't useful cooking foods
- Normalize names (drop ", raw", merge brand variants)
- Add count-based foods USDA doesn't cover (e.g. "egg", "onion" in count form)
- Curate down to ~500-800 high-relevance foods
Until that step lands, expect to manually search the canonical_name field to find what you want; the aggregator's fuzzy matching covers most of it.