cauldron/cauldron/data/README.md
Kayos edf679504d v0.3 step 1: foods schema + USDA SR Legacy density seed
Phase A foundation. Cobb 2026-04-29: 'go big or go home' on density-table
aggregator — this commit lands the schema + seed data so the aggregator
engine has something to look up against in step 2.

DB:
- migration 010: cauldron_foods (canonical_name PK, density_g_per_ml,
  default_unit_class enum mass/volume/count/mixed, common_size_g,
  category, usda_fdc_id, source enum)
- migration 011: cauldron_food_mapping (per-household Mealie food_id →
  cauldron canonical food_id, used by aggregator + foods-dedupe later)

Seed data:
- scripts/build_foods_seed.py — extractor that walks USDA SR Legacy
  foodPortions, derives density g/ml from cup/tbsp/tsp/fl-oz/ml/etc
  measurements (handles SR Legacy's quirk of putting unit in 'modifier'
  with measureUnit.name='undetermined'), filters out babyfood / branded
  / fast-food / alcoholic-beverage clutter, normalizes names, categorizes
  via longest-keyword-wins
- cauldron/data/foods_seed_usda.json — 2,462 foods with density values
  derived from USDA. 636KB, ships in the image.
- cauldron/data/README.md — regen instructions + known issues / iteration
  plan (next pass: claude-curated cleanup → ~500-800 high-relevance entries
  + count-based foods like egg/onion that USDA doesn't cover)

Loader (cauldron/foods.py):
- load_seed_if_empty(db) called on app startup right after migrate().
  Idempotent — won't reload if table is non-empty.
- reload_seed(db) for forced reloads (INSERT IGNORE).
- search_food(db, name) helper for the aggregator + UI.

Categories present in seed:
  produce-vegetable: 300, spice: 256, dairy: 207, condiment: 197,
  legume: 189, meat: 166, beverage: 153, baking: 129, produce-fruit: 128,
  oil-fat: 126, nut-seed: 115, grain: 89, other: 407

The 407 'other' bucket and the verbose USDA names ('mayonnaise, reduced
fat, with olive oil') will get cleaned up via clawdforge in step 3.
For now the aggregator can already do the math against this seed; the
unit-conversion engine is the next commit.
2026-04-28 22:03:17 -07:00

1.8 KiB

cauldron/data — seed data shipped with the app

foods_seed_usda.json

Canonical foods catalog seeded from USDA SR Legacy (2018-04 release). Each entry has a derived density_g_per_ml from USDA's foodPortions data — the quantity-of-grams reported for one cup / tablespoon / etc.

The aggregator engine (Phase A step 2) uses these density values to combine "2 cups rice + 1.25 lb rice" into a single shopping-list line.

Regenerate

If USDA ships a new SR Legacy dataset:

# 1. Download the new dataset from https://fdc.nal.usda.gov/download-datasets
#    Pick the "SR Legacy / Full Download — JSON" link.
# 2. Unzip somewhere local, e.g. /tmp/usda-sr.json
# 3. Re-run the extractor:
python3 scripts/build_foods_seed.py /tmp/usda-sr.json > cauldron/data/foods_seed_usda.json
# 4. Commit the resulting JSON

Loading

cauldron/foods.py runs load_seed_if_empty(db) on app boot — only loads when the cauldron_foods table is empty. Safe to redeploy without re-loading. For a manual reload (e.g. after updating the seed without dropping the table), call foods.reload_seed(db) which uses INSERT IGNORE on the canonical_name unique key.

Known issues / iteration

The USDA SR Legacy descriptions are verbose and brand-laden ("Apples, sulfured, stewed, with added sugar"). Our normalization is a heuristic — expect ~15% of entries to have suboptimal canonical names. The Phase A step 3 plan is to feed the seed through clawdforge → Sonnet to:

  1. Drop entries that aren't useful cooking foods
  2. Normalize names (drop ", raw", merge brand variants)
  3. Add count-based foods USDA doesn't cover (e.g. "egg", "onion" in count form)
  4. Curate down to ~500-800 high-relevance foods

Until that step lands, expect to manually search the canonical_name field to find what you want; the aggregator's fuzzy matching covers most of it.