discover: normalize source_url trailing slash before insert

Same recipe URL with vs without trailing slash was producing duplicate
discover corpus rows because UNIQUE(source_url) is byte-exact:
  https://www.tasteofhome.com/recipes/falafel  → id 7
  https://www.tasteofhome.com/recipes/falafel/ → id 3 (manually pasted)

Caught 2026-05-02 when Cobb pasted his first 4 with trailing slashes,
then a follow-up listing-page extractor stripped them, producing 1:1
dupes. rstrip('/') in insert_discovered_recipe normalizes at the
persistence layer so all callers get the dedup for free.

Existing data manually fixed: deleted dupes 7,8,5; stripped trailing
slashes off rows 3,4,6 to canonical form. Corpus now clean (4 rows).
This commit is contained in:
Kayos 2026-05-02 14:12:36 -07:00
parent 2a357b2acd
commit ed0894ddca

View file

@ -2251,7 +2251,18 @@ class DB:
) -> int | None:
"""INSERT a freshly-scraped recipe in 'raw' state. Returns the new
row id, or None if the source_url was already present (UNIQUE
violation = duplicate scrape, treat as skip)."""
violation = duplicate scrape, treat as skip).
Normalizes source_url by stripping trailing slashes so that
`.../recipes/falafel` and `.../recipes/falafel/` map to the same
UNIQUE key. 2026-05-02: caught when manual `/discover` paste
included trailing slash but listing-page extractor stripped it,
producing 1:1 duplicates."""
# URL canonicalization — single rstrip is safe for recipe paths
# (they always have a non-slash terminal segment; `https://host/`
# alone wouldn't be a valid recipe URL anyway).
if source_url.endswith("/"):
source_url = source_url.rstrip("/")
with self.conn() as c, c.cursor() as cur:
cur.execute(
"""INSERT IGNORE INTO cauldron_discovered_recipes