discover: normalize source_url trailing slash before insert
Same recipe URL with vs without trailing slash was producing duplicate discover corpus rows because UNIQUE(source_url) is byte-exact: https://www.tasteofhome.com/recipes/falafel → id 7 https://www.tasteofhome.com/recipes/falafel/ → id 3 (manually pasted) Caught 2026-05-02 when Cobb pasted his first 4 with trailing slashes, then a follow-up listing-page extractor stripped them, producing 1:1 dupes. rstrip('/') in insert_discovered_recipe normalizes at the persistence layer so all callers get the dedup for free. Existing data manually fixed: deleted dupes 7,8,5; stripped trailing slashes off rows 3,4,6 to canonical form. Corpus now clean (4 rows).
This commit is contained in:
parent
2a357b2acd
commit
ed0894ddca
1 changed files with 12 additions and 1 deletions
|
|
@ -2251,7 +2251,18 @@ class DB:
|
|||
) -> int | None:
|
||||
"""INSERT a freshly-scraped recipe in 'raw' state. Returns the new
|
||||
row id, or None if the source_url was already present (UNIQUE
|
||||
violation = duplicate scrape, treat as skip)."""
|
||||
violation = duplicate scrape, treat as skip).
|
||||
|
||||
Normalizes source_url by stripping trailing slashes so that
|
||||
`.../recipes/falafel` and `.../recipes/falafel/` map to the same
|
||||
UNIQUE key. 2026-05-02: caught when manual `/discover` paste
|
||||
included trailing slash but listing-page extractor stripped it,
|
||||
producing 1:1 duplicates."""
|
||||
# URL canonicalization — single rstrip is safe for recipe paths
|
||||
# (they always have a non-slash terminal segment; `https://host/`
|
||||
# alone wouldn't be a valid recipe URL anyway).
|
||||
if source_url.endswith("/"):
|
||||
source_url = source_url.rstrip("/")
|
||||
with self.conn() as c, c.cursor() as cur:
|
||||
cur.execute(
|
||||
"""INSERT IGNORE INTO cauldron_discovered_recipes
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue