Commit graph

19 commits

Author SHA1 Message Date
75bc7dc6bf Replace hand-rolled urlencoded_decode with url::form_urlencoded::parse
The previous decoder treated each %XX as an isolated code point via
`out.push(v as char)`. For UTF-8 multi-byte sequences (e.g. %E2%9C%93
for ✓) that produced three garbage chars at U+00E2 / U+009C / U+0093
instead of the proper U+2713. YT cipher strings are typically ASCII-
only so this was latent, but the function was named generically and
nothing in the type system prevented a non-ASCII input from reaching it.

`url::form_urlencoded::parse` is the canonical &-separated query-pair
parser — handles %-decode as UTF-8, handles + → space, and the url
crate is already a transitive dep. parse_cipher_string collapses to
one line; the bespoke 20-line decoder goes.
2026-05-26 22:52:27 -07:00
1292688827 Fixup: restore ContentCountry import in parsing.rs
android_user_agent / ios_user_agent still take &ContentCountry —
the previous commit over-pruned the imports when it dropped
bootstrap_visitor_data.
2026-05-26 22:33:47 -07:00
56afa423fb Drop dead Method variants, Downloader default fns, parsing/stream_helper unused, suppress_unused leftovers, stale comments
Second pass through the cruft inventory. All deletes — no behavior change.

Method enum + downloader trait:
  * Method::Head / Put / Delete dropped (no caller). Match arms in
    request.rs::as_str() and default_impl.rs::execute() collapse to
    just Get / Post.
  * Request::head builder dropped.
  * Downloader trait's get / get_localized / head / post default
    methods dropped. Every caller went through execute() directly
    anyway; the convenience wrappers carried 4 dead arms each.

Parsing module:
  * bootstrap_visitor_data — pub fn, no caller.
  * discover_web_client_version + CACHED_WEB_CLIENT_VERSION +
    reset_web_client_version_cache — entire sw.js live-version
    discovery pipeline, never wired up by any caller. The cache was
    never populated, so web_client_version() always returned the
    hardcoded constant. Collapsed to just returning the constant.
  * Drops once_cell::Lazy, parking_lot::RwLock around the version
    cache (consent flag still uses RwLock), Regex import, serde_json::Value
    import, downloader/exceptions/Request/InnertubeClientRequestInfo
    imports — all only kept alive by the deleted code.

stream_helper:
  * get_web_embedded_player_response — pub fn, no caller.

js/player_manager + extractor:
  * player_manager::player_hash — pub fn, no caller. Was only kept
    alive by its own definition.
  * extractor::extract_player_hash — pub fn, only called by the now-
    dead player_hash. Test removed alongside.

Stale comments:
  * itag.rs:1 header claimed 53 entries; ITAG_TABLE has 57 and the
    test at line 179 already asserts it.
  * js/mod.rs:12-13 claimed the submodules were 'crate-private
    plumbing' but they're declared pub mod. Tightened the comment to
    explain the integration-test dependency that keeps them public.

Net delete: ~170 LOC of dead surface across 9 files.
2026-05-26 22:33:00 -07:00
59d2ee07be Gate test-only MediaFormat import behind cfg(test)
Release builds were emitting unused_imports for MediaFormat, which is
now only referenced inside a #[cfg(test)] block (after dropping the
_suppress_unused stub in the previous commit).
2026-05-26 22:18:59 -07:00
d4000a9f9a Cleanup: drop playlist + suggestion + dead client constants + suppress_unused stubs
Round-2 cruft audit punch list — mechanical deletes, no behavior change.

Whole modules deleted (no wrapper consumer):
  * youtube/playlist_extractor.rs (297 LOC) — full playlist extraction
  * youtube/linkhandler/playlist.rs (81 LOC) — playlist URL parser
  * youtube/suggestion_extractor.rs (91 LOC) — search-as-you-type
  * tests/stream_phase4_offline.rs (186 LOC) — tautological test

Dead pub fns + enum variants + constants:
  * WEB_REMIX_* constants (3) + WEB_MUSIC_ANALYTICS_* constants (3)
  * InnertubeClientRequestInfo::of_web_music_analytics_charts_client
    factory + its charts_client_omits_platform_and_screen test
  * SearchFilter::Music{Songs,Videos,Albums,Playlists,Artists} variants
    (5 of 9 cases) + uses_music_endpoint helper + the search_extractor
    'music search not implemented' reject branch
  * Two #[allow(dead_code)] _suppress_unused stub fns and the imports
    they were keeping alive (std::sync::Arc in js/extractor.rs,
    NetworkError in stream_extractor.rs)

Renamed:
  * search_extractor::test_helpers -> renderer_helpers. Mis-named:
    it's production code called from channel.rs, not a test fixture.

potoken/ kept and documented as the designed Phase-5 extension point
for YouTube bot-detection — wrapper's Android side hasn't registered
a real provider yet, but the trait + global slot stay so when YT
forces po_token universally the integration is one Kotlin patch away,
not a Rust-side rewrite.

~580 LOC removed from production. Wrapper does not need to change.
2026-05-26 22:16:11 -07:00
bfd06d1ef3 Cap attribution_link recursion + Downloader response body size
Two adversarial bugs surfaced by the round-2 audit on this crate.

extract_video_id recursion (linkhandler/stream.rs)
  /attribution_link?u=<inner> recursed on the inner URL with no depth
  guard. The comment claimed 'only one level deep' but the call was
  plain recursion — a pasted URL whose u= param decodes to another
  /attribution_link would recurse until the JVM stack blew. Wrap the
  recursion in extract_video_id_inner with an explicit depth counter
  capped at MAX_ATTRIBUTION_DEPTH = 1.

ReqwestDownloader body cap (downloader/default_impl.rs)
  resp.text() read the entire response body into a String with no
  upper bound. Player.js is ~1.5 MB, watch HTML ~3 MB, channel
  responses well under 1 MB. A hostile redirect target (or compromised
  host) could blast multi-GB and OOM-kill the Android process — there
  is no headroom on a 1 GB JVM heap ceiling.

  Cap at 32 MB. Two-stage check: bail fast on a known Content-Length
  that exceeds the cap, and use Read::take(MAX+1) on the stream so we
  detect overrun rather than silently truncate. Switched the final
  decode to from_utf8_lossy so a single mojibake byte doesn't drop the
  whole response (same fix shape as the wrapper's read_capped_body).
2026-05-26 22:02:40 -07:00
7d2e4c51b8 Drop unused Phase-1 scaffolding: page, metainfo, service
These three modules were ported from NewPipeExtractor in Phase 1 as
part of the spine. Nothing in the YT extractor (channel/search/stream/
playlist/linkhandler) imports them, and the strawcore wrapper crate
that consumes us doesn't re-export them either. Per the round-2
audit's cruft inventory, this is ~195 LOC of dead surface shipping
to every Android APK.

  * page.rs — Page continuation carrier; continuation tokens flow
    through the codebase as plain Strings.
  * metainfo.rs — NPE MetaInfo info-card struct; no extractor
    populates it.
  * service.rs — StreamingService trait + ServiceInfo + LinkType;
    zero impls exist anywhere.

Wrapper does not need to change — none of the pub use re-exports
crossed the crate boundary.
2026-05-26 22:01:08 -07:00
7c7151186e channel: extract avatar from pageHeaderRenderer + metadata fallback
Channels on the newer pageHeaderRenderer layout (most channels with a
2024+ refreshed header — WTYP, etc.) were getting empty avatars and
banners since the parse_channel_browse only extracted those from the
older c4TabbedHeaderRenderer branch.

Two fixes layered:

1. parse_page_header_avatar() — walks the deep ViewModel nest:
     header.content.pageHeaderViewModel.image
       .decoratedAvatarViewModel.avatar.avatarViewModel.image.sources[]
   Falls back to a couple of shallower nestings YT has used on this
   path historically. Returns ImageSet sorted by height ascending so
   .last() still picks the largest source.

2. metadata.channelMetadataRenderer.avatar.thumbnails[] backfill.
   Set whether the header is c4Tabbed or pageHeader, and the most
   reliable single avatar source. Used only when both header branches
   came back empty so we don't override a higher-quality header avatar.

Description-from-metadata extraction folded into the same metadata
walk to avoid the JSON tree twice.
2026-05-25 19:47:46 +00:00
e6fbbb79b4 channel: second-browse to Videos tab + parse lockupViewModel
Found via emulator smoke that channelInfo was returning empty
recent_videos list, breaking the subscriptions feed.

Two root causes:
1. First browse of a channel by browseId lands on the HOME tab in
   2026 YT, not Videos. Home uses sectionListRenderer, not the
   richGridRenderer my parser expected. The Videos tab in the
   response carries an empty content block (you need a SECOND
   browse with the params token to populate it).
2. Channel video items on the Videos tab migrated from
   videoRenderer to lockupViewModel (YT made the switch ~2024).
   My old parser only handled videoRenderer.

Fix:
* fetch_channel_browse now does TWO browses — first for Home
  (header + metadata), second with params='EgZ2aWRlb3PyBgQKAjoA'
  for the Videos tab. Same magic constant NPE uses (audit Track
  A §2.4).
* parse_videos_tab handles BOTH videoRenderer (legacy/fallback)
  AND lockupViewModel (current). lockupViewModel parse extracts:
    - contentId → video ID
    - metadata.lockupMetadataViewModel.title.content → title
    - metadataRows[].metadataParts[].text.content → view-count
      ('1.1m views') + relative-age ('2 years ago') + uploader
    - contentImage.thumbnailViewModel.overlays[]
      .thumbnailBottomOverlayViewModel.badges[]
      .thumbnailBadgeViewModel.text → duration ('3:14:08')
    - contentImage.thumbnailViewModel.image.sources[] → thumbnails
* parse_videos_continuation pulls the continuation token from the
  Videos tab grid for pagination.

Second browse is best-effort: if it fails, recent_videos stays
empty and the channel header still populates from the first.

Verified the YT response shape by probing live channel
UCwwtUfy0-CqN50HfaFDzL0w (NCS Spektrem) — got 30+ lockup-style
video items with the expected fields.
2026-05-24 20:06:43 -07:00
aa07984631 Drop cdylib + staticlib from strawcore-core crate-type
Caught during the cargo-ndk cross-compile — strawcore-core was
emitting its own libstrawcore_core.so (~306 KB per ABI) into Straw's
jniLibs. That .so is never loaded by Android; the wrapper crate's
libstrawcore.so is the only entry point.

rlib only is what consumer crates need.
2026-05-24 17:40:37 -07:00
56089ffa3e Rename package to strawcore-core
Straw's wrapper crate already owns the name 'strawcore' (and that name
is baked into the Android .so file + Kotlin's System.loadLibrary call).
Renaming this extractor crate to 'strawcore-core' resolves the cargo
package-name collision so both can live in the same workspace dep tree.

Repo name on Gitea stays Sulkta-Coop/strawcore.
2026-05-24 17:28:38 -07:00
f79d8fb109 Phase 6 — Search + Channel + Playlist + LinkHandler
Pulls in the read-side extractor surfaces Straw needs at app open
(search bar) + on detail screens (channel + playlist).

src/youtube/linkhandler/
  * mod.rs       — ACCEPTED_HOSTS allowlist (youtube.com /
                   youtube-nocookie.com / youtu.be / m.youtube.com /
                   music.youtube.com); 27 Invidious mirror hosts
                   intentionally dropped (SPEC §6.6).
  * stream.rs    — extract_video_id() handles /watch?v= / youtu.be/ /
                   /embed/ / /shorts/ / /v/ / /live/ / attribution_link;
                   strict 11-char [A-Za-z0-9_-] validation.
  * channel.rs   — ChannelIdentifier enum (DirectId / Handle / Custom /
                   LegacyUser). Resolution to UC… id lands in
                   youtube/channel.rs.
  * playlist.rs  — extracts ?list=<PLid> from /playlist and /watch URLs.
  * search.rs    — SearchFilter enum + params() opaque base64 strings +
                   uses_music_endpoint() routing flag.

src/youtube/search_extractor.rs
  * search(query, filter) → SearchInfo { query, corrected_query,
                                          videos, continuation_token }
  * Walks twoColumnSearchResultsRenderer → sectionListRenderer →
    itemSectionRenderer → videoRenderer (+ shelfRenderer recursion).
  * Parses YT duration strings, view-count abbreviations ('1.5M views'),
    publishedTimeText, ownerBadges verified flag, badge LIVE flag.
  * Music-search filters route to WEB_REMIX — flagged as not-yet-impl.

src/youtube/suggestion_extractor.rs
  * suggestions(query) → Vec<String> via the suggestqueries-clients6
    endpoint; handles both XSSI-prefixed and bare JSON responses.

src/youtube/channel.rs
  * resolve_handle_to_channel_id() via /youtubei/v1/navigation/resolve_url
  * channel_info(ChannelIdentifier) → ChannelInfo {
      name, description, avatars, banners, subscriber_count, verified,
      recent_videos, videos_continuation
    }
  * Parses both c4TabbedHeaderRenderer (most common) and the newer
    pageHeaderRenderer flavor.
  * subscriber_count parser handles K/M/B suffixes.

src/youtube/playlist_extractor.rs
  * playlist_info(playlist_id) → PlaylistInfo with first-page video
    list + continuation_token. Browses with browseId='VL<id>'.
  * Walks playlistMetadataRenderer + playlistSidebarRenderer + the
    playlistVideoListRenderer.contents[] for video items.

Tests: 121 lib unit pass (+44 since Phase 5). All previous phase smoke
tests still green.

What's left:
* Phase 6 kiosks (Trending etc) — minor, deferred
* Phase 7 — UniFFI surface swap into Straw (Straw repo work)
* Phase 8 — delete rustypipe (Straw repo work)
2026-05-24 17:16:14 -07:00
b4286b8236 Phase 5 — PoTokenProvider trait + stream_extractor wiring
Mirrors NPE PoTokenProvider.java + PoTokenResult.java; defines the
host-injection surface for BotGuard attestation. The Rust crate stays
out of the BotGuard business — embedders (Straw on Android, future
Sulkta CLI via Browserless, etc.) supply their own impl.

src/youtube/potoken/mod.rs
  * PoTokenResult { player_request_po_token, streaming_data_po_token,
                    visitor_data }  + ::new + ::single constructors
  * PoTokenError (Unavailable, MintFailed) — FIX vs NPE: split 'declined'
    (Ok(None)) from 'errored' (Err) so callers can react differently
  * trait PoTokenProvider with 4 client-scoped methods; default impl
    returns Ok(None) so embedders can override just what they support
  * set_po_token_provider / clear_po_token_provider / po_token_provider
    static registration via RwLock<Option<Arc<dyn PoTokenProvider>>>

src/youtube/potoken/noop.rs
  * NoopPoTokenProvider — safe default

src/youtube/stream_extractor.rs
  * resolve_po_token via options-first-then-provider helper
    (options_or_provider)
  * Android branch: pulls player_request_po_token + visitor_data into
    /player body, streams streaming_data_po_token through to URL &pot=
  * iOS branch: same shape, gated on fetch_ios_client AND non-empty
    provider result

Kotlin side (PoTokenWebView lift into Straw via UniFFI's foreign-trait
bridge) is separate work — strawcore just owns the contract.

Tests: 77 lib unit pass (+4 since Phase 4) + 7 Phase 2 offline + 7
Phase 4 offline = 91 green.
2026-05-24 17:10:13 -07:00
a47e142ab7 Phase 4 (complete) — stream_extractor orchestrator
Wire the Android-primary fetch path + JSON-walking + URL post-processing
into a single stream_info(video_id) entry point. Mirrors NPE
YoutubeStreamExtractor.onFetchPage() per audit Track C §1.2.

src/youtube/stream_extractor.rs
  * stream_info(video_id) + stream_info_with(video_id, options)
  * fetch_android — reel endpoint (anonymous) OR /player (with po_token)
  * check_playability_status — maps to ContentUnavailable variants
    (AgeRestricted, GeoRestricted, Paid, Private, YoutubeMusicPremium,
    AccountTerminated, Other)
  * is_player_response_not_valid — decoy-video detection
  * populate_video_details + populate_microformat + populate_streams +
    populate_manifests + populate_captions
  * process_url — sig deobf path (signatureCipher → JS function call)
    + unconditional nsig deobf + cpn append + pot append
  * build_video_progressive / build_video_only / build_audio +
    push_*_dedup helpers (FIX: NPE bug — dedup by itag id, not by
    mediaFormat.id which collides 140/141)

Consolidated stream_helper's local ExtractionError into the crate-wide
exceptions::ExtractionError with a new DownloaderMissing variant.

Tests: 73 lib unit pass (+9 since Phase 3) + 7 new Phase 4 offline
integration tests = 80 lib green. Live YT end-to-end smoke deferred
to Straw integration; the code path is in place.
2026-05-24 17:08:04 -07:00
cd98673684 Phase 4 (partial) — stream value types + InnerTube /player helpers
Lands the data shapes + the HTTP layer for stream extraction. The
extractor orchestrator + DASH manifest creator are deferred to the
next session — the parsing logic is dense enough to want a focused
pass.

src/stream/
  * mod.rs       — StreamInfo + StreamInfoItem (full + 'card' shapes)
                   mirroring NPE StreamInfo.java + StreamInfoItem.java
  * delivery.rs  — DeliveryMethod (Progressive/Dash/Hls/Torrent)
  * audio.rs     — AudioStream (itag, format, url, bitrate, codec,
                   audio_track_id, content_length, etc.)
  * video.rs     — VideoStream (itag, format, url, resolution, fps,
                   bandwidth, codec, video_only flag)
  * subtitles.rs — SubtitlesStream (url, lang, auto_generated, mime)

src/youtube/stream_helper.rs
  * generate_content_playback_nonce() — 16-char LCG-shuffled cpn
  * get_web_metadata_player_response       (microformat + thumbnails only)
  * get_web_embedded_player_response       (embed-url + signatureTimestamp)
  * get_android_player_response            (full Android /player + poToken)
  * get_android_reel_player_response       (no-poToken fallback)
  * get_ios_player_response                (iOS — flagged with 917 KiB cap
                                            warning in the doc comment)

Per-helper headers + URL shapes match audit Track C §2.7 verbatim:
Android/iOS hit gapis endpoint with mobile UA; WEB family hits
www.youtube.com with the WEB headers.

Tests: 64 lib unit pass (up from 62 in Phase 3).

Next session: full stream_extractor.rs orchestrator + dash_manifest/
creator + Phase 4 done-when smoke (extract NCS Spektrem).
2026-05-24 17:01:03 -07:00
3014410cba Phase 3 — InnerTube + itag
Port the YT client matrix + request envelope + itag lookup table.

src/youtube/
  * constants.rs       — ClientsConstants.java verbatim. All six live
                         clients (WEB, WEB_EMBEDDED_PLAYER,
                         WEB_MUSIC_ANALYTICS, ANDROID, IOS, plus the
                         WEB_REMIX values for completeness). Base URLs
                         + prettyPrint=false suffix.
  * client_request.rs  — ClientInfo / DeviceInfo / InnertubeClientRequestInfo
                         + the 5 factory constructors NPE exposes
                         (ofWebClient, ofWebEmbeddedPlayer, ofCharts,
                         ofAndroid, ofIos). build_envelope() emits the
                         InnerTube JSON in NPE's exact insertion order;
                         build_desktop_envelope() is the WEB-fast-path
                         used by search/browse/next/resolve_url/comments.
  * itag.rs            — 57-entry itag table (14 progressive + 10 audio +
                         33 video-only). MediaFormat enum + ItagType
                         enum + ItagItem struct + lookup().
  * parsing.rs         — consent toggle + cookie generator (SOCS=CAE= /
                         SOCS=CAISAiAD), WEB client-version cache + sw.js
                         scrape, WEB/mobile header builders (mobile
                         deliberately strips X-YouTube-Client-Name +
                         Origin/Referer + Cookie per audit Track A §6.2),
                         android/ios UA templates, visitor_data bootstrap
                         POST to /youtubei/v1/visitor_id.

PARITY notes flagged in code:
  * androidSdkVersion=36 + osVersion=16 but Android-15 in UA — NPE-intentional
  * mobile clients send NO X-YouTube-Client-* headers
  * audit doc says "53 entries" but tallies + NPE source = 57 ItagItems

Tests: 62 lib unit pass (up from 43 in Phase 2). All Phase 1 + Phase 2
smoke still green. Live InnerTube POSTs (visitor_data bootstrap +
/player) deferred to Phase 4 integration.
2026-05-24 16:57:47 -07:00
91639f26d1 Phase 2 — JS deobfuscator (rquickjs + ress)
Port NewPipeExtractor's JS pipeline: player.js fetch + cache, sig and
nsig function extraction, deobfuscation, sticky-error caching.

src/youtube/js/
  * runtime.rs        — rquickjs wrapper (mirrors utils/JavaScript.java)
                        compile_or_throw + run(snippet, name, parameter)
  * lexer.rs          — match_to_closing_brace via the `ress` JS scanner
                        (NPE's lexer is derived from the same crate
                        upstream)
  * extractor.rs      — iframe_api → embed page fallback for player.js
                        URL, regex-driven hash extraction, clean-and-fetch
  * signature.rs      — 6 sig fn name regexes (front-most-recent),
                        deobf-function-body via lexer w/ regex fallback,
                        helper-object + global-string-array extraction,
                        signatureTimestamp, snippet assembler
  * nsig.rs           — 8 nsig fn name regexes (incl. array-indirection),
                        body via lexer w/ regex fallback, fixupFunction
                        early-return strip
  * player_manager.rs — orchestrator + sticky-error cache mirroring
                        YoutubeJavaScriptPlayerManager

PORT DEVIATIONS from NPE (each flagged in code):
  * dropped the 6th sig fn name regex (used Java backref \2; Rust's
    `regex` crate is backtracking-free, so we substitute a loose form
    that NPE itself half-broke per audit Track B §2.1)
  * dropped the Java atomic group `(?>...)` from helper-object regex —
    Rust's NFA is already linear-time
  * nsig fixup substitutes `(?:"undefined"|'undefined')` for the
    \1 backref; harmless loosening
  * sig and nsig assembled snippets prepend `var` — QuickJS rejects
    bare-assignment to undeclared identifiers; NPE relied on Rhino's
    non-strict mode

Tests:
  * 43 lib unit tests (up from 7 in Phase 1)
  * 7 Phase 2 offline integration tests against a hand-crafted
    minified synthetic player.js — exercises the full sig pipeline
    (build_deobfuscator → runtime::run) and nsig fixup_function
  * 7 Phase 1 live smoke tests still green

57/57 total green.
2026-05-24 16:53:19 -07:00
46201c731f Phase 1 — Foundation
Mirror NPE's dependency-free spine in Rust:

* exceptions   — NetworkError + ParsingError + ContentUnavailable
                 + ExtractionError tree, with reqwest/serde_json conversions
* localization — Localization + ContentCountry, default (en, GB)
* downloader/  — Downloader trait, Request builder, Response,
                 reqwest blocking default impl
* page         — continuation-token carrier
* image        — Image + ImageSet + ResolutionLevel
                 (HEIGHT_UNKNOWN/WIDTH_UNKNOWN = -1)
* metainfo     — title/content/url/url_text grab-bag
* service      — StreamingService trait + LinkType + ServiceInfo
* newpipe      — process-global Downloader / Localization /
                 ContentCountry singleton

Foundational invariants nailed down (per SPEC §3):
* HTTP non-2xx returns Ok(Response); only 429 throws NetworkError::Recaptcha
* Response header keys lowercase-normalized
* Request.add_header PARITY with NPE bug (silent overwrite);
  append_header is our clean addition
* default Localization is en-GB
* No cookie jar in the default downloader

Tests: 7 unit + 7 live smoke against httpbin.org (gated on
'online-tests' feature). All green.
2026-05-24 16:32:36 -07:00
f44b46fab5 Initial commit 2026-05-24 16:26:57 -07:00