Five-step daily pipeline with failure isolation
Each stage (scrape → generate docs → ingest to ChromaDB → ingest YouTube content → cleanup) runs independently. If scraping fails, it retries after 30 minutes. If ingestion fails, cleanup is skipped to preserve data for manual recovery. This keeps the system resilient without overengineering a full job queue.
Smart retrieval routing over pure embedding search
Rather than sending every question through the same vector similarity search, the chatbot classifies questions by keyword regex and adjusts which document types to prioritize — unit analysis chunks for unit questions, video guide chunks for strategy questions, a mix for comp questions. This hybrid approach significantly improved answer relevance compared to naive RAG.
Deterministic chunk IDs via MD5 hashing
Every chunk gets an ID derived from MD5(type + date + chunk_index), which means re-running the pipeline upserts instead of duplicating data. This was critical for a daily-refresh system — without it, ChromaDB would accumulate duplicate entries on every pipeline run.
Header-based semantic chunking
Documents are split on ## markdown headers (with a secondary ### split for oversized chunks). This keeps each comp or unit analysis as a single coherent chunk rather than breaking mid-sentence at a fixed token count. The tradeoff is occasional chunk size variance, but embedding quality is meaningfully better.
Two-tier YouTube processing
YouTube's transcript API blocks cloud provider IPs, so transcripts are fetched locally, processed through Gemini for structured extraction, and committed as JSON files. The Lightsail pipeline then ingests those pre-processed files. A pragmatic workaround that avoids introducing a proxy layer just for one data source.
Next.js __NEXT_DATA__ extraction
Instead of intercepting XHR requests from tactics.tools (fragile and race-condition-prone), the scraper extracts server-rendered data directly from the __NEXT_DATA__ script tag. More reliable and simpler to maintain.
Lazy-loaded Riot Data Dragon lookups
Unit/item/trait IDs from the raw data are resolved to human-readable names via Riot's CDN. The lookup table is fetched once per session, cached in memory, and falls back to raw IDs on failure — so the system never crashes over a cosmetic lookup.