Methodology

How this platform actually works

Every number, name, and timeline on this site comes from a specific source and a specific calculation. This page walks through both - first in plain language, then with the technical detail underneath, so you can trust (and challenge) what you see.

1. Where the data comes from

Four feeds power the whole platform. Nothing is hand-typed.

  • ASCO program CSV - the official nightly export of every abstract and session for the 2026 Annual Meeting, published by ASCO on CloudFront.
  • HCP Enrichment Engine - your KOL roster (name, institution, specialty, tier) served as a paginated JSON API.
  • Lovable AI Gateway - runs Google Gemini for short summaries, keyword extraction, and the Ask assistant. No data is sent to model providers other than the text we explicitly include in the prompt.
  • Our own database - every fetched row is stored so the UI is fast and the same answer is reproducible from one visit to the next.
In plain English
Think of it as a kitchen. ASCO ships the groceries (the CSV). Your HCP engine ships the guest list (the KOLs). Our kitchen (the database) preps everything once, and the dining room (the website) just plates it.

2. Pulling the ASCO program

The ASCO program guide is a Flutter single-page app - there are no scrapeable links. So we use the canonical source they publish themselves.

Every run downloads the entire ASCO abstracts CSV for meeting 335(the 2026 Annual Meeting). Each row is one abstract: a title, a session type (Poster Session, Oral Abstract, Clinical Science Symposium, Plenary, etc.), a track, the speaker’s display name, the abstract body, and a start/end timestamp.

We split the rows into posters (anything labelled “Poster”) and sessions (everything else). “Publication Only” abstracts are dropped - they have no live presentation to attend.

In plain English
We grab ASCO’s own master spreadsheet, throw away the rows that aren’t real presentations, and save the rest. We never guess what’s on the agenda.
Under the hood

Source URL: d32wbias3z7pxg.cloudfront.net/abstract-exports/335/meeting_335_abstracts.csv. Parsed with papaparse. Each row is upserted into the sessionstable keyed by source_url (so re-runs update in place instead of duplicating). Speakers are upserted into speakers keyed by a normalised name and linked via session_speakers. Every run writes a row to ingest_runs with live progress counters so the Admin page can show the crawl moving.

3. Dates, times and timezones

Conference scheduling is where most planning tools get sloppy. We refuse to guess.

ASCO publishes presentation times like 2026-05-31 09:57:00 CDT. We parse each one and convert it to a true UTC timestamp using the stated offset (CDT = UTC−5).

Safety net: ASCO occasionally leaks publication or receipt dates into the same column. We hard-reject any date outside the meeting window (2026-05-29 to 2026-06-02). If a row has no valid in-window date, we show “Time TBA” rather than mislead you.

Display: a toggle in the header switches between Chicago time (CT, the conference’s home) and your browser’s local timezone. Both modes use 24-hour format so a server-rendered page and your browser render the same string (no hydration flicker, no “01:00 PM” vs “13:00” disagreement).

In plain English
We only trust times that fall during the actual meeting. Anything else is shown as “TBA” instead of being fudged into a fake slot.

4. KOL roster from the HCP engine

Your KOL list isn’t hardcoded - it’s pulled live from your HCP Enrichment Engine.

We page through your HCP API (100 records per page, polite 1.1s gap between calls to stay under the 60 req/min limit) until there are no more pages. Each HCP record is normalised into a single shape: full name, institution, specialty, tier.

Under the hood

Endpoint: GET /hcps?project_id=…&page=N&per_page=100. Retried up to 3 times on 429 or 5xx with exponential backoff. Records are upserted into kols keyed by external_id, so the same person is never inserted twice even across runs.

5. Matching KOLs to speakers

The hardest problem: deciding whether “J. Smith, Dana-Farber” in your KOL list is the same person as “Jane Smith” on the ASCO program.

We use a two-pass match per KOL:

  1. Exact normalised name. First we strip titles (Dr., Prof., MD, PhD, MBBS…), accents, and punctuation, lowercase everything, and collapse whitespace. If a KOL and a speaker share an identical normalised name, that’s a strong hit (base score 0.90). If their institutions also overlap, the score is nudged up toward 1.00.
  2. Fuzzy fallback. If there’s no exact match (or it scored under 0.95), we compare every speaker’s name to the KOL using a token-set ratio: the size of the shared word-set divided by the average set size. Anything below 0.85 is discarded. The remainder is combined with institution overlap (0.85 × name + 0.15 × affiliation).

A match below 0.70 is dropped entirely. Everything above is fanned out: if a matched speaker appears in five sessions, that KOL gets five kol_matchesrows - one per session - each carrying the score and a short reason (name_exact+affiliation, fuzzy_name(0.92)+aff, etc.).

In plain English
We try the obvious match first (same exact name). If that fails, we try “mostly the same name + same hospital”. We never silently merge two people who only share a first initial.
Under the hood

Token-set ratio: 2 × |A ∩ B| / (|A| + |B|), computed on tokens longer than 1 character. Affiliation overlap: |A ∩ B| / min(|A|, |B|) on tokens longer than 3 characters. See src/lib/normalize.ts and src/lib/matcher.server.ts. The kol_matches table is truncated and rewritten on every rematch run - there are no stale matches.

6. TL;DRs and keywords

Every poster gets a one-paragraph summary and a handful of keywords.

The TL;DR is generated deterministically from the abstract text - same input always produces the same output. Keywords are pulled by frequency and stop-word filtering from the title and abstract combined.

For richer summaries we call Google Gemini via the Lovable AI Gateway and ask for a ≤2-sentence plain-English summary plus three bullet key points per session. We only send the title and the first ~800 characters of the abstract - both of which are already public on ASCO’s site.

Under the hood

See src/lib/tldr.ts (deterministic) and src/lib/tldr.server.ts (backfill). The backfill writes to sessions.summary and sessions.keywords only when those columns are empty (mode: "missing") or for every row (mode: "all").

7. The Ask assistant

Ask is a tool-using agent, not a chatbot guessing answers.

When you type a question on /ask, the model isn’t allowed to invent data. It can only call a small set of read-only tools we expose: search_sessions, get_session, search_posters, search_speakers, list_kols, get_kol_matches, list_articles, coverage_snapshot, recent_ingest_runs, and missing_schedule.

Every tool call is shown to you in the left-hand “Steps” panel with its filters and row count, so you can see exactly how the answer was assembled. If the data isn’t there, the model is instructed to say so - not to fabricate.

In plain English
The assistant has to look things up in our own database before answering. You see every query it ran.

8. Coverage and quality checks

We measure ourselves out loud.

  • Coverage snapshot: total sessions, posters, speakers, KOLs, and the percentage of sessions with a parsed day/time.
  • Missing schedule view: a live list of every row where day, start_at, or end_at is null - with a one-click re-run against just those rows.
  • Ingest runs: every crawl logs counters (processed, matched, inserted, updated, errors) so you can see whether yesterday’s run was healthy.

9. Known limits (so you don’t get burned)

  • ASCO’s CSV only carries one primary speaker per abstract. Chair and panelist names appear only when a session-level export is available; otherwise the UI shows “no chair listed”.
  • The CSV’s AbstractNumber isn’t the same as ASCO’s internal presentation ID used in their URLs. Direct deep links 404. We route “View on ASCO” buttons through a Google site-search instead so the user always lands on the real page.
  • Fuzzy matching has a floor at 0.70. If a KOL’s name is materially different from what ASCO prints (initials only, married name, transliteration), they won’t be auto-matched. That’s a deliberate trade-off in favour of precision.
  • All AI summaries are model output. They’re short and conservative, but you should still read the source abstract before quoting them in a clinical context.

Last updated: rolling - this page describes the live behaviour of the codebase as deployed.