Methodology

How data enters The Well, how it is validated, and how it is reviewed before publication. For the full technical document with code paths, see docs/methodology.md.

Data sources

Every procedural field on a judge page comes from one of four source types. The source type appears next to the field on the page.

Court website / standing order. The originating page on the court's site, or a linked PDF order. Extracted with deterministic rules and cited with a short excerpt.
PACER-derived statistic. Computed from public docket data — for example, motion-ruling cadence (median days from full submission to order) or bench-ruling rate.
Aggregated contributor observation. Verified-attorney observations, aggregated across many submissions before any field is published. No individual observation appears on the site.
Verified court page. A field copied verbatim from a court-published procedure page, linked and dated. Used for stable facts like courtroom equipment lists.

Pipeline

Scrape. One scraper per jurisdiction fetches the judge index, standing orders, and chambers-rules pages from the court's site. Scrapers live in scrapers/ and run on a weekly schedule.
Extract. Shared extractors at scrapers/common/extractors.py turn standing-order text into structured fields. Extraction is regex and structural HTML parsing — no language models.
Persist. Output is written as YAML to data/judges/{jurisdiction}/. Every procedural field is paired with a sources entry containing the originating URL and a short excerpt.
Validate. scripts/lint-judge-yaml.py enforces the JSON Schema at data/schema/judge.schema.json. CI fails on any schema error.
Review. The weekly scraper workflow commits directly to main. Maintainers review the resulting Git diff and roll back anomalies. Human-authored corrections go through the normal pull-request review.
Render. Astro builds a fully static site from the validated YAML. Judge pages are pre-rendered HTML with no client-side JavaScript and no runtime database.

What the project does not do

No AI in the data pipeline. Extraction is deterministic. The output of a scraper run is reproducible from its inputs. This is a hard constraint, not a temporary stance.
No subjective ratings. The schema does not contain rating fields. Pages describe procedure and cite where each fact came from.
No silent updates. Every change is a Git commit. Scraper-driven updates are visible in the public history and reviewable as diffs.

Confidence levels

Each sources entry on a judge page carries a confidence label.

Verified. A maintainer has confirmed the field against the cited source.
Auto-extracted. The field came from a deterministic extractor with no human verification yet. The cited excerpt is the matched text.
Aggregated. The field came from aggregated contributor observations. No individual observation appears.

Corrections and disputes

Corrections require a citation. Anyone may file an issue with the data:correction label or open a pull request against the YAML. If a correction is disputed by the cited source, the standing order or court page is the controlling authority and the field is updated to reflect it.

Where a court page and an aggregated observation conflict, the court page wins, but both are visible — the field reflects the rule, and the observation is preserved for context.

Update cadence

Scrapers run weekly. Standing orders and court pages are re-fetched on that cadence; the last fetched timestamp on each entry records when. Aggregated-observation fields are recomputed when new contributions cross the aggregation threshold for that field. The last verified timestamp on each sources entry records the most recent confirmation.