Methodology
How data enters The Well, how it is validated, and how it is reviewed before publication. For the full technical document with code paths, see docs/methodology.md.
Data sources
Every procedural field on a judge page comes from one of four source types. The source type appears next to the field on the page.
- Court website / standing order. The originating page on the court's site, or a linked PDF order. Extracted with deterministic rules and cited with a short excerpt.
- PACER-derived statistic. Computed from public docket data — for example, motion-ruling cadence (median days from full submission to order) or bench-ruling rate.
- Aggregated contributor observation. Verified-attorney observations, aggregated across many submissions before any field is published. No individual observation appears on the site.
- Verified court page. A field copied verbatim from a court-published procedure page, linked and dated. Used for stable facts like courtroom equipment lists.
Pipeline
- Scrape. One scraper per jurisdiction fetches the judge index, standing orders, and chambers-rules pages from the court's site. Scrapers live in
scrapers/and run on a weekly schedule. - Extract. Shared extractors at
scrapers/common/extractors.pyturn standing-order text into structured fields. Extraction is regex and structural HTML parsing — no language models. - Persist. Output is written as YAML to
data/judges/{jurisdiction}/. Every procedural field is paired with asourcesentry containing the originating URL and a short excerpt. - Validate.
scripts/lint-judge-yaml.pyenforces the JSON Schema atdata/schema/judge.schema.json. CI fails on any schema error. - Review. The weekly scraper workflow commits directly to
main. Maintainers review the resulting Git diff and roll back anomalies. Human-authored corrections go through the normal pull-request review. - Render. Astro builds a fully static site from the validated YAML. Judge pages are pre-rendered HTML with no client-side JavaScript and no runtime database.
What the project does not do
- No AI in the data pipeline. Extraction is deterministic. The output of a scraper run is reproducible from its inputs. This is a hard constraint, not a temporary stance.
- No subjective ratings. The schema does not contain rating fields. Pages describe procedure and cite where each fact came from.
- No silent updates. Every change is a Git commit. Scraper-driven updates are visible in the public history and reviewable as diffs.
Confidence levels
Each sources entry on a judge page carries a confidence
label.
- Verified. A maintainer has confirmed the field against the cited source.
- Auto-extracted. The field came from a deterministic extractor with no human verification yet. The cited excerpt is the matched text.
- Aggregated. The field came from aggregated contributor observations. No individual observation appears.
Corrections and disputes
Corrections require a citation. Anyone may file an issue with the
data:correction label or open a pull request against the
YAML. If a correction is disputed by the cited source, the standing
order or court page is the controlling authority and the field is
updated to reflect it.
Where a court page and an aggregated observation conflict, the court page wins, but both are visible — the field reflects the rule, and the observation is preserved for context.
Update cadence
Scrapers run weekly. Standing orders and court pages are re-fetched
on that cadence; the last fetched timestamp on each entry
records when. Aggregated-observation fields are recomputed when new
contributions cross the aggregation threshold for that field. The
last verified timestamp on each sources entry
records the most recent confirmation.