Public calibration backtest · all-time

Cardpulse grades its own homework in public.

Every Cardpulse forecast carries an 80% prediction band. This page shows what fraction of those bands actually contained the realized comp price after the horizon elapsed. No incumbent publishes this. That is the wedge.

Iglewicz–Hoaglin |M|>3.5 outlier filter±15-day realized windownightly re-evaluation

loading backtest…

Methodology

How the backtest is computed.

Every Cardpulse forecast carries an 80% prediction band [p10, p90]. Once a forecast reaches its target date (generated_at + horizon_days), the nightly job joins it to the realized comp price — the median of all ticks in a ±15-day window around the target date.

Each prediction is bucketed: in band (realized inside [p10, p90]), above (realized above p90), below (realized below p10), or no data (no comps in the target window). A perfectly calibrated 80% band lands in-band 80% of the time.

Re-evaluation is append-only: when new comps arrive in an old prediction's target window, a fresh row lands in the forecast_eval hypertable. The latest evaluation per prediction is what this page reports.

Collectibles markets move slow — single comps can be months apart for thin variants. The ±15-day window is the intentional compromise: tight enough to stay honest, wide enough to evaluate cards that trade monthly.

proper scoring rules

Pinball loss at q=0.10 and q=0.90 — the proper scoring rule for quantile forecasts. Lower is better.

Winkler interval score (Bentzien & Friederichs 2014) — width plus miss penalty in one number. Lower is better.

Mean signed error — average of (realized − p50) / p50, signed. Positive means the predictor leaned conservative; negative means it leaned aggressive. Italic on negative throughout the site, never red/green.

nightly cadence

The Arq cron at 02:00 UTC re-evaluates every elapsed prediction made in the last 730 days. Runs unconditionally regardless of what the verdicts look like. Willingness to publish bad numbers is the moat.