How We Score Patreek AI

What we measure

Our primary metric is the Brier score: the mean of (predicted_probability − actual_outcome)² across every analysis we ran on a market that has since resolved. Actual outcome is 1 for YES, 0 for NO.

Lower is better. A coin-flip forecaster scores 0.25. Perfect prediction scores 0. A Brier score above 0.25 means the AI is actively worse than guessing; below 0.25 means it has genuine signal.

How we score

For each public prediction market that closes, we look up every analysis Patreek AI generated for that market and compare the AI's stated probability to the realized YES/NO outcome. One market can contribute multiple data points if it was analyzed multiple times (e.g., on refresh by a Pro subscriber).

The overall Brier score is a straight mean across all scored data points — no weighting by market volume or time.

The 48-hour cutoff

Analyses generated within 48 hours of resolution are NOT counted toward calibration. This rule exists because late-breaking news could leak resolution information into the AI's input context — if a market is about to resolve YES and every headline says so, an analysis run at that moment would score unfairly well (or poorly, if the AI ignores obvious signals).

We score only analyses that were “honest” — made when the outcome was still genuinely uncertain and the AI had no advantage from proximity to resolution.

What we exclude

Markets that resolved as INVALID (e.g., UMA disputes that never finalized) — there is no genuine YES/NO outcome to score against.
Markets we never analyzed — they have no entry in our scoring table.
Disputed-then-resolved markets are scored using the final, post-dispute outcome, not the initial disputed result.

Why public

This page exists as a credibility asset — for you, not for us. We could quietly discard this data or hide the page behind a paywall. We don't. If our Brier is worse than 0.25, you'll see it here in plain numbers, with the exact sample size so you can judge statistical significance yourself.

Calibration is not something you can fake long-term with cherry-picking: the Brier score is computed across all scored analyses in our database, not a curated subset.

Sample size caveats

We display the overall Brier score only after n≥20 resolved markets. Below 20, variance is high enough that the number is more noise than signal.
Per-category breakdowns require n≥5 per category. Categories with fewer resolved markets are hidden, not zeroed.
Every chart is labeled with its sample size. Small samples are noisy — treat them as directional indicators, not authoritative figures.

What we do NOT do

We do not backfill historical predictions retrospectively. Every score on this page represents an analysis that was generated live — at the time a real user or the system ran it — and graded honestly against the eventual outcome.

We never go back and manufacture earlier analyses that happened to score well. If we ever change this policy, we will disclose it on this page first, including what changed and why.

Calibration curve

The scatter plot on the Track Record page shows buckets of predictions by probability range (e.g., 60–70%) vs. how often those events actually resolved YES. Points on the diagonal line are perfectly calibrated. Points above the diagonal mean the AI was under-confident; points below mean over-confident.

Dot size is proportional to the square root of the bucket's sample size — bigger dots deserve more trust. Only buckets with at least 3 data points are plotted.

← Back to Track Record