The Brier Score is the standard evaluation metric for probabilistic predictions. It measures the mean squared error between a model's predicted probability and the actual binary outcome (1 = win, 0 = loss).
Formula: BS = (1/N) × Σ(predicted_probability − actual_outcome)²
Interpretation: - 0.0 = perfect prediction - 0.25 = random (50% predictions for binary outcomes) - Lower is better; typical skilled sports models score 0.22-0.24 for game outcomes
Why Brier Score over accuracy: Accuracy rewards the right prediction even if the model is 51% confident. Brier Score penalizes incorrect high-confidence predictions more than incorrect low-confidence ones — incentivizing well-calibrated probabilities over overconfident guesses.
Decomposition: Brier Score decomposes into reliability (calibration) + resolution (sharpness). A model can improve BS by improving either. A model that always predicts 50% has zero calibration error but also zero resolution.
Brier Skill Score (BSS): Normalized Brier Score relative to a baseline (e.g., climatological frequency). BSS = 1 − (BS / BS_climatology). Positive BSS means the model beats the baseline.
In practice: Track Brier Score on held-out test data across multiple seasons. A stable, improving Brier Score on out-of-sample data is the strongest evidence of genuine predictive skill.
