We Graded 5 NBA Pundits Against the Box Score. Bill Simmons

One of the long-running annoyances of sports media is that pundits make confident predictions, the predictions vanish into the void, and three months later the same pundits make new confident predictions. Nobody scores them. We built Tout Tracker to score them. This post is the launch piece: five NBA sources, ninety days of mentions, real math, no model-said handwaving.

The numbers

Window: rolling 90 days ending 2026-05-15. Source must clear n ≥ 20 matched mention-game pairs in the window. Sorted by shrunk-empirical-Bayes lift, where lift is the average of (player's actual fantasy points - position baseline) × sign(source sentiment) across all matched mentions. Positive = source's bullish picks beat the league and their bearish picks underperformed. Negative = either confident in the wrong direction or systematically wrong on player evaluation.

Rank	Source	Type	n (90d)	Shrunk lift	95% CI
1	Thinking Basketball	YouTube	182	+8.01	[+4.10, +11.92]
2	Portland Trail Blazers (Official)	YouTube	107	+3.08	[+2.03, +5.08]
3	JxmyHighroller	YouTube	138	+2.14	[+0.37, +8.25]
4	The Bill Simmons Podcast	Podcast	226	-0.73	[-2.57, +0.82]

Four sources cleared. A fifth (Pro Football Focus, tagged NBA on a couple of crossover episodes) showed at n=2 and got correctly filtered as noise.

Why Thinking Basketball runs away with it

Ben Taylor's channel built its reputation on long-form, math-heavy player analysis. The lift number — +8 fantasy points above position baseline per mention — is exactly what you'd expect from a creator whose process is "find an underrated player, explain why the box score plus a few advanced stats vindicate them, watch the player's next month." The 95% confidence interval doesn't even touch zero. At n=182, the shrinkage prior barely moves him because the raw lift is so consistent.

Two caveats: (1) the bidirectional ±30d window means some of Taylor's "credit" comes from retrospective calls (player just did X, here's why Y) which is easier than forecasting; (2) this window is dominated by NBA playoffs, which is exactly when his depth-of-analysis content tends to hit. Reset expectations slightly when the regular season returns and the sample diversifies.

Why Bill Simmons sits below zero

The most popular sports podcast in America has the largest sample in our window (n=226) and the worst shrunk lift (-0.73). The number itself is small — roughly three-quarters of a fantasy point below baseline per bullish call — but he's the only one of the four under zero, the shrinkage barely moves him, and the trend across windows is consistent (his 30d, 90d, 365d, and all-time shrunk lifts are all -0.73, which means the recent 30 days aren't pulling his career line down; this is steady-state).

The honest read isn't "Bill Simmons is bad at basketball." It's that his show is structured around narrative confidence — bold calls about Stars and Trade Markets and What This All Means — and at the player-level mention granularity we score on, narrative confidence underperforms math-heavy analysis. The CI (-2.57 to +0.82) doesn't exclude zero. We're not telling you he's a fade. We're showing you the four-decimal number and the sample behind it.

The methodology, briefly

The full SQL is in migration 20260601000050, but the moving parts:

Mention extraction. Every podcast / YouTube / Reddit RSS item gets transcribed (where applicable), passed to Claude Haiku, and player names get resolved against an alias table backed by player_feature_store. Result: a mentions row with (source_id, sport_key, entity_key, sentiment_score, confidence, prop_implication). 1,341 matched player mentions in the 90 days ending 2026-05-15.
Player-game bridge. Mention entity_key uses our internal player_id format (e.g. nba-4683689) and game logs use stats.nba.com IDs (e.g. 101108). We bridge through player_feature_store, which carries both — name-matched, 97% overlap. NFL bridges the same way through player_display_name.
Pair join. Each matched mention pairs with the player's games in a ±30-day window around episode-published-at. We compute (actual_fantasy_points - position_baseline) × sign(sentiment_score) per pair. Position baseline = average fantasy points across all players at that position in the window.
Aggregate + shrink. Per (source, sport, position, window_days) we average the per-pair lifts and apply empirical-Bayes shrinkage toward zero with a 50-observation pseudo-prior. Variance-aware: small-n sources get pulled to zero hard.
Confidence intervals. Standard Wald CIs on the shrunk mean using the per-source variance. The 95% bounds are the published ci_lo_95 and ci_hi_95 columns.

Why this isn't the final word

Three honest limitations we'll close in the next two months:

Bidirectional window. The original Phase 1 design used a forward-only +14d window ("after the source talks, what happens next"). Production showed zero NBA pairs that way — most mentions sit just outside game data. We relaxed to ±30d, which mixes forecasting and commentary. Real forecasting-only lift will be a separate metric once the in-season corpus is dense enough.
Position baseline is sport-position level, not era-adjusted. A 2024-25 fantasy game vs a 2026 fantasy game both count toward the baseline. Pace and rule changes wash through. Defensible for a 90d window, less so for the all-time leaderboard column.
Explicit-pick hit rate is empty. Tout Tracker has a second column for "explicit prop hit rate" — when a source says "Wembanyama over 18.5 points" and the player did or didn't. Phase 1.5 wires the SQL but only one matched mention had a structured prop_implication that landed on a tracked pickem_lines row. That metric will fill in over the next 30 days as we backfill prop lines.

Read this leaderboard yourself

The live page is at /tout-tracker. Switch sport, window, and position. The "fade chip" on player /tout-tracker pages quotes a source's shrunk lift inline whenever the player has a recent mention from a source that's cleared the n≥20 threshold. We'll add NFL sources in September when the season window opens.

If a source disagrees with their score and wants the per-mention game-pair list that produced it, we'll send it. The math should hold up to scrutiny — that's the whole point.

Build your own edge

Scoring pundits is the same discipline as scoring models. Spin up a model in the model builder, backtest it against completed seasons in the workshop, and see where it lands against the field on the model leaderboards.

How to read a lift number without overreading it

A leaderboard like this is most dangerous to the reader who takes the top line and turns it into a betting rule: "Thinking Basketball is +8, so I'll blindly tail every player he's bullish on." That collapses two things the number deliberately keeps separate. Lift measures whether a source's directional reads have historically tracked above- or below-baseline player performance in a window. It is not a price. A source can be excellent at identifying who is undervalued and still leave you no edge if the market has already moved the player's line to fair by the time you act on it. The pundit's job is finding the player; the bettor's job is finding the player before the number reflects it. Those are different skills and this leaderboard only grades the first one.

The second trap is treating a single window's ranking as a stable ranking. Empirical-Bayes shrinkage is the right tool precisely because small samples lie — a source with a hot 25-mention stretch will look elite until the prior pulls them back toward the field, and a wide confidence interval (look at JxmyHighroller's, which spans from "barely positive" to "very good") is the system telling you it doesn't yet know. Read the interval before the point estimate. When the lower bound of the CI is comfortably above zero, you have a source whose edge has survived the shrinkage stress-test. When the interval straddles zero, you have a source the data can't distinguish from a coin, no matter how confident the point estimate looks. Rank by the bound you'd bet on, not the headline you'd screenshot.

Finally, watch for the difference between a source's lift drifting and the underlying players regressing. A creator who built a positive lift on a handful of breakout players will see that lift decay as those players mean-revert toward their new, higher baseline — not because the creator got worse, but because the easy asymmetry they exploited is gone. This is why we report lift per window and keep the all-time column visible: a source whose 30-day, 90-day, and 365-day numbers agree is showing you a process; a source whose recent window towers over their career line is showing you variance you should discount.

Retrospective vs. forecasting — the distinction that separates analysts from touts

The single most important caveat in our methodology, and the one most worth internalizing as a media consumer, is the gap between explaining what already happened and predicting what will. Our bidirectional ±30-day window mixes the two by necessity — the in-season corpus isn't yet dense enough to isolate forward-only calls — and that means part of any source's credit comes from commentary that was easy: a player just had a monster month, and the analyst walks you through why the advanced numbers support it. That's genuinely useful content. It is not the same as having told you a month earlier that the breakout was coming.

This matters because the two failure modes of sports media map cleanly onto this split. The analyst's failure mode is hindsight dressed as insight — confident retrospective narration that sounds predictive but never put a stake in the ground before the outcome. The tout's failure mode is the opposite: loud forward calls with no scorekeeping, so the hits get screenshotted and the misses get deleted. A trustworthy source survives both tests. They make calls you can timestamp before the result, and they let you check the full ledger, not the highlight reel. When the in-season window fills out, forecasting-only lift becomes its own column for exactly this reason — it's the harder, more honest metric, and it's the one that separates a real edge from good post-game television.

As a reader, you can apply this filter today without any leaderboard: when a pundit makes a player call, ask whether it would have been falsifiable the day they said it. "He's a buy-low" before a quiet stretch is a forecast. "He's clearly elite" after a 40-burger is narration. Both can be entertaining; only one is predictive, and only the predictive one is worth weighting when you build a betting view.

What makes a scoring system trustworthy

The reason we publish the migration SQL, the sample sizes, the confidence intervals, and the offer to hand any source their per-mention game-pair list is not transparency theater — it's the actual product. Anyone can publish a leaderboard. The hard part, and the part most accountability projects skip, is making the methodology adversarially auditable: a source who disputes their score should be able to reconstruct it, find the mentions that hurt them, and either accept the number or show us where the resolution logic mislabeled a sarcastic take as bearish. A score you can't contest isn't a score, it's an opinion with decimals.

Three properties separate a defensible accountability metric from a vanity stat. First, it has to be pre-registered in its rules — the definition of lift, the shrinkage prior, the n-threshold, and the window were fixed before we looked at who landed on top, so the ranking isn't reverse-engineered to embarrass anyone in particular. Second, it has to punish small samples, because the easiest way to fake a leaderboard is to cherry-pick the source's best 20 takes; shrinkage toward zero with a fixed pseudo-prior is the structural defense against that. Third, it has to publish its own uncertainty — a confidence interval that crosses zero is the system admitting it can't yet make a claim, and a system willing to say "we don't know" about a popular source is more credible on the ones where it does make a claim.

Apply the same three tests to anyone selling you picks. Are the rules fixed in advance, or does the win-rate denominator quietly shift to flatter the record? Does the sample punish luck, or is the pitch built on a recent heater? Does the source ever publish a loss or a "we don't know," or is the feed a wall of green checkmarks? The value of a system like this isn't that it crowns a winner this window — it's that it gives you a template for distinguishing measured edges from confident noise, which is the only durable skill in this entire space.

NBA example board

Use the named prop board instead of a generic “good matchup” note. Nikola Jokic assist and rebound props should start with touch volume and whether Denver is using him as a hub. Shai Gilgeous-Alexander points props should start with free-throw equity, opponent rim pressure, and whether the market has already priced his usage. Luka Doncic PRA props, Jayson Tatum three-point volume, and Victor Wembanyama blocks or rebounds each need different inputs even when the headline market looks similar.

Jokic assists: check teammate shooting availability, pace, and whether the defense sends help early.
Shai points: separate true usage from a public star tax when the Thunder are heavily favored.
Doncic PRA: watch blowout risk because rebounds and assists can disappear before points do.
Tatum threes: price attempts, not only make rate, especially against switch-heavy defenses.
Wembanyama blocks and rebounds: account for opponent rim attempts, foul risk, and minute stability.

How to keep NBA examples from going stale

Recheck the Celtics, Thunder, Nuggets, and Spurs context before acting because rotations move quickly around rest, injuries, and playoff leverage. The example is still useful if the player changes teams or the line changes, as long as the input stays explicit: minutes, usage, pace, matchup, and price. Pair this with reading NBA player props and NBA prop market structure when you need a deeper prop workflow.

Sport-specific model signals

Use names as evidence, not decoration. The useful SEO win is that Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua and Chiefs, Bills, Eagles and Lions appear inside decisions, thresholds, and internal links instead of being dumped into a keyword list.

Prop EV example: Luka Doncic points or PRA at 32.5 should be checked against projected minutes, usage without key teammates, pace, spread, and back-to-back fatigue before price.
MLB: a Dodgers at Rockies first-five total of 5.5 should account for starter xFIP, K-BB%, handedness, Coors Field run environment, wind, bullpen rest, and umpire zone.
NHL: a Maple Leafs puck-line price at +160 needs confirmed goalie, 5v5 expected-goal share, special-teams edge, and empty-net probability before the margin bet makes sense.
UFC: an Islam Makhachev-style grappling favorite needs takedown entries, control time, get-up rate, and submission exposure; an Alex Pereira-style striker needs knockdown equity and round-by-round cardio risk.
DFS value example: NBA showdown builds need projected minutes, usage, salary, ownership, and late-swap flexibility before a star salary is worth paying.
Stack example: an NBA same-game entry with Doncic points, teammate assists, and opponent threes needs one coherent pace script instead of three unrelated legs.

The goal is not to mention every star. It is to show how the model changes when the example changes from Doncic to Shohei Ohtani, Igor Shesterkin, Connor McDavid, or Tom Aspinall. Revisit and update the board when lineups, minutes, starters, goalie confirmations, weigh-ins, or market prices change.

Research note board

Use this table to turn the guide into a decision note. The point is to know when the idea is actionable and when it is only context.

Angle	Input to verify	Example application	Pass when
Market price	Spread, total, moneyline, prop price, or futures hold	Chiefs and Bills compared through hold	The price has moved past the number that created the edge
Football or sport context	Role, pace, weather, injury status, opponent style	Josh Allen role news mapped to the relevant market	The original input changes or remains unconfirmed
Review loop	Entry, close, result, and reason code	closing line value logged with a clear thesis	You cannot explain whether the process beat the market

Educational analysis only, not a bet recommendation. Check current lines, injuries, rules, contest terms, and local regulations before acting.

Average total points by weather bucket

Average combined points scored in NFL games by weather bucket over recent seasons. Wind above 20mph and snow each clip totals by 6-8 points vs domed games, which is why books move totals aggressively when forecasts shift.

NFL ATS cover-margin distribution

Distribution of (final margin − closing spread) across an NFL season. Roughly normal with mean ≈ 0 and standard deviation ≈ 13 points, which is why most ATS edges live in the ±1.5 point window.

Frequently asked questions

How is "lift" defined here?

For every matched player mention we find the player's actual fantasy-points game in a ±30-day window around the episode, compute (actual - position-baseline) × sign(sentiment), then average all the per-mention values per source. Empirical-Bayes shrinkage toward zero with a 50-observation pseudo-prior keeps small-sample sources from dominating. Positive = the source's bullish calls beat baseline and their bearish calls underperformed.

Why is the sample only five sources?

Two filters cut the long tail: every source needs n ≥ 20 matched mention-game pairs (most podcasts have plenty), and every player name must bridge through player_feature_store to nba_player_game_logs. 433 of 447 NBA player names in our feature store overlap with stats.nba.com names — 97% — but the ~3% gap drops sources that talked exclusively about edge-case players. Expect 15+ sources to clear the bar once we backfill playoff data and the next month of episodes.

Is +8.01 lift for Thinking Basketball really that good?

It's genuinely high but read the caveat: lift is in fantasy-points-per-game above his bullish-call baseline. NBA fantasy-points position-baseline is in the 25-30 range, so +8 lift means the players he's positive on score roughly 30% above the league's rest-of-position average in the matching window. That's a real signal at n=182. The 95% confidence interval (after shrinkage) is +4.1 to +11.9, both bounds well above zero.

Why is Bill Simmons "underwater" when his shrunk lift is only -0.73?

Compared to the other four sources who all sit positive, he's the only one with negative shrunk lift in this window. His sample (n=226) is the largest, so the shrinkage barely moves him. -0.73 fantasy points below baseline per mention isn't a disaster — it's a slight, persistent negative edge that compounds over hundreds of takes. The 95% CI is -2.6 to +0.8, so we can't rule out zero, but he's the clear bottom of the named sample.

How often does this leaderboard update?

Every 6 hours. compute-source-accuracy reads from a materialized view that gets refreshed at :05 past each 6h tick, the score writer runs at :10, and the player-level fade features compute at :20. New podcast episodes hitting the corpus typically show up in the leaderboard within 12 hours of being ingested and resolved.