Sports data used to be a closed shop. The lines you saw on TV came from data feeds that cost five figures a year, the play-by-play that powered every model lived behind paywalls or in scraped HTML, and the player tracking data that made the modern game legible was locked inside league partnerships. The story of the last decade is the steady opening of that shop. Today an individual modeler can stand up a respectable free nfl data sources stack in an afternoon, refresh it daily without owning a server, and feed a model that runs entirely in the browser. This handbook is a tour of that stack — what is free, what is fragile, what is licensable, and where the seams are between a hobbyist setup and something you can build a real product on. We will end with the question that eventually finds every successful builder: when is it time to graduate to Python and SQL, and how do you make the leap without throwing away the work that got you here.
The free NFL data landscape in 2026
There are four sources every serious NFL modeler should know by name and one that almost everyone uses without realizing it. The four are nflverse, ESPN's public scoreboard JSON, NFL Next Gen Stats, and Pro Football Reference. The fifth is your sportsbook's own line feed, which is technically your data when you are placing the bet. The mix of these five plus a couple of optional licensable feeds is enough to power every model on Shark Snip's leaderboards today.
nflverse — the canonical public dataset
nflverse is a community-maintained collection of NFL datasets published as free Github releases. It includes play-by-play back to 1999, weekly player stats, rosters, snap counts, draft history, depth charts, and a growing catalog of injury and combine data. The play-by-play release is roughly 50 MB compressed per season and is updated within hours of each game ending. Every public NFL model on Github touches it, and most published models on Shark Snip pull at least one nflverse table.
The pros: complete, free, well-documented, redistributable under the project's open license, and consumable as either CSV or Parquet. The cons: it is a community project with no commercial SLA, the schema occasionally changes between seasons, and there is no streaming feed — you fetch a snapshot and you fetch again next time. For a daily-refresh browser pipeline that is exactly the right shape, but for live in-game markets you need a faster source.
ESPN public JSON — fast, free, legally fragile
ESPN exposes scoreboards, box scores, schedules, and team pages as undocumented JSON endpoints. They are unauthenticated and they are quick. A typical scoreboard payload is under 200 KB and updates within seconds of a play ending. Every consumer-facing sports app built in the last five years has used these endpoints at some point.
The catch is licensing. ESPN does not publicly publish terms for these endpoints because they were never officially documented as a developer API. Hobbyist use is widely tolerated; commercial use is not. If you build a product on top of ESPN public JSON and the product gets noticed, you should expect a polite cease-and-desist. Use ESPN for development, validation, and personal models — and for any product you charge for, swap in a licensed feed before launch.
NFL Next Gen Stats — tracking, but not yours
The NFL publishes weekly tracking aggregates at the Next Gen Stats site: average separation for receivers, time to throw for quarterbacks, snap-weighted route depths, and a few dozen others. These are the only public window into the league's full tracking-data pipeline; the raw player x/y coordinates are tightly held by official partners.
The aggregates are published as HTML tables that are not designed for programmatic consumption. They scrape easily but the terms of use are unambiguous: NFL content is not licensed for redistribution. Treat Next Gen Stats the same way you would treat any league site — useful for your own modeling, off-limits for any product you plan to sell. If you need licensed tracking, the commercial vendors are SIS, PFF, and TruMedia.
Pro Football Reference — historical depth, scrape carefully
Pro Football Reference is the gold standard for historical NFL data. Box scores back to 1920, play indices, advanced stats, college crossover data, coaching trees — almost everything you would want for a long-horizon study lives there. The catch is in the robots.txt: a 3-second crawl delay, and the play index is explicitly disallowed. The site is also clear that bulk scraping for redistribution is forbidden.
For one-off historical lookups, manual queries through the site or the official downloadable CSVs are fine. For automated daily ingestion, treat PFR as off-limits and prefer nflverse, which derives most of the same fields legally from the play-by-play feed. The cluster sibling polite scraping guide walks through how to read a robots.txt and how to set a User-Agent that identifies your project to the source.
Beyond the NFL — quick tour of the other major sports
The same shape repeats across every major league: one canonical public dataset, one fast-but-fragile public API, one or two licensable commercial feeds. Below is the short version for the rest of the stack you will need to cover NBA, MLB, and NHL.
NBA — nba_api and the league's own JSON
The unofficial nba_api Python wrapper is the canonical entry point for NBA data. It wraps the league's stats.nba.com endpoints and gives you box scores, play-by-play, player tracking aggregates, shot charts, and lineup data going back two decades. The library is heavily used and well-maintained — over 1,600 stars on Github at the time of writing — and the underlying endpoints are tolerated for non-commercial use the same way ESPN's are.
For a browser pipeline you can hit the same endpoints directly with fetch, since they return JSON and accept browser CORS. The Shark Snip build canvas ships an NBA Schedule block that uses these endpoints under the hood and caches responses for 24 hours so you do not double-poll on rebuild.
MLB — pybaseball and the official Stats API
Baseball is the best-instrumented sport on the planet. pybaseball wraps the official MLB Stats API plus Baseball Savant's Statcast, giving you pitch-by-pitch detail since 2008 — roughly 30 GB total if you pull every pitch. For browser ingestion, the MLB Stats API is freely accessible and does not require a key; Baseball Savant's CSV exports are also free but have a community-respected daily cap.
For modeling, pitch-level Statcast is overkill for a moneyline model and exactly right for a pitcher prop model. The cluster sibling on building a CSV-to-model pipeline with no backend uses pitch-by-pitch totals as the worked example because the schema is rich enough to demonstrate every common feature engineering pitfall.
NHL — the league's open API
The NHL operates a genuinely open, documented JSON API at api-web.nhle.com. Schedules, box scores, play-by-play, and shift data are all available without a key and with reasonable rate limits. The community R package hockeyR wraps it and is the easiest entry point for analysts; for a browser pipeline you can hit the API directly and parse the JSON.
NHL coverage on Shark Snip uses these endpoints exclusively for game-level data, with line shopping handled by the same odds APIs as the other sports. The seam between league data and book data is clean because the league does not publish lines.
Browser-first ingestion: CSV upload to validated blocks
Once you have identified the sources, the next question is how to get rows out of them and into something a model can consume. The traditional answer is a Python script on your laptop or a server-side ETL pipeline. The browser-first answer is a CSV upload, a schema validator, and a typed block. Same logic, no infrastructure.
The minimal pipeline
The smallest useful pipeline has four stages: fetch, parse, validate, store. In a browser-first stack each stage is a function or a block. Fetch is a fetch() call to a CSV URL or a file picker for a local upload. Parse is a streaming CSV parser like Papa Parse that hands you rows without loading the whole file into memory. Validate is a schema check that refuses files missing required columns. Store is either IndexedDB for ephemeral state or a Supabase table for durable shared state.
The Shark Snip build canvas hides this entire pipeline behind a single Upload CSV button, but the underlying decomposition is the same. When you drop a file, the canvas runs the parser, runs the schema validator against the registered Data block, and either accepts the rows or shows you exactly which columns are missing or wrongly typed. That validation step is the part most homemade pipelines skip and the part that most homemade pipelines later regret skipping.
Schema validation as the contract
The single most useful habit you can adopt is treating schema validation as a contract. Define the columns and types your downstream code assumes, and refuse files that violate the contract at the boundary. The cost is a few minutes writing a Zod or Valibot schema; the saving is the entire category of "my model retrained but the column was renamed and now everything is wrong" bugs.
The block builder enforces this for you because every Data block declares its outputs as TypeScript types. If a CSV does not match, the block flips to a red error state and the canvas refuses to wire downstream consumers. If you are building outside the block builder, copy the pattern: validate at the source, never inside the model.
Daily refresh patterns without a backend
Polling for fresh data is the part of a pipeline that traditionally requires a server. Cron, systemd timers, Airflow — all of these assume a long-running process somewhere. The browser-first stack solves this with two patterns: scheduled cloud functions that the browser reads from, and on-demand refresh when the user opens the tab.
Cloudflare Workers with cron triggers
Cloudflare's Workers free plan includes 100,000 requests per day and supports scheduled triggers via cron syntax. A Worker that fetches the nflverse weekly play-by-play release every morning at 6 AM Eastern, parses it, and writes the rows to Supabase or to a KV store costs nothing and runs forever. The browser app then reads the freshly-written rows on next load with a normal fetch.
The pattern is durable because the Worker has no state of its own — it is a stateless function that runs on a schedule. If it fails, the next run picks up where it left off; if you change the schema, you redeploy the Worker without losing any data. The Shark Snip data layer uses this pattern for nflverse weekly refreshes and for The Odds API hourly polls.
Supabase scheduled functions via pg_cron
If your data already lives in Supabase, the simpler pattern is pg_cron running inside the database. You write a SQL function that fetches and upserts rows, you schedule it with one line, and Postgres runs it on the cadence you specify. No Worker, no extra service.
The tradeoff is that pg_cron jobs run inside the database connection, so long-running fetches can hold a connection longer than you want. For light hourly polls of small endpoints it is the cleanest option; for heavy multi-megabyte pulls, a Cloudflare Worker is more polite to the database.
Browser-triggered refresh with smart caching
The third pattern is the lowest-infrastructure of all: refresh data when the user opens the tab, but only if the cached copy is stale. A 30-line wrapper around fetch() that checks IndexedDB for a recent copy and only re-fetches if the timestamp is old gives you a self-refreshing pipeline with no cron at all. It works great for personal models you check daily and badly for shared products where the data needs to be current even when no one is looking. The cluster sibling on daily refresh workflows in the browser compares all three patterns side by side and shows when each one wins.
Ethical scraping: the rules that keep your pipeline alive
Most modeling-related data sources are happy to be polled politely and unhappy to be hammered. The difference between a pipeline that runs for years and a pipeline that gets blocked in a week is almost entirely about how it identifies itself and how often it asks.
Read robots.txt and respect it
robots.txt is the source operator's documented preference about how their site should be crawled. It is not legally binding everywhere, but ignoring it is a fast path to being blocked and a slow path to a lawsuit. The Pro Football Reference example above is typical: a published crawl delay, a list of disallowed paths, and a clear preference. Read it, follow it, and your pipeline survives.
For sources without a robots.txt, default to one request per endpoint per minute and adjust down if the source seems annoyed (rate-limit headers, 429 responses, suddenly slower responses). Default to identifying yourself with a User-Agent like YourProjectName/1.0 (contact: you@example.com) so the source operator can ping you instead of banning your IP block.
Cache aggressively, batch when possible
Every request you do not make is a request that cannot annoy the source. If you need today's NFL scoreboard, fetch it once and cache it for the rest of the day. If you need every team's roster, prefer a single bulk endpoint over thirty-two single-team endpoints when the source offers one. The Shark Snip data layer caches every external request for at least an hour by default and surfaces a cache-bypass option only in the Workshop UI for explicit refreshes.
Know when to license
The signal that you should stop scraping and start licensing is not legal — it is product. As soon as your model becomes a paid product, the cost of a feed is dwarfed by the legal risk of building on tolerated-but-unlicensed scraping. MySportsFeeds, SportsDataIO, and The Odds API all offer commercial tiers that are affordable for solo developers; pay them and sleep better. The published marketplace models that scale beyond a hobby use licensed feeds underneath, even when their first iteration was scraped.
From raw data to a feature store
Raw data is not what a model consumes. A model consumes features — engineered, time-aware, validated columns that are the same shape every week. The difference between raw data and features is the most common place where a homemade pipeline goes wrong, and the place where a feature store earns its keep.
What a feature store actually is
A feature store is two things: a typed schema for the columns your models consume, and a versioned history of how those columns were computed. The first part lets you ship a model and know it will keep getting the same shape of input next week. The second part lets you reproduce a model's training data months later — necessary the first time you need to debug why a published model went cold.
You do not need a fancy feature store to start. A Postgres table with a primary key of (game_id, as_of_timestamp) and a column per feature is a feature store. A Parquet file partitioned by season is a feature store. The discipline is the part that matters: every row is timestamped, every feature has a clear definition, and the computation that produced the row is reproducible from the raw inputs.
The as-of-timestamp rule
Every feature in a betting model needs an as-of timestamp that says when the value was knowable in the real world. A "team's offensive EPA over last 8 games" computed on Tuesday for Sunday's game uses a different value than the same feature computed Friday after a Thursday-night game. The as-of timestamp is what lets you avoid look-ahead bias — the modeler's most common self-inflicted wound.
The Shark Snip block library bakes the as-of rule into every Feature block; the canvas refuses to wire a feature whose as-of date is after the prediction kickoff. If you build a feature store outside the block library, copy the rule: every row carries the timestamp at which it was knowable, and every join is keyed on (entity, as_of_timestamp <= kickoff).
One source of truth per feature
If two parts of your code compute "rolling EPA" with two different windows, you have two features with the same name. The first time one of them changes, every model downstream is silently affected. The fix is one canonical definition per feature, in one file, imported wherever needed. The Shark Snip block catalog is that file: every Feature block has one definition and is referenced by ID. Copying that pattern in your own pipeline — a single features/ module with one function per feature — saves more pain than any other refactor.
From CSV to model in 30 minutes
Here is the full path from a freshly-downloaded nflverse CSV to a backtested NFL spread model on Shark Snip, end to end, with no Python and no server.
- Open /build and click "New model — NFL spread". The canvas spawns the starter graph.
- Drag the Upload CSV block onto the canvas, drop in the nflverse pbp_2024.csv file, and watch the schema validator turn green when every column resolves.
- Wire the upload to a Schedule block (auto-suggested) and an EPA Aggregation block (drag from the sidebar). The EPA block reads the play-by-play and emits one row per team per week with offensive and defensive net EPA.
- Add a Rest Differential block and a QB Availability block. Both auto-wire from the schedule and complete a five-feature inputs side.
- Drop a Linear Regression model block, then a Walk-Forward Split block above it. Set training to 2018-2023 and validation to 2024.
- Hit Train. The browser fits the regression on roughly 1,500 games in under three seconds and reports validation RMSE, ATS cover rate, and edge per bet.
- Open the Backtest tab. The same wired graph runs across the full window and produces an equity curve. A healthy first model lands at roughly 52-53% cover rate and a noisy upward equity drift.
- Drag a Quarter Kelly Staking block, set your bankroll cap, and click Publish. The model now generates live picks visible on Gridiron and is eligible for the leaderboards.
Total elapsed time on a normal laptop: about 30 minutes from a blank canvas to a published model, with the only typing being the upload filename. The full step-by-step worked example with screenshots lives in the CSV-to-model with no backend handbook; for a deeper comparison of the data sources you might use as the input layer, the cluster sibling free NFL data sources ranked goes source by source with current uptime numbers.
When to graduate to Python or SQL — and how to leave cleanly
Browser-first is the right answer for almost every personal model and many small products. It stops being the right answer at three thresholds, and a good builder graduates without throwing away the work that got them there.
The 200 MB threshold
Browser memory is finite. Modern Chrome handles a couple of hundred megabytes of in-tab data comfortably; beyond that, IndexedDB persists fine but loading rows back into memory for training starts to chug. If your training corpus exceeds ~200 MB compressed, the in-browser trainer is no longer the fastest path. Move the training to a Python or DuckDB script on your laptop and keep the browser for inference and visualization.
The two-minute threshold
If a single training run takes more than two minutes in the browser, the iteration loop gets painful. The threshold sneaks up on you as you add features. The honest fix is to either reduce the feature count (often the right answer) or to move training to a faster runtime. The Shark Snip Workshop ships an Export to Python button that emits scikit-learn or XGBoost code reproducing the wired graph; you keep iterating in the browser, and only spend Python time on the long final fits.
The collaboration threshold
Browser-first models live in your tab. The first time you want a teammate to inspect the same model and propose a change, you hit a sharing problem. The Shark Snip Workshop solves this by storing the wired graph in Supabase and making it forkable from the marketplace; outside the platform, the analogous fix is to commit the model definition to a Git repo and to share the artifact through your normal code review flow. Either way, the moment you have collaborators is the moment to standardize where the model lives.
Exporting from Shark Snip
The Workshop export emits four artifacts: the wired graph as JSON, the equivalent Python script, the trained model weights as ONNX, and a reproducibility manifest pinning the source data versions. Drop the four into a Git repo and you have a Python pipeline that reproduces the browser model bit-for-bit. The reverse is also supported: if you build something heavy in Python and want to ship it as a published Shark Snip model, the manifest format is documented and the import path is one upload.
Bottom line
Sports data is not the gatekeeper it used to be. The free public datasets cover NFL, NBA, MLB, and NHL with enough depth to build real models; the browser is fast enough to ingest, validate, train, and serve those models with zero infrastructure; and the patterns for daily refresh, ethical scraping, and graduation to heavier tooling are well-trodden. The skills that separate a hobbyist pipeline from a deployable one are not about picking exotic sources — they are about validating schemas at the boundary, respecting source operators with polite request rates, baking the as-of timestamp into every feature, and knowing when the browser stops being the right runtime. Open the build canvas, upload an nflverse CSV, and follow the 30-minute path above. From there every additional model is a fork in Workshop, every published artifact gets ranked on the leaderboards, and the genuinely good ones earn their listing on the marketplace.
Bet responsibly — set limits, never chase losses.
Price examples and pass rules
Use names as evidence, not decoration. The useful SEO win is that Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua and Chiefs, Bills, Eagles and Lions appear inside decisions, thresholds, and internal links instead of being dumped into a keyword list.
- Spread example: if Chiefs-Broncos opens Chiefs -3.5 and your fair number is -2.8, +3.5 is the bet, +3 is a pass, and the moneyline needs roughly +155 or better before it replaces the spread.
- Total example: if a Bills outdoor total opens 46.5 and wind moves from 8 mph to 21 mph, an under projection at 42.8 still needs a playable number; under 45 or better is different from chasing 43.5.
- Futures example: Bengals AFC North +280 is 26.3% before hold. If your fair number is 30%, stake modestly, track portfolio correlation, and avoid stacking every Burrow, Chase, and Higgins bet into the same thesis.
- CLV rule: a good write-up is not enough. Track whether the spread, total, prop, or futures price closed better than your entry before grading the process.
Use closing-line value guide to keep the examples attached to measurable prices.
Research note board
Use this table to turn the guide into a decision note. The point is to know when the idea is actionable and when it is only context.
| Angle | Input to verify | Example application | Pass when |
|---|---|---|---|
| Market price | Spread, total, moneyline, prop price, or futures hold | Chiefs and Bills compared through keeper | The price has moved past the number that created the edge |
| Football or sport context | Role, pace, weather, injury status, opponent style | Josh Allen role news mapped to the relevant market | The original input changes or remains unconfirmed |
| Review loop | Entry, close, result, and reason code | hold logged with a clear thesis | You cannot explain whether the process beat the market |
Average total points by weather bucket
Average combined points scored in NFL games by weather bucket over recent seasons. Wind above 20mph and snow each clip totals by 6-8 points vs domed games, which is why books move totals aggressively when forecasts shift.
NFL ATS cover-margin distribution
Distribution of (final margin − closing spread) across an NFL season. Roughly normal with mean ≈ 0 and standard deviation ≈ 13 points, which is why most ATS edges live in the ±1.5 point window.



