Sports Data, Demystified: A Browser-First Pipeline Tour

Sports data used to be a closed shop. The lines you saw on TV came from data feeds that cost five figures a year, the play-by-play that powered every model lived behind paywalls or in scraped HTML, and the player tracking data that made the modern game legible was locked inside league partnerships. The story of the last decade is the steady opening of that shop. Today an individual modeler can stand up a respectable free nfl data sources stack in an afternoon, refresh it daily without owning a server, and feed a model that runs entirely in the browser. This handbook is a tour of that stack — what is free, what is fragile, what is licensable, and where the seams are between a hobbyist setup and something you can build a real product on. We will end with the question that eventually finds every successful builder: when is it time to graduate to Python and SQL, and how do you make the leap without throwing away the work that got you here.

The free NFL data landscape in 2026

There are four sources every serious NFL modeler should know by name and one that almost everyone uses without realizing it. The four are nflverse, ESPN's public scoreboard JSON, NFL Next Gen Stats, and Pro Football Reference. The fifth is your sportsbook's own line feed, which is technically your data when you are placing the bet. The mix of these five plus a couple of optional licensable feeds is enough to power every model on Shark Snip's leaderboards today.

nflverse — the canonical public dataset

nflverse is a community-maintained collection of NFL datasets published as free Github releases. It includes play-by-play back to 1999, weekly player stats, rosters, snap counts, draft history, depth charts, and a growing catalog of injury and combine data. The play-by-play release is roughly 50 MB compressed per season and is updated within hours of each game ending. Every public NFL model on Github touches it, and most published models on Shark Snip pull at least one nflverse table.

The pros: complete, free, well-documented, redistributable under the project's open license, and consumable as either CSV or Parquet. The cons: it is a community project with no commercial SLA, the schema occasionally changes between seasons, and there is no streaming feed — you fetch a snapshot and you fetch again next time. For a daily-refresh browser pipeline that is exactly the right shape, but for live in-game markets you need a faster source.

ESPN public JSON — fast, free, legally fragile

ESPN exposes scoreboards, box scores, schedules, and team pages as undocumented JSON endpoints. They are unauthenticated and they are quick. A typical scoreboard payload is under 200 KB and updates within seconds of a play ending. Every consumer-facing sports app built in the last five years has used these endpoints at some point.

The catch is licensing. ESPN does not publicly publish terms for these endpoints because they were never officially documented as a developer API. Hobbyist use is widely tolerated; commercial use is not. If you build a product on top of ESPN public JSON and the product gets noticed, you should expect a polite cease-and-desist. Use ESPN for development, validation, and personal models — and for any product you charge for, swap in a licensed feed before launch.

NFL Next Gen Stats — tracking, but not yours

The NFL publishes weekly tracking aggregates at the Next Gen Stats site: average separation for receivers, time to throw for quarterbacks, snap-weighted route depths, and a few dozen others. These are the only public window into the league's full tracking-data pipeline; the raw player x/y coordinates are tightly held by official partners.

The aggregates are published as HTML tables that are not designed for programmatic consumption. They scrape easily but the terms of use are unambiguous: NFL content is not licensed for redistribution. Treat Next Gen Stats the same way you would treat any league site — useful for your own modeling, off-limits for any product you plan to sell. If you need licensed tracking, the commercial vendors are SIS, PFF, and TruMedia.

Pro Football Reference — historical depth, scrape carefully

Pro Football Reference is the gold standard for historical NFL data. Box scores back to 1920, play indices, advanced stats, college crossover data, coaching trees — almost everything you would want for a long-horizon study lives there. The catch is in the robots.txt: a 3-second crawl delay, and the play index is explicitly disallowed. The site is also clear that bulk scraping for redistribution is forbidden.

For one-off historical lookups, manual queries through the site or the official downloadable CSVs are fine. For automated daily ingestion, treat PFR as off-limits and prefer nflverse, which derives most of the same fields legally from the play-by-play feed. The cluster sibling polite scraping guide walks through how to read a robots.txt and how to set a User-Agent that identifies your project to the source.

Beyond the NFL — quick tour of the other major sports

The same shape repeats across every major league: one canonical public dataset, one fast-but-fragile public API, one or two licensable commercial feeds. Below is the short version for the rest of the stack you will need to cover NBA, MLB, and NHL.

NBA — nba_api and the league's own JSON

The unofficial nba_api Python wrapper is the canonical entry point for NBA data. It wraps the league's stats.nba.com endpoints and gives you box scores, play-by-play, player tracking aggregates, shot charts, and lineup data going back two decades. The library is heavily used and well-maintained — over 1,600 stars on Github at the time of writing — and the underlying endpoints are tolerated for non-commercial use the same way ESPN's are.

For a browser pipeline you can hit the same endpoints directly with fetch, since they return JSON and accept browser CORS. The Shark Snip build canvas ships an NBA Schedule block that uses these endpoints under the hood and caches responses for 24 hours so you do not double-poll on rebuild.

MLB — pybaseball and the official Stats API

Baseball is the best-instrumented sport on the planet. pybaseball wraps the official MLB Stats API plus Baseball Savant's Statcast, giving you pitch-by-pitch detail since 2008 — roughly 30 GB total if you pull every pitch. For browser ingestion, the MLB Stats API is freely accessible and does not require a key; Baseball Savant's CSV exports are also free but have a community-respected daily cap.

For modeling, pitch-level Statcast is overkill for a moneyline model and exactly right for a pitcher prop model. The cluster sibling on building a CSV-to-model pipeline with no backend uses pitch-by-pitch totals as the worked example because the schema is rich enough to demonstrate every common feature engineering pitfall.

NHL — the league's open API

The NHL operates a genuinely open, documented JSON API at api-web.nhle.com. Schedules, box scores, play-by-play, and shift data are all available without a key and with reasonable rate limits. The community R package hockeyR wraps it and is the easiest entry point for analysts; for a browser pipeline you can hit the API directly and parse the JSON.

NHL coverage on Shark Snip uses these endpoints exclusively for game-level data, with line shopping handled by the same odds APIs as the other sports. The seam between league data and book data is clean because the league does not publish lines.

Browser-first ingestion: CSV upload to validated blocks

Once you have identified the sources, the next question is how to get rows out of them and into something a model can consume. The traditional answer is a Python script on your laptop or a server-side ETL pipeline. The browser-first answer is a CSV upload, a schema validator, and a typed block. Same logic, no infrastructure.

The minimal pipeline

The smallest useful pipeline has four stages: fetch, parse, validate, store. In a browser-first stack each stage is a function or a block. Fetch is a fetch() call to a CSV URL or a file picker for a local upload. Parse is a streaming CSV parser like Papa Parse that hands you rows without loading the whole file into memory. Validate is a schema check that refuses files missing required columns. Store is either IndexedDB for ephemeral state or a Supabase table for durable shared state.

The Shark Snip build canvas hides this entire pipeline behind a single Upload CSV button, but the underlying decomposition is the same. When you drop a file, the canvas runs the parser, runs the schema validator against the registered Data block, and either accepts the rows or shows you exactly which columns are missing or wrongly typed. That validation step is the part most homemade pipelines skip and the part that most homemade pipelines later regret skipping.

Schema validation as the contract

The single most useful habit you can adopt is treating schema validation as a contract. Define the columns and types your downstream code assumes, and refuse files that violate the contract at the boundary. The cost is a few minutes writing a Zod or Valibot schema; the saving is the entire category of "my model retrained but the column was renamed and now everything is wrong" bugs.

The block builder enforces this for you because every Data block declares its outputs as TypeScript types. If a CSV does not match, the block flips to a red error state and the canvas refuses to wire downstream consumers. If you are building outside the block builder, copy the pattern: validate at the source, never inside the model.

Daily refresh patterns without a backend

Polling for fresh data is the part of a pipeline that traditionally requires a server. Cron, systemd timers, Airflow — all of these assume a long-running process somewhere. The browser-first stack solves this with two patterns: scheduled cloud functions that the browser reads from, and on-demand refresh when the user opens the tab.

Cloudflare Workers with cron triggers

Cloudflare's Workers free plan includes 100,000 requests per day and supports scheduled triggers via cron syntax. A Worker that fetches the nflverse weekly play-by-play release every morning at 6 AM Eastern, parses it, and writes the rows to Supabase or to a KV store costs nothing and runs forever. The browser app then reads the freshly-written rows on next load with a normal fetch.

The pattern is durable because the Worker has no state of its own — it is a stateless function that runs on a schedule. If it fails, the next run picks up where it left off; if you change the schema, you redeploy the Worker without losing any data. The Shark Snip data layer uses this pattern for nflverse weekly refreshes and for The Odds API hourly polls.

Supabase scheduled functions via pg_cron

If your data already lives in Supabase, the simpler pattern is pg_cron running inside the database. You write a SQL function that fetches and upserts rows, you schedule it with one line, and Postgres runs it on the cadence you specify. No Worker, no extra service.

The tradeoff is that pg_cron jobs run inside the database connection, so long-running fetches can hold a connection longer than you want. For light hourly polls of small endpoints it is the cleanest option; for heavy multi-megabyte pulls, a Cloudflare Worker is more polite to the database.

Browser-triggered refresh with smart caching

The third pattern is the lowest-infrastructure of all: refresh data when the user opens the tab, but only if the cached copy is stale. A 30-line wrapper around fetch() that checks IndexedDB for a recent copy and only re-fetches if the timestamp is old gives you a self-refreshing pipeline with no cron at all. It works great for personal models you check daily and badly for shared products where the data needs to be current even when no one is looking. The cluster sibling on daily refresh workflows in the browser compares all three patterns side by side and shows when each one wins.

Ethical scraping: the rules that keep your pipeline alive

Most modeling-related data sources are happy to be polled politely and unhappy to be hammered. The difference between a pipeline that runs for years and a pipeline that gets blocked in a week is almost entirely about how it identifies itself and how often it asks.

Read robots.txt and respect it

robots.txt is the source operator's documented preference about how their site should be crawled. It is not legally binding everywhere, but ignoring it is a fast path to being blocked and a slow path to a lawsuit. The Pro Football Reference example above is typical: a published crawl delay, a list of disallowed paths, and a clear preference. Read it, follow it, and your pipeline survives.

For sources without a robots.txt, default to one request per endpoint per minute and adjust down if the source seems annoyed (rate-limit headers, 429 responses, suddenly slower responses). Default to identifying yourself with a User-Agent like YourProjectName/1.0 (contact: you@example.com) so the source operator can ping you instead of banning your IP block.

Cache aggressively, batch when possible

Every request you do not make is a request that cannot annoy the source. If you need today's NFL scoreboard, fetch it once and cache it for the rest of the day. If you need every team's roster, prefer a single bulk endpoint over thirty-two single-team endpoints when the source offers one. The Shark Snip data layer caches every external request for at least an hour by default and surfaces a cache-bypass option only in the Workshop UI for explicit refreshes.

Know when to license

The signal that you should stop scraping and start licensing is not legal — it is product. As soon as your model becomes a paid product, the cost of a feed is dwarfed by the legal risk of building on tolerated-but-unlicensed scraping. MySportsFeeds, SportsDataIO, and The Odds API all offer commercial tiers that are affordable for solo developers; pay them and sleep better. The published marketplace models that scale beyond a hobby use licensed feeds underneath, even when their first iteration was scraped.

From raw data to a feature store

Raw data is not what a model consumes. A model consumes features — engineered, time-aware, validated columns that are the same shape every week. The difference between raw data and features is the most common place where a homemade pipeline goes wrong, and the place where a feature store earns its keep.

What a feature store actually is

A feature store is two things: a typed schema for the columns your models consume, and a versioned history of how those columns were computed. The first part lets you ship a model and know it will keep getting the same shape of input next week. The second part lets you reproduce a model's training data months later — necessary the first time you need to debug why a published model went cold.

You do not need a fancy feature store to start. A Postgres table with a primary key of (game_id, as_of_timestamp) and a column per feature is a feature store. A Parquet file partitioned by season is a feature store. The discipline is the part that matters: every row is timestamped, every feature has a clear definition, and the computation that produced the row is reproducible from the raw inputs.

The as-of-timestamp rule

Every feature in a betting model needs an as-of timestamp that says when the value was knowable in the real world. A "team's offensive EPA over last 8 games" computed on Tuesday for Sunday's game uses a different value than the same feature computed Friday after a Thursday-night game. The as-of timestamp is what lets you avoid look-ahead bias — the modeler's most common self-inflicted wound.

The Shark Snip block library bakes the as-of rule into every Feature block; the canvas refuses to wire a feature whose as-of date is after the prediction kickoff. If you build a feature store outside the block library, copy the rule: every row carries the timestamp at which it was knowable, and every join is keyed on (entity, as_of_timestamp <= kickoff).

One source of truth per feature

If two parts of your code compute "rolling EPA" with two different windows, you have two features with the same name. The first time one of them changes, every model downstream is silently affected. The fix is one canonical definition per feature, in one file, imported wherever needed. The Shark Snip block catalog is that file: every Feature block has one definition and is referenced by ID. Copying that pattern in your own pipeline — a single features/ module with one function per feature — saves more pain than any other refactor.

From CSV to model in 30 minutes

Here is the full path from a freshly-downloaded nflverse CSV to a backtested NFL spread model on Shark Snip, end to end, with no Python and no server.

Open /build and click "New model — NFL spread". The canvas spawns the starter graph.
Drag the Upload CSV block onto the canvas, drop in the nflverse pbp_2024.csv file, and watch the schema validator turn green when every column resolves.
Wire the upload to a Schedule block (auto-suggested) and an EPA Aggregation block (drag from the sidebar). The EPA block reads the play-by-play and emits one row per team per week with offensive and defensive net EPA.
Add a Rest Differential block and a QB Availability block. Both auto-wire from the schedule and complete a five-feature inputs side.
Drop a Linear Regression model block, then a Walk-Forward Split block above it. Set training to 2018-2023 and validation to 2024.
Hit Train. The browser fits the regression on roughly 1,500 games in under three seconds and reports validation RMSE, ATS cover rate, and edge per bet.
Open the Backtest tab. The same wired graph runs across the full window and produces an equity curve. A healthy first model lands at roughly 52-53% cover rate and a noisy upward equity drift.
Drag a Quarter Kelly Staking block, set your bankroll cap, and click Publish. The model now generates live picks visible on Gridiron and is eligible for the leaderboards.

Total elapsed time on a normal laptop: about 30 minutes from a blank canvas to a published model, with the only typing being the upload filename. The full step-by-step worked example with screenshots lives in the CSV-to-model with no backend handbook; for a deeper comparison of the data sources you might use as the input layer, the cluster sibling free NFL data sources ranked goes source by source with current uptime numbers.

When to graduate to Python or SQL — and how to leave cleanly

Browser-first is the right answer for almost every personal model and many small products. It stops being the right answer at three thresholds, and a good builder graduates without throwing away the work that got them there.

The 200 MB threshold

Browser memory is finite. Modern Chrome handles a couple of hundred megabytes of in-tab data comfortably; beyond that, IndexedDB persists fine but loading rows back into memory for training starts to chug. If your training corpus exceeds ~200 MB compressed, the in-browser trainer is no longer the fastest path. Move the training to a Python or DuckDB script on your laptop and keep the browser for inference and visualization.

The two-minute threshold

If a single training run takes more than two minutes in the browser, the iteration loop gets painful. The threshold sneaks up on you as you add features. The honest fix is to either reduce the feature count (often the right answer) or to move training to a faster runtime. The Shark Snip Workshop ships an Export to Python button that emits scikit-learn or XGBoost code reproducing the wired graph; you keep iterating in the browser, and only spend Python time on the long final fits.

The collaboration threshold

Browser-first models live in your tab. The first time you want a teammate to inspect the same model and propose a change, you hit a sharing problem. The Shark Snip Workshop solves this by storing the wired graph in Supabase and making it forkable from the marketplace; outside the platform, the analogous fix is to commit the model definition to a Git repo and to share the artifact through your normal code review flow. Either way, the moment you have collaborators is the moment to standardize where the model lives.

Exporting from Shark Snip

The Workshop export emits four artifacts: the wired graph as JSON, the equivalent Python script, the trained model weights as ONNX, and a reproducibility manifest pinning the source data versions. Drop the four into a Git repo and you have a Python pipeline that reproduces the browser model bit-for-bit. The reverse is also supported: if you build something heavy in Python and want to ship it as a published Shark Snip model, the manifest format is documented and the import path is one upload.

Bottom line

Sports data is not the gatekeeper it used to be. The free public datasets cover NFL, NBA, MLB, and NHL with enough depth to build real models; the browser is fast enough to ingest, validate, train, and serve those models with zero infrastructure; and the patterns for daily refresh, ethical scraping, and graduation to heavier tooling are well-trodden. The skills that separate a hobbyist pipeline from a deployable one are not about picking exotic sources — they are about validating schemas at the boundary, respecting source operators with polite request rates, baking the as-of timestamp into every feature, and knowing when the browser stops being the right runtime. Open the build canvas, upload an nflverse CSV, and follow the 30-minute path above. From there every additional model is a fork in Workshop, every published artifact gets ranked on the leaderboards, and the genuinely good ones earn their listing on the marketplace.

Bet responsibly — set limits, never chase losses.

Price examples and pass rules

Use names as evidence, not decoration. The useful SEO win is that Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua and Chiefs, Bills, Eagles and Lions appear inside decisions, thresholds, and internal links instead of being dumped into a keyword list.

Spread example: if Chiefs-Broncos opens Chiefs -3.5 and your fair number is -2.8, +3.5 is the bet, +3 is a pass, and the moneyline needs roughly +155 or better before it replaces the spread.
Total example: if a Bills outdoor total opens 46.5 and wind moves from 8 mph to 21 mph, an under projection at 42.8 still needs a playable number; under 45 or better is different from chasing 43.5.
Futures example: Bengals AFC North +280 is 26.3% before hold. If your fair number is 30%, stake modestly, track portfolio correlation, and avoid stacking every Burrow, Chase, and Higgins bet into the same thesis.
CLV rule: a good write-up is not enough. Track whether the spread, total, prop, or futures price closed better than your entry before grading the process.

Use closing-line value guide to keep the examples attached to measurable prices.

Research note board

Use this table to turn the guide into a decision note. The point is to know when the idea is actionable and when it is only context.

Angle	Input to verify	Example application	Pass when
Market price	Spread, total, moneyline, prop price, or futures hold	Chiefs and Bills compared through keeper	The price has moved past the number that created the edge
Football or sport context	Role, pace, weather, injury status, opponent style	Josh Allen role news mapped to the relevant market	The original input changes or remains unconfirmed
Review loop	Entry, close, result, and reason code	hold logged with a clear thesis	You cannot explain whether the process beat the market

Average total points by weather bucket

Average combined points scored in NFL games by weather bucket over recent seasons. Wind above 20mph and snow each clip totals by 6-8 points vs domed games, which is why books move totals aggressively when forecasts shift.

NFL ATS cover-margin distribution

Distribution of (final margin − closing spread) across an NFL season. Roughly normal with mean ≈ 0 and standard deviation ≈ 13 points, which is why most ATS edges live in the ±1.5 point window.

Frequently asked questions

What are the best free NFL data sources for a betting model?

For a public, commercial-friendly stack the answer is nflverse for play-by-play and rosters, NFL Next Gen Stats for weekly tracking aggregates, and either The Odds API or your sportsbook of choice for closing lines. Pro Football Reference fills historical gaps but its terms restrict scraping; treat it as a manual lookup, not a daily feed. ESPN public JSON is fast and free but its terms are unclear, so use it for development and validation rather than as the data layer in a paid product.

Can I really build a sports model without a server?

Yes, and the architecture is simpler than the server version. Modern browsers can fetch CSV, parse it, validate the schema, fit a model, and persist artifacts to IndexedDB or a free Supabase project, all without any backend code you wrote. The tradeoff is that long-running jobs need to start while the tab is open; for daily refreshes you delegate the cron to a Cloudflare Worker or a Supabase scheduled function and let the browser pick up the freshly written rows on next load.

How often should I refresh data for an NFL model?

Once per day is enough for spread and total models because the underlying play-by-play and injury reports change slowly. For player props and live in-game markets you want hourly during gameday and at minute granularity in the final hour before kickoff. Set the cadence to the volatility of the markets you bet, not to the maximum your data source allows. Polling a public source faster than necessary is the single fastest way to get rate-limited or blocked.

Is it ethical to scrape sports data?

Scraping is ethical when you respect robots.txt, you stay below documented or implied rate limits, you do not bypass authentication, and you do not redistribute the raw data in violation of the source terms. The honest test is whether you would be comfortable showing your traffic logs to the source operator. For commercial products, the safer path is to license a feed: MySportsFeeds, SportsDataIO, and similar vendors offer affordable tiers and legitimate redistribution rights.

What is a feature store and do I need one as a beginner?

A feature store is a typed, versioned table of model-ready columns derived from raw data. You do not need a full feature store to ship your first model, but you need the discipline of separating raw ingestion from feature computation in your pipeline. The Shark Snip block builder enforces that separation by making Data blocks distinct from Feature blocks; copying that pattern in your own scripts pays off the first time you change a feature definition and need to know which models are affected.

When should I graduate from browser-only to Python or SQL?

Graduate when one of three things is true: your training data exceeds about 200 megabytes after compression, your training run takes longer than two minutes in the browser, or you need to share artifacts with collaborators who do not use the same web app. Until then the browser stack is faster to iterate in because there is no deploy step. The Shark Snip Workshop has an Export to Python button so you can lift a working browser model into a server pipeline when you cross the threshold.

How do I avoid rate-limiting when polling free APIs?

Cache aggressively, batch requests where the API allows, set a polite User-Agent that identifies your project and a contact email, and stagger jobs so you are not hitting the source on the round minute when every other scraper is. For ESPN-style endpoints, one request per endpoint per game per day is plenty. For odds APIs, only refresh the lines you actually plan to bet, not every market in every sport.

Can I use the NFL Next Gen Stats data in a commercial product?

No. The Next Gen Stats site is published by the NFL for fan consumption and the terms do not grant commercial reuse. You can use the data for your personal models and for editorial analysis with attribution, but you should not redistribute it or use it as the data layer in a product you sell. For commercial tracking data, license SIS or PFF, or fall back to nflverse-derived approximations.

What format should I store raw sports data in?

Parquet for archives larger than a season, CSV for small tables you want to inspect by eye, and JSON for nested data like box scores and play tags. Avoid storing raw HTML — parse it once at ingestion and keep the structured output. For browser-side persistence, IndexedDB through a wrapper like Dexie handles tens of megabytes comfortably; beyond that, push the archive to Supabase Storage and load partitions on demand.

How do I validate that a CSV actually has the columns I expect?

Use a schema validator like Zod or Valibot at the boundary between your CSV parser and the rest of the pipeline. Define the expected columns, types, and acceptable null counts in one place, and refuse to load a file that fails the schema. The Shark Snip CSV upload flow runs this check on every file before any block can consume it; that pattern catches data drift the day a vendor renames a column rather than weeks later when your model starts losing money.