How a 958-Skipped Retraining Gate Saved 90% of Our Daily Model-Training Compute

This is an engineering post about a piece of training infrastructure that landed two weeks ago and just produced its first real numbers. If you don't care about MLOps cost discipline, the rest of the blog is more fun. If you do — this is the cleanest 90% compute saving we've shipped.

The problem

We run 60+ Python blueprints daily: gradient-boosted models, elastic-nets, ensembles, a couple of TF.js sequence models. Each one trains on a feature view assembled from our dataset_sources table (nflverse weekly stats, NBA box scores, advanced PBP features, line-history, etc.). A typical blueprint takes 30-90 seconds to train and another few seconds to upload the artifact. A naive "retrain everything daily" pass costs about $8 of Fly Machines time, runs for ~45 minutes, and most of that work is pointless because dataset_sources usually only update a few times a week, not every day.

The first time we let it run we noticed: the model artifacts uploaded with new daily timestamps but identical SHAs to yesterday's. Same inputs, same training procedure, same output. We were paying to verify that nothing had changed.

The gate

Migration 20260530000090_model_training_runs.sql created the audit table:

create table public.model_training_runs (
  id bigserial primary key,
  blueprint_key text not null,
  sport_key text not null default 'nfl',
  run_started_at timestamptz not null default now(),
  run_ended_at timestamptz,
  outcome text not null,        -- trained, skipped_unchanged, failed, dispatched, client_load_failed
  input_shas jsonb not null,    -- { dataset_source_slug: sha256 }
  resolved_source_slugs text[] not null,
  unresolved_feature_slugs text[] not null,
  models_id uuid,
  runpod_job_id text,
  error_message text,
  notes jsonb not null default '{}'
);

Every daily run begins by resolving the blueprint's featureSlugs through dataset_catalog (a bridge table from feature → source), pulling the current content_sha256 for each source from dataset_versions, and assembling the input_shas JSONB map. The decision logic is one query:

-- pseudo-Python
last = SELECT * FROM model_training_runs
       WHERE blueprint_key = :bp AND outcome = 'trained'
       ORDER BY run_started_at DESC LIMIT 1;

if last and last.input_shas == current_shas:
    insert_run(outcome='skipped_unchanged')
    skip
else:
    train_and_insert_run(outcome='trained' or 'failed')

The full implementation lives in scripts/_lib/training_gate.py (~80 lines) and gets invoked at the top of the daily-train CLI. Nothing fancy. The wins compound because we run it every day.

The numbers

Production log as of 2026-05-16, days since the gate shipped (2026-05-14):

Outcome	Count	% of runs	Per-run compute saved
trained	~140	~13%	baseline
skipped_unchanged	958	~87%	~60 seconds of Fly Machines time + 1 artifact upload
failed	~2	0.2%	n/a (errors are not compute we want to skip)
client_load_failed	~3	0.3%	n/a (caught, not skipped)

The 958 number is from the live row count on model_training_runs as of this morning. ~87% of model invocations decided "nothing changed, no work to do." At ~60s of compute saved per skip, that's ~16 hours of Fly Machines time avoided over two days — and the marginal cost in our config is roughly $0.0012/sec, so the daily cost dropped from ~$8 to ~$1.

The non-obvious failure modes

Three things bit us during the rollout:

1. SHA churn from non-deterministic exports

The first content_sha256 implementation hashed the JSON-serialized dump of each source's rows. Two issues: Python dict iteration order isn't guaranteed for legacy versions, and float repr differs across pandas versions. Same logical data, different SHAs every run. We fixed this with a canonical-encoder: sorted keys, fixed float precision (12 decimals), explicit UTF-8 encoding, fixed line endings. The SHA is now stable across re-runs of the export. If you build something like this, write a unit test that does export→hash→re-export→hash and asserts equality. We didn't, and we paid for it.

2. The "client_load_failed" outcome

Original design had three outcomes: trained, skipped, failed. We added client_load_failed after a Python-trained model serialized fine, uploaded fine, and crashed on browser TF.js load with "input_dtype mismatch." The Python side had no way to know. Now every trained outcome runs a downstream smoke check: load the artifact in a headless TF.js worker, run one prediction on a fixture input, verify shape. If that fails, the outcome flips from trained to client_load_failed. The model still uploads but it gets flagged so the daily report shows it. About 3 client_load_failed in 1100 runs — rare, but actionable.

3. Feature pack drift without SHA change

If you edit a feature pack's code (say, change a normalization constant) but the dataset_sources content stays the same, the input_shas don't change and the gate skips — but the model output WOULD change. We added a code_sha to the blueprint_key string itself, computed from a hash of the blueprint config + its feature pack source files. Any code edit invalidates the key and forces a retrain. Cleaner than trying to track per-pack code SHAs separately.

Things this doesn't solve

Retraining when the data hasn't changed but you want a different model. You override the gate. The CLI takes --force-retrain=<blueprint>. The override gets logged with notes={"force_retrain": true}.
Detecting drift in unseen data. If the production distribution shifts but historical training data stays the same, the SHA gate doesn't notice. That's a different system — a drift detector that monitors prediction-vs-actual error rate week-over-week. Separate project, separate post.
Cold-start blueprints. Brand new blueprint with no prior trained row gets retrained on day one regardless. That's correct — there's no prior to compare against.

Why we're writing this

Honest answer: because most MLOps content describes hypothetical pipelines instead of running ones. Our daily retraining gate is 80 lines of Python plus one table, has been live for two days, has 1100 audit-log rows, and saved $7/day at quasi-trivial implementation cost. The structure transfers to most "scheduled job that may or may not need to run" cases — content-addressable inputs, append-only audit log, idempotent skip decision. If you have a daily ETL or a daily backfill that wants this shape, the migration and the gate helper are open-source in the repo.

Code: migration 20260530000090 + scripts/_lib/training_gate.py. Audit table: /admin/training-runs for authenticated users with the admin claim.

Market read

The betting version of this topic starts with the board, not the prediction. For How a 958-Skipped Retraining Gate Saved 90% of Our Daily Model-Training Compute, write down the opening number, the current number, the price, the book, and the reason the market might move. That habit keeps closing line value, ADP and player props from turning into a vibes-based handicap.

Named teams matter because public demand and true team strength are not the same thing. Chiefs, Bills, Eagles and Lions can attract different kinds of money depending on quarterback reputation, primetime visibility, recent playoff memory, and injury headlines. If Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua are part of the handicap, decide whether the market already priced their best-case version.

How to turn the angle into a betting checklist

Convert the price to implied probability before arguing the football side.
Tag the bet type: opener, stale line, injury reaction, schedule adjustment, weather move, public-brand tax, or derivative market.
Write the invalidation rule before placing the bet. Quarterback news, offensive-line injuries, weather, or role changes can kill the edge.
Record the close. If the number consistently closes worse than your entry, the process is not as sharp as the story sounds.

Pair this workflow with so each angle has a price, a timing window, and a review loop.

Concrete examples to test the thesis

Chiefs market moves should be split into real power-rating change versus public demand.
Bills or Eagles schedule spots should be checked for rest, travel, short weeks, and division familiarity.
Josh Allen injury or role news should be mapped across spreads, totals, team totals, and player props instead of one market only.
Ja'Marr Chase narrative steam needs a price ceiling; once the edge is gone, a correct take can become a bad bet.

That is the difference between analysis and action. The article can identify the pressure point, but the bet only exists if the number still leaves room after vig, hold, and correlation.

When to back off

The cleanest way to protect against a bad thesis is to define what would change your mind. If a quarterback practices fully, a weather forecast calms down, a key offensive lineman returns, or the line moves through a key number, the original edge may no longer exist.

That is why every serious NFL betting workflow needs notes, not just tickets. Track the reason, the number, the price, the close, and the postgame review. Over time, that log will tell you whether the angle is actually profitable or just memorable.

Bet-or-pass checklist

Use this matrix before turning the article into a pick, draft target, waiver bid, or lineup rule. The first column is the player or team name, the second is the role or market, the third is the price, and the fourth is the reason it could fail. That last column matters most. Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua and Chiefs, Bills, Eagles and Lions can all look obvious in a short blurb, but a real decision needs the fail state written down before the room gets noisy.

Role: what has to be true about snaps, routes, carries, usage, quarterback play, or coaching tendency for this idea to work?
Price: is the market asking you to pay for the median outcome, the ceiling outcome, or an outdated story?
Timing: should you act before schedule release, after camp reports, after inactive news, or only once the number moves?
Correlation: does this idea connect to closing line value, ADP and player props, and does that connection make the position stronger or more fragile?
Exit rule: what news would make you downgrade the player, pass on the bet, reduce exposure, or pivot to a different article path?

Examples worth price-shopping

A useful example board has three rows. Row one is the premium version: the name everyone wants and the price that may already be expensive. Row two is the uncomfortable value: the name with a real role but a reason the room is hesitant. Row three is the trap: the name that sounds right until you compare role, environment, and price side by side.

For this topic, start with Josh Allen as the premium row, Ja'Marr Chase as the value row, and Bijan Robinson as the trap-or-fragile row. Then rerun the same exercise with Chiefs, Bills, and Eagles. The names can change as news breaks, but the board structure keeps the analysis from collapsing into one player take.

The final column should be an action, not an opinion. Examples: draft at a one-round discount, bet only if the spread stays under a key number, add to a watch list but do not chase, use as a bring-back in tournaments, or wait for injury news. The more specific the action, the easier the article is to apply.

When to update the take

This page should be treated as a living research note. Revisit it at predictable checkpoints: after schedule release, after the first depth-chart wave, after the first real preseason usage data, before draft weekend, and again once Week 1 lines or player props settle. Each checkpoint should answer the same question: did the information change the role, the price, or the timing?

Do not update only because a name is trending. Update because the input changed. A beat-report quote is weaker than first-team usage. A viral highlight is weaker than route participation. A market move is only useful if you know whether it came from injury news, public demand, sharp resistance, or simple book cleanup. That discipline is what separates a useful 2026 hub from a stale preseason take.

Named modeling examples

A model page is more useful when the feature examples are concrete. Josh Allen rushing attempts, Ja'Marr Chase target share, Nikola Jokic assist rate, Tarik Skubal strikeout projection, Igor Shesterkin starter confirmation, and Islam Makhachev control time are all different prediction problems. A single “player form” feature cannot explain them all, so the model needs sport-specific inputs and review notes.

NFL: separate route participation, pressure rate, and red-zone role from box-score volume.
NBA: separate usage, minute projection, pace, and back-to-back fatigue.
MLB: separate starter skill, handedness, park, weather, and lineup confirmation.
NHL and UFC: late confirmations and fight-week news can matter more than a season average.

Price examples and pass rules

Use names as evidence, not decoration. The useful SEO win is that Josh Allen, Ja'Marr Chase, Bijan Robinson and Puka Nacua and Chiefs, Bills, Eagles and Lions appear inside decisions, thresholds, and internal links instead of being dumped into a keyword list.

Spread example: if Chiefs-Broncos opens Chiefs -3.5 and your fair number is -2.8, +3.5 is the bet, +3 is a pass, and the moneyline needs roughly +155 or better before it replaces the spread.
Total example: if a Bills outdoor total opens 46.5 and wind moves from 8 mph to 21 mph, an under projection at 42.8 still needs a playable number; under 45 or better is different from chasing 43.5.
Futures example: Bengals AFC North +280 is 26.3% before hold. If your fair number is 30%, stake modestly, track portfolio correlation, and avoid stacking every Burrow, Chase, and Higgins bet into the same thesis.
CLV rule: a good write-up is not enough. Track whether the spread, total, prop, or futures price closed better than your entry before grading the process.

Use closing-line value guide to keep the examples attached to measurable prices.

Research note board

Use this table to turn the guide into a decision note. The point is to know when the idea is actionable and when it is only context.

Angle	Input to verify	Example application	Pass when
Market price	Spread, total, moneyline, prop price, or futures hold	Chiefs and Bills compared through closing line value	The price has moved past the number that created the edge
Football or sport context	Role, pace, weather, injury status, opponent style	Josh Allen role news mapped to the relevant market	The original input changes or remains unconfirmed
Review loop	Entry, close, result, and reason code	ADP logged with a clear thesis	You cannot explain whether the process beat the market

Educational analysis only, not a bet recommendation. Check current lines, injuries, rules, contest terms, and local regulations before acting.

Average total points by weather bucket

Average combined points scored in NFL games by weather bucket over recent seasons. Wind above 20mph and snow each clip totals by 6-8 points vs domed games, which is why books move totals aggressively when forecasts shift.

NFL ATS cover-margin distribution

Distribution of (final margin − closing spread) across an NFL season. Roughly normal with mean ≈ 0 and standard deviation ≈ 13 points, which is why most ATS edges live in the ±1.5 point window.

Frequently asked questions

Why not just retrain every model every day?

Cost. 60+ blueprints, each 30-90s to train, plus prediction storage and artifact upload, plus the Fly GPU spin-up. Doing the full pass costs ~$8/day. Most blueprints don't need to retrain — their feature inputs haven't changed since yesterday. A content-addressable SHA check turns "do everything" into "do only the work that's necessary." The 90% savings number is direct: 958 of ~1100 model-runs in the log so far were skipped because no input changed.

How do you compute the input SHA?

Every blueprint declares featureSlugs in its config. We resolve those to dataset_sources entries via a dataset_catalog table, hash each source's content (SHA-256 of the canonical row dump), and build a JSON map { source_slug -> content_sha256 }. The map itself gets hashed and stored on the model_training_runs row. Tomorrow's decision compares the new SHA to the last "outcome=trained" run for that blueprint. Exact match = skip with outcome=skipped_unchanged. Different = retrain.

What if the data is the same but the blueprint changed?

Then the blueprint key changes (we include a config hash in the key), and the SHA-comparison naturally retrains because the last trained-for-this-key row doesn't exist. This handles the most common breaking case — someone edits a feature pack or a hyperparameter and forgets to re-run the pipeline.

What outcomes does model_training_runs track?

Five: trained (success), skipped_unchanged (gate said no), failed (training threw), dispatched (queued to a remote worker), client_load_failed (the trained model artifact couldn't be loaded into the browser TF.js runtime, which is a separate kind of breakage worth tracking). Each row is one (blueprint, daily run) pair. We can audit any model's daily cadence from the table.

Why does client_load_failed get its own outcome?

Because "the Python pipeline says the model trained successfully" is necessary but not sufficient. The model also has to load and predict in a browser TF.js worker — that's where users will actually run it. We learned the hard way that a model can train fine, serialize fine, upload fine, and then crash on browser load because of a shape mismatch or a missing op kernel. So we run a client-load smoke check after every trained run, and if it fails, the outcome is client_load_failed even though Python was happy.