§9.3
Train/Test Splits, Generalization, and Leakage
A model that explains the past is not the same as a model that survives the future. The whole discipline of generalization is about that gap. We hold out a slice of history the model never sees, fit the model on the rest, and ask: when the held-out cases were brand-new to the algorithm, how well did it score them? If the answer is "almost as well as the training cases," the model has learned something the world will reward. If the held-out performance collapses, the model has memorized noise.
This article is about three ideas. First, what the train/test split is and why it matters more than any single model choice. Second, how cross-validation generalizes the idea. Third — and this is where time is lost in practice — the catalog of leakage traps that quietly let information from the future creep into the past.
The Executive Question
If the model looks good on the data we used to train it, why might it still fail on next quarter's customers?
The deeply unsatisfying answer is: because the model can be brilliant at the historical task it was trained on and useless at the slightly different task production presents. Generalization is the test of whether those two tasks are close enough.
The Train/Test Split
The test set is a rehearsal for the future
In practice the split is rarely a single fixed cut. Three variants come up:
- Random split. Roughly 70/30 or 80/20. Works when rows are exchangeable and unrelated across time. Cheap, fast, the default for an early prototype.
- Time-based split. Train on data up to a cut-off date, evaluate on data after. The right default whenever the unit-by-time process has trend, seasonality, or drift — which is almost every business setting.
- Group split. Hold out entire groups (customers, stores, regions) rather than rows. Required when the model will be used to score new groups, not just future rows of old groups.
The choice should match the deployment story. A churn model scored weekly on the same customer base wants a time-based split. A pricing model that will be applied to listings the firm has never seen wants a group split where new listings are the held-out set.
Cross-Validation
Cross-validation is what you reach for when one held-out slice doesn't give you a stable estimate of performance — typical when the dataset is small or noisy. The idea is to rotate the held-out role across many slices and average the resulting scores.
The point is not to squeeze more accuracy out of the model. It is to get a more stable estimate of how it would do on a slice it has not seen. A model with cross-validated AUC of 0.82 ± 0.01 is a different deployment risk than one with 0.82 ± 0.07, even though they have the same headline number.
Leakage: Information From the Future
Leakage is when a feature contains information that would not have been available at decision time. It is the single most common reason a model with stellar offline metrics produces disappointing live results.
Feature leakage gallery — would this be known at decision time?
- leakcancellation_dateonly known after churn
- safedays_since_last_purchasecomputed at decision time
- safesupport_tickets_last_30dpast window only
- mayberefund_amount_totalsafe only if cut off before label window
- leakfinal_account_balance"final" implies end of relationship
- safediscount_sharehistorical behaviour
- leakchurn_reason_coderecorded at the time of churn
- safeloyalty_tierattribute at decision time
A feature is leaking whenever its value at training time is not knowable when the prediction needs to be made.
A few patterns to watch for:
- Outcome-derived fields.
cancellation_date,churn_reason_code,final_account_balance— anything that exists because the outcome happened. - Time-window inconsistency. Features computed over different windows for different rows. The classic version:
avg_spend_last_30_daysending on the date of churn for churners and on the score date for non-churners. - Target encoding bleed. Replacing a categorical with its mean target — fine if the mean is computed using only training rows, deadly if computed over the whole dataset.
- Duplicate rows across the split. The same customer appearing in both training and test with different timestamps, allowing the model to memorize their idiosyncrasies.
- Future joins. Joining a snapshot table that was rebuilt after the prediction date, so old rows carry today's metadata.
The remedy is procedural, not statistical: every feature should be reconstructible as of the score date by someone with access only to the warehouse and a clock, not by someone with the labelled outcomes.
Overfitting and Underfitting
Generalization can fail in two directions:
- Overfitting. The model is too flexible for the amount of signal in the data. Training error collapses; test error stays high or worsens. Visual symptom: dramatic gap between training and validation performance.
- Underfitting. The model is too constrained. Both training and test error stay high; the model misses real structure that a richer model would catch.
The bias–variance trade-off, which we'll revisit in §10.4, is the conceptual handle for this. A linear logistic model is biased toward simple relationships and rarely overfits. A deep tree or wide neural net is flexible and frequently does, unless regularized.
The diagnostic is always the same. If training error is much lower than test error, the model is overfitting. If both are bad, the model is underfitting or the features are insufficient.