§9.3

Train/Test Splits, Generalization, and Leakage

A model that explains the past is not the same as a model that survives the future. The whole discipline of generalization is about that gap. We hold out a slice of history the model never sees, fit the model on the rest, and ask: when the held-out cases were brand-new to the algorithm, how well did it score them? If the answer is "almost as well as the training cases," the model has learned something the world will reward. If the held-out performance collapses, the model has memorized noise.

This article is about three ideas. First, what the train/test split is and why it matters more than any single model choice. Second, how cross-validation generalizes the idea. Third — and this is where time is lost in practice — the catalog of leakage traps that quietly let information from the future creep into the past.

The Executive Question

If the model looks good on the data we used to train it, why might it still fail on next quarter's customers?

The deeply unsatisfying answer is: because the model can be brilliant at the historical task it was trained on and useless at the slightly different task production presents. Generalization is the test of whether those two tasks are close enough.

The Train/Test Split

The test set is a rehearsal for the future

Figure 1. The minimum-viable evaluation. Historical labelled data is split before any model touches it; the training share is used to fit, the held-out share to grade. The grade is honest only if the test set was never used to choose anything about the model.

In practice the split is rarely a single fixed cut. Three variants come up:

Random split. Roughly 70/30 or 80/20. Works when rows are exchangeable and unrelated across time. Cheap, fast, the default for an early prototype.
Time-based split. Train on data up to a cut-off date, evaluate on data after. The right default whenever the unit-by-time process has trend, seasonality, or drift — which is almost every business setting.
Group split. Hold out entire groups (customers, stores, regions) rather than rows. Required when the model will be used to score new groups, not just future rows of old groups.

The choice should match the deployment story. A churn model scored weekly on the same customer base wants a time-based split. A pricing model that will be applied to listings the firm has never seen wants a group split where new listings are the held-out set.

Cross-Validation

Cross-validation is what you reach for when one held-out slice doesn't give you a stable estimate of performance — typical when the dataset is small or noisy. The idea is to rotate the held-out role across many slices and average the resulting scores.

The point is not to squeeze more accuracy out of the model. It is to get a more stable estimate of how it would do on a slice it has not seen. A model with cross-validated AUC of 0.82 ± 0.01 is a different deployment risk than one with 0.82 ± 0.07, even though they have the same headline number.

Leakage: Information From the Future

Leakage is when a feature contains information that would not have been available at decision time. It is the single most common reason a model with stellar offline metrics produces disappointing live results.

Feature leakage gallery — would this be known at decision time?

leak
cancellation_date
only known after churn
safe
days_since_last_purchase
computed at decision time
safe
support_tickets_last_30d
past window only
maybe
refund_amount_total
safe only if cut off before label window
leak
final_account_balance
"final" implies end of relationship
safe
discount_share
historical behaviour
leak
churn_reason_code
recorded at the time of churn
safe
loyalty_tier
attribute at decision time

A feature is leaking whenever its value at training time is not knowable when the prediction needs to be made.

Figure 2. A leakage gallery for a churn task. Every feature has to be inspectable as of the score date — if its value is only knowable after the outcome occurs, it leaks.

A few patterns to watch for:

Outcome-derived fields. cancellation_date, churn_reason_code, final_account_balance — anything that exists because the outcome happened.
Time-window inconsistency. Features computed over different windows for different rows. The classic version: avg_spend_last_30_days ending on the date of churn for churners and on the score date for non-churners.
Target encoding bleed. Replacing a categorical with its mean target — fine if the mean is computed using only training rows, deadly if computed over the whole dataset.
Duplicate rows across the split. The same customer appearing in both training and test with different timestamps, allowing the model to memorize their idiosyncrasies.
Future joins. Joining a snapshot table that was rebuilt after the prediction date, so old rows carry today's metadata.

The remedy is procedural, not statistical: every feature should be reconstructible as of the score date by someone with access only to the warehouse and a clock, not by someone with the labelled outcomes.

Overfitting and Underfitting

Generalization can fail in two directions:

Overfitting. The model is too flexible for the amount of signal in the data. Training error collapses; test error stays high or worsens. Visual symptom: dramatic gap between training and validation performance.
Underfitting. The model is too constrained. Both training and test error stay high; the model misses real structure that a richer model would catch.

The bias–variance trade-off, which we'll revisit in §10.4, is the conceptual handle for this. A linear logistic model is biased toward simple relationships and rarely overfits. A deep tree or wide neural net is flexible and frequently does, unless regularized.

The diagnostic is always the same. If training error is much lower than test error, the model is overfitting. If both are bad, the model is underfitting or the features are insufficient.