§9.3

Train/Test Splits, Generalization, and Leakage

A model that explains the past is not the same as a model that survives the future. The whole discipline of generalization is about that gap. We hold out a slice of history the model never sees, fit the model on the rest, and ask: when the held-out cases were brand-new to the algorithm, how well did it score them? If the answer is "almost as well as the training cases," the model has learned something the world will reward. If the held-out performance collapses, the model has memorized noise.

This article is about three ideas. First, what the train/test split is and why it matters more than any single model choice. Second, how cross-validation generalizes the idea. Third — and this is where time is lost in practice — the catalog of leakage traps that quietly let information from the future creep into the past.


The Executive Question

If the model looks good on the data we used to train it, why might it still fail on next quarter's customers?

The deeply unsatisfying answer is: because the model can be brilliant at the historical task it was trained on and useless at the slightly different task production presents. Generalization is the test of whether those two tasks are close enough.


The Train/Test Split

The test set is a rehearsal for the future

Historical labelled datapast customers, known outcomesTraining (≈70%)Test (≈30%)Fit modelPredict & scoreCompare predictions to known test outcomesIf performance collapses outside training, the model overfit.
Figure 1. The minimum-viable evaluation. Historical labelled data is split before any model touches it; the training share is used to fit, the held-out share to grade. The grade is honest only if the test set was never used to choose anything about the model.

In practice the split is rarely a single fixed cut. Three variants come up:

  • Random split. Roughly 70/30 or 80/20. Works when rows are exchangeable and unrelated across time. Cheap, fast, the default for an early prototype.
  • Time-based split. Train on data up to a cut-off date, evaluate on data after. The right default whenever the unit-by-time process has trend, seasonality, or drift — which is almost every business setting.
  • Group split. Hold out entire groups (customers, stores, regions) rather than rows. Required when the model will be used to score new groups, not just future rows of old groups.

The choice should match the deployment story. A churn model scored weekly on the same customer base wants a time-based split. A pricing model that will be applied to listings the firm has never seen wants a group split where new listings are the held-out set.


Cross-Validation

Cross-validation is what you reach for when one held-out slice doesn't give you a stable estimate of performance — typical when the dataset is small or noisy. The idea is to rotate the held-out role across many slices and average the resulting scores.

The point is not to squeeze more accuracy out of the model. It is to get a more stable estimate of how it would do on a slice it has not seen. A model with cross-validated AUC of 0.82 ± 0.01 is a different deployment risk than one with 0.82 ± 0.07, even though they have the same headline number.


Leakage: Information From the Future

Leakage is when a feature contains information that would not have been available at decision time. It is the single most common reason a model with stellar offline metrics produces disappointing live results.

Feature leakage gallery — would this be known at decision time?

  • leak
    cancellation_date
    only known after churn
  • safe
    days_since_last_purchase
    computed at decision time
  • safe
    support_tickets_last_30d
    past window only
  • maybe
    refund_amount_total
    safe only if cut off before label window
  • leak
    final_account_balance
    "final" implies end of relationship
  • safe
    discount_share
    historical behaviour
  • leak
    churn_reason_code
    recorded at the time of churn
  • safe
    loyalty_tier
    attribute at decision time

A feature is leaking whenever its value at training time is not knowable when the prediction needs to be made.

Figure 2. A leakage gallery for a churn task. Every feature has to be inspectable as of the score date — if its value is only knowable after the outcome occurs, it leaks.

A few patterns to watch for:

  • Outcome-derived fields. cancellation_date, churn_reason_code, final_account_balance — anything that exists because the outcome happened.
  • Time-window inconsistency. Features computed over different windows for different rows. The classic version: avg_spend_last_30_days ending on the date of churn for churners and on the score date for non-churners.
  • Target encoding bleed. Replacing a categorical with its mean target — fine if the mean is computed using only training rows, deadly if computed over the whole dataset.
  • Duplicate rows across the split. The same customer appearing in both training and test with different timestamps, allowing the model to memorize their idiosyncrasies.
  • Future joins. Joining a snapshot table that was rebuilt after the prediction date, so old rows carry today's metadata.

The remedy is procedural, not statistical: every feature should be reconstructible as of the score date by someone with access only to the warehouse and a clock, not by someone with the labelled outcomes.


Overfitting and Underfitting

Generalization can fail in two directions:

  • Overfitting. The model is too flexible for the amount of signal in the data. Training error collapses; test error stays high or worsens. Visual symptom: dramatic gap between training and validation performance.
  • Underfitting. The model is too constrained. Both training and test error stay high; the model misses real structure that a richer model would catch.

The bias–variance trade-off, which we'll revisit in §10.4, is the conceptual handle for this. A linear logistic model is biased toward simple relationships and rarely overfits. A deep tree or wide neural net is flexible and frequently does, unless regularized.

The diagnostic is always the same. If training error is much lower than test error, the model is overfitting. If both are bad, the model is underfitting or the features are insufficient.