§9.2
The Supervised Learning Setup
Supervised learning is the kind of pattern recognition where the answer is in the training data. We have a stack of past cases — customers who churned or stayed, listings that sold at one price or another, claims that were paid or denied — and we want a function that, for each new case, guesses the answer the world will produce next. The "supervision" is the labelled history. Without it, a model is still possible, but the task is different and lives in Chapter 11.
This article fixes the vocabulary every supervised model in the next chapter rests on: target, features, unit of prediction, and label timing. Get these four right and the modelling choices fall out almost mechanically. Get them wrong — most often, in the third or fourth — and even an excellent algorithm produces a useless score.
The four-decision setup that follows is, in effect, the predictive-shaped descendant of the Decision Question Card from §5.1. Same discipline; different question. The card asked what action are we taking? The Task Contract asks what target are we predicting? Both end with a one-sentence contract a team can review in fifteen minutes.
The Executive Question
What, exactly, are we trying to predict, for whom, and at what moment?
The wording matters. "Predict churn" is a topic. "For each weekly active customer, the probability of churn over the next 60 days, scored every Monday from features known by Sunday night" is a task. The second sentence specifies the target, the unit, the horizon, and the label-timing rule, and the difference between the two is the difference between a project that ships and one that quietly stalls.
The Four Decisions That Define the Task
1. The target Y
The target is the thing the model is supposed to guess. It comes in two flavours: a class label (yes/no, fraud/not, segment-A/B/C) or a number (price, demand, lifetime value). Almost every algorithmic choice in Chapter 10 follows from this choice.
Same setup, two flavors of target
- Churnchurned = yes/no
- Loan defaultdefault = yes/no
- Lead conversionconverted = yes/no
- Listing priceprice in $
- Demand next monthunits sold
- Customer LTVexpected revenue
Choosing classification vs regression is a choice about what action the answer must support, not a property of the data.
A subtle rule: define the target so that it is knowable at evaluation time. If "churn within 60 days" is the target, then at any given Monday, only cohorts from at least 60 days ago can be used to score the model honestly. This single discipline removes most of the bad surprises a churn model produces in its first month live.
2. The features X
Features are the columns the model will use to guess the target. The art is less in choosing fancy transformations than in being honest about what is known at decision time. If "number of refunds last month" can only be computed in the second week of the next month, it is not a real-time feature for a model that scores customers each Monday morning.
A simple test: would a human in the loop, with access to your data warehouse but no clairvoyance, be able to look the value up at the moment a prediction needs to be made? If not, the feature is information from the future.
3. The unit of analysis
The unit names what one row of the dataset represents. Common units in this book:
- a customer, scored periodically;
- a listing, scored once at posting time;
- a transaction, scored at the moment of authorization;
- a store-week, scored before each week begins.
The unit determines the count, the variance, and the meaning of every score. A model that predicts "transactions that are fraudulent" learns different patterns than one that predicts "customers who will, at some point this quarter, have a fraudulent transaction." Both can be useful — but they cannot be evaluated against each other.
4. Label timing
Label timing is the rule that connects features and outcome along the time axis. It looks innocuous and is the source of most leakage incidents:
| Variant | Label rule | Feature window | Used for |
|---|---|---|---|
| Forward-looking | Did the customer churn in the next 60 days? | Features cut off as of the score date. | Weekly retention scoring. |
| Retrospective | Did the customer churn within the dataset? | All features ever observed. | Post-hoc segmentation only — not safe for scoring. |
| Calendar-anchored | Did the customer churn between July and September? | Features as of June 30. | Annual planning and offer design. |
The retrospective variant looks brilliant in cross-validation and collapses in production: every feature has, in effect, been allowed to "look ahead" at the label. We will return to this in §9.3.
Why "Same Algorithm" Does Not Mean "Same Task"
Two teams may both report they are running a gradient-boosted classifier on customer features to predict churn, and yet be solving entirely different problems. One scores active customers; the other scores everyone who ever transacted. One uses a 60-day horizon; the other uses 12 months. One uses features known at decision time; the other has leakage.
When two churn models disagree about whom to target, the discrepancy is almost never about the algorithm. It is about the task as defined by the four decisions above.