§9.4
Feature Engineering
In the AutoML era, two parts of the modelling lifecycle are getting cheaper by the year. Picking the algorithm is one. Tuning the hyperparameters is the other. The part that has stayed stubbornly human is the one between raw data and the model — turning warehouse columns into features that capture the business sense of the question. This is feature engineering, and as the rest of the pipeline automates, the managerial leverage of feature engineering goes up, not down.
This article is about the move from a transactional warehouse to a feature catalog: the recency, frequency, monetary, and engagement summaries that most customer models lean on; the encodings that handle categories and dates without leaking; and the interaction features where business intuition pays off.
The Executive Question
What do we know about a customer (or a listing, or a transaction) that a model could use to tell them apart?
The deeper version: which of those things are available at decision time, reconstructible from the warehouse, and aligned with a managerial story? Features that score well on all three are durable. Features that score on accuracy alone tend to break in production.
A Feature Catalog for a Customer Model
Most customer-level models in retail and services lean on four families of features:
A feature catalog for a Bean & Basket churn model
Better features often matter more than fancier algorithms.
The taxonomy is older than machine learning. RFM — recency, frequency, monetary — was a direct-marketing tool decades before logistic regression became standard. The engagement column is the natural extension once email opens, app sessions, and loyalty interactions become trackable.
Why this matters managerially: every feature on the catalog corresponds to a story a marketer can tell. "We're going to call customers whose recency just slipped past two weeks and whose support tickets are up." Models built on stories are easier to maintain, easier to explain, and harder to break than models built on whatever the warehouse happened to expose.
Encoding the Awkward Columns
A handful of column types refuse to drop directly into a model:
Categoricals. A column with values like email, app, retail needs to be converted into numbers. The default move is one-hot encoding — one indicator column per value. For high-cardinality columns (ZIP code, product SKU) one-hot creates too many sparse columns; target encoding and embedding are common alternatives. Target encoding must be done inside the training fold to avoid the leakage we flagged in §9.3.
Dates and times. Useful columns rarely include the raw timestamp. Useful columns are derived from the timestamp: day of week, hour of day, days since last visit, days until next holiday, week-of-year as a cyclical encoding. The right derivation depends on the periodicity the business actually has.
Free text. Bag-of-words and TF-IDF turn short strings into feature columns; modern embedding methods (Part V) turn them into dense vectors. For Part IV, the move that matters is that a single text column becomes many feature columns — and those many columns can leak in all the same ways the original could.
Continuous variables with skew. Spend, price, and revenue tend to be right-skewed. Many models care; a log transform usually helps. The transform should be saved with the model so production inputs are transformed identically.
Interactions: Where Business Knowledge Lives
A pure linear model on raw features encodes the assumption that effects are additive. Interactions are how non-additivity gets back in:
- A loyalty-tier × discount-share interaction lets the model say "discount sensitivity is high for bronze customers and low for platinum."
- A device-type × time-of-day interaction captures the mobile evening / desktop morning split many e-commerce models hinge on.
- A region × season interaction lets the model recognize that the southern California iced-drink seasonality is offset from the New England one.
Tree-based models discover interactions implicitly when they split on one feature then on another. Linear and logistic models need them spelled out. Either way, the manager's role is the same: name the interactions that the business already knows about, and let the algorithm confirm or deny them.
RFM, Plus
Below is the smallest feature set that does respectable work on a retail churn problem. It is also the right starting point for any new customer model: ship this catalog first, see how it performs, then layer on engagement and behavior.
| Feature | Family | Definition | Why a manager cares |
|---|---|---|---|
| days_since_last_purchase | Recency | score_date − max(purchase_date) | Drop-off is the leading indicator of churn. |
| orders_last_90d | Frequency | count of purchases in trailing 90 days | Steady cadence vs. lapsing. |
| avg_order_value_lt | Monetary | lifetime revenue / lifetime orders | Separates premium and value customers. |
| discount_share_90d | Monetary | discounted spend / total spend, 90 days | Detects coupon-dependent buyers. |
| email_open_rate_30d | Engagement | opens / sends, trailing 30 days | Brand attention proxy. |
| tickets_last_30d | Engagement | support tickets in trailing 30 days | Friction is a strong churn signal. |
| loyalty_tier | Static | current tier (bronze/silver/gold/platinum) | Tier-specific retention plays. |
| tenure_days | Static | days since signup | New customers churn for different reasons. |
The catalog has two virtues beyond predictive power. Every column has a one-line definition a non-analyst can read. And every column is as-of-able: given the score date, it can be re-derived from the warehouse without peeking forward.
A Quick Note on Feature Stores
As soon as more than one model in the firm needs the same features, the case for a feature store appears. It is a shared library of feature definitions, joined to a unit-by-time key, computed on a schedule, and available to both training and production with identical semantics. The promise is that the score-time value of a feature is the same value the training-time derivation produced.
Feature stores are an architecture topic, not a Part IV topic — but the managerial implication is worth noting: investing in a few well-defined, well-tested features that several models share is almost always a better use of resources than building one large model from many ad-hoc ones.
Concept check
Three questions spanning the supervised setup — task definition, generalization, and features — across Chapter 9.
- 1.Two churn models report different lift on the same customer base. The first scores customers; the second scores transactions and aggregates to customers. What is the most likely source of disagreement?
- 2.A churn model trained with a random 80/20 split shows AUC 0.92 on the test set. When deployed to score next month's customers, AUC drops to 0.71. What is the most likely diagnosis?
- 3.A team adds a new feature,
refund_total_lifetime, to their churn model and AUC jumps from 0.82 to 0.93. Most likely explanation?