§9.4

Feature Engineering

In the AutoML era, two parts of the modelling lifecycle are getting cheaper by the year. Picking the algorithm is one. Tuning the hyperparameters is the other. The part that has stayed stubbornly human is the one between raw data and the model — turning warehouse columns into features that capture the business sense of the question. This is feature engineering, and as the rest of the pipeline automates, the managerial leverage of feature engineering goes up, not down.

This article is about the move from a transactional warehouse to a feature catalog: the recency, frequency, monetary, and engagement summaries that most customer models lean on; the encodings that handle categories and dates without leaking; and the interaction features where business intuition pays off.

The Executive Question

What do we know about a customer (or a listing, or a transaction) that a model could use to tell them apart?

The deeper version: which of those things are available at decision time, reconstructible from the warehouse, and aligned with a managerial story? Features that score well on all three are durable. Features that score on accuracy alone tend to break in production.

A Feature Catalog for a Customer Model

Most customer-level models in retail and services lean on four families of features:

A feature catalog for a Bean & Basket churn model

Better features often matter more than fancier algorithms.

Figure 1. A working feature catalog for a Bean & Basket churn model. The columns are not just convenient categories — they are the four ways a customer can be uneven over time, and each says something different to a manager.

The taxonomy is older than machine learning. RFM — recency, frequency, monetary — was a direct-marketing tool decades before logistic regression became standard. The engagement column is the natural extension once email opens, app sessions, and loyalty interactions become trackable.

Why this matters managerially: every feature on the catalog corresponds to a story a marketer can tell. "We're going to call customers whose recency just slipped past two weeks and whose support tickets are up." Models built on stories are easier to maintain, easier to explain, and harder to break than models built on whatever the warehouse happened to expose.

Encoding the Awkward Columns

A handful of column types refuse to drop directly into a model:

Categoricals. A column with values like email, app, retail needs to be converted into numbers. The default move is one-hot encoding — one indicator column per value. For high-cardinality columns (ZIP code, product SKU) one-hot creates too many sparse columns; target encoding and embedding are common alternatives. Target encoding must be done inside the training fold to avoid the leakage we flagged in §9.3.

Dates and times. Useful columns rarely include the raw timestamp. Useful columns are derived from the timestamp: day of week, hour of day, days since last visit, days until next holiday, week-of-year as a cyclical encoding. The right derivation depends on the periodicity the business actually has.

Free text. Bag-of-words and TF-IDF turn short strings into feature columns; modern embedding methods (Part V) turn them into dense vectors. For Part IV, the move that matters is that a single text column becomes many feature columns — and those many columns can leak in all the same ways the original could.

Continuous variables with skew. Spend, price, and revenue tend to be right-skewed. Many models care; a log transform usually helps. The transform should be saved with the model so production inputs are transformed identically.

Interactions: Where Business Knowledge Lives

A pure linear model on raw features encodes the assumption that effects are additive. Interactions are how non-additivity gets back in:

A loyalty-tier × discount-share interaction lets the model say "discount sensitivity is high for bronze customers and low for platinum."
A device-type × time-of-day interaction captures the mobile evening / desktop morning split many e-commerce models hinge on.
A region × season interaction lets the model recognize that the southern California iced-drink seasonality is offset from the New England one.

Tree-based models discover interactions implicitly when they split on one feature then on another. Linear and logistic models need them spelled out. Either way, the manager's role is the same: name the interactions that the business already knows about, and let the algorithm confirm or deny them.

RFM, Plus

Below is the smallest feature set that does respectable work on a retail churn problem. It is also the right starting point for any new customer model: ship this catalog first, see how it performs, then layer on engagement and behavior.

Table 1. A minimum-viable feature catalog for a customer scoring model. Every feature here is reconstructible as of the score date with only past data — no clairvoyance required.

Feature	Family	Definition	Why a manager cares
days_since_last_purchase	Recency	score_date − max(purchase_date)	Drop-off is the leading indicator of churn.
orders_last_90d	Frequency	count of purchases in trailing 90 days	Steady cadence vs. lapsing.
avg_order_value_lt	Monetary	lifetime revenue / lifetime orders	Separates premium and value customers.
discount_share_90d	Monetary	discounted spend / total spend, 90 days	Detects coupon-dependent buyers.
email_open_rate_30d	Engagement	opens / sends, trailing 30 days	Brand attention proxy.
tickets_last_30d	Engagement	support tickets in trailing 30 days	Friction is a strong churn signal.
loyalty_tier	Static	current tier (bronze/silver/gold/platinum)	Tier-specific retention plays.
tenure_days	Static	days since signup	New customers churn for different reasons.

The catalog has two virtues beyond predictive power. Every column has a one-line definition a non-analyst can read. And every column is as-of-able: given the score date, it can be re-derived from the warehouse without peeking forward.

A Quick Note on Feature Stores

As soon as more than one model in the firm needs the same features, the case for a feature store appears. It is a shared library of feature definitions, joined to a unit-by-time key, computed on a schedule, and available to both training and production with identical semantics. The promise is that the score-time value of a feature is the same value the training-time derivation produced.

Feature stores are an architecture topic, not a Part IV topic — but the managerial implication is worth noting: investing in a few well-defined, well-tested features that several models share is almost always a better use of resources than building one large model from many ad-hoc ones.

Concept check

Three questions spanning the supervised setup — task definition, generalization, and features — across Chapter 9.

1.
Two churn models report different lift on the same customer base. The first scores customers; the second scores transactions and aggregates to customers. What is the most likely source of disagreement?
2.
A churn model trained with a random 80/20 split shows AUC 0.92 on the test set. When deployed to score next month's customers, AUC drops to 0.71. What is the most likely diagnosis?
3.
A team adds a new feature, refund_total_lifetime, to their churn model and AUC jumps from 0.82 to 0.93. Most likely explanation?