§11.1
Clustering for Segmentation
Supervised learning needs a target. Many business questions arrive without one. Which customers behave alike? Which products move together? Which stores form a natural cohort? In these settings, the algorithm is asked to find structure rather than confirm a known pattern. The result is not a verdict — there is no right answer to compare to — but a lens: a way of looking at the data that makes a managerial story possible.
This article introduces the two strands of unsupervised learning that recur across the book — clustering (row-oriented: which observations are alike?) and dimensionality reduction (column-oriented: which variables are correlated?). The first half of Chapter 11 focuses on clustering for segmentation; the next two articles take up PCA, perceptual maps, and nonlinear methods.
The Executive Question
Without telling the algorithm what we are looking for, what natural groupings exist in our customer base — and are any of them useful for action?
The honest version of that question carries a warning. Clusters from an algorithm are not segments by virtue of existing. They become segments when a manager attaches a name, a story, and a different action to each one. The algorithm proposes; the business disposes.
What Clustering Is Doing
Three ingredients define every clustering algorithm:
- A feature space. Each unit (customer, product, store) is a point in some space whose axes are features.
- A distance or similarity measure. Most commonly Euclidean distance after standardization, but cosine similarity, Manhattan distance, and Gower's distance for mixed types all show up.
- A grouping rule that turns distances into clusters.
The fundamental move is the same across methods: group points that are close, separate points that are far. What varies is what counts as "close" and how the grouping is built.
Three Algorithms Worth Knowing
K-means is the default. It chooses k cluster centers, assigns every point to the nearest one, recomputes the centers as the mean of their assigned points, and iterates until the assignments stop changing. Fast, scalable, easy to explain. The cost: you must choose k in advance, and the algorithm assumes roughly spherical clusters of similar size. K-means struggles when real clusters are elongated, of very different sizes, or have nested structure.
Hierarchical clustering builds a tree of merges (or splits). Agglomerative hierarchical clustering starts with every point in its own cluster and repeatedly merges the closest pair. The output is a dendrogram — a tree that lets the analyst pick the level of granularity after the fact. Slower than K-means, but more interpretable and useful when the right number of clusters is itself in question.
DBSCAN clusters based on density. A point is part of a cluster if it has enough neighbours within a chosen radius; sparse points are labelled as noise. DBSCAN is the right tool when clusters are irregularly shaped, when noise is real and shouldn't be forced into a group, and when the number of clusters is unknown.
A practical rule: try K-means first because it is fastest. If the cluster sizes look unbalanced, the shapes look wrong, or many points seem to be assigned to a "leftover" cluster, switch to hierarchical or DBSCAN.
Standardization and Scale
Distance is sensitive to scale. A feature measured in thousands (annual spend) will dominate a feature measured in single digits (visits per month) unless the two are standardized. Two common moves:
- Z-score standardization. Subtract the mean, divide by the standard deviation. Every feature is on a comparable scale; outliers can still pull means around.
- Min-max scaling. Map every feature to [0, 1]. Robust when features have wildly different units, brittle when there are extreme outliers (one outlier compresses everyone else).
Forget this step and the clusters will be driven by the largest-scale feature, regardless of what it measures.
Choosing k
For K-means and many hierarchical methods, the analyst has to choose how many clusters to keep. Two standard diagnostics:
- Elbow plot. Plot within-cluster sum of squares against k. As k rises, the curve falls; the "elbow" — the kink past which adding another cluster doesn't help much — is the suggested choice.
- Silhouette score. For each point, compare its distance to its own cluster's center with the distance to the next nearest cluster. A high silhouette means clusters are tight and well-separated.
Neither diagnostic is binding. Both are advisory. The final number of clusters should be the smallest k for which the managerial story holds — too many segments and the team cannot operate on them; too few and the lens stops being useful.
What a Segment Looks Like
Once clusters are chosen, the work that matters is profiling them: which features are unusually high, low, or distinctive in each cluster?
Five Bean & Basket segments, profiled against five features
- Recency
- Frequency
- Spend
- Discounts
- Premium
- Recency
- Frequency
- Spend
- Discounts
- Premium
- Recency
- Frequency
- Spend
- Discounts
- Premium
- Recency
- Frequency
- Spend
- Discounts
- Premium
- Recency
- Frequency
- Spend
- Discounts
- Premium
The clusters are a lens, not a truth. Names come from the analyst after looking at the bars — the algorithm only sees similarity.
Three habits make segment profiles actually usable:
- Name the segments before you describe them. Forcing a one-line name disciplines the analyst into telling a story and surfaces the cases where the algorithm has not found a coherent group.
- Compare on a small set of human-readable features. Five to eight features in the profile is enough; thirty turns into a wall of numbers no one will reread.
- State the strategy that follows from the segment. A segment without a different action is not a segment — it is a population the firm is going to treat the same way as everyone else.
Where Clustering Has Limits
A short list of failures that recur:
- No "true" segments. Many customer bases look continuous. The algorithm will still return k clusters; if the manager treats them as discoveries about the world rather than slices of a continuum, the resulting strategy will overcommit to artificial boundaries.
- Algorithm-dependent boundaries. Two reasonable choices (K-means vs hierarchical, k=5 vs k=6, Euclidean vs cosine) can produce visibly different assignments. The right response is to lean on the parts of the segmentation that are stable across choices.
- Sensitivity to features. Add a noisy feature and the clusters shift; drop a high-variance feature and they shift again. The feature catalog is part of the segmentation, not a separate decision.
- Drift. Customer behaviour moves; clusters fitted six months ago may misclassify today's customers. Refit segmentations on a cadence the business commits to.