§10.2

Classification Evaluation

A classifier that returns probabilities has been graded long before anyone speaks of accuracy. The choice of evaluation language is itself a decision: about whose mistakes count, which units they count in, and whether the manager cares about getting the order right, the magnitudes right, or both. This article assembles the standard toolkit — confusion matrix, ROC and PR curves, calibration, lift — and ends with the only chart most managers should read first: the threshold–profit curve.

The teaching arc is short. Each evaluation view is the answer to a different question. The questions stack on top of each other, and the threshold–profit curve is the place where the modeller's answers meet the business's costs.

The Executive Question

Given a probability score for every customer, what is the action threshold that maximizes expected business value, and how confident are we in that choice?

The question already mixes two ideas: a model property (how well the scores rank and calibrate) and a business property (what we pay or earn when we get a unit right or wrong). The evaluations below isolate the first; the threshold–profit curve combines both.

The Confusion Matrix and Its Costs

Every classifier produces four kinds of outcomes once a threshold has been fixed. Naming them is half the work of evaluation.

Confusion matrix — each cell has a business cost

Actually churned

Actually stayed

Predicted churn

True Positive

caught a churner

retention spend justified

False Positive

wasted offer

spend on someone who would have stayed

Predicted stay

False Negative

missed churner

lost customer revenue

True Negative

correctly left alone

no unnecessary spend

Figure 1. The four outcomes of a thresholded classifier, with the business cost a manager has to attach to each cell.

From the matrix, the standard summary metrics fall out:

Accuracy = correct / total. Useful when the classes are balanced; misleading when they are not.
Precision = TP / (TP + FP). Of those we acted on, how many were right?
Recall (sensitivity) = TP / (TP + FN). Of those who needed action, how many did we reach?
F1 = harmonic mean of precision and recall. A single number for when the costs of FP and FN are similar.

In most business settings, accuracy alone is the wrong headline. A churn rate of 3% means a "predict everyone stays" model is 97% accurate and useless. Precision-recall trade-offs are where the analytics actually live.

ROC vs Precision-Recall

Two ranked-list summaries dominate practical evaluation:

The ROC curve plots true-positive rate against false-positive rate as the threshold sweeps from 0 to 1. The AUC is the area under this curve and equals the probability that a randomly chosen positive ranks above a randomly chosen negative. ROC is invariant to class balance and is the default summary for ranking quality.
The PR curve plots precision against recall as the threshold sweeps. PR-AUC is more revealing than ROC-AUC when the positive class is rare — which is the norm in churn, fraud, and conversion.

A useful intuition: a model can have ROC-AUC of 0.85 and still be unusable if 99% of the units are negatives. ROC tells you the model ranks well; PR tells you the top of the ranked list is profitable.

Calibration and Lift: Two Practical Charts

The two views below are the ones most non-technical reviewers should learn to read. Calibration answers "is the probability the model reports the rate I should expect?" Lift answers "if I target the top X%, what share of the positive class do I capture?"

Two evaluation views that managers actually read

Figure 2. Two views every model review should include. Calibration (left) compares predicted probability with observed rate by decile of score. Lift (right) shows the share of churners captured by targeting the top-X% by score; the dashed diagonal is the random baseline.

Calibration matters when the score has to enter a downstream decision unchanged — for example, when an expected-value computation multiplies probability by an offer cost. If a model says "0.4" but the empirical rate at that bin is 0.25, expected-value calculations using the raw score will systematically over-spend.

Lift matters when the action is "target the top of the list." It reads directly off the curve: at 10% targeted, what share of churners do we reach? That number, paired with the cost of an offer and the value of a saved customer, is the headline number for retention programs.

The Threshold–Profit Curve

The chart most managers should see first is none of the above. It is a curve that puts the business's own cost matrix on the y-axis and the action threshold on the x-axis.

Threshold–profit curve — the manager's lever

Treating everyone wastes offers; treating no one loses customers. The peak is the manager's threshold under current costs.

Figure 3. Threshold–profit curve. Net profit per customer rises from a 'treat everyone' default, peaks at a sensible threshold (here roughly 0.42), and falls as the threshold rises further and the model misses real churners. The peak is the threshold a deployed model should use.

The construction is mechanical. For each threshold t:

Compute predicted-positive and predicted-negative sets on the held-out data.
Apply the cost matrix: every TP earns the saved-customer value minus the offer cost; every FP costs the offer; every FN costs the churn; every TN is free.
Sum the cell-level dollars and divide by population size to get net profit per customer.

The shape — typically an inverted U — has a clear interpretation. The left side is "treat everyone": every customer gets an offer regardless of risk, and FP cost dominates. The right side is "treat almost no one": the model misses real churners, and FN cost dominates. The peak is where these are best balanced for the model the firm has.

A few practical notes:

The peak shifts as costs change. If the offer becomes cheaper, the peak moves left; if customer value drops, the peak moves right.
The peak is a property of the model + the costs, not the model alone. Two firms with the same model can rationally use different thresholds.
The peak depends on a held-out set that reflects production. If the held-out set is stale or non-representative, the chosen threshold is wrong before any drift sets in.

Putting the Views Together

A short workflow that combines the views:

ROC and PR-AUC confirm the model ranks reasonably and that the top of the list is meaningfully better than random.
Calibration confirms the probabilities can be used in expected-value math without recalibration. If miscalibrated, fit a one-parameter post-hoc calibration (e.g., Platt scaling or isotonic regression) and re-check.
Lift translates the model's ranking quality into operational language: at X% of the book targeted, what is captured?
Threshold–profit combines all of the above with the firm's cost matrix and chooses the threshold to deploy.

If any of the first three views look wrong, do not proceed to the fourth. The threshold–profit curve will silently still draw, and silently still mislead.