§10.5

AutoML, Explainability, and Model Cards

The economics of machine-learning work have shifted. A decade ago, a churn project's quarters went into picking and tuning algorithms. Today, an AutoML pipeline can fit a hundred candidate models, compare their cross-validated scores, and serve a tuned ensemble before lunch. What it cannot do is define the target, name the action, audit the features, or write the contract under which the model gets used. As the engineering surface automates, the managerial surface gets sharper.

This article maps that shift. We look briefly at what an AutoML leaderboard does (and doesn't) tell a manager; we work through the two interpretability tools — feature importance and partial dependence — that survive into production; and we land on the model card, the one-page document that should accompany every shipped model.

The Model Card is the third entry in the artefact family introduced at §0.1. The Decision Question Card (§5.1) frames the original action; the Predictive Task Contract (§9.2) specifies the prediction; the Model Card describes the system that ships. Each one extends the discipline of the last. The chain continues into §16.4 (AI Workflow Card) and lands at the decision memo that ships.

The Executive Question

If algorithm selection is mostly automated, what is the manager's job in the model — and how do we make sure the model behaves the way it claims to?

The short answer: the manager defines the task, audits the features, sets the threshold and cost matrix, signs off on the limitations, and owns the monitoring. Everything in this article is in service of those five duties.

What AutoML Does and Doesn't Do

A typical AutoML run will:

Take a labelled dataset and a metric (AUC, RMSE, log-loss).
Iterate through a configurable space of preprocessing pipelines and model families.
Cross-validate each candidate.
Return a leaderboard ranked by the chosen metric and a chosen "winner" pipeline that the user can deploy.

This automates a lot of the historically valuable work, well. The leaderboard's quality, however, is only as good as the task definition that produced the labels. AutoML does not check:

Whether the target represents what the firm actually wants to predict.
Whether features could be observed at decision time.
Whether the units are aligned with the action.
Whether class imbalance has been handled or hidden.
Whether the cross-validation reflects the deployment split.

In other words, AutoML automates the easy half. The hard half — the half this entire book is about — remains the manager's responsibility.

A practical guideline: treat the AutoML leaderboard as a menu of candidate models, not as a deployment recommendation. The model the team ships is the one whose threshold–profit curve looks best under the production-shaped evaluation from §10.2.

Feature Importance: Inspection, Not Causation

Once a model — single or ensemble — has been fit, the standard inspection move is to ask which features it relied on most.

Feature importance — what the model leaned on (not what caused it)

days_since_last_purchase31%
support_tickets_last_30d22%
email_open_rate14%
discount_share10%
orders_last_90d8%
loyalty_tier6%
avg_order_value5%
channel_mix4%

High importance means the model used the variable to sort customers. It does not prove the variable would change churn if you intervened on it.

Figure 1. Feature importance for the deployed Bean & Basket churn model. The bars rank features by their contribution to the model's predictions, not by their causal effect on churn.

Two flavours of importance dominate:

Tree-based importance. For random forests and boosting, the standard summary is total impurity reduction across all splits that use a feature. Fast to compute, dependent on the specific tree structures, and biased toward high-cardinality features.
Permutation importance. Shuffle a feature's values across rows and measure the drop in held-out performance. Slower, more reliable, model-agnostic. The honest first choice when you have the compute budget.

A standing rule, restated as often as needed: importance is not causation. A high importance on support_tickets_last_30d tells you the model uses tickets to rank customers, not that reducing tickets will reduce churn. A retention program that intervenes on a high-importance feature is doing causal inference, not prediction, and the Part III framework applies.

Partial dependence plots complement importance by showing, for a chosen feature, how the model's prediction changes as that feature varies (averaging over the others). A churn model's partial dependence on days_since_last_purchase may rise steeply between days 14 and 30 and plateau afterward — useful information for designing intervention timing, with the same causal caveat.

Local Explanations: SHAP and Friends

For per-customer explanations, modern practice leans on SHAP (Shapley additive explanations). For each prediction, SHAP attributes the model's deviation from the baseline to each feature, with attributions that sum to the actual prediction.

The strongest use for SHAP is debugging and regulatory accountability:

A high-confidence prediction with a single-feature attribution dominating is a red flag for leakage.
Local explanations are the kind of artefact a regulator or a customer-care team can actually act on ("why was this loan flagged?").

The weakest use is causal storytelling. A SHAP value of −0.12 for discount_share on a specific customer's churn probability is a model attribution, not a counterfactual. The same warning carries forward.

The Model Card

Every shipped model needs a one-page contract that captures what it is, what it is for, and what it can and cannot do.

One-page model card

Model name	BB-Churn-2026Q2 (gradient boosting)
Intended use	Rank weekly active customers by 60-day churn risk for retention offers.
Target	Churn within 60 days, observed on 2024–2025 cohorts.
Features	RFM, support activity, email engagement, loyalty tier (12 features).
Training data	180k customers, 6 store regions, Jan 2024 – Dec 2025.
Held-out AUC	0.84 (PR-AUC 0.41).
Calibration	Well calibrated up to 0.5; slightly under-confident above.
Known failure modes	New customers (<30 days tenure), B2B accounts.
Fairness review	No disparate FNR across region; not audited for income proxies.
Refresh cadence	Retrain quarterly; monitor weekly KS drift on top-3 features.
Owner	Customer Analytics, Bean & Basket Coffee.

The card is the artifact, not the spreadsheet. If a peer cannot reproduce the decision context from this single page, the model is not ready to ship.

Figure 2. A representative model card for a deployed churn model. The fields are deliberately operational — every row corresponds to a question the team will be asked at some point in the model's lifetime.

The card should include, at minimum:

Intended use. What decision the model supports, for which customers, on what cadence.
Target and features. What is being predicted, what windows the features were computed over, and what the unit of prediction is.
Training data. Dates, populations, exclusions.
Held-out performance. Headline metrics (AUC, PR-AUC, calibration, lift at decile) on a production-shaped held-out set.
Known failure modes. Subpopulations where performance is poor, edge cases, types of input the model should not be scored on.
Fairness review. Whether the model was audited for disparate performance across protected groups; what was found.
Refresh cadence. When the model retrains, what triggers an early refresh, who owns the schedule.
Owner. A real human and a real team.

A model without a model card is not yet operational. It is a research artefact that has accidentally ended up in production.

Putting Chapter 10 Together

Across Chapter 10 we have built a single recipe:

Start with a logistic regression baseline (§10.1) on a thoughtful feature catalog from §9.4.
Grade it with the full classification evaluation toolkit (§10.2), ending with a threshold–profit curve and a marked operating point.
For numeric targets, swap classification metrics for numeric prediction evaluation (§10.3).
Move to trees and ensembles (§10.4) only when the baseline's bias is the binding constraint.
Ship the model with a model card and inspection artefacts — feature importance, partial dependence, local explanations — that survive contact with audit, compliance, and rotation of personnel.

That recipe is conservative on purpose. The chapters that follow — segmentation, targeting, ranking — will show how the score, once produced, becomes part of a larger customer system. The discipline of Chapter 10 is what makes those downstream systems trustworthy.

Concept check

Three questions spanning Chapter 10 — thresholds, model choice, and the line between importance and cause.

1.
The threshold–profit curve has a clear peak at 0.38. Six months later, the offer cost falls by half. The right next step is:
2.
A random forest and a gradient-boosted model both achieve AUC 0.86 on the same task. The team has limited maintenance capacity. Which should they deploy?
3.
days_since_last_purchase has the highest permutation importance in a churn model. A manager proposes a "recency campaign" that contacts customers as soon as they cross 14 days. Which statement is supported?