§7.3

Heterogeneous Treatment Effects

A causal study's headline is almost always a single number: the Average Treatment Effect (ATE). A push notification lifts conversion 3%. A loyalty offer raises sales 8%. These numbers are useful planning tools — and they are almost always a strategic illusion. The same 3% can hide a 15% lift among occasional shoppers, a 0% response from loyalists who would have bought anyway, and a small negative margin impact from deal-hunters who shifted from full-price purchases. The aggregate hides the targeting opportunity.

This article works through Heterogeneous Treatment Effects (HTE) — what they are, how to estimate them, and why every causal study should report at least one heterogeneity analysis along with the headline ATE. We will distinguish two methods (subgroup analysis and interaction regression), name the traps that make naive heterogeneity claims false, and apply the framework to a real demographic targeting question.

The Executive Question: For Whom Does This Action Work?

A coupon push generates a +3.0 pp aggregate lift in transaction probability. Before approving a blast to a million users, segment the response and the picture changes:

Segment	Causal lift	Pre-coupon margin	Profitability of coupon
Brand loyalists (4+ visits/week)	0 pp	$2.50	Negative — pure margin erosion
Occasional shoppers (1/month)	+12 pp	$2.50	Strongly positive
Deal-hunters (visit only on promotion)	+8 pp	$1.10	Roughly breakeven
Premium connoisseurs (single-origin)	0 pp	$4.50	Negative — brand dilution

The same lever is the right action for one segment and the wrong action for another. The blast is wasteful and possibly value-destroying because the loyalists alone — the largest single segment — absorb the entire margin from the genuine responders. Recognizing this is the difference between a campaign that loses money and one that makes it.

Subgroup Analysis vs. Interaction Regression

Two operational ways to estimate heterogeneity. They are not interchangeable.

Subgroup analysis

Split the sample by a pre-treatment characteristic (income tier, prior frequency, geography) and estimate the treatment effect separately within each subgroup.

The strengths: simple, easy to explain, makes no functional-form commitment about how the effect varies. The weaknesses: each subgroup uses a fraction of the sample, so standard errors blow up; and the differences between subgroups are not tested formally. Two segments with non-overlapping confidence intervals might still not be reliably different.

Interaction regression

Estimate one regression that includes a segment indicator and its interaction with the treatment:

Interaction regression

Y_i \;=\; \beta_0 + \beta_1 \,T_i + \beta_2 \,D_i + \beta_3 \,(T_i \times D_i) + u_i

where $T_i$ is the treatment indicator and $D_i$ is a binary segment indicator. The four coefficients each have a precise meaning:

$\beta_0$ : average outcome for the untreated, non-segment baseline.
$\beta_1$ : treatment effect in the baseline group (where $D_i = 0$ ).
$\beta_2$ : stable baseline difference between segments (regardless of treatment).
$\beta_3$ : differential treatment effect — how much the segment's response differs from the baseline group's.

The total treatment effect for the segment is

\tau_{\text{segment}} \;=\; \beta_1 + \beta_3

and the formal test for "is the response different in this segment" is whether $\beta_3$ is significantly different from zero. Interaction regression uses all of the data and produces a direct test of the difference between segments — both major improvements over simple subgroup analysis.

Three Traps That Sink Naive HTE Claims

Trap 1: The post-treatment segment

The single most common HTE error is defining a segment using a variable measured after the treatment. "Customers who clicked the email," "users who reached the upgrade page," "stores that hit $10k in monthly sales" — each of these is partly a consequence of the treatment. Conditioning on it produces collider bias from Chapter 6.3 and destroys the causal interpretation.

The rule is absolute: segments must be defined on variables fixed before treatment was assigned.

Trap 2: Multiple testing (p-hacking)

If you slice the data by 20 different segmenting variables at the 5% significance level, you should expect roughly one to come up significant by chance alone, even when the treatment has no effect anywhere. Dashboards that let stakeholders slice lift by hundreds of customer features manufacture spurious heterogeneity at scale.

The defenses are well-known: pre-specify the segments of interest before looking at the data; apply a multiple-testing correction (Bonferroni, Benjamini–Hochberg) when many segments are tested; and treat exploratory subgroup findings as hypotheses to validate in a fresh experiment, not as conclusions in themselves.

Trap 3: The noisy subgroup illusion

Slice the sample into "female shoppers aged 18–22 in Denver who buy decaf" and you get a large point estimate with an enormous standard error. Highly specific subgroups with eye-catching point estimates are almost always sampling noise. Always report the interval around segment effects, and refuse to act on a segment whose interval crosses zero.

Targeting with HTE

Once you have credible per-segment effects, targeting follows mechanically. Compute the per-segment expected profit:

\pi_{\text{segment}} \;=\; \tau_{\text{segment}} \times (\text{margin per response}) - (\text{cost per send})

Send the treatment to segments where $\pi_{\text{segment}} > 0$ ; suppress it where $\pi_{\text{segment}} \le 0$ . The headline aggregate ATE is rarely the decision-relevant number. The decision-relevant numbers are the per-segment $\tau_{\text{segment}}$ s and the profitability they imply at each segment's margin.

A useful summary chart: the segment effect as a horizontal interval plot, sorted from most to least responsive. The visual lets a manager see at a glance which segments to target, which to ignore, and which to investigate further because the interval is too wide to act on yet.

Concept check

Three questions spanning the two-group counterfactual, single-unit synthetic controls, and segment-level effects.

1.
The DiD estimator works by netting out two specific confounders. Which pair?
2.
Why is the non-negativity constraint ( $w_j \ge 0$ ) on donor weights so important for synthetic control?
3.
You estimate the treatment effect of an email campaign separately among "customers who opened the email" and "customers who did not open the email." Why is this analysis broken?

Data Case: Income Heterogeneity in the Milk Pricing Quasi-Experiment

The milk pricing study has a natural HTE question: does equal pricing nudge whole-milk share more in some neighborhoods than others? We stratify the supermarket sample by ZIP-code median income (low, medium-low, medium-high, high) and re-estimate the equal-pricing effect within each tier.

The price-structure effect is largest in lower-income ZIP codes

Each interval is the equal-price minus whole-milk-expensive difference in whole-milk share.

Figure 1. Whole-milk share effect of equal pricing, by ZIP-code income tier. Each interval is the estimated lift in whole-milk share inside that income tier, with its 95% confidence interval. The effect is concentrated in lower-income ZIP codes and shrinks to near zero in higher-income ones.

Two readings of Figure 1 are worth being explicit about. First, the aggregate effect from Chapter 5 (about +8 pp) averages across all four tiers, and is much smaller than the response in the lowest-income tier (+12.8 pp). The aggregate would have under-supported a targeted public-health intervention; the heterogeneity result over-supports it relative to the average. Second, the high-income tier's interval crosses zero — a defensible reading is "we cannot reject zero effect in this group with the data we have." That is not the same as "there is no effect there"; it is a precision statement about what this dataset can show.

For decision purposes, the figure does the targeting work directly. A policy aimed at moving whole-milk share would be much more cost-effective if it concentrated on lower-income ZIP codes, where the response is large and clearly distinguishable from zero. A blanket rollout across all neighborhoods would spend most of its cost producing little movement in the high-income tier.