§5.3

Experiments and A/B Testing

The randomized controlled trial — the A/B test — is the gold standard of causal inference for one reason and one reason only: random assignment buys statistical equivalence between the treated and untreated groups before any treatment occurs. Any systematic difference in outcomes that appears afterwards can be attributed to the treatment with very few additional assumptions. Every other design in Part III is, in effect, a strategy for recovering that equivalence when randomization is unavailable.

This article works through what randomization gives you, why the interval around the estimate is what should drive a decision, and the operational decisions a manager actually makes when running an experiment in production — stratification, multi-armed bandits, and the gap between statistical significance and a business-relevant lift. The data case at the end uses the Milk quasi-experiment to show how the same diagnostic discipline carries over when randomization is impossible.


The Executive Question: Did the Action Cause Enough Lift?

A product team launches a redesigned mobile loyalty board. Conversion and gross margin both move up in the test arm. Two questions every executive should ask before approving rollout:

  • Is the lift real, or could it be a coincidence? — a statistical question, answered by the confidence interval.
  • Is the lift big enough to justify the rollout? — a decision question, answered by the threshold on the Decision Question Card.

These two questions are independent. A test can be statistically decisive and still fail the threshold; another can pass the threshold but be too imprecise to act on yet. Conflating the two is the single most common failure mode of corporate experimentation.


Why Randomization Works

Let Di{0,1}D_i \in \{0,1\} be the treatment indicator. Random assignment guarantees that DiD_i is statistically independent of both potential outcomes:

Independence from random assignment

(Yi(1),Yi(0))    Di\bigl(Y_i(1),\, Y_i(0)\bigr) \;\perp\; D_i

This independence has a concrete consequence: the average outcome we observe in the treated arm is an unbiased estimate of what everyone would have looked like under treatment, and the same holds for control:

E[YiDi=1]=E[Yi(1)],E[YiDi=0]=E[Yi(0)]\mathbb{E}[\,Y_i \mid D_i = 1\,] = \mathbb{E}[\,Y_i(1)\,], \qquad \mathbb{E}[\,Y_i \mid D_i = 0\,] = \mathbb{E}[\,Y_i(0)\,]

The difference in observed means is therefore an unbiased estimate of the ATE — selection bias from the previous chapter is gone by construction:

Experimental lift estimator

τ^  =  YtreatmentYcontrol\widehat{\tau} \;=\; \overline{Y}_{\text{treatment}} - \overline{Y}_{\text{control}}

Figure 1 makes the contrast concrete. With random assignment, the two arms have equivalent distributions of every pre-treatment covariate, observed or unobserved. With self-selection, the two arms differ on the covariate before any treatment is applied, and the post-treatment comparison is contaminated by that pre-existing difference.

Random assignment makes arms comparable; self-selection does not

Randomized assignmentEach bin split ~50/50 — arms balanced at baseline.Pre-treatment covariate (e.g. baseline spend) →Self-selected adoptionHigh-spend customers opt in — arms differ before any treatment.Pre-treatment covariate (e.g. baseline spend) →
Arm A (treatment) Arm B (control)
Figure 1. Random assignment produces arms that look statistically identical on any baseline covariate (left). Self-selection produces arms that already differ before any treatment (right) — the post-treatment difference is then a mix of treatment effect and selection bias.

The Interval, Not the Point

A single number — "the lift was 1.7 percentage points" — is almost never enough to decide on. What an executive needs is the uncertainty around it, and how that uncertainty sits relative to two reference lines: zero (is there any effect?) and the decision threshold (is the effect big enough?).

The standard error of the lift is

Standard error of the lift

SE(τ^)  =  sT2NT+sC2NCSE(\widehat{\tau}) \;=\; \sqrt{\frac{s^2_T}{N_T} + \frac{s^2_C}{N_C}}

and a 95% confidence interval is the point estimate plus or minus roughly two standard errors. Three qualitatively different situations are worth recognizing on sight:

Why the interval matters more than the point

0 = no effectdecision thresholdUnderpowered testwide CI crosses zero — undecidedPrecise nulltight CI around zero — confident no effectDecisive liftCI well above zero and threshold
Figure 2. Three readouts with the same conceptual shape but very different decisions. An underpowered test cannot distinguish zero from a real effect; a precise null is a confident 'no'; a decisive lift sits cleanly above both zero and the decision threshold.

The leftmost case in Figure 2 is the worst kind of result to act on: a positive point estimate whose interval crosses both zero and the threshold. You learn nothing decision-relevant. The middle case is informative even though "nothing happened" — it lets you stop investing in this lever and move on. The rightmost case is the green light.


Stratification: Balancing What You Can Name

Random assignment balances arms in expectation. In any single experiment, especially a small one, sheer luck can produce a treatment arm that is, say, mostly high-spend customers. The fix is stratified randomization: partition units into homogeneous strata based on observable pre-treatment characteristics (income tier, store size, prior 90-day spend bucket), then randomize within each stratum. The strata are perfectly balanced by construction, and the strata-aware estimator has lower variance than the simple difference in means.

Stratify on variables that (a) you can measure before treatment and (b) you believe drive the outcome. Stratifying on something irrelevant adds complexity without buying precision.


Multi-Armed Bandits: When to Use Them, and When Not To

A standard A/B test holds the traffic split fixed (often 50/50) until a pre-registered stopping rule fires. Every customer routed to the inferior arm during the exploration phase is an opportunity cost.

Multi-armed bandit algorithms — Thompson sampling, UCB, epsilon-greedy — shift the traffic split dynamically toward whichever arm is currently winning, reducing that opportunity cost. The cost: estimators of the long-run effect become biased, and downstream metrics measured days or weeks after exposure become hard to attribute cleanly.

The decision is not "A/B vs. bandit" in the abstract. It is:

  • Use bandits when the outcome is fast (clicks, page views, immediate conversion), the lever is reversible, and you care more about cumulative regret than about a clean long-run estimate. Headlines, banner choices, recommendation slates.
  • Use a stable A/B test when the outcome is slow (90-day retention, lifetime value, churn), when downstream business metrics matter as much as the immediate one, or when the result will be cited as evidence outside the team that ran it.

When You Cannot Randomize: Quasi-Experiments

Many of the most important business questions cannot be tested by flipping a coin. You cannot randomize who lives in which ZIP code, which stores get a regional pricing structure, or which states pass a policy. In these settings we look for quasi-experiments: naturally occurring assignment mechanisms that mimic random assignment well enough to act on.

A defensible quasi-experiment carries the same diagnostic discipline an A/B test would:

  1. Balance check. Are the treated and control groups similar on pre-treatment variables we can measure? If not, the comparison is already contaminated.
  2. Placebo check. Does the supposed treatment "move" an outcome that it has no business moving? If so, the two groups differ on more than the treatment.
  3. Robustness across specifications. Does the headline estimate survive sensible alternative definitions of the treated and control groups, time windows, and controls?

These checks are exactly the discipline of the Decision Question Card carried into observational data.


Data Case: The Milk Pricing Quasi-Experiment

Roughly 1,700 supermarkets price whole milk in one of two structures. In flat-priced stores, whole milk costs the same as 2%, 1%, and skim. In whole-expensive stores, whole milk carries a small premium — roughly fourteen cents on average. The pricing structure was chosen by regional chains, not assigned by a researcher.

We want to know whether the flat-pricing structure causally shifts share toward whole milk. Before looking at the outcome, we run the two diagnostic checks an A/B test would have made unnecessary: balance and placebo.

The placebo is quiet; the milk outcome moves

1,708 stores. Differences are equal-price stores minus whole-milk-expensive stores.

-2 pp04 pp8 ppDiet soda share0.5 ppWhole milk share8.2 pp
Balance checkPlacebo outcomeMilk outcome

Stores

627 equal-price

ZIP income gap

$197

Diet-income correlation

0.63

Figure 3. Quasi-experimental diagnostics for the milk pricing comparison. The income difference between the two store groups (balance check, blue) sits at zero. The diet-soda placebo (orange) — an outcome the price of milk should not move — is also quiet. The whole-milk share difference (green) is large and clearly separated from zero.

Equal-price stores

627

compared with 1,081 stores where whole milk is more expensive.

Whole-milk share

+8.2 pp

the main behavioral difference in the quasi-experimental comparison.

Diet-soda placebo

+0.5 pp

a small difference on an outcome the milk price structure should not move.

Figure 4. High-level cohort summary. The two store groups are large and demographically comparable; the headline outcome is concentrated in the whole-milk category.

The diagnostics in Figure 3 are the closest a real field setting comes to behaving like a randomized experiment. Both the balance check and the placebo are quiet — the two groups look comparable on income and on an outcome that the treatment should not affect. The outcome itself moves, by roughly eight percentage points, and is clearly separated from the noise band of the placebo.

That is the most this quasi-experimental design can claim: given the diagnostics behave like a randomized trial, the eight-point shift can be read causally. It does not rule out an unobserved confounder that happens to be correlated with the pricing structure; only randomization can do that.