§5.2

Causality and the Counterfactual

Every strategic decision rests on a causal claim. When a firm spends millions on a loyalty program, it is asserting that the program will cause customers to spend more than they would have spent otherwise. The hard truth of data-driven decision making is that this "otherwise" — the counterfactual — is never directly observable. We can see what a customer spent after joining the program. We can never observe what that same customer would have spent, at the same moment, had they not joined.

This unbridgeable gap is the Fundamental Problem of Causal Inference. This chapter introduces the standard mathematical language used to reason about it — the Potential Outcomes Framework — and shows precisely why a naive comparison of treated and untreated units overstates causal effects.


The Executive Question: What Is the True Causal Lift?

Suppose a marketing dashboard reports that customers who received a discount coupon spent more on average than customers who did not. It is tempting to subtract the two averages and call the difference "the coupon's lift." The framing is wrong.

The right question is not do recipients spend more than non-recipients? — that is descriptive arithmetic. The right question is would the recipients have spent less if they had not received the coupon? Figure 1 illustrates the branching structure of this dilemma: for any single unit at any single moment, we observe one of two possible paths and never both.

Same store, same monthseen in the dataObserved: action takenseen in the dataMissing: action not takenmust be estimatedObserved outcomeseen in the dataCounterfactual outcomeseen in the dataeffect
Figure 1. The potential-outcomes branching model. For each unit we observe one path; causal lift is the gap between the observed path and the unobservable counterfactual path.

The Potential Outcomes Framework

Let ii index a unit (a customer, a store-week, a state-month) and let Di{0,1}D_i \in \{0,1\} be a binary treatment indicator: Di=1D_i = 1 if unit ii received the treatment, Di=0D_i = 0 otherwise.

For every unit, define two potential outcomes:

  • Yi(1)Y_i(1) — the outcome unit ii would experience under treatment.
  • Yi(0)Y_i(0) — the outcome unit ii would experience under no treatment.

The individual treatment effect is the difference between these two states:

Individual treatment effect

τi=Yi(1)Yi(0)\tau_i = Y_i(1) - Y_i(0)

The Fundamental Problem of Causal Inference is that we only observe the realized outcome, which depends on whether unit ii was actually treated:

Observed outcome

Yi=DiYi(1)+(1Di)Yi(0)Y_i = D_i \, Y_i(1) + (1 - D_i) \, Y_i(0)

If unit ii was treated, we see Yi(1)Y_i(1) and Yi(0)Y_i(0) is missing. If untreated, we see Yi(0)Y_i(0) and Yi(1)Y_i(1) is missing. The individual effect τi\tau_i is therefore never observable for any single unit.

Because we cannot recover individual effects, decision-makers focus on averages over units. The Average Treatment Effect is

Average Treatment Effect (ATE)

τATE=E ⁣[Yi(1)Yi(0)]\tau_{\text{ATE}} = \mathbb{E}\!\left[\,Y_i(1) - Y_i(0)\,\right]

and the Average Treatment Effect on the Treated narrows the average to units that actually received the treatment:

Average Treatment Effect on the Treated (ATT)

τATT=E ⁣[Yi(1)Yi(0)Di=1]\tau_{\text{ATT}} = \mathbb{E}\!\left[\,Y_i(1) - Y_i(0) \,\big|\, D_i = 1\,\right]

ATE and ATT generally differ. If your treatment is opt-in — coupons that loyal customers self-select into, software features that power users enable — the treated population is not a random draw from the full population, and the ATT will reflect the response of the kind of unit that takes the treatment.


Why Naive Comparisons Fail

If we cannot see individual counterfactuals, can we just compare group averages? The naive difference is

Naive group comparison

Δ^naive=E[YiDi=1]E[YiDi=0]\widehat{\Delta}_{\text{naive}} = \mathbb{E}[\,Y_i \mid D_i = 1\,] - \mathbb{E}[\,Y_i \mid D_i = 0\,]

Substitute the observed outcomes with potential outcomes (using Yi=Yi(1)Y_i = Y_i(1) when Di=1D_i = 1 and Yi=Yi(0)Y_i = Y_i(0) when Di=0D_i = 0) and the difference splits cleanly into two terms:

Selection bias decomposition

Δ^naive  =  E[Yi(1)Yi(0)Di=1]ATT (true causal effect on the treated)  +  E[Yi(0)Di=1]E[Yi(0)Di=0]Selection bias\widehat{\Delta}_{\text{naive}} \;=\; \underbrace{\mathbb{E}[\,Y_i(1) - Y_i(0) \mid D_i = 1\,]}_{\text{ATT (true causal effect on the treated)}} \;+\; \underbrace{\mathbb{E}[\,Y_i(0) \mid D_i = 1\,] - \mathbb{E}[\,Y_i(0) \mid D_i = 0\,]}_{\text{Selection bias}}

This is one of the most important equations in applied business analytics. It says: the naive treated-minus-untreated difference equals the true ATT plus selection bias, where selection bias is the difference in untreated-state potential outcomes between the two groups.

Selection bias is zero only when the two groups would have looked identical in the absence of treatment. That is rarely true in observational data, because units choose, or are chosen for, treatment based on factors that also drive the outcome. Loyal customers opt into coupons; high-margin stores adopt new features first; states that legalize a policy differ in unobserved ways from states that do not.

A numerical illustration

Suppose loyal customers self-select into a coupon program. We observe an average spend of $15.50\$15.50 in the coupon group and $9.20\$9.20 in the non-coupon group, a naive lift of $6.30\$6.30. The decomposition above warns us that this $6.30\$6.30 is

$6.30  =  ATTcausal coupon effect  +  E[Yi(0)Di=1]E[Yi(0)Di=0]Selection bias: loyal customers spend more anyway\$6.30 \;=\; \underbrace{\text{ATT}}_{\text{causal coupon effect}} \;+\; \underbrace{\mathbb{E}[Y_i(0) \mid D_i = 1] - \mathbb{E}[Y_i(0) \mid D_i = 0]}_{\text{Selection bias: loyal customers spend more anyway}}

The bias term is plausibly large and positive: the customers who took the coupon were the kind of customer who would have spent more even without it. Without a design that controls selection, we have no way to split the $6.30\$6.30 into its two components.


The Four Standard Counterfactual Designs

The rest of Part III is, essentially, four families of methods for constructing a credible stand-in for E[Yi(0)Di=1]\mathbb{E}[Y_i(0) \mid D_i = 1] — the missing untreated potential outcome of the treated group.

Table 1. Four standard ways to construct a counterfactual, and the price each one asks in identifying assumptions.
MethodCounterfactual is built from…Key assumptionHome in Part III
Randomization (A/B test)Randomly assigned control groupRandom assignment was actually random and held upChapter 5
Regression controlTreated units adjusted for observed confoundersNo unobserved confounders that move with both treatment and outcomeChapter 6
Difference-in-differencesTreated units' pre-trend extrapolated using control units' trendParallel pre-treatment trends would have continuedChapter 7
Synthetic controlWeighted combination of untreated donor units tracking the treated unit pre-treatmentGood pre-treatment fit and no shocks unique to the treated unitChapter 7

Each method buys credibility at the cost of an assumption you can name and defend. The Decision Question Card from the previous article forces you to choose which counterfactual you are willing to defend before you choose a method.

Concept check

Three questions on framing a decision and reasoning about the missing counterfactual.

  1. 1.
    Which of the following is a decision-ready question rather than a metric-focused question?
  2. 2.
    The Fundamental Problem of Causal Inference says that…
  3. 3.
    In the decomposition Δ^naive=ATT+Selection bias\widehat{\Delta}_{\text{naive}} = \text{ATT} + \text{Selection bias}, "selection bias" is…

Data Case: Colorado Housing and the Synthetic Counterfactual

To see counterfactual construction in action with real data, consider Colorado's January 2014 legalization of recreational cannabis. A real-estate investor wants to know whether the policy moved housing values. The unit of analysis is the state-month and the outcome is the Zillow Home Value Index.

The Fundamental Problem applies at the state level: we observe Colorado after legalization, but we never observe the Colorado that would have existed had legalization not happened. A naive before-after comparison conflates the policy with the broader US housing recovery that was already underway in 2014.

The synthetic control method constructs a credible counterfactual by searching for a weighted combination of donor states — states where the policy did not change — whose weighted housing trajectory closely matches Colorado's pre-treatment path. Figure 2 shows the actual and synthetic paths; Figure 3 shows the donor weights.

Colorado separates from its synthetic comparison after 2014

Pre-period fit uses 216 months before 2014-01-31.

19962002200820142020$100k$200k$300k$400k2014Coloradosynthetic
Figure 2. Colorado's actual ZHVI (blue) against the optimized synthetic counterfactual (orange). The two paths track tightly before January 2014 and separate cleanly afterwards — the visual signature of a credible synthetic control design.

The synthetic Colorado is mostly Kansas, Massachusetts, Utah, and Michigan

Weights are constrained to be nonnegative and sum to one.

Kansas41.9%
Massachusetts35.3%
Utah12.5%
Michigan10.3%
Figure 3. Donor weights selected by the synthetic control algorithm. The synthetic Colorado is dominated by a small number of states whose pre-treatment housing trajectories closely tracked Colorado's.

The post-2014 gap between Colorado and the synthetic counterfactual averages roughly twenty percent. That is the design's estimate of the causal effect of legalization on Colorado housing values, under the assumption that no other Colorado-specific shock after 2014 could have produced the gap. We will return to this design in Chapter 7 and stress-test that assumption with placebo donor states.

For now, treat this case as a worked example of the framework: an action, a unit, an outcome, and a counterfactual that was built — not assumed — from the data.