§7.1

Difference-in-Differences

When a firm rolls out a new feature, a price change, or a regional policy, it rarely has the luxury of randomly assigning treatment. Rollouts are regional, gradual, and tangled up with macro trends. Difference-in-differences (DiD) is the workhorse design for these settings. It bypasses the two naive comparisons every executive deck reaches for first — before vs. after in the treated group, and treated vs. untreated after the rollout — and replaces them with a comparison that nets out both confounders at once.

This article works through the DiD logic visually and algebraically, derives the regression specification that recovers the DiD estimate as an interaction coefficient, and ends with the identifying assumption — parallel trends — that everything else hinges on.

The Executive Question: Did We Grow Faster Than the Tide?

A regional team launches a new mobile checkout feature in the West Coast stores. Weekly transactions in West Coast stores rise from 100,000 to 130,000 — a thirty-thousand-transaction gain. The team prepares to recommend a national rollout.

Before approving, the executive question:

How much of the 30,000-transaction increase was actually caused by the feature, and how much would have happened anyway?

Over the same period, untreated East Coast stores rose from 90,000 to 100,000. Ten thousand of the West Coast's gain was the tide rising for everyone. The remaining twenty thousand is the part the feature can plausibly claim.

That subtraction-of-subtractions is the DiD estimator. The 2×2 below shows where it sits relative to the two naive comparisons.

Table 1. Why naive comparisons overstate the effect of a regional rollout. The DiD estimate (third row) nets out both the common tide and the stable baseline gap between regions.

Comparison	Calculation	Implied effect	What it confounds with the effect
Naive before/after (West only)	West Post − West Pre = 130k − 100k	+30k	Common seasonal/macro tide affecting both regions
Naive cross-section (Post only)	West Post − East Post = 130k − 100k	+30k	Stable baseline difference between the two regions
Difference-in-differences	(130k − 100k) − (100k − 90k)	+20k	Neither (under parallel trends)

The 2×2 Picture

The cleanest way to understand DiD is to memorize the 2×2.

Difference-in-differences as a 2×2 comparison

Figure 1. The four cells of a difference-in-differences design. The DiD estimate is the difference between the two row-wise differences: how much the treated group changed minus how much the control group changed.

Both row-wise differences absorb the stable baseline gap between the two groups (the column means cancel within each row). Their difference also absorbs the common time shock (the row means cancel within each column). What remains is exactly the part of the post-treatment change in the treated group that the control group did not experience.

Visualizing DiD: Three Stories, One Picture

The pedagogical power of DiD is best seen by toggling between the three comparison frames in Figure 2. The same four numbers support three very different decisions depending on which comparison you make.

Difference-in-Differences vs. Naive Causal Comparisons

Difference-in-Differences (Subtracts the control group's trend to isolate the true causal app lift of +20)

Treated Stores (West) Control Stores (East)

The Math: $$\text{DiD Effect} = (130\text{k} - 100\text{k}) - (100\text{k} - 90\text{k}) = 30\text{k} - 10\text{k} = 20\text{k}$$By subtracting the general seasonal trend of +$10k$ (captured by the untreated East Coast control stores) from the total observed sales growth of +$30k$ in the treated West Coast stores, we isolate the true, unconfounded impact of the loyalty program rollout.

Figure 2. Difference-in-differences with parallel trends. Toggle through the three views. The first two are the naive comparisons that overstate the effect; the third constructs the counterfactual path the treated group would have followed without treatment — the dashed line that the control group's slope projects forward — and the gap between that and the observed post-treatment value is the DiD estimate.

The dashed projected line is the heart of the design: it is the counterfactual for the treated group, built from the control group's trend. DiD's identifying assumption is that this projection is credible.

DiD as a Regression

The 2×2 view is intuitive, but most production estimates of DiD use a regression, because regressions extend gracefully to many regions, many time periods, and additional controls.

The two-way regression is

Difference-in-differences regression

Y_{it} \;=\; \beta_0 + \beta_1 \,\text{Treated}_i + \beta_2 \,\text{Post}_t + \beta_3 \,(\text{Treated}_i \times \text{Post}_t) + \varepsilon_{it}

where

$\text{Treated}_i = 1$ if unit $i$ is in the treated group,
$\text{Post}_t = 1$ if period $t$ is after the treatment date,
$\beta_3$ is the coefficient on the interaction — and the DiD estimate.

To see why $\beta_3$ is the DiD, write out the expected outcome in each of the four cells and subtract differences of differences:

Cell	Expected outcome
Control · Pre	$\beta_0$
Control · Post	$\beta_0 + \beta_2$
Treated · Pre	$\beta_0 + \beta_1$
Treated · Post	$\beta_0 + \beta_1 + \beta_2 + \beta_3$

The change for the control group is $\beta_2$ . The change for the treated group is $\beta_2 + \beta_3$ . The difference of differences is

\text{DiD} \;=\; (\beta_2 + \beta_3) - \beta_2 \;=\; \beta_3

The interaction coefficient absorbs both the row difference (treated minus control) and the column difference (post minus pre), leaving only the part that lives in the treated-post cell alone.

The Identifying Assumption: Parallel Trends

DiD works because the treated and control groups would have followed parallel trends in the absence of treatment. The two pre-treatment levels are allowed to differ; the two pre-treatment slopes are not.

The visual test is plotting both groups' pre-treatment trajectories on the same chart. If they look parallel for several periods before the treatment date, the assumption is credible. If the treated group was already accelerating relative to the control group before the treatment hit, the design is in trouble — that pre-existing momentum will be mistaken for treatment effect.

The most useful pre-treatment plot is the event study: align all units to event time (treatment date = 0), plot the average treated–control gap in each pre and post period, and check whether the pre-period gaps are flat. We will see event-study plots throughout the next chapters; for now, the principle is just: if your pre-trends are not parallel, you do not have a DiD design — you have a story.

Data Case: Colorado Legalization, the Naive DiD

The West Coast checkout example was built to teach the mechanics: two groups, two periods, four clean numbers. Real policy questions rarely hand you a matched treated/control pair. Colorado's January 2014 legalization of recreational cannabis — the case that opened Chapter 5 and returns in full in the next article — is a sharper test, because it has only one treated unit: Colorado. Every other state is, in principle, a candidate control.

The most naive version of DiD treats "every other state" as if it were a single control group and averages them. The outcome is the Zillow Home Value Index (ZHVI), the treatment date is January 2014, and the pre/post windows are the same ones used throughout this case.

Table 2. Naive difference-in-differences on Colorado housing values, using the simple unweighted average of all 49 other states as the control group.

Group	Pre-2014 avg. ZHVI	Post-2014 avg. ZHVI	Growth
Colorado (treated)	$215,192	$344,321	+60.0%
All other states (simple average)	$173,780	$232,596	+33.9%

The DiD arithmetic is the same subtraction-of-subtractions as the checkout example: Colorado's growth rate minus the control group's growth rate.

Naive DiD estimate, Colorado legalization

\text{Naive DiD} \;=\; 60.0\% - 33.9\% \;=\; 26.2\%

Read literally, this says legalization lifted Colorado home values by roughly 26 percentage points relative to the rest of the country. That number is real arithmetic, correctly executed — and it is also a design you should not trust.

Why the naive control group is the wrong comparison

The DiD logic depends on the control group's trend standing in for the treated unit's counterfactual trend. Averaging all other 49 states together quietly assumes that this grand average would have tracked Colorado's specific pre-2014 housing trajectory — a state with a distinctive mountain-west, tech-migration, land-constrained housing market. There is no reason a simple average dominated by, say, large Northeast and Midwest states should share Colorado's counterfactual path. The parallel-trends assumption is doing an enormous amount of unexamined work here: it is not one comparison to defend, it is an implicit claim about 49 states at once, most of which look nothing like Colorado on the dimensions that drive home prices.

This is the general failure mode flagged above: there is only one treated unit, and classical DiD does not naturally pick a "best" comparison. Naive-averaging is the ad-hoc fallback, and it is fragile in exactly the way the pre-trends check exists to catch — it just happens to look like a fully specified estimate because it produces a clean number.

The fix: build a weighted twin instead of an averaged one

The next article, Synthetic Control (Chapter 7.2), works this same Colorado case and fixes exactly this weakness. Instead of collapsing 49 states into one flat average, it optimizes a weighted combination of donor states chosen so the blend's pre-2014 housing trajectory tracks Colorado's as closely as possible — and lets the post-2014 gap between actual and synthetic Colorado be the estimate.

Table 3. Naive DiD versus synthetic control on the same Colorado case. The synthetic control estimate replaces a flat average of 49 states with an optimized weighted twin fit to Colorado's own pre-treatment path.

Method	Control group construction	Estimated effect
Naive DiD	Simple average of all 49 other states	+26.2%
Synthetic control	Optimized weighted blend of donor states, fit to Colorado's pre-trend	+20.2%

The two designs do not agree, and the gap between 26.2% and 20.2% is not noise — it is the naive design's control group failing to reproduce Colorado's actual counterfactual trend. Chapter 7.2 shows the weights behind that second number and the pre-treatment fit that makes it credible.

When DiD Is Not Enough: Looking Ahead

DiD assumes you have a group of control units whose averaged trajectory is a credible counterfactual. Two things stretch that assumption in real settings:

There is only one treated unit. A state passes a unique policy; a firm pilots in a single city. The classical DiD does not naturally pick a "best" comparison — and, as the Colorado case just showed, ad-hoc averaging is fragile. Synthetic control (Chapter 7.2) addresses this by building a custom weighted counterfactual in place of the naive average.
Treatment effects vary across units. The same rollout helps low-income customers a lot and loyal customers not at all. The average effect understates the targeting opportunity. Heterogeneous treatment effects (Chapter 7.3) tackle this directly.

DiD remains the right starting point for most multi-unit rollouts. The next two articles handle the cases where it isn't enough — starting immediately with the single-treated-unit problem this Colorado case just made concrete.