§6.2
Identification
A model can be mathematically elegant, statistically precise, and visually beautiful — and still be wrong about what it claims. Statistical tools are designed for estimation: turning a dataset into a number with a standard error. They do not, on their own, tell you whether that number is the causal effect you wanted or a confounded artifact of the way the data were generated. Closing that gap is the job of identification.
This article separates identification from estimation, introduces directed acyclic graphs as the visual language of identification, walks through the three pathway patterns every manager should be able to recognize, and gives you a one-page identification memo to demand before approving any data-driven recommendation.
The Executive Question: Why Should We Believe This Comparison Is Fair?
A team reports that stores with new digital signage outperformed stores without by 25%. The regression is tight, the p-value is small, and the deck recommends a rollout.
Then the audit: store managers were allowed to opt in. The signs went disproportionately to high-traffic suburban stores whose managers had budget. The 25% reflects the gap between two different kinds of stores, not what installing a sign in a given store would do.
The model estimated something. It just was not what the deck was claiming. The question every executive should ask before reading a coefficient as a lever is:
Why should we believe the difference in the outcome between these groups is caused by the treatment, rather than by some other systematic difference between the groups?
That question is identification.
Identification vs. Estimation
The two phases of any causal analysis are easy to conflate, but separating them is the single most useful habit in evaluating data work.
| Phase | Question it answers | Substance |
|---|---|---|
| Identification | In an ideal world with infinite data, does the comparison we want to make actually exist in the design? | Assumptions, business mechanism, choice of treatment and control group |
| Estimation | Given the identified design, how do we compute the number and its uncertainty? | OLS, matching, maximum likelihood, sample size, standard errors |
A pipeline analogy: identification is choosing where the pipe taps the water — a clean source or a polluted one. Estimation is measuring the flow once the pipe is laid. Measuring the flow with great precision tells you nothing useful if the pipe is hooked to a muddy swamp.
Identification is the binary property — either your causal effect is identified by the design or it is not. Estimation is the continuous one — given identification, more data buys you more precision. The two cannot substitute for each other.
Directed Acyclic Graphs: The Language of Identification
A directed acyclic graph (DAG) is a flowchart of causal assumptions. Nodes are variables; arrows are causal directions you are willing to assert. The graph is the picture of the data-generating process the team believes is operating, and three of its patterns recur constantly.
The fork: a confounder opens a backdoor
The simplest and most common identification threat: a third variable drives both the treatment and the outcome , creating a path that produces correlation between and without any causal effect at all.
Confounding: a backdoor path from D to Y through Z
Comparing treated and untreated units without controlling for Z mixes the causal D → Y arrow with the spurious D ← Z → Y path.
The fix for a fork is to control for (or condition on) , either by including it in a regression or by selecting comparison units to be balanced on . The fork is the pattern omitted-variable bias formalizes — it is why we add controls.
The chain: a mediator should not be controlled for
A chain is — the treatment moves the outcome through an intermediate variable . Customer satisfaction is the canonical example: a price cut increases satisfaction which increases retention. is on the causal pathway.
The instinct to "control for everything" goes badly wrong here. Conditioning on closes off the very path you wanted to measure — the regression coefficient on then captures only the part of the effect that does not go through satisfaction, which is usually not the quantity you want.
The collider: conditioning creates spurious correlation
A collider is a variable that and both cause: . The trap is counterintuitive. In the raw data, and might be independent. Conditioning on the collider — by including it as a control, or by restricting the sample to a particular value of it — induces a correlation between and that does not exist in the underlying world.
The classic example: restrict your churn analysis to customers who reached the upgrade screen (a behavior caused by both your treatment and the customer's underlying engagement). Inside that screened sample, treatment and outcome will appear correlated even if they are not.
| Pattern | Diagram | What to do |
|---|---|---|
| Fork (confounding) | Control for (or design to break the dependence) | |
| Chain (mediation) | Do not control for if you want the total effect | |
| Collider (selection) | Do not condition on |
Drawing the DAG for a proposed analysis takes minutes and forces every team member to commit to a specific causal story. If two team members draw different graphs, that disagreement is the analysis — not a side conversation to be resolved later.
The Identification Memo
The one-page artifact every causal analysis should produce before the regression runs. It contracts the design. If your team cannot fill in all seven fields, the analysis is not ready to be a basis for action.
Identification memo
Worked example — milk pricing structure and whole-milk share
Treatment
Equal pricing across all milk fat levels vs. a small premium on whole milk.
Outcome
Store-level share of whole milk in total milk volume (percentage points).
Unit of analysis
Supermarket store.
Comparison group
Equal-price stores vs. whole-milk-premium stores within the same retail chain footprint.
Identifying assumption
The pricing structure was set at the regional corporate level for operational reasons, independent of local store-level demand for whole milk.
Empirical support
Pre-treatment balance on ZIP-code demographics; quiet placebo on diet-soda share (an outcome the milk pricing structure should not move).
Major threat
If the corporate pricing rule was itself responsive to regional demand patterns correlated with milk preferences, the assumption fails.
The memo is short on purpose. Long memos hide soft thinking; short memos force commitment.
Data Case: Auditing the Milk Pricing Quasi-Experiment
The milk pricing study (Chapter 5) is the canonical worked example of a quasi-experimental identification audit. Two diagnostic checks support — or refute — the identifying assumption that the pricing structure was effectively as-if random across the two groups of stores:
- Balance. Are the two groups of stores comparable on observable pre-treatment characteristics (ZIP-code income, household size, baseline volume)? If the groups already differed before the treatment, the identifying assumption is in trouble.
- Placebo. Does the "treatment" move an outcome that, by domain logic, it should not move? Diet-soda share is the placebo here — a milk-pricing rule has no business affecting how much diet soda people buy. A loud placebo signals unobserved differences between the groups.
The placebo is quiet; the milk outcome moves
1,708 stores. Differences are equal-price stores minus whole-milk-expensive stores.
Stores
627 equal-price
ZIP income gap
$197
Diet-income correlation
0.63
What the figure does, and what it does not do, is worth being explicit about. Quiet diagnostics support the identifying assumption: the two store groups look comparable on the observables we can measure, and the placebo outcome moves no more than noise. The diagnostics do not rule out unobserved confounding: there could still be a story in which corporate pricing decisions correlated with some unmeasured local taste. The memo lists that as the major threat for a reason. The argument the memo makes is "given the diagnostics behave as a randomized trial would, the eight-point lift is the most defensible reading of the data." That is identification doing the work — not the regression.