§5.4
Why Historical Data Is Hard
Historical data is the most attractive raw material in business analytics. It is already collected, virtually free, and often runs into millions of rows. It is also notoriously hard to interpret causally, for one fundamental reason: in historical data, the actions — prices, ads, promotions, layouts, offers — were not assigned randomly. They were chosen by managers responding to the very conditions we are trying to study.
That purposeful decision-making makes the treatment variable endogenous: it carries information about the demand environment in addition to its own causal effect. This article works through the four standard sources of endogeneity, derives the omitted-variable-bias formula that ties them together, and shows the visible signature of confounding in a real scanner dataset.
The Executive Question: What Did the Decision-Maker Know?
A regional manager observes that store managers cut pastry prices by half whenever afternoon traffic looks weak. A junior analyst regresses pastry sales on the discount flag and finds a negative correlation. The presented conclusion: "Discounts hurt sales."
The conclusion is wrong, and the reason is structural. The discount was triggered because expected sales were low. The historical data has an unobserved confounder built into how the action was chosen. Before treating a historical regression coefficient as a lever, every manager should ask:
What did the decision-maker know when they took this action, and how did that knowledge affect both the action and the outcome?
That question, applied seriously, surfaces four recurring patterns.
| Source | Mechanism | Business setting | Typical bias on the naive correlation |
|---|---|---|---|
| Omitted variable | Unobserved third factor moves both treatment and outcome. | Holiday season raises both ad spend and sales. | Inflates the apparent effect of advertising. |
| Reverse causality | The outcome causes the treatment, not the other way around. | Prices are cut when sales are already expected to be weak. | Makes price look harmful (or beneficial) for the wrong reason. |
| Simultaneity | Treatment and outcome are jointly determined in equilibrium. | Our prices and competitor prices both adjust to a shared local cost shock. | Blurs who is taking share from whom. |
| Selection | Units opt into treatment based on expected outcomes. | Heavy app users opt into the loyalty program first. | Overstates the program lift. |
The first two are easy to draw, and the picture is worth memorizing.
Confounding: a backdoor path from D to Y through Z
Comparing treated and untreated units without controlling for Z mixes the causal D → Y arrow with the spurious D ← Z → Y path.
Forward causation vs. reverse causation
The Mathematics of Omitted Variable Bias
The four sources above all collapse, mathematically, into the same form. Suppose the true outcome model is
True model
where is the treatment of interest (price), is some other driver of the outcome (seasonal demand), and is the causal effect we want.
An analyst who does not observe (or forgets to control for) runs the short regression
Short regression
Standard OLS algebra gives a clean expression for what the short-regression coefficient is really estimating:
Omitted variable bias
The bias term is the product of two pieces, and both signs must be right to predict the direction of the bias:
- How the omitted variable moves the outcome — the sign of .
- How the omitted variable moves with the treatment — the sign of , which is exactly the slope of a regression of on .
Memorize the 2×2:
| Cov(X₁, X₂) | β₂ > 0 (omitted variable raises Y) | β₂ < 0 (omitted variable lowers Y) |
|---|---|---|
| + (treatment moves with omitted variable) | Bias > 0 — short coefficient too high | Bias < 0 — short coefficient too low |
| − (treatment moves against omitted variable) | Bias < 0 — short coefficient too low | Bias > 0 — short coefficient too high |
Walking through a pricing example
Why does omitting season distort soup elasticity?
- True elasticity is negative — higher price means lower volume.
- Seasonal demand is positive — in winter, demand is higher at any price.
- Pricing behavior: retailers raise prices in winter when demand is high, so .
- Bias sign: positive × positive ⇒ bias is positive ⇒ short-regression coefficient is pulled less negative than the truth.
A manager who reads the biased coefficient sees customers as relatively insensitive to price, raises prices further, and watches profit collapse when the genuinely negative elasticity finally bites. The bug is not in the math; it is in pretending the omitted variable does not exist.
Reverse Causality and Simultaneity in One Picture
Reverse causality is omitted-variable bias' close cousin: in the short regression, the residual contains the demand shock that triggered the action. Now and are correlated because the manager looked at the demand shock before choosing the action. The OLS estimate inherits whatever pattern of action-given-demand the manager was using.
Simultaneity is the most general case: two variables move together because each one depends on the other in equilibrium. Competitor pricing is the canonical example — our price depends on theirs, theirs depends on ours, and a shared cost shock moves both. None of OLS, fixed effects, or naive controls can untangle this without an instrument or a design.
Concept check
Three questions spanning what randomization buys and why raw historical data resists causal reading.
- 1.Why does random assignment produce an unbiased estimate of the average treatment effect?
- 2.Which of the following is the cleanest definition of an endogenous treatment variable?
- 3.Suppose holiday season raises both ad spend () and sales (). You omit season () from the regression. The OVB formula predicts your estimated ad-spend coefficient is…
Data Case: Season as a Confounder in Soup Pricing
A short-run picture of the confounding pattern in real scanner data: Progresso soup volume against price, faceted by season. If season were unrelated to either price or volume, the two panels would lie on top of each other. Figure 3 shows that they do not.
Winter
Sample trend slope -2.82
Non-winter
Sample trend slope -2.02
Two features of Figure 3 are diagnostic. First, the winter cloud sits visibly above the non-winter cloud — soup demand is higher at any given price in winter. Second, the relationship between price and volume is steeper within season than the line that would be fit across the pooled data, because the pooled line gets dragged toward the high-price, high-volume corner where winter prices and winter demand coincide.
The naive pooled regression returns an elasticity near . The within-season slopes, and the store fixed-effects model we will build in Chapter 6, return something closer to . The mechanical reason is exactly the omitted-variable formula above: omitting season — with and — pulls the price coefficient toward zero.
We will return to this case across Chapter 6 (regression) and Chapter 8 (pricing). For now, the picture is the lesson: where the demand environment moves with the action, the naive coefficient is biased in a predictable direction. Predicting the direction is the first defense; designing around the confounder is the second.