§5.4

Why Historical Data Is Hard

Historical data is the most attractive raw material in business analytics. It is already collected, virtually free, and often runs into millions of rows. It is also notoriously hard to interpret causally, for one fundamental reason: in historical data, the actions — prices, ads, promotions, layouts, offers — were not assigned randomly. They were chosen by managers responding to the very conditions we are trying to study.

That purposeful decision-making makes the treatment variable endogenous: it carries information about the demand environment in addition to its own causal effect. This article works through the four standard sources of endogeneity, derives the omitted-variable-bias formula that ties them together, and shows the visible signature of confounding in a real scanner dataset.

The Executive Question: What Did the Decision-Maker Know?

A regional manager observes that store managers cut pastry prices by half whenever afternoon traffic looks weak. A junior analyst regresses pastry sales on the discount flag and finds a negative correlation. The presented conclusion: "Discounts hurt sales."

The conclusion is wrong, and the reason is structural. The discount was triggered because expected sales were low. The historical data has an unobserved confounder built into how the action was chosen. Before treating a historical regression coefficient as a lever, every manager should ask:

What did the decision-maker know when they took this action, and how did that knowledge affect both the action and the outcome?

That question, applied seriously, surfaces four recurring patterns.

Table 1. Four recurring sources of endogeneity in historical business data. Each one is a different reason the observed treatment is correlated with the unobserved drivers of the outcome.

Source	Mechanism	Business setting	Typical bias on the naive correlation
Omitted variable	Unobserved third factor moves both treatment and outcome.	Holiday season raises both ad spend and sales.	Inflates the apparent effect of advertising.
Reverse causality	The outcome causes the treatment, not the other way around.	Prices are cut when sales are already expected to be weak.	Makes price look harmful (or beneficial) for the wrong reason.
Simultaneity	Treatment and outcome are jointly determined in equilibrium.	Our prices and competitor prices both adjust to a shared local cost shock.	Blurs who is taking share from whom.
Selection	Units opt into treatment based on expected outcomes.	Heavy app users opt into the loyalty program first.	Overstates the program lift.

The first two are easy to draw, and the picture is worth memorizing.

Confounding: a backdoor path from D to Y through Z

Comparing treated and untreated units without controlling for Z mixes the causal D → Y arrow with the spurious D ← Z → Y path.

Figure 1. The confounding pattern. A third factor Z drives both the treatment D and the outcome Y, creating a backdoor path D ← Z → Y that contaminates the simple D–Y correlation.

Forward causation vs. reverse causation

Figure 2. Forward vs. reverse causation. The left panel is the world we assume when we read a regression coefficient as a lever; the right panel is the world that often actually generated the data.

The Mathematics of Omitted Variable Bias

The four sources above all collapse, mathematically, into the same form. Suppose the true outcome model is

True model

Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i, \qquad \mathbb{E}[\varepsilon_i \mid X_{1i}, X_{2i}] = 0

where $X_1$ is the treatment of interest (price), $X_2$ is some other driver of the outcome (seasonal demand), and $\beta_1$ is the causal effect we want.

An analyst who does not observe (or forgets to control for) $X_2$ runs the short regression

Short regression

Y_i = \alpha_0 + \alpha_1 X_{1i} + u_i

Standard OLS algebra gives a clean expression for what the short-regression coefficient is really estimating:

Omitted variable bias

\mathbb{E}\!\left[\widehat{\alpha}_1\right] \;=\; \underbrace{\beta_1}_{\text{true effect}} \;+\; \underbrace{\beta_2 \cdot \frac{\operatorname{Cov}(X_{1}, X_{2})}{\operatorname{Var}(X_{1})}}_{\text{omitted-variable bias}}

The bias term is the product of two pieces, and both signs must be right to predict the direction of the bias:

How the omitted variable moves the outcome — the sign of $\beta_2$ .
How the omitted variable moves with the treatment — the sign of $\operatorname{Cov}(X_1, X_2)$ , which is exactly the slope of a regression of $X_2$ on $X_1$ .

Memorize the 2×2:

Table 2. The sign of omitted-variable bias depends on the sign of two simpler relationships. Get them both right and you can predict whether your naive coefficient overstates or understates the truth.

Cov(X₁, X₂)	β₂ > 0 (omitted variable raises Y)	β₂ < 0 (omitted variable lowers Y)
+ (treatment moves with omitted variable)	Bias > 0 — short coefficient too high	Bias < 0 — short coefficient too low
− (treatment moves against omitted variable)	Bias < 0 — short coefficient too low	Bias > 0 — short coefficient too high

Walking through a pricing example

Why does omitting season distort soup elasticity?

True elasticity $\beta_1$ is negative — higher price means lower volume.
Seasonal demand $\beta_2$ is positive — in winter, demand is higher at any price.
Pricing behavior: retailers raise prices in winter when demand is high, so $\operatorname{Cov}(X_1, X_2) > 0$ .
Bias sign: positive × positive ⇒ bias is positive ⇒ short-regression coefficient is pulled less negative than the truth.

A manager who reads the biased coefficient sees customers as relatively insensitive to price, raises prices further, and watches profit collapse when the genuinely negative elasticity finally bites. The bug is not in the math; it is in pretending the omitted variable does not exist.

Reverse Causality and Simultaneity in One Picture

Reverse causality is omitted-variable bias' close cousin: in the short regression, the residual $u_i$ contains the demand shock that triggered the action. Now $X_{1i}$ and $u_i$ are correlated because the manager looked at the demand shock before choosing the action. The OLS estimate inherits whatever pattern of action-given-demand the manager was using.

Simultaneity is the most general case: two variables move together because each one depends on the other in equilibrium. Competitor pricing is the canonical example — our price depends on theirs, theirs depends on ours, and a shared cost shock moves both. None of OLS, fixed effects, or naive controls can untangle this without an instrument or a design.

Concept check

Three questions spanning what randomization buys and why raw historical data resists causal reading.

1.
Why does random assignment produce an unbiased estimate of the average treatment effect?
2.
Which of the following is the cleanest definition of an endogenous treatment variable?
3.
Suppose holiday season raises both ad spend ( $X_1$ ) and sales ( $Y$ ). You omit season ( $X_2$ ) from the regression. The OVB formula predicts your estimated ad-spend coefficient is…

Data Case: Season as a Confounder in Soup Pricing

A short-run picture of the confounding pattern in real scanner data: Progresso soup volume against price, faceted by season. If season were unrelated to either price or volume, the two panels would lie on top of each other. Figure 3 shows that they do not.

Winter

Sample trend slope -2.82

Non-winter

Sample trend slope -2.02

Figure 3. Log price and log volume for Progresso soup, faceted by season. Winter sits systematically higher (more soup sold at every price level), and the within-season slope is meaningfully different from the pooled slope. Pooling the two seasons without a seasonal control is a textbook omitted-variable trap.

Two features of Figure 3 are diagnostic. First, the winter cloud sits visibly above the non-winter cloud — soup demand is higher at any given price in winter. Second, the relationship between price and volume is steeper within season than the line that would be fit across the pooled data, because the pooled line gets dragged toward the high-price, high-volume corner where winter prices and winter demand coincide.

The naive pooled regression returns an elasticity near $-3.21$ . The within-season slopes, and the store fixed-effects model we will build in Chapter 6, return something closer to $-2.23$ . The mechanical reason is exactly the omitted-variable formula above: omitting season — with $\beta_{\text{season}} > 0$ and $\operatorname{Cov}(\text{price}, \text{season}) > 0$ — pulls the price coefficient toward zero.

We will return to this case across Chapter 6 (regression) and Chapter 8 (pricing). For now, the picture is the lesson: where the demand environment moves with the action, the naive coefficient is biased in a predictable direction. Predicting the direction is the first defense; designing around the confounder is the second.