§2.5
Data Quality and Measurement Bias
A junior analyst at Bean & Basket exports last week's transactions to investigate why average order value spiked. Eight rows in, he finds the answer: one transaction recorded a $99,999 purchase — almost certainly a typo where the cashier meant $9.99. He nearly deletes the row before stopping. Two rows further down is a transaction for negative $5.75. That one is not an error; it is a return. The reflex to "clean up the bad data" by dropping rows is, half the time, the reflex to delete the most informative parts of the file. Data quality is rarely a technical problem with a technical fix. It is almost always a business process problem visible as bad rows, and the analyst's job is to triage which rows to fix, which to investigate, and which to keep exactly as they are.
The executive question: is this a data problem, a business process problem, or both?
Most data quality problems fall into one of three categories. Some are definitional — the column means something different than the analyst assumed, or the same field is encoded inconsistently across systems. Some are operational — the upstream process produces wrong values, missing values, or duplicates; the data faithfully records what the process did. Some are interpretive — the row is technically correct but represents a business event (a return, a corporate order, a system test) that doesn't belong in the next sum without a flag. The first kind is fixed in code. The second is fixed in process. The third is fixed in the metric definition. Conflating them is how analytics teams end up writing patches that mask the real problem.
Figure 1 is a small sample of what real Bean & Basket transaction data looks like before triage. Eight rows; six of them have visible quality issues. Read the "Issue" column down the right-hand side — each row is a different kind of problem, and each kind needs a different response.
| Transaction | Date | Customer ID | Store | Amount | Issue |
|---|---|---|---|---|---|
| T01 | 2024-03-04 | C12 | A | $9.50 | — |
| T02 | 2024-03-04 | — null — | A | $6.25 | Missing customer_id |
| T03 | 2024-03-05 | C12 | A | $5.50 | — |
| T04 | 2024-13-01 | C77 | A | $8.75 | Impossible date (month 13) |
| T05 | 2024-03-06 | C22 | B | $-5.75 | Negative amount — return? |
| T06 | 2024-03-07 | C99 | — null — | $3.00 | Missing store; unmatched customer |
| T07 | 2024-03-08 | C12 | A | $99999.00 | Impossible amount — typo? |
| T08 | 2024-03-09 | C45 | a | $5.50 | Inconsistent store_id case (a vs A) |
Look at row T05 in Figure 1, the negative amount. The naive cleaning rule — "drop rows where amount < 0" — would erase a return event. The naive correction rule — "set the amount to its absolute value" — would silently double the revenue at that store for the day, since the return is now indistinguishable from a sale. The right answer is to keep the row, add a transaction_type column with values like sale and return, and let every downstream metric explicitly choose whether to include returns. The work is to make the business event visible in the schema, not to make the row go away.
The same pattern repeats throughout the table. T04's impossible date (month 13) is an ingestion bug; reject it at the boundary and log the source. T07's $99,999 amount is almost certainly a misplaced decimal; quarantine the row and ask the store. T08's lowercase a is a categorical encoding issue; normalize at ingestion and add a uniqueness check on the store dimension. Each row in Figure 1 wants a different response, and Figure 2 organizes those responses into a standing triage list.
| Check | Severity | Business risk | Action |
|---|---|---|---|
| Missing customer_id | High | Can't connect transactions to customer history; loyalty, churn, and CLV all blind to these. | Investigate root cause (guest checkout? POS bug?). Flag the rows; do not drop. |
| Impossible date | High | Will break time-based joins and dashboards silently — rows fall out of every window. | Reject at ingestion. Add validation rule: month ∈ [01–12], day ∈ [01–31]. |
| Negative revenue | Medium | Probably a return — meaningful business signal, NOT a bug. Auto-deleting hides churn behavior. | Add a return-flag column. Keep the row; let analyses filter when needed. |
| Impossible amount (e.g. $99,999) | High | One outlier can dominate a mean. A single $99,999 row inflates AOV across the entire week. | Reject or quarantine. If real (corporate order), tag and analyze separately. |
| Inconsistent category encoding | Low | Groups by store_id will treat 'A' and 'a' as two stores; rankings split into halves. | Normalize at ingestion (uppercase, trim). Add a uniqueness check on the dimension table. |
| Unknown foreign key | Medium | Anti-join surfaces it — see Chapter 5. Often a sync gap with the loyalty system. | Daily anti-join report; investigate any sustained increase in unmatched rows. |
| Duplicate row | Medium | Double-counts revenue, conversions, or any aggregated metric. | Look for a natural-key uniqueness check (transaction_id should be unique). Investigate, do not auto-delete. |
| Survivorship bias | High | If churned customers are dropped from the master file, every retention metric is overstated. | Keep a soft-delete flag, never hard-delete. Customer master is append-only. |
| Leaking future variables | High | A churn model trained on a 'days_since_last_purchase' field computed today learns nothing about future churn — it learns the definition. | Snapshot features as of the decision date, not as of today (see Chapter 27). |
Figure 2 is meant to be a living document, not a one-time audit. The nine checks here are the most common at a retail business of Bean & Basket's shape; a financial services firm or a software company would have its own variants. What stays constant is the structure: every check has a name, a severity that reflects business impact (not row count), a description of what the bad rows actually break, and a response. A team that runs Figure 2 weekly and reviews any failed check in the Monday meeting will catch every problem in Figure 1 within twenty-four hours of it arriving in the data.
Concept check
These three questions span a derived-metric trap (averaging ratios versus computing the ratio of totals), a data-quality decision (how silent imputation of missing or negative values rewrites a conclusion), and a definitional validation check (why a shrinking denominator can manufacture a trend that isn't real).
- 1.Bean and Basket has three stores. Store A earned 50,000 dollars on 1,000 square feet, Store B earned 60,000 dollars on 2,000 square feet, and Store C earned 40,000 dollars on 1,000 square feet. An analyst wants a single chain-wide "revenue per square foot" figure for the quarterly review. She computes each store's revenue per square foot, then averages those three numbers, and reports about 40 dollars per square foot. The CFO instead divides total revenue by total square footage and gets about 37.50 dollars per square foot. Who is right, and why does it matter?
- 2.An analyst is preparing the weekly revenue-by-store report. Some rows have a missing (null) amount because of a point-of-sale sync failure, and a handful of rows have negative amounts that are genuine returns. To get the dashboard to run without errors, a colleague suggests one cleanup pass: replace every null amount with zero, and take the absolute value of every negative amount so all the numbers are positive and the totals add up. What is the most important objection?
- 3.A retention dashboard shows the 12-month repeat-purchase rate climbing steadily for three straight quarters, and the team is ready to credit a new loyalty program. Before celebrating, you learn that the data engineering team recently started hard-deleting customer records 12 months after a customer's last purchase, to keep the master table small. What single check would most directly tell you whether the rising retention rate is real?