§2.5

Data Quality and Measurement Bias

A junior analyst at Bean & Basket exports last week's transactions to investigate why average order value spiked. Eight rows in, he finds the answer: one transaction recorded a $99,999 purchase — almost certainly a typo where the cashier meant $9.99. He nearly deletes the row before stopping. Two rows further down is a transaction for negative $5.75. That one is not an error; it is a return. The reflex to "clean up the bad data" by dropping rows is, half the time, the reflex to delete the most informative parts of the file. Data quality is rarely a technical problem with a technical fix. It is almost always a business process problem visible as bad rows, and the analyst's job is to triage which rows to fix, which to investigate, and which to keep exactly as they are.

The executive question: is this a data problem, a business process problem, or both?

Three kinds of quality problem

Most data quality problems fall into one of three categories. Some are definitional — the column means something different than the analyst assumed, or the same field is encoded inconsistently across systems. Some are operational — the upstream process produces wrong values, missing values, or duplicates; the data faithfully records what the process did. Some are interpretive — the row is technically correct but represents a business event (a return, a corporate order, a system test) that doesn't belong in the next sum without a flag. The first kind is fixed in code. The second is fixed in process. The third is fixed in the metric definition. Conflating them is how analytics teams end up writing patches that mask the real problem.

Figure 1 is a small sample of what real Bean & Basket transaction data looks like before triage. Eight rows; six of them have visible quality issues. Read the "Issue" column down the right-hand side — each row is a different kind of problem, and each kind needs a different response.

Transaction	Date	Customer ID	Store	Amount	Issue
T01	2024-03-04	C12	A	$9.50	—
T02	2024-03-04	— null —	A	$6.25	Missing customer_id
T03	2024-03-05	C12	A	$5.50	—
T04	2024-13-01	C77	A	$8.75	Impossible date (month 13)
T05	2024-03-06	C22	B	$-5.75	Negative amount — return?
T06	2024-03-07	C99	— null —	$3.00	Missing store; unmatched customer
T07	2024-03-08	C12	A	$99999.00	Impossible amount — typo?
T08	2024-03-09	C45	a	$5.50	Inconsistent store_id case (a vs A)

Figure 1. Eight transactions, six data-quality issues. The 'Issue' column on the right is what an analyst would flag on a first read. Notice that the responses are not symmetric: the negative amount is probably a return (keep, flag), the impossible amount is probably a typo (reject), the missing store is probably a sync gap (investigate). 'Cleaning' all six with the same rule would erase the differences.

Triage, don't bulk-clean

Look at row T05 in Figure 1, the negative amount. The naive cleaning rule — "drop rows where amount < 0" — would erase a return event. The naive correction rule — "set the amount to its absolute value" — would silently double the revenue at that store for the day, since the return is now indistinguishable from a sale. The right answer is to keep the row, add a transaction_type column with values like sale and return, and let every downstream metric explicitly choose whether to include returns. The work is to make the business event visible in the schema, not to make the row go away.

The same pattern repeats throughout the table. T04's impossible date (month 13) is an ingestion bug; reject it at the boundary and log the source. T07's $99,999 amount is almost certainly a misplaced decimal; quarantine the row and ask the store. T08's lowercase a is a categorical encoding issue; normalize at ingestion and add a uniqueness check on the store dimension. Each row in Figure 1 wants a different response, and Figure 2 organizes those responses into a standing triage list.

Figure 2. The standing data-quality triage list — nine checks every Bean & Basket pipeline should run continuously. Severity is about business risk, not about how many rows fail the check. A single impossible amount can do more damage than a thousand minor encoding inconsistencies.

Check	Severity	Business risk	Action
Missing customer_id	High	Can't connect transactions to customer history; loyalty, churn, and CLV all blind to these.	Investigate root cause (guest checkout? POS bug?). Flag the rows; do not drop.
Impossible date	High	Will break time-based joins and dashboards silently — rows fall out of every window.	Reject at ingestion. Add validation rule: month ∈ [01–12], day ∈ [01–31].
Negative revenue	Medium	Probably a return — meaningful business signal, NOT a bug. Auto-deleting hides churn behavior.	Add a return-flag column. Keep the row; let analyses filter when needed.
Impossible amount (e.g. $99,999)	High	One outlier can dominate a mean. A single $99,999 row inflates AOV across the entire week.	Reject or quarantine. If real (corporate order), tag and analyze separately.
Inconsistent category encoding	Low	Groups by store_id will treat 'A' and 'a' as two stores; rankings split into halves.	Normalize at ingestion (uppercase, trim). Add a uniqueness check on the dimension table.
Unknown foreign key	Medium	Anti-join surfaces it — see Chapter 5. Often a sync gap with the loyalty system.	Daily anti-join report; investigate any sustained increase in unmatched rows.
Duplicate row	Medium	Double-counts revenue, conversions, or any aggregated metric.	Look for a natural-key uniqueness check (transaction_id should be unique). Investigate, do not auto-delete.
Survivorship bias	High	If churned customers are dropped from the master file, every retention metric is overstated.	Keep a soft-delete flag, never hard-delete. Customer master is append-only.
Leaking future variables	High	A churn model trained on a 'days_since_last_purchase' field computed today learns nothing about future churn — it learns the definition.	Snapshot features as of the decision date, not as of today (see Chapter 27).

A living checklist

Figure 2 is meant to be a living document, not a one-time audit. The nine checks here are the most common at a retail business of Bean & Basket's shape; a financial services firm or a software company would have its own variants. What stays constant is the structure: every check has a name, a severity that reflects business impact (not row count), a description of what the bad rows actually break, and a response. A team that runs Figure 2 weekly and reviews any failed check in the Monday meeting will catch every problem in Figure 1 within twenty-four hours of it arriving in the data.

For the manager

Before asking why is this number weird?, ask what does the bad row actually represent? Data quality is the layer where the technical artifacts of the data pipeline collide with the messy realities of the business — returns, corporate orders, system migrations, deprecated products, churned customers, integration failures. The analyst's tools — flagging, anti-joins, severity-tiered triage — are means to an end; the end is to keep the business meaning of each row legible. The cheapest analytics governance a team can run is the weekly Monday triage on Figure 2: any check that flips from green to red gets a name, an owner, and a date. A team that does this stops producing dashboards whose numbers move for unexplained reasons. A team that doesn't will spend most of its third-year analyst-time explaining why the numbers don't agree across reports.