§2.2
Joining Data
The regional manager at Bean & Basket needs to know how the new loyalty program is doing. She has a transactions table with six purchases from last week. The loyalty tier lives in a separate customers table. The campaign sends live in a third table. The natural instinct — and it is the right one — is to join. But the moment she runs the first join two surprises arrive. One transaction has no matching customer; somebody walked into a store, used a customer ID that the master file has never heard of. And one of the matched customers has been sent three different campaigns this month, which means that joining transactions to campaigns naively will multiply her revenue figures by three for that customer. Neither surprise is a SQL error. Both are business signals dressed as table arithmetic.
The executive question: what changes when we attach customer, product, and campaign context to a transaction?
A raw transaction is six numbers and an ID — a date, a customer reference, a store, an amount. By itself it tells you almost nothing about the business. The loyalty tier of the buyer, the margin of the product, the region of the store, the campaign exposure of the customer — none of that lives in the transactions table. Each lives in its own master table, and the operation that connects them is the join. A join is how business context enters a dataset. It is also, by the same mechanism, how duplicate explosions and missing matches enter the dashboard.
Four join flavors cover almost every business case. A left join keeps every row of the left table and attaches matching rows from the right, leaving blanks where nothing matched — this is the default for "enrich my transactions with whatever customer info we have." An inner join keeps only rows where both sides matched — useful when missing matches mean the row should be excluded entirely. An anti-join keeps only rows where no match was found — the right way to find data quality problems ("which transactions came from customer IDs we have never heard of?"). A full outer join keeps everything from both sides, matched or not. The four are not interchangeable. Choosing the wrong one is the most common silent error in business analytics.
The deeper rule, the one we already wrote down in Chapter 1, is about grain. A join multiplies rows. If the left table has six rows and the right table has, for each left-side key, exactly one matching row, the result has six rows. If the right table has, on average, three rows per left-side key, the result has eighteen. The join engine does this without complaint. It is the analyst's responsibility to know what the grain of each input is, and what the grain of the output will be. Figure 1 makes the three most common outcomes concrete on the same Bean & Basket data.
Every transaction is kept. T06 has no matching customer, so the customer columns are blank — a visible signal that something is wrong with the data, not a row to silently drop.
| Transaction | Customer ID | Name | Loyalty | Amount |
|---|---|---|---|---|
| T01 | C12 | Maria Reyes | Gold | $9.50 |
| T02 | C45 | Jin Park | Silver | $6.25 |
| T03 | C12 | Maria Reyes | Gold | $5.50 |
| T04 | C77 | Sara Kim | Gold | $8.75 |
| T05 | C22 | Alex Chen | None | $5.75 |
| T06 | C99 | — no match — | — | $3.00 |
Read Figure 1 from left to right. The first tab is the join most managers actually want: every transaction kept, customer attributes attached when available, blanks made visible when not. The second tab is the join most managers should run next: the anti-join is the cheapest data-quality check in analytics. A single row in this table is one entry on the data-quality triage list — a transaction whose customer ID is unknown to the master file, which usually means a deleted account, a typo, or a sync gap with the loyalty system. The third tab is the join most managers should never run unaggregated. The inflated revenue total is not a calculation error; the SQL is correct. The error is conceptual: the joined table no longer has one row per transaction, and any column summed across that table is being double-counted by however many campaign rows attach to each customer.
The fix on the third tab is to aggregate before joining. Roll the campaigns table down to one row per customer first — for example, a single flag was_exposed_to_any_campaign — and then join. The resulting table has six rows, the revenue total is right, and the campaign-exposure information is preserved. The general rule is to match grains before joining: if you want a result with one row per transaction, every table you join in must contribute exactly one row per transaction's join key.
A subtler version, worth flagging once: joining future information into a historical row. If the customer master file has been updated to reflect today's loyalty tier, and you join it to a transaction from six months ago, the resulting row attributes the current loyalty tier to a past purchase. For descriptive reports this is usually harmless. For causal analysis — Chapter 12 onward — it is fatal: the model will learn that "Gold tier" predicts purchase behavior, but the Gold tier was assigned because of subsequent purchase behavior. We will come back to this trap several times; the name for it is leakage, and the defense is to use the customer attributes as they were at the time of the transaction, not as they are now.