§1.1
What Is a Dataset?
Part I — where on the decision ladder we are.
A regional manager at Bean & Basket Coffee opens her laptop and finds two views of last week on the shared drive. The first, transactions.csv, lists every drink that left a register: twelve rows for the week across three stores. The second, store_week.csv, takes that same week and rolls it up to three rows, one per store. Both describe identical business activity. Both look like reasonable starting points for an analysis. But they will not answer the same questions, and a manager who blurs that distinction will make different decisions than one who keeps it sharp.
The executive question: what does one row mean?
A dataset is a business story told in rows and columns. The columns name the things you can measure — date, store, customer, amount. The rows are the units that get measured. Before you reach for a chart, a regression, a model, or even an average, the first thing to know is what those rows actually represent. That is the dataset's grain: the level of detail at which one row is a complete observation.
When the grain is one purchase, each row is a moment in a customer's day — a person walked into a store on a particular date and spent a particular amount on particular items. When the grain is one store per week, each row is a store's commercial life summarized down to a handful of numbers. The same week of business activity supports both grains. They are not contradictions; they are different lenses. The trap is that they look interchangeable, and they are not.
Almost every analytical confusion a manager will encounter — incompatible joins, double-counted revenue, averages that lie, "the dashboard says X but the report says Y" — starts here, at the grain. Once you have read the first row, you have implicitly chosen the set of questions you can ask without trouble. Figure 1 makes the trade-off concrete by showing the same Bean & Basket week at two grains side by side.
One row = one purchase. Customer C12 visited Downtown twice; customer C66 visited Suburban twice. The customer ID makes those repeat visits visible.
| Transaction | Date | Store | Customer | Items | Amount |
|---|---|---|---|---|---|
| T001 | 2024-03-04 | A — Downtown | C12 | Latte + Croissant | $9.50 |
| T002 | 2024-03-04 | A — Downtown | C45 | Drip + Muffin | $6.25 |
| T003 | 2024-03-05 | A — Downtown | C12 | Latte | $5.50 |
| T004 | 2024-03-06 | A — Downtown | C77 | Cold brew + Bagel | $8.75 |
| T005 | 2024-03-08 | A — Downtown | C45 | Cappuccino + Croissant | $9.00 |
| T006 | 2024-03-04 | B — Campus | C22 | Drip coffee | $3.25 |
| T007 | 2024-03-05 | B — Campus | C99 | Latte | $5.50 |
| T008 | 2024-03-06 | B — Campus | C22 | Drip + Pastry | $5.75 |
| T009 | 2024-03-07 | B — Campus | C33 | Mocha | $6.00 |
| T010 | 2024-03-04 | C — Suburban | C66 | Drip coffee | $3.00 |
| T011 | 2024-03-06 | C — Suburban | C88 | Drip + Cookie | $4.50 |
| T012 | 2024-03-09 | C — Suburban | C66 | Cappuccino | $5.50 |
The per-transaction view tells you, immediately and concretely, that customer C12 walked into the Downtown store on two different days. You can see what each customer bought and what each one spent. You can also see absence — the Suburban store had only two regulars that week. The per-store-week view shows none of that. It cannot: the customer column has been aggregated away. What you get instead is a clean three-row ranking. Downtown out-earned Campus by roughly $19, and Campus out-earned Suburban by roughly $7. The ranking is legible in a way it never is in the per-transaction table, because there each store's rows are mixed in with the others.
Each grain is right for a different question. Per-transaction is right for "who buys what, how often, in what baskets?" — questions where the unit of analysis is a customer or a purchase. Per-store-week is right for "which store is growing, which is flat, which is shrinking?" — questions where the unit of analysis is a store across time. Trying to answer the first question from the per-store-week view is impossible. Trying to answer the second from the per-transaction view is possible but laborious, and easy to do wrong if you forget to group correctly.
The other classic mistake is joining tables of different grains without aggregating first.