§1.1

What Is a Dataset?

Part I — where on the decision ladder we are.

A regional manager at Bean & Basket Coffee opens her laptop and finds two views of last week on the shared drive. The first, transactions.csv, lists every drink that left a register: twelve rows for the week across three stores. The second, store_week.csv, takes that same week and rolls it up to three rows, one per store. Both describe identical business activity. Both look like reasonable starting points for an analysis. But they will not answer the same questions, and a manager who blurs that distinction will make different decisions than one who keeps it sharp.

The executive question: what does one row mean?

A dataset is a business story told in rows and columns. The columns name the things you can measure — date, store, customer, amount. The rows are the units that get measured. Before you reach for a chart, a regression, a model, or even an average, the first thing to know is what those rows actually represent. That is the dataset's grain: the level of detail at which one row is a complete observation.

When the grain is one purchase, each row is a moment in a customer's day — a person walked into a store on a particular date and spent a particular amount on particular items. When the grain is one store per week, each row is a store's commercial life summarized down to a handful of numbers. The same week of business activity supports both grains. They are not contradictions; they are different lenses. The trap is that they look interchangeable, and they are not.

Almost every analytical confusion a manager will encounter — incompatible joins, double-counted revenue, averages that lie, "the dashboard says X but the report says Y" — starts here, at the grain. Once you have read the first row, you have implicitly chosen the set of questions you can ask without trouble. Figure 1 makes the trade-off concrete by showing the same Bean & Basket week at two grains side by side.

One row = one purchase. Customer C12 visited Downtown twice; customer C66 visited Suburban twice. The customer ID makes those repeat visits visible.

Transaction	Date	Store	Customer	Items	Amount
T001	2024-03-04	A — Downtown	C12	Latte + Croissant	$9.50
T002	2024-03-04	A — Downtown	C45	Drip + Muffin	$6.25
T003	2024-03-05	A — Downtown	C12	Latte	$5.50
T004	2024-03-06	A — Downtown	C77	Cold brew + Bagel	$8.75
T005	2024-03-08	A — Downtown	C45	Cappuccino + Croissant	$9.00
T006	2024-03-04	B — Campus	C22	Drip coffee	$3.25
T007	2024-03-05	B — Campus	C99	Latte	$5.50
T008	2024-03-06	B — Campus	C22	Drip + Pastry	$5.75
T009	2024-03-07	B — Campus	C33	Mocha	$6.00
T010	2024-03-04	C — Suburban	C66	Drip coffee	$3.00
T011	2024-03-06	C — Suburban	C88	Drip + Cookie	$4.50
T012	2024-03-09	C — Suburban	C66	Cappuccino	$5.50

Figure 1. The same week of Bean & Basket sales at two grains. Toggle the tabs: the per-transaction view lets you trace what individual customers bought; the per-store-week view lets you compare stores at a glance. Neither answers the other's questions cleanly.

The per-transaction view tells you, immediately and concretely, that customer C12 walked into the Downtown store on two different days. You can see what each customer bought and what each one spent. You can also see absence — the Suburban store had only two regulars that week. The per-store-week view shows none of that. It cannot: the customer column has been aggregated away. What you get instead is a clean three-row ranking. Downtown out-earned Campus by roughly $19, and Campus out-earned Suburban by roughly $7. The ranking is legible in a way it never is in the per-transaction table, because there each store's rows are mixed in with the others.

Each grain is right for a different question. Per-transaction is right for "who buys what, how often, in what baskets?" — questions where the unit of analysis is a customer or a purchase. Per-store-week is right for "which store is growing, which is flat, which is shrinking?" — questions where the unit of analysis is a store across time. Trying to answer the first question from the per-store-week view is impossible. Trying to answer the second from the per-transaction view is possible but laborious, and easy to do wrong if you forget to group correctly.

The other classic mistake is joining tables of different grains without aggregating first.

For the manager

Before asking what model should I use?, ask what does one row mean? The grain of a dataset is the first decision an analyst makes — usually implicitly, often before opening the file — and it sets the ceiling on what the rest of the analysis can do. The finest grain is not always the right one; it is just the most flexible, because it can be aggregated upward to anything coarser. Per-store-week is a natural reporting grain because each row is already a comparable unit. The art is keeping both available and knowing which one any given question wants.

A practical habit: when a new file arrives, the first paragraph of any analysis memo should describe what one row is. "transactions.csv: one row per completed purchase. Twelve rows for the week of 2024-03-04 across three stores. Each row has a transaction ID, a date, a store, a customer ID, an item list, and an amount in dollars." If you cannot write that paragraph, you do not yet know what you have.