§1.1

Grain, Structure, and Measurement

Part I — where on the decision ladder we are.

A regional manager at Bean & Basket Coffee opens her laptop and finds two views of last week on the shared drive. The first, transactions.csv, lists every drink that left a register: twelve rows for the week across three stores. The second, store_week.csv, takes that same week and rolls it up to three rows, one per store. Both describe identical business activity. Both look like reasonable starting points for an analysis. But they will not answer the same questions, and a manager who blurs that distinction will make different decisions than one who keeps it sharp.

What Is a Dataset?

The executive question: what does one row mean?

A dataset is a business story told in rows and columns. The columns name the things you can measure — date, store, customer, amount. The rows are the units that get measured. Before you reach for a chart, a regression, a model, or even an average, the first thing to know is what those rows actually represent. That is the dataset's grain: the level of detail at which one row is a complete observation.

When the grain is one purchase, each row is a moment in a customer's day — a person walked into a store on a particular date and spent a particular amount on particular items. When the grain is one store per week, each row is a store's commercial life summarized down to a handful of numbers. The same week of business activity supports both grains. They are not contradictions; they are different lenses. The trap is that they look interchangeable, and they are not.

Almost every analytical confusion a manager will encounter — incompatible joins, double-counted revenue, averages that lie, "the dashboard says X but the report says Y" — starts here, at the grain. Once you have read the first row, you have implicitly chosen the set of questions you can ask without trouble. Figure 1 makes the trade-off concrete by showing the same Bean & Basket week at two grains side by side.

One row = one purchase. Customer C12 visited Downtown twice; customer C66 visited Suburban twice. The customer ID makes those repeat visits visible.

Transaction	Date	Store	Customer	Items	Amount
T001	2024-03-04	A — Downtown	C12	Latte + Croissant	$9.50
T002	2024-03-04	A — Downtown	C45	Drip + Muffin	$6.25
T003	2024-03-05	A — Downtown	C12	Latte	$5.50
T004	2024-03-06	A — Downtown	C77	Cold brew + Bagel	$8.75
T005	2024-03-08	A — Downtown	C45	Cappuccino + Croissant	$9.00
T006	2024-03-04	B — Campus	C22	Drip coffee	$3.25
T007	2024-03-05	B — Campus	C99	Latte	$5.50
T008	2024-03-06	B — Campus	C22	Drip + Pastry	$5.75
T009	2024-03-07	B — Campus	C33	Mocha	$6.00
T010	2024-03-04	C — Suburban	C66	Drip coffee	$3.00
T011	2024-03-06	C — Suburban	C88	Drip + Cookie	$4.50
T012	2024-03-09	C — Suburban	C66	Cappuccino	$5.50

Figure 1. The same week of Bean & Basket sales at two grains. Toggle the tabs: the per-transaction view lets you trace what individual customers bought; the per-store-week view lets you compare stores at a glance. Neither answers the other's questions cleanly.

The per-transaction view tells you, immediately and concretely, that customer C12 walked into the Downtown store on two different days. You can see what each customer bought and what each one spent. You can also see absence — the Suburban store had only two regulars that week. The per-store-week view shows none of that. It cannot: the customer column has been aggregated away. What you get instead is a clean three-row ranking. Downtown out-earned Campus by roughly $19, and Campus out-earned Suburban by roughly $7. The ranking is legible in a way it never is in the per-transaction table, because there each store's rows are mixed in with the others.

Each grain is right for a different question. Per-transaction is right for "who buys what, how often, in what baskets?" — questions where the unit of analysis is a customer or a purchase. Per-store-week is right for "which store is growing, which is flat, which is shrinking?" — questions where the unit of analysis is a store across time. Trying to answer the first question from the per-store-week view is impossible. Trying to answer the second from the per-transaction view is possible but laborious, and easy to do wrong if you forget to group correctly.

The other classic mistake is joining tables of different grains without aggregating first.

For the manager

Before asking what model should I use?, ask what does one row mean? The grain of a dataset is the first decision an analyst makes — usually implicitly, often before opening the file — and it sets the ceiling on what the rest of the analysis can do. The finest grain is not always the right one; it is just the most flexible, because it can be aggregated upward to anything coarser. Per-store-week is a natural reporting grain because each row is already a comparable unit. The art is keeping both available and knowing which one any given question wants.

A practical habit: when a new file arrives, the first paragraph of any analysis memo should describe what one row is. "transactions.csv: one row per completed purchase. Twelve rows for the week of 2024-03-04 across three stores. Each row has a transaction ID, a date, a store, a customer ID, an item list, and an amount in dollars." If you cannot write that paragraph, you do not yet know what you have.

Reading one row is the first lens. The second is the arrangement of the rows themselves — the shape of the table, which decides what the eye can see.

Data Structures

The regional manager at Bean & Basket has three slides to prepare for Friday's review. The first asks which store is performing best this quarter. The second asks whether the chain is growing. The third asks where the growth is coming from. She has one folder of data and three reasonable-sounding charts in mind, and none of the three will show all three answers — because the shape of the underlying table changes what the eye can see in the chart. Cross-section, time-series, panel, geo, network: these are not five different files. They are five different ways of arranging the same business reality, and each one closes off some questions while opening up others.

The executive question: what questions can this data structure answer?

In Chapter 1 we worried about what one row means. In this chapter we worry about how those rows are arranged. Data structure is the larger pattern: how the unit of observation combines with time and with relationships to other units. Five structures cover almost every dataset a business analyst will see.

A cross-sectional dataset has one row per unit at a single point in time — a customer survey, a snapshot of stores at the end of last week, a leaderboard of products by year-to-date revenue. It compares units to each other, with time fixed. A time-series has one row per period for a single metric — monthly revenue, weekly active users, daily ad spend. It tracks one thing as it changes, with units collapsed. A panel combines them: the same units, observed repeatedly over time. Store-week sales, customer-month engagement, product-quarter unit movement — the panel is the most informative shape because it lets you ask both who is doing better and who is changing fastest.

Two more structures matter, even though we will not generate synthetic data for them in this chapter. Geo-spatial data attaches location to each row — store coordinates, delivery zones, ZIP codes — and supports questions about distance, catchment, and territory. Network data describes relationships between entities — which customers refer which other customers, which products are bought together, which employees report to whom — and supports questions about influence, communities, and flow.

Figure 2 shows the same eight weeks of Bean & Basket sales arranged three ways. The business question is identical across all three tabs: which store is doing best, and is the chain growing? The data is identical too — 24 store-week observations underneath. Only the arrangement changes. So does the answer.

One row per store, snapshot at week 8 only. Reads as: "Downtown is winning by a wide margin; Suburban is the runner-up; Campus is a distant third." Time has been thrown away.

Store	Week	Revenue	Transactions
A — Downtown	2024-04-22	$145.00	19
B — Campus	2024-04-22	$60.00	10
C — Suburban	2024-04-22	$90.00	15

Figure 2. Three structures, one business. Toggle between the panel (24 rows, 3 stores × 8 weeks), the cross-section (just last week's 3 rows), and the time-series (the chain-wide weekly trend). Each view answers the question 'which store is doing best?' differently — only one of them answers it correctly.

The cross-section in Figure 2 looks decisive. Downtown earned $145 last week; Suburban $90; Campus $60. If a regional manager built her Friday slide from this single table, she would walk into the room with a clean ranking and a quiet recommendation to invest more in Downtown. The time-series tells a different and equally tidy story: chain revenue is up. Both stories are true in their narrow way. Both miss the thing that should actually drive the decision.

The panel makes the gap visible. Downtown's eight-week revenue line is flat — it is large, but it is not growing. Campus is bleeding out, slowly, every single week. Suburban started at $35 and ended at $90, nearly tripling. A manager who saw only the cross-section would invest in the wrong store; one who saw only the time-series would feel good about a chain-wide trend that is, in reality, the sum of one collapsing store, one stagnant store, and one breakout. The panel is the only structure of the three that lets you ask and answer the right question: where is the growth coming from? Once that question is on the table, the decision is no longer "give Downtown more marketing money" — it is "figure out what's happening at Suburban, and what's going wrong at Campus."

Geographic and network data follow the same logic. A geo-spatial table that records each store's latitude and longitude lets you ask whether the Suburban breakout has anything to do with proximity to a new university campus three miles away — a question the panel alone cannot answer. A network table that records which customers introduced which other customers lets you ask whether the Suburban growth is driven by a single high-influence customer who brings friends — again, invisible at the store-week level. The deeper rule is that adding structure unlocks questions: each new layer (time, place, relationship) lets the same business evidence be asked something new.

The operating posture

Before asking what chart should I make?, ask what is the shape of my data? The five structures map onto five families of business questions:

Cross-section answers who is highest right now?
Time-series answers is this metric rising or falling?
Panel answers which units changed, and how?
Geo-spatial answers where is this happening?
Network answers who is connected to whom?

The dangerous structures are the first two, because they look more decisive than they are. A single-week ranking and a single chain-wide trend each fit on one slide; the panel takes more room and looks less clean. The slide-friendly structures are the ones a manager is most likely to over-trust. The discipline is to design the underlying table in the richest shape the business can sustain — almost always a panel — and to derive the cross-sectional or time-series views from it only when the simpler view is enough to answer the question being asked. The panel is the source of truth; the simpler views are convenience.

Grain and structure describe how the rows are organized. The last lens turns to the columns: what kind of thing each one measures, and what you are allowed to do with it.

Variable Types and Measurement

A manager opens the customer-feedback report and sees "average rating: 4.2 out of 5." A few pages later the same report says "average ZIP code: 11042." The first number is misleading; the second is meaningless. Both errors come from the same root cause: someone treated a column as a number when it was not really a number. Almost every analytic mistake a manager will hear about, once you scratch the surface, is a quiet type confusion of this kind. The defense is to know what kind of column you are looking at before deciding what to do with it.

The executive question: is this column a number, a category, a date, a location, a label, or a business definition?

Variables are not just columns. They are measurement choices. When the company decides to track customer ratings on a 1-to-5 scale, that decision implies a set of operations that are sensible (the median, the share of 4-or-5 ratings) and a set that look sensible but are not (the arithmetic mean, the standard deviation). When the company records ZIP codes, the decision implies that the column will be used for maps and territory grouping, not for arithmetic. Type, in the small-t sense an analyst means, is the rule about what you are allowed to do with a column.

Seven types cover almost every business column. Numeric variables — price, revenue, quantity — support sums, means, ratios, and most of the textbook tools. Categorical variables — store, region, product family — support grouping and counting. Ordered categorical variables — satisfaction levels, T-shirt sizes, income brackets — support medians and rank comparisons, but not arithmetic, because the gaps between levels are not equal. Date/time is a category of its own: it sorts, supports lags and rolling windows, and decomposes into month, day-of-week, and hour. Geographic variables look numeric (ZIP, lat/lon) but behave categorically: distance and territory are sensible, sums are not. Boolean columns are simple — they support rates and conversions — but managers will sometimes report "average is_member = 0.43" instead of "43% are members," which is the same number stated badly. Text is its own world, treated in Part V; for now, the rule is that it does not belong in a normal table operation.

The trap is that a column's storage type — what the database engine sees — is not the same as its measurement type. A customer_id of 12345 is stored as an integer and will gladly average with other integers, but the average customer ID is not a number. A ZIP of 10003 is stored as an integer too. A rating of 4 is also an integer but means something quite different. Figure 3 walks through a Bean & Basket variable dictionary, naming the measurement type for each column, the operations that are sensible on it, and the common mistake — the one your team will make if no one writes the dictionary down.

Figure 3. What you can do with a column depends on what kind of column it is. Bean & Basket variable dictionary, eleven representative columns. The 'common mistake' column is the one to read carefully — every entry corresponds to a real report someone wrote.

Variable	Type	Example	Sensible operations	Common mistake
transaction_id	ID (categorical)	"T001"	Join key, distinct count	Averaging it. Sums and means of IDs are meaningless.
date	Date / time	"2024-03-04"	Sort, extract month, build lags, group by week	Storing as a string. "2024-13-01" sorts before "2024-2-01".
store_id	Categorical (nominal)	"A — Downtown"	Group, filter, count distinct	Treating alphabetical order as a ranking.
store_zip	Geographic / categorical	"10003"	Map, compute distance, group by region	Averaging ZIP codes. "Mean ZIP = 11042" is a noise sentence.
product_category	Categorical (nominal)	"Coffee"	Group, count, share-of-revenue ratios	Inventing implicit order (Coffee < Tea < Pastry).
product_size	Ordered categorical	"Medium"	Rank, median, compare adjacent ranks	Treating (Small + Large) / 2 = Medium. The interval is not equal.
price	Numeric (continuous)	5.50	Sum, mean, ratio, log, elasticity	Comparing across currencies without normalizing.
quantity	Numeric (count, integer)	2	Sum, mean, distribution by category	Reporting non-integer means without context ("avg basket = 1.4 items").
rating_1_to_5	Ordered categorical	4	Median, % top-2-box, compare ranks	Reporting "mean rating = 4.2" as if the gap from 4 to 5 equals the gap from 1 to 2.
is_loyalty_member	Boolean	true	Conversion rate (% true), filter, segment	Reporting "avg member = 0.43". Say "43% are members" instead.
review_text	Text / unstructured	"Great oat-milk latte but slow service."	Sentiment, topic, embedding, semantic search	Treating it as a structured column. "COUNT DISTINCT review" is rarely the question.

Figure 3 is the single most useful artifact a new analytics team can produce in its first week. The act of writing it down forces every column's type to be decided, named, and shared. Once the dictionary exists, downstream errors are caught at the table-design stage instead of at the slide-review stage. Notice how many of the entries warn against arithmetic on things that look numeric: IDs, ZIPs, ratings, and booleans all live in this looks-like-a-number-but-isn't trap. The trap is so reliable that you can find a version of it in almost every business intelligence stack ever shipped.