§1.1

Grain, Structure, and Measurement

I
What happened?
II
Where & for whom?
III
What caused it?
III
How much does X matter?
IV
What is likely next?
V
What does the text/image say?
VI
How do we operate this?

Part I — where on the decision ladder we are.

A regional manager at Bean & Basket Coffee opens her laptop and finds two views of last week on the shared drive. The first, transactions.csv, lists every drink that left a register: twelve rows for the week across three stores. The second, store_week.csv, takes that same week and rolls it up to three rows, one per store. Both describe identical business activity. Both look like reasonable starting points for an analysis. But they will not answer the same questions, and a manager who blurs that distinction will make different decisions than one who keeps it sharp.

What Is a Dataset?

The executive question: what does one row mean?

A dataset is a business story told in rows and columns. The columns name the things you can measure — date, store, customer, amount. The rows are the units that get measured. Before you reach for a chart, a regression, a model, or even an average, the first thing to know is what those rows actually represent. That is the dataset's grain: the level of detail at which one row is a complete observation.

When the grain is one purchase, each row is a moment in a customer's day — a person walked into a store on a particular date and spent a particular amount on particular items. When the grain is one store per week, each row is a store's commercial life summarized down to a handful of numbers. The same week of business activity supports both grains. They are not contradictions; they are different lenses. The trap is that they look interchangeable, and they are not.

Almost every analytical confusion a manager will encounter — incompatible joins, double-counted revenue, averages that lie, "the dashboard says X but the report says Y" — starts here, at the grain. Once you have read the first row, you have implicitly chosen the set of questions you can ask without trouble. Figure 1 makes the trade-off concrete by showing the same Bean & Basket week at two grains side by side.

One row = one purchase. Customer C12 visited Downtown twice; customer C66 visited Suburban twice. The customer ID makes those repeat visits visible.

TransactionDateStoreCustomerItemsAmount
T0012024-03-04A — DowntownC12Latte + Croissant$9.50
T0022024-03-04A — DowntownC45Drip + Muffin$6.25
T0032024-03-05A — DowntownC12Latte$5.50
T0042024-03-06A — DowntownC77Cold brew + Bagel$8.75
T0052024-03-08A — DowntownC45Cappuccino + Croissant$9.00
T0062024-03-04B — CampusC22Drip coffee$3.25
T0072024-03-05B — CampusC99Latte$5.50
T0082024-03-06B — CampusC22Drip + Pastry$5.75
T0092024-03-07B — CampusC33Mocha$6.00
T0102024-03-04C — SuburbanC66Drip coffee$3.00
T0112024-03-06C — SuburbanC88Drip + Cookie$4.50
T0122024-03-09C — SuburbanC66Cappuccino$5.50
Figure 1. The same week of Bean & Basket sales at two grains. Toggle the tabs: the per-transaction view lets you trace what individual customers bought; the per-store-week view lets you compare stores at a glance. Neither answers the other's questions cleanly.

The per-transaction view tells you, immediately and concretely, that customer C12 walked into the Downtown store on two different days. You can see what each customer bought and what each one spent. You can also see absence — the Suburban store had only two regulars that week. The per-store-week view shows none of that. It cannot: the customer column has been aggregated away. What you get instead is a clean three-row ranking. Downtown out-earned Campus by roughly $19, and Campus out-earned Suburban by roughly $7. The ranking is legible in a way it never is in the per-transaction table, because there each store's rows are mixed in with the others.

Each grain is right for a different question. Per-transaction is right for "who buys what, how often, in what baskets?" — questions where the unit of analysis is a customer or a purchase. Per-store-week is right for "which store is growing, which is flat, which is shrinking?" — questions where the unit of analysis is a store across time. Trying to answer the first question from the per-store-week view is impossible. Trying to answer the second from the per-transaction view is possible but laborious, and easy to do wrong if you forget to group correctly.

The other classic mistake is joining tables of different grains without aggregating first.

Reading one row is the first lens. The second is the arrangement of the rows themselves — the shape of the table, which decides what the eye can see.

Data Structures

The regional manager at Bean & Basket has three slides to prepare for Friday's review. The first asks which store is performing best this quarter. The second asks whether the chain is growing. The third asks where the growth is coming from. She has one folder of data and three reasonable-sounding charts in mind, and none of the three will show all three answers — because the shape of the underlying table changes what the eye can see in the chart. Cross-section, time-series, panel, geo, network: these are not five different files. They are five different ways of arranging the same business reality, and each one closes off some questions while opening up others.

The executive question: what questions can this data structure answer?

In Chapter 1 we worried about what one row means. In this chapter we worry about how those rows are arranged. Data structure is the larger pattern: how the unit of observation combines with time and with relationships to other units. Five structures cover almost every dataset a business analyst will see.

A cross-sectional dataset has one row per unit at a single point in time — a customer survey, a snapshot of stores at the end of last week, a leaderboard of products by year-to-date revenue. It compares units to each other, with time fixed. A time-series has one row per period for a single metric — monthly revenue, weekly active users, daily ad spend. It tracks one thing as it changes, with units collapsed. A panel combines them: the same units, observed repeatedly over time. Store-week sales, customer-month engagement, product-quarter unit movement — the panel is the most informative shape because it lets you ask both who is doing better and who is changing fastest.

Two more structures matter, even though we will not generate synthetic data for them in this chapter. Geo-spatial data attaches location to each row — store coordinates, delivery zones, ZIP codes — and supports questions about distance, catchment, and territory. Network data describes relationships between entities — which customers refer which other customers, which products are bought together, which employees report to whom — and supports questions about influence, communities, and flow.

Figure 2 shows the same eight weeks of Bean & Basket sales arranged three ways. The business question is identical across all three tabs: which store is doing best, and is the chain growing? The data is identical too — 24 store-week observations underneath. Only the arrangement changes. So does the answer.

One row per store, snapshot at week 8 only. Reads as: "Downtown is winning by a wide margin; Suburban is the runner-up; Campus is a distant third." Time has been thrown away.

StoreWeekRevenueTransactions
A — Downtown2024-04-22$145.0019
B — Campus2024-04-22$60.0010
C — Suburban2024-04-22$90.0015
Figure 2. Three structures, one business. Toggle between the panel (24 rows, 3 stores × 8 weeks), the cross-section (just last week's 3 rows), and the time-series (the chain-wide weekly trend). Each view answers the question 'which store is doing best?' differently — only one of them answers it correctly.

The cross-section in Figure 2 looks decisive. Downtown earned $145 last week; Suburban $90; Campus $60. If a regional manager built her Friday slide from this single table, she would walk into the room with a clean ranking and a quiet recommendation to invest more in Downtown. The time-series tells a different and equally tidy story: chain revenue is up. Both stories are true in their narrow way. Both miss the thing that should actually drive the decision.

The panel makes the gap visible. Downtown's eight-week revenue line is flat — it is large, but it is not growing. Campus is bleeding out, slowly, every single week. Suburban started at $35 and ended at $90, nearly tripling. A manager who saw only the cross-section would invest in the wrong store; one who saw only the time-series would feel good about a chain-wide trend that is, in reality, the sum of one collapsing store, one stagnant store, and one breakout. The panel is the only structure of the three that lets you ask and answer the right question: where is the growth coming from? Once that question is on the table, the decision is no longer "give Downtown more marketing money" — it is "figure out what's happening at Suburban, and what's going wrong at Campus."

Geographic and network data follow the same logic. A geo-spatial table that records each store's latitude and longitude lets you ask whether the Suburban breakout has anything to do with proximity to a new university campus three miles away — a question the panel alone cannot answer. A network table that records which customers introduced which other customers lets you ask whether the Suburban growth is driven by a single high-influence customer who brings friends — again, invisible at the store-week level. The deeper rule is that adding structure unlocks questions: each new layer (time, place, relationship) lets the same business evidence be asked something new.

Grain and structure describe how the rows are organized. The last lens turns to the columns: what kind of thing each one measures, and what you are allowed to do with it.

Variable Types and Measurement

A manager opens the customer-feedback report and sees "average rating: 4.2 out of 5." A few pages later the same report says "average ZIP code: 11042." The first number is misleading; the second is meaningless. Both errors come from the same root cause: someone treated a column as a number when it was not really a number. Almost every analytic mistake a manager will hear about, once you scratch the surface, is a quiet type confusion of this kind. The defense is to know what kind of column you are looking at before deciding what to do with it.

The executive question: is this column a number, a category, a date, a location, a label, or a business definition?

Variables are not just columns. They are measurement choices. When the company decides to track customer ratings on a 1-to-5 scale, that decision implies a set of operations that are sensible (the median, the share of 4-or-5 ratings) and a set that look sensible but are not (the arithmetic mean, the standard deviation). When the company records ZIP codes, the decision implies that the column will be used for maps and territory grouping, not for arithmetic. Type, in the small-t sense an analyst means, is the rule about what you are allowed to do with a column.

Seven types cover almost every business column. Numeric variables — price, revenue, quantity — support sums, means, ratios, and most of the textbook tools. Categorical variables — store, region, product family — support grouping and counting. Ordered categorical variables — satisfaction levels, T-shirt sizes, income brackets — support medians and rank comparisons, but not arithmetic, because the gaps between levels are not equal. Date/time is a category of its own: it sorts, supports lags and rolling windows, and decomposes into month, day-of-week, and hour. Geographic variables look numeric (ZIP, lat/lon) but behave categorically: distance and territory are sensible, sums are not. Boolean columns are simple — they support rates and conversions — but managers will sometimes report "average is_member = 0.43" instead of "43% are members," which is the same number stated badly. Text is its own world, treated in Part V; for now, the rule is that it does not belong in a normal table operation.

The trap is that a column's storage type — what the database engine sees — is not the same as its measurement type. A customer_id of 12345 is stored as an integer and will gladly average with other integers, but the average customer ID is not a number. A ZIP of 10003 is stored as an integer too. A rating of 4 is also an integer but means something quite different. Figure 3 walks through a Bean & Basket variable dictionary, naming the measurement type for each column, the operations that are sensible on it, and the common mistake — the one your team will make if no one writes the dictionary down.

Figure 3. What you can do with a column depends on what kind of column it is. Bean & Basket variable dictionary, eleven representative columns. The 'common mistake' column is the one to read carefully — every entry corresponds to a real report someone wrote.
VariableTypeExampleSensible operationsCommon mistake
transaction_idID (categorical)"T001"Join key, distinct countAveraging it. Sums and means of IDs are meaningless.
dateDate / time"2024-03-04"Sort, extract month, build lags, group by weekStoring as a string. "2024-13-01" sorts before "2024-2-01".
store_idCategorical (nominal)"A — Downtown"Group, filter, count distinctTreating alphabetical order as a ranking.
store_zipGeographic / categorical"10003"Map, compute distance, group by regionAveraging ZIP codes. "Mean ZIP = 11042" is a noise sentence.
product_categoryCategorical (nominal)"Coffee"Group, count, share-of-revenue ratiosInventing implicit order (Coffee < Tea < Pastry).
product_sizeOrdered categorical"Medium"Rank, median, compare adjacent ranksTreating (Small + Large) / 2 = Medium. The interval is not equal.
priceNumeric (continuous)5.50Sum, mean, ratio, log, elasticityComparing across currencies without normalizing.
quantityNumeric (count, integer)2Sum, mean, distribution by categoryReporting non-integer means without context ("avg basket = 1.4 items").
rating_1_to_5Ordered categorical4Median, % top-2-box, compare ranksReporting "mean rating = 4.2" as if the gap from 4 to 5 equals the gap from 1 to 2.
is_loyalty_memberBooleantrueConversion rate (% true), filter, segmentReporting "avg member = 0.43". Say "43% are members" instead.
review_textText / unstructured"Great oat-milk latte but slow service."Sentiment, topic, embedding, semantic searchTreating it as a structured column. "COUNT DISTINCT review" is rarely the question.

Figure 3 is the single most useful artifact a new analytics team can produce in its first week. The act of writing it down forces every column's type to be decided, named, and shared. Once the dictionary exists, downstream errors are caught at the table-design stage instead of at the slide-review stage. Notice how many of the entries warn against arithmetic on things that look numeric: IDs, ZIPs, ratings, and booleans all live in this looks-like-a-number-but-isn't trap. The trap is so reliable that you can find a version of it in almost every business intelligence stack ever shipped.