§1.3

Variable Types and Measurement

A manager opens the customer-feedback report and sees "average rating: 4.2 out of 5." A few pages later the same report says "average ZIP code: 11042." The first number is misleading; the second is meaningless. Both errors come from the same root cause: someone treated a column as a number when it was not really a number. Almost every analytic mistake a manager will hear about, once you scratch the surface, is a quiet type confusion of this kind. The defense is to know what kind of column you are looking at before deciding what to do with it.

The executive question: is this column a number, a category, a date, a location, a label, or a business definition?

Variables are not just columns. They are measurement choices. When the company decides to track customer ratings on a 1-to-5 scale, that decision implies a set of operations that are sensible (the median, the share of 4-or-5 ratings) and a set that look sensible but are not (the arithmetic mean, the standard deviation). When the company records ZIP codes, the decision implies that the column will be used for maps and territory grouping, not for arithmetic. Type, in the small-t sense an analyst means, is the rule about what you are allowed to do with a column.

Seven types cover almost every business column. Numeric variables — price, revenue, quantity — support sums, means, ratios, and most of the textbook tools. Categorical variables — store, region, product family — support grouping and counting. Ordered categorical variables — satisfaction levels, T-shirt sizes, income brackets — support medians and rank comparisons, but not arithmetic, because the gaps between levels are not equal. Date/time is a category of its own: it sorts, supports lags and rolling windows, and decomposes into month, day-of-week, and hour. Geographic variables look numeric (ZIP, lat/lon) but behave categorically: distance and territory are sensible, sums are not. Boolean columns are simple — they support rates and conversions — but managers will sometimes report "average is_member = 0.43" instead of "43% are members," which is the same number stated badly. Text is its own world, treated in Part V; for now, the rule is that it does not belong in a normal table operation.

The trap is that a column's storage type — what the database engine sees — is not the same as its measurement type. A customer_id of 12345 is stored as an integer and will gladly average with other integers, but the average customer ID is not a number. A ZIP of 10003 is stored as an integer too. A rating of 4 is also an integer but means something quite different. Figure 1 walks through a Bean & Basket variable dictionary, naming the measurement type for each column, the operations that are sensible on it, and the common mistake — the one your team will make if no one writes the dictionary down.

Figure 1. What you can do with a column depends on what kind of column it is. Bean & Basket variable dictionary, eleven representative columns. The 'common mistake' column is the one to read carefully — every entry corresponds to a real report someone wrote.

Variable	Type	Example	Sensible operations	Common mistake
transaction_id	ID (categorical)	"T001"	Join key, distinct count	Averaging it. Sums and means of IDs are meaningless.
date	Date / time	"2024-03-04"	Sort, extract month, build lags, group by week	Storing as a string. "2024-13-01" sorts before "2024-2-01".
store_id	Categorical (nominal)	"A — Downtown"	Group, filter, count distinct	Treating alphabetical order as a ranking.
store_zip	Geographic / categorical	"10003"	Map, compute distance, group by region	Averaging ZIP codes. "Mean ZIP = 11042" is a noise sentence.
product_category	Categorical (nominal)	"Coffee"	Group, count, share-of-revenue ratios	Inventing implicit order (Coffee < Tea < Pastry).
product_size	Ordered categorical	"Medium"	Rank, median, compare adjacent ranks	Treating (Small + Large) / 2 = Medium. The interval is not equal.
price	Numeric (continuous)	5.50	Sum, mean, ratio, log, elasticity	Comparing across currencies without normalizing.
quantity	Numeric (count, integer)	2	Sum, mean, distribution by category	Reporting non-integer means without context ("avg basket = 1.4 items").
rating_1_to_5	Ordered categorical	4	Median, % top-2-box, compare ranks	Reporting "mean rating = 4.2" as if the gap from 4 to 5 equals the gap from 1 to 2.
is_loyalty_member	Boolean	true	Conversion rate (% true), filter, segment	Reporting "avg member = 0.43". Say "43% are members" instead.
review_text	Text / unstructured	"Great oat-milk latte but slow service."	Sentiment, topic, embedding, semantic search	Treating it as a structured column. "COUNT DISTINCT review" is rarely the question.

Figure 1 is the single most useful artifact a new analytics team can produce in its first week. The act of writing it down forces every column's type to be decided, named, and shared. Once the dictionary exists, downstream errors are caught at the table-design stage instead of at the slide-review stage. Notice how many of the entries warn against arithmetic on things that look numeric: IDs, ZIPs, ratings, and booleans all live in this looks-like-a-number-but-isn't trap. The trap is so reliable that you can find a version of it in almost every business intelligence stack ever shipped.

Concept check

Three questions spanning the unit of observation (grain), the data structure that fits a question, and measurement type versus storage type.

1.
A regional report shows three stores' "average basket" — $7.10, $5.80, $4.35 — and the manager averages those three figures to claim a chain-wide average basket of $5.75. Why is the $5.75 wrong, and what would fix it?
2.
Last quarter's store ranking shows Downtown on top and Suburban near the bottom. Hidden underneath is eight weeks of store-week data in which Downtown is flat and Suburban has nearly tripled. Why can the ranking actively mislead the investment decision?
3.
An analyst notices that customer_id, ZIP code, and a 1-to-5 satisfaction rating are all stored as integers, and reports the "average" of each. Which diagnosis is correct?