§1.3
Variable Types and Measurement
A manager opens the customer-feedback report and sees "average rating: 4.2 out of 5." A few pages later the same report says "average ZIP code: 11042." The first number is misleading; the second is meaningless. Both errors come from the same root cause: someone treated a column as a number when it was not really a number. Almost every analytic mistake a manager will hear about, once you scratch the surface, is a quiet type confusion of this kind. The defense is to know what kind of column you are looking at before deciding what to do with it.
The executive question: is this column a number, a category, a date, a location, a label, or a business definition?
Variables are not just columns. They are measurement choices. When the company decides to track customer ratings on a 1-to-5 scale, that decision implies a set of operations that are sensible (the median, the share of 4-or-5 ratings) and a set that look sensible but are not (the arithmetic mean, the standard deviation). When the company records ZIP codes, the decision implies that the column will be used for maps and territory grouping, not for arithmetic. Type, in the small-t sense an analyst means, is the rule about what you are allowed to do with a column.
Seven types cover almost every business column. Numeric variables — price, revenue, quantity — support sums, means, ratios, and most of the textbook tools. Categorical variables — store, region, product family — support grouping and counting. Ordered categorical variables — satisfaction levels, T-shirt sizes, income brackets — support medians and rank comparisons, but not arithmetic, because the gaps between levels are not equal. Date/time is a category of its own: it sorts, supports lags and rolling windows, and decomposes into month, day-of-week, and hour. Geographic variables look numeric (ZIP, lat/lon) but behave categorically: distance and territory are sensible, sums are not. Boolean columns are simple — they support rates and conversions — but managers will sometimes report "average is_member = 0.43" instead of "43% are members," which is the same number stated badly. Text is its own world, treated in Part V; for now, the rule is that it does not belong in a normal table operation.
The trap is that a column's storage type — what the database engine sees — is not the same as its measurement type. A customer_id of 12345 is stored as an integer and will gladly average with other integers, but the average customer ID is not a number. A ZIP of 10003 is stored as an integer too. A rating of 4 is also an integer but means something quite different. Figure 1 walks through a Bean & Basket variable dictionary, naming the measurement type for each column, the operations that are sensible on it, and the common mistake — the one your team will make if no one writes the dictionary down.
| Variable | Type | Example | Sensible operations | Common mistake |
|---|---|---|---|---|
| transaction_id | ID (categorical) | "T001" | Join key, distinct count | Averaging it. Sums and means of IDs are meaningless. |
| date | Date / time | "2024-03-04" | Sort, extract month, build lags, group by week | Storing as a string. "2024-13-01" sorts before "2024-2-01". |
| store_id | Categorical (nominal) | "A — Downtown" | Group, filter, count distinct | Treating alphabetical order as a ranking. |
| store_zip | Geographic / categorical | "10003" | Map, compute distance, group by region | Averaging ZIP codes. "Mean ZIP = 11042" is a noise sentence. |
| product_category | Categorical (nominal) | "Coffee" | Group, count, share-of-revenue ratios | Inventing implicit order (Coffee < Tea < Pastry). |
| product_size | Ordered categorical | "Medium" | Rank, median, compare adjacent ranks | Treating (Small + Large) / 2 = Medium. The interval is not equal. |
| price | Numeric (continuous) | 5.50 | Sum, mean, ratio, log, elasticity | Comparing across currencies without normalizing. |
| quantity | Numeric (count, integer) | 2 | Sum, mean, distribution by category | Reporting non-integer means without context ("avg basket = 1.4 items"). |
| rating_1_to_5 | Ordered categorical | 4 | Median, % top-2-box, compare ranks | Reporting "mean rating = 4.2" as if the gap from 4 to 5 equals the gap from 1 to 2. |
| is_loyalty_member | Boolean | true | Conversion rate (% true), filter, segment | Reporting "avg member = 0.43". Say "43% are members" instead. |
| review_text | Text / unstructured | "Great oat-milk latte but slow service." | Sentiment, topic, embedding, semantic search | Treating it as a structured column. "COUNT DISTINCT review" is rarely the question. |
Figure 1 is the single most useful artifact a new analytics team can produce in its first week. The act of writing it down forces every column's type to be decided, named, and shared. Once the dictionary exists, downstream errors are caught at the table-design stage instead of at the slide-review stage. Notice how many of the entries warn against arithmetic on things that look numeric: IDs, ZIPs, ratings, and booleans all live in this looks-like-a-number-but-isn't trap. The trap is so reliable that you can find a version of it in almost every business intelligence stack ever shipped.
Concept check
Three questions spanning the unit of observation (grain), the data structure that fits a question, and measurement type versus storage type.
- 1.A regional report shows three stores' "average basket" — $7.10, $5.80, $4.35 — and the manager averages those three figures to claim a chain-wide average basket of $5.75. Why is the $5.75 wrong, and what would fix it?
- 2.Last quarter's store ranking shows Downtown on top and Suburban near the bottom. Hidden underneath is eight weeks of store-week data in which Downtown is flat and Suburban has nearly tripled. Why can the ranking actively mislead the investment decision?
- 3.An analyst notices that customer_id, ZIP code, and a 1-to-5 satisfaction rating are all stored as integers, and reports the "average" of each. Which diagnosis is correct?