§11.4
Case Study: Lottery ZIP Psychographics
Lottery data is a small but revealing example of what unsupervised learning can and cannot do. A ZIP code has no psychology in the individual sense. It has retailers, commuting patterns, neighborhood routines, product availability, demographic composition, and many players whose behavior is aggregated into one row. The task is not to infer who buys lottery tickets or why a person plays. The task is to ask whether neighborhood-level lottery portfolios differ in stable, interpretable ways.
This case study uses a baseline ZIP-level file derived from public NY Lottery data and joined to demographic measures. The original time-window metadata is unavailable, so the pre_ fields are treated as a baseline cross-section. The analysis is intentionally non-causal: PCA and k-means are fit on lottery behavior and retailer-access variables, then the resulting scores and segments are profiled by income, poverty, education, race/ethnicity, population, and region.
The strongest finding is not that "poor neighborhoods play the lottery." That story is too simple for this file. The more interesting pattern is that different kinds of neighborhoods play different product portfolios. Dense ZIPs, especially in NYC and downstate, are more daily-number and habit-index oriented. Small upstate ZIPs are more likely to split into routine checkout scratch corridors or specialized Quick Draw venue ZIPs. Higher-income ZIPs do not disappear from the lottery map; they shift toward jackpot-style portfolios and lower retailer density.
The Research Question
What neighborhood-level lottery routines can be recovered from ZIP-level behavior, and how do those routines interact with local demography?
That wording is deliberately ecological. The data supports statements about ZIPs, not individuals. A high-Hispanic-share ZIP with high Daily Numbers share does not prove that Hispanic residents buy Daily Numbers. It says that ZIP-level demographic composition, urban density, retail ecology, and product mix move together in the cross-section.
Scope, Measures, and Modeling Contract
The source file contains 1,400 ZIP rows and 59 columns. The main model uses 1,326 active ZIPs, or 94.7% of the source rows, after excluding rows with no recorded sales, no retailers, or population below 100. This exclusion is not a claim that the omitted ZIPs do not matter; it prevents inactive or unstable rows from becoming an artificial cluster.
| Item | Definition | Analytical implication |
|---|---|---|
| Unit | ZIP code cross-section | 1,400 source ZIP rows; 1,326 active rows after filtering zero/thin observations. |
| Behavior features | Product mix, channel context, timing, entropy, habit, retailer density, and log intensity | 38 standardized features enter PCA and k-means. |
| Demographics | Income, poverty, college share, Black share, Hispanic share, and population | Used only after fitting the unsupervised model, so demographics explain the segments rather than define them. |
| Interpretation | Descriptive ecological analysis | The row is a neighborhood-like ZIP aggregate. The article makes no individual-level or causal claim. |
The feature set includes product shares (instant scratch, Quick Draw, Daily Numbers, jackpot), channel context (bar, convenience, grocery, chain, gas, and related contexts), temporal structure, entropy/concentration indices, add-on behavior, retailer density, and log sales intensity. The demographic fields are held out of the unsupervised fit. They enter only after the clusters and PCA scores exist.
Sales Are Concentrated, but Concentration Is Not the Whole Story
The recorded baseline volume is highly concentrated. The top 10% of active ZIPs account for 47.6% of observed lottery volume, and the top 20% account for 70.4%. NYC alone accounts for 42.4% of observed volume while representing a much smaller share of active ZIP rows.
Statewide ZIP baseline
Sales concentration
Region share of volume
This concentration has two implications. First, raw volume is not a neutral measure of neighborhood propensity because it absorbs population, commuting, retailer count, and destination retail. Second, a good unsupervised case should not stop at raw sales. It should ask which ZIPs have similar portfolios after standardizing behavior and access.
PCA Finds a Product-Portfolio Space, Not a Moral Ranking
The first three principal components explain 45.2% of standardized behavioral variance. That is enough to make a useful map, not enough to pretend the map is the whole data. PC1 contrasts broad, high-entropy daily-number routines against concentrated rapid-resolution or venue-style play. PC2 separates checkout scratch retail from social Quick Draw venues. PC3, not plotted here, is largely an add-on and jackpot-sophistication axis.
PCA score space
PC1 loadings
Portfolio breadth and daily routine versus concentrated rapid play
Positive
Portfolio entropy
Speed entropy
Daily Numbers share
Habit index
Negative
Portfolio concentration
Rapid-resolution share
Bar context
Quick Draw share
PC2 loadings
Checkout scratch retail versus social Quick Draw venues
Positive
Instant scratch share
Convenience context
Routine-checkout share
Incidental share
Negative
Social-channel share
Quick Draw share
Bar context
Draw payout rate
The PCA map is useful because it makes a negative result visible: there is no single "lottery intensity" line. Instead, ZIPs vary along several behavioral dimensions. A ZIP can be high in per-resident sales because it is dense and daily-number oriented, because it is a retail destination, or because it has a small population denominator. These are different sociological objects.
Four ZIP Routines Are More Useful Than One Statewide Average
K-means with k=4 gives the smallest operationally readable segmentation: 410 dense daily-number ZIPs, 515 mixed retail ZIPs, 72 Quick Draw venue ZIPs, and 329 checkout scratch ZIPs. The silhouette scores are modest, which is typical for continuous social data; the point is not to discover natural species, but to create a stable lens for interpretation.
Four behavioral segments
Dense Daily-Number Routines
Large, dense ZIPs where Daily Numbers share, portfolio entropy, and habit index are high.
Median income
$93,862
Hispanic
15.2%
Black
7.9%
Mixed Retail Portfolios
Mid-sized ZIPs with broad instant-ticket retail, some Quick Draw, and a balanced product portfolio.
Median income
$74,017
Hispanic
3.0%
Black
1.6%
Quick Draw Venue ZIPs
Small ZIPs where the behavioral signature is almost entirely Quick Draw and bar/social-channel context.
Median income
$68,214
Hispanic
1.0%
Black
0.1%
Checkout Scratch Corridors
Small-population ZIPs where instant tickets, routine checkout, and incidental convenience-store play dominate.
Median income
$71,563
Hispanic
2.1%
Black
0.5%
| Segment | ZIPs | Sales/resident | Instant | Daily | Quick Draw | Habit |
|---|---|---|---|---|---|---|
| Dense Daily-Number Routines | 410 | 8.2 | 52% | 28% | 5% | 2.7 |
| Mixed Retail Portfolios | 515 | 6.9 | 70% | 8% | 10% | 0.4 |
| Quick Draw Venue ZIPs | 72 | 2.2 | 0% | 0% | 93% | 0.0 |
| Checkout Scratch Corridors | 329 | 3.9 | 76% | 7% | 1% | 0.2 |
The segment labels are analytical shorthand:
- Dense Daily-Number Routines. These ZIPs are large, heavily downstate, and show high Daily Numbers share, high portfolio entropy, and high habit index. They look less like occasional jackpot play and more like repeated low-denomination routines embedded in dense retail life.
- Mixed Retail Portfolios. This is the broad middle of the state: instant scratch tickets are high, Quick Draw is present but not dominant, and the product portfolio is balanced enough to avoid a single-channel story.
- Quick Draw Venue ZIPs. These ZIPs are small and specialized. Median Quick Draw share is 92.6%, with bar/social context defining the segment. This is less a neighborhood deprivation story than a venue ecology story.
- Checkout Scratch Corridors. These are small-population, routine checkout ZIPs where instant tickets and convenience context dominate. They are behaviorally regular but not high in habit index the way dense daily-number ZIPs are.
Demography Predicts Product Mix More Than Simple Intensity
The most important demographic result is a distinction between intensity and portfolio. Median sales per resident is surprisingly flat across income quartiles: 6.49 in the lowest-income quartile and 6.50 in the highest-income quartile. But product mix changes sharply. Higher-income ZIPs have higher median jackpot share (10.5% versus 6.3%) and higher portfolio entropy.
Demographic gradients
Income quartile: jackpot share
Income quartile: habit index
Hispanic-share quartile: Daily Numbers
Black-share quartile: habit index
The racial/ethnic composition gradients are stronger, but they require more caution. ZIPs in the highest Hispanic-share quartile have median Daily Numbers share of 24.7%, compared with 4.4% in the lowest quartile. Their median habit index is 2.5, compared with 0.1. The same broad pattern appears for high-Black-share ZIPs. This is not evidence of individual preference by race or ethnicity. It is evidence that ZIP-level race/ethnicity, density, retailer ecology, and daily-number product routines are spatially entangled in New York.
The Daily-Number Pattern Is an Urban Interaction
The clearest sociological interaction is between density and demographic composition. Low-population, low-Hispanic-share ZIPs have median Daily Numbers share of 4.4% and habit index 0.1. High-population, high-Hispanic-share ZIPs have median Daily Numbers share of 28.3% and habit index 3.1.
Population x Hispanic-share interaction
This interaction is the article's most sociologically interesting result. It suggests a neighborhood routine rather than a single psychological preference. Daily-number play seems to live where dense retail circulation, repeated local transactions, and demographic composition coincide. That reading is still descriptive. It turns the result into a set of hypotheses: language-market retail? cash-oriented checkout routines? age structure? commuting nodes? neighborhood lottery advertising? Those mechanisms are not in this file, but the unsupervised analysis tells us where to look.
Controlled Checks Preserve the Main Pattern but Do Not Make It Causal
To check whether the raw gradients were only proxies for region or population, I fit descriptive ridge models with standardized predictors: population, income, poverty, college share, Black share, Hispanic share, retailer density, and region indicators. These are not causal regressions. They are robustness checks against the easiest alternative explanation.
Controlled descriptive associations
Daily Numbers share
R2 0.73Jackpot share
R2 0.27Habit index
R2 0.84Log sales per resident
R2 0.39The adjusted models reinforce three claims:
- Daily Numbers are spatially structured. The Daily Numbers model has high descriptive fit (R2 0.73). NYC, Black share, Hispanic share, poverty, and income all remain positive after controls. That is a pattern to explain, not a causal estimate.
- Habit index is even more structured. The habit-index model has R2 0.85, with NYC, Black share, population, downstate suburban location, and Hispanic share among the largest positive associations.
- Per-resident sales is more about access and size. Log sales per resident loads most strongly on retailer density and population, while region effects distinguish downstate and upstate patterns. Income is not the central story for intensity.
A Sociological Reading: Routines, Retail Ecology, and Product Form
The lottery portfolio is a social object because games differ in how they fit everyday life. Instant scratch tickets are immediate and checkout-compatible. Quick Draw is venue-compatible. Daily Numbers are routine-compatible: repeated, local, low-denomination, and easy to fold into a daily retail circuit. Jackpot games are event-compatible: they attach to large draws, add-ons, and occasional upside.
That frame explains why the demographic findings are not reducible to poverty. Poverty is present in the file, but it is not the master variable. Higher-income ZIPs show more jackpot share and portfolio entropy. Dense minority ZIPs show more Daily Numbers and habit. Small upstate ZIPs split between convenience scratch and venue Quick Draw. These patterns are about how lottery products are socially and commercially embedded, not simply how much money a ZIP has.
Robustness, Uncertainty, and What Could Change the Story
| Check | Result | Interpretation |
|---|---|---|
| Active-ZIP filter | 74 rows excluded from the main model | Zero sales, zero retailers, or very small populations would otherwise form a technical "inactive" cluster rather than a sociological segment. |
| Demographic holdout from model fit | PCA and k-means use behavior and access features only | Race, ethnicity, income, poverty, education, and population are profiling variables, not segmentation inputs. |
| k sensitivity | k=4 separates the two dominant retail routines, the dense daily-number pattern, and a small venue Quick Draw group. k=5 mainly splits the scratch/check-out cluster. | The main story survives as four broad routines; additional clusters mostly subdivide scratch/check-out ZIPs. |
| Controlled associations | Region, population, income, poverty, education, race/ethnicity, and retailer density entered the descriptive adjustment | The adjustment reduces but does not erase the dense daily-number pattern; it still does not identify causal mechanisms. |
Four limitations matter most.
First, the file lacks the original calendar metadata, so the analysis treats the pre_ variables as a baseline period without dating the period. Second, ZIPs are imperfect neighborhoods: commercial corridors, commuters, tourists, and destination retail can all enter the numerator without belonging to the resident denominator. Third, demographic variables are aggregate shares. They cannot be mapped to individual players. Fourth, the channel-context shares may overlap; a ZIP's retail context is not a mutually exclusive set of store types.
The next data improvements are clear:
- Add exact time-window metadata and repeat the analysis across time.
- Join ZIPs to county, borough, urbanicity, age, transit, and retail-establishment measures.
- Split resident denominators from retail-destination denominators where possible.
- Test whether the segment assignments remain stable under hierarchical clustering, Gower distance, or a model that excludes intensity variables entirely.
- Use the unsupervised segments to generate hypotheses for causal work, not to make policy claims directly.