§11.4

Case Study: Lottery ZIP Psychographics

Lottery data is a small but revealing example of what unsupervised learning can and cannot do. A ZIP code has no psychology in the individual sense. It has retailers, commuting patterns, neighborhood routines, product availability, demographic composition, and many players whose behavior is aggregated into one row. The task is not to infer who buys lottery tickets or why a person plays. The task is to ask whether neighborhood-level lottery portfolios differ in stable, interpretable ways.

This case study uses a baseline ZIP-level file derived from public NY Lottery data and joined to demographic measures. The original time-window metadata is unavailable, so the pre_ fields are treated as a baseline cross-section. The analysis is intentionally non-causal: PCA and k-means are fit on lottery behavior and retailer-access variables, then the resulting scores and segments are profiled by income, poverty, education, race/ethnicity, population, and region.

The strongest finding is not that "poor neighborhoods play the lottery." That story is too simple for this file. The more interesting pattern is that different kinds of neighborhoods play different product portfolios. Dense ZIPs, especially in NYC and downstate, are more daily-number and habit-index oriented. Small upstate ZIPs are more likely to split into routine checkout scratch corridors or specialized Quick Draw venue ZIPs. Higher-income ZIPs do not disappear from the lottery map; they shift toward jackpot-style portfolios and lower retailer density.


The Research Question

What neighborhood-level lottery routines can be recovered from ZIP-level behavior, and how do those routines interact with local demography?

That wording is deliberately ecological. The data supports statements about ZIPs, not individuals. A high-Hispanic-share ZIP with high Daily Numbers share does not prove that Hispanic residents buy Daily Numbers. It says that ZIP-level demographic composition, urban density, retail ecology, and product mix move together in the cross-section.


Scope, Measures, and Modeling Contract

The source file contains 1,400 ZIP rows and 59 columns. The main model uses 1,326 active ZIPs, or 94.7% of the source rows, after excluding rows with no recorded sales, no retailers, or population below 100. This exclusion is not a claim that the omitted ZIPs do not matter; it prevents inactive or unstable rows from becoming an artificial cluster.

Table 1. The lottery psychographics case is a descriptive ZIP-level analysis; demographics are used for interpretation, not model fitting.
ItemDefinitionAnalytical implication
UnitZIP code cross-section1,400 source ZIP rows; 1,326 active rows after filtering zero/thin observations.
Behavior featuresProduct mix, channel context, timing, entropy, habit, retailer density, and log intensity38 standardized features enter PCA and k-means.
DemographicsIncome, poverty, college share, Black share, Hispanic share, and populationUsed only after fitting the unsupervised model, so demographics explain the segments rather than define them.
InterpretationDescriptive ecological analysisThe row is a neighborhood-like ZIP aggregate. The article makes no individual-level or causal claim.

The feature set includes product shares (instant scratch, Quick Draw, Daily Numbers, jackpot), channel context (bar, convenience, grocery, chain, gas, and related contexts), temporal structure, entropy/concentration indices, add-on behavior, retailer density, and log sales intensity. The demographic fields are held out of the unsupervised fit. They enter only after the clusters and PCA scores exist.


Sales Are Concentrated, but Concentration Is Not the Whole Story

The recorded baseline volume is highly concentrated. The top 10% of active ZIPs account for 47.6% of observed lottery volume, and the top 20% account for 70.4%. NYC alone accounts for 42.4% of observed volume while representing a much smaller share of active ZIP rows.

Statewide ZIP baseline

Source rows
1.4K
1.3K active ZIPs used
Behavior features
38
Product mix, channel, timing, access
Top 10% ZIPs
47.6%
Share of recorded lottery volume
NYC sales share
42.4%
Against 13.1% of active ZIP rows

Sales concentration

Top 5% of ZIPs30.6%
Top 10% of ZIPs47.6%
Top 20% of ZIPs70.4%
Top 25 ZIPs15.5%

Region share of volume

NYC42.4%
Downstate suburbs25.6%
Capital/Hudson/North Country12.6%
Western/Finger Lakes11.9%
Central/Southern Tier7.5%
Figure 1. Statewide lottery baseline. The left panel shows concentration of observed lottery volume; the right panel shows that NYC and the downstate suburbs carry most recorded volume, even though the active ZIP rows cover the entire state.

This concentration has two implications. First, raw volume is not a neutral measure of neighborhood propensity because it absorbs population, commuting, retailer count, and destination retail. Second, a good unsupervised case should not stop at raw sales. It should ask which ZIPs have similar portfolios after standardizing behavior and access.


PCA Finds a Product-Portfolio Space, Not a Moral Ranking

The first three principal components explain 45.2% of standardized behavioral variance. That is enough to make a useful map, not enough to pretend the map is the whole data. PC1 contrasts broad, high-entropy daily-number routines against concentrated rapid-resolution or venue-style play. PC2 separates checkout scratch retail from social Quick Draw venues. PC3, not plotted here, is largely an add-on and jackpot-sophistication axis.

PCA score space

PC1: portfolio breadth and daily routine →PC2: checkout scratch retail →
Daily routine410
Mixed retail515
Venue Quick Draw72
Checkout scratch329

PC1 loadings

Portfolio breadth and daily routine versus concentrated rapid play

Positive

Portfolio entropy

Speed entropy

Daily Numbers share

Habit index

Negative

Portfolio concentration

Rapid-resolution share

Bar context

Quick Draw share

PC2 loadings

Checkout scratch retail versus social Quick Draw venues

Positive

Instant scratch share

Convenience context

Routine-checkout share

Incidental share

Negative

Social-channel share

Quick Draw share

Bar context

Draw payout rate

PC1 explains 19.4% of standardized behavioral variance; PC2 explains 15.7%. Points are ZIPs, colored by k-means segment.
Figure 2. PCA score space for active ZIPs. The first axis separates broad daily-number routines from concentrated rapid-resolution or venue play; the second separates checkout scratch retail from social Quick Draw contexts.

The PCA map is useful because it makes a negative result visible: there is no single "lottery intensity" line. Instead, ZIPs vary along several behavioral dimensions. A ZIP can be high in per-resident sales because it is dense and daily-number oriented, because it is a retail destination, or because it has a small population denominator. These are different sociological objects.


Four ZIP Routines Are More Useful Than One Statewide Average

K-means with k=4 gives the smallest operationally readable segmentation: 410 dense daily-number ZIPs, 515 mixed retail ZIPs, 72 Quick Draw venue ZIPs, and 329 checkout scratch ZIPs. The silhouette scores are modest, which is typical for continuous social data; the point is not to discover natural species, but to create a stable lens for interpretation.

Four behavioral segments

Dense Daily-Number Routines

Large, dense ZIPs where Daily Numbers share, portfolio entropy, and habit index are high.

410
30.9%
Instant52%
Daily28%
Quick Draw5%
Jackpot9%
Habit2.7
Sales/resident8.2

Median income

$93,862

Hispanic

15.2%

Black

7.9%

Mixed Retail Portfolios

Mid-sized ZIPs with broad instant-ticket retail, some Quick Draw, and a balanced product portfolio.

515
38.8%
Instant70%
Daily8%
Quick Draw10%
Jackpot6%
Habit0.4
Sales/resident6.9

Median income

$74,017

Hispanic

3.0%

Black

1.6%

Quick Draw Venue ZIPs

Small ZIPs where the behavioral signature is almost entirely Quick Draw and bar/social-channel context.

72
5.4%
Instant0%
Daily0%
Quick Draw93%
Jackpot3%
Habit0.0
Sales/resident2.2

Median income

$68,214

Hispanic

1.0%

Black

0.1%

Checkout Scratch Corridors

Small-population ZIPs where instant tickets, routine checkout, and incidental convenience-store play dominate.

329
24.8%
Instant76%
Daily7%
Quick Draw1%
Jackpot9%
Habit0.2
Sales/resident3.9

Median income

$71,563

Hispanic

2.1%

Black

0.5%

Figure 3. Segment profiles. Each segment is named after its behavioral profile, not after its demographic composition. The demographic rows at the bottom are post-hoc profiles.
Table 2. The four segments differ most sharply in product mix and retail context: dense daily-number routines, mixed portfolios, Quick Draw venues, and checkout scratch corridors.
SegmentZIPsSales/residentInstantDailyQuick DrawHabit
Dense Daily-Number Routines4108.252%28%5%2.7
Mixed Retail Portfolios5156.970%8%10%0.4
Quick Draw Venue ZIPs722.20%0%93%0.0
Checkout Scratch Corridors3293.976%7%1%0.2

The segment labels are analytical shorthand:

  1. Dense Daily-Number Routines. These ZIPs are large, heavily downstate, and show high Daily Numbers share, high portfolio entropy, and high habit index. They look less like occasional jackpot play and more like repeated low-denomination routines embedded in dense retail life.
  2. Mixed Retail Portfolios. This is the broad middle of the state: instant scratch tickets are high, Quick Draw is present but not dominant, and the product portfolio is balanced enough to avoid a single-channel story.
  3. Quick Draw Venue ZIPs. These ZIPs are small and specialized. Median Quick Draw share is 92.6%, with bar/social context defining the segment. This is less a neighborhood deprivation story than a venue ecology story.
  4. Checkout Scratch Corridors. These are small-population, routine checkout ZIPs where instant tickets and convenience context dominate. They are behaviorally regular but not high in habit index the way dense daily-number ZIPs are.

Demography Predicts Product Mix More Than Simple Intensity

The most important demographic result is a distinction between intensity and portfolio. Median sales per resident is surprisingly flat across income quartiles: 6.49 in the lowest-income quartile and 6.50 in the highest-income quartile. But product mix changes sharply. Higher-income ZIPs have higher median jackpot share (10.5% versus 6.3%) and higher portfolio entropy.

Demographic gradients

Income quartile: jackpot share

Q1 lower income6.3%
Q26.3%
Q37.4%
Q4 higher income10.5%

Income quartile: habit index

Q1 lower income0.3
Q20.2
Q30.4
Q4 higher income1.1

Hispanic-share quartile: Daily Numbers

Q1 lower Hispanic share4.4%
Q26.8%
Q315.6%
Q4 higher Hispanic share24.7%

Black-share quartile: habit index

Q1 lower Black share0.1
Q20.3
Q30.7
Q4 higher Black share2.6
Quartiles are ZIP-level groups. They describe neighborhood composition, not individual player behavior.
Figure 4. Demographic gradients. Income is more visible in jackpot and portfolio mix than in simple per-resident intensity; race/ethnicity composition is more visible in Daily Numbers and habit-index gradients.

The racial/ethnic composition gradients are stronger, but they require more caution. ZIPs in the highest Hispanic-share quartile have median Daily Numbers share of 24.7%, compared with 4.4% in the lowest quartile. Their median habit index is 2.5, compared with 0.1. The same broad pattern appears for high-Black-share ZIPs. This is not evidence of individual preference by race or ethnicity. It is evidence that ZIP-level race/ethnicity, density, retailer ecology, and daily-number product routines are spatially entangled in New York.


The Daily-Number Pattern Is an Urban Interaction

The clearest sociological interaction is between density and demographic composition. Low-population, low-Hispanic-share ZIPs have median Daily Numbers share of 4.4% and habit index 0.1. High-population, high-Hispanic-share ZIPs have median Daily Numbers share of 28.3% and habit index 3.1.

Population x Hispanic-share interaction

Lower Hispanic share
Middle Hispanic share
Higher Hispanic share
Low population
4.4%
habit 0.1 | n=261
6.2%
habit 0.2 | n=119
13.1%
habit 0.3 | n=62
Middle population
5.2%
habit 0.2 | n=155
9.1%
habit 0.4 | n=177
19.2%
habit 1.1 | n=110
High population
8.7%
habit 0.6 | n=26
14.8%
habit 1.1 | n=146
28.3%
habit 3.1 | n=270
Cells show median Daily Numbers share. The gradient is strongest where high population and high Hispanic-share ZIPs overlap.
Figure 5. Population and Hispanic-share interaction. The daily-number gradient is not simply an ethnic-composition gradient; it steepens in high-population ZIPs, where retailer density, foot traffic, and repeated routines are more plausible mechanisms.

This interaction is the article's most sociologically interesting result. It suggests a neighborhood routine rather than a single psychological preference. Daily-number play seems to live where dense retail circulation, repeated local transactions, and demographic composition coincide. That reading is still descriptive. It turns the result into a set of hypotheses: language-market retail? cash-oriented checkout routines? age structure? commuting nodes? neighborhood lottery advertising? Those mechanisms are not in this file, but the unsupervised analysis tells us where to look.


Controlled Checks Preserve the Main Pattern but Do Not Make It Causal

To check whether the raw gradients were only proxies for region or population, I fit descriptive ridge models with standardized predictors: population, income, poverty, college share, Black share, Hispanic share, retailer density, and region indicators. These are not causal regressions. They are robustness checks against the easiest alternative explanation.

Controlled descriptive associations

Daily Numbers share

R2 0.73
Region: NYC+0.66
Black share+0.52
Region: Central/Southern Tier-0.19
Poverty+0.18
Hispanic share+0.16
Income+0.16

Jackpot share

R2 0.27
Region: NYC+0.41
Population-0.39
Region: Central/Southern Tier-0.29
Income+0.29
College++0.22
Poverty+0.14

Habit index

R2 0.84
Region: NYC+0.97
Black share+0.49
Population+0.24
Region: Downstate suburbs+0.23
Hispanic share+0.14
Retailer density+0.07

Log sales per resident

R2 0.39
Retailer density+0.64
Population+0.48
Region: Central/Southern Tier-0.29
Region: NYC-0.23
Region: Downstate suburbs+0.14
Income-0.08
Bars are standardized ridge coefficients with region indicators and demographic/access controls. They are descriptive adjustments, not causal estimates.
Figure 6. Controlled descriptive associations. Region and population explain a large part of product mix, but the dense daily-number and habit-index patterns remain aligned with ZIP demographic composition after simple controls.

The adjusted models reinforce three claims:

  1. Daily Numbers are spatially structured. The Daily Numbers model has high descriptive fit (R2 0.73). NYC, Black share, Hispanic share, poverty, and income all remain positive after controls. That is a pattern to explain, not a causal estimate.
  2. Habit index is even more structured. The habit-index model has R2 0.85, with NYC, Black share, population, downstate suburban location, and Hispanic share among the largest positive associations.
  3. Per-resident sales is more about access and size. Log sales per resident loads most strongly on retailer density and population, while region effects distinguish downstate and upstate patterns. Income is not the central story for intensity.

A Sociological Reading: Routines, Retail Ecology, and Product Form

The lottery portfolio is a social object because games differ in how they fit everyday life. Instant scratch tickets are immediate and checkout-compatible. Quick Draw is venue-compatible. Daily Numbers are routine-compatible: repeated, local, low-denomination, and easy to fold into a daily retail circuit. Jackpot games are event-compatible: they attach to large draws, add-ons, and occasional upside.

That frame explains why the demographic findings are not reducible to poverty. Poverty is present in the file, but it is not the master variable. Higher-income ZIPs show more jackpot share and portfolio entropy. Dense minority ZIPs show more Daily Numbers and habit. Small upstate ZIPs split between convenience scratch and venue Quick Draw. These patterns are about how lottery products are socially and commercially embedded, not simply how much money a ZIP has.


Robustness, Uncertainty, and What Could Change the Story

Table 3. The case is rigorous because the main caveats are part of the result: active-row filtering, demographic holdout, k sensitivity, and descriptive controls all shape the interpretation.
CheckResultInterpretation
Active-ZIP filter74 rows excluded from the main modelZero sales, zero retailers, or very small populations would otherwise form a technical "inactive" cluster rather than a sociological segment.
Demographic holdout from model fitPCA and k-means use behavior and access features onlyRace, ethnicity, income, poverty, education, and population are profiling variables, not segmentation inputs.
k sensitivityk=4 separates the two dominant retail routines, the dense daily-number pattern, and a small venue Quick Draw group. k=5 mainly splits the scratch/check-out cluster.The main story survives as four broad routines; additional clusters mostly subdivide scratch/check-out ZIPs.
Controlled associationsRegion, population, income, poverty, education, race/ethnicity, and retailer density entered the descriptive adjustmentThe adjustment reduces but does not erase the dense daily-number pattern; it still does not identify causal mechanisms.

Four limitations matter most.

First, the file lacks the original calendar metadata, so the analysis treats the pre_ variables as a baseline period without dating the period. Second, ZIPs are imperfect neighborhoods: commercial corridors, commuters, tourists, and destination retail can all enter the numerator without belonging to the resident denominator. Third, demographic variables are aggregate shares. They cannot be mapped to individual players. Fourth, the channel-context shares may overlap; a ZIP's retail context is not a mutually exclusive set of store types.

The next data improvements are clear:

  1. Add exact time-window metadata and repeat the analysis across time.
  2. Join ZIPs to county, borough, urbanicity, age, transit, and retail-establishment measures.
  3. Split resident denominators from retail-destination denominators where possible.
  4. Test whether the segment assignments remain stable under hierarchical clustering, Gower distance, or a model that excludes intensity variables entirely.
  5. Use the unsupervised segments to generate hypotheses for causal work, not to make policy claims directly.