§13.4

Text Classification and Sentiment

Text classification is supervised learning, with the features built from language. Everything from Chapter 10 — train/test splits, calibration, the threshold–profit curve — applies. What changes is the feature engineering. The features are no longer "days since last purchase"; they are "presence of the word crashed", "TF-IDF weight of refund", "the average BERT embedding of the document". The decision logic is the same. The representations are richer and the failure modes are different.

This article walks through the two text classification tasks every customer-voice system needs — routing (which bucket?) and sentiment (positive, negative, neutral) — and then adds the move that actually moves the needle for managers: aspect-based sentiment, where the model says what specifically the customer liked or disliked.

The Executive Question

Of the text the firm is receiving, what should be routed where, what should be flagged urgent, and which aspects of the product or service are driving the negative cases?

The same setup as §10.1, applied to a different input space. The answer is a probability for each class and a managerial threshold for action.

Routing: Many Classes, Operational Pay-off

A support team that handles a thousand tickets a day cannot read each one to decide whether it's a billing question, a delivery problem, an app bug, or a quality complaint. A classifier that assigns a bucket and a confidence is the standard fix.

Confusion matrix for a support-ticket router (held-out, 360 tickets)

predicted ↓ / actual →	billing	delivery	app issue	quality
billing	82	4	2	6
delivery	3	71	1	5
app issue	5	3	64	4
quality	10	12	7	76

The diagonal is the model getting tickets right. The brightest off-diagonal cell — quality tickets misrouted to billing — is where retraining will pay the most.

Figure 1. Confusion matrix from a held-out evaluation of a four-class ticket router. The diagonal is the model getting tickets right; the brightest off-diagonal cell — quality tickets misrouted to billing — is where the next round of feature work or labelled examples will pay off.

Three reading habits for multi-class confusion matrices:

The diagonal is the win. Sum the diagonal and divide by the total for an overall accuracy.
The brightest off-diagonal cell is the next investment. That confusion costs the team the most.
The asymmetry is informative. Billing predicted as quality is a different problem than quality predicted as billing.

The standard recipe stays the same as Chapter 10: start with TF-IDF + logistic regression as the baseline, add features (n-grams, domain dictionary, metadata), move to a tree ensemble or a fine-tuned transformer only if the baseline's bias is the binding constraint.

Sentiment: A Number for the Tone of a Document

Sentiment classification is a special case of text classification with two or three target classes (positive / negative / neutral) and a heavy industry of off-the-shelf models. Two camps:

Dictionary-based (VADER, Loughran-McDonald). Predefined lists of words with polarity scores. Fast, transparent, easy to deploy. Fails on sarcasm, negation, and domain-specific language. Often the right tool for monitoring on standard English text.
Transformer-based (BERT, FinBERT, RoBERTa). Models fine-tuned on labelled sentiment data. Handle context, negation, and longer documents better. Slower, opaque, often more accurate.

The honest position: neither dominates. Dictionary models are excellent baselines and surprisingly hard to beat when the corpus is clean. Transformer models are necessary when domain-specific or sarcastic language is common. The next section — aspect-based sentiment — usually beats both for managerial use.

A standard view is sentiment over time, with event annotations:

Average review sentiment, weekly — the app outage in week 13

Figure 2. Weekly average sentiment on Bean & Basket reviews. The dip in weeks 12–15 corresponds to the app outage on May 12 — sentiment recovers but takes about three weeks to return to baseline.

The chart is more useful than a single overall sentiment number for one reason: it puts the context on the screen. The dip without the outage annotation is a curiosity; the dip with the annotation is a story.

Aspect-Based Sentiment: The Move That Matters

Overall sentiment hides the picture. A 3-star review that says "food was incredible but the wait was awful" looks the same to a sentiment classifier as one that says "totally fine, nothing memorable." Both score around 0. They encode different information for an operator.

Aspect-based sentiment scores the document against named aspects of the product or service. For Bean & Basket the natural aspects are coffee, pastry, staff, wait time, app, value. The output is a grid of (aspect × document) sentiment scores.

Aspect-based sentiment — Bean & Basket, by store region

	NE	SE	MW	SW	WC
coffee	+0.60	+0.55	+0.70	+0.50	+0.45
pastry	+0.50	+0.60	+0.45	+0.40	+0.55
staff	+0.70	+0.40	+0.65	+0.55	+0.50
wait time	-0.30	-0.55	-0.10	-0.45	-0.60
app	-0.50	-0.40	-0.45	-0.30	-0.55
value	0.00	-0.15	+0.10	-0.20	-0.30

negative positive

Overall sentiment hides the picture — coffee and staff are loved everywhere, the app is hurting everywhere, and the South-East has a separate wait-time problem.

Figure 3. Aspect-based sentiment for Bean & Basket, by store region. Coffee and staff are loved everywhere; the app is hurting everywhere; the South-East has a wait-time problem the other regions do not. None of this is visible from overall sentiment.

Two things to notice:

The aspects are a managerial choice. They reflect the levers the firm can pull. "Wait time" appears because there is an operations team that can fix it; "app" appears because there is a product team that owns it.
Geographic variation is the action. The headline number ("overall sentiment is +0.18 this week") doesn't tell anyone what to do on Monday. The heatmap tells the South-East operations lead exactly where to look.

This is also the cleanest setup for the GPT-as-measurement move in §14.4. Once you can name aspects, you can also name constructs: "intent to return", "severity of complaint", "anger vs. disappointment". The aspect-based heatmap is the warm-up for the construct-based heatmap.

Two Variants Worth Knowing

Entity-based sentiment. Same idea, but the units are named entities (people, products, competitors) rather than aspects. The classical use case: a tweet that says "Goose Island used to be great, AB InBev ruined it" is negative on AB InBev and positive-but-nostalgic on Goose Island. Standard sentiment models return a single negative score and miss the asymmetry.

Emotion detection. Instead of three sentiment classes, fine-grained emotions: joy, anger, sadness, fear, disappointment, surprise, etc. Useful when the difference between "sad" and "angry" implies a different action — disappointment from a brand acquisition is not the same as anger about a defective product. The Goose Island brewery acquisition case in §14.2 leans heavily on this distinction.

Both are special cases of the same idea: aggregating sentiment to a unit smaller than the document gives operators the leverage to act.