§13.4
Text Classification and Sentiment
Text classification is supervised learning, with the features built from language. Everything from Chapter 10 — train/test splits, calibration, the threshold–profit curve — applies. What changes is the feature engineering. The features are no longer "days since last purchase"; they are "presence of the word crashed", "TF-IDF weight of refund", "the average BERT embedding of the document". The decision logic is the same. The representations are richer and the failure modes are different.
This article walks through the two text classification tasks every customer-voice system needs — routing (which bucket?) and sentiment (positive, negative, neutral) — and then adds the move that actually moves the needle for managers: aspect-based sentiment, where the model says what specifically the customer liked or disliked.
The Executive Question
Of the text the firm is receiving, what should be routed where, what should be flagged urgent, and which aspects of the product or service are driving the negative cases?
The same setup as §10.1, applied to a different input space. The answer is a probability for each class and a managerial threshold for action.
Routing: Many Classes, Operational Pay-off
A support team that handles a thousand tickets a day cannot read each one to decide whether it's a billing question, a delivery problem, an app bug, or a quality complaint. A classifier that assigns a bucket and a confidence is the standard fix.
Confusion matrix for a support-ticket router (held-out, 360 tickets)
| predicted ↓ / actual → | billing | delivery | app issue | quality |
|---|---|---|---|---|
| billing | 82 | 4 | 2 | 6 |
| delivery | 3 | 71 | 1 | 5 |
| app issue | 5 | 3 | 64 | 4 |
| quality | 10 | 12 | 7 | 76 |
The diagonal is the model getting tickets right. The brightest off-diagonal cell — quality tickets misrouted to billing — is where retraining will pay the most.
Three reading habits for multi-class confusion matrices:
- The diagonal is the win. Sum the diagonal and divide by the total for an overall accuracy.
- The brightest off-diagonal cell is the next investment. That confusion costs the team the most.
- The asymmetry is informative. Billing predicted as quality is a different problem than quality predicted as billing.
The standard recipe stays the same as Chapter 10: start with TF-IDF + logistic regression as the baseline, add features (n-grams, domain dictionary, metadata), move to a tree ensemble or a fine-tuned transformer only if the baseline's bias is the binding constraint.
Sentiment: A Number for the Tone of a Document
Sentiment classification is a special case of text classification with two or three target classes (positive / negative / neutral) and a heavy industry of off-the-shelf models. Two camps:
- Dictionary-based (VADER, Loughran-McDonald). Predefined lists of words with polarity scores. Fast, transparent, easy to deploy. Fails on sarcasm, negation, and domain-specific language. Often the right tool for monitoring on standard English text.
- Transformer-based (BERT, FinBERT, RoBERTa). Models fine-tuned on labelled sentiment data. Handle context, negation, and longer documents better. Slower, opaque, often more accurate.
The honest position: neither dominates. Dictionary models are excellent baselines and surprisingly hard to beat when the corpus is clean. Transformer models are necessary when domain-specific or sarcastic language is common. The next section — aspect-based sentiment — usually beats both for managerial use.
A standard view is sentiment over time, with event annotations:
Average review sentiment, weekly — the app outage in week 13
The chart is more useful than a single overall sentiment number for one reason: it puts the context on the screen. The dip without the outage annotation is a curiosity; the dip with the annotation is a story.
Aspect-Based Sentiment: The Move That Matters
Overall sentiment hides the picture. A 3-star review that says "food was incredible but the wait was awful" looks the same to a sentiment classifier as one that says "totally fine, nothing memorable." Both score around 0. They encode different information for an operator.
Aspect-based sentiment scores the document against named aspects of the product or service. For Bean & Basket the natural aspects are coffee, pastry, staff, wait time, app, value. The output is a grid of (aspect × document) sentiment scores.
Aspect-based sentiment — Bean & Basket, by store region
| NE | SE | MW | SW | WC | |
|---|---|---|---|---|---|
| coffee | +0.60 | +0.55 | +0.70 | +0.50 | +0.45 |
| pastry | +0.50 | +0.60 | +0.45 | +0.40 | +0.55 |
| staff | +0.70 | +0.40 | +0.65 | +0.55 | +0.50 |
| wait time | -0.30 | -0.55 | -0.10 | -0.45 | -0.60 |
| app | -0.50 | -0.40 | -0.45 | -0.30 | -0.55 |
| value | 0.00 | -0.15 | +0.10 | -0.20 | -0.30 |
Overall sentiment hides the picture — coffee and staff are loved everywhere, the app is hurting everywhere, and the South-East has a separate wait-time problem.
Two things to notice:
- The aspects are a managerial choice. They reflect the levers the firm can pull. "Wait time" appears because there is an operations team that can fix it; "app" appears because there is a product team that owns it.
- Geographic variation is the action. The headline number ("overall sentiment is +0.18 this week") doesn't tell anyone what to do on Monday. The heatmap tells the South-East operations lead exactly where to look.
This is also the cleanest setup for the GPT-as-measurement move in §14.4. Once you can name aspects, you can also name constructs: "intent to return", "severity of complaint", "anger vs. disappointment". The aspect-based heatmap is the warm-up for the construct-based heatmap.
Two Variants Worth Knowing
Entity-based sentiment. Same idea, but the units are named entities (people, products, competitors) rather than aspects. The classical use case: a tweet that says "Goose Island used to be great, AB InBev ruined it" is negative on AB InBev and positive-but-nostalgic on Goose Island. Standard sentiment models return a single negative score and miss the asymmetry.
Emotion detection. Instead of three sentiment classes, fine-grained emotions: joy, anger, sadness, fear, disappointment, surprise, etc. Useful when the difference between "sad" and "angry" implies a different action — disappointment from a brand acquisition is not the same as anger about a defective product. The Goose Island brewery acquisition case in §14.2 leans heavily on this distinction.
Both are special cases of the same idea: aggregating sentiment to a unit smaller than the document gives operators the leverage to act.