§13.6

Limits of Classical NLP

Bag-of-words gets a remarkable amount right. A TF-IDF classifier on customer reviews will identify complaints, route tickets, and surface aspect-level sentiment well enough to ship. The gap appears in two places: when the words are misleading because of context, and when the question the manager actually wants to ask is about meaning rather than vocabulary. This article inventories those gaps. The next chapter is the response.

The article is short on purpose. The point is to install a small mental gallery of failure patterns, so that when an embedding or an LLM solves them, the manager understands why the upgrade was worth the cost.

The Executive Question

Where, predictably, do bag-of-words and dictionary methods get the meaning wrong — and when does it matter enough to upgrade the representation?

The honest version: the failures are real, the cost of the upgrade is also real, and the right question is which subset of cases needs the upgrade.

The Failure Gallery

A short, recurring catalogue. Each row is a kind of mistake classical NLP makes, with an example that any team will eventually hit on their own data.

Where bag-of-words and dictionary sentiment quietly fail

Kind	Example	Surface read	What it really says
Sarcasm	"I just love waiting forty minutes for cold coffee."	positive (love)	strongly negative
Negation	"not bad — actually really good"	mixed (bad / good)	positive
Polysemy	"cold brew is amazing"	negative (cold)	positive (cold brew = product)
Domain idiom	"this app is killing me"	extreme negative	mild frustration
Mixed sentiment	"latte was perfect but the wait was awful"	mixed	positive on drink, negative on service
Context shift	"premium price" (luxury review)	negative (price)	positive — premium = quality

Bag-of-words knows which words appear, not what they mean together. The fix is a representation that places similar meanings near each other.

Figure 1. Where classical NLP quietly fails. The 'surface read' column shows what a TF-IDF or dictionary model returns; the 'truth' column shows what the customer meant. The size of each row's failure depends on how often the case appears in your corpus.

A walk-through of each category:

Sarcasm. Reading word polarities adds up to a positive score; the document is in fact negative. Sarcasm in customer text is rare in support tickets but common on social media; the cost of getting it wrong depends on where the model is deployed.
Negation. The word "good" appears; the context is "not good". Pre-tagging tokens after "not" helps; some libraries' default stop-word lists actively make this worse by deleting "not" altogether.
Polysemy. "Cold" is negative in one corpus and a product name in another. Without context, the model can't tell. The classical fix — n-gram dictionaries — covers the most consequential phrases but doesn't scale.
Domain idiom. "This app is killing me" reads as extreme negative on a polarity dictionary; in practice it is mild frustration. Domain-specific lexicons help, but only for the phrases the firm bothers to enumerate.
Mixed sentiment. "Latte was perfect but the wait was awful" averages to near zero. Aspect-based sentiment (§13.4) is the proper fix; bag-of-words on the whole review averages signal away.
Context shift. "Premium price" is negative in a budget-segment review and positive in a luxury-segment one. The same words flip polarity by context.

The pattern in all six is the same: bag-of-words knows which words appear, not what they mean together. That sentence is the entire motivation for the rest of Part V.

How Much Does It Matter?

The honest answer depends on the corpus. Two heuristics:

What fraction of the corpus is in the failure modes? Read a random sample of 100 documents. Tag the ones that hit any of the patterns above. If it's 5%, the classical model is probably good enough; if it's 30%, the failures are eating real value.
Are the failures concentrated in the cases that matter most? A sentiment model that gets ordinary reviews right and fails on the most strongly negative ones is worse than its average accuracy suggests. The tail is what gets attention.

Most teams discover the failure modes the same way: a classical model ships, looks great in aggregate, and then someone reads a few high-confidence predictions out loud and finds enough of the gallery above to embarrass the model.

What the Failures Have in Common

A bag-of-words representation places every document in a vocabulary-dimensional space. Two documents are close if they share words. They cannot be close if they share meaning without sharing vocabulary. That is the structural limit, and it produces every category in the gallery:

Sarcasm — same words, opposite meaning. Bag-of-words can't tell.
Negation — words modified by context. Bag-of-words ignores order.
Polysemy — same word, different meanings. Bag-of-words has no context.
Idiom — phrases that mean something different from their parts. Bag-of-words decomposes meaning into independent words.
Mixed sentiment — aspects with different polarities. Bag-of-words averages.
Context shift — the same words colored by surrounding context. Bag-of-words ignores it.

The structural fix is a representation that places meanings near each other, not vocabulary overlaps. That representation is the embedding.

Two Honest Caveats Before Moving On

Two things worth saying before the next chapter takes over:

Classical NLP doesn't go away. Even production systems that use embeddings or LLMs frequently keep a TF-IDF baseline, because the failure modes of embeddings (drift, opacity, cost) are different and complementary. The right architecture often runs both.
Embeddings and LLMs have their own gallery. They handle the six cases above well. They introduce new failure modes — hallucinated relationships, sensitivity to prompt or model version, opaque attributions. The rest of Part V is partly about that new gallery.

The right framing isn't "classical NLP is obsolete." It is "classical NLP and modern methods solve different cases of the same problem, and a serious customer-voice system uses both."

Concept check

Three questions spanning the chapter — framing unstructured work, text classification, and the limits of word counts.

1.
A team has a clean churn model from Part IV. Customer A and Customer B score nearly identically. A's recent reviews are full of frustration about a recent outage; B's reviews are short and bland. What is the cleanest framing for the next step?
2.
A confusion matrix for a four-class ticket router shows the strongest off-diagonal cell is "quality tickets predicted as billing". The most cost-effective improvement is to:
3.
Bag-of-words and embeddings are best thought of as: