§13.3

Preprocessing, Bag-of-Words, and TF-IDF

A document-term matrix is the simplest text representation that still works. Each row is a document. Each column is a word in the vocabulary. Each cell is a count or a weight. That's it. Almost every classical method in this chapter — and almost every benchmark a more elaborate method has to beat — runs on this matrix. The art is in two places: cleaning the text into honest tokens, and weighting the resulting counts so that informative words rise to the top.

This article walks through both. The cleaning side is unglamorous and consequential. The TF-IDF side is the simplest piece of math in the book that consistently does useful work.

The Executive Question

Which words distinguish the documents in this corpus from each other — and how do we get to them without throwing away the meaning along the way?

The cleaning step is full of small choices that look neutral and aren't. The TF-IDF step is the standard way to surface the words that actually carry information.

Preprocessing: Honest Choices

Most preprocessing choices have a defensible version and an aggressive version. The aggressive version is faster and throws more away.

Table 1. Common text preprocessing choices, with the defensible-vs-aggressive trade-off and a Bean & Basket example where each matters.

Step	Defensible	Aggressive variant risks	Example where it matters
Lowercasing	Lower everything.	Mostly safe; can erase NER signal ("Apple" the company vs. "apple" the fruit).	Brand mentions in tweets.
Punctuation	Strip standalone punctuation.	Aggressive removal of "!"/"?" drops urgency signal.	Support tickets — "URGENT!!!" matters.
Stop words	Remove a small list of true noise words.	Big lists strip negation: "not" and "no" are stop words in some libraries.	"not good" → "good" — sentiment inverted.
Stemming	Use sparingly on noisy corpora.	Reduces "universal" and "universe" to the same stem; collapses meaning.	Marketing copy with deliberate word choice.
Lemmatization	Reduce inflections to dictionary form.	Generally safer than stemming; can mis-tag homographs.	"better" → "good" — useful in sentiment.
N-grams	Keep bigrams for domain phrases.	Including trigrams explodes vocabulary with little signal.	"cold brew", "wait time", "app crash".
Negation handling	Tag tokens after "not"/"no" with a negation marker.	Ignoring negation flips sentiment polarity silently.	"not good" should not become positive.
Emoji / unicode	Map common emojis to sentiment tokens.	Dropping them throws away signal in social text.	"app keeps crashing 😡" — emoji is the verdict.

A single piece of teaching to remember from this table: the choices that look most innocuous are usually the ones that change results. Removing stop words from a sentiment task that hinges on "not" silently inverts the result.

Bag-of-Words

After preprocessing, a corpus of $N$ documents over a vocabulary of $V$ words becomes a document-term matrix of shape $N \times V$ . Each cell is the count of that word in that document. Most cells are zero — the matrix is sparse.

That's the entire idea of bag-of-words. Word order is gone. Context is gone. What survives is the frequency profile of words inside each document.

Two consequences for the manager:

Counts overweight common words. "The" appears everywhere; "cold" appears in some negative reviews and many cold-brew reviews. Raw counts confuse the two.
No semantics. "Refund" and "money back" are unrelated tokens to bag-of-words. They will be related to embeddings (§19).

The fix for the first consequence is TF-IDF. The fix for the second is to move to embeddings.

TF-IDF: Weighting Counts by Informativeness

Term frequency–inverse document frequency weighs each word in each document by:

TF-IDF

\text{tfidf}(w, d) = \underbrace{\text{tf}(w, d)}_{\text{how often in } d} \times \underbrace{\log \frac{N}{\text{df}(w)}}_{\text{rare across corpus}}

The first factor rewards words that appear often inside the document. The second factor — the IDF — penalizes words that appear in many documents. A word that is everywhere ("the") has near-zero IDF and gets crushed. A word that appears in a few documents and is dense inside them ("crashed") gets a large weight.

The result is a matrix where the highest-weighted cells correspond to the most distinctive words in each document. That is the property classical text methods lean on.

Reading TF-IDF Honestly

The most common managerial use of TF-IDF is to rank words by their weight inside a class of documents — positive reviews, support tickets, complaints from one region. The bar chart below is the standard output.

Top TF-IDF terms in positive vs. negative Bean & Basket reviews

Positive reviews

smooth0.92
friendly0.85
cinnamon0.78
cozy0.72
fresh0.68
fast0.62

Negative reviews

crashed0.95
slow0.87
rude0.80
cold0.74
expensive0.69
wait0.65

TF-IDF tells us which words appear — not what they mean in context. "Cold" lands in the negative column even when a review says "cold brew is fine."

Figure 1. Top TF-IDF terms in positive vs. negative Bean & Basket reviews. The lists look reassuring but contain the trap discussed in this chapter: 'cold' is in the negative column even though some negative reviews actually praise cold brew.

Three reading habits:

Read the words, not just the scores. A high TF-IDF score does not mean a word is informative about what you think. "Cold" can be in the negative column for two unrelated reasons.
Look for the trap words. Words with multiple senses ("cold", "fresh", "premium") are where bag-of-words leaks meaning. They are the bridge into §13.6 (limits of classical NLP) and §19 (embeddings).
Pair TF-IDF with example documents. Always click through the top-ranked words to a few example reviews. The interpretation lives there, not in the score.

When TF-IDF Earns Its Place

TF-IDF still earns its place in a modern stack despite the dominance of embeddings:

Speed. A TF-IDF model trains in seconds on millions of documents.
Transparency. Every prediction can be traced to the words that drove it.
Strong baseline. Many production text classifiers are TF-IDF + logistic regression. A more elaborate method should have to beat that baseline to justify the maintenance cost.
Diagnostic value. Even when embeddings power the final model, TF-IDF rankings are a useful first inspection — they tell you what the data looks like before any neural network is trained.