§13.2
Text as Data
Before any model touches a single review, the team has already made a dozen quiet choices. Was the document the review itself, or the (review + star rating) pair? Were two-word phrases — "cold brew" — treated as one unit or two? Was "running" reduced to "run"? Was the emoji thrown away? These choices look small. They are why two text projects working on the same data can end up answering different questions.
This article fixes the vocabulary that every text method in Part V depends on. Once a manager can speak fluently in document, corpus, token, vocabulary, n-gram, and metadata, the rest of the part — sentiment, topics, embeddings, RAG — drops into a clean place.
The Executive Question
What is a unit of text in our business, and what is around it that the model also needs to see?
The honest version: the unit of text is the level at which a decision is made. A "complaint" can be one review, one paragraph of a review, one ticket thread, or one transcript. The unit of analysis decides what gets compared to what.
The Pipeline at a Glance
The path from raw text to a business decision is the same shape across every method in this part.
From raw text to a business action — the standard pipeline
The chapters of Part V map onto this pipeline:
- 18.3 lives in stage 3 (tokens) and stage 4 (features) — bag-of-words, TF-IDF.
- 18.4 and 18.5 live in stage 5 (model) — classification, sentiment, topic models.
- 19.1–19.2 swap the stage-4 representation from TF-IDF to embeddings.
- 19.3 replaces stages 4–5 with an LLM that produces constructs directly.
- 20.1 adds a retrieval step before stage 5.
- 21 and 22 govern the whole loop.
The pipeline is generic. The chapters specialize it.
The Vocabulary
A short glossary, with one Bean & Basket example per term.
| Term | Definition | Bean & Basket example |
|---|---|---|
| Document | A single unit of text the model treats as one thing. | One app store review. |
| Corpus | A collection of documents, often the dataset for a method. | All May 2026 app store reviews. |
| Token | A unit produced by splitting a document — usually a word or sub-word. | "My", "latte", "was", "perfect". |
| Vocabulary | The set of unique tokens across the corpus. | ~6,000 unique words across 20,000 reviews. |
| N-gram | A contiguous run of n tokens. Unigram = 1; bigram = 2. | "cold brew" is a useful bigram in coffee text. |
| Stop words | Common tokens that often contribute little signal. | "the", "and", "of" — usually dropped. |
| Metadata | Structured fields attached to each document. | Star rating, store, date, verified-purchase flag. |
| Label | A target attached to the document for supervised tasks. | Sentiment, complaint category, recommend-or-not. |
A subtle but important point: the document is a choice, not a given. The same data can be treated as:
- one review per document — natural for review classification.
- one sentence per document — natural for aspect-based sentiment.
- one customer per document (concatenated reviews) — natural for customer-level scoring.
Each choice produces a different corpus, a different vocabulary, and different models.
Five Questions That Drive the Pipeline
For Bean & Basket the recurring business questions are short. Each one bends the pipeline a little.
- What are customers complaining about? Document = review; method = topic model + classification (§13.5, §13.4).
- Which reviews indicate churn risk? Document = review; label = churn-within-60-days; method = supervised classification (§13.4, §10.1).
- What product attributes drive positive sentiment? Document = review; method = aspect-based sentiment (§13.4).
- Which topics are growing over time? Document = review; method = topic model + time series (§13.5).
- How does our brand language differ from competitors'? Document = social post; method = embedding clustering (§14.3).
Three of these questions go through bag-of-words and TF-IDF. Two need embeddings to handle synonyms and rephrasings. The same source dataset, the same vocabulary, the same metadata — but five distinct configurations of the pipeline.
Text Plus Metadata
Almost every business text dataset comes with structured fields alongside the body. Reviews come with stars, store, date, and verified-purchase flag. Tickets come with issue type, urgency, and resolution time. Treating text as if those fields didn't exist throws away signal.
A few patterns:
- Stratify before you model. Run sentiment by store region; topics by month; classification accuracy by channel. The variation in metadata is where the manager's interpretation lives.
- Use metadata as labels. Star rating is a free sentiment label. Resolution time is a free label for "did the ticket get resolved?" Many supervised text models cost nothing once metadata is available.
- Avoid leakage through metadata. If a model trained to predict satisfaction has access to
resolution_time, it's not predicting satisfaction — it's learning that fast tickets get resolved.