Part V · Chapter 13
Text as Business Data
Turning reviews, tickets, and transcripts into evidence a model can act on — and knowing exactly where word counts stop working.
This chapter focuses on turning prose — reviews, tickets, transcripts, and social posts — into evidence a model can act on, using the classical NLP stack: tokens, document-term matrices, TF-IDF weighting, supervised classifiers for routing and sentiment, and LDA topic models surfaced as weekly text dashboards. Working the Bean & Basket coffee case, it shows where word counts earn their keep as a transparent baseline and where they quietly break. It closes with a gallery of failure modes — sarcasm, negation, polysemy, idiom, mixed and context-dependent sentiment — that motivates the move to embeddings in the next chapter. The recurring discipline: name the document, choose the representation, state the construct, then inspect what the method threw away.
Topics covered
In this chapter
- 13.1From Structured to Unstructured DataReframes unstructured text not as unusable but as data needing a representation layer, mapping six families of business text to their questions.
- 13.2Text as DataInstalls the core vocabulary — document, corpus, token, vocabulary, n-gram, metadata — and shows how the document boundary reshapes the whole pipeline.
- 13.3Preprocessing, Bag-of-Words, and TF-IDFWalks through honest preprocessing choices, the bag-of-words matrix, and TF-IDF weighting that lifts distinctive words above common ones.
- 13.4Text Classification and SentimentCovers supervised routing and sentiment, then aspect-based sentiment heatmaps that reveal which part of the experience is under stress, where.
- 13.5Topic Models and Text DashboardsExplains LDA topic discovery, why humans name the topics, and the trend-over-time dashboard that drives an operating cadence.
- 13.6Limits of Classical NLPCatalogues where bag-of-words fails — sarcasm, negation, polysemy, idiom, mixed and context-shifted sentiment — bridging to embeddings.