§13.1
From Structured to Unstructured Data
Part V — reading text, images, and documents the structured methods cannot see.
Part IV scored customers, built segments, and targeted audiences. Every input was a number in a column. Most of what a firm actually knows about its customers — what they complain about, what they thank the staff for, what the rep promised on the call, what the receipt showed, what the product looked like on the shelf — is not in a column. It is in reviews, tickets, social posts, transcripts, contracts, images, and video.
Part V is about pulling that evidence into the same decision loop. The path is short: classical text analysis, then embeddings, then language models, then agents and governance. But the first move — the one this article is about — is to stop treating "unstructured" as a synonym for "unusable" and start treating it as data that needs a representation layer.
The Executive Question
What business claim is this body of text, document, or image capable of supporting — and how does it become evidence a model can use?
The honest version: a review, taken raw, supports no claim a database can act on. Once it is converted into a representation — tokens, TF-IDF, an embedding, a structured extraction — it supports many.
Two Evidence Languages, Same Customer
The clearest way to see why this matters is to look at what the warehouse stores about a customer next to what the customer is telling us in their own words.
Structured rows vs. unstructured reviews — same customer, two evidence languages
| customer_id | orders_90d | tenure_mo | days_since |
|---|---|---|---|
| C-204 | 14 | 18 | 6 |
| C-205 | 22 | 9 | 1 |
| C-206 | 7 | 24 | 14 |
Three numbers per customer. Fast to model, easy to compare, no language signal.
- “My latte was perfect but the wait felt forever. App crashed at checkout — second time this month. Still love the staff.”
- “Cinnamon roll is incredible. Why is the music always so loud though?”
- “Honestly the app is killing me. Cold brew is fine.”
Three paragraphs per customer. Slow to compare, but loaded with intent, complaint type, and emotional tone.
Unstructured does not mean unusable. It means we need a representation layer.
The table on the left is what Part IV models lived on. Recency, frequency, monetary, engagement. Fast, comparable, easy to model. The reviews on the right carry a different kind of signal — intent ("still love the staff"), specific complaint type ("app crashed at checkout — second time"), and tonal weight ("the app is killing me"). A churn model trained only on the left will rank these three customers similarly. The reviews suggest one is loyal-but-frustrated, one is mildly annoyed, and one is signalling churn.
The job of Part V is to make both signals usable together.
Six Families of Unstructured Business Data
The technique used to extract evidence depends on the data type. The decision logic stays the same.
| Data family | Representative source | Typical question |
|---|---|---|
| Reviews | app store, Yelp, Trustpilot | What are customers complaining about? |
| Support tickets | internal ticketing system | How should this be routed; is it urgent? |
| Social posts | Twitter / X, Reddit | How is the brand being talked about right now? |
| Transcripts | call centre, sales calls | Did the rep cover the required steps? |
| Documents | contracts, invoices, manuals | What are the terms; what should be extracted? |
| Images / video | shelf, product, ad creative | What is shown; is it on-brand; is it defective? |
Each family will get its own treatment. Reviews and tickets dominate Chapter 13. Documents enter in §15.3. Images and multimodal arrive in §15.2–§15.4. Transcripts and "anything in language" pull in across §14.4, §21, and §22.
The Representation Move
Every method in this part begins with the same move: turn the raw input into something a model can compute on.
For text:
- Bag-of-words / TF-IDF. A document becomes a sparse vector of word counts. Cheap, transparent, the bedrock of §13.3–§13.5.
- Embeddings. A document becomes a dense vector in a learned meaning space. The bridge from classical to modern (§19).
- LLM measurement. A document is read by a language model that returns scores or labels on managerially defined constructs. The conceptual heart of the part (§14.4).
For images:
- CNN features. An image becomes a hierarchy of edges → textures → object parts → objects (§15.2).
- Image embeddings. An image becomes a vector in the same space as text — the basis for multimodal search (§15.4).
For documents:
- OCR + layout. A scanned page becomes text plus spatial structure (§15.3).
- Schema-driven extraction. A document plus a JSON schema becomes a set of validated fields (§16.2).
The representation determines which questions become askable. Bag-of-words can answer "which words are most overrepresented in negative reviews?" An embedding can answer "what are customers angry about, even if they don't use the word angry?" An LLM measurement can answer "do these customers feel betrayed?" These are three different questions, and they require three different representations.
Where This Part Is Going
The teaching arc has three turns:
- Classical NLP (Chapter 13). Tokens, TF-IDF, classification, sentiment, topics. The same supervised-vs-unsupervised distinction from Part IV, now on text. The chapter ends by mapping the gallery of cases where word counts fail.
- Embeddings and measured constructs (Chapter 14). Embeddings as a meaning space; semantic search; the bridge to perceptual maps from §11.2. The third article — GPT-as-measurement — is the conceptual heart of the part. It is where "the model measures the construct" replaces "we count proxies for the construct."
- RAG, vision, multimodal, LLMs, agents, governance (Chapters 15–22). The operational use of language models — answering from internal knowledge, reading documents and images, drafting actions, evaluating the whole pipeline.
The capstone (§16.5) integrates all of this into a single customer-voice loop that lives next to the customer intelligence loop from §12.4.