§13.1

From Structured to Unstructured Data

I
What happened?
II
Where & for whom?
III
What caused it?
III
How much does X matter?
IV
What is likely next?
V
What does the text/image say?
VI
How do we operate this?

Part V — reading text, images, and documents the structured methods cannot see.

Part IV scored customers, built segments, and targeted audiences. Every input was a number in a column. Most of what a firm actually knows about its customers — what they complain about, what they thank the staff for, what the rep promised on the call, what the receipt showed, what the product looked like on the shelf — is not in a column. It is in reviews, tickets, social posts, transcripts, contracts, images, and video.

Part V is about pulling that evidence into the same decision loop. The path is short: classical text analysis, then embeddings, then language models, then agents and governance. But the first move — the one this article is about — is to stop treating "unstructured" as a synonym for "unusable" and start treating it as data that needs a representation layer.


The Executive Question

What business claim is this body of text, document, or image capable of supporting — and how does it become evidence a model can use?

The honest version: a review, taken raw, supports no claim a database can act on. Once it is converted into a representation — tokens, TF-IDF, an embedding, a structured extraction — it supports many.


Two Evidence Languages, Same Customer

The clearest way to see why this matters is to look at what the warehouse stores about a customer next to what the customer is telling us in their own words.

Structured rows vs. unstructured reviews — same customer, two evidence languages

Warehouse table
customer_idorders_90dtenure_modays_since
C-20414186
C-2052291
C-20672414

Three numbers per customer. Fast to model, easy to compare, no language signal.

Reviews (same customers)
  • “My latte was perfect but the wait felt forever. App crashed at checkout — second time this month. Still love the staff.”
  • “Cinnamon roll is incredible. Why is the music always so loud though?”
  • “Honestly the app is killing me. Cold brew is fine.”

Three paragraphs per customer. Slow to compare, but loaded with intent, complaint type, and emotional tone.

Unstructured does not mean unusable. It means we need a representation layer.

Figure 1. The same three customers viewed two ways. The warehouse table gives a model a clean numeric foothold; the reviews are loaded with intent, complaint type, and emotional tone that no column captures.

The table on the left is what Part IV models lived on. Recency, frequency, monetary, engagement. Fast, comparable, easy to model. The reviews on the right carry a different kind of signal — intent ("still love the staff"), specific complaint type ("app crashed at checkout — second time"), and tonal weight ("the app is killing me"). A churn model trained only on the left will rank these three customers similarly. The reviews suggest one is loyal-but-frustrated, one is mildly annoyed, and one is signalling churn.

The job of Part V is to make both signals usable together.


Six Families of Unstructured Business Data

The technique used to extract evidence depends on the data type. The decision logic stays the same.

Table 1. The six families of unstructured business data this part covers, with a representative source and the typical business questions each one answers. Every row will reappear at least once across Chapters 13–22.
Data familyRepresentative sourceTypical question
Reviewsapp store, Yelp, TrustpilotWhat are customers complaining about?
Support ticketsinternal ticketing systemHow should this be routed; is it urgent?
Social postsTwitter / X, RedditHow is the brand being talked about right now?
Transcriptscall centre, sales callsDid the rep cover the required steps?
Documentscontracts, invoices, manualsWhat are the terms; what should be extracted?
Images / videoshelf, product, ad creativeWhat is shown; is it on-brand; is it defective?

Each family will get its own treatment. Reviews and tickets dominate Chapter 13. Documents enter in §15.3. Images and multimodal arrive in §15.2–§15.4. Transcripts and "anything in language" pull in across §14.4, §21, and §22.


The Representation Move

Every method in this part begins with the same move: turn the raw input into something a model can compute on.

For text:

  • Bag-of-words / TF-IDF. A document becomes a sparse vector of word counts. Cheap, transparent, the bedrock of §13.3–§13.5.
  • Embeddings. A document becomes a dense vector in a learned meaning space. The bridge from classical to modern (§19).
  • LLM measurement. A document is read by a language model that returns scores or labels on managerially defined constructs. The conceptual heart of the part (§14.4).

For images:

  • CNN features. An image becomes a hierarchy of edges → textures → object parts → objects (§15.2).
  • Image embeddings. An image becomes a vector in the same space as text — the basis for multimodal search (§15.4).

For documents:

  • OCR + layout. A scanned page becomes text plus spatial structure (§15.3).
  • Schema-driven extraction. A document plus a JSON schema becomes a set of validated fields (§16.2).

The representation determines which questions become askable. Bag-of-words can answer "which words are most overrepresented in negative reviews?" An embedding can answer "what are customers angry about, even if they don't use the word angry?" An LLM measurement can answer "do these customers feel betrayed?" These are three different questions, and they require three different representations.


Where This Part Is Going

The teaching arc has three turns:

  1. Classical NLP (Chapter 13). Tokens, TF-IDF, classification, sentiment, topics. The same supervised-vs-unsupervised distinction from Part IV, now on text. The chapter ends by mapping the gallery of cases where word counts fail.
  2. Embeddings and measured constructs (Chapter 14). Embeddings as a meaning space; semantic search; the bridge to perceptual maps from §11.2. The third article — GPT-as-measurement — is the conceptual heart of the part. It is where "the model measures the construct" replaces "we count proxies for the construct."
  3. RAG, vision, multimodal, LLMs, agents, governance (Chapters 15–22). The operational use of language models — answering from internal knowledge, reading documents and images, drafting actions, evaluating the whole pipeline.

The capstone (§16.5) integrates all of this into a single customer-voice loop that lives next to the customer intelligence loop from §12.4.