Part V

Unstructured Data, Embeddings, and Generative AI

Turning prose and pixels into governed evidence

This part takes on the data that never made it into a warehouse — reviews, tickets, transcripts, invoices, and images — and turns it into evidence a manager can act on. The arc climbs in four steps: Chapter 13 earns trust with transparent word-counting NLP and then catalogs where it breaks; Chapter 14 moves from words to meaning, using embeddings for structure and a language model to score the constructs a manager cares about; Chapter 15 makes embeddings concrete as retrieval, vision, and document extraction; and Chapter 16 reframes the model as a programmable, schema-bound, human-gated component inside a real governance layer. The throughline is a refusal to treat any of it as magic: name the document, choose the representation, state the construct, and inspect what the method discarded.

4 chapters · 19 articles

What you’ll learn

  • Build a transparent text pipeline (tokens, TF-IDF, classifiers, topic models) and recognize the failure modes — sarcasm, negation, polysemy — that force a move to embeddings
  • Use embeddings for semantic search, clustering, brand maps, and drift detection, and direct a language model to score named constructs like intent-to-return or evasiveness instead of a sentiment proxy
  • Stand up retrieval-augmented generation, vision, and document-AI workflows where the boundary work — chunking, confidence thresholds, citations, bias audits — decides whether anything ships
  • Convert a language model into a governed workflow component with schema-enforced JSON output, tool use, a human-approval gate, and an evaluation-and-risk rubric
  • Integrate classification, construct measurement, embedding clusters, and RAG into one monitored customer-voice loop a sponsor can approve

Chapters in this part