Part V · Chapter 15
Retrieval, Vision, and Multimodal Workflows
One shared embedding space, four production patterns: ground the text, see the image, read the document, reason across all of them.
This chapter makes the embedding idea concrete as plumbing: facts pulled from a firm's own indexed documents, pixels turned into searchable vectors, scanned invoices flattened into database rows. It opens with Retrieval-Augmented Generation — the standard way to keep a model's language ability while replacing its factual knowledge with a re-indexable corpus — then moves through what a CNN learns, how Vision Transformers and CLIP extend it, and how layout-aware document AI lifts the easy 80 percent of invoices and contracts while routing the rest to a human. It closes on multimodal models, where text, image, audio, and video share one space. The recurring lesson: the model is rarely the differentiator — the boundary work of chunking, confidence thresholds, citation-required prompts, and bias audits decides whether anything ships.
Topics covered
In this chapter
- 15.1Retrieval-Augmented GenerationWires chunking, embeddings, and an LLM into a grounded Q&A system, and shows how to catch missing-context and ungrounded-generation failures.
- 15.2Computer Vision FundamentalsExplains what CNNs, ViTs, and CLIP learn, the four output shapes to choose from, and where vision already ships in industry.
- 15.3OCR and Document AIWalks the six-stage scan-to-database pipeline for invoices and contracts, with the confidence threshold as the central design decision.
- 15.4Multimodal AISurveys shared-space models that place text, image, audio, and video in one space for cross-modal search and joint reasoning.