Part V · Chapter 15

Retrieval, Vision, and Multimodal Workflows

One shared embedding space, four production patterns: ground the text, see the image, read the document, reason across all of them.

This chapter makes the embedding idea concrete as plumbing: facts pulled from a firm's own indexed documents, pixels turned into searchable vectors, scanned invoices flattened into database rows. It opens with Retrieval-Augmented Generation — the standard way to keep a model's language ability while replacing its factual knowledge with a re-indexable corpus — then moves through what a CNN learns, how Vision Transformers and CLIP extend it, and how layout-aware document AI lifts the easy 80 percent of invoices and contracts while routing the rest to a human. It closes on multimodal models, where text, image, audio, and video share one space. The recurring lesson: the model is rarely the differentiator — the boundary work of chunking, confidence thresholds, citation-required prompts, and bias audits decides whether anything ships.

Start reading

Topics covered

the Retrieval-Augmented Generation pipelinechunking and re-ranking trade-offscitation-required prompting and grounding failuresCNN feature hierarchies and transfer learningVision Transformers and CLIPthe four vision output shapes (label, boxes, mask, embedding)layout-aware document extraction and OCRconfidence-threshold workflow designshared-space multimodal search

Topics covered

In this chapter