§15.4

Multimodal AI

A customer support team has the call recording, the transcript, the CRM notes, and the photo of the broken espresso machine the customer sent. A merchandising team has the shelf photo, the planogram document, the SKU list, and the regional sales numbers. A creative team has the ad video, the campaign brief, the music track, and the platform analytics. Every one of these workflows mixes modalities — text, images, audio, video, structured data — and every one of them is a candidate for multimodal AI.

The unifying idea is the same one that powered embeddings in §14.3: place different kinds of inputs into a shared meaning space so that a single nearest-neighbour query or a single language model can act across all of them. This article surveys what that enables in business, and where the rough edges still are.

The Executive Question

When the evidence a decision needs comes from multiple modalities — text, image, audio, video — can we use one system to reason across all of them?

The answer is increasingly yes. The architecture is the shared embedding space introduced by models like CLIP, extended to audio (CLAP, Whisper) and video (V-JEPA, VideoCLIP).

The Shared-Space Architecture

A multimodal model trains an encoder per modality, with a contrastive objective that pulls semantically related items together in the embedding space. The result: the vector for the text "labrador puppy" sits near the vector for an actual photo of a labrador puppy.

A multimodal model — different inputs, one shared meaning space, many use cases

Figure 1. A multimodal model takes four different input types and places them in a shared embedding space. The same nearest-neighbour primitive from §14.3 now works across modalities — search by text, find an image; search by audio, find a similar clip; search by image, find related video.

Three properties make the architecture useful:

Cross-modal search. A text query retrieves images, or an image retrieves related text. The standard "show me dark espresso mugs" → product image search.
Cross-modal generation. Text-to-image (Midjourney, DALL-E, Imagen), image-to-text (captioning, alt-text), audio-to-text (Whisper). Each is the model translating between modalities.
Joint reasoning. Modern frontier models (GPT-4o, Gemini, Claude with vision) can take a prompt that includes text and images and audio and produce a single response that reasons across all of them.

The third property is the most operationally important. A support agent can hand the model an image of a defective product, the customer's message, and ask "what should I tell them?" — and get a single answer that integrates both.

Business Use Cases

A non-exhaustive tour of what's shipping in 2026.

Table 1. Multimodal AI use cases by industry. Each row combines at least two modalities; the business value is usually in the integration, not in any single modality alone.

Use case	Modalities	Example
Visual product search	image → text → catalog	User uploads a photo of a shoe; system finds similar SKUs.
Automated product descriptions	image → text	E-commerce platform generates alt-text and listing copy from product photos.
Sales-call coaching	audio → text → summary	Whisper transcribes; LLM extracts objections, commitments, next steps.
Meeting summarization	audio + video + text	Notetaker captures speakers, slides shown, action items.
Retail shelf monitoring	video + structured planogram	In-store cameras check planogram compliance and out-of-stock in real time.
Visual social-media monitoring	image + text	Surface posts where the brand logo appears, even when the brand isn't named in caption.
Compliance review of marketing creative	image + video + text	Flag ads that violate brand guidelines or regulatory restrictions before launch.
Support ticket triage with media	image + text	Customer attaches a photo of a defect; ticket is automatically routed and product replacement triggered.
Audio brand listening	audio → embedding	Identify podcast and video mentions of the brand without explicit keyword matching.

A pattern: most of these were previously possible but expensive — they required custom integration of separate models. Multimodal foundation models collapse the integration step into a single API call.

Three Practical Modalities Beyond Text-and-Image

Brief notes on the three less-discussed modalities:

Audio. OpenAI's Whisper made high-quality transcription a commodity. Beyond transcription, audio embeddings (CLAP and successors) let you do similarity search and classification on raw audio — useful for podcast monitoring, ad-spot tracking, and acoustic scene analysis. Speaker diarization ("who said what") is the next layer; still imperfect but production-viable.

Video. Treated as either a stack of images or as a separate modality. Two patterns dominate: extract keyframes and run image analysis (cheap, works well for static scenes), or use a video-aware model that reasons about motion (more expensive, necessary for action recognition). In-store analytics, content moderation at scale, and ad-creative analysis all run on this.

Structured data alongside text. Less glamorous, increasingly important: an LLM that can read a CSV or query a database alongside its prose context. Tools like Code Interpreter, function calling, and "computer use" agents (§16.3) make this routine. The boundary between "text + structured" and "agentic" is blurring fast.

A Worked Example: Visual Product Search

The use case: Bean & Basket runs an online store with several hundred SKUs (coffee mugs, brewing equipment, branded merchandise). A customer uploads a photo of a mug they saw in a café and asks "do you sell something like this?"

The pipeline:

At index time. Embed every product image in the catalog with a multimodal model (CLIP-style image encoder). Store the vectors.
At query time. Embed the user-uploaded photo with the same model.
Retrieve. Top-k nearest catalog items by cosine similarity.
Display. Return the matches with prices and links.

That's it — a few hundred lines of code, no labelled training data, leverages a pretrained multimodal model. The same pattern works for "find this on our site" features across retail and content platforms.

A variant: combine the image embedding with a text query ("show me mugs like this, but in white"). The user's text adjusts the query vector before retrieval; the system returns visually similar items filtered by the text constraint. This is the design pattern behind every modern visual search engine.

The Quiet Revolution: Vision-Language Models in the Browser

Frontier multimodal models running in the cloud (GPT-4o, Gemini, Claude) have gotten the headlines. The quieter shift is that smaller, capable multimodal models now run locally — in browsers, on phones, on edge devices. A merchandiser walking a store can take a photo, ask an on-device model "is this planogram-compliant?", and get an answer without sending the image anywhere.

The implication for governance: privacy-sensitive multimodal workflows that previously required sending media to a cloud API can increasingly run on-device. The privacy story changes; the cost story changes; the latency story changes. Worth tracking.

Concept check

Three questions spanning the chapter — RAG, document AI, and the shared-space idea behind multimodal models.

1.
A team is choosing between fine-tuning a model on company documents vs. building a RAG system. The documents change weekly. Which is more operationally sustainable?
2.
The single most important tunable parameter in a production document AI pipeline is:
3.
The defining property of a multimodal foundation model (CLIP and successors) is: