§15.4

Multimodal AI

A customer support team has the call recording, the transcript, the CRM notes, and the photo of the broken espresso machine the customer sent. A merchandising team has the shelf photo, the planogram document, the SKU list, and the regional sales numbers. A creative team has the ad video, the campaign brief, the music track, and the platform analytics. Every one of these workflows mixes modalities — text, images, audio, video, structured data — and every one of them is a candidate for multimodal AI.

The unifying idea is the same one that powered embeddings in §14.3: place different kinds of inputs into a shared meaning space so that a single nearest-neighbour query or a single language model can act across all of them. This article surveys what that enables in business, and where the rough edges still are.


The Executive Question

When the evidence a decision needs comes from multiple modalities — text, image, audio, video — can we use one system to reason across all of them?

The answer is increasingly yes. The architecture is the shared embedding space introduced by models like CLIP, extended to audio (CLAP, Whisper) and video (V-JEPA, VideoCLIP).


The Shared-Space Architecture

A multimodal model trains an encoder per modality, with a contrastive objective that pulls semantically related items together in the embedding space. The result: the vector for the text "labrador puppy" sits near the vector for an actual photo of a labrador puppy.

A multimodal model — different inputs, one shared meaning space, many use cases

Textreviews, tickets, docsImageshelf, product, adAudiosales call, podcastVideoin-store, ad spotShared embedding spacetext + image + audio + videoin the same coordinatesTIAVProduct image search"dark espresso mugs"Caption generationimage → alt-textSales-call coachingaudio → summaryShelf monitoringvideo → KPI
Figure 1. A multimodal model takes four different input types and places them in a shared embedding space. The same nearest-neighbour primitive from §14.3 now works across modalities — search by text, find an image; search by audio, find a similar clip; search by image, find related video.

Three properties make the architecture useful:

  • Cross-modal search. A text query retrieves images, or an image retrieves related text. The standard "show me dark espresso mugs" → product image search.
  • Cross-modal generation. Text-to-image (Midjourney, DALL-E, Imagen), image-to-text (captioning, alt-text), audio-to-text (Whisper). Each is the model translating between modalities.
  • Joint reasoning. Modern frontier models (GPT-4o, Gemini, Claude with vision) can take a prompt that includes text and images and audio and produce a single response that reasons across all of them.

The third property is the most operationally important. A support agent can hand the model an image of a defective product, the customer's message, and ask "what should I tell them?" — and get a single answer that integrates both.


Business Use Cases

A non-exhaustive tour of what's shipping in 2026.

Table 1. Multimodal AI use cases by industry. Each row combines at least two modalities; the business value is usually in the integration, not in any single modality alone.
Use caseModalitiesExample
Visual product searchimage → text → catalogUser uploads a photo of a shoe; system finds similar SKUs.
Automated product descriptionsimage → textE-commerce platform generates alt-text and listing copy from product photos.
Sales-call coachingaudio → text → summaryWhisper transcribes; LLM extracts objections, commitments, next steps.
Meeting summarizationaudio + video + textNotetaker captures speakers, slides shown, action items.
Retail shelf monitoringvideo + structured planogramIn-store cameras check planogram compliance and out-of-stock in real time.
Visual social-media monitoringimage + textSurface posts where the brand logo appears, even when the brand isn't named in caption.
Compliance review of marketing creativeimage + video + textFlag ads that violate brand guidelines or regulatory restrictions before launch.
Support ticket triage with mediaimage + textCustomer attaches a photo of a defect; ticket is automatically routed and product replacement triggered.
Audio brand listeningaudio → embeddingIdentify podcast and video mentions of the brand without explicit keyword matching.

A pattern: most of these were previously possible but expensive — they required custom integration of separate models. Multimodal foundation models collapse the integration step into a single API call.


Three Practical Modalities Beyond Text-and-Image

Brief notes on the three less-discussed modalities:

Audio. OpenAI's Whisper made high-quality transcription a commodity. Beyond transcription, audio embeddings (CLAP and successors) let you do similarity search and classification on raw audio — useful for podcast monitoring, ad-spot tracking, and acoustic scene analysis. Speaker diarization ("who said what") is the next layer; still imperfect but production-viable.

Video. Treated as either a stack of images or as a separate modality. Two patterns dominate: extract keyframes and run image analysis (cheap, works well for static scenes), or use a video-aware model that reasons about motion (more expensive, necessary for action recognition). In-store analytics, content moderation at scale, and ad-creative analysis all run on this.

Structured data alongside text. Less glamorous, increasingly important: an LLM that can read a CSV or query a database alongside its prose context. Tools like Code Interpreter, function calling, and "computer use" agents (§16.3) make this routine. The boundary between "text + structured" and "agentic" is blurring fast.


The use case: Bean & Basket runs an online store with several hundred SKUs (coffee mugs, brewing equipment, branded merchandise). A customer uploads a photo of a mug they saw in a café and asks "do you sell something like this?"

The pipeline:

  1. At index time. Embed every product image in the catalog with a multimodal model (CLIP-style image encoder). Store the vectors.
  2. At query time. Embed the user-uploaded photo with the same model.
  3. Retrieve. Top-k nearest catalog items by cosine similarity.
  4. Display. Return the matches with prices and links.

That's it — a few hundred lines of code, no labelled training data, leverages a pretrained multimodal model. The same pattern works for "find this on our site" features across retail and content platforms.

A variant: combine the image embedding with a text query ("show me mugs like this, but in white"). The user's text adjusts the query vector before retrieval; the system returns visually similar items filtered by the text constraint. This is the design pattern behind every modern visual search engine.


The Quiet Revolution: Vision-Language Models in the Browser

Frontier multimodal models running in the cloud (GPT-4o, Gemini, Claude) have gotten the headlines. The quieter shift is that smaller, capable multimodal models now run locally — in browsers, on phones, on edge devices. A merchandiser walking a store can take a photo, ask an on-device model "is this planogram-compliant?", and get an answer without sending the image anywhere.

The implication for governance: privacy-sensitive multimodal workflows that previously required sending media to a cloud API can increasingly run on-device. The privacy story changes; the cost story changes; the latency story changes. Worth tracking.



Concept check

Three questions spanning the chapter — RAG, document AI, and the shared-space idea behind multimodal models.

  1. 1.
    A team is choosing between fine-tuning a model on company documents vs. building a RAG system. The documents change weekly. Which is more operationally sustainable?
  2. 2.
    The single most important tunable parameter in a production document AI pipeline is:
  3. 3.
    The defining property of a multimodal foundation model (CLIP and successors) is: