§15.2
Computer Vision Fundamentals
The same pattern that turned text into a vector — learn a representation that captures meaning — turns images into something a model can act on. A retail shelf is a grid of products. A medical X-ray is a structured field of features. A defective bottle on a manufacturing line has a signature in pixel space. None of these are columns in a warehouse, and all of them are routinely analyzed by vision systems today.
This article is the conceptual baseline for vision AI: what a Convolutional Neural Network (CNN) actually learns, how vision transformers (ViTs) and multimodal models like CLIP extend the picture, the four output shapes a manager will encounter in production, and the industries where vision is already shipping. The next two articles narrow into documents and OCR (§15.3) and multimodal AI (§15.4).
The Executive Question
What can a vision model see, what should it be asked to do, and what are the failure modes we should expect?
The honest version: vision models are not "magic eye". They are pattern matchers trained on labelled images, and they fail in exactly the places where their training data was thin.
What a CNN Learns
A CNN doesn't look at an image holistically. It applies small filters across the image, layer by layer, and the filters become progressively more abstract as the layers deepen. Early layers detect edges and gradients; middle layers detect textures; deep layers detect object parts; the final layer classifies the whole image.
What a CNN actually learns — a hierarchy of visual features
Three managerial implications:
- Generic backbones transfer. A CNN trained on a huge general image corpus (ImageNet) has already learned good filters for edges, textures, and parts. Fine-tuning that backbone on a small domain-specific dataset is the standard way to get a defect detector or a shelf analyzer working without millions of labelled images.
- Layers are interpretable but rough. You can visualize what each layer responds to. You cannot ask the model "why did you classify this as a cat?" in a way that gives a clean answer the way a tree does.
- What the model can see is bounded by its training data. A model that never saw left-handed coffee machines will be unreliable on them.
Vision Transformers and CLIP
CNNs dominated computer vision through the mid-2010s. The current state-of-the-art for many tasks is the Vision Transformer (ViT) and multimodal models like CLIP (Contrastive Language-Image Pretraining).
The key shifts:
- ViTs split an image into a grid of patches and process each patch the way an LLM processes tokens. Same transformer architecture as language models, applied to images.
- CLIP trains an image encoder and a text encoder jointly, so that the embedding of "a photo of a labrador" sits near the embedding of an actual photo of a labrador. The result is a shared embedding space across text and images — the foundation of multimodal AI in §15.4.
- Self-supervised pretraining lets the model learn from billions of unlabelled images by predicting masked patches, contrasting augmentations, or reconstructing inputs. Labelled data is reserved for fine-tuning on the task that matters.
A useful generalization: most of what you learned about embeddings in §14.3 transfers. The vision model produces vectors. Vectors near each other in the space correspond to images that are similar. Nearest-neighbour search, clustering, anomaly detection — all the patterns from text — apply to images too.
The Four Output Shapes
A vision model can be asked for several output shapes. The choice depends on the business question.
Four common output shapes from a vision model
| Output shape | Typical business use | Labelling cost |
|---|---|---|
| Class label | Defect / no-defect; product category; brand-safe / unsafe creative. | Lowest — one label per image. |
| Bounding boxes | Count products on shelf; locate logos; detect people in surveillance. | Moderate — bounding box per object per image. |
| Segmentation mask | Inventory by area on shelf; medical region-of-interest; satellite landcover. | Highest — pixel-level annotation. |
| Image embedding | Visual product search; near-duplicate detection; clustering of creative assets. | None — self-supervised pretraining. |
The choice should follow the action. A defect-detection workflow that just sorts good from bad doesn't need bounding boxes. A shelf-analytics workflow that needs to count SKUs does. An e-commerce visual search needs embeddings; a class label is too coarse and a mask is overkill.
Where Vision Ships Today
A short tour of established industry applications:
- Manufacturing quality control. Defect detection on production lines. Image classification or segmentation, often with a small fine-tuned model running at the edge. ROI is straightforward: fewer defective units shipped, less manual inspection.
- Retail shelf analytics. Object detection on shop-floor photos. Counts of SKUs, planogram compliance, out-of-stock detection. Increasingly real-time via in-store cameras.
- Medical imaging. Classification (disease / no disease), segmentation (tumour boundaries), measurement. The dominant use of vision AI in healthcare; regulated and rigorously validated.
- Autonomous vehicles. Object detection + segmentation + tracking, fused with LiDAR and radar. The technical frontier of vision AI.
- Agriculture. Drone and satellite imagery for crop monitoring, pest detection, yield prediction. Often paired with multispectral input.
- Security and surveillance. Anomaly detection, facial recognition, person tracking. Where ethics and regulation matter most.
- Document processing. OCR + layout understanding. The subject of §15.3.
- Content moderation. Brand-safe / unsafe classification on user-generated media. Operates at platform scale.
- Visual product search. Embed the catalog; embed the user-uploaded image; nearest-neighbour. The foundation of "shop the look" features.
The pattern across these: vision works best where the concept the team wants to detect is visually consistent and the labelling cost is manageable.
Self-Supervised Learning, Briefly
Why care about SSL? Because labelling images is expensive. A medical-imaging dataset of 100,000 labelled scans costs more than most teams can fund. Self-supervised pretraining lets the model learn structure from unlabelled images first — by predicting masked patches, by matching different augmentations of the same image, by reconstructing inputs — and then needs only a small labelled set to fine-tune to the actual task.
The economic implication: small teams in specialized domains can now train competitive vision models, because the heavy lifting happens on unlabelled data that already exists.
Bias and Ethics in Vision
Vision AI has documented failure modes that don't show up in text models:
- Skin-tone bias in facial recognition. Models trained on under-representative datasets have substantially higher error rates on darker skin tones. Documented since the Gender Shades study (2018).
- Surveillance and consent. Vision-AI deployments often touch privacy law (GDPR, CCPA, sector-specific). The architecture must support data minimization and audit.
- Defect detection bias by lighting. Models trained in one factory's lighting conditions fail in another's. The fix is data diversity, not a bigger model.
- Adversarial examples. Tiny pixel-level perturbations can flip a vision model's prediction. Robustness is a real concern for security-critical systems.
For business deployments, the right governance posture is the same as in §22: name the failure modes, evaluate by subgroup, build a monitoring loop, set a human-review path for high-stakes decisions.