§14.3
Embeddings and Semantic Search
If a document is a sparse vector of word counts, "refund" and "money back" are orthogonal — they share no words and the bag-of-words model sees no relationship. To a customer service team they are obviously the same thing. The representational gap is the entire reason embeddings exist. An embedding is a dense vector — a few hundred or a few thousand numbers — placed in a learned coordinate system where similar meanings sit near each other, even when they share no vocabulary.
This article does two things: it introduces embeddings as a coordinate system for meaning, then puts that coordinate system to work in its most operationally valuable use — semantic search, finding documents by meaning rather than vocabulary. The math is light; the consequences are large. Most of the modern toolkit — semantic search, RAG, multimodal AI, embedding-based recommenders — rests on the idea that meaning has coordinates and those coordinates can be computed.
The Executive Question
When we have text, images, audio, or documents, can we represent each one as a vector such that similar things are close together — even when they share nothing on the surface — and then retrieve, cluster, and position them by that similarity?
The answer is yes, and the rest of Part V is what that yes lets us build.
The Coordinate System Analogy
The most useful intuition is geographic. A city in a coordinate system has a latitude and longitude. Two cities close on the map are close in the world. The map is a representation — a few numbers per city — that preserves something about the relationships of the underlying cities.
An embedding does the same thing for documents (or words, or images). Each document gets a vector. Documents whose meanings are close sit near each other in the vector space. The "map" is now hundreds or thousands of dimensions instead of two, so we can't draw it directly — but the property that matters (close meanings → close vectors) is the same.
A small worked example. Suppose the corpus has these phrases:
- refund
- return
- money back
- cancel order
- delivery delay
- order arrived late
- driver was slow
- app crashed
- cannot sign in
Bag-of-words puts them in nine orthogonal directions — each phrase shares no words with any other. An embedding places them in roughly three clusters:
- A "refund/return" cluster.
- A "delivery delay" cluster.
- An "app technical" cluster.
That clustering wasn't programmed in. It was learned from the training corpus the embedding model saw. The model observed that "refund" and "money back" co-occur with similar surrounding language and therefore mean similar things — and represented them with nearby vectors.
A Picture of the Space
A scatter plot of the embeddings (after reducing to 2D for the page) makes the structure visible.
An embedding space — phrases near each other mean similar things
The query never used the word "refund" outside the bracket. Yet "refund / money back / return / cancel order" surfaced — the embedding captured the intent, not the keywords.
Two reading habits carry forward from §11.3:
- Trust neighbourhoods. The clusters and which phrases sit near each other are real.
- Distrust geometry. The exact distances between clusters depend on the projection. The axes have no business meaning. UMAP/t-SNE caveats apply.
The neighbour panel on the right of the figure is what most embedding applications use. Given a query, return the top-k items by vector distance. That single primitive — "find me documents that mean approximately the same thing as this one" — underwrites semantic search (later in this article), RAG (§15.1), embedding-based segmentation, recommendation, and anomaly detection.
Distance and Similarity
Two vectors in the embedding space have a distance. The two standard choices:
- Euclidean distance — straight-line distance.
- Cosine similarity — the cosine of the angle between vectors. Higher is more similar; ranges from -1 to 1.
For text embeddings, cosine similarity is the default. The reason is that text vectors often differ in magnitude (longer documents tend to have larger vectors), and cosine similarity ignores magnitude — it asks only about direction in meaning space.
Cosine similarity
Practical note: most embedding APIs return vectors already normalized to unit length, in which case Euclidean distance and cosine similarity give the same ranking.
What Gets Embedded
The same idea works across modalities:
| What gets embedded | Representative model family | What the vector is good for |
|---|---|---|
| Words | word2vec, GloVe (classical); contextual: BERT subwords. | Synonym detection, analogies, classical NLP features. |
| Sentences / documents | Sentence-Transformers (open), text-embedding-3 (OpenAI), Cohere embed. | Semantic search, clustering, RAG retrieval. |
| Images | CLIP, DINO, SigLIP image encoders. | Visual similarity, image search, content moderation. |
| Images + text in one space | CLIP and successors. | Text-query → image search and vice versa (§15.4). |
| Audio | Whisper encoder, audio CLIP variants. | Acoustic similarity, transcript-free search. |
| Products / users / structured entities | Recommender embeddings (matrix factorization, two-tower nets). | Collaborative recommenders (§12.2). |
The unifying idea: anything that has a notion of "similar" can in principle be embedded. The space, the model, and the training data change. The "find nearby items" primitive does not.
Embeddings Are Learned, Not Declared
Two properties to hold onto before we start building.
Embeddings are learned, not declared. Nobody told the model that "refund" and "money back" are similar. The model learned it from the contexts in which those words appear. That means the embedding inherits the biases, gaps, and assumptions of its training data. A clinical-domain embedding may not recognize coffee jargon; an English embedding may garble Mandarin.
The embedding is downstream of a choice of model. "OpenAI text-embedding-3-small" and "Sentence-Transformers all-MiniLM-L6-v2" produce different vectors for the same input. The downstream system has to commit to one model. Mixing embeddings from different models is meaningless — the vectors live in different spaces.
Semantic Search: Keywords vs. Meaning
Now put the primitive to work. A manager wants to find "customers angry about delivery delays." A keyword search returns only documents that contain "angry" or "delivery." A semantic search returns documents that share the meaning, regardless of vocabulary.
Keyword vs. semantic search for the query "customers angry about delivery delays"
- hit"Angry that my order took forever."matches "angry"
- miss"Driver was late again."no shared word
- hit"Delivery window was missed."matches "delivery"
- miss"Food arrived cold after a long wait."no shared word
- hit"Angry that my order took forever."semantic match — anger + delay
- hit"Driver was late again."semantic match — delivery + late
- hit"Delivery window was missed."semantic match — delivery problem
- hit"Food arrived cold after a long wait."semantic match — wait + frustration
Keyword search recovers only the documents that share words with the query. Semantic search recovers the documents that share meaning.
The mechanism is the nearest-neighbour primitive applied in two steps:
- At index time. Embed every document in the corpus. Store the vectors in a database that supports nearest-neighbour queries.
- At query time. Embed the query with the same model. Retrieve the top-k documents whose vectors are nearest.
That's the entire architecture. The complexity is in scaling the nearest-neighbour search (approximate methods become essential beyond a few million documents) and in the prompt or interface that uses the retrieved documents.
Vector Databases, Plainly
A vector database is what stores embeddings and answers "give me the k nearest neighbours of this vector, fast." The standard names — Pinecone, Weaviate, Qdrant, Milvus, Postgres+pgvector — implement variations of the same idea: index the vectors with an approximate-nearest-neighbour structure (HNSW, IVF, ScaNN), trade exactness for speed at scale, expose a simple query API.
Three managerial decisions when picking one:
- Approximate or exact? Approximate is essential beyond a few million vectors; the precision loss is usually small and tunable.
- Hybrid search? Combining vector similarity with classical text filters (date, region, language) is often necessary. Most databases support this; pure-vector ones lose value quickly in production.
- Where it lives. Hosted service vs. self-hosted vs. embedded-in-existing-DB. The right choice depends on data sensitivity, scale, and team capacity.
Embedding-Based Clustering
The same vectors that power search also power clustering. Once every document is a vector, K-means, hierarchical clustering, or DBSCAN — all from §11.1 — apply. The output is a segmentation of the corpus by meaning, not by vocabulary.
A few common business applications:
- Complaint segmentation. Cluster customer support tickets into themes; compare to the topic-model results from §13.5. The clusters are usually more stable across refits.
- Lead segmentation from sales notes. Cluster sales-call notes; identify customer personas the structured CRM fields don't capture.
- Story discovery in social posts. Cluster posts mentioning the brand; surface narratives the team didn't know existed.
The interpretive workflow is the same as §13.5: the algorithm produces clusters, the analyst names them, the manager decides which deserve action.
Brand Positioning From Text
In §11.2, a perceptual map placed brands in a low-dimensional space using survey ratings (premium, affordable, cozy, convenient). The same kind of map can be built from review text using embeddings.
Same brands, two evidence languages — survey PCA on the left, review embeddings on the right
The same competitive structure surfaces from two evidence languages. Embeddings let us read positioning from text the way PCA lets us read it from surveys.
The procedure for the text map:
- Collect reviews for each brand.
- Embed each review with a sentence/document embedding model.
- Average the embeddings within each brand to get a single brand vector.
- Reduce to 2D with UMAP or PCA for visualization.
- Read the resulting positions.
The reading habits differ across the two maps:
- The survey map has named axes. PC1 might be "value → premium." Useful for explaining the position.
- The text map has neighbourhoods. Brands close together are perceived similarly; the axes have no business meaning.
The strongest use is triangulation. If the survey map and the text map agree, the positioning is robust. If they disagree, that disagreement is informative — perhaps customers say one thing on surveys and another in reviews, or the survey questions don't capture the dimensions reviews actually emphasize.
Beyond Search: Anomaly Detection and Drift
Anomaly detection. Documents whose embeddings are far from every cluster are candidates for human review. A support ticket that doesn't look like any previous ticket is exactly what an escalation queue should surface.
Drift monitoring. The centroid of "incoming reviews this week" can drift away from "incoming reviews last quarter" in embedding space. The drift is a richer signal than a single sentiment number — it captures what kind of language has changed, not just whether the overall tone shifted.
Both are restatements of the same primitive: distance in embedding space carries managerial information when interpreted carefully.
The next article generalizes the move. Instead of using embeddings as a similarity primitive, it uses a language model directly to measure the constructs a manager cares about.