In the real world, two places that are close to each other are more likely to be similar than two places far apart. For example, Taipei is more similar to Tokyo than it is to Prague, in most aspects. Simple as it is, this intuition serves as the basis for how embeddings work.
As a fun bonus at the end of the article, we will see whether embeddings of real-world places correspond to their actual locations.
What is a vector
There are many ways to define the location of a place on Earth. Probably the most common one uses latitude and longitude (e.g., 50° north, 20° west), a system1 that is based on angles. But to understand vectors, we will talk about another widely used system - the geocentric coordinate system - that is based on the distances from the center of Earth.
Our world has 3 dimensions, so this system has three axes intersecting at the origin - the center of Earth.
- The Z axis goes through the poles.
- The X axis goes through the intersection of the equator and the prime meridian (Greenwich).
- The Y axis goes through the equator at a 90° angle from X, piercing the Indian Ocean.
To simplify it, instead of measuring the distances in meters, miles, or chicken feet, let’s say that the distance from the Earth’s center to its surface is 1 unit.2 In this system, the North Pole has coordinates (0, 0, 1), Taipei (-0.47, 0.77, 0.42), Tokyo (-0.62, 0.52, 0.58), and Prague (0.62, 0.16, 0.77).
Each coordinate is practically an arrow telling us where the place is relative to the center of Earth. It is a vector. And since the length - or, as mathematicians like to say, magnitude - of the vector is 1 unit, it is a special type of vector called unit vector.
But vectors don’t have to represent physical locations only. They can represent any attributes that can be expressed as numbers. For example, when tasting coffee,3 you can note its 1. acidity, 2. aroma and 3. bitterness on a numerical scale. Each coffee sample is then a point in a three-dimensional space, represented by the unique combination of the attributes. In fact, you are not restricted to 3 attributes and can add more, like body and aftertaste. The coffee vectors now have 5 dimensions. Human brains cannot imagine a 5-dimensional space but the math is still the same, regardless of whether the space has three, five, or a thousand dimensions.
What is an embedding
An embedding is a special type of vector that captures the meaning - that is, semantics - of data, like text or images. It is produced by a type of AI model called an embedding model. Many training techniques exist for embedding models. Most modern embedding models, including the one we will use in this article, are trained using contrastive learning: the model looks at pairs of very similar (positive pairs) and very different (negative pairs) items and learns how to place them either as close or as far as possible in a multi-dimensional space. By repeating this millions of times, it learns where to place the items so that similar ones end up nearby.
Most embedding models work with hundreds or even thousands of dimensions and the model is completely free to decide what each dimension represents. This virtual space into which the model places the data is called an embedding space.4 Some of the dimensions may correlate with human-understandable concepts (like sweetness, size, or maybe wetness) but most don’t have a counterpart in the human world. They are just abstract concepts, results of mathematical optimizations during the model training process.
To see concrete examples, let’s generate embeddings for a couple of words from three categories: transport, animals, and water. We will use the model Embed 4 from Cohere that produces vectors with 1,536 dimensions. I will only show the first 6 dimensions. The remaining 1,530 look just like this - you get the idea.
motorcycle = (-0.014469886, -0.032854330, -0.0053475660, 0.0180349300, -0.0095067840, -0.0399844160, ...)
bicycle = ( 0.006415045, -0.031378735, -0.0309388450, 0.0155427370, -0.0017962126, -0.0284461420, ...)
train = ( 0.018519579, 0.028055193, -0.0244300830, -0.0204897470, 0.0244300830, 0.0050830333, ...)
salmon = (-0.030416660, -0.029982137, -0.0027338786, -0.0169464260, 0.0188293620, -0.0373690400, ...)
butterfly = ( 0.024719475, -0.015482897, -0.0467809400, -0.0024088197, 0.0348199050, -0.0122268380, ...)
ocean = ( 0.011753302, -0.009939521, 0.0423699300, 0.0017865745, 0.0015507828, 0.0155985180, ...)
river = ( 0.034048382, 0.012981917, 0.0043726517, -0.0165577750, 0.0160136220, -3.0365636e-6, ...)
...
Looking at the embedding vectors, you can tell that they are very similar to the places vectors from the previous section, the main difference being that instead of 3, they have 1,536 dimensions.
Just like the places vectors, each number is the distance from the origin along some axis.5 For example, the embedding of the word "motorcycle" is -0.014469886 units far from the origin along the first axis, -0.03285433 units along the second axis, and so on. Only unlike the place vectors, the distances aren’t physical but rather capture some property of the embedded text - more like the coffee taste attributes. Who knows,6 maybe the 500th dimension captures the roundness, the 800th dimension the liveness, and the 1,150th dimension the yellow-ness of the thing.
Also, like the place vectors, their values range from -1 to 1, and if you were to calculate their magnitude (length), you would get exactly 1. One unit. Each embedding is a unit vector.7
And since their magnitude is 1, they too all end up on a "surface" around the origin. While we call this surface in 3-dimensional space a sphere, in a space with 4 or more dimensions we call it a hypersphere.
But no matter how much we squint our eyes, we cannot make sense of the thousands of decimal numbers.
Visualizing embeddings
Fortunately, there is a way to display them in a way our human brains understand. Without getting into too many technical details, the process is quite straightforward: reduce the number of dimensions from 1,536 to 3 using UMAP,8 center them, and finally normalize them.
When we are done, each embedding is reduced to a 3-dimensional vector. Only the main, most defining properties survived the dimensionality reduction and many details were lost, but that is the price we must pay for trimming 99.8% of the dimensions.
motorcycle = ( 0.58553650, 0.36164224, -0.72550810)
bicycle = ( 0.49223563, 0.17625500, -0.85243080)
train = ( 0.14205086, 0.95349850, -0.26582354)
salmon = (-0.43607110, -0.89988690, 0.00674733)
butterfly = (-0.11054645, -0.93424726, -0.33905977)
ocean = (-0.12926267, -0.22348693, 0.96609765)
river = ( 0.26605076, 0.39000934, 0.88153820)
...
On the other hand, we can now draw these 3D vectors and finally see them!
We can immediately notice that they are sitting on the surface of a sphere, a virtual globe, just like the places on Earth.
Some of them are hiding on the other side but, luckily for us, cartographers have been working on this exact problem for hundreds of years and have come up with some clever solutions on how to draw a 3D globe on 2D paper. Many projection techniques exist but I will use the Robinson, simply because it looks classy.
Now that we finally have the full picture, we can immediately spot the obvious:
- Words in different groups tend to cluster together and create "continents". Note that I manually placed the words in the three color-coded groups for our eyes only. The model knew nothing about my groups, it only got a word and generated its embedding. Nothing more, nothing less.
- Words within the groups that are similar stick close to each other - "motorcycle" is close to "bicycle", "salmon" is close to "dolphin" or "shark", and "lake" is close to "river".
- There is some empty space between the continents. That’s where other words would go if we embedded them, like "pizza" or "professionalism".
When you take a closer look, there are other, more subtle things to notice:
- The word "horse" is approximately in the middle between the animal and transport groups. When you think about it, it makes perfect sense - horses are animals but also a means of transport.
- The word "canoe" is quite far from the transport group, near the water group and close to "sailing". After all, a canoe is a water-based vehicle.
The embeddings clearly capture the meaning of the data that they embed, often surprisingly deep meaning.
But modern models can embed much more than just single words. They can generate embeddings for whole sentences, paragraphs, and even many pages of text. This entire article could be embedded, probably ending up near other texts about embeddings, not too far from other texts about AI, but quite far from apple pie recipes.
And, what’s more, the most modern embedding models understand not only text.
Multimodal embeddings
A modality in machine learning refers to a type of data - like text, images, audio, or video. An AI model that understands multiple types of data is called a multimodal model. For instance, the model Cohere Embed 4 is multimodal because it understands both text and images, which is very useful for some applications. When the texts and images are embedded into the same space, we can compare their meanings with one another. Let’s produce one more visualization - this time, I will add a few photos so we can compare them to the text.
The photos of a shark, butterfly and motorcycle are very close to their respective words. This shows that the model understands the meanings across modalities and can encode them into the same space.
But even more interesting are the locations of two other photos. The first one is a floatplane on water, which landed between the vectors of "airplane" and "river". The other one is a photo of a boat on water, which is docked between the vectors of the words "boat" and "river". The model captured the nuanced meanings of the photos and placed them on the sphere perfectly.
Some multimodal embedding models also understand video or audio (for example, Google Gemini Embedding 2 or Voyage Multimodal 3.5).
Now that we have wrapped our heads around the concept of embeddings, the next natural question is: so what can we do with them?
The real-world applications of embeddings
I will start with two applications that I have implemented on Atlas. Note that to make them possible, every photo uploaded to Atlas is embedded.
Semantic search
Traditional keyword-based search often requires an exact match and struggles with different spelling9 or synonyms. For example, if I captioned a photo as "motorcycle", then a search for "motorbike" would probably come back empty-handed even though it means the same thing - it is semantically identical. To overcome this limitation, the search on Atlas is powered by embeddings. When you type a query in the search box in the Gallery, the text is first converted to an embedding, which is then compared to all photo embeddings and finally the closest ones - photos with the most similar meaning - are displayed. Go ahead and give it a try. You can type "motorbike", "red flower", or "temple in a jungle" and see what photos will appear. While not always 100% accurate, it is an incredibly flexible system making it easy to find photos without having to manually label them or classify them into predefined categories.
This technique is also the foundation of retrieval-augmented generation (RAG), which you may know from Microsoft 365 Copilot and similar programs. But these programs, instead of searching for the most relevant photos, typically search for the most relevant documents (or their parts, known as chunks) which they invisibly inject into a large language model (LLM) as its grounding context along with the user’s question. For instance, when you ask about your company’s HR policy on bringing dogs to the office, it will use embeddings to very quickly find a few semantically relevant documents - about office rules and animals - not having to depend on overly rigid keyword search that would ignore potentially important documents containing words like "canine" or "pets".
Recommendations aka similar items search
Every photo has a small icon in the corner titled "Show similar photos". When clicked, other photos with a similar vector - similar meaning - are displayed. For example, when you are looking at a photo of a delicious Costa Rican lunch and click the button, you will be presented with photos of other mouth-watering dishes.
This concept can be applied to searching for similar items across different domains, like finding similar documents or emails, products based on their descriptions, articles, books, job postings, or songs in a music library.
Let’s move on to other applications that are not (yet) implemented on Atlas.
Clustering
Embeddings can help you discover clusters of similar items, like the "continents" we saw above. A note-taking app could, for example, automatically generate folders for users’ unorganized notes. Or a news app could detect a new cluster of news coming from multiple sources, and report a new event. Or a customer feedback system could reveal common types of users’ messages.
Classification
A few labeled examples from hand-picked categories can be used to calculate their positions in the embedding space. When new, unlabeled examples then come in, they can be automatically matched to the nearest category. Imagine an online forum about traveling which automatically suggests the category of the user’s post - is it "Accommodation tips" or "Paperwork and visas". Or a ticketing system that can autonomously triage incoming tickets, depending on their nature, to a pre-defined team.
De-duplication
When multiple documents produce very similar embeddings, there is a high chance they are duplicates - perhaps a Word document and an exported PDF file, or its scanned copy. These can be then automatically flagged. Or, the system could detect duplicate photos with different file names, even if they are resized, compressed, and maybe even partially cropped.
Wrapping up
Compared with LLMs, embeddings are a somewhat underappreciated marvel of technology. They are the cornerstone of many modern applications, powering features like search, RAG or recommendation systems, yet the general public often doesn’t know it. When the only "AI" one knows are LLMs, everything appears like a task for an LLM. However, embeddings are significantly faster and cheaper than many LLMs, and even more suitable for some applications than LLMs. Knowing more types of AI models will help you choose the best one for your use case. When planning your next project, spend a few seconds to consider whether embeddings could help.
Bonus: world map according to an embedding model
We started this article talking about a globe, and we will end it talking about a globe. In this fun experiment, we will embed the names of some countries, regions and territories (hereafter simply "countries") using the Embed 4 model and project them on a sphere. Will the names of countries neighboring in the real world be neighbors in the embedding space as well?
We only see half of the sphere so, again, let’s project it.
Just as with the words example above, I manually grouped the countries by their continents, loosely following the United Nations geoscheme. The model, again, knew nothing about my groups. The fact that countries from the same continent ended up, for the most part, close to each other is quite remarkable in itself, but there are some even more surprising observations.
- Mexico, Canada and the United States are much closer to the European cluster than to the other North American countries. This likely reflects the fact that these countries share similar contexts in the training data rather than appearing near Caribbean or Central American nations.
- Australia and New Zealand are very close to the European cluster, far from their Oceanian neighbors.
- South Africa is on the opposite side of the globe from other African countries, almost as if they were negative pairs in the training corpus.
- Cyprus, Georgia and Turkey are in the European cluster even though the UN geoscheme considers them part of "Western Asia".
You can let me know if you notice any other interesting details in the "map".
Note: I generated all visuals in this article programmatically, they are completely reproducible and I will be happy to share the source code in case of interest.
Note 2: Why did I go into the extra trouble to first reduce to 3 dimensions using UMAP and then project to 2 dimensions using Robinson, instead of directly reducing to 2 dimensions? The reason is simple - I didn’t want to ignore the, in my opinion, very important concepts of unit vectors and a (hyper)sphere. Their importance will become more apparent when discussing distance metrics and other more advanced topics.
This system is called a geographic coordinate system.↩
We ignore the ellipsoid shape of Earth and assume it is a perfect sphere.↩
Coffee nerds call it coffee cupping.↩
An embedding space is a type of latent space. A latent space means that the dimensions were not specified by a human but determined by the algorithm during training. Latent spaces always express some learned representation of the data but are often hidden inside the models, not exposed to the users - unlike an embedding which is the final output of the model that the users get.↩
This is called a Cartesian system.↩
The model knows, of course. And, perhaps, a few researchers who are trying to unravel the inner workings of the model, a discipline known as mechanistic interpretability.↩
This is not a universal property of all embeddings - some models produce vectors with varying magnitudes - but many modern models, including Embed 4, normalize their output to unit length.↩
UMAP stands for Uniform Manifold Approximation and Projection. This is a relatively new dimensionality reduction technique which is often preferred over the t-Distributed Stochastic Neighbor Embedding (t-SNE) for its speed and better preservation of global structure, and over the linear Principal Component Analysis (PCA) for its ability to capture non-linear relationships.↩
In practice, this issue is usually addressed using fuzzy search techniques. I have also implemented trigram-based search on Atlas which works in tandem with embeddings. Trigrams can match misspelled words and typos.↩