🔍 Click image to zoom

Semantic vs episodic memory in AI agents
Share

Frequently Asked Questions

How are embeddings used in RAG?

In a RAG pipeline, embeddings are used in two ways: first, each document chunk is converted to an embedding vector and stored in a vector database; second, each incoming query is converted to an embedding vector and compared against stored embeddings using cosine similarity or dot product. Chunks whose embeddings are closest to the query embedding are retrieved as context for the LLM. The quality of the embedding model directly determines the quality of retrieval.

What is the difference between an embedding and a token?

A token is a discrete unit of text — typically a word, subword, or character — that a model processes as input. An embedding is the continuous numerical vector that represents that token inside the model. Every token is first looked up in an embedding table to get its vector representation; the model then operates on these vectors. Tokens are the input format; embeddings are the internal mathematical representation the model actually computes over.

How do you choose an embedding model?

Choose an embedding model based on three factors: domain fit (a model trained on scientific text will produce better embeddings for scientific queries than a general-purpose model), dimensionality (higher dimensions capture more nuance but require more storage and compute), and benchmark performance on tasks matching your use case (the MTEB leaderboard is the standard reference for text embedding benchmarks as of 2025). For multilingual use cases, verify that the model was trained on your target languages.

See Also