In RAG, matrices mainly appear in the embedding and retrieval stage, since embeddings are inherently stored and compared as vectors and matrices.
Each text chunk (Document.page_content
) is converted into a vector (e.g., 768-dim for BERT, 1536-dim for OpenAI embeddings).
When you store multiple vectors, they form an embedding matrix:
Shape: (#documents/chunks × embedding_dim)
Example: 10,000 chunks → matrix of size (10000 × 1536)
👉 This is the core mathematical structure that powers retrieval.
Retrieval works by comparing a query vector (1 × embedding_dim) with the embedding matrix.
Mathematically:
Q
= Query vector (1 × d)
D
= Document matrix (n × d)
Result = similarity scores (1 × n)
👉 This gives you a score vector showing which documents are closest to the query.
Sometimes you calculate pairwise similarities between documents or between queries and documents.
This forms a similarity/distance matrix:
Shape: (n × n)
for document-to-document
Shape: (m × n)
for multiple queries vs. documents
Advanced RAG setups may use:
SVD (Singular Value Decomposition)
PCA (Principal Component Analysis)
To reduce embedding dimensionality and make retrieval faster.
These rely on linear algebra operations on embedding matrices.
Once relevant docs are retrieved, they are fed into the LLM.
Inside the LLM, attention matrices determine how tokens (from query + retrieved docs) relate to each other.
While not part of retrieval, it’s still a matrix operation that helps generation.
In RAG, matrices appear in two main ways:
Embedding Matrix → Stores vector representations of chunks.
Similarity Matrices → Used for retrieval via dot products / cosine similarity.
Optional → Dimensionality reduction (PCA/SVD) and Attention matrices inside LLM.