Contextual Compression : In RAG (Retrieval-Augmented Generation)

Back Contextual Compression : In RAG (Retrieval-Augmented Generation) 22 Aug, 2025

ABHISHEK AGNIHOTRI

In RAG (Retrieval-Augmented Generation), Contextual Compression refers to the process of shrinking or filtering retrieved documents into a more relevant and concise form before passing them to the LLM.

Normally, in RAG:

Retriever → Finds documents (chunks of text, PDFs, etc.) from a vector database or knowledge source.
LLM → Consumes those retrieved docs to generate an answer.

⚠️ Problem: Retrievers often return long or noisy chunks. Feeding them all directly into the LLM wastes context window space and may reduce accuracy.

🔹 What Contextual Compression Does

It compresses retrieved documents to preserve only the most relevant parts of text. This way:

Less irrelevant information is passed.
More space remains for multiple documents.
The LLM sees only the most useful context.

🔹 Techniques for Contextual Compression

LLM-based Filtering (Query-aware Summarization)
- Use an LLM to summarize each retrieved doc with respect to the query.
- Example: If query = “What are the side effects of Rosuvastatin?”, and a document talks about uses, dosage, and side effects, the compressor extracts only the side effects part.
Re-ranking / Scoring
- A smaller model ranks the sentences/paragraphs inside each document.
- Only the top-ranked, query-relevant parts are retained.
Keyword or Semantic Extraction
- Keep only key sentences that contain entities or terms related to the query.

🔹 Example Flow

Without Compression:

Retriever → returns 5 docs (~2000 tokens each)
LLM input = 10,000 tokens (mostly irrelevant fluff)

With Compression:

Retriever → returns 5 docs
Compressor → trims each doc to ~200 tokens, query-relevant only
LLM input = 1000 tokens (focused context)

🔹 Benefits in RAG

✅ Reduces hallucination (LLM focuses on relevant context)
✅ Saves token cost & context window space
✅ Improves accuracy in multi-hop QA
✅ Helps when docs are very long (e.g., legal, medical, scientific PDFs)

👉 In LangChain , this is implemented as a ContextualCompressionRetriever, which wraps around your base retriever. It first fetches documents, then compresses them with an LLM (or another method), and only then passes them to your RAG pipeline.

🔹 Example: Contextual Compression in LangChain (Python)

Here’s a small LangChain code snippet that demonstrates how contextual compression works in RAG:

from langchain.retrievers import ContextualCompressionRetriever

from langchain.chains import RetrievalQA

from langchain.llms import OpenAI

from langchain.prompts import PromptTemplate

from langchain.document_transformers import LLMChainExtractor

from langchain.vectorstores import FAISS

from langchain.embeddings import OpenAIEmbeddings

1. Load base retriever (e.g., FAISS)

embeddings = OpenAIEmbeddings()

vectorstore = FAISS.load_local("faiss_index", embeddings)

retriever = vectorstore.as_retriever()

2. Create a compressor (extracts relevant text)

llm = OpenAI(model="gpt-4")

compressor = LLMChainExtractor.from_llm(llm)

3. Wrap retriever with Contextual Compression

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor,

base_retriever=retriever,)

4. Build RetrievalQA chain

qa = RetrievalQA.from_chain_type(

llm=llm,

retriever=compression_retriever)

5. Ask a query

query = "What are the side effects of Rosuvastatin?"

answer = qa.run(query)

print("Answer:", answer)