In RAG (Retrieval-Augmented Generation), Contextual Compression refers to the process of shrinking or filtering retrieved documents into a more relevant and concise form before passing them to the LLM.
Normally, in RAG:
Retriever → Finds documents (chunks of text, PDFs, etc.) from a vector database or knowledge source.
LLM → Consumes those retrieved docs to generate an answer.
⚠️ Problem: Retrievers often return long or noisy chunks. Feeding them all directly into the LLM wastes context window space and may reduce accuracy.
It compresses retrieved documents to preserve only the most relevant parts of text. This way:
Less irrelevant information is passed.
More space remains for multiple documents.
The LLM sees only the most useful context.
LLM-based Filtering (Query-aware Summarization)
Use an LLM to summarize each retrieved doc with respect to the query.
Example: If query = “What are the side effects of Rosuvastatin?”, and a document talks about uses, dosage, and side effects, the compressor extracts only the side effects part.
Re-ranking / Scoring
A smaller model ranks the sentences/paragraphs inside each document.
Only the top-ranked, query-relevant parts are retained.
Keyword or Semantic Extraction
Keep only key sentences that contain entities or terms related to the query.
Without Compression:
Retriever → returns 5 docs (~2000 tokens each)
LLM input = 10,000 tokens (mostly irrelevant fluff)
With Compression:
Retriever → returns 5 docs
Compressor → trims each doc to ~200 tokens, query-relevant only
LLM input = 1000 tokens (focused context)
✅ Reduces hallucination (LLM focuses on relevant context)
✅ Saves token cost & context window space
✅ Improves accuracy in multi-hop QA
✅ Helps when docs are very long (e.g., legal, medical, scientific PDFs)
👉 In LangChain , this is implemented as a ContextualCompressionRetriever, which wraps around your base retriever. It first fetches documents, then compresses them with an LLM (or another method), and only then passes them to your RAG pipeline.
Here’s a small LangChain code snippet that demonstrates how contextual compression works in RAG:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.document_transformers import LLMChainExtractor
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
1. Load base retriever (e.g., FAISS)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local("faiss_index", embeddings)
retriever = vectorstore.as_retriever()
2. Create a compressor (extracts relevant text)
llm = OpenAI(model="gpt-4")
compressor = LLMChainExtractor.from_llm(llm)
3. Wrap retriever with Contextual Compression
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever,)
4. Build RetrievalQA chain
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=compression_retriever)
5. Ask a query
query = "What are the side effects of Rosuvastatin?"
answer = qa.run(query)
print("Answer:", answer)