Some text some message..
Back Contextual Compression : In RAG (Retrieval-Augmented Generation) 22 Aug, 2025

In RAG (Retrieval-Augmented Generation), Contextual Compression refers to the process of shrinking or filtering retrieved documents into a more relevant and concise form before passing them to the LLM.

Normally, in RAG:

  1. Retriever → Finds documents (chunks of text, PDFs, etc.) from a vector database or knowledge source.

  2. LLM → Consumes those retrieved docs to generate an answer.

⚠️ Problem: Retrievers often return long or noisy chunks. Feeding them all directly into the LLM wastes context window space and may reduce accuracy.


🔹 What Contextual Compression Does

It compresses retrieved documents to preserve only the most relevant parts of text. This way:

  • Less irrelevant information is passed.

  • More space remains for multiple documents.

  • The LLM sees only the most useful context.


🔹 Techniques for Contextual Compression

  1. LLM-based Filtering (Query-aware Summarization)

    • Use an LLM to summarize each retrieved doc with respect to the query.

    • Example: If query = “What are the side effects of Rosuvastatin?”, and a document talks about uses, dosage, and side effects, the compressor extracts only the side effects part.

  2. Re-ranking / Scoring

    • A smaller model ranks the sentences/paragraphs inside each document.

    • Only the top-ranked, query-relevant parts are retained.

  3. Keyword or Semantic Extraction

    • Keep only key sentences that contain entities or terms related to the query.


🔹 Example Flow

Without Compression:

Retriever → returns 5 docs (~2000 tokens each)
LLM input = 10,000 tokens (mostly irrelevant fluff)

With Compression:

Retriever → returns 5 docs
Compressor → trims each doc to ~200 tokens, query-relevant only
LLM input = 1000 tokens (focused context)

🔹 Benefits in RAG

  • ✅ Reduces hallucination (LLM focuses on relevant context)

  • ✅ Saves token cost & context window space

  • ✅ Improves accuracy in multi-hop QA

  • ✅ Helps when docs are very long (e.g., legal, medical, scientific PDFs)


👉 In LangChain , this is implemented as a ContextualCompressionRetriever, which wraps around your base retriever. It first fetches documents, then compresses them with an LLM (or another method), and only then passes them to your RAG pipeline.


🔹 Example: Contextual Compression in LangChain (Python)

Here’s a small LangChain code snippet that demonstrates how contextual compression works in RAG:

from langchain.retrievers import ContextualCompressionRetriever

from langchain.chains import RetrievalQA

from langchain.llms import OpenAI

from langchain.prompts import PromptTemplate

from langchain.document_transformers import LLMChainExtractor

from langchain.vectorstores import FAISS

from langchain.embeddings import OpenAIEmbeddings

 1. Load base retriever (e.g., FAISS)

embeddings = OpenAIEmbeddings()

vectorstore = FAISS.load_local("faiss_index", embeddings)

retriever = vectorstore.as_retriever()

2. Create a compressor (extracts relevant text)

llm = OpenAI(model="gpt-4")  

compressor = LLMChainExtractor.from_llm(llm)

3. Wrap retriever with Contextual Compression

compression_retriever = ContextualCompressionRetriever(

    base_compressor=compressor,

    base_retriever=retriever,)

4. Build RetrievalQA chain

qa = RetrievalQA.from_chain_type(

    llm=llm,

    retriever=compression_retriever)

5. Ask a query

query = "What are the side effects of Rosuvastatin?"

answer = qa.run(query)

print("Answer:", answer)