🔑 Significance of Document
Object in Embedding (RAG Context)
The Document
object is not just plain text.
It typically has two main parts:
page_content
(text/content): the actual chunk of text you want to embed.
metadata
: information like source (file name, URL, page number, section heading, timestamp, etc.).
This separation is crucial because embeddings are usually created only from text, but metadata helps with filtering, ranking, and context in retrieval.
👉 Example:
from langchain.schema import Document
doc = Document(
page_content="Aspirin is an anti-inflammatory drug used to reduce pain and fever.",
metadata={"source": "pharma_textbook.pdf", "page": 12}
)
When you pass a Document
to an embedding model, only the page_content
is vectorized.
But the metadata travels alongside in the vector store, so you know where the text came from after retrieval.
Without Document
, you’d just have “raw embeddings” with no traceability.
👉 Example Flow:
# Extract text → Wrap in Document → Generate embedding
embedding = embed_model.embed_query(doc.page_content)
vector_store.add_texts([doc.page_content], metadatas=[doc.metadata])
In RAG, large documents are split into chunks for better retrieval.
Each chunk becomes a Document
object so that embeddings are tied to manageable text pieces.
This allows retrieval to be precise (bring only the relevant sections instead of entire PDFs/books).
Since Document
carries metadata, you can:
Filter by source
, author
, date
, tag
, etc.
Re-rank results by importance (e.g., prefer certain document sources).
This is important for enterprise RAG, compliance, and multi-source knowledge bases.
👉 Example:
retriever.get_relevant_documents(
"What are side effects of Aspirin?",
filter={"source": "pharma_textbook.pdf"}
)
RAG is often used where citation/explainability matters.
Document
metadata lets the system return the original source alongside the answer.
This increases trust (especially in finance, medicine, law).
The Document
object in RAG is the bridge between raw text, embeddings, and retrieval.
It ensures:
Embeddings are generated from clean text (page_content
).
Metadata preserves context, source, and traceability.
Retrieval is accurate, explainable, and filterable.