Lets Deep DIve into the core logic of document deduplication in FAISS-based pipelines,
This little _fingerprint() method is one of the most crucial utility functions in your entire doc-ingestion process.
@staticmethod
def _fingerprint(text: str, md: Dict[str, Any]) -> str:
src = md.get("source") or md.get("file_path")
rid = md.get("row_id")
if src is not None:
return f"{src}::{'' if rid is None else rid}"
return hashlib.sha256(text.encode("utf-8")).hexdigest()
_fingerprint() Do?👉 It creates a unique identifier (hash or key) for each document or text chunk being added to the FAISS vector store.
👉 This unique ID (fingerprint) helps in:
Detecting duplicates
Tracking specific sources (e.g., same file, same row)
Maintaining metadata consistency
@staticmethodMeans this function does not depend on the class instance (self).
It can be called directly like:
FaissManager._fingerprint(text, metadata)
It’s just a helper utility that uses only its parameters.
def _fingerprint(text: str, md: Dict[str, Any]) -> str:
Takes two inputs:
text → actual document content (string)
md → metadata dictionary (with fields like source, file_path, or row_id)
Returns a string fingerprint (unique key).
source or file_pathsrc = md.get("source") or md.get("file_path")
Tries to get "source" from metadata.
If not present, it tries "file_path".
Either of these typically indicates where the text came from (e.g., PDF name, text file, database row).
✅ Example:
md = {"source": "report1.pdf"}
src = "report1.pdf"
row_idrid = md.get("row_id")
If documents are coming from a table (like a CSV or database), row_id uniquely identifies each row.
This adds fine-grained uniqueness to each text piece.
✅ Example:
md = {"source": "report1.pdf", "row_id": 3}
rid = 3
src exists → Create readable fingerprintif src is not None:
return f"{src}::{'' if rid is None else rid}"
Let’s decode this pattern 👇
| Case | Example Metadata | Fingerprint Output |
|---|---|---|
| File only | {"source": "report1.pdf"} |
"report1.pdf::" |
| File + Row | {"source": "report1.pdf", "row_id": 3} |
"report1.pdf::3" |
| From database | {"file_path": "users.csv", "row_id": 7} |
"users.csv::7" |
So it creates a human-readable and consistent key combining file name + row number.
return hashlib.sha256(text.encode("utf-8")).hexdigest()
If the document has no identifiable source,
the function uses the text content itself to generate a unique hash.
💡 SHA256 ensures that even a single character change produces a different hash:
"Hello World" → a591a6d40bf420404a0117...
"Hello world" → 64ec88ca00b268e5ba1a35...
✅ Example:
text = "This is a paragraph from a random source."
key = hashlib.sha256(text.encode("utf-8")).hexdigest()
Output (64-char hexadecimal hash):
"ed76d3e4b95ad2342b1d1e18afcc8b93efb8a894a823cbd7757d93922abdc3c5"
| Step | Purpose | Example Output |
|---|---|---|
Check source or file_path |
Identify file name | "report1.pdf" |
Check row_id |
Identify row inside file | "report1.pdf::3" |
| If no file info | Create hash from text | "ed76d3e4b95ad2342b1d..." |
| Return | Unique fingerprint | Used as dictionary key in _meta["rows"] |
✅ Prevents re-indexing the same content
✅ Enables fast lookup by file name or row
✅ Supports traceability between FAISS vectors and source documents
✅ Helps in maintaining clean, non-duplicated indexes
for doc in docs:
key = self._fingerprint(doc.page_content, doc.metadata or {})
if key in self._meta["rows"]:
continue # Skip duplicate
self._meta["rows"][key] = {"source": doc.metadata.get("source")}
_fingerprint() = Unique ID generator for every document chunk.
📎 It prefers readable keys like:
"report1.pdf::3"
and falls back to secure hashed keys like:
"ab12cd34ef56..."
when no metadata is available.