_fingerprint() method : core logic of document deduplication

Back _fingerprint() method : core logic of document deduplication 15 Oct, 2025

Lets Deep DIve into the core logic of document deduplication in FAISS-based pipelines,
This little _fingerprint() method is one of the most crucial utility functions in your entire doc-ingestion process.

🧩 Code to Understand

@staticmethod
def _fingerprint(text: str, md: Dict[str, Any]) -> str:
    src = md.get("source") or md.get("file_path")
    rid = md.get("row_id")
    if src is not None:
        return f"{src}::{'' if rid is None else rid}"
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

🎯 Purpose: What Does `_fingerprint()` Do?

👉 It creates a unique identifier (hash or key) for each document or text chunk being added to the FAISS vector store.
👉 This unique ID (fingerprint) helps in:

Detecting duplicates
Tracking specific sources (e.g., same file, same row)
Maintaining metadata consistency

🧱 Step-by-Step Breakdown

1️⃣ `@staticmethod`

Means this function does not depend on the class instance (self).

It can be called directly like:

FaissManager._fingerprint(text, metadata)

It’s just a helper utility that uses only its parameters.

2️⃣ Function signature

def _fingerprint(text: str, md: Dict[str, Any]) -> str:

Takes two inputs:
- text → actual document content (string)
- md → metadata dictionary (with fields like source, file_path, or row_id)
Returns a string fingerprint (unique key).

3️⃣ Extract `source` or `file_path`

src = md.get("source") or md.get("file_path")

Tries to get "source" from metadata.
If not present, it tries "file_path".
Either of these typically indicates where the text came from (e.g., PDF name, text file, database row).

✅ Example:

md = {"source": "report1.pdf"}
src = "report1.pdf"

4️⃣ Extract `row_id`

rid = md.get("row_id")

If documents are coming from a table (like a CSV or database), row_id uniquely identifies each row.
This adds fine-grained uniqueness to each text piece.

✅ Example:

md = {"source": "report1.pdf", "row_id": 3}
rid = 3

5️⃣ If `src` exists → Create readable fingerprint

if src is not None:
    return f"{src}::{'' if rid is None else rid}"

Let’s decode this pattern 👇

Case	Example Metadata	Fingerprint Output
File only	`{"source": "report1.pdf"}`	`"report1.pdf::"`
File + Row	`{"source": "report1.pdf", "row_id": 3}`	`"report1.pdf::3"`
From database	`{"file_path": "users.csv", "row_id": 7}`	`"users.csv::7"`

So it creates a human-readable and consistent key combining file name + row number.

6️⃣ Else → Generate SHA256 hash of text

return hashlib.sha256(text.encode("utf-8")).hexdigest()

If the document has no identifiable source,
the function uses the text content itself to generate a unique hash.

💡 SHA256 ensures that even a single character change produces a different hash:

"Hello World" → a591a6d40bf420404a0117...
"Hello world" → 64ec88ca00b268e5ba1a35...

✅ Example:

text = "This is a paragraph from a random source."
key = hashlib.sha256(text.encode("utf-8")).hexdigest()

Output (64-char hexadecimal hash):

"ed76d3e4b95ad2342b1d1e18afcc8b93efb8a894a823cbd7757d93922abdc3c5"

🧠 Summary Table

Step	Purpose	Example Output
Check `source` or `file_path`	Identify file name	`"report1.pdf"`
Check `row_id`	Identify row inside file	`"report1.pdf::3"`
If no file info	Create hash from text	`"ed76d3e4b95ad2342b1d..."`
Return	Unique fingerprint	Used as dictionary key in `_meta["rows"]`

💡 Why This Is Important

✅ Prevents re-indexing the same content
✅ Enables fast lookup by file name or row
✅ Supports traceability between FAISS vectors and source documents
✅ Helps in maintaining clean, non-duplicated indexes

🔍 Example in Context

for doc in docs:
    key = self._fingerprint(doc.page_content, doc.metadata or {})
    if key in self._meta["rows"]:
        continue  # Skip duplicate
    self._meta["rows"][key] = {"source": doc.metadata.get("source")}

🎯 In Short

_fingerprint() = Unique ID generator for every document chunk.

📎 It prefers readable keys like:

"report1.pdf::3"

and falls back to secure hashed keys like:

"ab12cd34ef56..."

when no metadata is available.

🧩 Code to Understand

🎯 Purpose: What Does _fingerprint() Do?

🧱 Step-by-Step Breakdown

1️⃣ @staticmethod

2️⃣ Function signature

3️⃣ Extract source or file_path

4️⃣ Extract row_id

5️⃣ If src exists → Create readable fingerprint

6️⃣ Else → Generate SHA256 hash of text

🧠 Summary Table

💡 Why This Is Important

🔍 Example in Context

🎯 In Short

🎯 Purpose: What Does `_fingerprint()` Do?

1️⃣ `@staticmethod`

3️⃣ Extract `source` or `file_path`

4️⃣ Extract `row_id`

5️⃣ If `src` exists → Create readable fingerprint