Some text some message..
Back _fingerprint() method : core logic of document deduplication 15 Oct, 2025

Lets Deep DIve into the core logic of document deduplication in FAISS-based pipelines, 
This little _fingerprint() method is one of the most crucial utility functions in your entire doc-ingestion process.


🧩 Code to Understand

@staticmethod
def _fingerprint(text: str, md: Dict[str, Any]) -> str:
    src = md.get("source") or md.get("file_path")
    rid = md.get("row_id")
    if src is not None:
        return f"{src}::{'' if rid is None else rid}"
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

🎯 Purpose: What Does _fingerprint() Do?

👉 It creates a unique identifier (hash or key) for each document or text chunk being added to the FAISS vector store.
👉 This unique ID (fingerprint) helps in:

  • Detecting duplicates

  • Tracking specific sources (e.g., same file, same row)

  • Maintaining metadata consistency


🧱 Step-by-Step Breakdown

1️⃣ @staticmethod

  • Means this function does not depend on the class instance (self).

  • It can be called directly like:

    FaissManager._fingerprint(text, metadata)
    
  • It’s just a helper utility that uses only its parameters.


2️⃣ Function signature

def _fingerprint(text: str, md: Dict[str, Any]) -> str:
  • Takes two inputs:

    • text → actual document content (string)

    • md → metadata dictionary (with fields like source, file_path, or row_id)

  • Returns a string fingerprint (unique key).


3️⃣ Extract source or file_path

src = md.get("source") or md.get("file_path")
  • Tries to get "source" from metadata.

  • If not present, it tries "file_path".

  • Either of these typically indicates where the text came from (e.g., PDF name, text file, database row).

✅ Example:

md = {"source": "report1.pdf"}
src = "report1.pdf"

4️⃣ Extract row_id

rid = md.get("row_id")
  • If documents are coming from a table (like a CSV or database), row_id uniquely identifies each row.

  • This adds fine-grained uniqueness to each text piece.

✅ Example:

md = {"source": "report1.pdf", "row_id": 3}
rid = 3

5️⃣ If src exists → Create readable fingerprint

if src is not None:
    return f"{src}::{'' if rid is None else rid}"

Let’s decode this pattern 👇

Case Example Metadata Fingerprint Output
File only {"source": "report1.pdf"} "report1.pdf::"
File + Row {"source": "report1.pdf", "row_id": 3} "report1.pdf::3"
From database {"file_path": "users.csv", "row_id": 7} "users.csv::7"

So it creates a human-readable and consistent key combining file name + row number.


6️⃣ Else → Generate SHA256 hash of text

return hashlib.sha256(text.encode("utf-8")).hexdigest()

If the document has no identifiable source,
the function uses the text content itself to generate a unique hash.

💡 SHA256 ensures that even a single character change produces a different hash:

"Hello World" → a591a6d40bf420404a0117...
"Hello world" → 64ec88ca00b268e5ba1a35...

✅ Example:

text = "This is a paragraph from a random source."
key = hashlib.sha256(text.encode("utf-8")).hexdigest()

Output (64-char hexadecimal hash):

"ed76d3e4b95ad2342b1d1e18afcc8b93efb8a894a823cbd7757d93922abdc3c5"

🧠 Summary Table

Step Purpose Example Output
Check source or file_path Identify file name "report1.pdf"
Check row_id Identify row inside file "report1.pdf::3"
If no file info Create hash from text "ed76d3e4b95ad2342b1d..."
Return Unique fingerprint Used as dictionary key in _meta["rows"]

💡 Why This Is Important

✅ Prevents re-indexing the same content
✅ Enables fast lookup by file name or row
✅ Supports traceability between FAISS vectors and source documents
✅ Helps in maintaining clean, non-duplicated indexes


🔍 Example in Context

for doc in docs:
    key = self._fingerprint(doc.page_content, doc.metadata or {})
    if key in self._meta["rows"]:
        continue  # Skip duplicate
    self._meta["rows"][key] = {"source": doc.metadata.get("source")}

🎯 In Short

_fingerprint() = Unique ID generator for every document chunk.

📎 It prefers readable keys like:

"report1.pdf::3"

and falls back to secure hashed keys like:

"ab12cd34ef56..."

when no metadata is available.