Python libraries depending on the document type:
| Library | Purpose | Example | ||
|---|---|---|---|---|
PyMuPDF (fitz) |
Extract text, images, metadata from PDFs | fitz.open("file.pdf").get_page_text(0) | ||
pdfplumber |
Accurate text extraction, tables | pdfplumber.open("file.pdf") | ||
PyPDF2 / pypdf |
Basic text and metadata | reader.getPage(0).extract_text() |
.docx)| Library | Purpose | Example |
|---|---|---|
python-docx |
Extract & manipulate DOCX content | Document("file.docx").paragraphs |
.xlsx, .xls)| Library | Purpose | Example |
|---|---|---|
openpyxl |
Read/write .xlsx |
load_workbook("file.xlsx") |
pandas |
Load Excel into DataFrames | pd.read_excel("file.xlsx") |
xlrd |
Read .xls files (older Excel) |
xlrd.open_workbook("file.xls") |
| Library | Purpose | Example |
|---|---|---|
pytesseract |
OCR text from images/PDFs | pytesseract.image_to_string(Image.open("scan.jpg")) |
easyocr |
Supports multiple languages, robust OCR | reader.readtext("scan.jpg") |
If you're building a Retriever or a RAG system, combine:
📝 LangChain or Haystack for pipelines
📁 Unstructured or LangChain.document_loaders for document ingestion
📚 Vector stores like FAISS, Chroma, or Pinecone for retrieval