Python libraries depending on the document type:
Library | Purpose | Example | ||
---|---|---|---|---|
PyMuPDF (fitz ) |
Extract text, images, metadata from PDFs | fitz.open("file.pdf").get_page_text(0) | ||
pdfplumber |
Accurate text extraction, tables | pdfplumber.open("file.pdf") | ||
PyPDF2 / pypdf |
Basic text and metadata | reader.getPage(0).extract_text() |
.docx
)Library | Purpose | Example |
---|---|---|
python-docx |
Extract & manipulate DOCX content | Document("file.docx").paragraphs |
.xlsx
, .xls
)Library | Purpose | Example |
---|---|---|
openpyxl |
Read/write .xlsx |
load_workbook("file.xlsx") |
pandas |
Load Excel into DataFrames | pd.read_excel("file.xlsx") |
xlrd |
Read .xls files (older Excel) |
xlrd.open_workbook("file.xls") |
Library | Purpose | Example |
---|---|---|
pytesseract |
OCR text from images/PDFs | pytesseract.image_to_string(Image.open("scan.jpg")) |
easyocr |
Supports multiple languages, robust OCR | reader.readtext("scan.jpg") |
If you're building a Retriever or a RAG system, combine:
📝 LangChain
or Haystack
for pipelines
📁 Unstructured
or LangChain.document_loaders
for document ingestion
📚 Vector stores like FAISS, Chroma, or Pinecone for retrieval