🐪 Camelot: PDF Table Extraction in Python

Back 🐪 Camelot: PDF Table Extraction in Python 31 Aug, 2025

ABHISHEK AGNIHOTRI

Camelot is a Python library that makes it easy to extract tables from PDF files and convert them into structured formats like CSV, Excel, JSON, or Pandas DataFrames.

Unlike OCR tools (like Tesseract), Camelot is designed specifically for tabular data in PDFs, so it can intelligently detect and parse rows/columns.

✨ Key Features of Camelot

📊 Extracts tables directly into Pandas DataFrames.
📂 Exports tables into CSV, Excel, JSON, HTML formats.
🖼️ Supports two parsing methods:
- Lattice → works best with PDFs that have lines/borders around cells.
- Stream → works best with PDFs where tables are defined by spaces/whitespaces, without explicit borders.
📑 Can handle multi-page PDFs and extract multiple tables from each page.
⚡ Lightweight and easy to use (pure Python, no heavy dependencies except ghostscript for image-based parsing).

🔧 Installation

You can install Camelot with pip:

pip install "camelot-py[cv]"

cv installs OpenCV, which is required for the lattice mode.
You may also need Ghostscript installed separately (for working with certain PDFs).

🖇️ How Camelot Works (Two Methods)

1. Lattice Mode

Detects tables using lines and borders.
Works best when the table has clear row/column separators.

Example:

| Name   | Age | Country |
|--------|-----|---------|
| Abhi   | 30  | India   |

Extracts very accurate results if the table is well-drawn.

2. Stream Mode

Detects tables using whitespace alignment.
Useful when tables don’t have borders, just spacing.

Example:

Name     Age   Country
Abhi     30    India
Rahul    28    USA

More error-prone than lattice but powerful when borders are missing.

📝 Example Code

Extracting a Table

import camelot

# Reading tables from a PDF
tables = camelot.read_pdf("example.pdf", pages="1", flavor="lattice")

print("Total tables found:", tables.n)
print(tables[0].df)  # Print first table as DataFrame

Exporting a Table

# Export first table to CSV
tables[0].to_csv("output.csv")

# Export all tables to Excel
tables.export("tables.xlsx", f="excel")

Stream Mode Example

tables = camelot.read_pdf("example.pdf", pages="1", flavor="stream")
for i, table in enumerate(tables):
    print(f"Table {i+1}")
    print(table.df)

📊 Real-World Use Cases

Finance/Banking → Extracting tabular data from financial reports, invoices, statements.
Healthcare → Extracting tables from medical trial results in research PDFs.
Data Science → Preprocessing government reports, survey results, census PDFs.
Business Intelligence → Parsing competitor product lists, catalogs, or regulatory documents.

✅ Advantages of Camelot

Easy integration with Pandas for data analysis.
Handles both bordered and borderless tables.
Exports in multiple structured formats.
Open-source and actively maintained.

⚠️ Limitations

❌ Doesn’t work well with scanned PDFs (use OCR tools like Tesseract + Camelot for those).
❌ Complex tables (merged cells, nested headers) may cause errors.
❌ Requires manual tweaking (flavor, edge_tol, row_tol, etc.) for best accuracy.
❌ Ghostscript/OpenCV dependencies can be tricky to install on some systems.

🔄 Alternatives to Camelot

Tabula-py → Another popular Python wrapper for Tabula (Java-based).
pdfplumber → More versatile (not only tables, but text extraction too).
PyMuPDF (fitz) → Lower-level PDF parsing, can be combined with table detection logic.

🎨 Workflow Summary (Infographic Style)

1️⃣ Load PDF → camelot.read_pdf()
2️⃣ Choose method → flavor="lattice" OR flavor="stream"
3️⃣ Extract tables → tables[i].df
4️⃣ Export results → to_csv(), to_excel(), to_json()