The Unstructured library is a powerful open-source Python library developed to extract and preprocess unstructured data (like PDFs, DOCX, HTML, emails, etc.) and convert it into structured, machine-readable elements, which are useful in NLP, RAG (Retrieval-Augmented Generation), search, and more.
It is designed to:
Parse a variety of document formats (PDF, DOCX, HTML, EML, PPTX, etc.)
Break documents into semantic elements like Title
, NarrativeText
, ListItem
, etc.
Output structured formats like JSON
, dict
, or text
that can be directly embedded and stored in a vector database for use in LLM pipelines (e.g., LangChain, Haystack).
Feature | Description |
---|---|
📂 Multi-format support | PDF, DOCX, PPTX, EML, HTML, TXT, images |
📑 Semantic chunking | Breaks documents into meaningful chunks (paragraphs, headings, lists, etc.) |
🧠 Intelligent type tagging | Tags chunks as Title , ListItem , NarrativeText , etc. |
🔄 Export formats | JSON, Markdown, text, dict, etc. |
🌐 Web content parsing | Supports scraping + HTML parsing |
pip install "unstructured[all-docs]"
You may need additional dependencies for certain file types (like pytesseract
for OCR).
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example.pdf")
for element in elements:
print(type(element), element.text)
from unstructured.partition.docx import partition_docx
elements = partition_docx(filename="example.docx")
for el in elements:
print(el.category, el.text)
An element is an instance of a class like Title
, NarrativeText
, or ListItem
.
{
"type": "Title",
"text": "Introduction to AI",
"metadata": {
"page_number": 1
}
}
from langchain.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("example.pdf")
docs = loader.load()
print(docs[0].page_content)
You can directly plug docs
into vector stores (FAISS, Chroma, etc.).
.pdf
, .docx
, .pptx
, .html
, .txt
.eml
, .msg
(emails)
.jpg
, .png
(via OCR)
.csv
, .md
, .rst
Tool | Use Case |
---|---|
LangChain | Document loaders + retrievers |
LlamaIndex | Data ingestion |
Haystack | Document preprocessing |
Chroma/FAISS | Vector store embeddings |
HuggingFace | Data pipelines for model training |
Official site: https://unstructured.io/
GitHub: https://github.com/Unstructured-IO/unstructured