The Unstructured library is a powerful open-source Python library developed to extract and preprocess unstructured data (like PDFs, DOCX, HTML, emails, etc.) and convert it into structured, machine-readable elements, which are useful in NLP, RAG (Retrieval-Augmented Generation), search, and more.
It is designed to:
Parse a variety of document formats (PDF, DOCX, HTML, EML, PPTX, etc.)
Break documents into semantic elements like Title, NarrativeText, ListItem, etc.
Output structured formats like JSON, dict, or text that can be directly embedded and stored in a vector database for use in LLM pipelines (e.g., LangChain, Haystack).
| Feature | Description |
|---|---|
| 📂 Multi-format support | PDF, DOCX, PPTX, EML, HTML, TXT, images |
| 📑 Semantic chunking | Breaks documents into meaningful chunks (paragraphs, headings, lists, etc.) |
| 🧠 Intelligent type tagging | Tags chunks as Title, ListItem, NarrativeText, etc. |
| 🔄 Export formats | JSON, Markdown, text, dict, etc. |
| 🌐 Web content parsing | Supports scraping + HTML parsing |
pip install "unstructured[all-docs]"
You may need additional dependencies for certain file types (like pytesseract for OCR).
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example.pdf")
for element in elements:
print(type(element), element.text)
from unstructured.partition.docx import partition_docx
elements = partition_docx(filename="example.docx")
for el in elements:
print(el.category, el.text)
An element is an instance of a class like Title, NarrativeText, or ListItem.
{
"type": "Title",
"text": "Introduction to AI",
"metadata": {
"page_number": 1
}
}
from langchain.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("example.pdf")
docs = loader.load()
print(docs[0].page_content)
You can directly plug docs into vector stores (FAISS, Chroma, etc.).
.pdf, .docx, .pptx, .html, .txt
.eml, .msg (emails)
.jpg, .png (via OCR)
.csv, .md, .rst
| Tool | Use Case |
|---|---|
| LangChain | Document loaders + retrievers |
| LlamaIndex | Data ingestion |
| Haystack | Document preprocessing |
| Chroma/FAISS | Vector store embeddings |
| HuggingFace | Data pipelines for model training |
Official site: https://unstructured.io/
GitHub: https://github.com/Unstructured-IO/unstructured