← Back to feed

Unstructured

GitHub Repo Pretty sure · Freemium model is honest
https://github.com/Unstructured-IO/unstructured

Document parsing library that actually works on PDFs instead of just talking about them—the paid tier exists because the open-source version gets the job done.

25%
60%
15%
Slop 25%Signal 60%Science 15%

Unstructured solves a real, painful problem: turning documents into machine-readable text for RAG pipelines. The library actually extracts tables, maintains layout structure, and handles multiple formats (PDF, HTML, DOCX, images). No novel ML here—it's OCR + heuristic-based layout parsing + chunking utilities. The README leads with a hard sell for the Enterprise platform (chunking, embeddings, UI), which is standard SaaS upsell, but the open-source tier is genuinely useful and doesn't pretend...

14232 stars HTML 2026-03-04 1265 days old

Become a MFer to rate — log in