kreuzberg
Active·★ 8.4k·MIT·Updated 2026-05-29
★ Trending★ Essential
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Kreuzberg is a high-performance, polyglot library designed to extract text and metadata from over 57 file formats, including comprehensive OCR capabilities. Built with a Rust core, it offers native speed processing, memory efficiency, and the ability to generate embeddings without requiring a GPU, making it highly versatile for various data extraction and processing tasks.
#Document Processing#Data Extraction#OCR#Multi-language#Embeddings#Coding#Data Analysis#Image Generation
01
Features
01Extensible architecture with a plugin system for custom backends and processors.
02Polyglot support with native bindings for 10+ programming languages.
03Comprehensive support for 57+ file formats across 8 categories, including Office, PDF, and images.
04Advanced OCR capabilities with multiple backends and intelligent table detection.
05High performance due to a Rust core, SIMD optimizations, and full parallelism.
02
Compatibility
Rust
Core Library
Verified via docs
Python
Language Binding
Verified via docs
Elixir
Language Binding
Verified via docs
Node.js
Language Binding
Verified via docs
WASM
WebAssembly Support
Verified via docs
Java
Language Binding
Verified via docs
03
Use cases
↳Automated extraction of text, metadata, and structured data from diverse document types.
↳Building intelligent document processing pipelines for data ingestion and analysis.
↳Enabling efficient search and retrieval systems for unstructured and semi-structured content.
04
Alternatives
ragflow★ 81.5k
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
n8n★ 190.2k
Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.
Context7★ 56.4k
MCP Server that provides up-to-date code documentation for LLMs and AI code editors.
GitHub MCP Server★ 30.3k
GitHub's official MCP Server. Allows AI agents to interact directly with your GitHub repositories (read files, search code, issues).
Microsoft AutoGen★ 58.5k
A framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.
Related searches
Comments
Log in to leave a comment
No comments yet. Be the first!