Omnivore¶
A high-performance web crawler and content extraction framework written in Rust.
Features¶
β Implemented¶
- Parallel web crawling with configurable concurrency
- HTML content extraction using CSS selectors and rules
- Git repository analysis with intelligent code extraction
- Metadata extraction (OpenGraph, Twitter Cards, JSON-LD)
- Politeness controls with rate limiting and delays
- RocksDB storage for crawled content
- REST API for programmatic access
- CLI interface with progress tracking
- TOML configuration for flexible setup
- Smart code filtering for repository analysis
π§ In Development¶
- Knowledge graph construction
- Advanced browser automation
- Vector embeddings and search
- Entity and relation extraction
- Robots.txt rule parsing
Quick Links¶
- Installation - Get Omnivore running on your system
- Quickstart - Start crawling in 5 minutes
- CLI Reference - Command-line interface documentation
- Git Command - Extract and analyze code from repositories
- API Documentation - REST API endpoints
- Configuration - Customize crawler behavior
Use Cases¶
Omnivore is ideal for: - Web scraping - Extract structured data from websites - Code analysis - Extract and analyze source code from Git repositories - Content archival - Save website content locally - Data mining - Collect data for analysis - Site monitoring - Track changes over time - Research - Gather information systematically - Codebase documentation - Generate reports of repository structure
Architecture Overview¶
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β CLI ββββββΆβ Core ββββββΆβ Storage β
βββββββββββββββ β Crawler β β (RocksDB) β
βββββββββββββββ βββββββββββββββ
βββββββββββββββ β
β REST API ββββββββββββββ
βββββββββββββββ
Getting Started¶
-
Install Omnivore using Cargo:
-
Run your first crawl:
-
Start the API server:
Project Status¶
Omnivore is actively developed with a focus on stability and performance. The core crawling functionality is production-ready, while advanced features like knowledge graphs are under development.
See our GitHub repository for the latest updates.