Skip to content

Omnivore

A high-performance web crawler and content extraction framework written in Rust.

Features

βœ… Implemented

  • Parallel web crawling with configurable concurrency
  • HTML content extraction using CSS selectors and rules
  • Git repository analysis with intelligent code extraction
  • Metadata extraction (OpenGraph, Twitter Cards, JSON-LD)
  • Politeness controls with rate limiting and delays
  • RocksDB storage for crawled content
  • REST API for programmatic access
  • CLI interface with progress tracking
  • TOML configuration for flexible setup
  • Smart code filtering for repository analysis

🚧 In Development

  • Knowledge graph construction
  • Advanced browser automation
  • Vector embeddings and search
  • Entity and relation extraction
  • Robots.txt rule parsing

Use Cases

Omnivore is ideal for: - Web scraping - Extract structured data from websites - Code analysis - Extract and analyze source code from Git repositories - Content archival - Save website content locally - Data mining - Collect data for analysis - Site monitoring - Track changes over time - Research - Gather information systematically - Codebase documentation - Generate reports of repository structure

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     CLI     │────▢│    Core     │────▢│   Storage   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   Crawler   β”‚     β”‚  (RocksDB)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚   REST API  β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Getting Started

  1. Install Omnivore using Cargo:

    cargo install omnivore
    

  2. Run your first crawl:

    omnivore crawl https://example.com --depth 2
    

  3. Start the API server:

    omnivore-api
    

Project Status

Omnivore is actively developed with a focus on stability and performance. The core crawling functionality is production-ready, while advanced features like knowledge graphs are under development.

See our GitHub repository for the latest updates.