Quickstart Guide¶
Get up and running with Omnivore in minutes. This guide covers the essential commands and common use cases to help you start crawling and analyzing content immediately.
Table of Contents¶
- Initial Setup
- Basic Web Crawling
- Git Repository Analysis
- AI-Powered Extraction
- Browser Mode
- Output Formats
- Common Workflows
- Tips and Best Practices
Initial Setup¶
Step 1: Install Omnivore¶
# Quick install (Linux/macOS)
curl -sSfL https://raw.githubusercontent.com/Pranav-Karra-3301/omnivore/master/install.sh | sh
# Or via Homebrew
brew tap Pranav-Karra-3301/omnivore && brew install omnivore
# Or from source
git clone https://github.com/Pranav-Karra-3301/omnivore.git
cd omnivore
cargo build --release
make install
Step 2: Run Setup Wizard¶
The setup wizard will guide you through: - OpenAI API Key: For AI-powered extraction features - Default Settings: Workers, depth, delays - Output Preferences: Default formats and directories - Browser Configuration: ChromeDriver setup for JavaScript rendering
Step 3: Verify Installation¶
# Check version
omnivore --version
# View available commands
omnivore --help
# Test with a simple crawl
omnivore crawl https://example.com --depth 1
Basic Web Crawling¶
Simple Crawl¶
# Crawl a website with default settings
omnivore crawl https://example.com
# Output:
# πΈοΈ Omnivore Web Crawler
# Starting crawl from: https://example.com
# Configuration:
# Workers: 10
# Max depth: 5
# ...
# β
Successfully crawled 42 pages
Crawl with Custom Settings¶
# Shallow crawl with more workers
omnivore crawl https://news.ycombinator.com \
--workers 20 \
--depth 2 \
--output hn-crawl.json
# Deep crawl with rate limiting
omnivore crawl https://blog.example.com \
--depth 10 \
--delay 500 \
--respect-robots
Organized Output with Tables¶
# Extract tables and organize output
omnivore crawl https://en.wikipedia.org/wiki/List_of_countries_by_population \
--extract-tables \
--organize \
--output population-data
# Creates:
# population-data/
# βββ index.json # Crawl metadata
# βββ page_0001.json # Page content
# βββ page_0002.json # ...
# βββ tables/
# βββ page_0001_Population_by_country.csv
# βββ page_0001_Historical_data.csv
Multiple Output Formats¶
# JSON output (default)
omnivore crawl https://example.com --output data.json
# Markdown format
omnivore crawl https://example.com --format markdown --output report.md
# CSV format (best for structured data)
omnivore crawl https://example.com --format csv --output data.csv
# Plain text
omnivore crawl https://example.com --format text --output content.txt
# YAML format
omnivore crawl https://example.com --format yaml --output data.yaml
Git Repository Analysis¶
Basic Repository Analysis¶
# Analyze a GitHub repository
omnivore git https://github.com/rust-lang/rust-clippy --output clippy-analysis.txt
# Output includes:
# - Project type detection (CLI tool, Web app, Library, etc.)
# - Language and framework identification
# - Organized code structure
# - Full source code with line numbers
Filter Specific Files¶
# Extract only Rust source files
omnivore git https://github.com/rust-lang/cargo \
--only "*.rs,*.toml" \
--output cargo-rust.txt
# Exclude test files
omnivore git ./my-project \
--exclude "**/tests/**,**/*.test.js" \
--output src-only.txt
# Include normally excluded directories
omnivore git ./my-project \
--include "**/node_modules/express/**" \
--output with-deps.txt
Analyze Local Repository¶
# Current directory
omnivore git . --output this-project.txt
# Specific local path
omnivore git ~/projects/my-app --output my-app-analysis.txt
# With JSON output for processing
omnivore git . --json --output codebase.json
AI-Powered Extraction¶
Setup AI Features¶
Natural Language Queries¶
# Extract specific information using AI
omnivore crawl https://shop.example.com \
--ai "Extract all product names, prices, and availability"
# Extract structured data from news sites
omnivore crawl https://news.site.com \
--ai "Get article titles, authors, publication dates, and summaries"
# Academic data extraction
omnivore crawl https://university.edu/courses \
--ai "Extract course codes, titles, credits, prerequisites, and instructor names"
Auto Mode - Intelligent Detection¶
# Automatically detect and extract everything
omnivore crawl https://example.com --auto
# Auto mode detects:
# - Tables (exports as CSV)
# - Forms and input fields
# - Dropdowns and filters
# - Pagination
# - Contact information
# - Downloadable files
# - Media content
# - Structured data (JSON-LD, microdata)
Using Templates¶
# E-commerce template
omnivore crawl https://shop.com --template ecommerce
# Extracts: products, prices, reviews, categories
# News template
omnivore crawl https://news.com --template news
# Extracts: articles, authors, dates, categories
# Academic template
omnivore crawl https://university.edu --template academic
# Extracts: courses, faculty, departments, research
Browser Mode¶
Enable JavaScript Rendering¶
# First, ensure ChromeDriver is running
chromedriver --port=9515
# Crawl JavaScript-heavy site
omnivore crawl https://spa-app.com --browser
# With custom wait time for dynamic content
omnivore crawl https://slow-app.com \
--browser \
--wait 3000 # Wait 3 seconds for content to load
Interactive Mode¶
# Interact with dropdowns and filters
omnivore crawl https://data-portal.com \
--browser \
--interact \
--extract-tables
# This will:
# 1. Load the page with JavaScript
# 2. Click on all dropdowns to reveal options
# 3. Try different filter combinations
# 4. Extract data from each variation
# 5. Export all tables found
Output Formats¶
JSON Output (Default)¶
{
"stats": {
"total_urls": 50,
"successful": 48,
"failed": 2,
"elapsed_time": "45.3s"
},
"results": [
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"links": ["..."],
"metadata": {}
}
]
}
Markdown Output¶
# Crawl Results: https://docs.example.com
**Date:** 2024-01-15
**Pages:** 25 | **Words:** 15,420
## Page 1: Getting Started
**URL:** https://docs.example.com/getting-started
...
Organized Directory Structure¶
crawl-output/
βββ index.json # Metadata and statistics
βββ page_0001.json # Individual page data
βββ page_0002.json
βββ tables/ # Extracted tables as CSV
βββ table_001.csv
βββ table_002.csv
Common Workflows¶
Website Backup¶
# Complete website backup with media
omnivore crawl https://my-site.com \
--depth 20 \
--organize \
--include-raw \
--zip \
--output backup-$(date +%Y%m%d).zip
Data Mining¶
# Extract all tables from a data portal
omnivore crawl https://data.gov/statistics \
--extract-tables \
--format csv \
--organize \
--output gov-data/
# Mine product data from e-commerce
omnivore crawl https://shop.com/products \
--ai "Extract product name, price, SKU, stock status" \
--format json \
--output products.json
Documentation Scraping¶
# Scrape technical documentation
omnivore crawl https://docs.framework.com \
--depth 10 \
--format markdown \
--exclude-urls "*/api/*" \
--output framework-docs.md
Competitive Analysis¶
# Analyze competitor websites
for site in competitor1.com competitor2.com competitor3.com; do
omnivore crawl https://$site \
--depth 3 \
--ai "Extract products, pricing, features, testimonials" \
--output "analysis-$site.json"
done
Tips and Best Practices¶
Performance Optimization¶
# For large sites, use shallow depth first
omnivore crawl https://huge-site.com --depth 2 --output preview.json
# Use more workers for faster crawling (if site allows)
omnivore crawl https://fast-site.com --workers 50
# But be polite with rate limiting
omnivore crawl https://small-blog.com --workers 2 --delay 1000
Respectful Crawling¶
# Always respect robots.txt
omnivore crawl https://example.com --respect-robots
# Use appropriate delays
omnivore crawl https://example.com --delay 500
# Set a custom user agent
omnivore crawl https://example.com \
--user-agent "MyBot/1.0 (contact@example.com)"
Error Handling¶
# Verbose mode for debugging
omnivore crawl https://problem-site.com \
--verbose \
--output debug.json
# Save error log
omnivore crawl https://example.com \
--output data.json \
2> errors.log
Next Steps¶
Now that you're familiar with the basics:
- Explore Advanced Features:
- Read about Browser Mode for JavaScript sites
- Learn about AI Integration for intelligent extraction
-
Check Git Command for code repository analysis
-
Customize Your Setup:
- Configure defaults in Configuration
- Set up API keys for AI features
-
Create custom templates
-
Scale Your Operations:
- Use the REST API for programmatic access
- Deploy with Docker
-
Set up monitoring and logging
-
Get Help:
- Run
omnivore help <command>for detailed command help - Check Troubleshooting for common issues
- Visit the GitHub repository for updates