Skip to content

Omnivore Crawl Command

The omnivore crawl command is the core web crawling functionality, offering powerful features for extracting content from websites with support for AI-powered extraction, browser rendering, and multiple output formats.

Table of Contents

Basic Usage

omnivore crawl <URL> [OPTIONS]

Simple Examples

# Basic crawl with defaults
omnivore crawl https://example.com

# Save results to file
omnivore crawl https://example.com --output results.json

# Limit depth and workers
omnivore crawl https://example.com --depth 3 --workers 5

Command Options

Core Options

Option Short Default Description
--workers -w 10 Number of concurrent workers (1-100)
--depth -d 5 Maximum crawl depth (1-20)
--output -o - Output file path
--format - json Output format (json, markdown, csv, yaml, text)
--verbose -v false Enable verbose logging

Crawling Control

Option Default Description
--delay 100 Delay between requests in milliseconds
--respect-robots false Honor robots.txt directives
--user-agent Omnivore/X.X Custom User-Agent string
--timeout 30000 Request timeout in milliseconds
--max-retries 3 Maximum retry attempts for failed requests
--follow-redirects true Follow HTTP redirects
--max-redirects 10 Maximum number of redirects to follow

Content Filtering

Option Description
--include-urls URL patterns to include (comma-separated)
--exclude-urls URL patterns to exclude (comma-separated)
--content-types Content types to accept (default: text/html)
--max-page-size Maximum page size in bytes
--min-content-length Minimum content length to save

Output Options

Option Description
--organize Create organized folder structure
--extract-tables Extract HTML tables as CSV files
--include-raw Include raw HTML in output
--exclude-urls Don't include URLs/links in output
--zip Compress output to ZIP file
--stdout Output to stdout instead of file

AI-Powered Features

Natural Language Extraction

Use the --ai flag with a natural language query to extract specific information:

# Extract product information
omnivore crawl https://shop.com \
  --ai "Extract product names, prices, descriptions, and availability"

# Extract article metadata
omnivore crawl https://news.site.com \
  --ai "Get article titles, authors, publication dates, and summaries"

# Extract contact information
omnivore crawl https://company.com \
  --ai "Find all email addresses, phone numbers, and physical addresses"

Auto Mode

The --auto flag enables intelligent automatic extraction:

omnivore crawl https://example.com --auto

Auto mode automatically detects and extracts: - Tables: Converted to CSV files - Forms: Input fields and their attributes - Dropdowns: All options and values - Pagination: Next/previous links - Contact Info: Emails, phones, addresses - Downloads: Links to downloadable files - Media: Images and videos with metadata - Structured Data: JSON-LD, Microdata, OpenGraph

Templates

Use pre-configured templates for common website types:

# E-commerce sites
omnivore crawl https://shop.com --template ecommerce

# News and blogs
omnivore crawl https://news.com --template news

# Academic sites
omnivore crawl https://university.edu --template academic

# Real estate
omnivore crawl https://realestate.com --template realestate

# Job boards
omnivore crawl https://jobs.com --template jobs

# Social media
omnivore crawl https://social.com --template social

Available Templates

Template Extracts
ecommerce Products, prices, reviews, categories, stock status
news Articles, authors, dates, categories, tags
academic Courses, faculty, departments, research papers
realestate Listings, prices, features, locations, agents
jobs Job titles, companies, salaries, requirements
social Posts, users, comments, likes, shares
forum Threads, posts, users, timestamps
documentation Sections, code examples, API references

Browser Mode

Enable JavaScript rendering for dynamic websites:

Basic Browser Mode

# Ensure ChromeDriver is running
chromedriver --port=9515

# Crawl with browser
omnivore crawl https://spa-app.com --browser

Browser Options

Option Description
--browser Enable browser mode with JavaScript rendering
--interact Interact with dropdowns and dynamic elements
--wait Wait time in ms for content to load (default: 2000)
--screenshot Take screenshots of each page
--scroll Scroll to load lazy content
--max-scroll Maximum scroll attempts for infinite scroll

Interactive Mode

# Interact with all dynamic elements
omnivore crawl https://data-portal.com \
  --browser \
  --interact \
  --extract-tables

# Handle infinite scroll
omnivore crawl https://feed.com \
  --browser \
  --scroll \
  --max-scroll 20

Output Formats

JSON (Default)

omnivore crawl https://example.com --format json --output data.json

Output structure:

{
  "crawler": "omnivore",
  "version": "0.4.0",
  "start_url": "https://example.com",
  "timestamp": "2024-01-15T10:30:00Z",
  "stats": {
    "total_urls": 100,
    "successful": 95,
    "failed": 5,
    "elapsed_time": "120.5s"
  },
  "results": [
    {
      "url": "https://example.com",
      "status_code": 200,
      "title": "Example Domain",
      "content": "...",
      "cleaned_content": {
        "title": "Example Domain",
        "content": "Main text content...",
        "word_count": 500,
        "links": ["..."],
        "tables": []
      },
      "metadata": {
        "description": "...",
        "keywords": ["..."],
        "og:title": "..."
      },
      "extracted_data": {}
    }
  ]
}

Markdown

omnivore crawl https://docs.site.com --format markdown --output docs.md

CSV

omnivore crawl https://data.site.com --format csv --output data.csv

CSV columns: - url - title - content - word_count - links (comma-separated) - status_code - crawled_at

YAML

omnivore crawl https://example.com --format yaml --output data.yaml

Plain Text

omnivore crawl https://example.com --format text --output content.txt

Organized Output

omnivore crawl https://example.com --organize --output site-backup/

Creates structure:

site-backup/
├── index.json                 # Crawl metadata and statistics
├── page_0001.json             # First page content
├── page_0002.json             # Second page content
├── ...
├── tables/                    # Extracted tables (if --extract-tables)
│   ├── page_0001_table_1.csv
│   ├── page_0001_table_2.csv
│   └── ...
├── media/                     # Downloaded media (if --download-media)
│   ├── images/
│   └── videos/
└── screenshots/               # Screenshots (if --screenshot in browser mode)
    ├── page_0001.png
    └── ...

Advanced Features

Table Extraction

# Extract all tables as CSV files
omnivore crawl https://data.gov/statistics \
  --extract-tables \
  --organize \
  --output gov-data/

Tables are saved as: - Individual CSV files in tables/ directory - Embedded in JSON output with structure - Formatted in Markdown output

URL Filtering

# Include only specific paths
omnivore crawl https://example.com \
  --include-urls "/blog/*,/news/*"

# Exclude admin and private areas
omnivore crawl https://example.com \
  --exclude-urls "/admin/*,/private/*,*.pdf"

# Combine include and exclude
omnivore crawl https://example.com \
  --include-urls "/products/*" \
  --exclude-urls "*/reviews/*"

Content Type Filtering

# Only HTML pages
omnivore crawl https://example.com \
  --content-types "text/html"

# HTML and JSON
omnivore crawl https://api.example.com \
  --content-types "text/html,application/json"

# Everything except images
omnivore crawl https://example.com \
  --exclude-content-types "image/*"

Custom Headers

# Add authentication
omnivore crawl https://api.example.com \
  --header "Authorization: Bearer TOKEN" \
  --header "X-API-Key: KEY"

# Custom cookies
omnivore crawl https://example.com \
  --header "Cookie: session=abc123; user=john"

Session Management

# Save session for resuming
omnivore crawl https://large-site.com \
  --session my-crawl \
  --output partial.json

# Resume interrupted crawl
omnivore crawl --resume my-crawl \
  --output complete.json

# List saved sessions
omnivore sessions list

# Delete session
omnivore sessions delete my-crawl

Examples

Complete Website Backup

omnivore crawl https://my-site.com \
  --depth 20 \
  --workers 20 \
  --organize \
  --extract-tables \
  --include-raw \
  --zip \
  --output "backup-$(date +%Y%m%d).zip"

E-commerce Product Extraction

omnivore crawl https://shop.com/products \
  --ai "Extract product name, price, SKU, description, images, stock" \
  --format json \
  --include-urls "*/product/*" \
  --exclude-urls "*/reviews/*" \
  --output products.json

News Article Collection

omnivore crawl https://news.site.com \
  --template news \
  --depth 5 \
  --format markdown \
  --organize \
  --output news-archive/

Academic Research

omnivore crawl https://journal.edu \
  --ai "Extract paper titles, authors, abstracts, DOIs, citations" \
  --extract-tables \
  --include-urls "*/papers/*,*/articles/*" \
  --output research.json

API Documentation Scraping

omnivore crawl https://api.service.com/docs \
  --ai "Extract endpoints, methods, parameters, examples, responses" \
  --format markdown \
  --depth 10 \
  --output api-docs.md

Multi-site Comparison

#!/bin/bash
sites=("competitor1.com" "competitor2.com" "competitor3.com")

for site in "${sites[@]}"; do
  omnivore crawl "https://$site" \
    --depth 3 \
    --ai "Extract products, pricing, features" \
    --format json \
    --output "analysis-$site.json" &
done
wait

# Combine results
jq -s '.' analysis-*.json > combined-analysis.json

Dynamic Site with Infinite Scroll

omnivore crawl https://social-feed.com \
  --browser \
  --interact \
  --scroll \
  --max-scroll 50 \
  --wait 3000 \
  --output feed-content.json

Monitoring Price Changes

# Initial crawl
omnivore crawl https://shop.com/sale \
  --ai "Extract product names and prices" \
  --output prices-baseline.json

# Later crawl
omnivore crawl https://shop.com/sale \
  --ai "Extract product names and prices" \
  --output prices-current.json

# Compare
diff <(jq '.results[].extracted_data' prices-baseline.json) \
     <(jq '.results[].extracted_data' prices-current.json)

Performance Tuning

For Large Sites

# Start with shallow crawl
omnivore crawl https://huge-site.com \
  --depth 2 \
  --workers 5 \
  --output preview.json

# Then deep crawl specific sections
omnivore crawl https://huge-site.com/important-section \
  --depth 10 \
  --workers 20

For Slow Sites

omnivore crawl https://slow-site.com \
  --workers 2 \
  --delay 2000 \
  --timeout 60000 \
  --max-retries 5

For Fast Sites

omnivore crawl https://fast-site.com \
  --workers 50 \
  --delay 50 \
  --timeout 10000

Memory Management

# Limit memory usage
omnivore crawl https://example.com \
  --max-queue-size 1000 \
  --max-page-size 10485760 \
  --stream-mode  # Don't store in memory

Best Practices

1. Always Be Respectful

omnivore crawl https://example.com \
  --respect-robots \
  --delay 500 \
  --workers 5 \
  --user-agent "MyBot/1.0 (contact@me.com)"

2. Start Small

  • Begin with --depth 1 or --depth 2
  • Use fewer workers initially
  • Test on a small section first

3. Use Appropriate Delays

  • Small sites: 500-1000ms delay
  • Medium sites: 200-500ms delay
  • Large sites: 50-200ms delay (if allowed)

4. Handle Errors Gracefully

omnivore crawl https://unstable-site.com \
  --max-retries 5 \
  --timeout 30000 \
  --continue-on-error \
  --verbose 2> errors.log

5. Optimize Output

  • Use --organize for large crawls
  • Use --zip to save space
  • Use appropriate format for your use case
  • Exclude unnecessary data with flags

6. Monitor Resource Usage

# Monitor with system tools
omnivore crawl https://example.com &
PID=$!
top -p $PID

Troubleshooting

Common Issues

429 Too Many Requests - Increase delay: --delay 1000 - Reduce workers: --workers 2 - Add exponential backoff

JavaScript Content Not Loading - Use browser mode: --browser - Increase wait time: --wait 5000 - Enable interaction: --interact

Memory Issues - Reduce workers - Limit queue size - Use streaming mode - Process in batches

Incomplete Crawls - Check robots.txt compliance - Verify URL patterns - Check for session requirements - Look for rate limiting

See Also