CLI Reference¶

The Omnivore CLI provides a command-line interface for web crawling and content extraction.

Installation¶

cargo install omnivore

Global Options¶

-v, --verbose: Enable verbose logging
-c, --config <FILE>: Path to TOML configuration file
-h, --help: Print help information
-V, --version: Print version information

Commands¶

`crawl` - Web Crawling¶

Start a web crawl from one or more seed URLs.

omnivore crawl <URL> [OPTIONS]

Arguments¶

<URL>: Starting URL(s) to crawl

Options¶

--workers <N>: Number of concurrent workers (default: 10, range: 1-100)
--depth <N>: Maximum crawl depth (default: 5, range: 1-20)
--delay <MS>: Delay between requests in milliseconds (default: 100)
--output <FILE>: Export crawl statistics to JSON file
--respect-robots: Honor robots.txt directives (currently fetches but doesn't parse)
--user-agent <STRING>: Custom User-Agent string

Examples¶

Basic crawl:

omnivore crawl https://example.com

Deep crawl with more workers:

omnivore crawl https://example.com --depth 10 --workers 20

Polite crawl with delays:

omnivore crawl https://example.com --delay 1000 --respect-robots

Export statistics:

omnivore crawl https://example.com --output stats.json

`parse` - HTML Parsing¶

Parse HTML content and extract structured data.

omnivore parse <FILE> [OPTIONS]

Arguments¶

<FILE>: Path to HTML file

Options¶

--rules <FILE>: Path to extraction rules file (JSON)
--output <FILE>: Write extracted data to file

Examples¶

Basic parsing:

omnivore parse page.html

With extraction rules:

omnivore parse page.html --rules rules.json

`git` - Git Repository Analysis¶

Extract and analyze code from Git repositories with intelligent filtering.

omnivore git <SOURCE> [OPTIONS]

Arguments¶

<SOURCE>: Repository URL or local path

Options¶

--output <PATH>: Output file path (.txt or .json)
--only <PATTERNS>: Include only matching files (comma-separated)
--include <PATTERNS>: Include matching files
--exclude <PATTERNS>: Exclude matching files
--keep: Keep cloned repository after completion
--json: Output in JSON format
--stdout: Output to stdout

Examples¶

Analyze a GitHub repository:

omnivore git https://github.com/rust-lang/cargo --output cargo-analysis.txt

Extract specific file types:

omnivore git ./my-project --only "*.rs,*.toml"

See Git Command Documentation for detailed usage.

`graph` - Graph Operations (⚠️ Not Implemented)¶

Build knowledge graphs from crawled data.

omnivore graph <INPUT> [OPTIONS]

Note: This command currently only prints placeholder messages. Graph functionality is under development.

`stats` - Statistics (⚠️ Limited Implementation)¶

Display crawl session statistics.

omnivore stats [SESSION]

Note: This command has limited functionality. Session tracking is not fully implemented.

`generate-completions` - Shell Completions¶

Generate shell completion scripts.

omnivore generate-completions <SHELL>

Supported Shells¶

bash
zsh
fish
powershell

Installation Examples¶

Bash:

omnivore generate-completions bash > /usr/local/etc/bash_completion.d/omnivore

Zsh:

omnivore generate-completions zsh > ~/.zsh/completions/_omnivore

Fish:

omnivore generate-completions fish > ~/.config/fish/completions/omnivore.fish

Configuration File¶

Use a TOML configuration file to set default values:

omnivore -c config.toml crawl https://example.com

See Configuration for file format details.

Exit Codes¶

0: Success
1: General error
2: Invalid arguments
3: Configuration error
4: Network error

Environment Variables¶

OMNIVORE_CONFIG: Default configuration file path
OMNIVORE_DATA_DIR: Data storage directory (default: ~/.omnivore)
RUST_LOG: Logging level (trace, debug, info, warn, error)

Common Use Cases¶

1. Quick Site Snapshot¶

omnivore crawl https://example.com --depth 2 --output snapshot.json

2. Respectful Crawling¶

omnivore crawl https://example.com --workers 2 --delay 2000 --respect-robots

3. Deep Site Analysis¶

omnivore crawl https://example.com --depth 10 --workers 20 --output analysis.json

4. Parse Downloaded Content¶

# First download
curl https://example.com > page.html
# Then parse
omnivore parse page.html

Troubleshooting¶

Issue: Crawl seems stuck¶

Reduce worker count: --workers 2
Increase delay: --delay 1000
Check network connectivity

Issue: Too many requests error¶

Increase delay between requests
Reduce worker count
Check if site has rate limiting

Issue: Memory usage high¶

Reduce worker count
Decrease crawl depth
Use configuration file to limit queue size

Performance Tips¶

Worker Count: Start with 5-10 workers and adjust based on site response
Delays: Use at least 100ms delay for polite crawling
Depth: Keep depth low (2-3) for initial exploration
Output: Use --output to save results for later analysis

Limitations¶

Robots.txt parsing not fully implemented (fetches but doesn't parse rules)
No JavaScript rendering (static HTML only)
Session management not persistent across runs
Graph and stats commands have limited functionality

CLI Reference¶

Installation¶

Global Options¶

Commands¶

crawl - Web Crawling¶

Arguments¶

Options¶

Examples¶

parse - HTML Parsing¶

Arguments¶

Options¶

Examples¶

git - Git Repository Analysis¶

Arguments¶

Options¶

Examples¶

graph - Graph Operations (⚠️ Not Implemented)¶

stats - Statistics (⚠️ Limited Implementation)¶

generate-completions - Shell Completions¶

Supported Shells¶

Installation Examples¶

Configuration File¶

Exit Codes¶

Environment Variables¶

Common Use Cases¶

1. Quick Site Snapshot¶

2. Respectful Crawling¶

3. Deep Site Analysis¶

4. Parse Downloaded Content¶

Troubleshooting¶

Issue: Crawl seems stuck¶

Issue: Too many requests error¶

Issue: Memory usage high¶

Performance Tips¶

Limitations¶

`crawl` - Web Crawling¶

`parse` - HTML Parsing¶

`git` - Git Repository Analysis¶

`graph` - Graph Operations (⚠️ Not Implemented)¶

`stats` - Statistics (⚠️ Limited Implementation)¶

`generate-completions` - Shell Completions¶