CLI Reference¶
The Omnivore CLI provides a command-line interface for web crawling and content extraction.
Installation¶
Global Options¶
-v, --verbose: Enable verbose logging-c, --config <FILE>: Path to TOML configuration file-h, --help: Print help information-V, --version: Print version information
Commands¶
crawl - Web Crawling¶
Start a web crawl from one or more seed URLs.
Arguments¶
<URL>: Starting URL(s) to crawl
Options¶
--workers <N>: Number of concurrent workers (default: 10, range: 1-100)--depth <N>: Maximum crawl depth (default: 5, range: 1-20)--delay <MS>: Delay between requests in milliseconds (default: 100)--output <FILE>: Export crawl statistics to JSON file--respect-robots: Honor robots.txt directives (currently fetches but doesn't parse)--user-agent <STRING>: Custom User-Agent string
Examples¶
Basic crawl:
Deep crawl with more workers:
Polite crawl with delays:
Export statistics:
parse - HTML Parsing¶
Parse HTML content and extract structured data.
Arguments¶
<FILE>: Path to HTML file
Options¶
--rules <FILE>: Path to extraction rules file (JSON)--output <FILE>: Write extracted data to file
Examples¶
Basic parsing:
With extraction rules:
git - Git Repository Analysis¶
Extract and analyze code from Git repositories with intelligent filtering.
Arguments¶
<SOURCE>: Repository URL or local path
Options¶
--output <PATH>: Output file path (.txt or .json)--only <PATTERNS>: Include only matching files (comma-separated)--include <PATTERNS>: Include matching files--exclude <PATTERNS>: Exclude matching files--keep: Keep cloned repository after completion--json: Output in JSON format--stdout: Output to stdout
Examples¶
Analyze a GitHub repository:
Extract specific file types:
See Git Command Documentation for detailed usage.
graph - Graph Operations (⚠️ Not Implemented)¶
Build knowledge graphs from crawled data.
Note: This command currently only prints placeholder messages. Graph functionality is under development.
stats - Statistics (⚠️ Limited Implementation)¶
Display crawl session statistics.
Note: This command has limited functionality. Session tracking is not fully implemented.
generate-completions - Shell Completions¶
Generate shell completion scripts.
Supported Shells¶
bashzshfishpowershell
Installation Examples¶
Bash:
Zsh:
Fish:
Configuration File¶
Use a TOML configuration file to set default values:
See Configuration for file format details.
Exit Codes¶
0: Success1: General error2: Invalid arguments3: Configuration error4: Network error
Environment Variables¶
OMNIVORE_CONFIG: Default configuration file pathOMNIVORE_DATA_DIR: Data storage directory (default:~/.omnivore)RUST_LOG: Logging level (trace, debug, info, warn, error)
Common Use Cases¶
1. Quick Site Snapshot¶
2. Respectful Crawling¶
3. Deep Site Analysis¶
4. Parse Downloaded Content¶
Troubleshooting¶
Issue: Crawl seems stuck¶
- Reduce worker count:
--workers 2 - Increase delay:
--delay 1000 - Check network connectivity
Issue: Too many requests error¶
- Increase delay between requests
- Reduce worker count
- Check if site has rate limiting
Issue: Memory usage high¶
- Reduce worker count
- Decrease crawl depth
- Use configuration file to limit queue size
Performance Tips¶
- Worker Count: Start with 5-10 workers and adjust based on site response
- Delays: Use at least 100ms delay for polite crawling
- Depth: Keep depth low (2-3) for initial exploration
- Output: Use
--outputto save results for later analysis
Limitations¶
- Robots.txt parsing not fully implemented (fetches but doesn't parse rules)
- No JavaScript rendering (static HTML only)
- Session management not persistent across runs
- Graph and stats commands have limited functionality