REST API Reference¶
The Omnivore REST API provides programmatic access to the crawler functionality.
Base Configuration¶
- Default URL:
http://localhost:3000 - Content-Type:
application/json - CORS: Enabled for all origins
Starting the API Server¶
omnivore-api
# With custom port
omnivore-api --port 8080
# With custom config
omnivore-api --config config.toml
Endpoints¶
GET / - API Information¶
Returns API metadata and available endpoints.
Response¶
{
"name": "Omnivore API",
"version": "0.1.0",
"endpoints": [
"/",
"/health",
"/api/crawl",
"/api/stats",
"/graphql"
]
}
GET /health - Health Check¶
Check if the API server is running.
Response¶
- Status:
200 OK - Body:
"OK"
Example¶
POST /api/crawl - Start Crawl¶
Initiate a new web crawl.
Request Body¶
Parameters¶
url(required): Starting URL to crawlmax_depth(optional): Maximum crawl depth (default: 5, max: 20)max_workers(optional): Number of concurrent workers (default: 10, max: 100)
Response¶
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started",
"message": "Crawl started for URL: https://example.com"
}
Example¶
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "max_depth": 3}'
GET /api/stats - Get Statistics¶
Retrieve statistics from the last crawl.
Response (Success)¶
{
"status": "completed",
"stats": {
"urls_visited": 150,
"urls_queued": 0,
"success_count": 145,
"error_count": 5,
"start_time": "2024-01-15T10:30:00Z",
"end_time": "2024-01-15T10:35:30Z",
"duration_secs": 330,
"pages_per_second": 0.45,
"average_response_time_ms": 250,
"total_bytes_downloaded": 15728640
}
}
Response (No Data)¶
Example¶
Error Responses¶
All endpoints may return error responses in the following format:
Common Error Codes¶
400 Bad Request: Invalid request parameters404 Not Found: Endpoint not found500 Internal Server Error: Server error during processing
Rate Limiting¶
Currently, the API does not implement rate limiting. For production use, consider: - Adding a reverse proxy (nginx, Caddy) with rate limiting - Implementing application-level rate limiting - Using an API gateway
Authentication¶
⚠️ Not Implemented: The API currently has no authentication. All endpoints are publicly accessible.
For production deployment: 1. Implement JWT or API key authentication 2. Use HTTPS/TLS 3. Restrict CORS origins 4. Add request validation
Limitations¶
Current Limitations¶
- No persistent sessions: Crawl state is not preserved across API restarts
- Single crawl at a time: Cannot run multiple crawls concurrently
- No crawl management: Cannot pause, resume, or cancel crawls
- Limited statistics: Only basic metrics are tracked
- No authentication: All endpoints are public
Not Implemented¶
- Crawl status endpoint (
GET /api/crawl/{id}) - Cancel crawl endpoint (
DELETE /api/crawl/{id}) - List crawls endpoint (
GET /api/crawls) - Configuration endpoint (
GET/POST /api/config) - WebSocket support for real-time updates
Usage Examples¶
Basic Crawl Workflow¶
-
Start a crawl:
-
Check statistics:
Python Example¶
import requests
import time
# Start crawl
response = requests.post(
"http://localhost:3000/api/crawl",
json={
"url": "https://example.com",
"max_depth": 3,
"max_workers": 5
}
)
crawl_id = response.json()["id"]
# Wait and check stats
time.sleep(30)
stats = requests.get("http://localhost:3000/api/stats").json()
print(f"Crawled {stats['stats']['urls_visited']} pages")
JavaScript Example¶
// Start crawl
fetch('http://localhost:3000/api/crawl', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
url: 'https://example.com',
max_depth: 2
})
})
.then(res => res.json())
.then(data => console.log('Crawl started:', data.id));
// Get stats
fetch('http://localhost:3000/api/stats')
.then(res => res.json())
.then(stats => console.log('Stats:', stats));
Performance Considerations¶
- Concurrent Requests: The API can handle multiple requests but only one crawl at a time
- Response Times: Health check < 10ms, crawl start < 100ms
- Memory Usage: Increases with crawl depth and worker count
- CPU Usage: Scales with number of workers and parsing complexity
Future Enhancements¶
Planned improvements for the REST API: - WebSocket support for real-time crawl updates - Batch crawl operations - Crawl templates and presets - Export formats (CSV, JSON, XML) - Webhook notifications - Metrics and monitoring endpoints