2.2 KiB
2.2 KiB
Crawler
Usage
- Install the project in your virtualenv so the
basangoCLI is available:- With uv:
uv run --with . basango --help - Or install locally:
uv syncthenbasango --help
- With uv:
Sync crawl (in-process)
- Crawl a configured source by id and write to CSV/JSON:
basango crawl --source-id my-source- Limit by page range:
basango crawl --source-id my-source -p 1:3 - Limit by date range:
basango crawl --source-id my-source -d 2024-10-01:2024-10-31 - Category, when supported:
basango crawl --source-id my-source -g tech
Async crawl (Redis + RQ)
- Enqueue a crawl job and return immediately:
basango crawl --source-id my-source --async
- Start one or more workers to process queues:
- Article-only (default):
basango worker - Multiple queues:
basango worker -q listing -q articles -q processed - macOS friendly (no forking):
basango worker --simple - One-shot draining for CI:
basango worker --burst
- Article-only (default):
Environment
BASANGO_REDIS_URL(defaultredis://localhost:6379/0)BASANGO_QUEUE_PREFIX(defaultcrawler)BASANGO_QUEUE_TIMEOUT(default600seconds)BASANGO_QUEUE_RESULT_TTL(default3600seconds)BASANGO_QUEUE_FAILURE_TTL(default3600seconds)
Configuration
- See
config/pipeline.*.yamlfor source definitions and HTTP client settings. - Use
-c/--envto select which pipeline to load (defaultdevelopment).