Files
basango/projects/crawler/README.md
T

47 lines
2.2 KiB
Markdown

# Crawler
[![crawler audit](https://github.com/bernard-ng/basango/actions/workflows/crawler_audit.yml/badge.svg)](https://github.com/bernard-ng/basango/actions/workflows/crawler_audit.yml)
[![crawler quality](https://github.com/bernard-ng/basango/actions/workflows/crawler_quality.yml/badge.svg)](https://github.com/bernard-ng/basango/actions/workflows/crawler_quality.yml)
[![crawler tests](https://github.com/bernard-ng/basango/actions/workflows/crawler_tests.yml/badge.svg)](https://github.com/bernard-ng/basango/actions/workflows/crawler_tests.yml)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
---
### Usage
- Install the project in your virtualenv so the `basango` CLI is available:
- With uv: `uv run --with . basango --help`
- Or install locally: `uv sync` then `basango --help`
#### Sync crawl (in-process)
- Crawl a configured source by id and write to CSV/JSON:
- `basango crawl --source-id my-source`
- Limit by page range: `basango crawl --source-id my-source -p 1:3`
- Limit by date range: `basango crawl --source-id my-source -d 2024-10-01:2024-10-31`
- Category, when supported: `basango crawl --source-id my-source -g tech`
#### Async crawl (Redis + RQ)
- Enqueue a crawl job and return immediately:
- `basango crawl --source-id my-source --async`
- Start one or more workers to process queues:
- Article-only (default): `basango worker`
- Multiple queues: `basango worker -q listing -q articles -q processed`
- macOS friendly (no forking): `basango worker --simple`
- One-shot draining for CI: `basango worker --burst`
#### Environment
- `BASANGO_REDIS_URL` (default `redis://localhost:6379/0`)
- `BASANGO_QUEUE_PREFIX` (default `crawler`)
- `BASANGO_QUEUE_TIMEOUT` (default `600` seconds)
- `BASANGO_QUEUE_RESULT_TTL` (default `3600` seconds)
- `BASANGO_QUEUE_FAILURE_TTL` (default `3600` seconds)
#### Configuration
- See `config/pipeline.*.yaml` for source definitions and HTTP client settings.
- Use `-c/--env` to select which pipeline to load (default `development`).