diff --git a/basango/apps/crawler/README.md b/basango/apps/crawler/README.md new file mode 100644 index 0000000..ffb0bfd --- /dev/null +++ b/basango/apps/crawler/README.md @@ -0,0 +1,179 @@ +# @basango/crawler + +A powerful, scalable web crawler application built with Node.js and TypeScript for extracting and processing data from various news sources and websites. + +The Basango Crawler is designed to systematically crawl news websites and extract article content. It supports both synchronous and asynchronous crawling modes, with configurable sources, queue-based processing, and robust error handling. + +## Features + +- **Multi-mode Operation**: Synchronous and asynchronous crawling capabilities +- **Queue-based Processing**: Uses BullMQ with Redis for scalable job processing +- **Configurable Sources**: JSON-based configuration for different website sources +- **HTML & WordPress Support**: Built-in parsers for HTML websites and WordPress APIs +- **Rate Limiting**: Respects website rate limits and implements backoff strategies +- **Data Persistence**: JSONL output format for processed articles +- **Worker Management**: Distributed worker system for parallel processing +- **Type Safety**: Full TypeScript implementation with Zod schema validation + +## Prerequisites + +- [Bun](https://bun.sh/) runtime (recommended) or Node.js (v22+) +- Redis server (for async operations) +- TypeScript knowledge for configuration + +## Installation + +```bash +# Navigate to the crawler directory +cd basango/apps/crawler + +# Install dependencies +bun install +``` + +## Configuration + +### 1. Environment Variables + +Create a `.env.local` file with the following variables: + +```bash +# Redis configuration for async operations +BASANGO_CRAWLER_ASYNC_REDIS_URL=redis://localhost:6379/0 +BASANGO_CRAWLER_ASYNC_QUEUE_LISTING=listing +BASANGO_CRAWLER_ASYNC_QUEUE_DETAILS=details +BASANGO_CRAWLER_ASYNC_QUEUE_PROCESSING=processing + +# Fetch configuration +BASANGO_CRAWLER_FETCH_MAX_RETRIES=3 +BASANGO_CRAWLER_FETCH_RESPECT_RETRY_AFTER=true +BASANGO_CRAWLER_FETCH_USER_AGENT=Basango/0.1 (+https://github.com/bernard-ng/basango) + +# Crawler behavior +BASANGO_CRAWLER_UPDATE_DIRECTION=forward + +# TTL settings (in seconds) +BASANGO_CRAWLER_ASYNC_TTL_FAILURE=3600 +BASANGO_CRAWLER_ASYNC_TTL_RESULT=3600 +``` + +### 2. Source Configuration + +Sources are configured in `config/sources.json`. Example source configuration: + +```json +{ + "sources": { + "html": [ + { + "sourceId": "example.com", + "sourceKind": "html", + "sourceUrl": "https://example.com", + "sourceSelectors": { + "articles": ".article-list .article", + "articleTitle": "h2.title", + "articleLink": "a.permalink", + "articleDate": ".publish-date", + "articleBody": ".content", + "pagination": ".pagination .next" + }, + "requiresDetails": true, + "supportsCategories": false + } + ] + } +} +``` + +## Usage + +### Synchronous Crawling + +Perfect for immediate, one-time crawling tasks: + +```bash +# Crawl a specific source +bun run crawler:sync -- --sourceId radiookapi.net + +# Crawl with page range filter +bun run crawler:sync -- --sourceId radiookapi.net --pageRange 1:5 + +# Crawl with date range filter +bun run crawler:sync -- --sourceId radiookapi.net --dateRange 2024-01-01:2024-01-31 + +# Crawl specific category (if supported) +bun run crawler:sync -- --sourceId example.com --category politics +``` + +### Asynchronous Crawling + +Best for large-scale operations and when you need job queuing: + +```bash +# Schedule an async crawl job +bun run crawler:async -- --sourceId radiookapi.net + +# Schedule with filters +bun run crawler:async -- --sourceId radiookapi.net --pageRange 1:10 --category economics +``` + +### Worker Management + +Start workers to process async jobs: + +```bash +# Start workers for all queues +bun run crawler:worker + +# Start workers for specific queues +bun run crawler:worker -- --queue listing --queue details + +# Start workers with short option +bun run crawler:worker -- -q listing -q processing +``` + +## CLI Options + +### Crawling Commands + +| Option | Description | Example | +|--------|-------------|---------| +| `--sourceId` | **Required.** Source identifier from sources.json | `--sourceId radiookapi.net` | +| `--pageRange` | Page range to crawl (format: start:end) | `--pageRange 1:5` | +| `--dateRange` | Date range filter (format: YYYY-MM-DD:YYYY-MM-DD) | `--dateRange 2024-01-01:2024-01-31` | +| `--category` | Category slug to crawl | `--category politics` | + +### Worker Commands + +| Option | Description | Example | +|--------|-------------|---------| +| `--queue`, `-q` | Specify queue(s) to process (can be used multiple times) | `--queue listing --queue details` | + +## Project Structure + +``` +basango/apps/crawler/ +├── src/ +│ ├── config.ts # Configuration schema and loading +│ ├── constants.ts # Application constants +│ ├── schema.ts # Zod validation schemas +│ ├── utils.ts # Utility functions +│ ├── http/ # HTTP client and utilities +│ ├── process/ # Core crawling logic +│ │ ├── async/ # Async processing (queues, workers) +│ │ ├── sync/ # Synchronous processing +│ │ ├── parsers/ # Content parsers (HTML, WordPress) +│ │ ├── crawler.ts # Main crawler interface +│ │ └── persistence.ts # Data persistence layer +│ ├── scripts/ # CLI entry points +│ │ ├── crawl.ts # Sync crawling script +│ │ ├── queue.ts # Async job scheduling +│ │ ├── worker.ts # Worker process +│ │ └── utils.ts # CLI utilities +│ └── __tests__/ # Test files +├── config/ +│ ├── sources.json # Source configurations +│ └── pipeline.json # Pipeline settings +├── data/ # Output directory for crawled data +└── package.json +```