# @basango/crawler A powerful, scalable web crawler application built with Node.js and TypeScript for extracting and processing data from various news sources and websites. The Basango Crawler is designed to systematically crawl news websites and extract article content. It supports both synchronous and asynchronous crawling modes, with configurable sources, queue-based processing, and robust error handling. ## Features - **Multi-mode Operation**: Synchronous and asynchronous crawling capabilities - **Queue-based Processing**: Uses BullMQ with Redis for scalable job processing - **Configurable Sources**: JSON-based configuration for different website sources - **HTML & WordPress Support**: Built-in parsers for HTML websites and WordPress APIs - **Rate Limiting**: Respects website rate limits and implements backoff strategies - **Data Persistence**: JSONL output format for processed articles - **Worker Management**: Distributed worker system for parallel processing - **Type Safety**: Full TypeScript implementation with Zod schema validation ## Prerequisites - [Bun](https://bun.sh/) runtime (recommended) or Node.js (v22+) - Redis server (for async operations) - TypeScript knowledge for configuration ## Installation ```bash # Navigate to the crawler directory cd basango/apps/crawler # Install dependencies bun install ``` ## Configuration ### 1. Environment Variables Create a `.env.local` file with the following variables: ```bash # Redis configuration for async operations BASANGO_CRAWLER_ASYNC_REDIS_URL=redis://localhost:6379/0 BASANGO_CRAWLER_ASYNC_QUEUE_LISTING=listing BASANGO_CRAWLER_ASYNC_QUEUE_DETAILS=details BASANGO_CRAWLER_ASYNC_QUEUE_PROCESSING=processing # Fetch configuration BASANGO_CRAWLER_FETCH_MAX_RETRIES=3 BASANGO_CRAWLER_FETCH_RESPECT_RETRY_AFTER=true BASANGO_CRAWLER_FETCH_USER_AGENT=Basango/0.1 (+https://github.com/bernard-ng/basango) # Crawler behavior BASANGO_CRAWLER_UPDATE_DIRECTION=forward # TTL settings (in seconds) BASANGO_CRAWLER_ASYNC_TTL_FAILURE=3600 BASANGO_CRAWLER_ASYNC_TTL_RESULT=3600 ``` ### 2. Source Configuration Sources are configured in `config/sources.json`. Example source configuration: ```json { "sources": { "html": [ { "sourceId": "example.com", "sourceKind": "html", "sourceUrl": "https://example.com", "sourceSelectors": { "articles": ".article-list .article", "articleTitle": "h2.title", "articleLink": "a.permalink", "articleDate": ".publish-date", "articleBody": ".content", "pagination": ".pagination .next" }, "requiresDetails": true, "supportsCategories": false } ] } } ``` ## Usage ### Synchronous Crawling Perfect for immediate, one-time crawling tasks: ```bash # Crawl a specific source bun run crawler:sync -- --sourceId radiookapi.net # Crawl with page range filter bun run crawler:sync -- --sourceId radiookapi.net --pageRange 1:5 # Crawl with date range filter bun run crawler:sync -- --sourceId radiookapi.net --dateRange 2024-01-01:2024-01-31 # Crawl specific category (if supported) bun run crawler:sync -- --sourceId example.com --category politics ``` ### Asynchronous Crawling Best for large-scale operations and when you need job queuing: ```bash # Schedule an async crawl job bun run crawler:async -- --sourceId radiookapi.net # Schedule with filters bun run crawler:async -- --sourceId radiookapi.net --pageRange 1:10 --category economics ``` ### Worker Management Start workers to process async jobs: ```bash # Start workers for all queues bun run crawler:worker # Start workers for specific queues bun run crawler:worker -- --queue listing --queue details # Start workers with short option bun run crawler:worker -- -q listing -q processing ``` ## CLI Options ### Crawling Commands | Option | Description | Example | |--------|-------------|---------| | `--sourceId` | **Required.** Source identifier from sources.json | `--sourceId radiookapi.net` | | `--pageRange` | Page range to crawl (format: start:end) | `--pageRange 1:5` | | `--dateRange` | Date range filter (format: YYYY-MM-DD:YYYY-MM-DD) | `--dateRange 2024-01-01:2024-01-31` | | `--category` | Category slug to crawl | `--category politics` | ### Worker Commands | Option | Description | Example | |--------|-------------|---------| | `--queue`, `-q` | Specify queue(s) to process (can be used multiple times) | `--queue listing --queue details` | ## Project Structure ``` basango/apps/crawler/ ├── src/ │ ├── config.ts # Configuration schema and loading │ ├── constants.ts # Application constants │ ├── schema.ts # Zod validation schemas │ ├── utils.ts # Utility functions │ ├── http/ # HTTP client and utilities │ ├── process/ # Core crawling logic │ │ ├── async/ # Async processing (queues, workers) │ │ ├── sync/ # Synchronous processing │ │ ├── parsers/ # Content parsers (HTML, WordPress) │ │ ├── crawler.ts # Main crawler interface │ │ └── persistence.ts # Data persistence layer │ ├── scripts/ # CLI entry points │ │ ├── sync.ts # Sync crawling script │ │ ├── async.ts # Async job scheduling │ │ ├── worker.ts # Worker process │ │ └── utils.ts # CLI utilities │ └── __tests__/ # Test files ├── config/ │ ├── sources.json # Source configurations │ └── pipeline.json # Pipeline settings ├── data/ # Output directory for crawled data └── package.json ```