155 lines
4.5 KiB
Markdown
155 lines
4.5 KiB
Markdown
# @basango/crawler
|
|
|
|
A powerful, scalable web crawler application built with Node.js and TypeScript for extracting and processing data from various news sources and websites.
|
|
|
|
The Basango Crawler is designed to systematically crawl news websites and extract article content. It supports both synchronous and asynchronous crawling modes, with configurable sources, queue-based processing, and robust error handling.
|
|
|
|
## Features
|
|
|
|
- **Multi-mode Operation**: Synchronous and asynchronous crawling capabilities
|
|
- **Queue-based Processing**: Uses BullMQ with Redis for scalable job processing
|
|
- **Configurable Sources**: JSON-based configuration for different website sources
|
|
- **HTML & WordPress Support**: Built-in parsers for HTML websites and WordPress APIs
|
|
- **Rate Limiting**: Respects website rate limits and implements backoff strategies
|
|
- **Data Persistence**: JSONL output format for processed articles
|
|
- **Worker Management**: Distributed worker system for parallel processing
|
|
- **Type Safety**: Full TypeScript implementation with Zod schema validation
|
|
|
|
## Prerequisites
|
|
|
|
- [Bun](https://bun.sh/) runtime (recommended) or Node.js (v22+)
|
|
- Redis server (for async operations)
|
|
- TypeScript knowledge for configuration
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Navigate to the crawler directory
|
|
cd basango/apps/crawler
|
|
|
|
# Install dependencies
|
|
bun install
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### 1. Environment Variables
|
|
|
|
Create a `.env.local` file with the following variables:
|
|
|
|
```bash
|
|
# Redis configuration for async operations
|
|
BASANGO_CRAWLER_ASYNC_REDIS_URL=redis://localhost:6379/0
|
|
BASANGO_CRAWLER_ASYNC_QUEUE_LISTING=listing
|
|
BASANGO_CRAWLER_ASYNC_QUEUE_DETAILS=details
|
|
BASANGO_CRAWLER_ASYNC_QUEUE_PROCESSING=processing
|
|
|
|
# Fetch configuration
|
|
BASANGO_CRAWLER_FETCH_MAX_RETRIES=3
|
|
BASANGO_CRAWLER_FETCH_RESPECT_RETRY_AFTER=true
|
|
BASANGO_CRAWLER_FETCH_USER_AGENT=Basango/0.1 (+https://github.com/bernard-ng/basango)
|
|
|
|
# Crawler behavior
|
|
BASANGO_CRAWLER_UPDATE_DIRECTION=forward
|
|
|
|
# TTL settings (in seconds)
|
|
BASANGO_CRAWLER_ASYNC_TTL_FAILURE=3600
|
|
BASANGO_CRAWLER_ASYNC_TTL_RESULT=3600
|
|
```
|
|
|
|
### 2. Source Configuration
|
|
|
|
Sources are configured in `config/sources.json`. Example source configuration:
|
|
|
|
```json
|
|
{
|
|
"sources": {
|
|
"html": [
|
|
{
|
|
"sourceId": "example.com",
|
|
"sourceKind": "html",
|
|
"sourceUrl": "https://example.com",
|
|
"sourceSelectors": {
|
|
"articles": ".article-list .article",
|
|
"articleTitle": "h2.title",
|
|
"articleLink": "a.permalink",
|
|
"articleDate": ".publish-date",
|
|
"articleBody": ".content",
|
|
"pagination": ".pagination .next"
|
|
},
|
|
"requiresDetails": true,
|
|
"supportsCategories": false
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Synchronous Crawling
|
|
|
|
Perfect for immediate, one-time crawling tasks:
|
|
|
|
```bash
|
|
# Crawl a specific source
|
|
bun run crawler:sync -- --sourceId radiookapi.net
|
|
|
|
# Crawl with page range filter
|
|
bun run crawler:sync -- --sourceId radiookapi.net --pageRange 1:5
|
|
|
|
# Crawl with date range filter
|
|
bun run crawler:sync -- --sourceId radiookapi.net --dateRange 2024-01-01:2024-01-31
|
|
|
|
# Crawl specific category (if supported)
|
|
bun run crawler:sync -- --sourceId example.com --category politics
|
|
```
|
|
|
|
Crawled data will be saved in the `data/` directory as JSONL files.
|
|
and can be push to the database using the `bun run crawler:push -- --sourceId radiookapi.net`.
|
|
|
|
|
|
### Asynchronous Crawling
|
|
|
|
Best for large-scale operations and when you need job queuing:
|
|
|
|
```bash
|
|
# Schedule an async crawl job
|
|
bun run crawler:async -- --sourceId radiookapi.net
|
|
|
|
# Schedule with filters
|
|
bun run crawler:async -- --sourceId radiookapi.net --pageRange 1:10 --category economics
|
|
```
|
|
|
|
### Worker Management
|
|
|
|
Start workers to process async jobs:
|
|
|
|
```bash
|
|
# Start workers for all queues
|
|
bun run crawler:worker
|
|
|
|
# Start workers for specific queues
|
|
bun run crawler:worker -- --queue listing --queue details
|
|
|
|
# Start workers with short option
|
|
bun run crawler:worker -- -q listing -q processing
|
|
```
|
|
|
|
## CLI Options
|
|
|
|
### Crawling Commands
|
|
|
|
| Option | Description | Example |
|
|
|--------|-------------|---------|
|
|
| `--sourceId` | **Required.** Source identifier from sources.json | `--sourceId radiookapi.net` |
|
|
| `--pageRange` | Page range to crawl (format: start:end) | `--pageRange 1:5` |
|
|
| `--dateRange` | Date range filter (format: YYYY-MM-DD:YYYY-MM-DD) | `--dateRange 2024-01-01:2024-01-31` |
|
|
| `--category` | Category slug to crawl | `--category politics` |
|
|
|
|
### Worker Commands
|
|
|
|
| Option | Description | Example |
|
|
|--------|-------------|---------|
|
|
| `--queue`, `-q` | Specify queue(s) to process (can be used multiple times) | `--queue listing --queue details` |
|