feat(monorepo): migrate to typescript monorepo
This commit is contained in:
@@ -0,0 +1,179 @@
|
||||
# @basango/crawler
|
||||
|
||||
A powerful, scalable web crawler application built with Node.js and TypeScript for extracting and processing data from various news sources and websites.
|
||||
|
||||
The Basango Crawler is designed to systematically crawl news websites and extract article content. It supports both synchronous and asynchronous crawling modes, with configurable sources, queue-based processing, and robust error handling.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-mode Operation**: Synchronous and asynchronous crawling capabilities
|
||||
- **Queue-based Processing**: Uses BullMQ with Redis for scalable job processing
|
||||
- **Configurable Sources**: JSON-based configuration for different website sources
|
||||
- **HTML & WordPress Support**: Built-in parsers for HTML websites and WordPress APIs
|
||||
- **Rate Limiting**: Respects website rate limits and implements backoff strategies
|
||||
- **Data Persistence**: JSONL output format for processed articles
|
||||
- **Worker Management**: Distributed worker system for parallel processing
|
||||
- **Type Safety**: Full TypeScript implementation with Zod schema validation
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- [Bun](https://bun.sh/) runtime (recommended) or Node.js (v22+)
|
||||
- Redis server (for async operations)
|
||||
- TypeScript knowledge for configuration
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Navigate to the crawler directory
|
||||
cd basango/apps/crawler
|
||||
|
||||
# Install dependencies
|
||||
bun install
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### 1. Environment Variables
|
||||
|
||||
Create a `.env.local` file with the following variables:
|
||||
|
||||
```bash
|
||||
# Redis configuration for async operations
|
||||
BASANGO_CRAWLER_ASYNC_REDIS_URL=redis://localhost:6379/0
|
||||
BASANGO_CRAWLER_ASYNC_QUEUE_LISTING=listing
|
||||
BASANGO_CRAWLER_ASYNC_QUEUE_DETAILS=details
|
||||
BASANGO_CRAWLER_ASYNC_QUEUE_PROCESSING=processing
|
||||
|
||||
# Fetch configuration
|
||||
BASANGO_CRAWLER_FETCH_MAX_RETRIES=3
|
||||
BASANGO_CRAWLER_FETCH_RESPECT_RETRY_AFTER=true
|
||||
BASANGO_CRAWLER_FETCH_USER_AGENT=Basango/0.1 (+https://github.com/bernard-ng/basango)
|
||||
|
||||
# Crawler behavior
|
||||
BASANGO_CRAWLER_UPDATE_DIRECTION=forward
|
||||
|
||||
# TTL settings (in seconds)
|
||||
BASANGO_CRAWLER_ASYNC_TTL_FAILURE=3600
|
||||
BASANGO_CRAWLER_ASYNC_TTL_RESULT=3600
|
||||
```
|
||||
|
||||
### 2. Source Configuration
|
||||
|
||||
Sources are configured in `config/sources.json`. Example source configuration:
|
||||
|
||||
```json
|
||||
{
|
||||
"sources": {
|
||||
"html": [
|
||||
{
|
||||
"sourceId": "example.com",
|
||||
"sourceKind": "html",
|
||||
"sourceUrl": "https://example.com",
|
||||
"sourceSelectors": {
|
||||
"articles": ".article-list .article",
|
||||
"articleTitle": "h2.title",
|
||||
"articleLink": "a.permalink",
|
||||
"articleDate": ".publish-date",
|
||||
"articleBody": ".content",
|
||||
"pagination": ".pagination .next"
|
||||
},
|
||||
"requiresDetails": true,
|
||||
"supportsCategories": false
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Synchronous Crawling
|
||||
|
||||
Perfect for immediate, one-time crawling tasks:
|
||||
|
||||
```bash
|
||||
# Crawl a specific source
|
||||
bun run crawler:sync -- --sourceId radiookapi.net
|
||||
|
||||
# Crawl with page range filter
|
||||
bun run crawler:sync -- --sourceId radiookapi.net --pageRange 1:5
|
||||
|
||||
# Crawl with date range filter
|
||||
bun run crawler:sync -- --sourceId radiookapi.net --dateRange 2024-01-01:2024-01-31
|
||||
|
||||
# Crawl specific category (if supported)
|
||||
bun run crawler:sync -- --sourceId example.com --category politics
|
||||
```
|
||||
|
||||
### Asynchronous Crawling
|
||||
|
||||
Best for large-scale operations and when you need job queuing:
|
||||
|
||||
```bash
|
||||
# Schedule an async crawl job
|
||||
bun run crawler:async -- --sourceId radiookapi.net
|
||||
|
||||
# Schedule with filters
|
||||
bun run crawler:async -- --sourceId radiookapi.net --pageRange 1:10 --category economics
|
||||
```
|
||||
|
||||
### Worker Management
|
||||
|
||||
Start workers to process async jobs:
|
||||
|
||||
```bash
|
||||
# Start workers for all queues
|
||||
bun run crawler:worker
|
||||
|
||||
# Start workers for specific queues
|
||||
bun run crawler:worker -- --queue listing --queue details
|
||||
|
||||
# Start workers with short option
|
||||
bun run crawler:worker -- -q listing -q processing
|
||||
```
|
||||
|
||||
## CLI Options
|
||||
|
||||
### Crawling Commands
|
||||
|
||||
| Option | Description | Example |
|
||||
|--------|-------------|---------|
|
||||
| `--sourceId` | **Required.** Source identifier from sources.json | `--sourceId radiookapi.net` |
|
||||
| `--pageRange` | Page range to crawl (format: start:end) | `--pageRange 1:5` |
|
||||
| `--dateRange` | Date range filter (format: YYYY-MM-DD:YYYY-MM-DD) | `--dateRange 2024-01-01:2024-01-31` |
|
||||
| `--category` | Category slug to crawl | `--category politics` |
|
||||
|
||||
### Worker Commands
|
||||
|
||||
| Option | Description | Example |
|
||||
|--------|-------------|---------|
|
||||
| `--queue`, `-q` | Specify queue(s) to process (can be used multiple times) | `--queue listing --queue details` |
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
basango/apps/crawler/
|
||||
├── src/
|
||||
│ ├── config.ts # Configuration schema and loading
|
||||
│ ├── constants.ts # Application constants
|
||||
│ ├── schema.ts # Zod validation schemas
|
||||
│ ├── utils.ts # Utility functions
|
||||
│ ├── http/ # HTTP client and utilities
|
||||
│ ├── process/ # Core crawling logic
|
||||
│ │ ├── async/ # Async processing (queues, workers)
|
||||
│ │ ├── sync/ # Synchronous processing
|
||||
│ │ ├── parsers/ # Content parsers (HTML, WordPress)
|
||||
│ │ ├── crawler.ts # Main crawler interface
|
||||
│ │ └── persistence.ts # Data persistence layer
|
||||
│ ├── scripts/ # CLI entry points
|
||||
│ │ ├── crawl.ts # Sync crawling script
|
||||
│ │ ├── queue.ts # Async job scheduling
|
||||
│ │ ├── worker.ts # Worker process
|
||||
│ │ └── utils.ts # CLI utilities
|
||||
│ └── __tests__/ # Test files
|
||||
├── config/
|
||||
│ ├── sources.json # Source configurations
|
||||
│ └── pipeline.json # Pipeline settings
|
||||
├── data/ # Output directory for crawled data
|
||||
└── package.json
|
||||
```
|
||||
Reference in New Issue
Block a user