feat: enhance training pipeline with research templates and experiment configuration
This commit is contained in:
@@ -1,69 +1,20 @@
|
||||
# DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System for Congolese Name Analysis
|
||||
# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
|
||||
|
||||
A comprehensive, research-friendly pipeline for analyzing Congolese names and predicting gender using culturally-aware machine learning models.
|
||||
This system provides advanced data processing, experiment management, and an intuitive web interface for non-technical users.
|
||||
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
|
||||
underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
|
||||
data.
|
||||
This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5
|
||||
million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
|
||||
|
||||
## Overview
|
||||
## Getting Started
|
||||
|
||||
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data.
|
||||
This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 7 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
|
||||
### Installation & Setup
|
||||
|
||||
Our approach involves:
|
||||
Instructions and command line snippets bellow are provided to help you set up the project environment quickly and
|
||||
efficiently.
|
||||
assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).
|
||||
|
||||
- **(1) Advanced data processing pipeline** with batching, checkpointing, and parallel processing
|
||||
- **(2) Modular experiment framework** for systematic model comparison and research iteration
|
||||
- **(3) Multiple feature extraction strategies** leveraging name components, linguistic patterns, and demographic data
|
||||
- **(4) Culturally-aware gender prediction models** trained specifically on Congolese naming patterns
|
||||
- **(5) User-friendly web interface** enabling non-technical users to run experiments and make predictions
|
||||
- **(6) Comprehensive research tools** for reproducible experimentation and result analysis
|
||||
|
||||
## Key Features
|
||||
|
||||
### **Advanced Data Processing**
|
||||
- **Batched processing** with configurable batch sizes and parallel execution
|
||||
- **Automatic checkpointing** and resume capability for large datasets
|
||||
- **LLM-powered annotation** with rate limiting and retry logic
|
||||
- **Memory-efficient** chunked data loading for datasets of any size
|
||||
|
||||
### **Research-Friendly Experiment Framework**
|
||||
- **Modular model architecture** - easily add new models and features
|
||||
- **Systematic experiment tracking** with automatic result storage
|
||||
- **Feature ablation studies** and component analysis tools
|
||||
- **Cross-validation** and statistical significance testing
|
||||
- **Automated baseline comparisons** and performance analysis
|
||||
|
||||
### **Intuitive Web Interface**
|
||||
- **No-code experiment creation** with visual parameter selection
|
||||
- **Real-time monitoring** of data processing and training progress
|
||||
- **Interactive result visualization** with charts and comparisons
|
||||
- **Batch prediction capabilities** for CSV file upload and processing
|
||||
- **Model comparison tools** with automatic performance rankings
|
||||
|
||||
### **Comprehensive Analytics**
|
||||
- **Feature importance analysis** showing which name components matter most
|
||||
- **Province-specific studies** examining regional naming patterns
|
||||
- **Learning curve analysis** for understanding data requirements
|
||||
- **Prediction confidence scoring** and error analysis tools
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Using Make Commands (Recommended)
|
||||
|
||||
```bash
|
||||
# Complete setup and basic processing
|
||||
make quick-start
|
||||
|
||||
# Launch web interface
|
||||
make web
|
||||
|
||||
# Run research workflow
|
||||
make research-flow
|
||||
|
||||
# Show all available commands
|
||||
make help
|
||||
```
|
||||
|
||||
### Manual Installation
|
||||
**Using Makefile (Recommended)**
|
||||
|
||||
```bash
|
||||
git clone https://github.com/bernard-ng/drc-ners-nlp.git
|
||||
@@ -71,246 +22,88 @@ cd drc-ners-nlp
|
||||
|
||||
# Setup environment
|
||||
make setup
|
||||
make process
|
||||
|
||||
# Launch web interface
|
||||
make web
|
||||
make activate
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Web Interface (Recommended for Non-Technical Users)
|
||||
|
||||
Launch the Streamlit web application:
|
||||
```bash
|
||||
make web
|
||||
```
|
||||
|
||||
The interface provides:
|
||||
- **Dashboard**: Overview of datasets and recent experiments
|
||||
- **Data Overview**: Interactive data exploration and statistics
|
||||
- **Data Processing**: Monitor and control the processing pipeline
|
||||
- **Experiments**: Create and manage machine learning experiments
|
||||
- **Results & Analysis**: Compare models and analyze performance
|
||||
- **Predictions**: Make predictions on new names or upload CSV files
|
||||
- **Settings**: Configure the system and manage data
|
||||
|
||||
### Research & Experiments
|
||||
|
||||
#### Quick Research Studies
|
||||
```bash
|
||||
# Compare different approaches (full name vs native vs surname)
|
||||
make baseline
|
||||
|
||||
# Analyze which name components are most effective
|
||||
make components
|
||||
|
||||
# Test feature importance through ablation study
|
||||
make ablation
|
||||
|
||||
# View all experiment results
|
||||
make list-experiments
|
||||
|
||||
# Export results for publication
|
||||
make export-results
|
||||
```
|
||||
|
||||
#### Custom Experiments
|
||||
```bash
|
||||
# Run specific experiment via command line
|
||||
python research/cli.py run \
|
||||
--name "native_name_study" \
|
||||
--features native_name \
|
||||
--model-type logistic_regression \
|
||||
--description "Test native name effectiveness"
|
||||
|
||||
# Compare multiple experiments
|
||||
python research/cli.py compare <exp_id_1> <exp_id_2>
|
||||
|
||||
# View detailed results
|
||||
python research/cli.py show <experiment_id>
|
||||
```
|
||||
|
||||
### Data Processing Pipeline
|
||||
|
||||
#### Basic Processing (No LLM)
|
||||
```bash
|
||||
make process-basic # Fast processing without LLM annotation
|
||||
```
|
||||
|
||||
#### Complete Processing (With LLM)
|
||||
```bash
|
||||
make process # Full pipeline including LLM annotation
|
||||
make process-dev # Development mode with smaller batches
|
||||
```
|
||||
|
||||
#### Monitor Progress
|
||||
```bash
|
||||
make monitoring # Show current pipeline status
|
||||
make status # Show overall system status
|
||||
```
|
||||
|
||||
#### Resume Interrupted Processing
|
||||
```bash
|
||||
make process-resume # Resume from last checkpoint
|
||||
```
|
||||
|
||||
### Available Models and Features
|
||||
|
||||
#### Models
|
||||
- **Logistic Regression**: Character n-gram based classification
|
||||
- **Random Forest**: Engineered feature-based classification
|
||||
- **LSTM**: Sequential neural network (planned)
|
||||
- **Transformer**: Attention-based model (planned)
|
||||
|
||||
#### Features
|
||||
- **Full Name**: Complete name as given
|
||||
- **Native Name**: Identified native/given name component
|
||||
- **Surname**: Family name component
|
||||
- **Name Length**: Character count features
|
||||
- **Word Count**: Number of words in name
|
||||
- **Province**: Geographic/demographic features
|
||||
- **Name Beginnings/Endings**: Prefix/suffix patterns
|
||||
- **Character N-grams**: Linguistic pattern features
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Configurations
|
||||
**Manual Setup**
|
||||
|
||||
```bash
|
||||
# Switch to development configuration (smaller batches, more logging)
|
||||
make config-dev
|
||||
git clone https://github.com/bernard-ng/drc-ners-nlp.git
|
||||
cd drc-ners-nlp
|
||||
|
||||
# Switch to production configuration (optimized for performance)
|
||||
make config-prod
|
||||
# Setup environment
|
||||
python -m venv .venv
|
||||
.venv/bin/pip install --upgrade pip
|
||||
.venv/bin/pip install -r requirements.txt
|
||||
|
||||
# View current configuration
|
||||
make show-config
|
||||
pip install --upgrade pip
|
||||
pip install -r requirements.txt
|
||||
pip install jupyter notebook ipykernel pytest black flake8 mypy
|
||||
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
### Custom Configuration
|
||||
## Data Processing
|
||||
|
||||
Edit configuration files in `config/`:
|
||||
- `pipeline.yaml` - Main configuration
|
||||
- `pipeline.development.yaml` - Development overrides
|
||||
- `pipeline.production.yaml` - Production settings
|
||||
This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
|
||||
checkpointing, and parallel processing capabilities.
|
||||
step are defined in the `drc-ners-nlp/processing/steps` directory. and configuration to enable them is managed through
|
||||
the `drc-ners-nlp/config/pipeline.yaml` file.
|
||||
|
||||
**Pipeline Configuration**
|
||||
|
||||
Example configuration:
|
||||
```yaml
|
||||
processing:
|
||||
batch_size: 1000
|
||||
max_workers: 4
|
||||
|
||||
llm:
|
||||
model_name: "mistral:7b"
|
||||
requests_per_minute: 60
|
||||
|
||||
data:
|
||||
split_evaluation: true
|
||||
split_by_gender: true
|
||||
stages:
|
||||
- "data_cleaning"
|
||||
- "feature_extraction"
|
||||
- "llm_annotation"
|
||||
- "data_splitting"
|
||||
```
|
||||
|
||||
## Research Capabilities
|
||||
|
||||
### Systematic Experimentation
|
||||
|
||||
The framework supports systematic research through:
|
||||
|
||||
1. **Baseline Studies**: Compare fundamental approaches
|
||||
2. **Feature Studies**: Test individual name components
|
||||
3. **Ablation Studies**: Identify most important features
|
||||
4. **Cross-Province Analysis**: Test generalization across regions
|
||||
5. **Hyperparameter Optimization**: Systematic parameter tuning
|
||||
|
||||
### Reproducible Research
|
||||
|
||||
- **Experiment Tracking**: All experiments automatically logged with full configuration
|
||||
- **Result Export**: CSV export for publication and further analysis
|
||||
- **Statistical Testing**: Cross-validation and confidence intervals
|
||||
- **Version Control**: Configuration-based approach enables easy replication
|
||||
|
||||
### Publication-Ready Output
|
||||
**Running the Pipeline**
|
||||
|
||||
```bash
|
||||
# Generate comprehensive results for publication
|
||||
make research-flow
|
||||
make export-results
|
||||
|
||||
# Get best models for each approach
|
||||
make list-completed
|
||||
python research/cli.py list --status completed | head -10
|
||||
python main.py --env development
|
||||
```
|
||||
|
||||
## Development
|
||||
## Experiments
|
||||
|
||||
This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
|
||||
research iteration. models are defined in the `drc-ners-nlp/research/models` directory.
|
||||
you can define model features, training parameters, and evaluation metrics in the `research_templates.yaml` file.
|
||||
|
||||
**Running Experiments**
|
||||
|
||||
### Code Quality and Testing
|
||||
```bash
|
||||
make format # Format code with black
|
||||
make lint # Lint with flake8
|
||||
make check-deps # Verify dependencies
|
||||
python train.py --name="bigru" --type="baseline" --env="development"
|
||||
python train.py --name="cnn" --type="baseline" --env="development"
|
||||
python train.py --name="lightgbm" --type="baseline" --env="development"
|
||||
|
||||
python train.py --name="logistic_regression_fullname" --type="baseline" --env="development"
|
||||
python train.py --name="logistic_regression_native" --type="baseline" --env="development"
|
||||
python train.py --name="logistic_regression_surname" --type="baseline" --env="development"
|
||||
|
||||
python train.py --name="lstm" --type="baseline" --env="development"
|
||||
python train.py --name="random_forest" --type="baseline" --env="development"
|
||||
python train.py --name="svm" --type="baseline" --env="development"
|
||||
python train.py --name="naive_bayes" --type="baseline" --env="development"
|
||||
python train.py --name="transformer" --type="baseline" --env="development"
|
||||
python train.py --name="xgboost" --type="baseline" --env="development"
|
||||
```
|
||||
|
||||
### Development Workflow
|
||||
## Web Interface
|
||||
|
||||
This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run
|
||||
experiments and make predictions without needing to understand the underlying code.
|
||||
|
||||
### Running the Web Interface
|
||||
|
||||
```bash
|
||||
make daily-work # Daily development setup
|
||||
make notebook # Launch Jupyter for analysis
|
||||
make web-dev # Launch web interface with auto-reload
|
||||
streamlit run app.py
|
||||
```
|
||||
|
||||
### Data Management
|
||||
```bash
|
||||
make check-data # Verify all data files
|
||||
make data-stats # Show dataset statistics
|
||||
make backup-data # Create timestamped backup
|
||||
make clean-checkpoints # Clean processing checkpoints
|
||||
```
|
||||
## Contributors
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
├── Makefile # All command shortcuts
|
||||
├── streamlit_app.py # Web interface application
|
||||
├── config/ # Configuration files
|
||||
│ ├── pipeline.yaml # Main configuration
|
||||
│ ├── pipeline.development.yaml # Dev settings
|
||||
│ └── pipeline.production.yaml # Prod settings
|
||||
├── core/ # Core framework
|
||||
│ ├── config.py # Configuration management
|
||||
│ ├── domain.py # Domain-specific data
|
||||
│ └── utils.py # Reusable utilities
|
||||
├── processing/ # Data processing pipeline
|
||||
│ ├── main.py # Main pipeline script
|
||||
│ ├── pipeline.py # Pipeline framework
|
||||
│ ├── steps_config.py # Configurable processing steps
|
||||
│ └── monitor.py # Monitoring utilities
|
||||
├── research/ # Research and experiments
|
||||
│ ├── cli.py # Command-line interface
|
||||
│ ├── experiment.py # Experiment management
|
||||
│ ├── models.py # Model implementations
|
||||
│ └── runner.py # Experiment execution
|
||||
└── dataset/ # Data files
|
||||
└── names.csv # Raw dataset
|
||||
```
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this pipeline in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{drc_names_pipeline,
|
||||
title={DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System},
|
||||
author={Your Name},
|
||||
year={2025},
|
||||
url={https://github.com/bernard-ng/drc-ners-nlp}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License - see the LICENSE file for details.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- Democratic Republic of Congo population data contributors
|
||||
- Open source NLP and machine learning communities
|
||||
- Cultural linguistics research communities
|
||||
<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">
|
||||
<img src="https://contrib.rocks/image?repo=bernard-ng/drc-ners-nlp" alt="contributors"/>
|
||||
</a>
|
||||
|
||||
Reference in New Issue
Block a user