feat: enhance training pipeline with research templates and experiment configuration

2025-08-08 23:48:55 +02:00
parent 96291b4ad0
commit 6d39c3afc1
9 changed files with 341 additions and 755 deletions
@@ -1,69 +1,20 @@
-# DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System for Congolese Name Analysis
+# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

-A comprehensive, research-friendly pipeline for analyzing Congolese names and predicting gender using culturally-aware machine learning models. 
-This system provides advanced data processing, experiment management, and an intuitive web interface for non-technical users.
+Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
+underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
+data.
+This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5
+million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

-## Overview
+## Getting Started

-Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. 
-This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 7 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
+### Installation & Setup

-Our approach involves:
+Instructions and command line snippets bellow are provided to help you set up the project environment quickly and
+efficiently.
+assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

- **(1) Advanced data processing pipeline** with batching, checkpointing, and parallel processing
- **(2) Modular experiment framework** for systematic model comparison and research iteration  
- **(3) Multiple feature extraction strategies** leveraging name components, linguistic patterns, and demographic data
- **(4) Culturally-aware gender prediction models** trained specifically on Congolese naming patterns
- **(5) User-friendly web interface** enabling non-technical users to run experiments and make predictions
- **(6) Comprehensive research tools** for reproducible experimentation and result analysis
-
-## Key Features
-
-### **Advanced Data Processing**
- **Batched processing** with configurable batch sizes and parallel execution
- **Automatic checkpointing** and resume capability for large datasets
- **LLM-powered annotation** with rate limiting and retry logic
- **Memory-efficient** chunked data loading for datasets of any size
-
-### **Research-Friendly Experiment Framework**
- **Modular model architecture** - easily add new models and features
- **Systematic experiment tracking** with automatic result storage
- **Feature ablation studies** and component analysis tools
- **Cross-validation** and statistical significance testing
- **Automated baseline comparisons** and performance analysis
-
-### **Intuitive Web Interface**
- **No-code experiment creation** with visual parameter selection
- **Real-time monitoring** of data processing and training progress
- **Interactive result visualization** with charts and comparisons
- **Batch prediction capabilities** for CSV file upload and processing
- **Model comparison tools** with automatic performance rankings
-
-### **Comprehensive Analytics**
- **Feature importance analysis** showing which name components matter most
- **Province-specific studies** examining regional naming patterns
- **Learning curve analysis** for understanding data requirements
- **Prediction confidence scoring** and error analysis tools
-
-## Quick Start
-
-### Using Make Commands (Recommended)
-
-```bash
-# Complete setup and basic processing
-make quick-start
-
-# Launch web interface
-make web
-
-# Run research workflow  
-make research-flow
-
-# Show all available commands
-make help
-```
-
-### Manual Installation
+**Using Makefile (Recommended)**

 ```bash
 git clone https://github.com/bernard-ng/drc-ners-nlp.git
@@ -71,246 +22,88 @@ cd drc-ners-nlp

 # Setup environment
 make setup
-make process
-
-# Launch web interface
-make web
+make activate
 ```

-## Usage
-
-### Web Interface (Recommended for Non-Technical Users)
-
-Launch the Streamlit web application:
-```bash
-make web
-```
-
-The interface provides:
- **Dashboard**: Overview of datasets and recent experiments
- **Data Overview**: Interactive data exploration and statistics  
- **Data Processing**: Monitor and control the processing pipeline
- **Experiments**: Create and manage machine learning experiments
- **Results & Analysis**: Compare models and analyze performance
- **Predictions**: Make predictions on new names or upload CSV files
- **Settings**: Configure the system and manage data
-
-### Research & Experiments
-
-#### Quick Research Studies
-```bash
-# Compare different approaches (full name vs native vs surname)
-make baseline
-
-# Analyze which name components are most effective
-make components  
-
-# Test feature importance through ablation study
-make ablation
-
-# View all experiment results
-make list-experiments
-
-# Export results for publication
-make export-results
-```
-
-#### Custom Experiments
-```bash
-# Run specific experiment via command line
-python research/cli.py run \
-  --name "native_name_study" \
-  --features native_name \
-  --model-type logistic_regression \
-  --description "Test native name effectiveness"
-
-# Compare multiple experiments
-python research/cli.py compare <exp_id_1> <exp_id_2>
-
-# View detailed results
-python research/cli.py show <experiment_id>
-```
-
-### Data Processing Pipeline
-
-#### Basic Processing (No LLM)
-```bash
-make process-basic    # Fast processing without LLM annotation
-```
-
-#### Complete Processing (With LLM)
-```bash
-make process         # Full pipeline including LLM annotation
-make process-dev     # Development mode with smaller batches
-```
-
-#### Monitor Progress
-```bash
-make monitoring         # Show current pipeline status
-make status          # Show overall system status
-```
-
-#### Resume Interrupted Processing
-```bash
-make process-resume  # Resume from last checkpoint
-```
-
-### Available Models and Features
-
-#### Models
- **Logistic Regression**: Character n-gram based classification
- **Random Forest**: Engineered feature-based classification
- **LSTM**: Sequential neural network (planned)
- **Transformer**: Attention-based model (planned)
-
-#### Features
- **Full Name**: Complete name as given
- **Native Name**: Identified native/given name component  
- **Surname**: Family name component
- **Name Length**: Character count features
- **Word Count**: Number of words in name
- **Province**: Geographic/demographic features
- **Name Beginnings/Endings**: Prefix/suffix patterns
- **Character N-grams**: Linguistic pattern features
-
-## Configuration
-
-### Environment Configurations
+**Manual Setup**

 ```bash
-# Switch to development configuration (smaller batches, more logging)
-make config-dev
+git clone https://github.com/bernard-ng/drc-ners-nlp.git
+cd drc-ners-nlp

-# Switch to production configuration (optimized for performance) 
-make config-prod
+# Setup environment
+python -m venv .venv
+.venv/bin/pip install --upgrade pip
+.venv/bin/pip install -r requirements.txt

-# View current configuration
-make show-config
+pip install --upgrade pip
+pip install -r requirements.txt
+pip install jupyter notebook ipykernel pytest black flake8 mypy
+
+source .venv/bin/activate
 ```

-### Custom Configuration
+## Data Processing

-Edit configuration files in `config/`:
- `pipeline.yaml` - Main configuration
- `pipeline.development.yaml` - Development overrides  
- `pipeline.production.yaml` - Production settings
+This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
+checkpointing, and parallel processing capabilities.
+step are defined in the `drc-ners-nlp/processing/steps` directory. and configuration to enable them is managed through
+the `drc-ners-nlp/config/pipeline.yaml` file.
+
+**Pipeline Configuration**

-Example configuration:
 ```yaml
-processing:
-  batch_size: 1000
-  max_workers: 4
-  
-llm:
-  model_name: "mistral:7b"
-  requests_per_minute: 60
-  
-data:
-  split_evaluation: true
-  split_by_gender: true
+stages:
+  - "data_cleaning"
+  - "feature_extraction"
+  - "llm_annotation"
+  - "data_splitting"
 ```

-## Research Capabilities
-
-### Systematic Experimentation
-
-The framework supports systematic research through:
-
-1. **Baseline Studies**: Compare fundamental approaches
-2. **Feature Studies**: Test individual name components  
-3. **Ablation Studies**: Identify most important features
-4. **Cross-Province Analysis**: Test generalization across regions
-5. **Hyperparameter Optimization**: Systematic parameter tuning
-
-### Reproducible Research
-
- **Experiment Tracking**: All experiments automatically logged with full configuration
- **Result Export**: CSV export for publication and further analysis
- **Statistical Testing**: Cross-validation and confidence intervals
- **Version Control**: Configuration-based approach enables easy replication
-
-### Publication-Ready Output
+**Running the Pipeline**

 ```bash
-# Generate comprehensive results for publication
-make research-flow
-make export-results
-
-# Get best models for each approach  
-make list-completed
-python research/cli.py list --status completed | head -10
+python main.py --env development
 ```

-## Development
+## Experiments
+
+This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
+research iteration. models are defined in the `drc-ners-nlp/research/models` directory.
+you can define model features, training parameters, and evaluation metrics in the `research_templates.yaml` file.
+
+**Running Experiments**

-### Code Quality and Testing
 ```bash
-make format          # Format code with black
-make lint           # Lint with flake8  
-make check-deps     # Verify dependencies
+python train.py --name="bigru" --type="baseline" --env="development"
+python train.py --name="cnn" --type="baseline" --env="development"
+python train.py --name="lightgbm" --type="baseline" --env="development"
+
+python train.py --name="logistic_regression_fullname" --type="baseline" --env="development"
+python train.py --name="logistic_regression_native" --type="baseline" --env="development"
+python train.py --name="logistic_regression_surname" --type="baseline" --env="development"
+
+python train.py --name="lstm" --type="baseline" --env="development"
+python train.py --name="random_forest" --type="baseline" --env="development"
+python train.py --name="svm" --type="baseline" --env="development"
+python train.py --name="naive_bayes" --type="baseline" --env="development"
+python train.py --name="transformer" --type="baseline" --env="development"
+python train.py --name="xgboost" --type="baseline" --env="development"
 ```

-### Development Workflow
+## Web Interface
+
+This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run
+experiments and make predictions without needing to understand the underlying code.
+
+### Running the Web Interface
+
 ```bash
-make daily-work     # Daily development setup
-make notebook       # Launch Jupyter for analysis
-make web-dev        # Launch web interface with auto-reload
+streamlit run app.py
 ```

-### Data Management
-```bash
-make check-data     # Verify all data files
-make data-stats     # Show dataset statistics
-make backup-data    # Create timestamped backup
-make clean-checkpoints  # Clean processing checkpoints
-```
+## Contributors

-## Project Structure
-
-```
-├── Makefile                    # All command shortcuts
-├── streamlit_app.py           # Web interface application
-├── config/                    # Configuration files
-│   ├── pipeline.yaml         # Main configuration
-│   ├── pipeline.development.yaml  # Dev settings
-│   └── pipeline.production.yaml   # Prod settings
-├── core/                      # Core framework
-│   ├── config.py             # Configuration management
-│   ├── domain.py             # Domain-specific data
-│   └── utils.py              # Reusable utilities
-├── processing/                # Data processing pipeline
-│   ├── main.py               # Main pipeline script
-│   ├── pipeline.py           # Pipeline framework
-│   ├── steps_config.py       # Configurable processing steps
-│   └── monitor.py            # Monitoring utilities
-├── research/                  # Research and experiments
-│   ├── cli.py                # Command-line interface
-│   ├── experiment.py         # Experiment management
-│   ├── models.py             # Model implementations
-│   └── runner.py             # Experiment execution
-└── dataset/                   # Data files
-    └── names.csv             # Raw dataset
-```
-
-## Citation
-
-If you use this pipeline in your research, please cite:
-
-```bibtex
-@software{drc_names_pipeline,
-  title={DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System},
-  author={Your Name},
-  year={2025},
-  url={https://github.com/bernard-ng/drc-ners-nlp}
-}
-```
-
-## License
-
-This project is licensed under the MIT License - see the LICENSE file for details.
-
-## Acknowledgments
-
- Democratic Republic of Congo population data contributors
- Open source NLP and machine learning communities
- Cultural linguistics research communities
+<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">
+  <img src="https://contrib.rocks/image?repo=bernard-ng/drc-ners-nlp" alt="contributors"/>
+</a>