feat: enhance training pipeline with research templates and experiment configuration

2025-08-08 23:48:55 +02:00
parent 96291b4ad0
commit 6d39c3afc1
9 changed files with 341 additions and 755 deletions
@@ -12,6 +12,7 @@ help: ## Show this help message
 .PHONY: setup
 setup: ## Setup virtual environment and install dependencies
 	python -m venv .venv
+	source .venv/bin/activate
 	.venv/bin/pip install --upgrade pip
 	.venv/bin/pip install -r requirements.txt

@@ -20,79 +21,6 @@ install: ## Install/update dependencies
 	pip install --upgrade pip
 	pip install -r requirements.txt

-.PHONY: install-dev
-install-dev: ## Install development dependencies
-	pip install -r requirements.txt
-	pip install jupyter notebook ipykernel pytest black flake8 mypy
-
-.PHONY: activate
-activate: ## Show activation command
-	@echo "Run: source .venv/bin/activate"
-
-# =============================================================================
-# MODEL TRAINING & ARTIFACTS
-# =============================================================================
-
-.PHONY: train-baseline
-train-baseline: ## Train all baseline models and save artifacts
-	python research/train.py --mode baseline
-
-.PHONY: train-neural
-train-neural: ## Train neural network models (LSTM, CNN, Transformer)
-	python research/train.py --mode neural
-
-.PHONY: train-model
-train-model: ## Train specific model (use: make train-model MODEL=logistic_regression NAME=my_model)
-	python research/train.py --model-type $(MODEL) --name $(NAME)
-
-.PHONY: list-models
-list-models: ## List all saved model artifacts
-	python research/train.py --mode list
-
-# =============================================================================
-# RESEARCH & EXPERIMENTS
-# =============================================================================
-
-.PHONY: experiment
-experiment: ## Create sample experiment configuration
-	python research/cli.py run --name "sample_experiment" --features full_name --model-type logistic_regression
-
-.PHONY: baseline
-baseline: ## Run baseline experiments
-	python research/cli.py baseline
-
-.PHONY: ablation
-ablation: ## Run feature ablation study
-	python research/cli.py ablation
-
-.PHONY: components
-components: ## Run name component analysis
-	python research/cli.py components
-
-.PHONY: list-experiments
-list-experiments: ## List all experiments
-	python research/cli.py list
-
-.PHONY: list-completed
-list-completed: ## List completed experiments only
-	python research/cli.py list --status completed
-
-.PHONY: export-results
-export-results: ## Export all experiment results to CSV
-	python research/cli.py export --output results_$(shell date +%Y%m%d_%H%M%S).csv
-
-.PHONY: best-model
-best-model: ## Show best performing model
-	python research/cli.py list --status completed | head -5
-
-# =============================================================================
-# WEB INTERFACE
-# =============================================================================
-
-.PHONY: web
-web: ## Launch Streamlit web interface
-	streamlit run web/app.py --server.runOnSave true --server.port 8501
-
 # =============================================================================
 # DEVELOPMENT & CODE QUALITY
 # =============================================================================
@@ -113,10 +41,6 @@ type-check: ## Type check with mypy
 notebook: ## Start Jupyter notebook
 	jupyter notebook notebooks/

-.PHONY: lab
-lab: ## Start Jupyter lab
-	jupyter lab notebooks/
-
 # =============================================================================
 # DEPLOYMENT & PRODUCTION
 # =============================================================================
@@ -1,69 +1,20 @@
-# DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System for Congolese Name Analysis
+# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

-A comprehensive, research-friendly pipeline for analyzing Congolese names and predicting gender using culturally-aware machine learning models. 
-This system provides advanced data processing, experiment management, and an intuitive web interface for non-technical users.
+Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
+underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
+data.
+This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5
+million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

-## Overview
+## Getting Started

-Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. 
-This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 7 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
+### Installation & Setup

-Our approach involves:
+Instructions and command line snippets bellow are provided to help you set up the project environment quickly and
+efficiently.
+assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

- **(1) Advanced data processing pipeline** with batching, checkpointing, and parallel processing
- **(2) Modular experiment framework** for systematic model comparison and research iteration  
- **(3) Multiple feature extraction strategies** leveraging name components, linguistic patterns, and demographic data
- **(4) Culturally-aware gender prediction models** trained specifically on Congolese naming patterns
- **(5) User-friendly web interface** enabling non-technical users to run experiments and make predictions
- **(6) Comprehensive research tools** for reproducible experimentation and result analysis
-
-## Key Features
-
-### **Advanced Data Processing**
- **Batched processing** with configurable batch sizes and parallel execution
- **Automatic checkpointing** and resume capability for large datasets
- **LLM-powered annotation** with rate limiting and retry logic
- **Memory-efficient** chunked data loading for datasets of any size
-
-### **Research-Friendly Experiment Framework**
- **Modular model architecture** - easily add new models and features
- **Systematic experiment tracking** with automatic result storage
- **Feature ablation studies** and component analysis tools
- **Cross-validation** and statistical significance testing
- **Automated baseline comparisons** and performance analysis
-
-### **Intuitive Web Interface**
- **No-code experiment creation** with visual parameter selection
- **Real-time monitoring** of data processing and training progress
- **Interactive result visualization** with charts and comparisons
- **Batch prediction capabilities** for CSV file upload and processing
- **Model comparison tools** with automatic performance rankings
-
-### **Comprehensive Analytics**
- **Feature importance analysis** showing which name components matter most
- **Province-specific studies** examining regional naming patterns
- **Learning curve analysis** for understanding data requirements
- **Prediction confidence scoring** and error analysis tools
-
-## Quick Start
-
-### Using Make Commands (Recommended)
-
-```bash
-# Complete setup and basic processing
-make quick-start
-
-# Launch web interface
-make web
-
-# Run research workflow  
-make research-flow
-
-# Show all available commands
-make help
-```
-
-### Manual Installation
+**Using Makefile (Recommended)**

 ```bash
 git clone https://github.com/bernard-ng/drc-ners-nlp.git
@@ -71,246 +22,88 @@ cd drc-ners-nlp

 # Setup environment
 make setup
-make process
-
-# Launch web interface
-make web
+make activate
 ```

-## Usage
-
-### Web Interface (Recommended for Non-Technical Users)
-
-Launch the Streamlit web application:
-```bash
-make web
-```
-
-The interface provides:
- **Dashboard**: Overview of datasets and recent experiments
- **Data Overview**: Interactive data exploration and statistics  
- **Data Processing**: Monitor and control the processing pipeline
- **Experiments**: Create and manage machine learning experiments
- **Results & Analysis**: Compare models and analyze performance
- **Predictions**: Make predictions on new names or upload CSV files
- **Settings**: Configure the system and manage data
-
-### Research & Experiments
-
-#### Quick Research Studies
-```bash
-# Compare different approaches (full name vs native vs surname)
-make baseline
-
-# Analyze which name components are most effective
-make components  
-
-# Test feature importance through ablation study
-make ablation
-
-# View all experiment results
-make list-experiments
-
-# Export results for publication
-make export-results
-```
-
-#### Custom Experiments
-```bash
-# Run specific experiment via command line
-python research/cli.py run \
-  --name "native_name_study" \
-  --features native_name \
-  --model-type logistic_regression \
-  --description "Test native name effectiveness"
-
-# Compare multiple experiments
-python research/cli.py compare <exp_id_1> <exp_id_2>
-
-# View detailed results
-python research/cli.py show <experiment_id>
-```
-
-### Data Processing Pipeline
-
-#### Basic Processing (No LLM)
-```bash
-make process-basic    # Fast processing without LLM annotation
-```
-
-#### Complete Processing (With LLM)
-```bash
-make process         # Full pipeline including LLM annotation
-make process-dev     # Development mode with smaller batches
-```
-
-#### Monitor Progress
-```bash
-make monitoring         # Show current pipeline status
-make status          # Show overall system status
-```
-
-#### Resume Interrupted Processing
-```bash
-make process-resume  # Resume from last checkpoint
-```
-
-### Available Models and Features
-
-#### Models
- **Logistic Regression**: Character n-gram based classification
- **Random Forest**: Engineered feature-based classification
- **LSTM**: Sequential neural network (planned)
- **Transformer**: Attention-based model (planned)
-
-#### Features
- **Full Name**: Complete name as given
- **Native Name**: Identified native/given name component  
- **Surname**: Family name component
- **Name Length**: Character count features
- **Word Count**: Number of words in name
- **Province**: Geographic/demographic features
- **Name Beginnings/Endings**: Prefix/suffix patterns
- **Character N-grams**: Linguistic pattern features
-
-## Configuration
-
-### Environment Configurations
+**Manual Setup**

 ```bash
-# Switch to development configuration (smaller batches, more logging)
-make config-dev
+git clone https://github.com/bernard-ng/drc-ners-nlp.git
+cd drc-ners-nlp

-# Switch to production configuration (optimized for performance) 
-make config-prod
+# Setup environment
+python -m venv .venv
+.venv/bin/pip install --upgrade pip
+.venv/bin/pip install -r requirements.txt

-# View current configuration
-make show-config
+pip install --upgrade pip
+pip install -r requirements.txt
+pip install jupyter notebook ipykernel pytest black flake8 mypy
+
+source .venv/bin/activate
 ```

-### Custom Configuration
+## Data Processing

-Edit configuration files in `config/`:
- `pipeline.yaml` - Main configuration
- `pipeline.development.yaml` - Development overrides  
- `pipeline.production.yaml` - Production settings
+This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
+checkpointing, and parallel processing capabilities.
+step are defined in the `drc-ners-nlp/processing/steps` directory. and configuration to enable them is managed through
+the `drc-ners-nlp/config/pipeline.yaml` file.
+
+**Pipeline Configuration**

-Example configuration:
 ```yaml
-processing:
-  batch_size: 1000
-  max_workers: 4
-  
-llm:
-  model_name: "mistral:7b"
-  requests_per_minute: 60
-  
-data:
-  split_evaluation: true
-  split_by_gender: true
+stages:
+  - "data_cleaning"
+  - "feature_extraction"
+  - "llm_annotation"
+  - "data_splitting"
 ```

-## Research Capabilities
-
-### Systematic Experimentation
-
-The framework supports systematic research through:
-
-1. **Baseline Studies**: Compare fundamental approaches
-2. **Feature Studies**: Test individual name components  
-3. **Ablation Studies**: Identify most important features
-4. **Cross-Province Analysis**: Test generalization across regions
-5. **Hyperparameter Optimization**: Systematic parameter tuning
-
-### Reproducible Research
-
- **Experiment Tracking**: All experiments automatically logged with full configuration
- **Result Export**: CSV export for publication and further analysis
- **Statistical Testing**: Cross-validation and confidence intervals
- **Version Control**: Configuration-based approach enables easy replication
-
-### Publication-Ready Output
+**Running the Pipeline**

 ```bash
-# Generate comprehensive results for publication
-make research-flow
-make export-results
-
-# Get best models for each approach  
-make list-completed
-python research/cli.py list --status completed | head -10
+python main.py --env development
 ```

-## Development
+## Experiments
+
+This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
+research iteration. models are defined in the `drc-ners-nlp/research/models` directory.
+you can define model features, training parameters, and evaluation metrics in the `research_templates.yaml` file.
+
+**Running Experiments**

-### Code Quality and Testing
 ```bash
-make format          # Format code with black
-make lint           # Lint with flake8  
-make check-deps     # Verify dependencies
+python train.py --name="bigru" --type="baseline" --env="development"
+python train.py --name="cnn" --type="baseline" --env="development"
+python train.py --name="lightgbm" --type="baseline" --env="development"
+
+python train.py --name="logistic_regression_fullname" --type="baseline" --env="development"
+python train.py --name="logistic_regression_native" --type="baseline" --env="development"
+python train.py --name="logistic_regression_surname" --type="baseline" --env="development"
+
+python train.py --name="lstm" --type="baseline" --env="development"
+python train.py --name="random_forest" --type="baseline" --env="development"
+python train.py --name="svm" --type="baseline" --env="development"
+python train.py --name="naive_bayes" --type="baseline" --env="development"
+python train.py --name="transformer" --type="baseline" --env="development"
+python train.py --name="xgboost" --type="baseline" --env="development"
 ```

-### Development Workflow
+## Web Interface
+
+This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run
+experiments and make predictions without needing to understand the underlying code.
+
+### Running the Web Interface
+
 ```bash
-make daily-work     # Daily development setup
-make notebook       # Launch Jupyter for analysis
-make web-dev        # Launch web interface with auto-reload
+streamlit run app.py
 ```

-### Data Management
-```bash
-make check-data     # Verify all data files
-make data-stats     # Show dataset statistics
-make backup-data    # Create timestamped backup
-make clean-checkpoints  # Clean processing checkpoints
-```
+## Contributors

-## Project Structure
-
-```
-├── Makefile                    # All command shortcuts
-├── streamlit_app.py           # Web interface application
-├── config/                    # Configuration files
-│   ├── pipeline.yaml         # Main configuration
-│   ├── pipeline.development.yaml  # Dev settings
-│   └── pipeline.production.yaml   # Prod settings
-├── core/                      # Core framework
-│   ├── config.py             # Configuration management
-│   ├── domain.py             # Domain-specific data
-│   └── utils.py              # Reusable utilities
-├── processing/                # Data processing pipeline
-│   ├── main.py               # Main pipeline script
-│   ├── pipeline.py           # Pipeline framework
-│   ├── steps_config.py       # Configurable processing steps
-│   └── monitor.py            # Monitoring utilities
-├── research/                  # Research and experiments
-│   ├── cli.py                # Command-line interface
-│   ├── experiment.py         # Experiment management
-│   ├── models.py             # Model implementations
-│   └── runner.py             # Experiment execution
-└── dataset/                   # Data files
-    └── names.csv             # Raw dataset
-```
-
-## Citation
-
-If you use this pipeline in your research, please cite:
-
-```bibtex
-@software{drc_names_pipeline,
-  title={DRC Names Gender Prediction Pipeline: A Culturally-Aware NLP System},
-  author={Your Name},
-  year={2025},
-  url={https://github.com/bernard-ng/drc-ners-nlp}
-}
-```
-
-## License
-
-This project is licensed under the MIT License - see the LICENSE file for details.
-
-## Acknowledgments
-
- Democratic Republic of Congo population data contributors
- Open source NLP and machine learning communities
- Cultural linguistics research communities
+<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">
+  <img src="https://contrib.rocks/image?repo=bernard-ng/drc-ners-nlp" alt="contributors"/>
+</a>
@@ -1,205 +0,0 @@
-#!.venv/bin/python3
-import argparse
-import logging
-import sys
-from pathlib import Path
-
-import pandas as pd
-
-from core.config import setup_config
-from research.experiment.experiment_runner import ExperimentRunner
-from research.experiment.experiment_tracker import ExperimentTracker
-
-
-def list_experiments(args):
-    """List experiments with optional filtering"""
-
-    tracker = ExperimentTracker()
-
-    # Apply filters
-    filters = {}
-    if args.status:
-        from research.experiment import ExperimentStatus
-
-        filters["status"] = ExperimentStatus(args.status)
-    if args.model_type:
-        filters["model_type"] = args.model_type
-    if args.tags:
-        filters["tags"] = args.tags
-
-    experiments = tracker.list_experiments(**filters)
-
-    if not experiments:
-        logging.info("No experiments found matching criteria")
-        return
-
-    # Create summary table
-    rows = []
-    for exp in experiments:
-        row = {
-            "ID": exp.experiment_id[:12] + "...",
-            "Name": exp.config.name,
-            "Model": exp.config.model_type,
-            "Status": exp.status.value,
-            "Test Acc": f"{exp.test_metrics.get('accuracy', 0):.4f}" if exp.test_metrics else "N/A",
-            "Start Time": exp.start_time.strftime("%Y-%m-%d %H:%M"),
-        }
-        rows.append(row)
-
-    df = pd.DataFrame(rows)
-    logging.info(df.to_string(index=False))
-
-
-def show_experiment_details(args):
-    """Show detailed results for an experiment"""
-
-    tracker = ExperimentTracker()
-    experiment = tracker.get_experiment(args.experiment_id)
-
-    if not experiment:
-        logging.error(f"Experiment not found: {args.experiment_id}")
-        return
-
-    logging.info("=== Experiment Details ===")
-    logging.info(f"ID: {experiment.experiment_id}")
-    logging.info(f"Name: {experiment.config.name}")
-    logging.info(f"Description: {experiment.config.description}")
-    logging.info(f"Model Type: {experiment.config.model_type}")
-    logging.info(f"Features: {', '.join([f.value for f in experiment.config.features])}")
-    logging.info(f"Status: {experiment.status.value}")
-    logging.info(f"Start Time: {experiment.start_time}")
-    logging.info(f"End Time: {experiment.end_time}")
-
-    if experiment.test_metrics:
-        logging.info("=== Test Metrics ===")
-        for metric, value in experiment.test_metrics.items():
-            logging.info(f"{metric}: {value:.4f}")
-
-    if experiment.cv_metrics:
-        logging.info("=== Cross-Validation Metrics ===")
-        for metric, value in experiment.cv_metrics.items():
-            if not metric.endswith("_std"):
-                std_key = f"{metric}_std"
-                std_val = experiment.cv_metrics.get(std_key, 0)
-                logging.info(f"{metric}: {value:.4f} ± {std_val:.4f}")
-
-    if experiment.feature_importance:
-        logging.info("=== Top 10 Feature Importances ===")
-        sorted_features = sorted(
-            experiment.feature_importance.items(), key=lambda x: x[1], reverse=True
-        )
-        for feature, importance in sorted_features[:10]:
-            logging.info(f"{feature}: {importance:.4f}")
-
-    if experiment.prediction_examples:
-        logging.info("=== Prediction Examples ===")
-        for i, example in enumerate(experiment.prediction_examples[:5]):
-            correct = "✓" if example["correct"] else "✗"
-            logging.info(
-                f"{i + 1}. {example['name']} -> True: {example['true_label']}, "
-                f"Pred: {example['predicted_label']} {correct}"
-            )
-
-
-def compare_experiments_cmd(args):
-    """Compare multiple experiments"""
-
-    config = setup_config(env="development")
-    runner = ExperimentRunner(config)
-    comparison = runner.compare_experiments(args.experiment_ids)
-
-    if comparison.empty:
-        logging.info("No experiments found for comparison")
-        return
-
-    logging.info("=== Experiment Comparison ===")
-
-    # Show key columns
-    key_columns = ["name", "model_type", "features", "test_accuracy", "test_f1"]
-    available_columns = [col for col in key_columns if col in comparison.columns]
-
-    logging.info(comparison[available_columns].to_string(index=False))
-
-
-def export_results(args):
-    """Export experiment results"""
-
-    tracker = ExperimentTracker()
-    output_path = tracker.export_results(Path(args.output) if args.output else None)
-
-    logging.info(f"Results exported to: {output_path}")
-
-
-def main():
-    """Main CLI entry point with unified configuration loading"""
-    parser = argparse.ArgumentParser(
-        description="DRC Names Research Experiment Manager",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-    )
-
-    # Global arguments
-    parser.add_argument("--config", type=Path, help="Path to configuration file")
-    parser.add_argument(
-        "--env", type=str, default="development",
-        help="Environment name (default: development)"
-    )
-    parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging")
-
-    subparsers = parser.add_subparsers(dest="command", help="Available commands")
-
-    # List experiments
-    list_parser = subparsers.add_parser("list", help="List experiments")
-    list_parser.add_argument("--status", choices=["pending", "running", "completed", "failed"])
-    list_parser.add_argument("--tags", nargs="+", help="Filter by tags")
-
-    # Show experiment details
-    detail_parser = subparsers.add_parser("show", help="Show experiment details")
-    detail_parser.add_argument("experiment_id", help="Experiment ID")
-
-    # Compare experiments
-    compare_parser = subparsers.add_parser("compare", help="Compare experiments")
-    compare_parser.add_argument("experiment_ids", nargs="+", help="Experiment IDs to compare")
-
-    # Export results
-    export_parser = subparsers.add_parser("export", help="Export results to CSV")
-    export_parser.add_argument("--output", help="Output file path")
-
-    args = parser.parse_args()
-
-    if not args.command:
-        parser.print_help()
-        return 1
-
-    try:
-        # Load configuration and setup logging
-        config = setup_config(config_path=args.config, env=args.env)
-
-        # Override log level if verbose requested
-        if args.verbose:
-            logging.getLogger().setLevel(logging.DEBUG)
-
-        # Execute command
-        command_map = {
-            "list": list_experiments,
-            "show": show_experiment_details,
-            "compare": compare_experiments_cmd,
-            "export": export_results,
-        }
-        handler = command_map.get(args.command)
-        if handler:
-            handler(args)
-
-        return 0
-
-    except Exception as e:
-        logging.error(f"Command failed: {e}")
-        if args.verbose:
-            import traceback
-
-            traceback.print_exc()
-        return 1
-
-
-if __name__ == "__main__":
-    exit_code = main()
-    sys.exit(exit_code)
@@ -1,17 +1,12 @@
-# Production Environment Configuration
-# Optimized settings for production deployment
-
-name: "drc_names_pipeline"
-version: "1.0.0"
 environment: "development"
 debug: true

 # Processing settings
 processing:
-  batch_size: 100_000
+  batch_size: 10_000
  max_workers: 8
  checkpoint_interval: 10
-  use_multiprocessing: true  # Enable multiprocessing for CPU-bound tasks
+  use_multiprocessing: true

 # Pipeline stages
 stages:
@@ -20,7 +15,6 @@ stages:
  #- "llm_annotation"
  - "data_splitting"

-
 # Production LLM settings
 llm:
  model_name: "mistral:7b"
@@ -31,14 +25,10 @@ llm:
  max_concurrent_requests: 4
  enable_rate_limiting: true

-# Development data settings - limited dataset for faster testing
+# Data handling configuration
 data:
-  split_evaluation: true
-  split_by_gender: true
-  evaluation_fraction: 0.2
-  random_seed: 42
-  max_dataset_size: ~  # Limit to 10k records for development/testing
-  balance_by_sex: false     # Balance male/female samples when limiting
+  max_dataset_size: 100_000
+  balance_by_sex: true

 # Enhanced logging for development
 logging:
@@ -1,17 +1,12 @@
-# Production Environment Configuration
-# Optimized settings for production deployment
-
-name: "drc_names_pipeline"
-version: "1.0.0"
 environment: "production"
 debug: false

-# Production processing settings (optimized for performance)
+# Processing settings
 processing:
  batch_size: 10_000
  max_workers: 8
  checkpoint_interval: 10
-  use_multiprocessing: true  # Enable multiprocessing for CPU-bound tasks
+  use_multiprocessing: true

 # Pipeline stages
 stages:
@@ -20,7 +15,6 @@ stages:
  - "llm_annotation"
  - "data_splitting"

-
 # Production LLM settings
 llm:
  model_name: "mistral:7b"
@@ -31,19 +25,15 @@ llm:
  max_concurrent_requests: 4
  enable_rate_limiting: true

-# Production data settings
+# Data handling configuration
 data:
-  split_evaluation: true
-  split_by_gender: true
-  evaluation_fraction: 0.2
-  random_seed: 42
  max_dataset_size: null
  balance_by_sex: false

 # Production logging (less verbose)
 logging:
  level: "INFO"
-  console_logging: false  # Disable console in production
+  console_logging: false
  file_logging: true
  log_file: "pipeline.production.log"
  max_log_size: 52428800  # 50MB
@@ -1,72 +1,72 @@
 # DRC Names Processing Pipeline Configuration
 # Main configuration file with default settings

-name: "drc_names_pipeline"
-version: "1.0.0"
-description: "DRC Names NLP Processing Pipeline"
-environment: "development"
-debug: false
+name: "drc_ners_pipeline"                 # Name of the pipeline
+version: "1.0.0"                          # Version of the pipeline
+description: "DRC NERS NLP Processing"    # Description of the pipeline
+environment: "development"                # Environment type (development, production, etc.)
+debug: false                              # Enable debug mode for detailed logging and error reporting

 # Project directory structure
 paths:
-  root_dir: "."
-  configs_dir: "./config"
-  data_dir: "./data/dataset"
-  models_dir: "./data/models"
-  outputs_dir: "./data/outputs"
-  logs_dir: "./data/logs"
-  checkpoints_dir: "./data/checkpoints"
+  root_dir: "."                           # Root directory of the project
+  configs_dir: "./config"                 # Directory for configuration files
+  data_dir: "./data/dataset"              # Directory for dataset files
+  models_dir: "./data/models"             # Directory for model files
+  outputs_dir: "./data/outputs"           # Directory for output files
+  logs_dir: "./data/logs"                 # Directory for log files
+  checkpoints_dir: "./data/checkpoints"   # Directory for model checkpoints

 # Pipeline stages
-stages:
-  - "data_cleaning"
-  - "feature_extraction"
-  - "llm_annotation"
-  - "data_splitting"
+stages:                                    # List of stages in the processing pipeline
+  - "data_cleaning"                        # Data cleaning stage
+  - "feature_extraction"                   # Feature extraction stage
+  - "llm_annotation"                       # LLM annotation stage (computational intensive)
+  - "data_splitting"                       # Data splitting stage

 # Data processing configuration
 processing:
-  batch_size: 1_000
-  max_workers: 4
-  checkpoint_interval: 5
-  use_multiprocessing: false
-  encoding_options:
+  batch_size: 1_000                        # Size of data batches to process at once
+  max_workers: 4                           # Number of worker threads for parallel processing
+  checkpoint_interval: 5                   # Interval for saving checkpoints during processing
+  use_multiprocessing: false               # Enable multiprocessing for CPU-bound tasks
+  encoding_options:                        # List of encodings to try when reading files
    - "utf-8"
    - "utf-16"
    - "latin1"
-  chunk_size: 100_000
+  chunk_size: 100_000                      # Size of data chunks to process in parallel

 # LLM annotation settings
 llm:
-  model_name: "mistral:7b"
-  requests_per_minute: 60
-  requests_per_second: 2
-  retry_attempts: 3
-  timeout_seconds: 600
-  max_concurrent_requests: 2
-  enable_rate_limiting: true
+  model_name: "mistral:7b"                 # Name of the LLM model to use
+  requests_per_minute: 60                  # Requests per minute to the LLM service
+  requests_per_second: 2                   # Requests per second to the LLM service
+  retry_attempts: 3                        # Number of retry attempts for LLM requests
+  timeout_seconds: 600                     # Timeout for LLM requests
+  max_concurrent_requests: 2               # Maximum concurrent requests to the LLM service
+  enable_rate_limiting: true               # Enable rate limiting to avoid overloading the LLM service

 # Data handling configuration
 data:
-  input_file: "names.csv"
+  input_file: "names.csv"                   # Input file containing names data
  output_files:
-    featured: "names_featured.csv"
-    evaluation: "names_evaluation.csv"
-    males: "names_males.csv"
-    females: "names_females.csv"
-  split_evaluation: true
-  split_by_gender: true
-  evaluation_fraction: 0.2
-  random_seed: 42
-  max_dataset_size: null
-  balance_by_sex: false
+    featured: "names_featured.csv"          # Output file for featured data
+    evaluation: "names_evaluation.csv"      # Output file for evaluation set
+    males: "names_males.csv"                # Output files for male names
+    females: "names_females.csv"            # Output files for female names
+  split_evaluation: true                    # Should the dataset be split into training and evaluation sets ?
+  split_by_gender: true                     # Should the dataset be split by gender ?
+  evaluation_fraction: 0.2                  # Fraction of data to use for evaluation
+  random_seed: 42                           # Random seed for reproducibility
+  max_dataset_size: null                    # Maximum size of the dataset to process, set to null for no
+  balance_by_sex: false                     # Should the dataset be balanced by sex when limiting the dataset size?

 # Logging configuration
 logging:
-  level: "INFO"
+  level: "INFO"                            # Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-  file_logging: true
-  console_logging: true
-  log_file: "pipeline.log"
-  max_log_size: 10485760  # 10MB
-  backup_count: 5
+  file_logging: true                       # Enable logging to file
+  console_logging: true                    # Enable logging to console
+  log_file: "pipeline.log"                 # Log file name
+  max_log_size: 10485760                   # Maximum size of log file before rotation (10MB)
+  backup_count: 5                          # Number of backup log files to keep
@@ -1,128 +1,148 @@
-# Research Experiment Configuration Templates
-# These configurations can be used as starting points for different types of experiments
-
-# Baseline Experiments Configuration
 baseline_experiments:
-  - name: "baseline_logistic_regression_fullname"
+  - name: "bigru"
+    description: "Baseline BiGRU with full name features"
+    model_type: "bigru"
+    features: [ "full_name" ]
+    model_params:
+      max_len: 20
+      embedding_dim: 64
+      gru_units: 32
+      epochs: 10
+      batch_size: 32
+    tags: [ "baseline", "neural", "bigru" ]
+
+  - name: "cnn"
+    description: "Baseline CNN with character patterns"
+    model_type: "cnn"
+    features: [ "full_name" ]
+    model_params:
+      max_len: 20
+      embedding_dim: 64
+      filters: 64
+      kernel_size: 3
+      dropout: 0.5
+      epochs: 10
+      batch_size: 32
+    tags: [ "baseline", "neural", "cnn" ]
+
+  - name: "ensemble"
+    description: "Baseline Ensemble with multiple models"
+    model_type: "ensemble"
+    features: [ "full_name", "name_length", "word_count" ]
+    model_params:
+      base_models: [ "logistic_regression", "random_forest", "xgboost" ]
+      voting: "soft"
+      cv_folds: 5
+    tags: [ "baseline", "ensemble" ]
+
+  - name: "lightgbm"
+    description: "Baseline LightGBM with engineered features"
+    model_type: "lightgbm"
+    features: [ "full_name", "name_length", "word_count" ]
+    model_params:
+      n_estimators: 100
+      max_depth: -1
+      learning_rate: 0.1
+      num_leaves: 31
+      subsample: 0.8
+      colsample_bytree: 0.8
+    tags: [ "baseline", "lightgbm" ]
+
+  - name: "logistic_regression_fullname"
    description: "Baseline logistic regression with full name"
    model_type: "logistic_regression"
-    features: ["full_name"]
+    features: [ "full_name" ]
    model_params:
-      ngram_range: [2, 5]
      max_features: 10000
-      max_iter: 1000
-    tags: ["baseline", "fullname"]
+    tags: [ "baseline", "logistic_regression", "fullname" ]

-  - name: "baseline_logistic_regression_native"
+  - name: "logistic_regression_native"
    description: "Logistic regression with native name only"
    model_type: "logistic_regression"
-    features: ["native_name"]
+    features: [ "native_name" ]
    model_params:
-      ngram_range: [2, 4]
      max_features: 5000
-    tags: ["baseline", "native"]
+    tags: [ "baseline", "logistic_regression", "native" ]

-  - name: "baseline_rf_engineered"
-    description: "Random Forest with engineered features"
+  - name: "logistic_regression_surname"
+    description: "Logistic regression with surname name only"
+    model_type: "logistic_regression"
+    features: [ "surname" ]
+    model_params:
+      max_features: 5000
+    tags: [ "baseline", "logistic_regression", "surname" ]
+
+  - name: "lstm"
+    description: "Baseline LSTM with full name features"
+    model_type: "lstm"
+    features: [ "full_name" ]
+    model_params:
+      embedding_dim: 128
+      lstm_units: 64
+      epochs: 10
+      batch_size: 64
+    tags: [ "baseline", "neural", "lstm" ]
+
+  - name: "naive_bayes"
+    description: "Baseline Naive Bayes with full name features"
+    model_type: "naive_bayes"
+    features: [ "full_name" ]
+    model_params:
+      max_features: 5000
+    tags: [ "baseline", "naive_bayes" ]
+
+  - name: "random_forest"
+    description: "Baseline Random Forest with engineered features"
    model_type: "random_forest"
-    features: ["name_length", "word_count", "province"]
+    features: [ "name_length", "word_count", "province" ]
    model_params:
      n_estimators: 100
      max_depth: 10
-    tags: ["baseline", "engineered"]
+      min_samples_split: 2
+      min_samples_leaf: 1
+    tags: [ "baseline", "random_forest", "engineered" ]
+
+  - name: "svm"
+    description: "Baseline SVM with full name features"
+    model_type: "svm"
+    features: [ "full_name" ]
+    model_params:
+      C: 1.0
+      kernel: "rbf"
+      ngram_range: [ 2, 4 ]
+      max_features: 5000
+    tags: [ "baseline", "svm" ]
+
+  - name: "transformer"
+    description: "Baseline Transformer with attention mechanism"
+    model_type: "transformer"
+    features: [ "full_name" ]
+    model_params:
+      embedding_dim: 128
+      num_heads: 4
+      num_layers: 2
+      epochs: 10
+      batch_size: 64
+    tags: [ "baseline", "neural", "transformer" ]
+
+  - name: "xgboost"
+    description: "Baseline XGBoost with engineered features"
+    model_type: "xgboost"
+    features: [ "full_name", "name_length", "word_count" ]
+    model_params:
+      n_estimators: 100
+      max_depth: 6
+      learning_rate: 0.1
+      subsample: 0.8
+      colsample_bytree: 0.8
+    tags: [ "baseline", "xgboost" ]
+
+
+# Advanced Experiments Configuration
+advanced_experiments:

 # Feature Study Configurations
 feature_studies:
-  - name: "native_vs_surname"
-    description: "Compare native name vs surname effectiveness"
-    experiments:
-      - model_type: "logistic_regression"
-        features: ["native_name"]
-        tags: ["feature_study", "native"]
-      - model_type: "logistic_regression"
-        features: ["surname"]
-        tags: ["feature_study", "surname"]

-  - name: "name_parts_analysis"
-    description: "Analyze effectiveness of different name parts"
-    experiments:
-      - features: ["first_word"]
-        tags: ["name_parts", "first"]
-      - features: ["last_word"]
-        tags: ["name_parts", "last"]
-      - features: ["name_beginnings"]
-        feature_params:
-          beginning_length: 3
-        tags: ["name_parts", "beginnings"]
-      - features: ["name_endings"]
-        feature_params:
-          ending_length: 3
-        tags: ["name_parts", "endings"]
-
-# Province-Specific Studies
-province_studies:
-  - name: "kinshasa_study"
-    description: "Gender prediction for Kinshasa province"
-    model_type: "logistic_regression"
-    features: ["full_name"]
-    train_data_filter:
-      province: "kinshasa"
-    tags: ["province_study", "kinshasa"]
-
-  - name: "cross_province_generalization"
-    description: "Train on one province, test on another"
-    experiments:
-      - train_filter: {"province": "kinshasa"}
-        test_filter: {"province": "bas-congo"}
-        tags: ["generalization", "kinshasa_to_bas-congo"]
-
-# Model Comparison Studies
-model_comparisons:
-  - name: "model_comparison_fullname"
-    description: "Compare different models with full name"
-    base_config:
-      features: ["full_name"]
-      tags: ["model_comparison"]
-    models:
-      - model_type: "logistic_regression"
-        model_params:
-          ngram_range: [2, 5]
-      - model_type: "random_forest"
-        # Note: RF will need different feature preparation
-        features: ["name_length", "word_count", "province"]
-
-# Advanced Feature Combinations
-advanced_features:
-  - name: "multi_feature_combination"
-    description: "Test various feature combinations"
-    experiments:
-      - features: ["full_name", "name_length"]
-        tags: ["combination", "name_plus_length"]
-      - features: ["native_name", "surname", "province"]
-        tags: ["combination", "semantic_features"]
-      - features: ["name_beginnings", "name_endings", "word_count"]
-        tags: ["combination", "structural_features"]
-
-# Hyperparameter Studies
-hyperparameter_studies:
-  - name: "ngram_range_study"
-    description: "Study effect of different n-gram ranges"
-    base_config:
-      model_type: "logistic_regression"
-      features: ["full_name"]
-      tags: ["hyperparameter", "ngram"]
-    variants:
-      - model_params: {"ngram_range": [1, 3]}
-      - model_params: {"ngram_range": [2, 4]}
-      - model_params: {"ngram_range": [2, 5]}
-      - model_params: {"ngram_range": [3, 6]}
-
-# Data Size Studies
-data_studies:
-  - name: "learning_curve_study"
-    description: "Study performance vs training data size"
-    base_config:
-      model_type: "logistic_regression"
-      features: ["full_name"]
-      tags: ["learning_curve"]
-    data_sizes: [0.1, 0.25, 0.5, 0.75, 1.0]  # Fractions of training data to use
+# Hyperparameter Tuning Configurations
+hyperparameter_tuning:
@@ -33,6 +33,7 @@ class ModelTrainer:
            model_type: str = "logistic_regression",
            features: List[str] = None,
            model_params: Dict[str, Any] = None,
+            tags: List[str] = None,
            save_artifacts: bool = True,
    ) -> str:
        """
@@ -45,6 +46,10 @@ class ModelTrainer:
            features = ["full_name"]
        feature_types = [FeatureType(f) for f in features]

+        # Prepare tags - combine default tags with template tags
+        default_tags = ["training", model_type]
+        experiment_tags = default_tags + (tags or [])
+
        # Create experiment configuration
        config = ExperimentConfig(
            name=model_name,
@@ -52,7 +57,7 @@ class ModelTrainer:
            model_type=model_type,
            features=feature_types,
            model_params=model_params or {},
-            tags=["training", model_type],
+            tags=experiment_tags,
        )

        # Run experiment
@@ -3,29 +3,98 @@ import argparse
 import logging
 import sys
 import traceback
+import yaml
+from pathlib import Path

 from core.config import setup_config
 from research.model_trainer import ModelTrainer


+def load_research_templates(templates_path: str = "config/research_templates.yaml") -> dict:
+    """Load research templates from YAML file"""
+    try:
+        with open(templates_path, 'r') as file:
+            return yaml.safe_load(file)
+    except FileNotFoundError:
+        logging.error(f"Templates file not found: {templates_path}")
+        raise
+    except yaml.YAMLError as e:
+        logging.error(f"Error parsing templates file: {e}")
+        raise
+
+
+def find_experiment_config(templates: dict, name: str, experiment_type: str) -> dict:
+    """Find experiment configuration by name and type"""
+    # Map type to section in templates
+    type_mapping = {
+        "baseline": "baseline_experiments",
+        "advanced": "advanced_experiments",
+        "feature_study": "feature_studies",
+        "tuning": "hyperparameter_tuning"
+    }
+
+    section_name = type_mapping.get(experiment_type)
+    if not section_name:
+        available_types = list(type_mapping.keys())
+        raise ValueError(f"Unknown experiment type '{experiment_type}'. Available types: {available_types}")
+
+    if section_name not in templates:
+        raise ValueError(f"Section '{section_name}' not found in templates")
+
+    experiments = templates[section_name]
+
+    # Search for experiment by model name
+    for experiment in experiments:
+        # Check if this is the experiment we're looking for
+        # Look for experiments that match the model type or contain the name
+        if (experiment.get("model_type") == name or
+            name.lower() in experiment.get("name", "").lower() or
+            f"baseline_{name}" == experiment.get("name") or
+            f"advanced_{name}" == experiment.get("name")):
+            return experiment
+
+    # If not found, list available experiments
+    available_experiments = [exp.get("name", exp.get("model_type", "unknown")) for exp in experiments]
+    raise ValueError(f"Experiment '{name}' not found in '{experiment_type}' section. "
+                    f"Available experiments: {available_experiments}")
+
+
 def main():
-    parser = argparse.ArgumentParser(description="Train DRC Names Models")
-    parser.add_argument("--type", type=str, help="Specific model type to train")
-    parser.add_argument("--name", type=str, help="Model name")
+    parser = argparse.ArgumentParser(description="Train DRC Names Models using Research Templates")
+    parser.add_argument("--name", type=str, required=True, help="Model name to train")
+    parser.add_argument("--type", type=str, required=True, help="Experiment type")
    parser.add_argument("--config", type=str, help="Path to configuration file")
    parser.add_argument("--env", type=str, default="development", help="Environment name")
+    parser.add_argument("--templates", type=str, default="config/research_templates.yaml")
    args = parser.parse_args()

    try:
+        # Setup pipeline configuration
        config = setup_config(config_path=args.config, env=args.env)
-        trainer = ModelTrainer(config)

-        # Train specific model
+        # Load research templates
+        logging.info(f"Loading research templates from: {args.templates}")
+        templates = load_research_templates(args.templates)
+
+        # Find the specific experiment configuration
+        logging.info(f"Looking for experiment: name='{args.name}', type='{args.type}'")
+        experiment_config = find_experiment_config(templates, args.name, args.type)
+
+        logging.info(f"Found experiment: {experiment_config.get('name')}")
+        logging.info(f"Description: {experiment_config.get('description')}")
+        logging.info(f"Features: {experiment_config.get('features')}")
+
+        # Train the model using template configuration
+        trainer = ModelTrainer(config)
        trainer.train_single_model(
-            model_name=args.name,
-            model_type=args.type,
-            features=["full_name"]
+            model_name=experiment_config.get("name"),
+            model_type=experiment_config.get("model_type"),
+            features=experiment_config.get("features"),
+            model_params=experiment_config.get("model_params", {}),
+            tags=experiment_config.get("tags", [])
        )
+
+        logging.info("Training completed successfully!")
        return 0

    except Exception as e: