T

Amaury Cansa 4874b178c9 Name Analysis (#9 )

* feat: implement representative sampling by province (~500k records), extract surnames from the first token of name, build letter transition matrices (frequency and probability), add heatmap visualization for transitions, and integrate a Markov chain–based name generator.

* Implemented letter frequency analysis with histograms, computed bigram and trigram frequencies, and displayed the top results in tabular format. Rebuilt the transition probability matrix, and developed a name generator capable of producing realistic outputs based on surname data.

2025-09-24 20:23:40 +02:00

config

feat: add osm data

2025-09-21 16:23:44 +02:00

core

fix: add missing regions in region_mapper

2025-09-23 00:05:35 +02:00

notebooks

Name Analysis (#9 )

2025-09-24 20:23:40 +02:00

osm

feat: add osm data

2025-09-21 16:23:44 +02:00

processing

fix: add missing regions in region_mapper

2025-09-23 00:05:35 +02:00

research

feat: add osm data

2025-09-21 16:23:44 +02:00

web

feat: add osm data

2025-09-21 16:23:44 +02:00

.gitattributes

hotfixes

2025-08-16 20:34:45 +02:00

.gitignore

refactoring: add initial pipeline configuration and model classes

2025-08-04 16:12:25 +02:00

main.py

feat: add osm data

2025-09-21 16:23:44 +02:00

Makefile

feat: enhance training pipeline with research templates and experiment configuration

2025-08-08 23:48:55 +02:00

model_notation.md

feat: document models

2025-09-20 23:35:54 +02:00

monitor.py

feat: add osm data

2025-09-21 16:23:44 +02:00

ner.py

feat: add NER testing interface and evaluation statistics handling

2025-08-17 15:33:16 +02:00

README.md

feat: add osm data

2025-09-21 16:23:44 +02:00

requirements.txt

fix: dependencies in requirements.txt

2025-08-19 17:38:56 +02:00

train.py

fix: update default template path in argument parser

2025-08-17 16:31:11 +02:00

README.md

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

Getting Started

Installation & Setup

Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

Using Makefile (Recommended)

git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp

# Setup environment
make setup
make activate

Manual Setup

git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp

# Setup environment
python -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt

pip install --upgrade pip
pip install -r requirements.txt
pip install jupyter notebook ipykernel pytest black flake8 mypy

source .venv/bin/activate

Data Processing

This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching, checkpointing, and parallel processing capabilities. step are defined in the drc-ners-nlp/processing/steps directory. and configuration to enable them is managed through the drc-ners-nlp/config/pipeline.yaml file.

Pipeline Configuration

stages:
  - "data_cleaning"
  - "feature_extraction"
  - "data_splitting"

Running the Pipeline

python main.py --env production

NER Processing (Optional)

This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names. Its main objective is to accurately identify and tag the different components of a Congolese name, specifically distinguishing between the native part and the surname.

python ner.py --env production

Once you've built and train the NER model you can use it to annotate COMPOSE name in the original dataset

Running the Pipeline with NER Annotation

stages:
  - "data_cleaning"
  - "feature_extraction"
  - "ner_annotation"
  - "data_splitting"

Running the Pipeline with LLM Annotation

stages:
  - "data_cleaning"
  - "feature_extraction"
  - "llm_annotation"
  - "data_splitting"

Experiments

This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and research iteration. models are defined in the drc-ners-nlp/research/models directory. you can define model features, training parameters, and evaluation metrics in the research_templates.yaml file.

Running Experiments

# bigru
python train.py --name="bigru" --type="baseline" --env="production"
python train.py --name="bigru_native" --type="baseline" --env="production"
python train.py --name="bigru_surname" --type="baseline" --env="production"

# cnn
python train.py --name="cnn" --type="baseline" --env="production"
python train.py --name="cnn_native" --type="baseline" --env="production"
python train.py --name="cnn_surname" --type="baseline" --env="production"

# lightgbm
python train.py --name="lightgbm" --type="baseline" --env="production"
python train.py --name="lightgbm_native" --type="baseline" --env="production"
python train.py --name="lightgbm_surname" --type="baseline" --env="production"

# logistic regression
python train.py --name="logistic_regression" --type="baseline" --env="production"
python train.py --name="logistic_regression_native" --type="baseline" --env="production"
python train.py --name="logistic_regression_surname" --type="baseline" --env="production"

# lstm
python train.py --name="lstm" --type="baseline" --env="production"
python train.py --name="lstm_native" --type="baseline" --env="production"
python train.py --name="lstm_surname" --type="baseline" --env="production"

# random forest
python train.py --name="random_forest" --type="baseline" --env="production"
python train.py --name="random_forest_native" --type="baseline" --env="production"
python train.py --name="random_forest_surname" --type="baseline" --env="production"

# svm
python train.py --name="svm" --type="baseline" --env="production"
python train.py --name="svm_native" --type="baseline" --env="production"
python train.py --name="svm_surname" --type="baseline" --env="production"

# naive bayes
python train.py --name="naive_bayes" --type="baseline" --env="production"
python train.py --name="naive_bayes_native" --type="baseline" --env="production"
python train.py --name="naive_bayes_surname" --type="baseline" --env="production"

# transformer
python train.py --name="transformer" --type="baseline" --env="production"
python train.py --name="transformer_native" --type="baseline" --env="production"
python train.py --name="transformer_surname" --type="baseline" --env="production"

# xgboost
python train.py --name="xgboost" --type="baseline" --env="production"
python train.py --name="xgboost_native" --type="baseline" --env="production"
python train.py --name="xgboost_surname" --type="baseline" --env="production"

Web Interface

This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run experiments and make predictions without needing to understand the underlying code.

Running the Web Interface

streamlit run web/app.py

Contributors

Acknowledgements

Map Visualization: https://data.humdata.org/dataset/anciennes-provinces-rdc-old-provinces-drc