Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

Getting Started

Installation & Setup

Unix based

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp

uv sync

Macos & windows

docker compose build
docker compose run --rm app
docker compose run --rm app ners pipeline run --env=production
docker compose run --rm app ners research train --name=lightgbm --type=baseline --env=production
docker compose run --rm --service-ports app ners web run --env=production

Data Processing

This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching, checkpointing, and parallel processing capabilities. step are defined in the drc-ners-nlp/processing/steps directory. and configuration to enable them is managed through the drc-ners-nlp/config/pipeline.yaml file.

Pipeline Configuration

stages:
  - "data_cleaning"
  - "data_selection"
  - "feature_extraction"
  - "data_splitting"

Running the Pipeline

uv run ners pipeline run --env="production"

Experiments

This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and research iteration. models are defined in the drc-ners-nlp/research/models directory. you can define model features, training parameters, and evaluation metrics in the research_templates.yaml file.

Running Experiments

# bigru
uv run ners research train --name="bigru" --type="baseline" --env="production"
uv run ners research train --name="bigru_native" --type="baseline" --env="production"
uv run ners research train --name="bigru_surname" --type="baseline" --env="production"

# cnn
uv run ners research train --name="cnn" --type="baseline" --env="production"
uv run ners research train --name="cnn_native" --type="baseline" --env="production"
uv run ners research train --name="cnn_surname" --type="baseline" --env="production"

# lightgbm
uv run ners research train --name="lightgbm" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_native" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_surname" --type="baseline" --env="production"

# logistic regression
uv run ners research train --name="logistic_regression" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_native" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_surname" --type="baseline" --env="production"

# lstm
uv run ners research train --name="lstm" --type="baseline" --env="production"
uv run ners research train --name="lstm_native" --type="baseline" --env="production"
uv run ners research train --name="lstm_surname" --type="baseline" --env="production"

# random forest
uv run ners research train --name="random_forest" --type="baseline" --env="production"
uv run ners research train --name="random_forest_native" --type="baseline" --env="production"
uv run ners research train --name="random_forest_surname" --type="baseline" --env="production"

# svm
uv run ners research train --name="svm" --type="baseline" --env="production"
uv run ners research train --name="svm_native" --type="baseline" --env="production"
uv run ners research train --name="svm_surname" --type="baseline" --env="production"

# naive bayes
uv run ners research train --name="naive_bayes" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_native" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_surname" --type="baseline" --env="production"

# transformer
uv run ners research train --name="transformer" --type="baseline" --env="production"
uv run ners research train --name="transformer_native" --type="baseline" --env="production"
uv run ners research train --name="transformer_surname" --type="baseline" --env="production"

# xgboost
uv run ners research train --name="xgboost" --type="baseline" --env="production"
uv run ners research train --name="xgboost_native" --type="baseline" --env="production"
uv run ners research train --name="xgboost_surname" --type="baseline" --env="production"

TensorFlow on macOS (Intel) with uv

TensorFlow no longer publishes wheels for macOS Intel. To keep using uv and run TF reliably, use a Linux container with TF preinstalled and install project code with minimal extras inside the container.

One-time build

docker compose -f docker/compose.tf.yml build

If you see a message like `tensorflow/tensorflow:<tag>: not found`, update `docker/Dockerfile.tf-cpu` to a tag that exists (e.g., `2.17.0`) and rebuild:

```bash
sed -n '1,20p' docker/Dockerfile.tf-cpu  # verify the FROM line
docker pull tensorflow/tensorflow:2.17.0 # quick availability check
docker compose -f docker/compose.tf.yml build


### Start a shell with uv and TF available

```bash
docker compose -f docker/compose.tf.yml run --rm tf bash

Inside the container:

# Install project in editable mode without pulling full deps
uv pip install -e . --no-deps

# Install only what research needs alongside TensorFlow
uv pip install typer pandas scikit-learn seaborn plotly

# Sanity check
uv run python -c "import tensorflow as tf; print(tf.__version__)"

# Run an experiment
uv run ners research train --name="lstm" --type="baseline" --env="production"

Web Interface

This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run experiments and make predictions without needing to understand the underlying code.

Running the Web Interface

uv run ners web run --env="production"

Contributors

Acknowledgements

Map Visualization: https://data.humdata.org/dataset/anciennes-provinces-rdc-old-provinces-drc