Files
drc-ners-nlp/README.md
T
2025-10-05 18:14:15 +02:00

174 lines
6.5 KiB
Markdown

# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
data.
This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5
million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
## Getting Started
### Installation & Setup
**Unix based**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp
uv sync
```
**Macos & windows**
```bash
docker compose build
docker compose run --rm app
docker compose run --rm app ners pipeline run --env=production
docker compose run --rm app ners research train --name=lightgbm --type=baseline --env=production
docker compose run --rm --service-ports app ners web run --env=production
```
## Data Processing
This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
checkpointing, and parallel processing capabilities.
step are defined in the `drc-ners-nlp/processing/steps` directory. and configuration to enable them is managed through
the `drc-ners-nlp/config/pipeline.yaml` file.
**Pipeline Configuration**
```yaml
stages:
- "data_cleaning"
- "data_selection"
- "feature_extraction"
- "data_splitting"
```
**Running the Pipeline**
```bash
uv run ners pipeline run --env="production"
```
## Experiments
This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
research iteration. models are defined in the `drc-ners-nlp/research/models` directory.
you can define model features, training parameters, and evaluation metrics in the `research_templates.yaml` file.
**Running Experiments**
```bash
# bigru
uv run ners research train --name="bigru" --type="baseline" --env="production"
uv run ners research train --name="bigru_native" --type="baseline" --env="production"
uv run ners research train --name="bigru_surname" --type="baseline" --env="production"
# cnn
uv run ners research train --name="cnn" --type="baseline" --env="production"
uv run ners research train --name="cnn_native" --type="baseline" --env="production"
uv run ners research train --name="cnn_surname" --type="baseline" --env="production"
# lightgbm
uv run ners research train --name="lightgbm" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_native" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_surname" --type="baseline" --env="production"
# logistic regression
uv run ners research train --name="logistic_regression" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_native" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_surname" --type="baseline" --env="production"
# lstm
uv run ners research train --name="lstm" --type="baseline" --env="production"
uv run ners research train --name="lstm_native" --type="baseline" --env="production"
uv run ners research train --name="lstm_surname" --type="baseline" --env="production"
# random forest
uv run ners research train --name="random_forest" --type="baseline" --env="production"
uv run ners research train --name="random_forest_native" --type="baseline" --env="production"
uv run ners research train --name="random_forest_surname" --type="baseline" --env="production"
# svm
uv run ners research train --name="svm" --type="baseline" --env="production"
uv run ners research train --name="svm_native" --type="baseline" --env="production"
uv run ners research train --name="svm_surname" --type="baseline" --env="production"
# naive bayes
uv run ners research train --name="naive_bayes" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_native" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_surname" --type="baseline" --env="production"
# transformer
uv run ners research train --name="transformer" --type="baseline" --env="production"
uv run ners research train --name="transformer_native" --type="baseline" --env="production"
uv run ners research train --name="transformer_surname" --type="baseline" --env="production"
# xgboost
uv run ners research train --name="xgboost" --type="baseline" --env="production"
uv run ners research train --name="xgboost_native" --type="baseline" --env="production"
uv run ners research train --name="xgboost_surname" --type="baseline" --env="production"
```
## TensorFlow on macOS (Intel) with uv
TensorFlow no longer publishes wheels for macOS Intel. To keep using uv and run TF reliably, use a Linux container with TF preinstalled and install project code with minimal extras inside the container.
### One-time build
```bash
docker compose -f docker/compose.tf.yml build
If you see a message like `tensorflow/tensorflow:<tag>: not found`, update `docker/Dockerfile.tf-cpu` to a tag that exists (e.g., `2.17.0`) and rebuild:
```bash
sed -n '1,20p' docker/Dockerfile.tf-cpu # verify the FROM line
docker pull tensorflow/tensorflow:2.17.0 # quick availability check
docker compose -f docker/compose.tf.yml build
```
```
### Start a shell with uv and TF available
```bash
docker compose -f docker/compose.tf.yml run --rm tf bash
```
Inside the container:
```bash
# Install project in editable mode without pulling full deps
uv pip install -e . --no-deps
# Install only what research needs alongside TensorFlow
uv pip install typer pandas scikit-learn seaborn plotly
# Sanity check
uv run python -c "import tensorflow as tf; print(tf.__version__)"
# Run an experiment
uv run ners research train --name="lstm" --type="baseline" --env="production"
```
## Web Interface
This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run
experiments and make predictions without needing to understand the underlying code.
### Running the Web Interface
```bash
uv run ners web run --env="production"
```
## Contributors
<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">
<img src="https://contrib.rocks/image?repo=bernard-ng/drc-ners-nlp" alt="contributors"/>
</a>
## Acknowledgements
- Map Visualization: [https://data.humdata.org/dataset/anciennes-provinces-rdc-old-provinces-drc](https://data.humdata.org/dataset/anciennes-provinces-rdc-old-provinces-drc)