refactoring: uv

This commit is contained in:
2025-10-05 18:14:15 +02:00
parent f3b06fbd07
commit 9dd4f759b3
120 changed files with 5525 additions and 3366 deletions
+83 -137
View File
@@ -10,37 +10,23 @@ million names from the Democratic Republic of Congo (DRC) annotated with gender
### Installation & Setup
Instructions and command line snippets bellow are provided to help you set up the project environment quickly and
efficiently.
assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).
**Using Makefile (Recommended)**
**Unix based**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp
# Setup environment
make setup
make activate
uv sync
```
**Manual Setup**
**Macos & windows**
```bash
git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp
# Setup environment
python -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
pip install --upgrade pip
pip install -r requirements.txt
pip install jupyter notebook ipykernel pytest black flake8 mypy
source .venv/bin/activate
docker compose build
docker compose run --rm app
docker compose run --rm app ners pipeline run --env=production
docker compose run --rm app ners research train --name=lightgbm --type=baseline --env=production
docker compose run --rm --service-ports app ners web run --env=production
```
## Data Processing
@@ -55,6 +41,7 @@ the `drc-ners-nlp/config/pipeline.yaml` file.
```yaml
stages:
- "data_cleaning"
- "data_selection"
- "feature_extraction"
- "data_splitting"
```
@@ -62,37 +49,7 @@ stages:
**Running the Pipeline**
```bash
python main.py --env production
```
## NER Processing (Optional)
This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names.
Its main objective is to accurately identify and tag the different components of a Congolese name,
specifically distinguishing between the native part and the surname.
```bash
python ner.py --env production
```
Once you've built and train the NER model you can use it to annotate **COMPOSE** name in the original dataset
**Running the Pipeline with NER Annotation**
```yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "ner_annotation"
- "data_splitting"
```
**Running the Pipeline with LLM Annotation**
```yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "llm_annotation"
- "data_splitting"
uv run ners pipeline run --env="production"
```
## Experiments
@@ -105,54 +62,94 @@ you can define model features, training parameters, and evaluation metrics in th
```bash
# bigru
python train.py --name="bigru" --type="baseline" --env="production"
python train.py --name="bigru_native" --type="baseline" --env="production"
python train.py --name="bigru_surname" --type="baseline" --env="production"
uv run ners research train --name="bigru" --type="baseline" --env="production"
uv run ners research train --name="bigru_native" --type="baseline" --env="production"
uv run ners research train --name="bigru_surname" --type="baseline" --env="production"
# cnn
python train.py --name="cnn" --type="baseline" --env="production"
python train.py --name="cnn_native" --type="baseline" --env="production"
python train.py --name="cnn_surname" --type="baseline" --env="production"
uv run ners research train --name="cnn" --type="baseline" --env="production"
uv run ners research train --name="cnn_native" --type="baseline" --env="production"
uv run ners research train --name="cnn_surname" --type="baseline" --env="production"
# lightgbm
python train.py --name="lightgbm" --type="baseline" --env="production"
python train.py --name="lightgbm_native" --type="baseline" --env="production"
python train.py --name="lightgbm_surname" --type="baseline" --env="production"
uv run ners research train --name="lightgbm" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_native" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_surname" --type="baseline" --env="production"
# logistic regression
python train.py --name="logistic_regression" --type="baseline" --env="production"
python train.py --name="logistic_regression_native" --type="baseline" --env="production"
python train.py --name="logistic_regression_surname" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_native" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_surname" --type="baseline" --env="production"
# lstm
python train.py --name="lstm" --type="baseline" --env="production"
python train.py --name="lstm_native" --type="baseline" --env="production"
python train.py --name="lstm_surname" --type="baseline" --env="production"
uv run ners research train --name="lstm" --type="baseline" --env="production"
uv run ners research train --name="lstm_native" --type="baseline" --env="production"
uv run ners research train --name="lstm_surname" --type="baseline" --env="production"
# random forest
python train.py --name="random_forest" --type="baseline" --env="production"
python train.py --name="random_forest_native" --type="baseline" --env="production"
python train.py --name="random_forest_surname" --type="baseline" --env="production"
uv run ners research train --name="random_forest" --type="baseline" --env="production"
uv run ners research train --name="random_forest_native" --type="baseline" --env="production"
uv run ners research train --name="random_forest_surname" --type="baseline" --env="production"
# svm
python train.py --name="svm" --type="baseline" --env="production"
python train.py --name="svm_native" --type="baseline" --env="production"
python train.py --name="svm_surname" --type="baseline" --env="production"
uv run ners research train --name="svm" --type="baseline" --env="production"
uv run ners research train --name="svm_native" --type="baseline" --env="production"
uv run ners research train --name="svm_surname" --type="baseline" --env="production"
# naive bayes
python train.py --name="naive_bayes" --type="baseline" --env="production"
python train.py --name="naive_bayes_native" --type="baseline" --env="production"
python train.py --name="naive_bayes_surname" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_native" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_surname" --type="baseline" --env="production"
# transformer
python train.py --name="transformer" --type="baseline" --env="production"
python train.py --name="transformer_native" --type="baseline" --env="production"
python train.py --name="transformer_surname" --type="baseline" --env="production"
uv run ners research train --name="transformer" --type="baseline" --env="production"
uv run ners research train --name="transformer_native" --type="baseline" --env="production"
uv run ners research train --name="transformer_surname" --type="baseline" --env="production"
# xgboost
python train.py --name="xgboost" --type="baseline" --env="production"
python train.py --name="xgboost_native" --type="baseline" --env="production"
python train.py --name="xgboost_surname" --type="baseline" --env="production"
uv run ners research train --name="xgboost" --type="baseline" --env="production"
uv run ners research train --name="xgboost_native" --type="baseline" --env="production"
uv run ners research train --name="xgboost_surname" --type="baseline" --env="production"
```
## TensorFlow on macOS (Intel) with uv
TensorFlow no longer publishes wheels for macOS Intel. To keep using uv and run TF reliably, use a Linux container with TF preinstalled and install project code with minimal extras inside the container.
### One-time build
```bash
docker compose -f docker/compose.tf.yml build
If you see a message like `tensorflow/tensorflow:<tag>: not found`, update `docker/Dockerfile.tf-cpu` to a tag that exists (e.g., `2.17.0`) and rebuild:
```bash
sed -n '1,20p' docker/Dockerfile.tf-cpu # verify the FROM line
docker pull tensorflow/tensorflow:2.17.0 # quick availability check
docker compose -f docker/compose.tf.yml build
```
```
### Start a shell with uv and TF available
```bash
docker compose -f docker/compose.tf.yml run --rm tf bash
```
Inside the container:
```bash
# Install project in editable mode without pulling full deps
uv pip install -e . --no-deps
# Install only what research needs alongside TensorFlow
uv pip install typer pandas scikit-learn seaborn plotly
# Sanity check
uv run python -c "import tensorflow as tf; print(tf.__version__)"
# Run an experiment
uv run ners research train --name="lstm" --type="baseline" --env="production"
```
## Web Interface
@@ -163,60 +160,9 @@ experiments and make predictions without needing to understand the underlying co
### Running the Web Interface
```bash
streamlit run web/app.py
uv run ners web run --env="production"
```
## GPU Acceleration
This project can leverage GPUs for faster training when supported libraries and hardware are available.
- TensorFlow/Keras models (BiGRU, LSTM, CNN, Transformer)
- Uses GPU automatically if a TensorFlow GPU build is installed.
- The code enables safe GPU memory growth by default; optionally enable mixed precision for additional speed:
- Add `mixed_precision: true` in the experiment `model_params` (e.g., in `config/research_templates.yaml`).
- The final layer outputs are set to float32 for numerical stability under mixed precision.
- spaCy NER
- Automatically prefers GPU if available; otherwise falls back to CPU.
- Ensure a compatible CUDA-enabled spaCy/thinc stack is installed to use GPU.
- XGBoost
- Enable GPU by adding to the experiment `model_params`:
- `use_gpu: true` (sets `tree_method: gpu_hist` and `predictor: gpu_predictor`).
- LightGBM
- Enable GPU by adding to the experiment `model_params`:
- `use_gpu: true` (sets `device: gpu`). Optional: `gpu_platform_id`, `gpu_device_id`.
Example template snippet (GPU on):
```yaml
- name: "lstm_gpu"
description: "LSTM with GPU + mixed precision"
model_type: "lstm"
features: ["full_name"]
model_params:
embedding_dim: 128
lstm_units: 64
epochs: 5
batch_size: 128
use_gpu: true
mixed_precision: true
tags: ["gpu", "mixed_precision"]
- name: "xgboost_gpu"
description: "XGBoost with GPU"
model_type: "xgboost"
features: ["full_name"]
model_params:
n_estimators: 200
use_gpu: true
```
Notes:
- Install CUDAenabled binaries for TensorFlow/spaCy/LightGBM/XGBoost to actually use GPU.
- If GPU is requested but not available, training will proceed on CPU with a warning.
## Contributors
<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">