refactor: include province and annotation pipeline
This commit is contained in:
@@ -12,28 +12,44 @@ Experiments conducted on custom evaluation sets, including multilingual and code
|
|||||||
This work demonstrates the importance of culturally grounded resources in reducing bias and improving performance in NLP systems applied to underrepresented regions. Our findings open new directions for inclusive language technologies in African contexts and contribute a valuable resource for future research in regional linguistics, onomastics, and identity-aware artificial intelligence.
|
This work demonstrates the importance of culturally grounded resources in reducing bias and improving performance in NLP systems applied to underrepresented regions. Our findings open new directions for inclusive language technologies in African contexts and contribute a valuable resource for future research in regional linguistics, onomastics, and identity-aware artificial intelligence.
|
||||||
|
|
||||||
|
|
||||||
# Usage
|
## Installation
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/bernard-ng/drc-ners-nlp.git
|
git clone https://github.com/bernard-ng/drc-ners-nlp.git
|
||||||
cd drc-ners-nlp
|
cd drc-ners-nlp
|
||||||
|
|
||||||
python3 -m venv .venv
|
python3 -m venv .venv
|
||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
cp .env .env.local
|
|
||||||
|
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
## Gender Inference
|
|
||||||
### 1. Dataset Preparation
|
## Dataset
|
||||||
|
### Preparation
|
||||||
|
| Name | Description | Default |
|
||||||
|
|------------------|--------------------------------------------------------------------|---------|
|
||||||
|
| --split_eval | Split into evaluation and featured datasets | True |
|
||||||
|
| --no-split_eval | Do not split into evaluation and featured datasets | |
|
||||||
|
| --split_by_sex | Split by sex into male/female datasets | True |
|
||||||
|
| --no-split_by_sex| Do not split by sex into male/female datasets | |
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m processing.gender.prepare
|
python -m processing.prepare --split_eval --split_by_sex
|
||||||
python -m processing.annotation.prepare
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Training
|
### Annotation
|
||||||
Arguments:
|
| Name | Description | Default |
|
||||||
|
|-------------|-----------------------------------------------------|----------------|
|
||||||
|
| --llm_model | Ollama model name to use | llama3.2:3b |
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m processing.annotate --llm_model=mistral7b
|
||||||
|
```
|
||||||
|
|
||||||
|
## Experiments
|
||||||
|
### Training
|
||||||
| Name | Description | Default |
|
| Name | Description | Default |
|
||||||
|----------------|--------------------------------------------------|--------------------|
|
|----------------|--------------------------------------------------|--------------------|
|
||||||
| --dataset | Path to the dataset file | names_featured.csv |
|
| --dataset | Path to the dataset file | names_featured.csv |
|
||||||
@@ -50,22 +66,18 @@ Arguments:
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m ners.gender.models.lstm --size 1000000 --save
|
python -m pipelilne.gender.models.lstm --size 1000000 --save
|
||||||
python -m ners.gender.models.logreg --size 1000000 --save
|
python -m pipelilne.gender.models.logreg --size 1000000 --save
|
||||||
python -m ners.gender.models.transformer --size 1000000 --save
|
python -m pipelilne.gender.models.transformer --size 1000000 --save
|
||||||
```
|
```
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m ners.gender.models.lstm --size 1000000 --balanced --save
|
python -m pipelilne.gender.models.lstm --size 1000000 --balanced --save
|
||||||
python -m ners.gender.models.logreg --size 1000000 --balanced --save
|
python -m pipelilne.gender.models.logreg --size 1000000 --balanced --save
|
||||||
python -m ners.gender.models.transformer --size 1000000 --balanced --save
|
python -m pipelilne.gender.models.transformer --size 1000000 --balanced --save
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Evaluation
|
### Evaluation
|
||||||
|
|
||||||
|
|
||||||
Arguments:
|
|
||||||
|
|
||||||
| Name | Description | Default |
|
| Name | Description | Default |
|
||||||
|------------|-----------------------------------------------|----------------------|
|
|------------|-----------------------------------------------|----------------------|
|
||||||
| --model | Model type: logreg, lstm, or transformer | (required) |
|
| --model | Model type: logreg, lstm, or transformer | (required) |
|
||||||
@@ -77,15 +89,12 @@ Arguments:
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m ners.gender.eval --dataset names_evaluations.csv --model logreg
|
python -m pipelilne.gender.eval --dataset names_evaluations.csv --model logreg
|
||||||
python -m ners.gender.eval --dataset names_evaluations.csv --model lstm
|
python -m pipelilne.gender.eval --dataset names_evaluations.csv --model lstm
|
||||||
python -m ners.gender.eval --dataset names_evaluations.csv --model transformer
|
python -m pipelilne.gender.eval --dataset names_evaluations.csv --model transformer
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Inference
|
### Inference
|
||||||
|
|
||||||
Arguments:
|
|
||||||
|
|
||||||
| Name | Description | Default |
|
| Name | Description | Default |
|
||||||
|-------------|------------------------------------------|-----------|
|
|-------------|------------------------------------------|-----------|
|
||||||
| --model | Model type: logreg, lstm, or transformer | (required)|
|
| --model | Model type: logreg, lstm, or transformer | (required)|
|
||||||
@@ -95,7 +104,7 @@ Arguments:
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m ners.gender.predict --model logreg --names "Tshisekedi"
|
python -m pipelilne.gender.predict --model logreg --names "Tshisekedi"
|
||||||
python -m ners.gender.predict --model lstm --names "Ilunga Ngandu"
|
python -m pipelilne.gender.predict --model lstm --names "Ilunga Ngandu"
|
||||||
python -m ners.gender.predict --model transformer --names "musenga wa musenga"
|
python -m pipelilne.gender.predict --model transformer --names "musenga wa musenga"
|
||||||
```
|
```
|
||||||
|
|||||||
-140
@@ -1,140 +0,0 @@
|
|||||||
REGION_MAPPING = {
|
|
||||||
# Kinshasa
|
|
||||||
"kinshasa": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-centre": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-est": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-funa": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-lukunga": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-mont-amba": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-ouest": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-plateau": ("KINSHASA", "KINSHASA"),
|
|
||||||
"kinshasa-tshangu": ("KINSHASA", "KINSHASA"),
|
|
||||||
|
|
||||||
# Bas-Congo → Kongo-Central → BAS-CONGO
|
|
||||||
"bas-congo": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"bas-congo-1": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"bas-congo-2": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"kongo-central": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"kongo-central-1": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"kongo-central-2": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
"kongo-central-3": ("KONGO-CENTRAL", "BAS-CONGO"),
|
|
||||||
|
|
||||||
# Kwilu, Kwango, Mai-Ndombe → BANDUNDU
|
|
||||||
"bandundu": ("BANDUNDU", "BANDUNDU"),
|
|
||||||
"bandundu-1": ("BANDUNDU", "BANDUNDU"),
|
|
||||||
"bandundu-2": ("BANDUNDU", "BANDUNDU"),
|
|
||||||
"bandundu-3": ("BANDUNDU", "BANDUNDU"),
|
|
||||||
"kwilu": ("KWILU", "BANDUNDU"),
|
|
||||||
"kwilu-1": ("KWILU", "BANDUNDU"),
|
|
||||||
"kwilu-2": ("KWILU", "BANDUNDU"),
|
|
||||||
"kwilu-3": ("KWILU", "BANDUNDU"),
|
|
||||||
"kwango": ("KWANGO", "BANDUNDU"),
|
|
||||||
"kwango-1": ("KWANGO", "BANDUNDU"),
|
|
||||||
"kwango-2": ("KWANGO", "BANDUNDU"),
|
|
||||||
"mai-ndombe": ("MAI-NDOMBE", "BANDUNDU"),
|
|
||||||
"mai-ndombe-1": ("MAI-NDOMBE", "BANDUNDU"),
|
|
||||||
"mai-ndombe-2": ("MAI-NDOMBE", "BANDUNDU"),
|
|
||||||
"mai-ndombe-3": ("MAI-NDOMBE", "BANDUNDU"),
|
|
||||||
|
|
||||||
# Katanga → HAUT-KATANGA, HAUT-LOMAMI, LUALABA, TANGANYIKA
|
|
||||||
"haut-katanga": ("HAUT-KATANGA", "KATANGA"),
|
|
||||||
"haut-katanga-1": ("HAUT-KATANGA", "KATANGA"),
|
|
||||||
"haut-katanga-2": ("HAUT-KATANGA", "KATANGA"),
|
|
||||||
"haut-lomami": ("HAUT-LOMAMI", "KATANGA"),
|
|
||||||
"haut-lomami-1": ("HAUT-LOMAMI", "KATANGA"),
|
|
||||||
"haut-lomami-2": ("HAUT-LOMAMI", "KATANGA"),
|
|
||||||
"lualaba": ("LUALABA", "KATANGA"),
|
|
||||||
"lualaba-1": ("LUALABA", "KATANGA"),
|
|
||||||
"lualaba-2": ("LUALABA", "KATANGA"),
|
|
||||||
"lualaba-74-corrige-922a": ("LUALABA", "KATANGA"),
|
|
||||||
"tanganyika": ("TANGANYIKA", "KATANGA"),
|
|
||||||
"tanganyika-1": ("TANGANYIKA", "KATANGA"),
|
|
||||||
"tanganyika-2": ("TANGANYIKA", "KATANGA"),
|
|
||||||
|
|
||||||
# Equateur → MONGALA, NORD-UBANGI, SUD-UBANGI, TSHUAPA
|
|
||||||
"equateur": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"equateur-1": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"equateur-2": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"equateur-3": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"equateur-4": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"equateur-5": ("EQUATEUR", "EQUATEUR"),
|
|
||||||
"mongala": ("MONGALA", "EQUATEUR"),
|
|
||||||
"mongala-1": ("MONGALA", "EQUATEUR"),
|
|
||||||
"mongala-2": ("MONGALA", "EQUATEUR"),
|
|
||||||
"nord-ubangi": ("NORD-UBANGI", "EQUATEUR"),
|
|
||||||
"nord-ubangi-1": ("NORD-UBANGI", "EQUATEUR"),
|
|
||||||
"nord-ubangi-2": ("NORD-UBANGI", "EQUATEUR"),
|
|
||||||
"sud-ubangi": ("SUD-UBANGI", "EQUATEUR"),
|
|
||||||
"sud-ubangi-1": ("SUD-UBANGI", "EQUATEUR"),
|
|
||||||
"sud-ubangi-2": ("SUD-UBANGI", "EQUATEUR"),
|
|
||||||
"tshuapa": ("TSHUAPA", "EQUATEUR"),
|
|
||||||
"tshuapa-1": ("TSHUAPA", "EQUATEUR"),
|
|
||||||
"tshuapa-2": ("TSHUAPA", "EQUATEUR"),
|
|
||||||
|
|
||||||
# Province-Orientale
|
|
||||||
"province-orientale": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
|
||||||
"province-orientale-1": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
|
||||||
"province-orientale-2": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
|
||||||
"province-orientale-3": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
|
||||||
"province-orientale-4": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
|
||||||
"haut-uele": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
|
||||||
"haut-uele-1": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
|
||||||
"haut-uele-2": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
|
||||||
"bas-uele": ("BAS-UELE", "PROVINCE-ORIENTALE"),
|
|
||||||
"ituri": ("ITURI", "PROVINCE-ORIENTALE"),
|
|
||||||
"ituri-1": ("ITURI", "PROVINCE-ORIENTALE"),
|
|
||||||
"ituri-2": ("ITURI", "PROVINCE-ORIENTALE"),
|
|
||||||
"ituri-3": ("ITURI", "PROVINCE-ORIENTALE"),
|
|
||||||
"tshopo": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
|
||||||
"tshopo-1": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
|
||||||
"tshopo-2": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
|
||||||
|
|
||||||
# Kasaï
|
|
||||||
"kasai-1": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-2": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-ce": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-central": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-central-1": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-central-2": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-occidental": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-occidental-1": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-occidental-2": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
|
||||||
"kasai-oriental": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
|
||||||
"kasai-oriental-1": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
|
||||||
"kasai-oriental-2": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
|
||||||
"kasai-oriental-3": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
|
||||||
"kasai-orientale": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
|
||||||
"lomami": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
|
||||||
"lomami-1": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
|
||||||
"lomami-2": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
|
||||||
"sankuru": ("SANKURU", "KASAÏ-ORIENTAL"),
|
|
||||||
"sankuru-1": ("SANKURU", "KASAÏ-ORIENTAL"),
|
|
||||||
"sankuru-2": ("SANKURU", "KASAÏ-ORIENTAL"),
|
|
||||||
|
|
||||||
# Nord-Kivu
|
|
||||||
"nord-kivu": ("NORD-KIVU", "NORD-KIVU"),
|
|
||||||
"nord-kivu-1": ("NORD-KIVU", "NORD-KIVU"),
|
|
||||||
"nord-kivu-2": ("NORD-KIVU", "NORD-KIVU"),
|
|
||||||
"nord-kivu-3": ("NORD-KIVU", "NORD-KIVU"),
|
|
||||||
|
|
||||||
# Sud-Kivu
|
|
||||||
"sud-kivu": ("SUD-KIVU", "SUD-KIVU"),
|
|
||||||
"sud-kivu-1": ("SUD-KIVU", "SUD-KIVU"),
|
|
||||||
"sud-kivu-2": ("SUD-KIVU", "SUD-KIVU"),
|
|
||||||
"sud-kivu-3": ("SUD-KIVU", "SUD-KIVU"),
|
|
||||||
|
|
||||||
# Maniema
|
|
||||||
"maniema": ("MANIEMA", "MANIEMA"),
|
|
||||||
"maniema-1": ("MANIEMA", "MANIEMA"),
|
|
||||||
"maniema-2": ("MANIEMA", "MANIEMA"),
|
|
||||||
|
|
||||||
# Divers
|
|
||||||
"hors-frontieres": ("AUTRES", "AUTRES"),
|
|
||||||
"lukaya": ("AUTRES", "AUTRES"),
|
|
||||||
"recours": ("AUTRES", "AUTRES"),
|
|
||||||
"junacyc": ("AUTRES", "AUTRES"),
|
|
||||||
"junacyp": ("AUTRES", "AUTRES"),
|
|
||||||
"junacyc-lualaba-corrige": ("LUALABA", "KATANGA"),
|
|
||||||
"options-techniques-toutes-les-provinces-et-hors-frontieres": ("AUTRES", "AUTRES"),
|
|
||||||
"region": ("AUTRES", "AUTRES"),
|
|
||||||
}
|
|
||||||
+149
-6
@@ -1,6 +1,7 @@
|
|||||||
import csv
|
import csv
|
||||||
import io
|
import io
|
||||||
import json
|
import json
|
||||||
|
import logging
|
||||||
import os
|
import os
|
||||||
import pickle
|
import pickle
|
||||||
from typing import List, Dict
|
from typing import List, Dict
|
||||||
@@ -16,15 +17,157 @@ GENDER_RESULT_DIR = os.path.join(ROOT_DIR, 'gender', 'results')
|
|||||||
NER_MODELS_DIR = os.path.join(MODELS_DIR, 'ner')
|
NER_MODELS_DIR = os.path.join(MODELS_DIR, 'ner')
|
||||||
NER_RESULT_DIR = os.path.join(ROOT_DIR, 'ner', 'results')
|
NER_RESULT_DIR = os.path.join(ROOT_DIR, 'ner', 'results')
|
||||||
|
|
||||||
|
REGION_MAPPING = {
|
||||||
|
# Kinshasa
|
||||||
|
"kinshasa": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-centre": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-est": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-funa": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-lukunga": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-mont-amba": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-ouest": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-plateau": ("KINSHASA", "KINSHASA"),
|
||||||
|
"kinshasa-tshangu": ("KINSHASA", "KINSHASA"),
|
||||||
|
|
||||||
|
# Bas-Congo → Kongo-Central → BAS-CONGO
|
||||||
|
"bas-congo": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"bas-congo-1": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"bas-congo-2": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"kongo-central": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"kongo-central-1": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"kongo-central-2": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
"kongo-central-3": ("KONGO-CENTRAL", "BAS-CONGO"),
|
||||||
|
|
||||||
|
# Kwilu, Kwango, Mai-Ndombe → BANDUNDU
|
||||||
|
"bandundu": ("BANDUNDU", "BANDUNDU"),
|
||||||
|
"bandundu-1": ("BANDUNDU", "BANDUNDU"),
|
||||||
|
"bandundu-2": ("BANDUNDU", "BANDUNDU"),
|
||||||
|
"bandundu-3": ("BANDUNDU", "BANDUNDU"),
|
||||||
|
"kwilu": ("KWILU", "BANDUNDU"),
|
||||||
|
"kwilu-1": ("KWILU", "BANDUNDU"),
|
||||||
|
"kwilu-2": ("KWILU", "BANDUNDU"),
|
||||||
|
"kwilu-3": ("KWILU", "BANDUNDU"),
|
||||||
|
"kwango": ("KWANGO", "BANDUNDU"),
|
||||||
|
"kwango-1": ("KWANGO", "BANDUNDU"),
|
||||||
|
"kwango-2": ("KWANGO", "BANDUNDU"),
|
||||||
|
"mai-ndombe": ("MAI-NDOMBE", "BANDUNDU"),
|
||||||
|
"mai-ndombe-1": ("MAI-NDOMBE", "BANDUNDU"),
|
||||||
|
"mai-ndombe-2": ("MAI-NDOMBE", "BANDUNDU"),
|
||||||
|
"mai-ndombe-3": ("MAI-NDOMBE", "BANDUNDU"),
|
||||||
|
|
||||||
|
# Katanga → HAUT-KATANGA, HAUT-LOMAMI, LUALABA, TANGANYIKA
|
||||||
|
"haut-katanga": ("HAUT-KATANGA", "KATANGA"),
|
||||||
|
"haut-katanga-1": ("HAUT-KATANGA", "KATANGA"),
|
||||||
|
"haut-katanga-2": ("HAUT-KATANGA", "KATANGA"),
|
||||||
|
"haut-lomami": ("HAUT-LOMAMI", "KATANGA"),
|
||||||
|
"haut-lomami-1": ("HAUT-LOMAMI", "KATANGA"),
|
||||||
|
"haut-lomami-2": ("HAUT-LOMAMI", "KATANGA"),
|
||||||
|
"lualaba": ("LUALABA", "KATANGA"),
|
||||||
|
"lualaba-1": ("LUALABA", "KATANGA"),
|
||||||
|
"lualaba-2": ("LUALABA", "KATANGA"),
|
||||||
|
"lualaba-74-corrige-922a": ("LUALABA", "KATANGA"),
|
||||||
|
"tanganyika": ("TANGANYIKA", "KATANGA"),
|
||||||
|
"tanganyika-1": ("TANGANYIKA", "KATANGA"),
|
||||||
|
"tanganyika-2": ("TANGANYIKA", "KATANGA"),
|
||||||
|
|
||||||
|
# Equateur → MONGALA, NORD-UBANGI, SUD-UBANGI, TSHUAPA
|
||||||
|
"equateur": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"equateur-1": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"equateur-2": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"equateur-3": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"equateur-4": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"equateur-5": ("EQUATEUR", "EQUATEUR"),
|
||||||
|
"mongala": ("MONGALA", "EQUATEUR"),
|
||||||
|
"mongala-1": ("MONGALA", "EQUATEUR"),
|
||||||
|
"mongala-2": ("MONGALA", "EQUATEUR"),
|
||||||
|
"nord-ubangi": ("NORD-UBANGI", "EQUATEUR"),
|
||||||
|
"nord-ubangi-1": ("NORD-UBANGI", "EQUATEUR"),
|
||||||
|
"nord-ubangi-2": ("NORD-UBANGI", "EQUATEUR"),
|
||||||
|
"sud-ubangi": ("SUD-UBANGI", "EQUATEUR"),
|
||||||
|
"sud-ubangi-1": ("SUD-UBANGI", "EQUATEUR"),
|
||||||
|
"sud-ubangi-2": ("SUD-UBANGI", "EQUATEUR"),
|
||||||
|
"tshuapa": ("TSHUAPA", "EQUATEUR"),
|
||||||
|
"tshuapa-1": ("TSHUAPA", "EQUATEUR"),
|
||||||
|
"tshuapa-2": ("TSHUAPA", "EQUATEUR"),
|
||||||
|
|
||||||
|
# Province-Orientale
|
||||||
|
"province-orientale": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
||||||
|
"province-orientale-1": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
||||||
|
"province-orientale-2": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
||||||
|
"province-orientale-3": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
||||||
|
"province-orientale-4": ("PROVINCE-ORIENTALE", "PROVINCE-ORIENTALE"),
|
||||||
|
"haut-uele": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
||||||
|
"haut-uele-1": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
||||||
|
"haut-uele-2": ("HAUT-UELE", "PROVINCE-ORIENTALE"),
|
||||||
|
"bas-uele": ("BAS-UELE", "PROVINCE-ORIENTALE"),
|
||||||
|
"ituri": ("ITURI", "PROVINCE-ORIENTALE"),
|
||||||
|
"ituri-1": ("ITURI", "PROVINCE-ORIENTALE"),
|
||||||
|
"ituri-2": ("ITURI", "PROVINCE-ORIENTALE"),
|
||||||
|
"ituri-3": ("ITURI", "PROVINCE-ORIENTALE"),
|
||||||
|
"tshopo": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
||||||
|
"tshopo-1": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
||||||
|
"tshopo-2": ("TSHOPO", "PROVINCE-ORIENTALE"),
|
||||||
|
|
||||||
|
# Kasaï
|
||||||
|
"kasai-1": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-2": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-ce": ("KASAÏ", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-central": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-central-1": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-central-2": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-occidental": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-occidental-1": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-occidental-2": ("KASAÏ-CENTRAL", "KASAÏ-OCCIDENTAL"),
|
||||||
|
"kasai-oriental": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
||||||
|
"kasai-oriental-1": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
||||||
|
"kasai-oriental-2": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
||||||
|
"kasai-oriental-3": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
||||||
|
"kasai-orientale": ("KASAÏ-ORIENTAL", "KASAÏ-ORIENTAL"),
|
||||||
|
"lomami": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
||||||
|
"lomami-1": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
||||||
|
"lomami-2": ("LOMAMI", "KASAÏ-ORIENTAL"),
|
||||||
|
"sankuru": ("SANKURU", "KASAÏ-ORIENTAL"),
|
||||||
|
"sankuru-1": ("SANKURU", "KASAÏ-ORIENTAL"),
|
||||||
|
"sankuru-2": ("SANKURU", "KASAÏ-ORIENTAL"),
|
||||||
|
|
||||||
|
# Nord-Kivu
|
||||||
|
"nord-kivu": ("NORD-KIVU", "NORD-KIVU"),
|
||||||
|
"nord-kivu-1": ("NORD-KIVU", "NORD-KIVU"),
|
||||||
|
"nord-kivu-2": ("NORD-KIVU", "NORD-KIVU"),
|
||||||
|
"nord-kivu-3": ("NORD-KIVU", "NORD-KIVU"),
|
||||||
|
|
||||||
|
# Sud-Kivu
|
||||||
|
"sud-kivu": ("SUD-KIVU", "SUD-KIVU"),
|
||||||
|
"sud-kivu-1": ("SUD-KIVU", "SUD-KIVU"),
|
||||||
|
"sud-kivu-2": ("SUD-KIVU", "SUD-KIVU"),
|
||||||
|
"sud-kivu-3": ("SUD-KIVU", "SUD-KIVU"),
|
||||||
|
|
||||||
|
# Maniema
|
||||||
|
"maniema": ("MANIEMA", "MANIEMA"),
|
||||||
|
"maniema-1": ("MANIEMA", "MANIEMA"),
|
||||||
|
"maniema-2": ("MANIEMA", "MANIEMA"),
|
||||||
|
|
||||||
|
# Divers
|
||||||
|
"hors-frontieres": ("AUTRES", "AUTRES"),
|
||||||
|
"lukaya": ("AUTRES", "AUTRES"),
|
||||||
|
"recours": ("AUTRES", "AUTRES"),
|
||||||
|
"junacyc": ("AUTRES", "AUTRES"),
|
||||||
|
"junacyp": ("AUTRES", "AUTRES"),
|
||||||
|
"junacyc-lualaba-corrige": ("LUALABA", "KATANGA"),
|
||||||
|
"options-techniques-toutes-les-provinces-et-hors-frontieres": ("AUTRES", "AUTRES"),
|
||||||
|
"region": ("AUTRES", "AUTRES"),
|
||||||
|
}
|
||||||
|
|
||||||
|
logging.basicConfig(level=logging.INFO, format=">> %(message)s")
|
||||||
|
|
||||||
def load_json_dataset(path: str) -> list:
|
def load_json_dataset(path: str) -> list:
|
||||||
print(f">> Loading JSON dataset from {path}")
|
logging.info(f"Loading JSON dataset from {path}")
|
||||||
with open(os.path.join(DATA_DIR, path), "r", encoding="utf-8") as f:
|
with open(os.path.join(DATA_DIR, path), "r", encoding="utf-8") as f:
|
||||||
return json.load(f)
|
return json.load(f)
|
||||||
|
|
||||||
|
|
||||||
def save_csv_dataset(data: list, path: str) -> None:
|
def save_csv_dataset(data: list, path: str) -> None:
|
||||||
print(f">> Saving CSV dataset to {path}")
|
logging.info(f"Saving CSV dataset to {path}")
|
||||||
with open(os.path.join(DATA_DIR, path), "w", encoding="utf-8") as f:
|
with open(os.path.join(DATA_DIR, path), "w", encoding="utf-8") as f:
|
||||||
writer = csv.DictWriter(f, fieldnames=data[0].keys())
|
writer = csv.DictWriter(f, fieldnames=data[0].keys())
|
||||||
writer.writeheader()
|
writer.writeheader()
|
||||||
@@ -32,14 +175,14 @@ def save_csv_dataset(data: list, path: str) -> None:
|
|||||||
|
|
||||||
|
|
||||||
def load_csv_dataset(path: str, limit: int = None, balanced: bool = False) -> List[Dict[str, str]]:
|
def load_csv_dataset(path: str, limit: int = None, balanced: bool = False) -> List[Dict[str, str]]:
|
||||||
print(f">> Loading CSV dataset from {path}")
|
logging.info(f"Loading CSV dataset from {path}")
|
||||||
|
|
||||||
file_path = os.path.join(DATA_DIR, path)
|
file_path = os.path.join(DATA_DIR, path)
|
||||||
with open(file_path, "r", encoding="utf-8", errors="replace", newline="") as f:
|
with open(file_path, "r", encoding="utf-8", errors="replace", newline="") as f:
|
||||||
raw_text = f.read().replace('\x00', '')
|
raw_text = f.read().replace('\x00', '')
|
||||||
|
|
||||||
reader = csv.DictReader(io.StringIO(raw_text))
|
reader = csv.DictReader(io.StringIO(raw_text))
|
||||||
print(f">> Detected fieldnames: {reader.fieldnames}")
|
logging.info(f"Detected fieldnames: {reader.fieldnames}")
|
||||||
|
|
||||||
if balanced:
|
if balanced:
|
||||||
by_sex = {'m': [], 'f': []}
|
by_sex = {'m': [], 'f': []}
|
||||||
@@ -58,12 +201,12 @@ def load_csv_dataset(path: str, limit: int = None, balanced: bool = False) -> Li
|
|||||||
if limit and i + 1 >= limit:
|
if limit and i + 1 >= limit:
|
||||||
break
|
break
|
||||||
|
|
||||||
print(">> Successfully loaded with UTF-8 encoding")
|
logging.info("Successfully loaded with UTF-8 encoding")
|
||||||
return data
|
return data
|
||||||
|
|
||||||
|
|
||||||
def save_json_dataset(data: list, path: str) -> None:
|
def save_json_dataset(data: list, path: str) -> None:
|
||||||
print(f">> Saving JSON dataset to {path}")
|
logging.info(f"Saving JSON dataset to {path}")
|
||||||
with open(os.path.join(DATA_DIR, path), "w", encoding="utf-8") as f:
|
with open(os.path.join(DATA_DIR, path), "w", encoding="utf-8") as f:
|
||||||
json.dump(data, f, ensure_ascii=False, separators=(',', ':'))
|
json.dump(data, f, ensure_ascii=False, separators=(',', ':'))
|
||||||
|
|
||||||
|
|||||||
@@ -1,5 +1,4 @@
|
|||||||
import argparse
|
import argparse
|
||||||
import logging
|
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
@@ -8,8 +7,7 @@ from sklearn.metrics import (
|
|||||||
classification_report, confusion_matrix
|
classification_report, confusion_matrix
|
||||||
)
|
)
|
||||||
|
|
||||||
logging.basicConfig(level=logging.INFO, format=">> %(message)s")
|
from misc import logging
|
||||||
|
|
||||||
|
|
||||||
def evaluate_proba(y_true, y_proba, threshold, class_names):
|
def evaluate_proba(y_true, y_proba, threshold, class_names):
|
||||||
y_pred = (y_proba[:, 1] >= threshold).astype(int)
|
y_pred = (y_proba[:, 1] >= threshold).astype(int)
|
||||||
@@ -14,7 +14,7 @@ from sklearn.pipeline import make_pipeline, Pipeline
|
|||||||
from sklearn.preprocessing import LabelEncoder
|
from sklearn.preprocessing import LabelEncoder
|
||||||
|
|
||||||
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
||||||
from ners.gender.models import BaseConfig, load_config, logging
|
from pipeline.gender.models import BaseConfig, load_config, logging
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -16,7 +16,7 @@ from tensorflow.keras.preprocessing.sequence import pad_sequences
|
|||||||
from tensorflow.keras.preprocessing.text import Tokenizer
|
from tensorflow.keras.preprocessing.text import Tokenizer
|
||||||
|
|
||||||
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
||||||
from ners.gender.models import load_config, BaseConfig, evaluate_proba, logging
|
from pipeline.gender.models import load_config, BaseConfig, evaluate_proba, logging
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -20,7 +20,7 @@ from tensorflow.keras.preprocessing.sequence import pad_sequences
|
|||||||
from tensorflow.keras.preprocessing.text import Tokenizer
|
from tensorflow.keras.preprocessing.text import Tokenizer
|
||||||
|
|
||||||
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
from misc import GENDER_MODELS_DIR, load_csv_dataset, save_pickle
|
||||||
from ners.gender.models import BaseConfig, load_config, evaluate_proba, logging
|
from pipeline.gender.models import BaseConfig, load_config, evaluate_proba, logging
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -0,0 +1,86 @@
|
|||||||
|
import os
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
import ollama
|
||||||
|
import pandas as pd
|
||||||
|
from pydantic import BaseModel, ValidationError
|
||||||
|
from tqdm import tqdm
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from misc import load_prompt, load_csv_dataset, DATA_DIR, logging
|
||||||
|
|
||||||
|
|
||||||
|
class NameAnalysis(BaseModel):
|
||||||
|
identified_name: Optional[str]
|
||||||
|
identified_surname: Optional[str]
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_name(client: ollama.Client, model: str, prompt: str, name: str) -> dict:
|
||||||
|
"""
|
||||||
|
Analyze a name using the specified model and prompt.
|
||||||
|
Returns a dictionary with identified name, surname, and category.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
response = client.chat(
|
||||||
|
model=model,
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": prompt},
|
||||||
|
{"role": "user", "content": name}
|
||||||
|
],
|
||||||
|
format=NameAnalysis.model_json_schema()
|
||||||
|
)
|
||||||
|
analysis = NameAnalysis.model_validate_json(response.message.content)
|
||||||
|
return analysis.model_dump()
|
||||||
|
except ValidationError as ve:
|
||||||
|
logging.warning(f"Validation error: {ve}")
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"Unexpected error: {e}")
|
||||||
|
return {
|
||||||
|
"identified_name": None,
|
||||||
|
"identified_surname": None
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def build_updates(client: ollama.Client, prompt: str, llm_model: str, rows: pd.DataFrame) -> pd.DataFrame:
|
||||||
|
"""
|
||||||
|
Build updates for the DataFrame by analyzing names.
|
||||||
|
Iterates through the DataFrame rows, analyzes each name, and returns a DataFrame with updates.
|
||||||
|
"""
|
||||||
|
logging.getLogger("httpx").setLevel(logging.WARNING)
|
||||||
|
updates = []
|
||||||
|
|
||||||
|
for idx, row in rows.iterrows():
|
||||||
|
entry = analyze_name(client, llm_model, prompt, row['name'])
|
||||||
|
entry["annotated"] = 1
|
||||||
|
updates.append((idx, entry))
|
||||||
|
logging.info(f"Analyzed name: {row['name']} - {entry}")
|
||||||
|
|
||||||
|
return pd.DataFrame.from_dict(dict(updates), orient='index')
|
||||||
|
|
||||||
|
|
||||||
|
def main(llm_model: str = "llama3.2:3b"):
|
||||||
|
df = pd.DataFrame(load_csv_dataset('names_featured.csv'))
|
||||||
|
prompt = load_prompt()
|
||||||
|
|
||||||
|
entries = df[df['annotated'].astype("Int8") == 0]
|
||||||
|
if entries.empty:
|
||||||
|
logging.info("No names to analyze.")
|
||||||
|
return
|
||||||
|
|
||||||
|
logging.info(f"Found {len(entries)} names to analyze.")
|
||||||
|
client = ollama.Client()
|
||||||
|
|
||||||
|
df.update(build_updates(client, prompt, llm_model, entries))
|
||||||
|
df.to_csv(os.path.join(DATA_DIR, 'names_featured.csv'), index=False)
|
||||||
|
logging.info("Done.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description="Analyze names using an LLM model.")
|
||||||
|
parser.add_argument('--llm_model', type=str, default="llama3.2:3b", help="Ollama model name to use (default: llama3.2:3b)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
main(llm_model=args.llm_model)
|
||||||
|
except Exception as e:
|
||||||
|
logging.error(f"Fatal error: {e}", exc_info=True)
|
||||||
@@ -1,72 +0,0 @@
|
|||||||
import os
|
|
||||||
|
|
||||||
import ollama
|
|
||||||
import pandas as pd
|
|
||||||
from pydantic import BaseModel, ValidationError
|
|
||||||
from tqdm import tqdm
|
|
||||||
|
|
||||||
from misc import load_prompt, load_csv_dataset, DATA_DIR
|
|
||||||
|
|
||||||
|
|
||||||
class NameAnalysis(BaseModel):
|
|
||||||
identified_name: str | None
|
|
||||||
identified_surname: str | None
|
|
||||||
identified_category: str | None
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
dataset = pd.DataFrame(load_csv_dataset('names_featured.csv'))
|
|
||||||
prompt = load_prompt()
|
|
||||||
|
|
||||||
print(">> Filtering dataset for names that need analysis...")
|
|
||||||
to_analyze = dataset[dataset['llm_annotated'] == 0].copy()
|
|
||||||
if to_analyze.empty:
|
|
||||||
print(">> No names to analyze.")
|
|
||||||
return
|
|
||||||
|
|
||||||
client = ollama.Client()
|
|
||||||
updates = []
|
|
||||||
|
|
||||||
print(">> Starting name analysis with LLM...")
|
|
||||||
for row in tqdm(to_analyze.itertuples(index=True), total=len(to_analyze)):
|
|
||||||
name = row.name
|
|
||||||
try:
|
|
||||||
response = client.chat(
|
|
||||||
model="llama3.2:3b",
|
|
||||||
messages=[
|
|
||||||
{"role": "system", "content": prompt},
|
|
||||||
{"role": "user", "content": name}
|
|
||||||
],
|
|
||||||
format=NameAnalysis.model_json_schema()
|
|
||||||
)
|
|
||||||
analysis = NameAnalysis.model_validate_json(response.message.content)
|
|
||||||
result = analysis.model_dump()
|
|
||||||
except (ValidationError, Exception):
|
|
||||||
result = {
|
|
||||||
"identified_name": None,
|
|
||||||
"identified_surname": None,
|
|
||||||
"identified_category": None
|
|
||||||
}
|
|
||||||
|
|
||||||
updates.append({
|
|
||||||
"index": row.Index,
|
|
||||||
"identified_name": result["identified_name"],
|
|
||||||
"identified_surname": result["identified_surname"],
|
|
||||||
"identified_category": result["identified_category"],
|
|
||||||
"llm_annotated": 1
|
|
||||||
})
|
|
||||||
|
|
||||||
print(">> Updating dataset with results...")
|
|
||||||
updates_df = pd.DataFrame(updates).set_index("index")
|
|
||||||
dataset.update(updates_df)
|
|
||||||
|
|
||||||
print(">> Saving updated dataset...")
|
|
||||||
dataset.to_csv(os.path.join(DATA_DIR, 'names_featured.csv'), index=False)
|
|
||||||
print(">> Done.")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
try:
|
|
||||||
main()
|
|
||||||
except Exception as e:
|
|
||||||
print(f">> Fatal error: {e}")
|
|
||||||
@@ -1,78 +0,0 @@
|
|||||||
import os
|
|
||||||
import pandas as pd
|
|
||||||
from misc import DATA_DIR
|
|
||||||
|
|
||||||
|
|
||||||
def clean(filepath):
|
|
||||||
encodings = ['utf-8', 'utf-16', 'latin1']
|
|
||||||
for enc in encodings:
|
|
||||||
try:
|
|
||||||
print(f">> Trying to read {filepath} with encoding: {enc}")
|
|
||||||
# Use chunked reading to handle large files
|
|
||||||
chunks = pd.read_csv(filepath, encoding=enc, chunksize=100_000, on_bad_lines='skip')
|
|
||||||
cleaned_chunks = []
|
|
||||||
|
|
||||||
for chunk in chunks:
|
|
||||||
# Drop rows with essential missing values early
|
|
||||||
chunk = chunk.dropna(subset=['name', 'sex', 'region'])
|
|
||||||
|
|
||||||
# Clean string columns in-place
|
|
||||||
for col in chunk.select_dtypes(include='object').columns:
|
|
||||||
chunk[col] = (
|
|
||||||
chunk[col]
|
|
||||||
.astype(str)
|
|
||||||
.str.replace('\x00', ' ', regex=False)
|
|
||||||
.str.replace('\u00a0', ' ', regex=False)
|
|
||||||
.str.replace(' +', ' ', regex=True)
|
|
||||||
)
|
|
||||||
|
|
||||||
cleaned_chunks.append(chunk)
|
|
||||||
|
|
||||||
df = pd.concat(cleaned_chunks, ignore_index=True)
|
|
||||||
df.to_csv(filepath, index=False, encoding='utf-8')
|
|
||||||
print(f">> Successfully read with encoding: {enc}")
|
|
||||||
return df
|
|
||||||
except Exception:
|
|
||||||
continue
|
|
||||||
raise UnicodeDecodeError(f"Unable to decode {filepath} with common encodings.")
|
|
||||||
|
|
||||||
|
|
||||||
def process(df: pd.DataFrame):
|
|
||||||
print(">> Preprocessing names")
|
|
||||||
df['name'] = df['name'].str.strip().str.lower()
|
|
||||||
|
|
||||||
df['words'] = df['name'].str.count(' ') + 1
|
|
||||||
df['length'] = df['name'].str.replace(' ', '', regex=False).str.len()
|
|
||||||
|
|
||||||
name_split = df['name'].str.split()
|
|
||||||
df['probable_native'] = name_split.apply(lambda x: ' '.join(x[:-1]) if len(x) > 1 else '')
|
|
||||||
df['probable_surname'] = name_split.apply(lambda x: x[-1] if x else '')
|
|
||||||
df['llm_annotated'] = 0
|
|
||||||
|
|
||||||
return df
|
|
||||||
|
|
||||||
|
|
||||||
def split_and_save(df: pd.DataFrame):
|
|
||||||
print(">> Saving evaluation and featured datasets")
|
|
||||||
eval_idx = df.sample(frac=0.2, random_state=42).index
|
|
||||||
|
|
||||||
df_evaluation = df.loc[eval_idx]
|
|
||||||
df_featured = df.drop(index=eval_idx)
|
|
||||||
|
|
||||||
df_evaluation.to_csv(os.path.join(DATA_DIR, 'names_evaluation.csv'), index=False)
|
|
||||||
df_featured.to_csv(os.path.join(DATA_DIR, 'names_featured.csv'), index=False)
|
|
||||||
|
|
||||||
print(">> Saving by sex")
|
|
||||||
df[df['sex'].str.lower() == 'm'].to_csv(os.path.join(DATA_DIR, 'names_males.csv'), index=False)
|
|
||||||
df[df['sex'].str.lower() == 'f'].to_csv(os.path.join(DATA_DIR, 'names_females.csv'), index=False)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
filepath = os.path.join(DATA_DIR, 'names.csv')
|
|
||||||
df = clean(filepath)
|
|
||||||
df = process(df)
|
|
||||||
split_and_save(df)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
main()
|
|
||||||
@@ -0,0 +1,110 @@
|
|||||||
|
import os
|
||||||
|
import argparse
|
||||||
|
import pandas as pd
|
||||||
|
from misc import DATA_DIR, REGION_MAPPING, logging
|
||||||
|
|
||||||
|
|
||||||
|
def clean(filepath) -> pd.DataFrame:
|
||||||
|
"""
|
||||||
|
Clean the CSV file by removing null bytes, non-breaking spaces, and extra spaces.
|
||||||
|
Also, it attempts to read the file with different encodings to handle potential encoding issues.
|
||||||
|
"""
|
||||||
|
|
||||||
|
encodings = ['utf-8', 'utf-16', 'latin1']
|
||||||
|
for enc in encodings:
|
||||||
|
try:
|
||||||
|
logging.info(f"Trying to read {filepath} with encoding: {enc}")
|
||||||
|
# Use chunked reading to handle large files
|
||||||
|
chunks = pd.read_csv(filepath, encoding=enc, chunksize=100_000, on_bad_lines='skip')
|
||||||
|
cleaned_chunks = []
|
||||||
|
|
||||||
|
for chunk in chunks:
|
||||||
|
# Drop rows with essential missing values early
|
||||||
|
chunk = chunk.dropna(subset=['name', 'sex', 'region'])
|
||||||
|
|
||||||
|
# Clean string columns in-place
|
||||||
|
for col in chunk.select_dtypes(include='object').columns:
|
||||||
|
chunk[col] = (
|
||||||
|
chunk[col]
|
||||||
|
.astype(str)
|
||||||
|
.str.replace('\x00', ' ', regex=False)
|
||||||
|
.str.replace('\u00a0', ' ', regex=False)
|
||||||
|
.str.replace(' +', ' ', regex=True)
|
||||||
|
.str.strip()
|
||||||
|
.str.lower()
|
||||||
|
)
|
||||||
|
|
||||||
|
cleaned_chunks.append(chunk)
|
||||||
|
|
||||||
|
df = pd.concat(cleaned_chunks, ignore_index=True)
|
||||||
|
df.to_csv(filepath, index=False, encoding='utf-8')
|
||||||
|
logging.info(f"Successfully read with encoding: {enc}")
|
||||||
|
return df
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
raise UnicodeDecodeError(f"Unable to decode {filepath} with common encodings.")
|
||||||
|
|
||||||
|
|
||||||
|
def process(df: pd.DataFrame) -> pd.DataFrame:
|
||||||
|
"""
|
||||||
|
Process the DataFrame to extract features and clean data.
|
||||||
|
This includes counting words, calculating name length, and extracting probable native names and surnames.
|
||||||
|
Also maps regions to provinces based on REGION_MAPPING.
|
||||||
|
"""
|
||||||
|
|
||||||
|
logging.info("Preprocessing names")
|
||||||
|
df['words'] = df['name'].str.count(' ') + 1
|
||||||
|
df['length'] = df['name'].str.replace(' ', '', regex=False).str.len()
|
||||||
|
|
||||||
|
name_split = df['name'].str.split()
|
||||||
|
df['probable_native'] = name_split.apply(lambda x: ' '.join(x[:-1]) if len(x) > 1 else '')
|
||||||
|
df['probable_surname'] = name_split.apply(lambda x: x[-1] if x else '')
|
||||||
|
df['identified_category'] = df['words'].apply(lambda x: 'compose' if x > 3 else 'simple')
|
||||||
|
df['identified_name'] = None
|
||||||
|
df['identified_surname'] = None
|
||||||
|
|
||||||
|
logging.info("Mapping regions to provinces")
|
||||||
|
df['province'] = df['region'].map(lambda r: REGION_MAPPING.get(r, ('AUTRES', 'AUTRES'))[1])
|
||||||
|
df['province'] = df['province'].str.lower()
|
||||||
|
df['annotated'] = 0
|
||||||
|
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def save_artifacts(df: pd.DataFrame, split_eval: bool = True, split_by_sex: bool = True) -> None:
|
||||||
|
"""
|
||||||
|
Splits the input DataFrame into evaluation and featured datasets, saves them as CSV files,
|
||||||
|
and additionally saves separate CSV files for male and female entries if requested.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if split_eval:
|
||||||
|
logging.info("Saving evaluation and featured datasets")
|
||||||
|
eval_idx = df.sample(frac=0.2, random_state=42).index
|
||||||
|
df_evaluation = df.loc[eval_idx]
|
||||||
|
df_featured = df.drop(index=eval_idx)
|
||||||
|
df_evaluation.to_csv(os.path.join(DATA_DIR, 'names_evaluation.csv'), index=False)
|
||||||
|
df_featured.to_csv(os.path.join(DATA_DIR, 'names_featured.csv'), index=False)
|
||||||
|
else:
|
||||||
|
df.to_csv(os.path.join(DATA_DIR, 'names_featured.csv'), index=False)
|
||||||
|
|
||||||
|
if split_by_sex:
|
||||||
|
logging.info("Saving by sex")
|
||||||
|
df[df['sex'] == 'm'].to_csv(os.path.join(DATA_DIR, 'names_males.csv'), index=False)
|
||||||
|
df[df['sex'] == 'f'].to_csv(os.path.join(DATA_DIR, 'names_females.csv'), index=False)
|
||||||
|
|
||||||
|
|
||||||
|
def main(split_eval: bool = True, split_by_sex: bool = True):
|
||||||
|
df = process(clean(os.path.join(DATA_DIR, 'names.csv')))
|
||||||
|
save_artifacts(df, split_eval=split_eval, split_by_sex=split_by_sex)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description="Prepare name datasets with optional splits.")
|
||||||
|
|
||||||
|
parser.add_argument('--split_eval', action='store_true', default=True, help="Split into evaluation and featured datasets (default: True)")
|
||||||
|
parser.add_argument('--no-split_eval', action='store_false', dest='split_eval', help="Do not split into evaluation and featured datasets")
|
||||||
|
parser.add_argument('--split_by_sex', action='store_true', default=True, help="Split by sex into male/female datasets (default: True)")
|
||||||
|
parser.add_argument('--no-split_by_sex', action='store_false', dest='split_by_sex', help="Do not split by sex into male/female datasets")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
main(split_eval=args.split_eval, split_by_sex=args.split_by_sex)
|
||||||
@@ -7,7 +7,6 @@ from misc import load_prompt
|
|||||||
class NameAnalysis(BaseModel):
|
class NameAnalysis(BaseModel):
|
||||||
identified_name: str | None
|
identified_name: str | None
|
||||||
identified_surname: str | None
|
identified_surname: str | None
|
||||||
identified_category: str | None
|
|
||||||
|
|
||||||
|
|
||||||
name = input("Enter name: ")
|
name = input("Enter name: ")
|
||||||
+15
-22
@@ -1,31 +1,24 @@
|
|||||||
## Instructions:
|
## Instructions:
|
||||||
You are analyzing Congolese full names. For each input, return:
|
Identify the identified_name (native Congolese part) and identified_surname (non-native, French or English part) from the provided full name.
|
||||||
|
Return null if a part cannot be identified. Do not alter the original name or add any additional information.
|
||||||
|
|
||||||
- "identified_name": the native name part of the full name
|
## Examples:
|
||||||
- "identified_surname": the French or English, usually last part of the full name (can also be composed of multiple words)
|
```
|
||||||
- "identified_category":
|
"tshabu ngandu bernard"
|
||||||
- "simple" if the native name has no connector
|
|
||||||
- "compose" if it includes connectors like "wa", "ya", etc.
|
|
||||||
|
|
||||||
if you cannot identify any field, return null for that field.
|
|
||||||
do not alter the original name, just identify the parts.
|
|
||||||
do not add any additional information or explanations.
|
|
||||||
|
|
||||||
## Example:
|
|
||||||
- "tshabu ngandu bernard"
|
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"identified_name": "tshabu ngandu",
|
"identified_name": "tshabu ngandu",
|
||||||
"identified_surname": "bernard",
|
"identified_surname": "bernard"
|
||||||
"identified_category": "simple"
|
|
||||||
}
|
}
|
||||||
```
|
|
||||||
|
|
||||||
- "ilunga wa ilunga albert"
|
"tshisekedi wa mulumba"
|
||||||
```json
|
|
||||||
{
|
{
|
||||||
"identified_name": "ilunga wa ilunga",
|
"identified_name": "tshisekedi wa mulumba",
|
||||||
"identified_surname": "albert",
|
"identified_surname": null
|
||||||
"identified_category": "compose"
|
}
|
||||||
|
|
||||||
|
"ntumba wasokadio marie france"
|
||||||
|
{
|
||||||
|
"identified_name": "ntumba wasokadio",
|
||||||
|
"identified_surname": "marie france"
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user