2 Commits

158 changed files with 12330 additions and 45610 deletions
-16
View File
@@ -1,16 +0,0 @@
.git
.gitignore
.idea
.vscode
__pycache__
.ruff_cache
.venv
*.pyc
*.pyo
*.pyd
*.swp
*.swo
*.DS_Store
dist
build
*.egg-info
-35
View File
@@ -1,35 +0,0 @@
name: audit
on:
push:
branches:
- main
pull_request:
jobs:
bandit:
name: bandit
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Cache uv dependencies
uses: actions/cache@v4
with:
path: |
~/.cache/uv
.venv
key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
restore-keys: |
${{ runner.os }}-uv-
- name: Sync dependencies (with dev tools)
run: uv sync --dev
- name: Run Bandit (security linter)
run: uv run bandit -r . -c pyproject.toml || true
-40
View File
@@ -1,40 +0,0 @@
name: quality
on:
push:
branches:
- main
pull_request:
jobs:
lint:
name: ruff and pyright
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Cache uv dependencies
uses: actions/cache@v4
with:
path: |
~/.cache/uv
.venv
key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
restore-keys: |
${{ runner.os }}-uv-
- name: Sync dependencies (with dev tools)
run: uv sync --dev
- name: Run Ruff (lint + format checks)
run: |
uv run ruff check .
uv run ruff format --check .
- name: Run Pyright (type checks)
run: uv run pyright
-1
View File
@@ -1 +0,0 @@
3.11
-49
View File
@@ -1,49 +0,0 @@
# syntax=docker/dockerfile:1
# Minimal Linux base (glibc) Python will be installed by uv
FROM debian:bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
UV_INSTALL_DIR=/usr/local/bin \
UV_LINK_MODE=copy \
UV_PYTHON_DOWNLOADS=1 \
UV_PROJECT_ENVIRONMENT=/app/.venv \
PATH=/app/.venv/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
WORKDIR /app
# System deps for building/using common scientific stack
# Keep minimal; rely on wheels where possible
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates curl git \
build-essential pkg-config \
libssl-dev libffi-dev \
libopenblas0 libstdc++6 \
libfreetype6 libpng16-16 libjpeg62-turbo \
&& rm -rf /var/lib/apt/lists/*
# Install uv (static binary)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
# Copy project metadata first for layer caching
COPY pyproject.toml README.md ./
# Install a managed Python via uv and create the project venv
RUN uv python install 3.11 \
&& uv venv /app/.venv --python 3.11
# Resolve and install runtime deps into project venv
# Use lockfile if present for reproducibility
RUN if [ -f uv.lock ]; then uv sync --no-dev --no-install-project --frozen; else uv sync --no-dev --no-install-project; fi
# Copy source code and optional templates
COPY src ./src
# Re-sync to ensure the local package is installed
RUN uv sync --no-dev \
&& rm -rf /root/.cache
# Default command shows help; override in compose or docker run
CMD ["ners", "--help"]
-206
View File
@@ -1,206 +0,0 @@
# Formal Model Specifications
This document formalises the statistical models implemented in
`src/ners/research/models`. Throughout, the training set is
$\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N$ with labels
$y^{(i)} \in \{0,1\}$ for the binary gender classes. Feature vectors
$\mathbf{x}^{(i)}$ combine
* character $n$-gram count representations of name strings produced by
`CountVectorizer` or `TfidfVectorizer`, and
* engineered scalar or categorical metadata (e.g., name length, province)
that are either used directly or encoded by `LabelEncoder`.
For neural architectures, character or token sequences are converted into
integer index sequences using a `Tokenizer` before being padded to a
maximum length specified in the configuration. Predictions are returned as
class posterior probabilities via a softmax layer unless otherwise noted.
## Logistic Regression (`logistic_regression_model.py`)
**Feature map.** Character $n$-gram counts
$\phi(\mathbf{x}) \in \mathbb{R}^d$ obtained with
`CountVectorizer(analyzer="char", ngram_range=(2,4))` (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】
**Model.** The linear logit for class $1$ is
$z = \mathbf{w}^\top \phi(\mathbf{x}) + b$. The class posteriors are
$p(y=1\mid \mathbf{x}) = \sigma(z)$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$
with $\sigma(u) = (1 + e^{-u})^{-1}$.
**Training objective.** Minimise the regularised negative log-likelihood
$$\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),$$
where $R$ is the penalty induced by the chosen solver (e.g., $\ell_2$ for
`liblinear`).
## Multinomial Naive Bayes (`naive_bayes_model.py`)
**Feature map.** Character $n$-gram counts
$\phi(\mathbf{x}) \in \mathbb{N}^d$ derived with
`CountVectorizer(analyzer="char", ngram_range=(2,4))` by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】
**Generative model.** For each class $c \in \{0,1\}$, the class prior is
$\pi_c = \frac{N_c}{N}$. Conditional feature probabilities are estimated with
Laplace smoothing (parameter $\alpha$):
$$\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},$$
where $N_{cj}$ counts the total occurrences of feature $j$ among examples of
class $c$. The likelihood of an input with counts $\phi_j(\mathbf{x})$ is
$$p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.$$
**Inference.** Predict with the maximum a posteriori (MAP) decision
$\hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}$.
## Random Forest (`random_forest_model.py`)
**Feature map.** Concatenation of engineered numerical features and label
encoded categorical attributes produced on demand in
`prepare_features`.【F:src/ners/research/models/random_forest_model.py†L28-L71】
**Model.** An ensemble of $T$ decision trees ${\{T_t\}}_{t=1}^T$, each trained on
a bootstrap sample of the data with random feature sub-sampling. Each tree
outputs a class prediction $T_t(\mathbf{x}) \in \{0,1\}$. The forest prediction
is the mode of individual votes:
$$\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.$$
**Class probability.** For soft outputs,
$p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x})$ where
$p_t$ is the class distribution estimated at the leaf reached by
$\mathbf{x}$ in tree $t$.
## LightGBM (`lightgbm_model.py`)
**Feature map.** Hybrid of numeric inputs, categorical label encodings, and
character $n$-gram counts expanded into dense columns and assembled into a
feature matrix persisted in `self.feature_columns`.【F:src/ners/research/models/lightgbm_model.py†L38-L118】
**Model.** Gradient boosted decision trees forming an additive function
$F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x})$, where $h_m$ denotes the
$m$-th tree and $\eta$ is the learning rate.
**Training objective.** LightGBM minimises
$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),$$
using second-order Taylor approximations of the loss $\ell$ (binary log-loss by
default) and regulariser $\Omega$ determined by tree complexity constraints.
## XGBoost (`xgboost_model.py`)
**Feature map.** Combination of numeric metadata, categorical label encodings,
and character $n$-gram counts as described in `prepare_features`.【F:src/ners/research/models/xgboost_model.py†L41-L113】
**Model.** Additive ensemble of regression trees
$F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x})$ with $f_m \in \mathcal{F}$, the
space of trees with fixed structure.
**Training objective.** At boosting iteration $m$, minimise the regularised
objective
$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),$$
where $\Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2$ penalises the
number of leaves $T_f$ and their scores $w_j$. The optimal leaf weights follow
$$w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},$$
with $g_i$ and $h_i$ denoting first- and second-order gradients of the loss for
sample $i$.
## Convolutional Neural Network (`cnn_model.py`)
**Input encoding.** Character-level token sequences padded to length
$L$ using `Tokenizer(char_level=True)` followed by `pad_sequences`.【F:src/ners/research/models/cnn_model.py†L23-L64】
**Architecture.** Embedding layer producing $X \in \mathbb{R}^{L \times d}$,
followed by two convolutional blocks:
1. $H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X))$ with kernel size
$k_1$ and $F$ filters, then temporal max-pooling.
2. $H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)}))$ with kernel
size $k_2$ and $F$ filters.
Global max-pooling yields $h = \max_{t} H^{(2)}_{t,:}$, which passes through a
dense layer and dropout before the softmax layer producing
$p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b)$.
**Loss.** Cross-entropy between softmax output and the ground-truth label.
## Bidirectional GRU (`bigru_model.py`)
**Input encoding.** Word-level sequences padded to length $L$ with
`Tokenizer(char_level=False)` and `pad_sequences`.【F:src/ners/research/models/bigru_model.py†L47-L69】
**Recurrent dynamics.** A stacked bidirectional GRU computes forward and
backward hidden states according to
\[
\begin{aligned}
\mathbf{z}_t &= \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
\mathbf{r}_t &= \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
\tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h\big),\\
\mathbf{h}_t &= (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
\end{aligned}
\]
The final representation concatenates the last forward and backward states
before passing through dense layers and a softmax classifier.
## Bidirectional LSTM (`lstm_model.py`)
**Input encoding.** Word-level sequences padded to length $L$ using the same
pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】
**Recurrent dynamics.** At each timestep, the LSTM updates its memory cell via
\[
\begin{aligned}
\mathbf{i}_t &= \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
\mathbf{f}_t &= \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
\mathbf{o}_t &= \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
\tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
\mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
\mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
\end{aligned}
\]
Bidirectional aggregation concatenates terminal forward/backward hidden vectors
before the dense-softmax head.
## Transformer Encoder (`transformer_model.py`)
**Input encoding.** Token sequences padded to a fixed length with positional
indices $\{0, \ldots, L-1\}$ added through a learned positional embedding.
`Tokenizer` initialises the vocabulary; padding uses `pad_sequences`.【F:src/ners/research/models/transformer_model.py†L25-L77】
**Architecture.** For hidden dimension $d$, the encoder block computes
\[
\begin{aligned}
Z^{(0)} &= X + P,\\
Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\\
Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big),
\end{aligned}
\]
where $\operatorname{MHAttn}$ is multi-head self-attention with
$H$ heads. Global average pooling produces a fixed-length vector for the dense
and dropout layers before the final softmax classifier.
## Ensemble Voting (`ensemble_model.py`)
**Base learners.** A configurable set of pipelines that include character
$n$-gram vectorisers and classical classifiers (logistic regression,
random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】
**Aggregation.** Given model posteriors $\mathbf{p}_j(\mathbf{x})$ and non-negative
weights $w_j$, the soft-voting ensemble predicts
$$p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).$$
Hard voting instead returns
$\hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}$.
+52
View File
@@ -0,0 +1,52 @@
.PHONY: default
default: help
.PHONY: help
help: ## Show this help message
@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)
# =============================================================================
# ENVIRONMENT SETUP
# =============================================================================
.PHONY: setup
setup: ## Setup virtual environment and install dependencies
python -m venv .venv
source .venv/bin/activate
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
.PHONY: install
install: ## Install/update dependencies
pip install --upgrade pip
pip install -r requirements.txt
# =============================================================================
# DEVELOPMENT & CODE QUALITY
# =============================================================================
.PHONY: format
format: ## Format code with black
black . --line-length 100
.PHONY: lint
lint: ## Lint code with flake8
flake8 . --max-line-length=100 --ignore=E203,W503 --exclude=.venv
.PHONY: type-check
type-check: ## Type check with mypy
mypy . --ignore-missing-imports
.PHONY: notebook
notebook: ## Start Jupyter notebook
jupyter notebook notebooks/
# =============================================================================
# DEPLOYMENT & PRODUCTION
# =============================================================================
.PHONY: backup
backup: ## Backup datasets and results
@mkdir -p backups/$(shell date +%Y%m%d_%H%M%S)
@cp -r data/ backups/$(shell date +%Y%m%d_%H%M%S)/data/
@echo "Backup created in backups/$(shell date +%Y%m%d_%H%M%S)/"
+91 -74
View File
@@ -1,10 +1,5 @@
# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
[![audit](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml)
[![quality](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml)
---
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
data.
@@ -15,41 +10,51 @@ million names from the Democratic Republic of Congo (DRC) annotated with gender
### Installation & Setup
> download [the dataset](https://drive.google.com/file/d/1a5wQnOZdsRWBOeoMA_0lNtbneTvS9xqy/view?usp=drive_link), if you need access please reach us at mlec.academia@gmail.com.
Instructions and command line snippets bellow are provided to help you set up the project environment quickly and
efficiently.
assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).
**Using Makefile (Recommended)**
```bash
git clone https://github.com/bernard-ng/drc-ners-nlp.git
mkdir -p drc-ners-nlp/data/dataset
cp names.csv drc-ners-nlp/data/dataset
cd drc-ners-nlp
# Setup environment
make setup
make activate
```
**Linux**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
**Manual Setup**
uv sync
```
**Macos & windows**
```bash
docker compose build
docker compose exec app bash
git clone https://github.com/bernard-ng/drc-ners-nlp.git
cd drc-ners-nlp
# Setup environment
python -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
pip install --upgrade pip
pip install -r requirements.txt
pip install jupyter notebook ipykernel pytest black flake8 mypy
source .venv/bin/activate
```
## Data Processing
This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
checkpointing, and parallel processing capabilities.
step are defined in the `drc-ners-nlp/processing/steps` directory. and configuration to enable them is managed through
the `drc-ners-nlp/config/pipeline.yaml` file.
**Pipeline Configuration**
```yaml
stages:
- "data_cleaning"
- "data_selection"
- "feature_extraction"
- "data_splitting"
```
@@ -57,77 +62,97 @@ stages:
**Running the Pipeline**
```bash
uv run ners pipeline run --env="production"
python main.py --env production
```
## NER Processing (Optional)
This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names.
Its main objective is to accurately identify and tag the different components of a Congolese name,
specifically distinguishing between the native part and the surname.
```bash
python ner.py --env production
```
Once you've built and train the NER model you can use it to annotate **COMPOSE** name in the original dataset
**Running the Pipeline with NER Annotation**
```yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "ner_annotation"
- "data_splitting"
```
**Running the Pipeline with LLM Annotation**
```yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "llm_annotation"
- "data_splitting"
```
## Experiments
This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
research iteration. you can define model features, training parameters, and evaluation metrics in the `config/research_templates.yaml` file.
research iteration. models are defined in the `drc-ners-nlp/research/models` directory.
you can define model features, training parameters, and evaluation metrics in the `research_templates.yaml` file.
**Running Experiments**
```bash
# bigru
uv run ners research train --name="bigru" --type="baseline" --env="production"
uv run ners research train --name="bigru_native" --type="baseline" --env="production"
uv run ners research train --name="bigru_surname" --type="baseline" --env="production"
```
python train.py --name="bigru" --type="baseline" --env="production"
python train.py --name="bigru_native" --type="baseline" --env="production"
python train.py --name="bigru_surname" --type="baseline" --env="production"
```bash
# cnn
uv run ners research train --name="cnn" --type="baseline" --env="production"
uv run ners research train --name="cnn_native" --type="baseline" --env="production"
uv run ners research train --name="cnn_surname" --type="baseline" --env="production"
```
python train.py --name="cnn" --type="baseline" --env="production"
python train.py --name="cnn_native" --type="baseline" --env="production"
python train.py --name="cnn_surname" --type="baseline" --env="production"
```bash
# lightgbm
uv run ners research train --name="lightgbm" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_native" --type="baseline" --env="production"
uv run ners research train --name="lightgbm_surname" --type="baseline" --env="production"
```
python train.py --name="lightgbm" --type="baseline" --env="production"
python train.py --name="lightgbm_native" --type="baseline" --env="production"
python train.py --name="lightgbm_surname" --type="baseline" --env="production"
```bash
# logistic regression
uv run ners research train --name="logistic_regression" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_native" --type="baseline" --env="production"
uv run ners research train --name="logistic_regression_surname" --type="baseline" --env="production"
```
python train.py --name="logistic_regression" --type="baseline" --env="production"
python train.py --name="logistic_regression_native" --type="baseline" --env="production"
python train.py --name="logistic_regression_surname" --type="baseline" --env="production"
```bash
# lstm
uv run ners research train --name="lstm" --type="baseline" --env="production"
uv run ners research train --name="lstm_native" --type="baseline" --env="production"
uv run ners research train --name="lstm_surname" --type="baseline" --env="production"
```
python train.py --name="lstm" --type="baseline" --env="production"
python train.py --name="lstm_native" --type="baseline" --env="production"
python train.py --name="lstm_surname" --type="baseline" --env="production"
```bash
# random forest
uv run ners research train --name="random_forest" --type="baseline" --env="production"
uv run ners research train --name="random_forest_native" --type="baseline" --env="production"
uv run ners research train --name="random_forest_surname" --type="baseline" --env="production"
```
python train.py --name="random_forest" --type="baseline" --env="production"
python train.py --name="random_forest_native" --type="baseline" --env="production"
python train.py --name="random_forest_surname" --type="baseline" --env="production"
# svm
python train.py --name="svm" --type="baseline" --env="production"
python train.py --name="svm_native" --type="baseline" --env="production"
python train.py --name="svm_surname" --type="baseline" --env="production"
```bash
# naive bayes
uv run ners research train --name="naive_bayes" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_native" --type="baseline" --env="production"
uv run ners research train --name="naive_bayes_surname" --type="baseline" --env="production"
```
python train.py --name="naive_bayes" --type="baseline" --env="production"
python train.py --name="naive_bayes_native" --type="baseline" --env="production"
python train.py --name="naive_bayes_surname" --type="baseline" --env="production"
```bash
# transformer
uv run ners research train --name="transformer" --type="baseline" --env="production"
uv run ners research train --name="transformer_native" --type="baseline" --env="production"
uv run ners research train --name="transformer_surname" --type="baseline" --env="production"
```
python train.py --name="transformer" --type="baseline" --env="production"
python train.py --name="transformer_native" --type="baseline" --env="production"
python train.py --name="transformer_surname" --type="baseline" --env="production"
```bash
# xgboost
uv run ners research train --name="xgboost" --type="baseline" --env="production"
uv run ners research train --name="xgboost_native" --type="baseline" --env="production"
uv run ners research train --name="xgboost_surname" --type="baseline" --env="production"
python train.py --name="xgboost" --type="baseline" --env="production"
python train.py --name="xgboost_native" --type="baseline" --env="production"
python train.py --name="xgboost_surname" --type="baseline" --env="production"
```
## Web Interface
@@ -137,18 +162,10 @@ experiments and make predictions without needing to understand the underlying co
### Running the Web Interface
![web](./assets/web.png)
```bash
uv run ners web run --env="production"
streamlit run web/app.py
```
```bash
docker compose run --rm --service-ports app ners web run --env=production
```
then open : http://localhost:8501/
## Contributors
<a href="https://github.com/bernard-ng/drc-ners-nlp/graphs/contributors" title="show all contributors">
-72
View File
@@ -1,72 +0,0 @@
# Technical Architecture Report: DRC NERS NLP
## Project Overview
The DRC NERS NLP project delivers an end-to-end system for Congolese name analysis and gender inference backed by a 5-million-record dataset enriched with demographic metadata.【F:README.md†L1-L12】 The toolkit wraps a configurable processing pipeline, experiment runner, and Streamlit dashboard so that researchers and practitioners can reproducibly clean raw registry data, engineer features, benchmark multiple models, and publish insights without modifying core code.
```mermaid
flowchart LR
A[Data ingestion\nDataLoader] --> B[Preprocessing\nBatch pipeline]
B --> C[Feature extraction\nName heuristics + NER]
C --> D[Model training\nExperiment runner]
D --> E[Evaluation\nMetrics + tracking]
E --> F[Visualization & Deployment\nStreamlit + exports]
```
## Software Architecture and Implementation
### Configuration-driven Orchestration
* **Central config management** `ConfigManager` resolves layered YAML/JSON configs, injects project paths, and merges environment overrides, ensuring every workflow can be replayed with the same parameters.【F:src/ners/core/config/config_manager.py†L12-L157】
* **Pipeline definition** `PipelineConfig` captures stage order, batch settings, annotation parameters, and dataset splits. The default `pipeline.yaml` enumerates every stage and shared directories, creating a single source of truth for the runbook.【F:src/ners/core/config/pipeline_config.py†L10-L29】【F:config/pipeline.yaml†L1-L68】
* **Research templates** Pre-built experiment templates map feature sets and hyperparameters for each baseline architecture, allowing experiments to be reproduced or extended declaratively.【F:config/research_templates.yaml†L1-L86】
### Modular Data Pipeline
1. **Data ingestion** `DataLoader` streams CSV chunks with typed columns, optional balancing, and dataset size limits; it also writes artifacts using consistent encodings for downstream reuse.【F:src/ners/core/utils/data_loader.py†L33-L174】
2. **Batch processing engine** The `Pipeline` class wires ordered steps and delegates to a `BatchProcessor` that supports sequential or concurrent execution, checkpointing, and memory-aware concatenation via `MemoryMonitor` to handle multi-million row datasets safely.【F:src/ners/processing/pipeline.py†L12-L57】【F:src/ners/processing/batch/batch_processor.py†L12-L173】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
3. **Preprocessing steps**
* `DataCleaningStep` drops critical nulls, normalizes text, and deduplicates records.【F:src/ners/processing/steps/data_cleaning_step.py†L10-L31】
* `FeatureExtractionStep` engineers linguistic statistics, name segments, gender/category inference, region mapping, and spaCy-based tagging while optimizing dtypes to keep memory usage low.【F:src/ners/processing/steps/feature_extraction_step.py†L24-L196】
* `DataSelectionStep` enforces column whitelists and domain-specific filters (e.g., removing “global” regions for certain years).【F:src/ners/processing/steps/data_selection_step.py†L9-L60】
* `NERAnnotationStep` loads a spaCy model, parallelizes tagging with retries, and records provenance for each batch.【F:src/ners/processing/steps/ner_annotation_step.py†L13-L172】
* `LLMAnnotationStep` calls an Ollama-hosted model with configurable concurrency, rate limiting, and exponential backoff to enrich unannotated rows while maintaining checkpoints.【F:src/ners/processing/steps/llm_annotation_step.py†L18-L169】
* `DataSplittingStep` persists evaluation, gender, and province-specific splits in deterministic fashion, reusing the shared data loader for consistent I/O.【F:src/ners/processing/steps/data_splitting_step.py†L11-L69】
4. **Pipeline runner** `run_pipeline` composes the configured steps, captures progress metrics, and invokes the splitter to materialize curated datasets, turning raw CSVs into model-ready corpora with a single command.【F:src/ners/main.py†L14-L75】
5. **Operational visibility** `PipelineMonitor` inspects checkpoint state, estimates storage use, and exposes Typer commands for status, cleanup, or reset, simplifying long-running batch management.【F:src/ners/processing/monitoring/pipeline_monitor.py†L11-L196】【F:src/ners/cli.py†L156-L200】
### Research Experimentation Pipeline
* **CLI and templates** Typer-powered commands load configuration environments and instantiate experiments from templates, guaranteeing every run is traceable to a versioned config file.【F:src/ners/cli.py†L13-L146】
* **Experiment runner** `ExperimentRunner` fetches the featured dataset, applies filters, splits data, trains models from the registry, computes metrics, confusion matrices, and feature importance, then persists joblib artifacts per run.【F:src/ners/research/experiment/experiment_runner.py†L24-L271】
* **Model registry and abstractions** Traditional, neural, and ensemble estimators inherit from `BaseModel`, which standardizes feature extraction, training, persistence, and probability interfaces. For example, the logistic regression model couples a character n-gram vectorizer with solver-specific tuning guidance.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/models/logistic_regression_model.py†L1-L47】
* **Tracking and artifact management** `ExperimentTracker` records metadata, metrics, and tags in JSON for comparison/export, while `ModelTrainer` orchestrates runs, saves serialized models/configs, and generates learning curves for later visualization or deployment.【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】【F:src/ners/research/model_trainer.py†L16-L200】
### Visualization and User Interfaces
* **Streamlit portal** `ners.web.app` bootstraps a Streamlit dashboard that shares the same configuration stack, giving analysts access to pipeline monitors, experiment summaries, and predictions through a browser-friendly UI.【F:src/ners/web/app.py†L1-L67】
* **Analysis utilities** The statistics package offers reusable seaborn/matplotlib plots (e.g., transition matrices, letter frequency charts) for exploratory studies, exporting intermediate CSVs alongside visuals.【F:src/ners/research/statistics/plots.py†L1-L39】
## Technology Stack and Environments
* **Core languages & libraries** Python 3.11 orchestrated by the `uv` package manager, with heavy use of pandas, NumPy, scikit-learn, joblib, spaCy, Streamlit, seaborn, and Typer across modules.【F:Dockerfile†L3-L48】【F:src/ners/research/experiment/experiment_runner.py†L6-L21】
* **LLM integration** The LLM annotation step leverages the Ollama client, optional rate limiting, and JSON-schema validation to keep third-party inference reproducible and auditable.【F:src/ners/processing/steps/llm_annotation_step.py†L21-L116】
* **Containerization** The Dockerfile provisions a slim Debian image with reproducible uv-managed environments, while `compose.yml` mounts configs/data and exposes Streamlit, providing parity between local and deployment setups.【F:Dockerfile†L3-L48】【F:compose.yml†L1-L23】
## Reproducibility and Automation
* All entry points (`ners pipeline`, `ners ner`, `ners research`, `ners monitor`) source environment-specific configs and return exit codes, enabling CI/CD and scheduled jobs to orchestrate pipelines reliably.【F:src/ners/cli.py†L13-L200】
* Version-controlled configs cover pipeline stages, annotations, prompts, and experiment templates; `ConfigManager` ensures default paths map to versioned `data/`, `models/`, and `outputs/` directories for each run.【F:src/ners/core/config/config_manager.py†L15-L111】【F:config/pipeline.yaml†L1-L68】
* Experiment artifacts (models, metrics, learning curves, exports) are stored per experiment ID with timestamps and hashes, easing regression comparisons and rollbacks.【F:src/ners/research/experiment/experiment_tracker.py†L45-L163】【F:src/ners/research/model_trainer.py†L109-L200】
## Scalability and Performance Considerations
* Chunked reads, optimized dtypes, and optional stratified sampling keep ingestion memory-efficient even for multi-million row CSVs.【F:src/ners/core/utils/data_loader.py†L40-L161】
* Batch processing supports threaded or multiprocess execution with incremental checkpointing, enabling restarts mid-run and reducing wasted computation on failure.【F:src/ners/processing/batch/batch_processor.py†L29-L156】
* Memory monitoring and dtype normalization inside feature engineering prevent ballooning DataFrame footprints during annotation-heavy stages.【F:src/ners/processing/steps/feature_extraction_step.py†L53-L195】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
* Rate-limited concurrency in NER and LLM steps balances throughput with external service stability, retrying transient failures without blocking the whole run.【F:src/ners/processing/steps/ner_annotation_step.py†L50-L166】【F:src/ners/processing/steps/llm_annotation_step.py†L58-L163】
## Deployment and Interfaces
* **Command-line workflows** The Typer CLI exposes discrete subcommands for pipeline execution, NER dataset generation, research training, and checkpoint maintenance, simplifying automation scripts and developer onboarding.【F:src/ners/cli.py†L13-L200】
* **Web interface** Streamlit shares the same configuration context as the CLI and surfaces monitoring utilities, experiment tracking, and interactive analysis for non-technical stakeholders.【F:src/ners/web/app.py†L1-L67】
* **Containerized services** Docker Compose binds the CLI, configs, and data directories into a reproducible container, standardizing environment setup across OSes and enabling GPU-enabled hosts to mount device-specific resources if needed.【F:compose.yml†L1-L23】
## Summary of Design Choices
* Configuration-first design separates code from experiment definitions, allowing fast iteration without code changes.【F:src/ners/core/config/config_manager.py†L15-L157】【F:config/research_templates.yaml†L1-L86】
* Batch checkpoints, memory monitoring, and rate limiting deliver resilience against large-scale processing failures and external service hiccups.【F:src/ners/processing/batch/batch_processor.py†L29-L173】【F:src/ners/processing/steps/llm_annotation_step.py†L21-L169】
* Unified model abstractions, experiment tracking, and artifact exports make the research stack extensible and production-ready, with metrics and models stored alongside configuration for reproducible science.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】
* Streamlit dashboards and Typer commands democratize access, letting analysts trigger pipelines or inspect experiments without touching Python modules.【F:src/ners/cli.py†L13-L200】【F:src/ners/web/app.py†L1-L67】
By combining configuration-driven orchestration, modular batch processing, and standardized experiment tooling, the DRC NERS NLP project functions as a robust, reproducible pipeline capable of scaling from exploratory research to production-ready deployments.
Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

-987
View File
@@ -1,987 +0,0 @@
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="720pt" height="720pt" viewBox="0 0 720 720" xmlns="http://www.w3.org/2000/svg" version="1.1">
<metadata>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cc:Work>
<dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
<dc:date>2025-10-18T16:31:20.357443</dc:date>
<dc:format>image/svg+xml</dc:format>
<dc:creator>
<cc:Agent>
<dc:title>Matplotlib v3.10.6, https://matplotlib.org/</dc:title>
</cc:Agent>
</dc:creator>
</cc:Work>
</rdf:RDF>
</metadata>
<defs>
<style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
</defs>
<g id="figure_1">
<g id="patch_1">
<path d="M 0 720
L 720 720
L 720 0
L 0 0
z
" style="fill: #ffffff"/>
</g>
<g id="axes_1">
<g id="matplotlib.axis_1"/>
<g id="matplotlib.axis_2"/>
<g id="patch_2">
<path d="M 180.003859 224.636913
C 166.258624 224.636913 152.593228 226.730521 139.478445 230.845611
C 126.363663 234.960701 113.951848 241.04947 102.67059 248.902156
C 91.389332 256.754843 81.369683 266.280225 72.956824 277.15015
C 64.543964 288.020076 57.835624 300.10827 53.062909 312.998292
C 48.290193 325.888314 45.508546 339.430422 44.813794 353.158087
C 44.119043 366.885753 45.519259 380.639504 48.966204 393.945519
C 52.413149 407.251534 57.866781 419.955241 65.13922 431.618993
C 72.41166 443.282744 81.418424 453.771045 91.849228 462.72257
C 102.280032 471.674096 114.013703 478.984857 126.646012 484.402997
C 139.278322 489.821136 152.662522 493.283713 166.337525 494.67144
C 180.012527 496.059166 193.819471 495.355922 207.282704 492.585934
C 220.745937 489.815946 233.709059 485.011393 245.725487 478.337792
C 257.741914 471.664191 268.672055 463.199069 278.139362 453.234052
C 287.606669 443.269034 295.501163 431.919883 301.550876 419.57758
C 307.600589 407.235277 311.735243 394.0432 313.812397 380.455819
C 315.889551 366.868439 315.885077 353.043597 313.799127 339.457564
L 180.003859 360
z
" style="fill: #1f77b4; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="patch_3">
<path d="M 313.799128 339.457546
C 308.892427 307.49967 292.69445 278.333734 268.158489 257.277425
C 243.622529 236.221115 212.336236 224.636921 180.003878 224.636913
L 180.003859 360
z
" style="fill: #ff7f0e; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="text_1">
<!-- Simple -->
<g style="fill: #262626" transform="translate(52.468156 475.640935) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-53" d="M 288 1472
L 859 1522
Q 900 1178 1048 958
Q 1197 738 1509 602
Q 1822 466 2213 466
Q 2559 466 2825 569
Q 3091 672 3220 851
Q 3350 1031 3350 1244
Q 3350 1459 3225 1620
Q 3100 1781 2813 1891
Q 2628 1963 1997 2114
Q 1366 2266 1113 2400
Q 784 2572 623 2826
Q 463 3081 463 3397
Q 463 3744 659 4045
Q 856 4347 1234 4503
Q 1613 4659 2075 4659
Q 2584 4659 2973 4495
Q 3363 4331 3572 4012
Q 3781 3694 3797 3291
L 3216 3247
Q 3169 3681 2898 3903
Q 2628 4125 2100 4125
Q 1550 4125 1298 3923
Q 1047 3722 1047 3438
Q 1047 3191 1225 3031
Q 1400 2872 2139 2705
Q 2878 2538 3153 2413
Q 3553 2228 3743 1945
Q 3934 1663 3934 1294
Q 3934 928 3725 604
Q 3516 281 3123 101
Q 2731 -78 2241 -78
Q 1619 -78 1198 103
Q 778 284 539 648
Q 300 1013 288 1472
z
" transform="scale(0.015625)"/>
<path id="ArialMT-69" d="M 425 3934
L 425 4581
L 988 4581
L 988 3934
L 425 3934
z
M 425 0
L 425 3319
L 988 3319
L 988 0
L 425 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6d" d="M 422 0
L 422 3319
L 925 3319
L 925 2853
Q 1081 3097 1340 3245
Q 1600 3394 1931 3394
Q 2300 3394 2536 3241
Q 2772 3088 2869 2813
Q 3263 3394 3894 3394
Q 4388 3394 4653 3120
Q 4919 2847 4919 2278
L 4919 0
L 4359 0
L 4359 2091
Q 4359 2428 4304 2576
Q 4250 2725 4106 2815
Q 3963 2906 3769 2906
Q 3419 2906 3187 2673
Q 2956 2441 2956 1928
L 2956 0
L 2394 0
L 2394 2156
Q 2394 2531 2256 2718
Q 2119 2906 1806 2906
Q 1569 2906 1367 2781
Q 1166 2656 1075 2415
Q 984 2175 984 1722
L 984 0
L 422 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-70" d="M 422 -1272
L 422 3319
L 934 3319
L 934 2888
Q 1116 3141 1344 3267
Q 1572 3394 1897 3394
Q 2322 3394 2647 3175
Q 2972 2956 3137 2557
Q 3303 2159 3303 1684
Q 3303 1175 3120 767
Q 2938 359 2589 142
Q 2241 -75 1856 -75
Q 1575 -75 1351 44
Q 1128 163 984 344
L 984 -1272
L 422 -1272
z
M 931 1641
Q 931 1000 1190 694
Q 1450 388 1819 388
Q 2194 388 2461 705
Q 2728 1022 2728 1688
Q 2728 2322 2467 2637
Q 2206 2953 1844 2953
Q 1484 2953 1207 2617
Q 931 2281 931 1641
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6c" d="M 409 0
L 409 4581
L 972 4581
L 972 0
L 409 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-65" d="M 2694 1069
L 3275 997
Q 3138 488 2766 206
Q 2394 -75 1816 -75
Q 1088 -75 661 373
Q 234 822 234 1631
Q 234 2469 665 2931
Q 1097 3394 1784 3394
Q 2450 3394 2872 2941
Q 3294 2488 3294 1666
Q 3294 1616 3291 1516
L 816 1516
Q 847 969 1125 678
Q 1403 388 1819 388
Q 2128 388 2347 550
Q 2566 713 2694 1069
z
M 847 1978
L 2700 1978
Q 2663 2397 2488 2606
Q 2219 2931 1791 2931
Q 1403 2931 1139 2672
Q 875 2413 847 1978
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-53"/>
<use xlink:href="#ArialMT-69" transform="translate(66.699219 0)"/>
<use xlink:href="#ArialMT-6d" transform="translate(88.916016 0)"/>
<use xlink:href="#ArialMT-70" transform="translate(172.216797 0)"/>
<use xlink:href="#ArialMT-6c" transform="translate(227.832031 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(250.048828 0)"/>
</g>
</g>
<g id="text_2">
<!-- 77.4% -->
<g style="fill: #262626" transform="translate(112.934527 424.218706) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-37" d="M 303 3981
L 303 4522
L 3269 4522
L 3269 4084
Q 2831 3619 2401 2847
Q 1972 2075 1738 1259
Q 1569 684 1522 0
L 944 0
Q 953 541 1156 1306
Q 1359 2072 1739 2783
Q 2119 3494 2547 3981
L 303 3981
z
" transform="scale(0.015625)"/>
<path id="ArialMT-2e" d="M 581 0
L 581 641
L 1222 641
L 1222 0
L 581 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-34" d="M 2069 0
L 2069 1097
L 81 1097
L 81 1613
L 2172 4581
L 2631 4581
L 2631 1613
L 3250 1613
L 3250 1097
L 2631 1097
L 2631 0
L 2069 0
z
M 2069 1613
L 2069 3678
L 634 1613
L 2069 1613
z
" transform="scale(0.015625)"/>
<path id="ArialMT-25" d="M 372 3481
Q 372 3972 619 4315
Q 866 4659 1334 4659
Q 1766 4659 2048 4351
Q 2331 4044 2331 3447
Q 2331 2866 2045 2552
Q 1759 2238 1341 2238
Q 925 2238 648 2547
Q 372 2856 372 3481
z
M 1350 4272
Q 1141 4272 1002 4090
Q 863 3909 863 3425
Q 863 2984 1003 2804
Q 1144 2625 1350 2625
Q 1563 2625 1702 2806
Q 1841 2988 1841 3469
Q 1841 3913 1700 4092
Q 1559 4272 1350 4272
z
M 1353 -169
L 3859 4659
L 4316 4659
L 1819 -169
L 1353 -169
z
M 3334 1075
Q 3334 1569 3581 1911
Q 3828 2253 4300 2253
Q 4731 2253 5014 1945
Q 5297 1638 5297 1041
Q 5297 459 5011 145
Q 4725 -169 4303 -169
Q 3888 -169 3611 142
Q 3334 453 3334 1075
z
M 4316 1866
Q 4103 1866 3964 1684
Q 3825 1503 3825 1019
Q 3825 581 3965 400
Q 4106 219 4313 219
Q 4528 219 4667 400
Q 4806 581 4806 1063
Q 4806 1506 4665 1686
Q 4525 1866 4316 1866
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-37"/>
<use xlink:href="#ArialMT-37" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-34" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
<g id="text_3">
<!-- Compose -->
<g style="fill: #262626" transform="translate(276.973954 249.651267) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-43" d="M 3763 1606
L 4369 1453
Q 4178 706 3683 314
Q 3188 -78 2472 -78
Q 1731 -78 1267 223
Q 803 525 561 1097
Q 319 1669 319 2325
Q 319 3041 592 3573
Q 866 4106 1370 4382
Q 1875 4659 2481 4659
Q 3169 4659 3637 4309
Q 4106 3959 4291 3325
L 3694 3184
Q 3534 3684 3231 3912
Q 2928 4141 2469 4141
Q 1941 4141 1586 3887
Q 1231 3634 1087 3207
Q 944 2781 944 2328
Q 944 1744 1114 1308
Q 1284 872 1643 656
Q 2003 441 2422 441
Q 2931 441 3284 734
Q 3638 1028 3763 1606
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6f" d="M 213 1659
Q 213 2581 725 3025
Q 1153 3394 1769 3394
Q 2453 3394 2887 2945
Q 3322 2497 3322 1706
Q 3322 1066 3130 698
Q 2938 331 2570 128
Q 2203 -75 1769 -75
Q 1072 -75 642 372
Q 213 819 213 1659
z
M 791 1659
Q 791 1022 1069 705
Q 1347 388 1769 388
Q 2188 388 2466 706
Q 2744 1025 2744 1678
Q 2744 2294 2464 2611
Q 2184 2928 1769 2928
Q 1347 2928 1069 2612
Q 791 2297 791 1659
z
" transform="scale(0.015625)"/>
<path id="ArialMT-73" d="M 197 991
L 753 1078
Q 800 744 1014 566
Q 1228 388 1613 388
Q 2000 388 2187 545
Q 2375 703 2375 916
Q 2375 1106 2209 1216
Q 2094 1291 1634 1406
Q 1016 1563 777 1677
Q 538 1791 414 1992
Q 291 2194 291 2438
Q 291 2659 392 2848
Q 494 3038 669 3163
Q 800 3259 1026 3326
Q 1253 3394 1513 3394
Q 1903 3394 2198 3281
Q 2494 3169 2634 2976
Q 2775 2784 2828 2463
L 2278 2388
Q 2241 2644 2061 2787
Q 1881 2931 1553 2931
Q 1166 2931 1000 2803
Q 834 2675 834 2503
Q 834 2394 903 2306
Q 972 2216 1119 2156
Q 1203 2125 1616 2013
Q 2213 1853 2448 1751
Q 2684 1650 2818 1456
Q 2953 1263 2953 975
Q 2953 694 2789 445
Q 2625 197 2315 61
Q 2006 -75 1616 -75
Q 969 -75 630 194
Q 291 463 197 991
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-43"/>
<use xlink:href="#ArialMT-6f" transform="translate(72.216797 0)"/>
<use xlink:href="#ArialMT-6d" transform="translate(127.832031 0)"/>
<use xlink:href="#ArialMT-70" transform="translate(211.132812 0)"/>
<use xlink:href="#ArialMT-6f" transform="translate(266.748047 0)"/>
<use xlink:href="#ArialMT-73" transform="translate(322.363281 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(372.363281 0)"/>
</g>
</g>
<g id="text_4">
<!-- 22.6% -->
<g style="fill: #262626" transform="translate(218.720076 300.951615) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-32" d="M 3222 541
L 3222 0
L 194 0
Q 188 203 259 391
Q 375 700 629 1000
Q 884 1300 1366 1694
Q 2113 2306 2375 2664
Q 2638 3022 2638 3341
Q 2638 3675 2398 3904
Q 2159 4134 1775 4134
Q 1369 4134 1125 3890
Q 881 3647 878 3216
L 300 3275
Q 359 3922 746 4261
Q 1134 4600 1788 4600
Q 2447 4600 2831 4234
Q 3216 3869 3216 3328
Q 3216 3053 3103 2787
Q 2991 2522 2730 2228
Q 2469 1934 1863 1422
Q 1356 997 1212 845
Q 1069 694 975 541
L 3222 541
z
" transform="scale(0.015625)"/>
<path id="ArialMT-36" d="M 3184 3459
L 2625 3416
Q 2550 3747 2413 3897
Q 2184 4138 1850 4138
Q 1581 4138 1378 3988
Q 1113 3794 959 3422
Q 806 3050 800 2363
Q 1003 2672 1297 2822
Q 1591 2972 1913 2972
Q 2475 2972 2870 2558
Q 3266 2144 3266 1488
Q 3266 1056 3080 686
Q 2894 316 2569 119
Q 2244 -78 1831 -78
Q 1128 -78 684 439
Q 241 956 241 2144
Q 241 3472 731 4075
Q 1159 4600 1884 4600
Q 2425 4600 2770 4297
Q 3116 3994 3184 3459
z
M 888 1484
Q 888 1194 1011 928
Q 1134 663 1356 523
Q 1578 384 1822 384
Q 2178 384 2434 671
Q 2691 959 2691 1453
Q 2691 1928 2437 2201
Q 2184 2475 1800 2475
Q 1419 2475 1153 2201
Q 888 1928 888 1484
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-32"/>
<use xlink:href="#ArialMT-32" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-36" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
<g id="text_5">
<!-- Distribution by Category -->
<g style="fill: #262626" transform="translate(115.979172 184.796141) scale(0.12 -0.12)">
<defs>
<path id="ArialMT-44" d="M 494 0
L 494 4581
L 2072 4581
Q 2606 4581 2888 4516
Q 3281 4425 3559 4188
Q 3922 3881 4101 3404
Q 4281 2928 4281 2316
Q 4281 1794 4159 1391
Q 4038 988 3847 723
Q 3656 459 3429 307
Q 3203 156 2883 78
Q 2563 0 2147 0
L 494 0
z
M 1100 541
L 2078 541
Q 2531 541 2789 625
Q 3047 709 3200 863
Q 3416 1078 3536 1442
Q 3656 1806 3656 2325
Q 3656 3044 3420 3430
Q 3184 3816 2847 3947
Q 2603 4041 2063 4041
L 1100 4041
L 1100 541
z
" transform="scale(0.015625)"/>
<path id="ArialMT-74" d="M 1650 503
L 1731 6
Q 1494 -44 1306 -44
Q 1000 -44 831 53
Q 663 150 594 308
Q 525 466 525 972
L 525 2881
L 113 2881
L 113 3319
L 525 3319
L 525 4141
L 1084 4478
L 1084 3319
L 1650 3319
L 1650 2881
L 1084 2881
L 1084 941
Q 1084 700 1114 631
Q 1144 563 1211 522
Q 1278 481 1403 481
Q 1497 481 1650 503
z
" transform="scale(0.015625)"/>
<path id="ArialMT-72" d="M 416 0
L 416 3319
L 922 3319
L 922 2816
Q 1116 3169 1280 3281
Q 1444 3394 1641 3394
Q 1925 3394 2219 3213
L 2025 2691
Q 1819 2813 1613 2813
Q 1428 2813 1281 2702
Q 1134 2591 1072 2394
Q 978 2094 978 1738
L 978 0
L 416 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-62" d="M 941 0
L 419 0
L 419 4581
L 981 4581
L 981 2947
Q 1338 3394 1891 3394
Q 2197 3394 2470 3270
Q 2744 3147 2920 2923
Q 3097 2700 3197 2384
Q 3297 2069 3297 1709
Q 3297 856 2875 390
Q 2453 -75 1863 -75
Q 1275 -75 941 416
L 941 0
z
M 934 1684
Q 934 1088 1097 822
Q 1363 388 1816 388
Q 2184 388 2453 708
Q 2722 1028 2722 1663
Q 2722 2313 2464 2622
Q 2206 2931 1841 2931
Q 1472 2931 1203 2611
Q 934 2291 934 1684
z
" transform="scale(0.015625)"/>
<path id="ArialMT-75" d="M 2597 0
L 2597 488
Q 2209 -75 1544 -75
Q 1250 -75 995 37
Q 741 150 617 320
Q 494 491 444 738
Q 409 903 409 1263
L 409 3319
L 972 3319
L 972 1478
Q 972 1038 1006 884
Q 1059 663 1231 536
Q 1403 409 1656 409
Q 1909 409 2131 539
Q 2353 669 2445 892
Q 2538 1116 2538 1541
L 2538 3319
L 3100 3319
L 3100 0
L 2597 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6e" d="M 422 0
L 422 3319
L 928 3319
L 928 2847
Q 1294 3394 1984 3394
Q 2284 3394 2536 3286
Q 2788 3178 2913 3003
Q 3038 2828 3088 2588
Q 3119 2431 3119 2041
L 3119 0
L 2556 0
L 2556 2019
Q 2556 2363 2490 2533
Q 2425 2703 2258 2804
Q 2091 2906 1866 2906
Q 1506 2906 1245 2678
Q 984 2450 984 1813
L 984 0
L 422 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-20" transform="scale(0.015625)"/>
<path id="ArialMT-79" d="M 397 -1278
L 334 -750
Q 519 -800 656 -800
Q 844 -800 956 -737
Q 1069 -675 1141 -563
Q 1194 -478 1313 -144
Q 1328 -97 1363 -6
L 103 3319
L 709 3319
L 1400 1397
Q 1534 1031 1641 628
Q 1738 1016 1872 1384
L 2581 3319
L 3144 3319
L 1881 -56
Q 1678 -603 1566 -809
Q 1416 -1088 1222 -1217
Q 1028 -1347 759 -1347
Q 597 -1347 397 -1278
z
" transform="scale(0.015625)"/>
<path id="ArialMT-61" d="M 2588 409
Q 2275 144 1986 34
Q 1697 -75 1366 -75
Q 819 -75 525 192
Q 231 459 231 875
Q 231 1119 342 1320
Q 453 1522 633 1644
Q 813 1766 1038 1828
Q 1203 1872 1538 1913
Q 2219 1994 2541 2106
Q 2544 2222 2544 2253
Q 2544 2597 2384 2738
Q 2169 2928 1744 2928
Q 1347 2928 1158 2789
Q 969 2650 878 2297
L 328 2372
Q 403 2725 575 2942
Q 747 3159 1072 3276
Q 1397 3394 1825 3394
Q 2250 3394 2515 3294
Q 2781 3194 2906 3042
Q 3031 2891 3081 2659
Q 3109 2516 3109 2141
L 3109 1391
Q 3109 606 3145 398
Q 3181 191 3288 0
L 2700 0
Q 2613 175 2588 409
z
M 2541 1666
Q 2234 1541 1622 1453
Q 1275 1403 1131 1340
Q 988 1278 909 1158
Q 831 1038 831 891
Q 831 666 1001 516
Q 1172 366 1500 366
Q 1825 366 2078 508
Q 2331 650 2450 897
Q 2541 1088 2541 1459
L 2541 1666
z
" transform="scale(0.015625)"/>
<path id="ArialMT-67" d="M 319 -275
L 866 -356
Q 900 -609 1056 -725
Q 1266 -881 1628 -881
Q 2019 -881 2231 -725
Q 2444 -569 2519 -288
Q 2563 -116 2559 434
Q 2191 0 1641 0
Q 956 0 581 494
Q 206 988 206 1678
Q 206 2153 378 2554
Q 550 2956 876 3175
Q 1203 3394 1644 3394
Q 2231 3394 2613 2919
L 2613 3319
L 3131 3319
L 3131 450
Q 3131 -325 2973 -648
Q 2816 -972 2473 -1159
Q 2131 -1347 1631 -1347
Q 1038 -1347 672 -1080
Q 306 -813 319 -275
z
M 784 1719
Q 784 1066 1043 766
Q 1303 466 1694 466
Q 2081 466 2343 764
Q 2606 1063 2606 1700
Q 2606 2309 2336 2618
Q 2066 2928 1684 2928
Q 1309 2928 1046 2623
Q 784 2319 784 1719
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-44"/>
<use xlink:href="#ArialMT-69" transform="translate(72.216797 0)"/>
<use xlink:href="#ArialMT-73" transform="translate(94.433594 0)"/>
<use xlink:href="#ArialMT-74" transform="translate(144.433594 0)"/>
<use xlink:href="#ArialMT-72" transform="translate(172.216797 0)"/>
<use xlink:href="#ArialMT-69" transform="translate(205.517578 0)"/>
<use xlink:href="#ArialMT-62" transform="translate(227.734375 0)"/>
<use xlink:href="#ArialMT-75" transform="translate(283.349609 0)"/>
<use xlink:href="#ArialMT-74" transform="translate(338.964844 0)"/>
<use xlink:href="#ArialMT-69" transform="translate(366.748047 0)"/>
<use xlink:href="#ArialMT-6f" transform="translate(388.964844 0)"/>
<use xlink:href="#ArialMT-6e" transform="translate(444.580078 0)"/>
<use xlink:href="#ArialMT-20" transform="translate(500.195312 0)"/>
<use xlink:href="#ArialMT-62" transform="translate(527.978516 0)"/>
<use xlink:href="#ArialMT-79" transform="translate(583.59375 0)"/>
<use xlink:href="#ArialMT-20" transform="translate(633.59375 0)"/>
<use xlink:href="#ArialMT-43" transform="translate(661.376953 0)"/>
<use xlink:href="#ArialMT-61" transform="translate(733.59375 0)"/>
<use xlink:href="#ArialMT-74" transform="translate(789.208984 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(816.992188 0)"/>
<use xlink:href="#ArialMT-67" transform="translate(872.607422 0)"/>
<use xlink:href="#ArialMT-6f" transform="translate(928.222656 0)"/>
<use xlink:href="#ArialMT-72" transform="translate(983.837891 0)"/>
<use xlink:href="#ArialMT-79" transform="translate(1017.138672 0)"/>
</g>
</g>
</g>
<g id="axes_2">
<g id="matplotlib.axis_3"/>
<g id="matplotlib.axis_4"/>
<g id="patch_4">
<path d="M 529.211577 224.636913
C 507.285007 224.636913 485.68326 229.964522 466.271144 240.159826
C 446.859028 250.35513 430.21169 265.11606 417.76624 283.168351
C 405.32079 301.220643 397.445965 322.029441 394.821573 343.798387
C 392.197181 365.567333 394.900978 387.651454 402.699554 408.144299
C 410.49813 428.637143 423.160428 446.931545 439.593351 461.448208
C 456.026274 475.964871 475.742945 486.273692 497.041288 491.484744
C 518.33963 496.695796 540.588613 496.654685 561.867552 491.36496
C 583.146492 486.075235 602.824932 475.69362 619.204096 461.116329
L 529.211577 360
z
" style="fill: #1f77b4; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="patch_5">
<path d="M 619.204111 461.116312
C 639.701289 442.874029 654.175118 418.823408 660.69632 392.170277
C 667.217523 365.517149 665.482877 337.500814 655.723596 311.855701
C 645.964315 286.210583 628.63419 264.129128 606.043231 248.554666
C 583.452273 232.980205 556.650898 224.636913 529.211596 224.636913
L 529.211577 360
z
" style="fill: #ff7f0e; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="text_6">
<!-- Male -->
<g style="fill: #262626" transform="translate(368.374911 415.543875) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-4d" d="M 475 0
L 475 4581
L 1388 4581
L 2472 1338
Q 2622 884 2691 659
Q 2769 909 2934 1394
L 4031 4581
L 4847 4581
L 4847 0
L 4263 0
L 4263 3834
L 2931 0
L 2384 0
L 1059 3900
L 1059 0
L 475 0
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-4d"/>
<use xlink:href="#ArialMT-61" transform="translate(83.300781 0)"/>
<use xlink:href="#ArialMT-6c" transform="translate(138.916016 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(161.132812 0)"/>
</g>
</g>
<g id="text_7">
<!-- 61.6% -->
<g style="fill: #262626" transform="translate(439.127799 391.47173) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-31" d="M 2384 0
L 1822 0
L 1822 3584
Q 1619 3391 1289 3197
Q 959 3003 697 2906
L 697 3450
Q 1169 3672 1522 3987
Q 1875 4303 2022 4600
L 2384 4600
L 2384 0
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-36"/>
<use xlink:href="#ArialMT-31" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-36" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
<g id="text_8">
<!-- Female -->
<g style="fill: #262626" transform="translate(668.374801 309.626426) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-46" d="M 525 0
L 525 4581
L 3616 4581
L 3616 4041
L 1131 4041
L 1131 2622
L 3281 2622
L 3281 2081
L 1131 2081
L 1131 0
L 525 0
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-46"/>
<use xlink:href="#ArialMT-65" transform="translate(61.083984 0)"/>
<use xlink:href="#ArialMT-6d" transform="translate(116.699219 0)"/>
<use xlink:href="#ArialMT-61" transform="translate(200 0)"/>
<use xlink:href="#ArialMT-6c" transform="translate(255.615234 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(277.832031 0)"/>
</g>
</g>
<g id="text_9">
<!-- 38.4% -->
<g style="fill: #262626" transform="translate(590.942228 333.698576) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-33" d="M 269 1209
L 831 1284
Q 928 806 1161 595
Q 1394 384 1728 384
Q 2125 384 2398 659
Q 2672 934 2672 1341
Q 2672 1728 2419 1979
Q 2166 2231 1775 2231
Q 1616 2231 1378 2169
L 1441 2663
Q 1497 2656 1531 2656
Q 1891 2656 2178 2843
Q 2466 3031 2466 3422
Q 2466 3731 2256 3934
Q 2047 4138 1716 4138
Q 1388 4138 1169 3931
Q 950 3725 888 3313
L 325 3413
Q 428 3978 793 4289
Q 1159 4600 1703 4600
Q 2078 4600 2393 4439
Q 2709 4278 2876 4000
Q 3044 3722 3044 3409
Q 3044 3113 2884 2869
Q 2725 2625 2413 2481
Q 2819 2388 3044 2092
Q 3269 1797 3269 1353
Q 3269 753 2831 336
Q 2394 -81 1725 -81
Q 1122 -81 723 278
Q 325 638 269 1209
z
" transform="scale(0.015625)"/>
<path id="ArialMT-38" d="M 1131 2484
Q 781 2613 612 2850
Q 444 3088 444 3419
Q 444 3919 803 4259
Q 1163 4600 1759 4600
Q 2359 4600 2725 4251
Q 3091 3903 3091 3403
Q 3091 3084 2923 2848
Q 2756 2613 2416 2484
Q 2838 2347 3058 2040
Q 3278 1734 3278 1309
Q 3278 722 2862 322
Q 2447 -78 1769 -78
Q 1091 -78 675 323
Q 259 725 259 1325
Q 259 1772 486 2073
Q 713 2375 1131 2484
z
M 1019 3438
Q 1019 3113 1228 2906
Q 1438 2700 1772 2700
Q 2097 2700 2305 2904
Q 2513 3109 2513 3406
Q 2513 3716 2298 3927
Q 2084 4138 1766 4138
Q 1444 4138 1231 3931
Q 1019 3725 1019 3438
z
M 838 1322
Q 838 1081 952 856
Q 1066 631 1291 507
Q 1516 384 1775 384
Q 2178 384 2440 643
Q 2703 903 2703 1303
Q 2703 1709 2433 1975
Q 2163 2241 1756 2241
Q 1359 2241 1098 1978
Q 838 1716 838 1322
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-33"/>
<use xlink:href="#ArialMT-38" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-34" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
<g id="text_10">
<!-- Distribution by Sex -->
<g style="fill: #262626" transform="translate(479.192202 184.796141) scale(0.12 -0.12)">
<defs>
<path id="ArialMT-78" d="M 47 0
L 1259 1725
L 138 3319
L 841 3319
L 1350 2541
Q 1494 2319 1581 2169
Q 1719 2375 1834 2534
L 2394 3319
L 3066 3319
L 1919 1756
L 3153 0
L 2463 0
L 1781 1031
L 1600 1309
L 728 0
L 47 0
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-44"/>
<use xlink:href="#ArialMT-69" transform="translate(72.216797 0)"/>
<use xlink:href="#ArialMT-73" transform="translate(94.433594 0)"/>
<use xlink:href="#ArialMT-74" transform="translate(144.433594 0)"/>
<use xlink:href="#ArialMT-72" transform="translate(172.216797 0)"/>
<use xlink:href="#ArialMT-69" transform="translate(205.517578 0)"/>
<use xlink:href="#ArialMT-62" transform="translate(227.734375 0)"/>
<use xlink:href="#ArialMT-75" transform="translate(283.349609 0)"/>
<use xlink:href="#ArialMT-74" transform="translate(338.964844 0)"/>
<use xlink:href="#ArialMT-69" transform="translate(366.748047 0)"/>
<use xlink:href="#ArialMT-6f" transform="translate(388.964844 0)"/>
<use xlink:href="#ArialMT-6e" transform="translate(444.580078 0)"/>
<use xlink:href="#ArialMT-20" transform="translate(500.195312 0)"/>
<use xlink:href="#ArialMT-62" transform="translate(527.978516 0)"/>
<use xlink:href="#ArialMT-79" transform="translate(583.59375 0)"/>
<use xlink:href="#ArialMT-20" transform="translate(633.59375 0)"/>
<use xlink:href="#ArialMT-53" transform="translate(661.376953 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(728.076172 0)"/>
<use xlink:href="#ArialMT-78" transform="translate(783.691406 0)"/>
</g>
</g>
</g>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 25 KiB

@@ -1,13 +0,0 @@
compose,simple
0.2062165520477412,0.7937834479522587
0.6269061385346485,0.3730938614653515
0.09081330148566008,0.90918669851434
0.12423822403788959,0.8757617759621105
0.2612655252892886,0.7387344747107114
0.07622377139542966,0.9237762286045703
0.18062352012628255,0.8193764798737174
0.07679244621346286,0.9232075537865372
0.4611502742287561,0.5388497257712439
0.11962561930536533,0.8803743806946347
0.16090483213325235,0.8390951678667476
0.409646629226467,0.590353370773533
1 compose simple
2 0.2062165520477412 0.7937834479522587
3 0.6269061385346485 0.3730938614653515
4 0.09081330148566008 0.90918669851434
5 0.12423822403788959 0.8757617759621105
6 0.2612655252892886 0.7387344747107114
7 0.07622377139542966 0.9237762286045703
8 0.18062352012628255 0.8193764798737174
9 0.07679244621346286 0.9232075537865372
10 0.4611502742287561 0.5388497257712439
11 0.11962561930536533 0.8803743806946347
12 0.16090483213325235 0.8390951678667476
13 0.409646629226467 0.590353370773533
Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

-487
View File
@@ -1,487 +0,0 @@
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns:xlink="http://www.w3.org/1999/xlink" width="432pt" height="432pt" viewBox="0 0 432 432" xmlns="http://www.w3.org/2000/svg" version="1.1">
<metadata>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cc:Work>
<dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
<dc:date>2025-09-28T16:57:45.798630</dc:date>
<dc:format>image/svg+xml</dc:format>
<dc:creator>
<cc:Agent>
<dc:title>Matplotlib v3.10.3, https://matplotlib.org/</dc:title>
</cc:Agent>
</dc:creator>
</cc:Work>
</rdf:RDF>
</metadata>
<defs>
<style type="text/css">*{stroke-linejoin: round; stroke-linecap: butt}</style>
</defs>
<g id="figure_1">
<g id="patch_1">
<path d="M 0 432
L 432 432
L 432 0
L 0 0
z
" style="fill: #ffffff"/>
</g>
<g id="axes_1">
<g id="matplotlib.axis_1"/>
<g id="matplotlib.axis_2"/>
<g id="patch_2">
<path d="M 373.581818 218.160006
C 373.581818 202.706938 371.228081 187.343628 366.601695 172.599346
C 361.975308 157.855065 355.130015 143.901092 346.301639 131.218148
C 337.473263 118.535204 326.76436 107.270623 314.543856 97.812474
C 302.323352 88.354325 288.73321 80.812481 274.241615 75.44676
C 259.750019 70.08104 244.525316 66.953775 229.092 66.172703
C 213.658685 65.39163 198.196043 66.965822 183.236767 70.841048
C 168.277492 74.716274 153.995362 80.847516 140.8824 89.02355
C 127.769438 97.199585 115.977976 107.325433 105.914233 119.052255
C 95.85049 130.779078 87.631374 143.970648 81.540037 158.172511
C 75.4487 172.374374 71.555903 187.42155 69.995754 202.795659
C 68.435605 218.169768 69.226228 233.692213 72.340387 248.82824
C 75.454545 263.964267 80.856062 278.538045 88.358853 292.047502
C 95.861645 305.556959 105.378552 317.845158 116.581715 328.488768
C 127.784879 339.132378 140.544153 348.007753 154.419977 354.809136
C 168.295802 361.610518 183.126983 366.258896 198.402584 368.594133
C 213.678185 370.929369 229.22075 370.924336 244.494835 368.579206
L 221.399995 218.160006
z
" style="fill: #1f77b4; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="patch_3">
<path d="M 244.494835 368.579206
C 280.423454 363.062851 313.213222 344.852289 336.885761 317.267751
C 360.5583 289.683214 373.581822 254.509625 373.581818 218.159992
L 221.399995 218.160006
z
" style="fill: #ff7f0e; stroke: #ffffff; stroke-linejoin: miter"/>
</g>
<g id="text_1">
<!-- Simple -->
<g style="fill: #262626" transform="translate(63.800032 111.787574) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-53" d="M 288 1472
L 859 1522
Q 900 1178 1048 958
Q 1197 738 1509 602
Q 1822 466 2213 466
Q 2559 466 2825 569
Q 3091 672 3220 851
Q 3350 1031 3350 1244
Q 3350 1459 3225 1620
Q 3100 1781 2813 1891
Q 2628 1963 1997 2114
Q 1366 2266 1113 2400
Q 784 2572 623 2826
Q 463 3081 463 3397
Q 463 3744 659 4045
Q 856 4347 1234 4503
Q 1613 4659 2075 4659
Q 2584 4659 2973 4495
Q 3363 4331 3572 4012
Q 3781 3694 3797 3291
L 3216 3247
Q 3169 3681 2898 3903
Q 2628 4125 2100 4125
Q 1550 4125 1298 3923
Q 1047 3722 1047 3438
Q 1047 3191 1225 3031
Q 1400 2872 2139 2705
Q 2878 2538 3153 2413
Q 3553 2228 3743 1945
Q 3934 1663 3934 1294
Q 3934 928 3725 604
Q 3516 281 3123 101
Q 2731 -78 2241 -78
Q 1619 -78 1198 103
Q 778 284 539 648
Q 300 1013 288 1472
z
" transform="scale(0.015625)"/>
<path id="ArialMT-69" d="M 425 3934
L 425 4581
L 988 4581
L 988 3934
L 425 3934
z
M 425 0
L 425 3319
L 988 3319
L 988 0
L 425 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6d" d="M 422 0
L 422 3319
L 925 3319
L 925 2853
Q 1081 3097 1340 3245
Q 1600 3394 1931 3394
Q 2300 3394 2536 3241
Q 2772 3088 2869 2813
Q 3263 3394 3894 3394
Q 4388 3394 4653 3120
Q 4919 2847 4919 2278
L 4919 0
L 4359 0
L 4359 2091
Q 4359 2428 4304 2576
Q 4250 2725 4106 2815
Q 3963 2906 3769 2906
Q 3419 2906 3187 2673
Q 2956 2441 2956 1928
L 2956 0
L 2394 0
L 2394 2156
Q 2394 2531 2256 2718
Q 2119 2906 1806 2906
Q 1569 2906 1367 2781
Q 1166 2656 1075 2415
Q 984 2175 984 1722
L 984 0
L 422 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-70" d="M 422 -1272
L 422 3319
L 934 3319
L 934 2888
Q 1116 3141 1344 3267
Q 1572 3394 1897 3394
Q 2322 3394 2647 3175
Q 2972 2956 3137 2557
Q 3303 2159 3303 1684
Q 3303 1175 3120 767
Q 2938 359 2589 142
Q 2241 -75 1856 -75
Q 1575 -75 1351 44
Q 1128 163 984 344
L 984 -1272
L 422 -1272
z
M 931 1641
Q 931 1000 1190 694
Q 1450 388 1819 388
Q 2194 388 2461 705
Q 2728 1022 2728 1688
Q 2728 2322 2467 2637
Q 2206 2953 1844 2953
Q 1484 2953 1207 2617
Q 931 2281 931 1641
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6c" d="M 409 0
L 409 4581
L 972 4581
L 972 0
L 409 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-65" d="M 2694 1069
L 3275 997
Q 3138 488 2766 206
Q 2394 -75 1816 -75
Q 1088 -75 661 373
Q 234 822 234 1631
Q 234 2469 665 2931
Q 1097 3394 1784 3394
Q 2450 3394 2872 2941
Q 3294 2488 3294 1666
Q 3294 1616 3291 1516
L 816 1516
Q 847 969 1125 678
Q 1403 388 1819 388
Q 2128 388 2347 550
Q 2566 713 2694 1069
z
M 847 1978
L 2700 1978
Q 2663 2397 2488 2606
Q 2219 2931 1791 2931
Q 1403 2931 1139 2672
Q 875 2413 847 1978
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-53"/>
<use xlink:href="#ArialMT-69" transform="translate(66.699219 0)"/>
<use xlink:href="#ArialMT-6d" transform="translate(88.916016 0)"/>
<use xlink:href="#ArialMT-70" transform="translate(172.216797 0)"/>
<use xlink:href="#ArialMT-6c" transform="translate(227.832031 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(250.048828 0)"/>
</g>
</g>
<g id="text_2">
<!-- 77.4% -->
<g style="fill: #262626" transform="translate(137.931975 161.280512) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-37" d="M 303 3981
L 303 4522
L 3269 4522
L 3269 4084
Q 2831 3619 2401 2847
Q 1972 2075 1738 1259
Q 1569 684 1522 0
L 944 0
Q 953 541 1156 1306
Q 1359 2072 1739 2783
Q 2119 3494 2547 3981
L 303 3981
z
" transform="scale(0.015625)"/>
<path id="ArialMT-2e" d="M 581 0
L 581 641
L 1222 641
L 1222 0
L 581 0
z
" transform="scale(0.015625)"/>
<path id="ArialMT-34" d="M 2069 0
L 2069 1097
L 81 1097
L 81 1613
L 2172 4581
L 2631 4581
L 2631 1613
L 3250 1613
L 3250 1097
L 2631 1097
L 2631 0
L 2069 0
z
M 2069 1613
L 2069 3678
L 634 1613
L 2069 1613
z
" transform="scale(0.015625)"/>
<path id="ArialMT-25" d="M 372 3481
Q 372 3972 619 4315
Q 866 4659 1334 4659
Q 1766 4659 2048 4351
Q 2331 4044 2331 3447
Q 2331 2866 2045 2552
Q 1759 2238 1341 2238
Q 925 2238 648 2547
Q 372 2856 372 3481
z
M 1350 4272
Q 1141 4272 1002 4090
Q 863 3909 863 3425
Q 863 2984 1003 2804
Q 1144 2625 1350 2625
Q 1563 2625 1702 2806
Q 1841 2988 1841 3469
Q 1841 3913 1700 4092
Q 1559 4272 1350 4272
z
M 1353 -169
L 3859 4659
L 4316 4659
L 1819 -169
L 1353 -169
z
M 3334 1075
Q 3334 1569 3581 1911
Q 3828 2253 4300 2253
Q 4731 2253 5014 1945
Q 5297 1638 5297 1041
Q 5297 459 5011 145
Q 4725 -169 4303 -169
Q 3888 -169 3611 142
Q 3334 453 3334 1075
z
M 4316 1866
Q 4103 1866 3964 1684
Q 3825 1503 3825 1019
Q 3825 581 3965 400
Q 4106 219 4313 219
Q 4528 219 4667 400
Q 4806 581 4806 1063
Q 4806 1506 4665 1686
Q 4525 1866 4316 1866
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-37"/>
<use xlink:href="#ArialMT-37" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-34" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
<g id="text_3">
<!-- Compose -->
<g style="fill: #262626" transform="translate(348.434338 329.82462) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-43" d="M 3763 1606
L 4369 1453
Q 4178 706 3683 314
Q 3188 -78 2472 -78
Q 1731 -78 1267 223
Q 803 525 561 1097
Q 319 1669 319 2325
Q 319 3041 592 3573
Q 866 4106 1370 4382
Q 1875 4659 2481 4659
Q 3169 4659 3637 4309
Q 4106 3959 4291 3325
L 3694 3184
Q 3534 3684 3231 3912
Q 2928 4141 2469 4141
Q 1941 4141 1586 3887
Q 1231 3634 1087 3207
Q 944 2781 944 2328
Q 944 1744 1114 1308
Q 1284 872 1643 656
Q 2003 441 2422 441
Q 2931 441 3284 734
Q 3638 1028 3763 1606
z
" transform="scale(0.015625)"/>
<path id="ArialMT-6f" d="M 213 1659
Q 213 2581 725 3025
Q 1153 3394 1769 3394
Q 2453 3394 2887 2945
Q 3322 2497 3322 1706
Q 3322 1066 3130 698
Q 2938 331 2570 128
Q 2203 -75 1769 -75
Q 1072 -75 642 372
Q 213 819 213 1659
z
M 791 1659
Q 791 1022 1069 705
Q 1347 388 1769 388
Q 2188 388 2466 706
Q 2744 1025 2744 1678
Q 2744 2294 2464 2611
Q 2184 2928 1769 2928
Q 1347 2928 1069 2612
Q 791 2297 791 1659
z
" transform="scale(0.015625)"/>
<path id="ArialMT-73" d="M 197 991
L 753 1078
Q 800 744 1014 566
Q 1228 388 1613 388
Q 2000 388 2187 545
Q 2375 703 2375 916
Q 2375 1106 2209 1216
Q 2094 1291 1634 1406
Q 1016 1563 777 1677
Q 538 1791 414 1992
Q 291 2194 291 2438
Q 291 2659 392 2848
Q 494 3038 669 3163
Q 800 3259 1026 3326
Q 1253 3394 1513 3394
Q 1903 3394 2198 3281
Q 2494 3169 2634 2976
Q 2775 2784 2828 2463
L 2278 2388
Q 2241 2644 2061 2787
Q 1881 2931 1553 2931
Q 1166 2931 1000 2803
Q 834 2675 834 2503
Q 834 2394 903 2306
Q 972 2216 1119 2156
Q 1203 2125 1616 2013
Q 2213 1853 2448 1751
Q 2684 1650 2818 1456
Q 2953 1263 2953 975
Q 2953 694 2789 445
Q 2625 197 2315 61
Q 2006 -75 1616 -75
Q 969 -75 630 194
Q 291 463 197 991
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-43"/>
<use xlink:href="#ArialMT-6f" transform="translate(72.216797 0)"/>
<use xlink:href="#ArialMT-6d" transform="translate(127.832031 0)"/>
<use xlink:href="#ArialMT-70" transform="translate(211.132812 0)"/>
<use xlink:href="#ArialMT-6f" transform="translate(266.748047 0)"/>
<use xlink:href="#ArialMT-73" transform="translate(322.363281 0)"/>
<use xlink:href="#ArialMT-65" transform="translate(372.363281 0)"/>
</g>
</g>
<g id="text_4">
<!-- 22.6% -->
<g style="fill: #262626" transform="translate(276.514892 280.209809) scale(0.1 -0.1)">
<defs>
<path id="ArialMT-32" d="M 3222 541
L 3222 0
L 194 0
Q 188 203 259 391
Q 375 700 629 1000
Q 884 1300 1366 1694
Q 2113 2306 2375 2664
Q 2638 3022 2638 3341
Q 2638 3675 2398 3904
Q 2159 4134 1775 4134
Q 1369 4134 1125 3890
Q 881 3647 878 3216
L 300 3275
Q 359 3922 746 4261
Q 1134 4600 1788 4600
Q 2447 4600 2831 4234
Q 3216 3869 3216 3328
Q 3216 3053 3103 2787
Q 2991 2522 2730 2228
Q 2469 1934 1863 1422
Q 1356 997 1212 845
Q 1069 694 975 541
L 3222 541
z
" transform="scale(0.015625)"/>
<path id="ArialMT-36" d="M 3184 3459
L 2625 3416
Q 2550 3747 2413 3897
Q 2184 4138 1850 4138
Q 1581 4138 1378 3988
Q 1113 3794 959 3422
Q 806 3050 800 2363
Q 1003 2672 1297 2822
Q 1591 2972 1913 2972
Q 2475 2972 2870 2558
Q 3266 2144 3266 1488
Q 3266 1056 3080 686
Q 2894 316 2569 119
Q 2244 -78 1831 -78
Q 1128 -78 684 439
Q 241 956 241 2144
Q 241 3472 731 4075
Q 1159 4600 1884 4600
Q 2425 4600 2770 4297
Q 3116 3994 3184 3459
z
M 888 1484
Q 888 1194 1011 928
Q 1134 663 1356 523
Q 1578 384 1822 384
Q 2178 384 2434 671
Q 2691 959 2691 1453
Q 2691 1928 2437 2201
Q 2184 2475 1800 2475
Q 1419 2475 1153 2201
Q 888 1928 888 1484
z
" transform="scale(0.015625)"/>
</defs>
<use xlink:href="#ArialMT-32"/>
<use xlink:href="#ArialMT-32" transform="translate(55.615234 0)"/>
<use xlink:href="#ArialMT-2e" transform="translate(111.230469 0)"/>
<use xlink:href="#ArialMT-36" transform="translate(139.013672 0)"/>
<use xlink:href="#ArialMT-25" transform="translate(194.628906 0)"/>
</g>
</g>
</g>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 31 KiB

-27
View File
@@ -1,27 +0,0 @@
letter,Male,Female
a,0.1726871198212362,0.1780007719968084
b,0.06275449167118631,0.05757115683434764
c,0.002527913112031784,0.002525362815502787
d,0.02639274743484273,0.025798130028588412
e,0.060460557268468315,0.05992111228866155
f,0.004168185425527368,0.005738163668905593
g,0.03710295718248242,0.035944081768606244
h,0.015744753594548896,0.016324638692088497
i,0.07320872667180656,0.07954877144283247
j,0.00442530712700423,0.004397881604276826
k,0.06012485644271973,0.05719911396875115
l,0.04930645065793003,0.04598845291218479
m,0.08281339976187696,0.08014229460267776
n,0.08138893330151427,0.08430430794896865
o,0.06920807306069308,0.06452478894803111
p,0.009832203366545821,0.009371006578405026
q,5.0822826402147366e-05,8.43622136063042e-05
r,0.009139850680293098,0.010064380634131025
s,0.032639239825093015,0.034139532349508485
t,0.0277669772899704,0.027953179679053274
u,0.06917254296038988,0.06619473621457156
v,0.0035449558612418576,0.006171217790567778
w,0.013512780220408454,0.014295070954872152
x,4.796818419701171e-05,1.6707334940670683e-05
y,0.020592394840652214,0.020809516372185803
z,0.01138579141093724,0.012971260356926069
1 letter Male Female
2 a 0.1726871198212362 0.1780007719968084
3 b 0.06275449167118631 0.05757115683434764
4 c 0.002527913112031784 0.002525362815502787
5 d 0.02639274743484273 0.025798130028588412
6 e 0.060460557268468315 0.05992111228866155
7 f 0.004168185425527368 0.005738163668905593
8 g 0.03710295718248242 0.035944081768606244
9 h 0.015744753594548896 0.016324638692088497
10 i 0.07320872667180656 0.07954877144283247
11 j 0.00442530712700423 0.004397881604276826
12 k 0.06012485644271973 0.05719911396875115
13 l 0.04930645065793003 0.04598845291218479
14 m 0.08281339976187696 0.08014229460267776
15 n 0.08138893330151427 0.08430430794896865
16 o 0.06920807306069308 0.06452478894803111
17 p 0.009832203366545821 0.009371006578405026
18 q 5.0822826402147366e-05 8.43622136063042e-05
19 r 0.009139850680293098 0.010064380634131025
20 s 0.032639239825093015 0.034139532349508485
21 t 0.0277669772899704 0.027953179679053274
22 u 0.06917254296038988 0.06619473621457156
23 v 0.0035449558612418576 0.006171217790567778
24 w 0.013512780220408454 0.014295070954872152
25 x 4.796818419701171e-05 1.6707334940670683e-05
26 y 0.020592394840652214 0.020809516372185803
27 z 0.01138579141093724 0.012971260356926069
-7
View File
@@ -1,7 +0,0 @@
sex,position,2-grams,3-grams,4-grams
female,prefix,"ka, ma, mu, mb, ng, ba, ki, lu, ts, bo","tsh, kab, ngo, mas, kas, kal, muk, kav, mbu, man","tshi, kavi, ngoy, kaso, ilun, mbuy, kaba, ntum, kavu, ngal"
female,suffix,"ba, ga, la, ka, ma, da, go, ya, bo, na","nga, mba, ngo, nda, ala, mbo, ngu, ndo, mbe, mbu","anga, amba, ongo, umba, inga, ombo, unga, enga, anda, ungu"
female,any,"ng, ka, mb, an, ba, ma, nd, ga, la, am","nga, mba, ang, ngo, amb, ong, nda, ala, mbo, eng","anga, amba, ongo, tshi, umba, inga, ombo, unga, anda, enga"
male,prefix,"ka, mu, ma, ba, mb, ng, ki, lu, ts, bo","kab, tsh, kal, kas, muk, ngo, kam, mut, mul, mbu","tshi, ngoy, ilun, kaba, kaso, kamb, muke, kabe, kalo, muto"
male,suffix,"ba, ga, la, go, ka, da, bo, le, di, ma","nga, mba, ngo, nda, ala, mbo, ngu, mbe, ndo, ele","amba, ongo, anga, umba, unga, ombo, anda, enga, onga, angu"
male,any,"ng, ka, mb, ba, an, ma, mu, nd, am, al","nga, mba, ngo, amb, ang, ong, ala, nda, shi, mbo","amba, ongo, anga, tshi, umba, unga, ombo, anda, lung, enga"
1 sex position 2-grams 3-grams 4-grams
2 female prefix ka, ma, mu, mb, ng, ba, ki, lu, ts, bo tsh, kab, ngo, mas, kas, kal, muk, kav, mbu, man tshi, kavi, ngoy, kaso, ilun, mbuy, kaba, ntum, kavu, ngal
3 female suffix ba, ga, la, ka, ma, da, go, ya, bo, na nga, mba, ngo, nda, ala, mbo, ngu, ndo, mbe, mbu anga, amba, ongo, umba, inga, ombo, unga, enga, anda, ungu
4 female any ng, ka, mb, an, ba, ma, nd, ga, la, am nga, mba, ang, ngo, amb, ong, nda, ala, mbo, eng anga, amba, ongo, tshi, umba, inga, ombo, unga, anda, enga
5 male prefix ka, mu, ma, ba, mb, ng, ki, lu, ts, bo kab, tsh, kal, kas, muk, ngo, kam, mut, mul, mbu tshi, ngoy, ilun, kaba, kaso, kamb, muke, kabe, kalo, muto
6 male suffix ba, ga, la, go, ka, da, bo, le, di, ma nga, mba, ngo, nda, ala, mbo, ngu, mbe, ndo, ele amba, ongo, anga, umba, unga, ombo, anda, enga, onga, angu
7 male any ng, ka, mb, ba, an, ma, mu, nd, am, al nga, mba, ngo, amb, ang, ong, ala, nda, shi, mbo amba, ongo, anga, tshi, umba, unga, ombo, anda, lung, enga
Binary file not shown.

Before

Width:  |  Height:  |  Size: 33 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 463 KiB

-29
View File
@@ -1,29 +0,0 @@
^,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,$
0.0,0.03240135083510582,0.09260231861008805,0.008075675043119726,0.017857766912447937,0.02028878683448995,0.008457996285223429,0.0056965765224603,0.0036982016623235744,0.026885400880147632,0.0030749451480349046,0.18878761331096985,0.060296922518691204,0.26743305983402327,0.1199608033359101,0.014603892627340732,0.012508764232704531,9.485640637203402e-06,0.005467323565586415,0.032579081785992364,0.042149993340081786,0.004793643382647348,0.005066031042208305,0.01062281917633087,8.287454451451392e-06,0.010728658956072298,0.005944601062910965,0.0
0.0,0.0009791886146586694,0.048581928219493954,0.0020218103912897157,0.020990537577026614,0.001131268956960644,0.006787613741763863,0.006186016110784558,0.011832275288244888,0.004851088661689449,0.0032179280340603345,0.04386672984621395,0.07634733982374604,0.08293271059721305,0.13517553734169577,0.001808420172771762,0.01144167917930759,6.458327509042543e-06,0.010475318749427155,0.046487395291047624,0.033763074574396215,0.004567187662011127,0.011274824306950683,0.008258962300707514,1.6278524132381206e-05,0.032413284125006325,0.011652857641829295,0.3829322859400618
0.0,0.39420804956524974,0.00012606255453160502,2.546718273365758e-07,0.0006208899150465718,0.11658188641534632,1.273359136682879e-06,5.781050480540271e-05,0.0016016311221197252,0.11411997386048364,6.366795683414395e-06,1.4261622330848246e-05,0.0003718208679114007,2.2156448978282096e-05,3.8200774100486376e-05,0.19131406078812782,6.366795683414395e-06,2.546718273365758e-07,0.0013390644681357156,1.884571522290661e-05,9.422857611453305e-06,0.14365426567670633,1.222424771215564e-05,0.02738409757801999,2.546718273365758e-07,0.004381374117498451,1.3242935021501942e-05,0.004095886999054148
0.0,0.07019293726020524,3.059582305823609e-05,0.004779067561696477,1.8357493834941653e-05,0.10123545933509158,0.0,7.954913995141383e-05,0.33171991359739567,0.3235936629931282,1.8357493834941653e-05,0.03783479479381475,0.015714014722710057,8.566830456306106e-05,4.283415228153053e-05,0.03822642132896017,2.4476658446588873e-05,0.00897069532067482,0.0032064422565031424,8.566830456306106e-05,0.013223514725769638,0.018045416439747646,6.119164611647218e-06,0.0037694054007746864,1.8357493834941653e-05,0.008328183036451864,1.2238329223294436e-05,0.02073784886887242
0.0,0.2424810241211335,1.5959776634059462e-05,5.911028382984986e-07,0.0006726750299836915,0.10030246732235734,9.457645412775978e-06,4.0786095842596404e-05,0.004887238267051987,0.3104158822239417,0.06585713162618893,1.5959776634059462e-05,2.6008524885133942e-05,0.00012176718468949073,7.152344343411834e-05,0.1405991300148426,1.773308514895496e-06,5.911028382984986e-07,0.009874964016614718,9.398535128946128e-05,1.4777570957462466e-05,0.0932293307592775,3.014624475322343e-05,0.00875600634371566,5.911028382984986e-07,0.004086293921157521,0.0022692437962279362,0.016124694325944745
0.0,0.006205216662872565,0.028941081239369993,0.0011474697534826386,0.01425752342098435,0.003730367449534814,0.004639925223263839,0.005866185675542982,0.003561493573938398,0.0016397191355402771,0.0038625407716201547,0.05415821120988892,0.10151501424777083,0.09480497248870047,0.17112338854414036,0.0027153276653648665,0.008680322523437667,5.646239001703881e-06,0.03422185458932722,0.03327456967317772,0.03469819184328915,0.002615748541152998,0.002961965650848386,0.005427832211228881,0.00028205530285784386,0.028868963368484594,0.01293322372785744,0.33786118926732095
0.0,0.23077320736794152,3.230986452473805e-06,6.46197290494761e-06,3.5540850977211856e-05,0.08830932171901404,0.0015282565920201096,3.877183742968566e-05,0.000132470444551426,0.14398568026804265,6.46197290494761e-06,9.692959357421415e-05,0.0038481048648963015,3.5540850977211856e-05,1.938591871484283e-05,0.08549513251890935,6.46197290494761e-06,6.46197290494761e-06,0.013518447317150399,0.0002003211600533759,0.0005040338865859135,0.3976407336924036,0.0,0.02652962976126241,0.0,0.004814169814185969,6.46197290494761e-06,0.0024587806903325652
0.0,0.38404120076639636,0.02137884680216012,5.907040897397653e-06,4.261508075979735e-05,0.09362322277181143,5.485109404726393e-06,4.514666971582492e-05,0.015463789206401714,0.07891806638923264,1.0548287316781523e-05,1.4345670750822873e-05,0.000878039436248894,2.9113272994317006e-05,0.0005134906265809245,0.2519884576420865,7.172835375411436e-06,8.438629853425219e-07,0.0015463789206401712,2.3206232096919352e-05,2.3628163589590613e-05,0.11810284411361265,5.485109404726393e-06,0.020353131343476286,2.5315889560275656e-06,0.002251004513401177,6.750903882740175e-06,0.010718747639820713
0.0,0.2614230396902226,4.453049370764763e-05,3.872216844143272e-06,2.5169409486931267e-05,0.09374152952565344,2.5169409486931267e-05,3.484995159728945e-05,2.420135527589545e-05,0.4197734753146176,1.1616650532429817e-05,9.486931268151016e-05,7.454017424975799e-05,0.0002468538238141336,0.0006030977734753146,0.1158296224588577,2.3233301064859633e-05,9.68054211035818e-07,0.002967086156824782,0.00020909970958373668,0.00020716360116166505,0.07815295256534366,1.1616650532429817e-05,0.007239109390125847,0.0,0.0033049370764762828,1.0648596321393998e-05,0.01591674733785092
0.0,0.057525431732821986,0.041996980300875106,0.0033912381338324768,0.016022951674192765,0.02875699673928052,0.007807629805802953,0.006448520094153173,0.005039572274652094,0.0003533603548473279,0.0023910036496201273,0.05831058618602033,0.08421986612340406,0.07337456279315632,0.10815890676433847,0.011699700072632914,0.009224952061459374,0.0004542620977921718,0.03307003561382166,0.050486207507906694,0.048087850694833095,0.002131396128804547,0.005388030520530441,0.005741799384458517,9.109752500688338e-05,0.017802008720851855,0.013278383415184936,0.3087466696297192
0.0,0.22479395580338848,3.1523973982213474e-05,4.2031965309617963e-05,0.00024168380053030328,0.10204660644420081,1.0507991327404491e-05,2.1015982654808982e-05,0.0002346784729787003,0.42807805335957994,9.457192194664042e-05,8.756659439503743e-05,3.502663775801497e-05,0.0007320567291425129,0.0001260958959288539,0.12674739139115296,2.8021310206411978e-05,0.0,3.502663775801497e-05,1.0507991327404491e-05,7.355593929183144e-05,0.06349628892772954,7.0053275516029945e-06,0.001986010360879449,0.0,0.00035727170513175267,1.0507991327404491e-05,0.05067303684452026
0.0,0.5034268610320017,6.895672925910058e-05,5.506050625251378e-06,1.3109644345836613e-05,0.07313660826232712,0.00021106527396796946,4.719471964501181e-06,0.00408968465012719,0.14606084028625171,6.292629286001574e-06,2.9627796221590745e-05,0.0002089677308726356,3.9853318811343305e-05,5.322515604409665e-05,0.12817613908388756,0.00323913092496931,2.8841217560840547e-06,0.0001966446651875492,0.0006470920449104952,2.0975430953338582e-05,0.10610631502055855,3.146314643000787e-06,0.021257812692547902,7.865786607501968e-07,0.01034980201815109,1.0225522589752559e-05,0.0026337275490785753
0.0,0.2907864656030218,0.0008378546668440517,4.703067334072724e-05,0.00035917260804733476,0.17203916946407904,0.00028798234223705586,9.696049777780069e-05,0.00010694646266521537,0.11111415343016283,4.831918493910333e-06,0.00016943927518645567,0.0027625688669183344,0.00026962105196019656,7.408941690662511e-05,0.17521728332147365,0.000927728350830784,3.5434068955342444e-06,9.663836987820666e-06,0.00025609168017724765,0.0001262741366408567,0.2173715845181466,0.0004770714192987469,0.010590921082852263,6.442557991880444e-07,0.004126136265899831,2.0938313473611443e-05,0.011915833133882475
0.0,0.2455144244761935,0.34861793685557824,2.345028152630318e-05,5.805835829496029e-05,0.02770991169645199,0.00440562708416741,6.392092867653609e-05,1.3805407672743e-05,0.04043547172794345,6.6190310759726715e-06,7.224199631490173e-05,7.98066032588705e-05,0.0006679547931524422,0.00019195190120320746,0.05560269776577443,0.03787277201050043,5.673455207976575e-07,1.7209480797528944e-05,0.0004308043654590213,0.00011952078971470652,0.19084103867348565,0.006140002341245849,0.037190633579328045,5.673455207976575e-07,0.0008513965115436847,0.0001529941754417683,0.00291861447415675
0.0,0.06852883077720875,0.00014519177304071664,0.0020409279683942802,0.189448848282653,0.02723591583598238,0.0006794974978305538,0.3949594662038515,0.00015362226308824212,0.04945662681480349,0.007555592524815616,0.040859213064337006,0.003013431831432164,4.027900800484397e-05,0.0018093705084222468,0.014132311483001883,5.18943498481013e-05,1.5924258978659246e-05,0.0002334309022048167,0.0345440266418473,0.03498709572990059,0.011673605896696897,0.0013887827271623645,0.0007604302022867985,0.0,0.03952251203124602,0.06717414469868305,0.009589026724278762
0.0,0.002811754741536618,0.022217453078125466,0.0009318958571949934,0.010439299379577709,0.0016556070980166227,0.007270623934115422,0.004762079642666141,0.004112736445620663,0.006897636060238115,0.0017462718427437218,0.0702396992225785,0.08710265314882101,0.10264970585602438,0.195075412409823,0.004091849124683534,0.008484842920460622,2.75437199170934e-06,0.007752409501665248,0.028939268392892802,0.03214650504623902,0.0037128934448241906,0.0025411376933511757,0.007407424409703652,2.2723568931602058e-05,0.045256168071778936,0.005312035917010772,0.33641715881938433
0.0,0.2755187546363871,0.00010093905364029487,3.204414401279202e-06,6.408828802558404e-06,0.20437434609918623,0.00369629201187556,3.204414401279202e-06,0.055202446890836816,0.1479045533126435,6.408828802558404e-06,0.0002483421160991382,0.0013074010757219146,6.729270242686325e-05,0.00011055229684413248,0.14242500468645605,0.0011712134636675483,0.0,0.0033470108421361265,0.0001538118912614017,0.0002675686025068134,0.14199881757108593,9.613243203837607e-06,0.012518044858597203,0.0,0.007613688617439385,8.011036003198005e-06,0.0019370685055732778
0.0,0.003864734299516908,0.0007246376811594203,0.0,0.0004830917874396135,0.0007246376811594203,0.0,0.0,0.0004830917874396135,0.00024154589371980676,0.0,0.0004830917874396135,0.0004830917874396135,0.0007246376811594203,0.0016908212560386474,0.0007246376811594203,0.0,0.0,0.00024154589371980676,0.001932367149758454,0.0007246376811594203,0.9835748792270531,0.0,0.00024154589371980676,0.0,0.0,0.0,0.0026570048309178746
0.0,0.2822393929187961,0.0012267882391033057,0.007813046583785243,0.008807168087886199,0.18842588750970124,0.00023592081521217416,0.004038313540390457,0.03821103686384766,0.19376257905381108,5.694640367190411e-05,0.000777725170147719,0.003924420733046649,0.0033565837364325194,0.004537814852598302,0.0694306823968904,0.0003530677027658055,0.00015294176986168533,0.0043604674811629435,0.0014366764126368952,0.011607304108438968,0.1015468270277394,0.0017295436315209734,0.0317907366098667,3.254080209823092e-06,0.008437829984071277,0.00010738464692416203,0.03162965963948045
0.0,0.23379934463089652,3.257301707105292e-05,0.0016072457280488111,4.281025100766955e-05,0.13016736016171107,7.445261044812096e-06,1.2098549197819656e-05,0.244367427355192,0.15211273242011003,1.535585090492495e-05,0.0005100003815696286,0.0002601188077531226,0.0006663508635106826,0.00011400555974868522,0.11456581564230732,0.001241497279222417,9.306576306015119e-06,0.00021265526859244547,0.016044072222754766,0.009141384576583351,0.07100824655726476,6.97993222951134e-06,0.0097877263010361,1.3959864459022679e-06,0.0037649754445984165,8.375918675413608e-06,0.010492699456216746
0.0,0.22270715194197135,1.4995917777938228e-05,0.004287166271403895,0.0006853689828879916,0.1378752450721748,7.386878016539942e-05,6.1094479836044635e-06,0.02331143188798605,0.08055140545073841,6.6648523457503235e-06,9.719576337552555e-05,2.221617448583441e-05,6.831473654394081e-05,4.6653966420252264e-05,0.14458064193636178,6.1094479836044635e-06,1.6662130864375809e-06,0.003284661397730618,0.19145788091019666,0.0033663058389660594,0.16493732261773184,8.331065432187904e-06,0.011131414226127331,4.443234897166882e-06,0.0022449444317935675,1.943915267510511e-05,0.009203050280756905
0.0,0.052226842718488986,0.03074039814488894,0.0014204842607524314,0.01637797660464469,0.022525374073740573,0.006430721206713545,0.012709093373135088,0.013169096520166011,0.009173231318596861,0.007254770344054158,0.09023383531205176,0.09354867756485294,0.11036573774681681,0.12381315549879986,0.0011981001392515705,0.007636325104666064,1.5917063911104562e-06,0.014746477553756472,0.05062194790299505,0.05958757523086564,0.0004966123940264624,0.005729233461488436,0.007166089559406575,7.64019067733019e-05,0.0306126068603455,0.018867178013714143,0.21327046547861703
0.0,0.18540060189078825,2.026568310551328e-05,0.0,1.68880692545944e-05,0.08221449874521645,3.3776138509188797e-06,4.053136621102656e-05,0.00018239114794961952,0.2600120918575863,3.3776138509188797e-06,3.3776138509188797e-06,9.119557397480976e-05,6.755227701837759e-06,4.390898006194544e-05,0.05670675894307707,0.0,3.3776138509188797e-06,0.0005843271962089662,1.013284155275664e-05,1.68880692545944e-05,0.3652484066106658,2.364329695643216e-05,0.011642634944117379,0.0,0.01438187977721259,3.37761385091888e-05,0.02330891318519119
0.0,0.5743844833643246,2.79704006032656e-05,0.0,2.0138688434351233e-05,0.2559571359204835,2.237632048261248e-06,1.5663424337828736e-05,0.00014768371518524237,0.09644529772813218,4.475264096522496e-06,2.1257504458481857e-05,3.915856084457184e-05,6.601014542370682e-05,7.94359377132743e-05,0.03731027677270805,1.3425792289567489e-05,5.59408012065312e-06,2.5732768555004354e-05,3.132684867565747e-05,3.692092879631059e-05,0.02981420941103287,7.831712168914368e-06,5.258435313413933e-05,2.237632048261248e-06,0.00013425792289567488,3.4683296748049344e-05,0.005319970194741118
0.0,0.2527091460771565,0.0013003901170351106,0.008669267446900737,0.0,0.045947117468573904,0.0,0.0,0.0034677069787602947,0.2492414390983962,0.0,0.0,0.0008669267446900737,0.0,0.00043346337234503684,0.03034243606415258,0.002600780234070221,0.0,0.0,0.002600780234070221,0.007368877329865626,0.014304291287386216,0.0,0.0013003901170351106,0.009102730819245773,0.0039011703511053317,0.0,0.3658430862592111
0.0,0.37652174563292656,0.0002512562814070352,0.00022957047140464225,0.0004613843024647045,0.13366460277578368,3.514596793491266e-05,9.496889207944485e-05,8.001316104331179e-05,0.20418461354391002,6.730078966259871e-06,0.0003701543431442929,0.0011328966259870784,0.0010469011725293131,0.0007941493180186647,0.10861898779612347,0.00036940655659248626,7.477865518066523e-07,0.0004120303900454654,0.0007156317300789662,0.0002736898779612347,0.05130638310600622,0.0006640344580043072,0.0017565506101938262,3.7389327590332615e-06,3.0659248624072745e-05,0.00010543790380473797,0.11686856903565446
0.0,0.42030782462700017,0.00013905412295474562,5.150152702027615e-06,0.00033475992563179497,0.11766811385957594,3.8626145265207114e-06,1.1587843579562133e-05,0.00012746627937518349,0.2313873481348722,6.308937059983828e-05,0.00010042797768953849,3.090091621216569e-05,9.656536316301779e-05,0.0001660924246403906,0.11382481240568783,3.8626145265207114e-06,0.0,0.0002060061080811046,7.338967600389351e-05,3.3475992563179495e-05,0.09470873311393682,0.00025750763510138075,0.005945851294490882,0.0,0.00251584959494049,0.0002600827114523946,0.011728185240692387
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 ^ a b c d e f g h i j k l m n o p q r s t u v w x y z $
2 0.0 0.03240135083510582 0.09260231861008805 0.008075675043119726 0.017857766912447937 0.02028878683448995 0.008457996285223429 0.0056965765224603 0.0036982016623235744 0.026885400880147632 0.0030749451480349046 0.18878761331096985 0.060296922518691204 0.26743305983402327 0.1199608033359101 0.014603892627340732 0.012508764232704531 9.485640637203402e-06 0.005467323565586415 0.032579081785992364 0.042149993340081786 0.004793643382647348 0.005066031042208305 0.01062281917633087 8.287454451451392e-06 0.010728658956072298 0.005944601062910965 0.0
3 0.0 0.0009791886146586694 0.048581928219493954 0.0020218103912897157 0.020990537577026614 0.001131268956960644 0.006787613741763863 0.006186016110784558 0.011832275288244888 0.004851088661689449 0.0032179280340603345 0.04386672984621395 0.07634733982374604 0.08293271059721305 0.13517553734169577 0.001808420172771762 0.01144167917930759 6.458327509042543e-06 0.010475318749427155 0.046487395291047624 0.033763074574396215 0.004567187662011127 0.011274824306950683 0.008258962300707514 1.6278524132381206e-05 0.032413284125006325 0.011652857641829295 0.3829322859400618
4 0.0 0.39420804956524974 0.00012606255453160502 2.546718273365758e-07 0.0006208899150465718 0.11658188641534632 1.273359136682879e-06 5.781050480540271e-05 0.0016016311221197252 0.11411997386048364 6.366795683414395e-06 1.4261622330848246e-05 0.0003718208679114007 2.2156448978282096e-05 3.8200774100486376e-05 0.19131406078812782 6.366795683414395e-06 2.546718273365758e-07 0.0013390644681357156 1.884571522290661e-05 9.422857611453305e-06 0.14365426567670633 1.222424771215564e-05 0.02738409757801999 2.546718273365758e-07 0.004381374117498451 1.3242935021501942e-05 0.004095886999054148
5 0.0 0.07019293726020524 3.059582305823609e-05 0.004779067561696477 1.8357493834941653e-05 0.10123545933509158 0.0 7.954913995141383e-05 0.33171991359739567 0.3235936629931282 1.8357493834941653e-05 0.03783479479381475 0.015714014722710057 8.566830456306106e-05 4.283415228153053e-05 0.03822642132896017 2.4476658446588873e-05 0.00897069532067482 0.0032064422565031424 8.566830456306106e-05 0.013223514725769638 0.018045416439747646 6.119164611647218e-06 0.0037694054007746864 1.8357493834941653e-05 0.008328183036451864 1.2238329223294436e-05 0.02073784886887242
6 0.0 0.2424810241211335 1.5959776634059462e-05 5.911028382984986e-07 0.0006726750299836915 0.10030246732235734 9.457645412775978e-06 4.0786095842596404e-05 0.004887238267051987 0.3104158822239417 0.06585713162618893 1.5959776634059462e-05 2.6008524885133942e-05 0.00012176718468949073 7.152344343411834e-05 0.1405991300148426 1.773308514895496e-06 5.911028382984986e-07 0.009874964016614718 9.398535128946128e-05 1.4777570957462466e-05 0.0932293307592775 3.014624475322343e-05 0.00875600634371566 5.911028382984986e-07 0.004086293921157521 0.0022692437962279362 0.016124694325944745
7 0.0 0.006205216662872565 0.028941081239369993 0.0011474697534826386 0.01425752342098435 0.003730367449534814 0.004639925223263839 0.005866185675542982 0.003561493573938398 0.0016397191355402771 0.0038625407716201547 0.05415821120988892 0.10151501424777083 0.09480497248870047 0.17112338854414036 0.0027153276653648665 0.008680322523437667 5.646239001703881e-06 0.03422185458932722 0.03327456967317772 0.03469819184328915 0.002615748541152998 0.002961965650848386 0.005427832211228881 0.00028205530285784386 0.028868963368484594 0.01293322372785744 0.33786118926732095
8 0.0 0.23077320736794152 3.230986452473805e-06 6.46197290494761e-06 3.5540850977211856e-05 0.08830932171901404 0.0015282565920201096 3.877183742968566e-05 0.000132470444551426 0.14398568026804265 6.46197290494761e-06 9.692959357421415e-05 0.0038481048648963015 3.5540850977211856e-05 1.938591871484283e-05 0.08549513251890935 6.46197290494761e-06 6.46197290494761e-06 0.013518447317150399 0.0002003211600533759 0.0005040338865859135 0.3976407336924036 0.0 0.02652962976126241 0.0 0.004814169814185969 6.46197290494761e-06 0.0024587806903325652
9 0.0 0.38404120076639636 0.02137884680216012 5.907040897397653e-06 4.261508075979735e-05 0.09362322277181143 5.485109404726393e-06 4.514666971582492e-05 0.015463789206401714 0.07891806638923264 1.0548287316781523e-05 1.4345670750822873e-05 0.000878039436248894 2.9113272994317006e-05 0.0005134906265809245 0.2519884576420865 7.172835375411436e-06 8.438629853425219e-07 0.0015463789206401712 2.3206232096919352e-05 2.3628163589590613e-05 0.11810284411361265 5.485109404726393e-06 0.020353131343476286 2.5315889560275656e-06 0.002251004513401177 6.750903882740175e-06 0.010718747639820713
10 0.0 0.2614230396902226 4.453049370764763e-05 3.872216844143272e-06 2.5169409486931267e-05 0.09374152952565344 2.5169409486931267e-05 3.484995159728945e-05 2.420135527589545e-05 0.4197734753146176 1.1616650532429817e-05 9.486931268151016e-05 7.454017424975799e-05 0.0002468538238141336 0.0006030977734753146 0.1158296224588577 2.3233301064859633e-05 9.68054211035818e-07 0.002967086156824782 0.00020909970958373668 0.00020716360116166505 0.07815295256534366 1.1616650532429817e-05 0.007239109390125847 0.0 0.0033049370764762828 1.0648596321393998e-05 0.01591674733785092
11 0.0 0.057525431732821986 0.041996980300875106 0.0033912381338324768 0.016022951674192765 0.02875699673928052 0.007807629805802953 0.006448520094153173 0.005039572274652094 0.0003533603548473279 0.0023910036496201273 0.05831058618602033 0.08421986612340406 0.07337456279315632 0.10815890676433847 0.011699700072632914 0.009224952061459374 0.0004542620977921718 0.03307003561382166 0.050486207507906694 0.048087850694833095 0.002131396128804547 0.005388030520530441 0.005741799384458517 9.109752500688338e-05 0.017802008720851855 0.013278383415184936 0.3087466696297192
12 0.0 0.22479395580338848 3.1523973982213474e-05 4.2031965309617963e-05 0.00024168380053030328 0.10204660644420081 1.0507991327404491e-05 2.1015982654808982e-05 0.0002346784729787003 0.42807805335957994 9.457192194664042e-05 8.756659439503743e-05 3.502663775801497e-05 0.0007320567291425129 0.0001260958959288539 0.12674739139115296 2.8021310206411978e-05 0.0 3.502663775801497e-05 1.0507991327404491e-05 7.355593929183144e-05 0.06349628892772954 7.0053275516029945e-06 0.001986010360879449 0.0 0.00035727170513175267 1.0507991327404491e-05 0.05067303684452026
13 0.0 0.5034268610320017 6.895672925910058e-05 5.506050625251378e-06 1.3109644345836613e-05 0.07313660826232712 0.00021106527396796946 4.719471964501181e-06 0.00408968465012719 0.14606084028625171 6.292629286001574e-06 2.9627796221590745e-05 0.0002089677308726356 3.9853318811343305e-05 5.322515604409665e-05 0.12817613908388756 0.00323913092496931 2.8841217560840547e-06 0.0001966446651875492 0.0006470920449104952 2.0975430953338582e-05 0.10610631502055855 3.146314643000787e-06 0.021257812692547902 7.865786607501968e-07 0.01034980201815109 1.0225522589752559e-05 0.0026337275490785753
14 0.0 0.2907864656030218 0.0008378546668440517 4.703067334072724e-05 0.00035917260804733476 0.17203916946407904 0.00028798234223705586 9.696049777780069e-05 0.00010694646266521537 0.11111415343016283 4.831918493910333e-06 0.00016943927518645567 0.0027625688669183344 0.00026962105196019656 7.408941690662511e-05 0.17521728332147365 0.000927728350830784 3.5434068955342444e-06 9.663836987820666e-06 0.00025609168017724765 0.0001262741366408567 0.2173715845181466 0.0004770714192987469 0.010590921082852263 6.442557991880444e-07 0.004126136265899831 2.0938313473611443e-05 0.011915833133882475
15 0.0 0.2455144244761935 0.34861793685557824 2.345028152630318e-05 5.805835829496029e-05 0.02770991169645199 0.00440562708416741 6.392092867653609e-05 1.3805407672743e-05 0.04043547172794345 6.6190310759726715e-06 7.224199631490173e-05 7.98066032588705e-05 0.0006679547931524422 0.00019195190120320746 0.05560269776577443 0.03787277201050043 5.673455207976575e-07 1.7209480797528944e-05 0.0004308043654590213 0.00011952078971470652 0.19084103867348565 0.006140002341245849 0.037190633579328045 5.673455207976575e-07 0.0008513965115436847 0.0001529941754417683 0.00291861447415675
16 0.0 0.06852883077720875 0.00014519177304071664 0.0020409279683942802 0.189448848282653 0.02723591583598238 0.0006794974978305538 0.3949594662038515 0.00015362226308824212 0.04945662681480349 0.007555592524815616 0.040859213064337006 0.003013431831432164 4.027900800484397e-05 0.0018093705084222468 0.014132311483001883 5.18943498481013e-05 1.5924258978659246e-05 0.0002334309022048167 0.0345440266418473 0.03498709572990059 0.011673605896696897 0.0013887827271623645 0.0007604302022867985 0.0 0.03952251203124602 0.06717414469868305 0.009589026724278762
17 0.0 0.002811754741536618 0.022217453078125466 0.0009318958571949934 0.010439299379577709 0.0016556070980166227 0.007270623934115422 0.004762079642666141 0.004112736445620663 0.006897636060238115 0.0017462718427437218 0.0702396992225785 0.08710265314882101 0.10264970585602438 0.195075412409823 0.004091849124683534 0.008484842920460622 2.75437199170934e-06 0.007752409501665248 0.028939268392892802 0.03214650504623902 0.0037128934448241906 0.0025411376933511757 0.007407424409703652 2.2723568931602058e-05 0.045256168071778936 0.005312035917010772 0.33641715881938433
18 0.0 0.2755187546363871 0.00010093905364029487 3.204414401279202e-06 6.408828802558404e-06 0.20437434609918623 0.00369629201187556 3.204414401279202e-06 0.055202446890836816 0.1479045533126435 6.408828802558404e-06 0.0002483421160991382 0.0013074010757219146 6.729270242686325e-05 0.00011055229684413248 0.14242500468645605 0.0011712134636675483 0.0 0.0033470108421361265 0.0001538118912614017 0.0002675686025068134 0.14199881757108593 9.613243203837607e-06 0.012518044858597203 0.0 0.007613688617439385 8.011036003198005e-06 0.0019370685055732778
19 0.0 0.003864734299516908 0.0007246376811594203 0.0 0.0004830917874396135 0.0007246376811594203 0.0 0.0 0.0004830917874396135 0.00024154589371980676 0.0 0.0004830917874396135 0.0004830917874396135 0.0007246376811594203 0.0016908212560386474 0.0007246376811594203 0.0 0.0 0.00024154589371980676 0.001932367149758454 0.0007246376811594203 0.9835748792270531 0.0 0.00024154589371980676 0.0 0.0 0.0 0.0026570048309178746
20 0.0 0.2822393929187961 0.0012267882391033057 0.007813046583785243 0.008807168087886199 0.18842588750970124 0.00023592081521217416 0.004038313540390457 0.03821103686384766 0.19376257905381108 5.694640367190411e-05 0.000777725170147719 0.003924420733046649 0.0033565837364325194 0.004537814852598302 0.0694306823968904 0.0003530677027658055 0.00015294176986168533 0.0043604674811629435 0.0014366764126368952 0.011607304108438968 0.1015468270277394 0.0017295436315209734 0.0317907366098667 3.254080209823092e-06 0.008437829984071277 0.00010738464692416203 0.03162965963948045
21 0.0 0.23379934463089652 3.257301707105292e-05 0.0016072457280488111 4.281025100766955e-05 0.13016736016171107 7.445261044812096e-06 1.2098549197819656e-05 0.244367427355192 0.15211273242011003 1.535585090492495e-05 0.0005100003815696286 0.0002601188077531226 0.0006663508635106826 0.00011400555974868522 0.11456581564230732 0.001241497279222417 9.306576306015119e-06 0.00021265526859244547 0.016044072222754766 0.009141384576583351 0.07100824655726476 6.97993222951134e-06 0.0097877263010361 1.3959864459022679e-06 0.0037649754445984165 8.375918675413608e-06 0.010492699456216746
22 0.0 0.22270715194197135 1.4995917777938228e-05 0.004287166271403895 0.0006853689828879916 0.1378752450721748 7.386878016539942e-05 6.1094479836044635e-06 0.02331143188798605 0.08055140545073841 6.6648523457503235e-06 9.719576337552555e-05 2.221617448583441e-05 6.831473654394081e-05 4.6653966420252264e-05 0.14458064193636178 6.1094479836044635e-06 1.6662130864375809e-06 0.003284661397730618 0.19145788091019666 0.0033663058389660594 0.16493732261773184 8.331065432187904e-06 0.011131414226127331 4.443234897166882e-06 0.0022449444317935675 1.943915267510511e-05 0.009203050280756905
23 0.0 0.052226842718488986 0.03074039814488894 0.0014204842607524314 0.01637797660464469 0.022525374073740573 0.006430721206713545 0.012709093373135088 0.013169096520166011 0.009173231318596861 0.007254770344054158 0.09023383531205176 0.09354867756485294 0.11036573774681681 0.12381315549879986 0.0011981001392515705 0.007636325104666064 1.5917063911104562e-06 0.014746477553756472 0.05062194790299505 0.05958757523086564 0.0004966123940264624 0.005729233461488436 0.007166089559406575 7.64019067733019e-05 0.0306126068603455 0.018867178013714143 0.21327046547861703
24 0.0 0.18540060189078825 2.026568310551328e-05 0.0 1.68880692545944e-05 0.08221449874521645 3.3776138509188797e-06 4.053136621102656e-05 0.00018239114794961952 0.2600120918575863 3.3776138509188797e-06 3.3776138509188797e-06 9.119557397480976e-05 6.755227701837759e-06 4.390898006194544e-05 0.05670675894307707 0.0 3.3776138509188797e-06 0.0005843271962089662 1.013284155275664e-05 1.68880692545944e-05 0.3652484066106658 2.364329695643216e-05 0.011642634944117379 0.0 0.01438187977721259 3.37761385091888e-05 0.02330891318519119
25 0.0 0.5743844833643246 2.79704006032656e-05 0.0 2.0138688434351233e-05 0.2559571359204835 2.237632048261248e-06 1.5663424337828736e-05 0.00014768371518524237 0.09644529772813218 4.475264096522496e-06 2.1257504458481857e-05 3.915856084457184e-05 6.601014542370682e-05 7.94359377132743e-05 0.03731027677270805 1.3425792289567489e-05 5.59408012065312e-06 2.5732768555004354e-05 3.132684867565747e-05 3.692092879631059e-05 0.02981420941103287 7.831712168914368e-06 5.258435313413933e-05 2.237632048261248e-06 0.00013425792289567488 3.4683296748049344e-05 0.005319970194741118
26 0.0 0.2527091460771565 0.0013003901170351106 0.008669267446900737 0.0 0.045947117468573904 0.0 0.0 0.0034677069787602947 0.2492414390983962 0.0 0.0 0.0008669267446900737 0.0 0.00043346337234503684 0.03034243606415258 0.002600780234070221 0.0 0.0 0.002600780234070221 0.007368877329865626 0.014304291287386216 0.0 0.0013003901170351106 0.009102730819245773 0.0039011703511053317 0.0 0.3658430862592111
27 0.0 0.37652174563292656 0.0002512562814070352 0.00022957047140464225 0.0004613843024647045 0.13366460277578368 3.514596793491266e-05 9.496889207944485e-05 8.001316104331179e-05 0.20418461354391002 6.730078966259871e-06 0.0003701543431442929 0.0011328966259870784 0.0010469011725293131 0.0007941493180186647 0.10861898779612347 0.00036940655659248626 7.477865518066523e-07 0.0004120303900454654 0.0007156317300789662 0.0002736898779612347 0.05130638310600622 0.0006640344580043072 0.0017565506101938262 3.7389327590332615e-06 3.0659248624072745e-05 0.00010543790380473797 0.11686856903565446
28 0.0 0.42030782462700017 0.00013905412295474562 5.150152702027615e-06 0.00033475992563179497 0.11766811385957594 3.8626145265207114e-06 1.1587843579562133e-05 0.00012746627937518349 0.2313873481348722 6.308937059983828e-05 0.00010042797768953849 3.090091621216569e-05 9.656536316301779e-05 0.0001660924246403906 0.11382481240568783 3.8626145265207114e-06 0.0 0.0002060061080811046 7.338967600389351e-05 3.3475992563179495e-05 0.09470873311393682 0.00025750763510138075 0.005945851294490882 0.0 0.00251584959494049 0.0002600827114523946 0.011728185240692387
29 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binary file not shown.

Before

Width:  |  Height:  |  Size: 217 KiB

-27
View File
@@ -1,27 +0,0 @@
letter,Male,Female
a,0.10419923843992125,0.11230333493368445
b,0.018546448249013497,0.015225074098649389
c,0.041480767577953326,0.04405987823025858
d,0.03868872722439995,0.026961499859145175
e,0.13138825792008238,0.18038344738539613
f,0.010247500256025038,0.007027086508994281
g,0.0180527572420696,0.017807501663033867
h,0.031508761634381516,0.03697185263156448
i,0.09337919525658041,0.10271520299704247
j,0.026664242619993696,0.012183083323972286
k,0.012803255631156278,0.004848323140290566
l,0.0509726758057992,0.06672341587576307
m,0.03320386129622267,0.030360801411648666
n,0.07989188838489009,0.08206189166389144
o,0.057005660062330925,0.03761362219276409
p,0.021467218695097115,0.011531157247822707
q,0.0018784195453980996,0.001950967247682959
r,0.0734505638264324,0.06822482369855525
s,0.05242163399917173,0.0432875249054165
t,0.03949576796436023,0.04783038894946737
u,0.031398878017324786,0.020461941209356852
v,0.011932256018971252,0.013217596918162086
w,0.0020867055933035655,0.0015356192415710282
x,0.002258056881804058,0.0006618534967773684
y,0.013323645811687641,0.011759717016584062
z,0.002253616045629267,0.002292394152504883
1 letter Male Female
2 a 0.10419923843992125 0.11230333493368445
3 b 0.018546448249013497 0.015225074098649389
4 c 0.041480767577953326 0.04405987823025858
5 d 0.03868872722439995 0.026961499859145175
6 e 0.13138825792008238 0.18038344738539613
7 f 0.010247500256025038 0.007027086508994281
8 g 0.0180527572420696 0.017807501663033867
9 h 0.031508761634381516 0.03697185263156448
10 i 0.09337919525658041 0.10271520299704247
11 j 0.026664242619993696 0.012183083323972286
12 k 0.012803255631156278 0.004848323140290566
13 l 0.0509726758057992 0.06672341587576307
14 m 0.03320386129622267 0.030360801411648666
15 n 0.07989188838489009 0.08206189166389144
16 o 0.057005660062330925 0.03761362219276409
17 p 0.021467218695097115 0.011531157247822707
18 q 0.0018784195453980996 0.001950967247682959
19 r 0.0734505638264324 0.06822482369855525
20 s 0.05242163399917173 0.0432875249054165
21 t 0.03949576796436023 0.04783038894946737
22 u 0.031398878017324786 0.020461941209356852
23 v 0.011932256018971252 0.013217596918162086
24 w 0.0020867055933035655 0.0015356192415710282
25 x 0.002258056881804058 0.0006618534967773684
26 y 0.013323645811687641 0.011759717016584062
27 z 0.002253616045629267 0.002292394152504883
-7
View File
@@ -1,7 +0,0 @@
sex,position,2-grams,3-grams,4-grams
female,prefix,"ma, ch, be, na, an, sa, es, jo, ju, me","mar, cha, est, chr, gra, dor, sar, rut, ben, mer","mari, chri, esth, grac, sara, dorc, ruth, rach, naom, jean"
female,suffix,"ne, ie, te, le, ce, ia, se, el, ah, th","ine, tte, lle, rah, nce, ene, nne, her, lie, rie","ette, line, tine, elle, ther, arie, ille, rcas, ruth, arah"
female,any,"ne, in, el, an, ie, ri, ra, li, ar, er","ine, tte, ett, mar, ari, lle, lin, eli, the, ell","ette, line, ther, mari, tine, elle, rist, chri, ance, hris"
male,prefix,"jo, je, pa, ch, ma, da, al, ju, be, fr","jea, jos, chr, pat, mar, jon, fra, dan, cha, ben","jean, chri, jose, jona, patr, fran, mich, emma, davi, dieu"
male,suffix,"in, el, an, ck, on, re, ce, er, se, is","ean, tin, ick, ier, ard, uel, ert, ain, iel, ise","jean, stin, rick, bert, seph, than, oise, avid, ndre, tian"
male,any,"an, er, el, ie, in, ri, is, on, en, re","ric, sti, jea, ean, tin, ris, ier, ist, ick, jos","jean, stin, rist, rick, chri, hris, usti, bert, jose, atha"
1 sex position 2-grams 3-grams 4-grams
2 female prefix ma, ch, be, na, an, sa, es, jo, ju, me mar, cha, est, chr, gra, dor, sar, rut, ben, mer mari, chri, esth, grac, sara, dorc, ruth, rach, naom, jean
3 female suffix ne, ie, te, le, ce, ia, se, el, ah, th ine, tte, lle, rah, nce, ene, nne, her, lie, rie ette, line, tine, elle, ther, arie, ille, rcas, ruth, arah
4 female any ne, in, el, an, ie, ri, ra, li, ar, er ine, tte, ett, mar, ari, lle, lin, eli, the, ell ette, line, ther, mari, tine, elle, rist, chri, ance, hris
5 male prefix jo, je, pa, ch, ma, da, al, ju, be, fr jea, jos, chr, pat, mar, jon, fra, dan, cha, ben jean, chri, jose, jona, patr, fran, mich, emma, davi, dieu
6 male suffix in, el, an, ck, on, re, ce, er, se, is ean, tin, ick, ier, ard, uel, ert, ain, iel, ise jean, stin, rick, bert, seph, than, oise, avid, ndre, tian
7 male any an, er, el, ie, in, ri, is, on, en, re ric, sti, jea, ean, tin, ris, ier, ist, ick, jos jean, stin, rist, rick, chri, hris, usti, bert, jose, atha
Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 464 KiB

-29
View File
@@ -1,29 +0,0 @@
^,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,$
0.0,0.08177487512565312,0.05064544871095967,0.07327059962450833,0.06495363065149523,0.061627841497941024,0.044425593979353147,0.05752526940789172,0.024078074472915836,0.016592203335813446,0.11910658380449511,0.017117380488251007,0.03302685352589563,0.08869583137140728,0.035182675783582286,0.013984289415344039,0.05503217558728984,0.00021346554218849907,0.05022750354744187,0.05679281701430199,0.024045325783562315,0.0015048422132203263,0.015574997094552255,0.0054400764881583534,0.00016613969233005726,0.0062853521103135605,0.002710153731134059,0.0
0.0,0.005939042406371192,0.02544883058808627,0.060741665942489004,0.03423882544765226,0.014545218880858291,0.0021159734936245364,0.009683458260117744,0.022014608259160358,0.0375319435995705,0.0012071918943162118,0.00483524835366186,0.06685892969977804,0.04498424729464408,0.22067338212572413,0.008778800791243883,0.010449662778691797,0.0001958961956369646,0.11578879149722136,0.04779808257392872,0.05734632827194516,0.041184450260777605,0.016855321085647877,0.0012893799222601413,0.0007600183229223588,0.005051176039980499,0.003479293182959682,0.1402042328307295
0.0,0.10740025063072939,0.008207325873470239,4.244850350569272e-05,0.003728085960065186,0.4237910943039645,3.691174217886323e-06,0.00013288227184390762,0.00039495564131383656,0.12999208243130264,3.1374980852033744e-05,2.7683806634147422e-05,0.046482956925842464,3.3220567960976905e-05,0.0004447864932553019,0.11562418678818012,9.227935544715807e-06,1.8455871089431614e-06,0.10180812169063161,0.0005019996936325399,4.060291639674955e-05,0.021691185291408975,2.214704530731794e-05,0.0056788715342181075,1.8455871089431614e-06,0.012356205694374466,1.291910976260213e-05,0.021538001561366694
0.0,0.10050204481617504,5.957221194601566e-06,0.012163156374077748,5.7338253998040075e-05,0.21854065952395846,8.191179142577154e-06,1.5637705635829112e-05,0.2589321085286557,0.08507730983805294,2.0105621531780285e-05,0.1057168473193994,0.04785063459298775,4.393450631018655e-05,2.978610597300783e-05,0.049474722021166005,5.957221194601566e-06,0.022256178383031452,0.010718530234386868,0.00032913647100173653,0.026146243823106274,0.0030076520506244655,1.1169789739877937e-05,1.71270109344795e-05,4.021124306356057e-05,0.017885067331492553,5.957221194601566e-06,0.04113833561197044
0.0,0.13123437278970704,4.496701950163427e-05,2.716757428223737e-05,0.01997097753616332,0.19341720307012325,3.6535703345077846e-05,0.0005414778598321794,0.001204741397481285,0.22505150128952298,0.023330388618097914,1.405219359426071e-05,0.000277296620260078,0.0018839307545372192,0.0003981454851707201,0.12731568440272087,7.494503250272378e-06,0.0,0.09698730337468113,0.00034755758823138154,4.215658078278213e-05,0.0051580918619999645,0.0001564477553494359,0.0013602523399244366,9.368129062840473e-07,0.04583919231738472,0.00014333237466145925,0.12520879217648806
0.0,0.031895413152132776,0.012309605339111969,0.00870721680660756,0.0298675163088017,0.01029528350039639,0.0020813614768936716,0.00574201810613231,0.0009116137714871438,0.010444608551166907,0.00043419130147119637,0.0026210201219160722,0.12830321631540123,0.027835024848523834,0.08354099609624636,0.013881382027362197,0.017625577146542382,3.717462802398891e-05,0.11198919346094116,0.05457966355708561,0.03163080498524293,0.014952763160862552,0.010523970116611377,0.0003372866531390005,0.007001151995776294,0.0020375037696743585,0.0024278373639262424,0.37798660543852275
0.0,0.24238728488124023,1.7778409899018633e-05,2.133409187882236e-05,4.977954771725217e-05,0.09607097141231688,0.010471483430521974,2.133409187882236e-05,0.00032712274214194285,0.18898094154458825,7.111363959607453e-06,5.689091167685962e-05,0.08809913241359693,7.111363959607453e-05,0.00019556250888920495,0.025647134120324277,1.066704593941118e-05,0.0,0.3050632911392405,0.00016711705305077514,0.0022969705589532072,0.02134475892476177,3.5556819798037263e-06,0.0007609159436779974,0.0,0.002400085336367515,3.5556819798037263e-06,0.015524107523823069
0.0,0.1356264584296794,0.001080252906268644,1.588607215100947e-05,0.0006989871746444167,0.31163178202896913,1.4120953023119531e-05,0.0013097183928943365,0.009215686966713384,0.0887413641546668,1.9416310406789355e-05,1.0590714767339647e-05,0.09359544175636414,8.649083726660713e-05,0.014772281981310918,0.05647322137771078,3.5302382557798828e-06,3.5302382557798828e-06,0.12695442815435615,0.00037420525511266755,0.00015709560238220477,0.14264104184391405,2.1181429534679295e-05,0.002757116077764088,8.825595639449706e-06,0.004749935573151832,3.883262081357871e-05,0.00899857731398292
0.0,0.23289963854076665,3.564871753738659e-05,4.690620728603499e-05,6.848306263761108e-05,0.26556981191549,1.7824358768693296e-05,3.940121412026939e-05,5.159682801463849e-05,0.12588969348669787,1.1257489748648397e-05,0.00019419169816418486,0.0004596808314031429,0.0005131539077092228,0.01605974725059266,0.07720574094452215,3.0019972663062393e-05,2.814372437162099e-06,0.09786511088158341,0.00017824358768693297,0.004121179372151034,0.016572901158301883,8.536929726058368e-05,0.00026642725738467874,0.0,0.010226491312501348,7.504993165765598e-06,0.15158116134140495
0.0,0.06228411243453732,0.005795446918619388,0.09845124598199072,0.03328635202876481,0.18937896710200938,0.005852137258390008,0.010564929010010733,0.0003991130242471751,0.00012739036120869017,0.0015853746742749012,0.004633946566422507,0.05203358674566824,0.029307928241758888,0.20644406259916737,0.010186993411539938,0.0038761205301784834,0.008475857408808376,0.03351050693544404,0.1267853384445865,0.034354671707545616,0.0007083034405823336,0.02195903569080437,0.0006861485951547353,0.002158142589888398,0.000965039002302149,0.0035929946378757934,0.052596254658219155
0.0,0.12490546761508407,4.295183412001663e-05,8.130168601288861e-05,4.295183412001663e-05,0.29643208317929476,7.669970378574399e-06,4.601982227144639e-06,0.0029621425602054325,0.02271998625541308,4.295183412001663e-05,3.374786966572735e-05,1.8407928908578555e-05,6.902973340716958e-05,5.6757780801450546e-05,0.36199192198719726,0.00021475917060008315,7.669970378574399e-06,0.00026691496917438906,3.988384596858687e-05,4.601982227144639e-06,0.1879909739788585,1.8407928908578555e-05,5.0621804498591024e-05,0.0,0.0005844517428473692,3.067988151429759e-06,0.0014066725674305445
0.0,0.17051293593552538,0.00010003301089359489,0.00018339385330492396,4.334763805389112e-05,0.16333056575336527,4.001320435743795e-05,1.0003301089359488e-05,0.0020973587950690394,0.09480795329125279,3.0009903268078467e-05,0.00013004291416167336,0.0040513369411905925,0.000266754695716253,0.00026342026201979984,0.04065008119346051,0.0005101683555573339,6.668867392906325e-06,0.002400792261446277,0.022800857616346728,0.00011337074567940754,0.031060249882461213,6.33542402326101e-05,0.007512479118108976,3.667877066098479e-05,0.03316094311122671,3.0009903268078467e-05,0.42578717643489017
0.0,0.14258031599664123,0.018121651433974158,0.0015416351691170474,0.010372152272561979,0.18043688111447329,0.003440854764523217,0.003236993073110312,0.0002837534353449891,0.16829829483469572,4.903700144796899e-05,0.0011846017203722846,0.08546157592801147,0.0018463257511476862,0.00020716755667906,0.0888495368482762,0.0208693764476935,4.297624846001776e-05,0.00014490709416647014,0.005683884347609529,0.0020215366102539125,0.020202693619018865,0.024516847791351416,0.001074406211500444,9.366618254106435e-06,0.0360691939639308,0.00015206980224313977,0.18330196434514115
0.0,0.31932204396028435,0.038368037519108326,0.00011873135546683685,0.00011081593176904773,0.17818311343296872,0.0006362021797098007,5.2439681997852944e-05,5.936567773341842e-05,0.20796192681201364,2.374627109336737e-05,0.00010092165214681132,0.00023053671519810822,0.04005699105062408,0.00020184330429362263,0.08672138203297763,0.010948020402004582,2.968283886670921e-06,0.0001612767578424534,0.005299376165669818,0.0002523041303670283,0.047970435892488755,0.0004581051465095455,0.006624220207087273,9.894279622236404e-07,0.018100595140919277,0.00036806720194719424,0.03766554366592954
0.0,0.09272466504327583,6.54951531625727e-05,0.0676682588133141,0.04243183895879178,0.2208676970453058,0.0016707146854644294,0.033734710117275545,0.0024899923798154134,0.0862990806755178,0.006808358436540491,0.00222879614025689,0.0004584660721380089,4.667019896015659e-05,0.05049048810360941,0.03840682843838769,0.0003961084113425055,0.0005533752162418567,0.005650231566546017,0.015463131131101295,0.04694629828845869,0.019948568656372817,0.007369969570245905,0.00011177316557684563,1.686402147299776e-05,0.014610517580350198,0.003478302475442259,0.23906279965503272
0.0,0.003164071575712822,0.016308332918058378,0.029613095334402047,0.041853890953758216,0.032495642848868175,0.0018930066187277572,0.011876432294110452,0.012510670409103,0.07853874127032717,0.0004316702272449279,0.00471407186694461,0.0692342092506865,0.04876514486192701,0.2053281179113374,0.0007785596452408519,0.01644424108555678,7.118999249916351e-05,0.17295026214096784,0.10292454961004063,0.022388605459236936,0.03825103015155055,0.011361275621116504,0.0008639876362398481,0.0006983091082417949,0.007007036807167667,0.002350564115972381,0.06718329028496059
0.0,0.26933300206615723,2.192209253680628e-05,0.0001096104626840314,0.00010778362163929754,0.08803364310467982,8.03810059682897e-05,2.5575774626273995e-05,0.34018520514511513,0.09085245883670416,2.9229456715741707e-05,2.5575774626273995e-05,0.04719826523174392,3.6536820894677135e-05,0.00012422519104190227,0.0354936946581341,0.013204407071336317,3.6536820894677134e-06,0.08679504487635026,0.003801656214091156,0.001415801809668739,0.0039459766566251305,1.8268410447338567e-06,0.0004694981484966012,3.6536820894677134e-06,0.01590082445336349,0.0,0.0028005473215770025
0.0,0.00038185048063353973,4.9806584430461705e-05,0.00014941975329138513,4.9806584430461705e-05,0.0014941975329138511,0.0,4.9806584430461705e-05,3.320438962030781e-05,0.00029883950658277027,0.0,0.0,3.320438962030781e-05,0.00011621536367107731,4.9806584430461705e-05,0.00018262414291169292,0.0,8.30109740507695e-05,8.30109740507695e-05,0.00023243072734215463,9.961316886092341e-05,0.9891919711785898,4.9806584430461705e-05,4.9806584430461705e-05,0.0,0.00011621536367107731,0.0,0.007205352547606794
0.0,0.1627793575149433,0.0035916362881505785,0.036496892212239035,0.040730844643505654,0.1503575411693532,0.0012619623047103333,0.015429811736916198,0.0012770756257248282,0.23819394035839017,0.00021692060750216365,0.000525410159974503,0.026246059979436992,0.01374067585882558,0.020855493981119907,0.0587779279725791,0.0009934786019822454,0.00013068577583122155,0.016929586592884027,0.006149343615133057,0.04790433801209155,0.02665767572236118,0.017815938419439997,0.00039161281805206186,2.5781547612962073e-05,0.01068822952333919,0.00010534873765986227,0.1017264302202411
0.0,0.11064837045719457,0.00019155378133028236,0.03693078718830975,0.00084374879871672,0.1936315534685894,6.319971696951493e-05,3.192563022171373e-05,0.02285614506485138,0.058747068868796735,1.1076239056512926e-05,0.0013578165996337023,0.008225084814671716,0.006192269176064638,0.0010717890145861037,0.04918110758481304,0.02873958263428148,0.0002521473244041472,0.005270335160478415,0.04403717185827366,0.20958133770996804,0.02021609090855783,8.274602118689069e-05,0.0006600135390733879,1.954630421737575e-06,0.024633555661684747,5.863891265212725e-06,0.17653570425659867
0.0,0.06351954441846226,1.4753288139094001e-05,0.0013484505359131916,1.99169389877769e-05,0.16353651069982222,1.84416101738675e-05,3.39325627199162e-05,0.18805131193614777,0.2149259016103214,1.4753288139094001e-05,1.69662813599581e-05,0.0001386809085074836,0.00011802630511275201,0.00019253041021517672,0.07676357118092694,5.38495017076931e-05,2.2129932208641e-06,0.06875179805699196,0.010426886392304685,0.08126775004979235,0.009717990897221218,0.0003732581899190782,0.0005030871255431055,1.10649661043205e-05,0.008581250046104025,3.98338779755538e-05,0.11155772592816623
0.0,0.018356080355773654,0.01654418269968032,0.045542802543262595,0.07255847970462294,0.18801741120403903,0.0036002028947896146,0.04503556556922605,0.002494898139737888,0.03139678907199226,0.0007679331862739315,0.011800927182003705,0.10299977587203472,0.02639873544642752,0.06185106106897243,0.0003314734644285327,0.009195144680499687,9.672891132789921e-05,0.06532858339329739,0.1206587002937256,0.050628148113196415,0.00014745260873155367,0.0076580986870819715,0.0010781734750451204,0.0062708645441355145,0.020688190815471907,0.010810045651327663,0.07974355042289408
0.0,0.12830504252517705,1.01674063454783e-05,3.304407062280448e-05,0.00014742739200943535,0.30917049215330417,5.08370317273915e-06,7.625554759108725e-06,0.00013471813407758747,0.5037873588636906,2.541851586369575e-06,7.625554759108725e-06,0.0008693132425383947,2.541851586369575e-05,0.00019572257215045728,0.04173211934501568,5.08370317273915e-06,0.0,0.00263081639189251,5.8462586486500224e-05,2.7960367450065326e-05,0.00580050532009537,4.321147696828278e-05,0.00031010589353708815,0.0,0.002524058625264988,5.08370317273915e-06,0.004161011046886994
0.0,0.36613233287858116,0.0001534788540245566,1.7053206002728514e-05,0.0002899045020463847,0.16732605729877217,1.7053206002728514e-05,6.821282401091406e-05,0.001892905866302865,0.4231412005457026,1.7053206002728514e-05,5.1159618008185536e-05,0.0003751705320600273,0.00042633015006821284,0.0036493860845839016,0.013250341064120055,1.7053206002728514e-05,5.1159618008185536e-05,0.0007162346521145975,0.0006309686221009549,0.0020804911323328784,0.007929740791268758,3.410641200545703e-05,3.410641200545703e-05,0.0,0.004280354706684857,6.821282401091406e-05,0.007349931787175989
0.0,0.42772328342798144,3.971248163297724e-05,0.012668281640919741,0.00015884992653190897,0.0298439299471824,1.985624081648862e-05,3.971248163297724e-05,0.0013899368571542036,0.19326079186688377,0.0,3.971248163297724e-05,0.00013899368571542036,0.00013899368571542036,0.00011913744489893174,0.02454231364917994,0.00230332393471268,3.971248163297724e-05,1.985624081648862e-05,0.0007148246693935904,0.002521742583694055,0.0016480679877685556,1.985624081648862e-05,0.0005956872244946587,7.942496326595449e-05,0.005877447281680632,0.00011913744489893174,0.29593741312894645
0.0,0.07903953601739935,0.0008749671887304226,0.020399235028686423,0.025884029348899416,0.017196855117933077,0.0011899553766733747,0.0022049173156006648,0.00033248753171756057,0.011642063422621652,9.499643763358873e-05,0.0003999850005624789,0.07594465207554717,0.013214504456082897,0.04049848130695099,0.024181593190255365,0.006899741259702761,5.749784383085634e-05,0.01051710560853968,0.03773608489681637,0.0037698586303013637,0.0075047185730535105,0.028846418259315276,0.00024249090659100284,0.00047998200067497467,6.749746884491831e-05,0.0010774595952651777,0.5897028861417697
0.0,0.40388373911101183,0.0005723698905517087,0.0004886084431538977,0.0008794951976770158,0.21811480902389993,2.792048246593701e-05,0.00011168192986374804,0.000991177127540764,0.20157192316283226,2.792048246593701e-05,0.00015356265356265356,0.0005584096493187402,0.0017589903953540316,0.0011168192986374804,0.07377987491623855,5.584096493187402e-05,2.792048246593701e-05,0.004844203707840071,6.980120616484253e-05,4.1880723698905516e-05,0.027669198123743577,6.980120616484253e-05,0.0018846325664507483,1.3960241232968505e-05,0.018092472637927185,0.004606879606879607,0.03858610676792495
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1 ^ a b c d e f g h i j k l m n o p q r s t u v w x y z $
2 0.0 0.08177487512565312 0.05064544871095967 0.07327059962450833 0.06495363065149523 0.061627841497941024 0.044425593979353147 0.05752526940789172 0.024078074472915836 0.016592203335813446 0.11910658380449511 0.017117380488251007 0.03302685352589563 0.08869583137140728 0.035182675783582286 0.013984289415344039 0.05503217558728984 0.00021346554218849907 0.05022750354744187 0.05679281701430199 0.024045325783562315 0.0015048422132203263 0.015574997094552255 0.0054400764881583534 0.00016613969233005726 0.0062853521103135605 0.002710153731134059 0.0
3 0.0 0.005939042406371192 0.02544883058808627 0.060741665942489004 0.03423882544765226 0.014545218880858291 0.0021159734936245364 0.009683458260117744 0.022014608259160358 0.0375319435995705 0.0012071918943162118 0.00483524835366186 0.06685892969977804 0.04498424729464408 0.22067338212572413 0.008778800791243883 0.010449662778691797 0.0001958961956369646 0.11578879149722136 0.04779808257392872 0.05734632827194516 0.041184450260777605 0.016855321085647877 0.0012893799222601413 0.0007600183229223588 0.005051176039980499 0.003479293182959682 0.1402042328307295
4 0.0 0.10740025063072939 0.008207325873470239 4.244850350569272e-05 0.003728085960065186 0.4237910943039645 3.691174217886323e-06 0.00013288227184390762 0.00039495564131383656 0.12999208243130264 3.1374980852033744e-05 2.7683806634147422e-05 0.046482956925842464 3.3220567960976905e-05 0.0004447864932553019 0.11562418678818012 9.227935544715807e-06 1.8455871089431614e-06 0.10180812169063161 0.0005019996936325399 4.060291639674955e-05 0.021691185291408975 2.214704530731794e-05 0.0056788715342181075 1.8455871089431614e-06 0.012356205694374466 1.291910976260213e-05 0.021538001561366694
5 0.0 0.10050204481617504 5.957221194601566e-06 0.012163156374077748 5.7338253998040075e-05 0.21854065952395846 8.191179142577154e-06 1.5637705635829112e-05 0.2589321085286557 0.08507730983805294 2.0105621531780285e-05 0.1057168473193994 0.04785063459298775 4.393450631018655e-05 2.978610597300783e-05 0.049474722021166005 5.957221194601566e-06 0.022256178383031452 0.010718530234386868 0.00032913647100173653 0.026146243823106274 0.0030076520506244655 1.1169789739877937e-05 1.71270109344795e-05 4.021124306356057e-05 0.017885067331492553 5.957221194601566e-06 0.04113833561197044
6 0.0 0.13123437278970704 4.496701950163427e-05 2.716757428223737e-05 0.01997097753616332 0.19341720307012325 3.6535703345077846e-05 0.0005414778598321794 0.001204741397481285 0.22505150128952298 0.023330388618097914 1.405219359426071e-05 0.000277296620260078 0.0018839307545372192 0.0003981454851707201 0.12731568440272087 7.494503250272378e-06 0.0 0.09698730337468113 0.00034755758823138154 4.215658078278213e-05 0.0051580918619999645 0.0001564477553494359 0.0013602523399244366 9.368129062840473e-07 0.04583919231738472 0.00014333237466145925 0.12520879217648806
7 0.0 0.031895413152132776 0.012309605339111969 0.00870721680660756 0.0298675163088017 0.01029528350039639 0.0020813614768936716 0.00574201810613231 0.0009116137714871438 0.010444608551166907 0.00043419130147119637 0.0026210201219160722 0.12830321631540123 0.027835024848523834 0.08354099609624636 0.013881382027362197 0.017625577146542382 3.717462802398891e-05 0.11198919346094116 0.05457966355708561 0.03163080498524293 0.014952763160862552 0.010523970116611377 0.0003372866531390005 0.007001151995776294 0.0020375037696743585 0.0024278373639262424 0.37798660543852275
8 0.0 0.24238728488124023 1.7778409899018633e-05 2.133409187882236e-05 4.977954771725217e-05 0.09607097141231688 0.010471483430521974 2.133409187882236e-05 0.00032712274214194285 0.18898094154458825 7.111363959607453e-06 5.689091167685962e-05 0.08809913241359693 7.111363959607453e-05 0.00019556250888920495 0.025647134120324277 1.066704593941118e-05 0.0 0.3050632911392405 0.00016711705305077514 0.0022969705589532072 0.02134475892476177 3.5556819798037263e-06 0.0007609159436779974 0.0 0.002400085336367515 3.5556819798037263e-06 0.015524107523823069
9 0.0 0.1356264584296794 0.001080252906268644 1.588607215100947e-05 0.0006989871746444167 0.31163178202896913 1.4120953023119531e-05 0.0013097183928943365 0.009215686966713384 0.0887413641546668 1.9416310406789355e-05 1.0590714767339647e-05 0.09359544175636414 8.649083726660713e-05 0.014772281981310918 0.05647322137771078 3.5302382557798828e-06 3.5302382557798828e-06 0.12695442815435615 0.00037420525511266755 0.00015709560238220477 0.14264104184391405 2.1181429534679295e-05 0.002757116077764088 8.825595639449706e-06 0.004749935573151832 3.883262081357871e-05 0.00899857731398292
10 0.0 0.23289963854076665 3.564871753738659e-05 4.690620728603499e-05 6.848306263761108e-05 0.26556981191549 1.7824358768693296e-05 3.940121412026939e-05 5.159682801463849e-05 0.12588969348669787 1.1257489748648397e-05 0.00019419169816418486 0.0004596808314031429 0.0005131539077092228 0.01605974725059266 0.07720574094452215 3.0019972663062393e-05 2.814372437162099e-06 0.09786511088158341 0.00017824358768693297 0.004121179372151034 0.016572901158301883 8.536929726058368e-05 0.00026642725738467874 0.0 0.010226491312501348 7.504993165765598e-06 0.15158116134140495
11 0.0 0.06228411243453732 0.005795446918619388 0.09845124598199072 0.03328635202876481 0.18937896710200938 0.005852137258390008 0.010564929010010733 0.0003991130242471751 0.00012739036120869017 0.0015853746742749012 0.004633946566422507 0.05203358674566824 0.029307928241758888 0.20644406259916737 0.010186993411539938 0.0038761205301784834 0.008475857408808376 0.03351050693544404 0.1267853384445865 0.034354671707545616 0.0007083034405823336 0.02195903569080437 0.0006861485951547353 0.002158142589888398 0.000965039002302149 0.0035929946378757934 0.052596254658219155
12 0.0 0.12490546761508407 4.295183412001663e-05 8.130168601288861e-05 4.295183412001663e-05 0.29643208317929476 7.669970378574399e-06 4.601982227144639e-06 0.0029621425602054325 0.02271998625541308 4.295183412001663e-05 3.374786966572735e-05 1.8407928908578555e-05 6.902973340716958e-05 5.6757780801450546e-05 0.36199192198719726 0.00021475917060008315 7.669970378574399e-06 0.00026691496917438906 3.988384596858687e-05 4.601982227144639e-06 0.1879909739788585 1.8407928908578555e-05 5.0621804498591024e-05 0.0 0.0005844517428473692 3.067988151429759e-06 0.0014066725674305445
13 0.0 0.17051293593552538 0.00010003301089359489 0.00018339385330492396 4.334763805389112e-05 0.16333056575336527 4.001320435743795e-05 1.0003301089359488e-05 0.0020973587950690394 0.09480795329125279 3.0009903268078467e-05 0.00013004291416167336 0.0040513369411905925 0.000266754695716253 0.00026342026201979984 0.04065008119346051 0.0005101683555573339 6.668867392906325e-06 0.002400792261446277 0.022800857616346728 0.00011337074567940754 0.031060249882461213 6.33542402326101e-05 0.007512479118108976 3.667877066098479e-05 0.03316094311122671 3.0009903268078467e-05 0.42578717643489017
14 0.0 0.14258031599664123 0.018121651433974158 0.0015416351691170474 0.010372152272561979 0.18043688111447329 0.003440854764523217 0.003236993073110312 0.0002837534353449891 0.16829829483469572 4.903700144796899e-05 0.0011846017203722846 0.08546157592801147 0.0018463257511476862 0.00020716755667906 0.0888495368482762 0.0208693764476935 4.297624846001776e-05 0.00014490709416647014 0.005683884347609529 0.0020215366102539125 0.020202693619018865 0.024516847791351416 0.001074406211500444 9.366618254106435e-06 0.0360691939639308 0.00015206980224313977 0.18330196434514115
15 0.0 0.31932204396028435 0.038368037519108326 0.00011873135546683685 0.00011081593176904773 0.17818311343296872 0.0006362021797098007 5.2439681997852944e-05 5.936567773341842e-05 0.20796192681201364 2.374627109336737e-05 0.00010092165214681132 0.00023053671519810822 0.04005699105062408 0.00020184330429362263 0.08672138203297763 0.010948020402004582 2.968283886670921e-06 0.0001612767578424534 0.005299376165669818 0.0002523041303670283 0.047970435892488755 0.0004581051465095455 0.006624220207087273 9.894279622236404e-07 0.018100595140919277 0.00036806720194719424 0.03766554366592954
16 0.0 0.09272466504327583 6.54951531625727e-05 0.0676682588133141 0.04243183895879178 0.2208676970453058 0.0016707146854644294 0.033734710117275545 0.0024899923798154134 0.0862990806755178 0.006808358436540491 0.00222879614025689 0.0004584660721380089 4.667019896015659e-05 0.05049048810360941 0.03840682843838769 0.0003961084113425055 0.0005533752162418567 0.005650231566546017 0.015463131131101295 0.04694629828845869 0.019948568656372817 0.007369969570245905 0.00011177316557684563 1.686402147299776e-05 0.014610517580350198 0.003478302475442259 0.23906279965503272
17 0.0 0.003164071575712822 0.016308332918058378 0.029613095334402047 0.041853890953758216 0.032495642848868175 0.0018930066187277572 0.011876432294110452 0.012510670409103 0.07853874127032717 0.0004316702272449279 0.00471407186694461 0.0692342092506865 0.04876514486192701 0.2053281179113374 0.0007785596452408519 0.01644424108555678 7.118999249916351e-05 0.17295026214096784 0.10292454961004063 0.022388605459236936 0.03825103015155055 0.011361275621116504 0.0008639876362398481 0.0006983091082417949 0.007007036807167667 0.002350564115972381 0.06718329028496059
18 0.0 0.26933300206615723 2.192209253680628e-05 0.0001096104626840314 0.00010778362163929754 0.08803364310467982 8.03810059682897e-05 2.5575774626273995e-05 0.34018520514511513 0.09085245883670416 2.9229456715741707e-05 2.5575774626273995e-05 0.04719826523174392 3.6536820894677135e-05 0.00012422519104190227 0.0354936946581341 0.013204407071336317 3.6536820894677134e-06 0.08679504487635026 0.003801656214091156 0.001415801809668739 0.0039459766566251305 1.8268410447338567e-06 0.0004694981484966012 3.6536820894677134e-06 0.01590082445336349 0.0 0.0028005473215770025
19 0.0 0.00038185048063353973 4.9806584430461705e-05 0.00014941975329138513 4.9806584430461705e-05 0.0014941975329138511 0.0 4.9806584430461705e-05 3.320438962030781e-05 0.00029883950658277027 0.0 0.0 3.320438962030781e-05 0.00011621536367107731 4.9806584430461705e-05 0.00018262414291169292 0.0 8.30109740507695e-05 8.30109740507695e-05 0.00023243072734215463 9.961316886092341e-05 0.9891919711785898 4.9806584430461705e-05 4.9806584430461705e-05 0.0 0.00011621536367107731 0.0 0.007205352547606794
20 0.0 0.1627793575149433 0.0035916362881505785 0.036496892212239035 0.040730844643505654 0.1503575411693532 0.0012619623047103333 0.015429811736916198 0.0012770756257248282 0.23819394035839017 0.00021692060750216365 0.000525410159974503 0.026246059979436992 0.01374067585882558 0.020855493981119907 0.0587779279725791 0.0009934786019822454 0.00013068577583122155 0.016929586592884027 0.006149343615133057 0.04790433801209155 0.02665767572236118 0.017815938419439997 0.00039161281805206186 2.5781547612962073e-05 0.01068822952333919 0.00010534873765986227 0.1017264302202411
21 0.0 0.11064837045719457 0.00019155378133028236 0.03693078718830975 0.00084374879871672 0.1936315534685894 6.319971696951493e-05 3.192563022171373e-05 0.02285614506485138 0.058747068868796735 1.1076239056512926e-05 0.0013578165996337023 0.008225084814671716 0.006192269176064638 0.0010717890145861037 0.04918110758481304 0.02873958263428148 0.0002521473244041472 0.005270335160478415 0.04403717185827366 0.20958133770996804 0.02021609090855783 8.274602118689069e-05 0.0006600135390733879 1.954630421737575e-06 0.024633555661684747 5.863891265212725e-06 0.17653570425659867
22 0.0 0.06351954441846226 1.4753288139094001e-05 0.0013484505359131916 1.99169389877769e-05 0.16353651069982222 1.84416101738675e-05 3.39325627199162e-05 0.18805131193614777 0.2149259016103214 1.4753288139094001e-05 1.69662813599581e-05 0.0001386809085074836 0.00011802630511275201 0.00019253041021517672 0.07676357118092694 5.38495017076931e-05 2.2129932208641e-06 0.06875179805699196 0.010426886392304685 0.08126775004979235 0.009717990897221218 0.0003732581899190782 0.0005030871255431055 1.10649661043205e-05 0.008581250046104025 3.98338779755538e-05 0.11155772592816623
23 0.0 0.018356080355773654 0.01654418269968032 0.045542802543262595 0.07255847970462294 0.18801741120403903 0.0036002028947896146 0.04503556556922605 0.002494898139737888 0.03139678907199226 0.0007679331862739315 0.011800927182003705 0.10299977587203472 0.02639873544642752 0.06185106106897243 0.0003314734644285327 0.009195144680499687 9.672891132789921e-05 0.06532858339329739 0.1206587002937256 0.050628148113196415 0.00014745260873155367 0.0076580986870819715 0.0010781734750451204 0.0062708645441355145 0.020688190815471907 0.010810045651327663 0.07974355042289408
24 0.0 0.12830504252517705 1.01674063454783e-05 3.304407062280448e-05 0.00014742739200943535 0.30917049215330417 5.08370317273915e-06 7.625554759108725e-06 0.00013471813407758747 0.5037873588636906 2.541851586369575e-06 7.625554759108725e-06 0.0008693132425383947 2.541851586369575e-05 0.00019572257215045728 0.04173211934501568 5.08370317273915e-06 0.0 0.00263081639189251 5.8462586486500224e-05 2.7960367450065326e-05 0.00580050532009537 4.321147696828278e-05 0.00031010589353708815 0.0 0.002524058625264988 5.08370317273915e-06 0.004161011046886994
25 0.0 0.36613233287858116 0.0001534788540245566 1.7053206002728514e-05 0.0002899045020463847 0.16732605729877217 1.7053206002728514e-05 6.821282401091406e-05 0.001892905866302865 0.4231412005457026 1.7053206002728514e-05 5.1159618008185536e-05 0.0003751705320600273 0.00042633015006821284 0.0036493860845839016 0.013250341064120055 1.7053206002728514e-05 5.1159618008185536e-05 0.0007162346521145975 0.0006309686221009549 0.0020804911323328784 0.007929740791268758 3.410641200545703e-05 3.410641200545703e-05 0.0 0.004280354706684857 6.821282401091406e-05 0.007349931787175989
26 0.0 0.42772328342798144 3.971248163297724e-05 0.012668281640919741 0.00015884992653190897 0.0298439299471824 1.985624081648862e-05 3.971248163297724e-05 0.0013899368571542036 0.19326079186688377 0.0 3.971248163297724e-05 0.00013899368571542036 0.00013899368571542036 0.00011913744489893174 0.02454231364917994 0.00230332393471268 3.971248163297724e-05 1.985624081648862e-05 0.0007148246693935904 0.002521742583694055 0.0016480679877685556 1.985624081648862e-05 0.0005956872244946587 7.942496326595449e-05 0.005877447281680632 0.00011913744489893174 0.29593741312894645
27 0.0 0.07903953601739935 0.0008749671887304226 0.020399235028686423 0.025884029348899416 0.017196855117933077 0.0011899553766733747 0.0022049173156006648 0.00033248753171756057 0.011642063422621652 9.499643763358873e-05 0.0003999850005624789 0.07594465207554717 0.013214504456082897 0.04049848130695099 0.024181593190255365 0.006899741259702761 5.749784383085634e-05 0.01051710560853968 0.03773608489681637 0.0037698586303013637 0.0075047185730535105 0.028846418259315276 0.00024249090659100284 0.00047998200067497467 6.749746884491831e-05 0.0010774595952651777 0.5897028861417697
28 0.0 0.40388373911101183 0.0005723698905517087 0.0004886084431538977 0.0008794951976770158 0.21811480902389993 2.792048246593701e-05 0.00011168192986374804 0.000991177127540764 0.20157192316283226 2.792048246593701e-05 0.00015356265356265356 0.0005584096493187402 0.0017589903953540316 0.0011168192986374804 0.07377987491623855 5.584096493187402e-05 2.792048246593701e-05 0.004844203707840071 6.980120616484253e-05 4.1880723698905516e-05 0.027669198123743577 6.980120616484253e-05 0.0018846325664507483 1.3960241232968505e-05 0.018092472637927185 0.004606879606879607 0.03858610676792495
29 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
-3
View File
@@ -1,3 +0,0 @@
category,l2,kl_mf,kl_fm,jsd,permutation_p_value
names,0.3189041485139616,0.04320097944655348,0.0215380760498496,0.03236952774820154,0.977
surnames,1.2770018925640299,0.2936188220992242,0.23989460296618093,0.26675671253270256,0.002
1 category l2 kl_mf kl_fm jsd permutation_p_value
2 names 0.3189041485139616 0.04320097944655348 0.0215380760498496 0.03236952774820154 0.977
3 surnames 1.2770018925640299 0.2936188220992242 0.23989460296618093 0.26675671253270256 0.002
Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

BIN
View File
Binary file not shown.

Before

Width:  |  Height:  |  Size: 158 KiB

-23
View File
@@ -1,23 +0,0 @@
services:
app:
build:
context: .
dockerfile: Dockerfile
image: drc-ners:uv
working_dir: /app
tty: true
stdin_open: true
environment:
NERS_ENV: production
STREAMLIT_SERVER_ADDRESS: 0.0.0.0
PYTHONPATH: /app/src
# expose Streamlit for `ners web run`
ports:
- "8501:8501"
volumes:
- ./src:/app/src
- ./assets:/app/assets
- ./config:/app/config
- ./data:/app/data
# default command shows CLI help; override per run
command: ["ners", "--help"]
+1 -1
View File
@@ -30,7 +30,7 @@ llm:
# Data handling configuration
data:
split_evaluation: false
max_dataset_size: 10_000
max_dataset_size: 100_000
balance_by_sex: true
# Enhanced logging for development
+65
View File
@@ -73,6 +73,37 @@ baseline_experiments:
batch_size: 32
tags: [ "baseline", "neural", "cnn", "surname" ]
## Ensemble Models
- name: "ensemble"
description: "Baseline Ensemble with multiple models"
model_type: "ensemble"
features: [ "full_name" ]
model_params:
base_models: [ "logistic_regression", "random_forest", "xgboost" ]
voting: "soft"
cv_folds: 5
tags: [ "baseline", "ensemble" ]
- name: "ensemble_native"
description: "Baseline Ensemble with native name"
model_type: "ensemble"
features: [ "native_name" ]
model_params:
base_models: [ "logistic_regression", "random_forest", "xgboost" ]
voting: "soft"
cv_folds: 5
tags: [ "baseline", "ensemble", "native" ]
- name: "ensemble_surname"
description: "Baseline Ensemble with surname"
model_type: "ensemble"
features: [ "surname" ]
model_params:
base_models: [ "logistic_regression", "random_forest", "xgboost" ]
voting: "soft"
cv_folds: 5
tags: [ "baseline", "ensemble", "surname" ]
# LightGBM Models
- name: "lightgbm"
description: "Baseline LightGBM with engineered features"
@@ -231,6 +262,40 @@ baseline_experiments:
min_samples_leaf: 1
tags: [ "baseline", "random_forest", "engineered", "surname" ]
# SVM Models
- name: "svm"
description: "Baseline SVM with full name features"
model_type: "svm"
features: [ "full_name" ]
model_params:
C: 1.0
kernel: "rbf"
ngram_range: [ 2, 4 ]
max_features: 5000
tags: [ "baseline", "svm" ]
- name: "svm_native"
description: "Baseline SVM with native name features"
model_type: "svm"
features: [ "native_name" ]
model_params:
C: 1.0
kernel: "rbf"
ngram_range: [ 2, 4 ]
max_features: 5000
tags: [ "baseline", "svm", "native" ]
- name: "svm_surname"
description: "Baseline SVM with surname features"
model_type: "svm"
features: [ "surname" ]
model_params:
C: 1.0
kernel: "rbf"
ngram_range: [ 2, 4 ]
max_features: 5000
tags: [ "baseline", "svm", "surname" ]
# Transformer Models
- name: "transformer"
description: "Baseline Transformer with attention mechanism"
@@ -2,9 +2,10 @@ import logging
from pathlib import Path
from typing import Optional, Union
from ners.core.utils import ensure_directories
from ners.core.config.config_manager import ConfigManager
from ners.core.config.pipeline_config import PipelineConfig
from core.utils import ensure_directories
from .config_manager import ConfigManager
from .logging_config import LoggingConfig
from .pipeline_config import PipelineConfig
config_manager = ConfigManager()
@@ -16,14 +17,12 @@ def get_config() -> PipelineConfig:
def load_config(config_path: Optional[Union[str, Path]] = None) -> PipelineConfig:
"""Load configuration from specified path"""
if config_path is not None:
return config_manager.load_config(config_path)
if config_path:
return config_manager.load_config(Path(config_path))
return config_manager.get_config()
def setup_config(
config_path: Optional[Union[str, Path]] = None, env: str = "development"
) -> PipelineConfig:
def setup_config(config_path: Optional[Path] = None, env: str = "development") -> PipelineConfig:
"""
Unified configuration loading and logging setup for all entrypoint scripts.
@@ -37,8 +36,6 @@ def setup_config(
# Determine config path
if config_path is None:
config_path = Path("config") / f"pipeline.{env}.yaml"
else:
config_path = Path(config_path)
# Load configuration
config = ConfigManager(config_path).load_config()
@@ -5,17 +5,15 @@ from typing import Optional, Union, Dict, Any
import yaml
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.config.project_paths import ProjectPaths
from core.config.pipeline_config import PipelineConfig
from core.config.project_paths import ProjectPaths
class ConfigManager:
"""Centralized configuration management"""
def __init__(self, config_path: Optional[Union[str, Path]] = None):
self.config_path: Path = (
Path(config_path) if config_path is not None else self._find_config_file()
)
self.config_path = config_path or self._find_config_file()
self._config: Optional[PipelineConfig] = None
self._setup_default_paths()
@@ -38,7 +36,7 @@ class ConfigManager:
def _setup_default_paths(self):
"""Setup default project paths"""
root_dir = Path(__file__).parent.parent.parent.parent.parent
root_dir = Path(__file__).parent.parent.parent
self.default_paths = ProjectPaths(
root_dir=root_dir,
configs_dir=root_dir / "config",
@@ -49,17 +47,13 @@ class ConfigManager:
checkpoints_dir=root_dir / "data" / "checkpoints",
)
def load_config(
self, config_path: Optional[Union[str, Path]] = None
) -> PipelineConfig:
def load_config(self, config_path: Optional[Path] = None) -> PipelineConfig:
"""Load configuration from file"""
if config_path is not None:
self.config_path = Path(config_path)
if config_path:
self.config_path = config_path
if not self.config_path.exists():
logging.warning(
f"Config file not found: {self.config_path}. Using defaults."
)
logging.warning(f"Config file not found: {self.config_path}. Using defaults.")
return self._create_default_config()
try:
@@ -84,11 +78,9 @@ class ConfigManager:
"""Create default configuration"""
return PipelineConfig(paths=self.default_paths)
def save_config(
self, config: PipelineConfig, path: Optional[Union[str, Path]] = None
):
def save_config(self, config: PipelineConfig, path: Optional[Path] = None):
"""Save configuration to file"""
save_path = Path(path) if path is not None else self.config_path
save_path = path or self.config_path
save_path.parent.mkdir(parents=True, exist_ok=True)
config_dict = config.model_dump()
@@ -130,11 +122,7 @@ class ConfigManager:
def _deep_update(self, base_dict: Dict, update_dict: Dict):
"""Recursively update nested dictionaries"""
for key, value in update_dict.items():
if (
key in base_dict
and isinstance(base_dict[key], dict)
and isinstance(value, dict)
):
if key in base_dict and isinstance(base_dict[key], dict) and isinstance(value, dict):
self._deep_update(base_dict[key], value)
else:
base_dict[key] = value
@@ -148,8 +136,8 @@ class ConfigManager:
env_config = self.load_config(env_config_path)
# Merge configurations
base_dict = base_config.model_dump()
env_dict = env_config.model_dump()
base_dict = base_config.dict()
env_dict = env_config.dict()
self._deep_update(base_dict, env_dict)
return PipelineConfig(**base_dict)
@@ -1,10 +1,10 @@
from pydantic import BaseModel
from ners.core.config.annotation_config import AnnotationConfig
from ners.core.config.data_config import DataConfig
from ners.core.config.logging_config import LoggingConfig
from ners.core.config.processing_config import ProcessingConfig
from ners.core.config.project_paths import ProjectPaths
from core.config.annotation_config import AnnotationConfig
from core.config.data_config import DataConfig
from core.config.logging_config import LoggingConfig
from core.config.processing_config import ProcessingConfig
from core.config.project_paths import ProjectPaths
class PipelineConfig(BaseModel):
@@ -10,8 +10,6 @@ class ProcessingConfig(BaseModel):
max_workers: int = 4
checkpoint_interval: int = 5
use_multiprocessing: bool = False
encoding_options: list = field(
default_factory=lambda: ["utf-8", "utf-16", "latin1"]
)
encoding_options: list = field(default_factory=lambda: ["utf-8", "utf-16", "latin1"])
chunk_size: int = 100_000
epochs: int = 2
@@ -4,13 +4,13 @@ from pathlib import Path
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from ners.core.config import PipelineConfig
from core.config import PipelineConfig
@contextmanager
def temporary_config_override(**overrides):
"""Context manager for temporarily overriding configuration"""
from ners.core.config import get_config
from core.config import get_config
config = get_config()
original_values = {}
@@ -5,7 +5,7 @@ from typing import Optional, Union, Iterator, Dict
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from core.config.pipeline_config import PipelineConfig
OPTIMIZED_DTYPES = {
# Numeric columns with appropriate bit-width
@@ -113,9 +113,7 @@ class DataLoader:
sex_values = df["sex"].dropna().unique()
if len(sex_values) == 0:
logging.warning(
"No valid values found in sex column 'sex', using random sampling"
)
logging.warning(f"No valid values found in sex column 'sex', using random sampling")
return df.sample(n=max_size, random_state=self.config.data.random_seed)
# Calculate samples per sex category
@@ -142,22 +140,18 @@ class DataLoader:
logging.info(f"Sampled {current_samples} records for sex '{sex}'")
if not balanced_samples:
logging.warning(
"No balanced samples could be created, using random sampling"
)
logging.warning("No balanced samples could be created, using random sampling")
return df.sample(n=max_size, random_state=self.config.data.random_seed)
# Create result using iloc with indices (no copying until final step)
result = df.iloc[balanced_samples].copy()
# Shuffle the final result
result = result.sample(
frac=1, random_state=self.config.data.random_seed
).reset_index(drop=True)
logging.info(
f"Created balanced dataset with {len(result)} records from {len(df)} total"
result = result.sample(frac=1, random_state=self.config.data.random_seed).reset_index(
drop=True
)
logging.info(f"Created balanced dataset with {len(result)} records from {len(df)} total")
return result
@classmethod
@@ -1,4 +1,4 @@
from ners.core.config.pipeline_config import PipelineConfig
from core.config.pipeline_config import PipelineConfig
class PromptManager:
@@ -19,15 +19,9 @@ class RegionMapper:
return (
series.str.upper()
.str.strip()
.apply(
lambda x: (
unicodedata.normalize("NFKD", x)
.encode("ascii", errors="ignore")
.decode("utf-8")
if isinstance(x, str)
else x
)
)
.apply(lambda x: unicodedata.normalize("NFKD", x)
.encode("ascii", errors="ignore")
.decode("utf-8") if isinstance(x, str) else x)
)
@staticmethod
@@ -2,7 +2,7 @@ import json
import logging
from typing import Dict, Any
from ners.core.config.pipeline_config import PipelineConfig
from core.config.pipeline_config import PipelineConfig
class StateManager:
+41 -11
View File
@@ -1,17 +1,21 @@
#!.venv/bin/python3
import argparse
import logging
from ners.core.utils.data_loader import DataLoader
from ners.processing.batch.batch_config import BatchConfig
from ners.processing.pipeline import Pipeline
from ners.processing.steps.data_cleaning_step import DataCleaningStep
from ners.processing.steps.data_selection_step import DataSelectionStep
from ners.processing.steps.data_splitting_step import DataSplittingStep
from ners.processing.steps.llm_annotation_step import LLMAnnotationStep
from ners.processing.steps.ner_annotation_step import NERAnnotationStep
from ners.processing.steps.feature_extraction_step import FeatureExtractionStep
import sys
import traceback
from core.config import setup_config
from core.utils.data_loader import DataLoader
from processing.batch.batch_config import BatchConfig
from processing.pipeline import Pipeline
from processing.steps.data_cleaning_step import DataCleaningStep
from processing.steps.data_selection_step import DataSelectionStep
from processing.steps.data_splitting_step import DataSplittingStep
from processing.steps.feature_extraction_step import FeatureExtractionStep
def create_pipeline(config) -> Pipeline:
"""Create pipeline from configuration"""
batch_config = BatchConfig(
batch_size=config.processing.batch_size,
max_workers=config.processing.max_workers,
@@ -19,13 +23,14 @@ def create_pipeline(config) -> Pipeline:
use_multiprocessing=config.processing.use_multiprocessing,
)
# Add steps based on configuration
pipeline = Pipeline(batch_config)
steps = [
DataCleaningStep(config),
FeatureExtractionStep(config),
DataSelectionStep(config),
NERAnnotationStep(config),
LLMAnnotationStep(config),
# NERAnnotationStep(config),
# LLMAnnotationStep(config),
]
for stage in config.stages:
@@ -37,6 +42,7 @@ def create_pipeline(config) -> Pipeline:
def run_pipeline(config) -> int:
"""Run the complete pipeline"""
try:
logging.info(f"Starting pipeline: {config.name} v{config.version}")
@@ -73,3 +79,27 @@ def run_pipeline(config) -> int:
except Exception as e:
logging.error(f"Pipeline failed: {e}", exc_info=True)
return 1
def main():
"""Main entry point with unified configuration loading"""
parser = argparse.ArgumentParser(
description="DRC NERS Processing Pipeline",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--config", type=str, help="Path to configuration file")
parser.add_argument("--env", type=str, default="development", help="Environment name")
args = parser.parse_args()
try:
config = setup_config(config_path=args.config, env=args.env)
return run_pipeline(config)
except Exception as e:
print(f"Pipeline failed: {e}")
traceback.print_exc()
return 1
if __name__ == "__main__":
sys.exit(main())
+93
View File
@@ -0,0 +1,93 @@
# Model Notation Reference
This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
## Logistic Regression
- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
## Multinomial Naive Bayes
- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
## Support Vector Machine (RBF Kernel)
- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
- Implementation notes: TFIDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
## Random Forest
- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
- Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
## LightGBM (Gradient Boosted Trees)
- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
## XGBoost
- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
## Convolutional Neural Network (1D)
- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
## Bidirectional GRU
- Forward GRU recursion: $\begin{aligned}
&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
\end{aligned}$
- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
## LSTM
- Gates per timestep: $\begin{aligned}
&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
\end{aligned}$
- Bidirectional stacking concatenates final hidden vectors before classification via softmax.
- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
## Transformer Encoder (Single Block)
- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
- Sequence pooling (global average) feeds dense layers and softmax classifier.
- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
## Ensemble (Soft Voting)
- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.
Executable
+90
View File
@@ -0,0 +1,90 @@
#!.venv/bin/python3
import argparse
import sys
import traceback
from pathlib import Path
from core.config import setup_config
from processing.monitoring.pipeline_monitor import PipelineMonitor
def main():
choices = [
"data_cleaning",
"data_selection",
"feature_extraction",
"ner_annotation",
"llm_annotation",
"data_splitting",
]
parser = argparse.ArgumentParser(description="DRC NERS Processing Monitoring")
parser.add_argument("--config", type=Path, help="Path to configuration file")
parser.add_argument("--env", type=str, default="development", help="Environment")
subparsers = parser.add_subparsers(dest="command", help="Available commands")
# Clean command
clean_parser = subparsers.add_parser("clean", help="Clean checkpoint files")
clean_parser.add_argument("--step", type=str, choices=choices, help="default: all")
clean_parser.add_argument("--keep-last", type=int, default=1, help="(default: 1)")
clean_parser.add_argument("--force", action="store_true", help="Clean without confirmation")
# Reset command
reset_parser = subparsers.add_parser("reset", help="Reset pipeline step")
reset_parser.add_argument("--step", type=str, choices=choices, help="(default: all)")
reset_parser.add_argument("--all", action="store_true", help="Reset all steps")
reset_parser.add_argument("--force", action="store_true", help="Reset without confirmation")
args = parser.parse_args()
try:
setup_config(config_path=args.config, env=args.env)
monitor = PipelineMonitor()
if not args.command:
parser.print_help()
monitor.print_status(detailed=True)
return 1
elif args.command == "clean":
checkpoint_info = monitor.count_checkpoint_files()
print(f"Current checkpoint storage: {checkpoint_info['total_size_mb']:.1f} MB")
if not args.force:
response = input("Are you sure you want to clean checkpoints? (y/N): ")
if response.lower() != "y":
print("Cancelled")
return 0
if args.step:
monitor.clean_step_checkpoints(args.step, args.keep_last)
else:
for step in monitor.steps:
monitor.clean_step_checkpoints(step, args.keep_last)
print("Checkpoint cleaning completed")
elif args.command == "reset":
if not args.force:
response = input(
f"Are you sure you want to reset {args.step}? This will delete all checkpoints. (y/N): "
)
if response.lower() != "y":
print("Cancelled")
return 0
if args.step:
monitor.reset_step(args.step)
else:
for step in monitor.steps:
monitor.reset_step(step)
print(f"Reset completed")
except Exception as e:
print(f"Monitoring failed: {e}")
traceback.print_exc()
return 1
if __name__ == "__main__":
sys.exit(main())
+25 -10
View File
@@ -1,24 +1,29 @@
#!/usr/bin/env python3
import argparse
import logging
import os
import sys
import traceback
from pathlib import Path
from ners.core.config import PipelineConfig
from ners.processing.ner.name_builder import NameBuilder
from ners.processing.ner.name_engineering import NameEngineering
from ners.processing.ner.name_model import NameModel
from core.config import setup_config, PipelineConfig
from processing.ner.name_builder import NameBuilder
from processing.ner.name_engineering import NameEngineering
from processing.ner.name_model import NameModel
def feature(config: PipelineConfig):
"""Apply feature engineering to create position-independent NER dataset."""
NameEngineering(config).compute()
def build(config: PipelineConfig):
"""Build NER dataset using NERDataBuilder."""
NameBuilder(config).build()
def train(config: PipelineConfig):
"""Train the NER model."""
name_model = NameModel(config)
data_path = Path(config.paths.data_dir) / config.data.output_files["ner_data"]
@@ -32,9 +37,7 @@ def train(config: PipelineConfig):
split_idx = int(len(data) * 0.9)
train_data, eval_data = data[:split_idx], data[split_idx:]
logging.info(
f"Training with {len(train_data)} examples, evaluating on {len(eval_data)}"
)
logging.info(f"Training with {len(train_data)} examples, evaluating on {len(eval_data)}")
name_model.train(
data=train_data,
epochs=config.processing.epochs,
@@ -72,9 +75,21 @@ def run_pipeline(config: PipelineConfig, reset: bool = False):
def main():
parser = argparse.ArgumentParser(description="NER model management for DRC names")
parser.add_argument("--config", type=str, help="Path to configuration file")
parser.add_argument("--env", type=str, default="development", help="Environment name")
parser.add_argument("--reset", action="store_true", help="Reset all steps")
args = parser.parse_args()
try:
logging.error("This module is no longer a CLI. Use 'ners ner ...' instead.")
return 1
except Exception:
config = setup_config(config_path=args.config, env=args.env)
return run_pipeline(config, args.reset)
except Exception as e:
print(f"Pipeline failed: {e}")
traceback.print_exc()
return 1
if __name__ == "__main__":
sys.exit(main())
+499
View File
File diff suppressed because one or more lines are too long
+2957
View File
File diff suppressed because one or more lines are too long
+5746
View File
File diff suppressed because one or more lines are too long
+832
View File
File diff suppressed because one or more lines are too long
+107
View File
@@ -0,0 +1,107 @@
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Qualitative Analysis",
"id": "d20715dd63f57364"
},
{
"cell_type": "code",
"id": "c93a55c8",
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T13:34:50.973298Z",
"start_time": "2025-09-21T13:34:50.969142Z"
}
},
"source": [
"import pandas as pd\n",
"import geopandas as gpd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import sys\n",
"import os\n",
"\n",
"sys.path.append(os.path.abspath(\"..\"))\n",
"from core.utils.data_loader import DataLoader\n",
"from core.config.pipeline_config import PipelineConfig"
],
"outputs": [],
"execution_count": 3
},
{
"cell_type": "code",
"id": "c0b00261",
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T13:34:51.002610Z",
"start_time": "2025-09-21T13:34:50.998586Z"
}
},
"source": [
"config = PipelineConfig(\n",
" paths={\n",
" \"root_dir\": \"../data\",\n",
" \"data_dir\": \"../data/dataset\",\n",
" \"models_dir\": \"../models\",\n",
" \"outputs_dir\": \"../data/processed\",\n",
" \"logs_dir\": \"../logs\",\n",
" \"configs_dir\": \"../configs\",\n",
" \"checkpoints_dir\": \"../checkpoints\"\n",
" }\n",
")\n",
"\n",
"loader = DataLoader(config)"
],
"outputs": [],
"execution_count": 4
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T13:35:27.430639Z",
"start_time": "2025-09-21T13:34:51.013412Z"
}
},
"cell_type": "code",
"outputs": [],
"execution_count": 5,
"source": [
"gdf = gpd.read_file(\"../osm/provinces.shp\")\n",
"gdf_proj = gdf.to_crs(epsg=32732)\n",
"gdf['centroid'] = gdf_proj.geometry.centroid.to_crs(gdf.crs)\n",
"\n",
"df = loader.load_csv_complete(config.paths.data_dir / \"names_featured.csv\")"
],
"id": "b38394ce38864379"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Exploration",
"id": "a1af5626d2a948d6"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
+107
View File
@@ -0,0 +1,107 @@
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Quantitative Analysis",
"id": "a605c0f92056a825"
},
{
"cell_type": "code",
"id": "c93a55c8",
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T14:14:47.287549Z",
"start_time": "2025-09-21T14:14:47.279199Z"
}
},
"source": [
"import pandas as pd\n",
"import geopandas as gpd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import sys\n",
"import os\n",
"\n",
"sys.path.append(os.path.abspath(\"..\"))\n",
"from core.utils.data_loader import DataLoader\n",
"from core.config.pipeline_config import PipelineConfig"
],
"outputs": [],
"execution_count": 30
},
{
"cell_type": "code",
"id": "c0b00261",
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T14:14:47.315980Z",
"start_time": "2025-09-21T14:14:47.308376Z"
}
},
"source": [
"config = PipelineConfig(\n",
" paths={\n",
" \"root_dir\": \"../data\",\n",
" \"data_dir\": \"../data/dataset\",\n",
" \"models_dir\": \"../models\",\n",
" \"outputs_dir\": \"../data/processed\",\n",
" \"logs_dir\": \"../logs\",\n",
" \"configs_dir\": \"../configs\",\n",
" \"checkpoints_dir\": \"../checkpoints\"\n",
" }\n",
")\n",
"\n",
"loader = DataLoader(config)"
],
"outputs": [],
"execution_count": 31
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-09-21T14:15:47.899044Z",
"start_time": "2025-09-21T14:14:47.339266Z"
}
},
"cell_type": "code",
"source": [
"gdf = gpd.read_file(\"../osm/provinces.shp\")\n",
"gdf_proj = gdf.to_crs(epsg=32732)\n",
"gdf['centroid'] = gdf_proj.geometry.centroid.to_crs(gdf.crs)\n",
"\n",
"df = loader.load_csv_complete(config.paths.data_dir / \"names_featured.csv\")"
],
"id": "b38394ce38864379",
"outputs": [],
"execution_count": 32
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Exploration",
"id": "a1af5626d2a948d6"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@@ -8,6 +8,4 @@ class BatchConfig:
batch_size: int = 1000
max_workers: int = 4
checkpoint_interval: int = 5 # Save checkpoint every N batches
use_multiprocessing: bool = (
False # Use ProcessPoolExecutor instead of ThreadPoolExecutor
)
use_multiprocessing: bool = False # Use ProcessPoolExecutor instead of ThreadPoolExecutor
@@ -4,9 +4,9 @@ from typing import Iterator
import pandas as pd
from ners.processing.batch.batch_config import BatchConfig
from ners.processing.batch.memory_monitor import MemoryMonitor
from ners.processing.steps import PipelineStep
from processing.batch.batch_config import BatchConfig
from processing.batch.memory_monitor import MemoryMonitor
from processing.steps import PipelineStep
class BatchProcessor:
@@ -33,9 +33,7 @@ class BatchProcessor:
for batch_num, (batch, batch_id) in enumerate(self.create_batches(df)):
if step.batch_exists(batch_id):
logging.info(
f"Batch {batch_id} already processed, loading from checkpoint"
)
logging.info(f"Batch {batch_id} already processed, loading from checkpoint")
processed_batch = step.load_batch(batch_id)
else:
try:
@@ -82,9 +80,7 @@ class BatchProcessor:
def process_concurrent(self, step: PipelineStep, df: pd.DataFrame) -> pd.DataFrame:
"""Memory-optimized concurrent processing"""
executor_class = (
ProcessPoolExecutor
if self.config.use_multiprocessing
else ThreadPoolExecutor
ProcessPoolExecutor if self.config.use_multiprocessing else ThreadPoolExecutor
)
results = {}
@@ -93,9 +89,7 @@ class BatchProcessor:
future_to_batch = {}
for batch, batch_id in self.create_batches(df):
if step.batch_exists(batch_id):
logging.info(
f"Batch {batch_id} already processed, loading from checkpoint"
)
logging.info(f"Batch {batch_id} already processed, loading from checkpoint")
results[batch_id] = step.load_batch(batch_id)
else:
# Only copy if necessary for concurrent processing
@@ -127,9 +121,7 @@ class BatchProcessor:
del results
self.memory_monitor.cleanup_memory()
result = (
self._safe_concat(ordered_results) if ordered_results else pd.DataFrame()
)
result = self._safe_concat(ordered_results) if ordered_results else pd.DataFrame()
# Final cleanup
del ordered_results
@@ -139,9 +131,7 @@ class BatchProcessor:
def process(self, step: PipelineStep, df: pd.DataFrame) -> pd.DataFrame:
"""Process data using the configured strategy"""
step.state.total_batches = (
len(df) + self.config.batch_size - 1
) // self.config.batch_size
step.state.total_batches = (len(df) + self.config.batch_size - 1) // self.config.batch_size
step.load_state()
logging.info(f"Starting {step.name} with {step.state.total_batches} batches")
@@ -4,8 +4,8 @@ import shutil
from datetime import datetime
from typing import Optional, Dict
from ners.core.config.config_manager import ConfigManager
from ners.core.config.project_paths import ProjectPaths
from core.config.config_manager import ConfigManager
from core.config.project_paths import ProjectPaths
class PipelineMonitor:
@@ -97,10 +97,7 @@ class PipelineMonitor:
avg_completion = total_completion / len(self.steps)
if avg_completion >= 100 and overall_status not in [
"error",
"completed_with_errors",
]:
if avg_completion >= 100 and overall_status not in ["error", "completed_with_errors"]:
overall_status = "completed"
return {
@@ -124,9 +121,7 @@ class PipelineMonitor:
print(f"{step_name.replace('_', ' ').title()}:")
print(f" Status: {step_status['status']}")
print(f" Progress: {step_status['completion_percentage']:.1f}%")
print(
f" Batches: {step_status['processed_batches']}/{step_status['total_batches']}"
)
print(f" Batches: {step_status['processed_batches']}/{step_status['total_batches']}")
if step_status["failed_batches"] > 0:
print(f" Failed Batches: {step_status['failed_batches']}")
@@ -146,10 +141,7 @@ class PipelineMonitor:
if step_dir.exists():
csv_files = list(step_dir.glob("*.csv"))
step_size = sum(f.stat().st_size for f in csv_files)
counts[step] = {
"files": len(csv_files),
"size_mb": step_size / (1024 * 1024),
}
counts[step] = {"files": len(csv_files), "size_mb": step_size / (1024 * 1024)}
total_size += step_size
else:
counts[step] = {"files": 0, "size_mb": 0}
@@ -168,9 +160,7 @@ class PipelineMonitor:
csv_files = sorted(step_dir.glob("batch_*.csv"))
if len(csv_files) <= keep_last:
logging.info(
f"Only {len(csv_files)} checkpoint files for {step_name}, keeping all"
)
logging.info(f"Only {len(csv_files)} checkpoint files for {step_name}, keeping all")
return
files_to_delete = csv_files[:-keep_last] if keep_last > 0 else csv_files
@@ -3,7 +3,7 @@ from typing import List, Tuple, Dict
import pandas as pd
from ners.processing.steps.feature_extraction_step import NameCategory
from processing.steps.feature_extraction_step import NameCategory
class BaseNameFormatter(ABC):
@@ -12,9 +12,7 @@ class BaseNameFormatter(ABC):
Contains common logic for NER tagging and attribute computation.
"""
def __init__(
self, connectors: List[str] = None, additional_surnames: List[str] = None
):
def __init__(self, connectors: List[str] = None, additional_surnames: List[str] = None):
self.connectors = connectors or ["wa", "ya", "ka", "ba"]
self.additional_surnames = additional_surnames or [
"jean",
@@ -48,9 +46,7 @@ class BaseNameFormatter(ABC):
end_pos = current_pos + len(word)
# Determine tag based on word content
if word in native_parts or any(
connector in word for connector in self.connectors
):
if word in native_parts or any(connector in word for connector in self.connectors):
tag = "NATIVE"
elif word == surname or word in self.additional_surnames:
tag = "SURNAME"
@@ -76,9 +72,7 @@ class BaseNameFormatter(ABC):
"words": words_count,
"length": length,
"identified_category": (
NameCategory.SIMPLE.value
if words_count == 3
else NameCategory.COMPOSE.value
NameCategory.SIMPLE.value if words_count == 3 else NameCategory.COMPOSE.value
),
}
@@ -3,7 +3,7 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class ConnectorFormatter(BaseNameFormatter):
@@ -3,15 +3,13 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class ExtendedSurnameFormatter(BaseNameFormatter):
def transform(self, row: pd.Series) -> Dict:
native_parts = self.parse_native_components(row["probable_native"])
original_surname = (
row["probable_surname"] if pd.notna(row["probable_surname"]) else ""
)
original_surname = row["probable_surname"] if pd.notna(row["probable_surname"]) else ""
# Add random additional surname
additional_surname = random.choice(self.additional_surnames)
@@ -24,9 +22,7 @@ class ExtendedSurnameFormatter(BaseNameFormatter):
"identified_name": row["probable_native"],
"probable_surname": combined_surname,
"identified_surname": combined_surname,
"ner_entities": str(
self.create_ner_tags(full_name, native_parts, combined_surname)
),
"ner_entities": str(self.create_ner_tags(full_name, native_parts, combined_surname)),
"transformation_type": self.transformation_type,
**self.compute_numeric_features(full_name),
}
@@ -2,7 +2,7 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class NativeOnlyFormatter(BaseNameFormatter):
@@ -2,7 +2,7 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class OriginalFormatter(BaseNameFormatter):
@@ -2,7 +2,7 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class PositionFlippedFormatter(BaseNameFormatter):
@@ -2,7 +2,7 @@ from typing import Dict
import pandas as pd
from ners.processing.ner.formats import BaseNameFormatter
from processing.ner.formats import BaseNameFormatter
class ReducedNativeFormatter(BaseNameFormatter):
@@ -11,9 +11,7 @@ class ReducedNativeFormatter(BaseNameFormatter):
surname = row["probable_surname"] if pd.notna(row["probable_surname"]) else ""
# Keep only first native component + surname
reduced_native = (
native_parts[0] if len(native_parts) > 1 else row["probable_native"]
)
reduced_native = native_parts[0] if len(native_parts) > 1 else row["probable_native"]
full_name = f"{reduced_native} {surname}".strip()
return {
@@ -22,9 +20,7 @@ class ReducedNativeFormatter(BaseNameFormatter):
"identified_name": reduced_native,
"probable_surname": surname,
"identified_surname": surname,
"ner_entities": str(
self.create_ner_tags(full_name, [reduced_native], surname)
),
"ner_entities": str(self.create_ner_tags(full_name, [reduced_native], surname)),
"transformation_type": self.transformation_type,
**self.compute_numeric_features(full_name),
}
@@ -4,8 +4,8 @@ import logging
import spacy
from spacy.tokens import DocBin
from ners.core.config import PipelineConfig
from ners.core.utils.data_loader import DataLoader
from core.config import PipelineConfig
from core.utils.data_loader import DataLoader
from .name_tagger import NameTagger
@@ -20,9 +20,7 @@ class NameBuilder:
self.tagger = NameTagger()
def build(self) -> int:
filepath = self.config.paths.get_data_path(
self.config.data.output_files["engineered"]
)
filepath = self.config.paths.get_data_path(self.config.data.output_files["engineered"])
df = self.data_loader.load_csv_complete(filepath)
df = df[["name", "ner_tagged", "ner_entities"]]
@@ -40,9 +38,7 @@ class NameBuilder:
# Use NERNameTagger for parsing and validation
parsed_entities = self.tagger.parse_entities(ner_df["ner_entities"])
validated_entities = self.tagger.validate_entities(
ner_df["name"], parsed_entities
)
validated_entities = self.tagger.validate_entities(ner_df["name"], parsed_entities)
# Drop rows with no valid entities
mask = validated_entities.map(bool)
@@ -55,33 +51,22 @@ class NameBuilder:
# Prepare training data
training_data = list(
zip(
ner_df["name"].tolist(),
[{"entities": ents} for ents in validated_entities],
)
zip(ner_df["name"].tolist(), [{"entities": ents} for ents in validated_entities])
)
# Use NERNameTagger to create spaCy DocBin
docs = self.tagger.create_docs(
nlp, ner_df["name"].tolist(), validated_entities.tolist()
)
docs = self.tagger.create_docs(nlp, ner_df["name"].tolist(), validated_entities.tolist())
doc_bin = DocBin(docs=docs)
# Save
json_path = self.config.paths.get_data_path(
self.config.data.output_files["ner_data"]
)
spacy_path = self.config.paths.get_data_path(
self.config.data.output_files["ner_spacy"]
)
json_path = self.config.paths.get_data_path(self.config.data.output_files["ner_data"])
spacy_path = self.config.paths.get_data_path(self.config.data.output_files["ner_spacy"])
with open(json_path, "w", encoding="utf-8") as f:
json.dump(training_data, f, ensure_ascii=False, separators=(",", ":"))
doc_bin.to_disk(spacy_path)
logging.info(
f"Processed: {len(training_data)}, Skipped: {total_rows - len(training_data)}"
)
logging.info(f"Processed: {len(training_data)}, Skipped: {total_rows - len(training_data)}")
logging.info(f"Saved NER JSON to {json_path}")
logging.info(f"Saved NER spacy to {spacy_path}")
return 0
@@ -6,14 +6,14 @@ import numpy as np
import pandas as pd
from tqdm import tqdm
from ners.core.config import PipelineConfig
from ners.core.utils.data_loader import DataLoader
from ners.processing.ner.formats.connectors_format import ConnectorFormatter
from ners.processing.ner.formats.extended_surname_format import ExtendedSurnameFormatter
from ners.processing.ner.formats.native_only_format import NativeOnlyFormatter
from ners.processing.ner.formats.original_format import OriginalFormatter
from ners.processing.ner.formats.position_flipped_format import PositionFlippedFormatter
from ners.processing.ner.formats.reduced_native_format import ReducedNativeFormatter
from core.config import PipelineConfig
from core.utils.data_loader import DataLoader
from processing.ner.formats.connectors_format import ConnectorFormatter
from processing.ner.formats.extended_surname_format import ExtendedSurnameFormatter
from processing.ner.formats.native_only_format import NativeOnlyFormatter
from processing.ner.formats.original_format import OriginalFormatter
from processing.ner.formats.position_flipped_format import PositionFlippedFormatter
from processing.ner.formats.reduced_native_format import ReducedNativeFormatter
class NameEngineering:
@@ -44,60 +44,42 @@ class NameEngineering:
# Initialize format classes
self.formatters = {
"original": OriginalFormatter(self.connectors, self.additional_surnames),
"native_only": NativeOnlyFormatter(
self.connectors, self.additional_surnames
),
"position_flipped": PositionFlippedFormatter(
self.connectors, self.additional_surnames
),
"reduced_native": ReducedNativeFormatter(
self.connectors, self.additional_surnames
),
"connector_added": ConnectorFormatter(
self.connectors, self.additional_surnames
),
"extended_surname": ExtendedSurnameFormatter(
self.connectors, self.additional_surnames
),
"native_only": NativeOnlyFormatter(self.connectors, self.additional_surnames),
"position_flipped": PositionFlippedFormatter(self.connectors, self.additional_surnames),
"reduced_native": ReducedNativeFormatter(self.connectors, self.additional_surnames),
"connector_added": ConnectorFormatter(self.connectors, self.additional_surnames),
"extended_surname": ExtendedSurnameFormatter(self.connectors, self.additional_surnames),
}
def load_data(self) -> pd.DataFrame:
"""Load and filter NER-tagged data from CSV file"""
filepath = self.config.paths.get_data_path(
self.config.data.output_files["featured"]
)
filepath = self.config.paths.get_data_path(self.config.data.output_files["featured"])
df = self.data_loader.load_csv_complete(filepath)
# Filter only NER-tagged rows
ner_data = df[df["ner_tagged"] == 1].copy()
logging.info(
f"Loaded {len(ner_data)} NER-tagged records from {len(df)} total records"
)
logging.info(f"Loaded {len(ner_data)} NER-tagged records from {len(df)} total records")
return ner_data
def compute(self) -> None:
logging.info("Applying feature engineering transformations...")
input_filepath = self.config.paths.get_data_path(
self.config.data.output_files["featured"]
)
input_filepath = self.config.paths.get_data_path(self.config.data.output_files["featured"])
output_filepath = self.config.paths.get_data_path(
self.config.data.output_files["engineered"]
)
df = self.data_loader.load_csv_complete(input_filepath)
ner_df = df[df["ner_tagged"] == 1].copy()
logging.info(
f"Loaded {len(ner_df)} NER-tagged records from {len(df)} total records"
)
logging.info(f"Loaded {len(ner_df)} NER-tagged records from {len(df)} total records")
del df # No need to keep in memory
gc.collect()
ner_df = ner_df.sample(
frac=1, random_state=self.config.data.random_seed
).reset_index(drop=True)
ner_df = ner_df.sample(frac=1, random_state=self.config.data.random_seed).reset_index(
drop=True
)
total_rows = len(ner_df)
# Calculate split points
@@ -112,11 +94,7 @@ class NameEngineering:
(0, split_25_1, "original"), # First 25%: original format
(split_25_1, split_25_2, "native_only"), # Second 25%: remove surname
(split_25_2, split_25_3, "position_flipped"), # Third 25%: flip positions
(
split_25_3,
split_10_1,
"reduced_native",
), # Fourth 10%: reduce native components
(split_25_3, split_10_1, "reduced_native"), # Fourth 10%: reduce native components
(split_10_1, split_10_2, "connector_added"), # Fifth 10%: add connectors
(split_10_2, total_rows, "extended_surname"), # Last 5%: extend surnames
]
@@ -11,7 +11,7 @@ from spacy.training import Example
from spacy.util import minibatch
from tqdm import tqdm
from ners.core.config.pipeline_config import PipelineConfig
from core.config.pipeline_config import PipelineConfig
class NameModel:
@@ -29,15 +29,6 @@ class NameModel:
"""Create a blank spaCy model with NER pipeline"""
logging.info(f"Creating blank {language} model for NER training")
# Prefer GPU for spaCy if available (falls back to CPU automatically)
try:
if spacy.prefer_gpu():
logging.info("spaCy GPU enabled (cupy) for NER training")
else:
logging.info("spaCy running on CPU")
except Exception as e:
logging.debug(f"spaCy GPU selection skipped: {e}")
# Create blank model - French tokenizer works well for DRC names
self.nlp = spacy.blank(language)
@@ -87,9 +78,7 @@ class NameModel:
# Handle different annotation formats from NERNameTagger
if not isinstance(annotations, dict) or "entities" not in annotations:
logging.warning(
f"Skipping invalid annotations at index {i}: {annotations}"
)
logging.warning(f"Skipping invalid annotations at index {i}: {annotations}")
skipped_count += 1
continue
@@ -126,9 +115,7 @@ class NameModel:
valid_entities = []
for entity in entities:
if not isinstance(entity, (list, tuple)) or len(entity) != 3:
logging.warning(
f"Skipping invalid entity format in '{text}': {entity}"
)
logging.warning(f"Skipping invalid entity format in '{text}': {entity}")
continue
start, end, label = entity
@@ -142,30 +129,21 @@ class NameModel:
or start < 0
or end > len(text)
):
logging.warning(
f"Skipping invalid entity bounds in '{text}': {entity}"
)
logging.warning(f"Skipping invalid entity bounds in '{text}': {entity}")
continue
# Check for overlaps with already validated entities
has_overlap = any(
start < v_end and end > v_start
for v_start, v_end, _ in valid_entities
start < v_end and end > v_start for v_start, v_end, _ in valid_entities
)
if has_overlap:
logging.warning(
f"Skipping overlapping entity in '{text}': {entity}"
)
logging.warning(f"Skipping overlapping entity in '{text}': {entity}")
continue
# Validate that the span doesn't contain spaces (matching tagger validation)
span_text = text[start:end]
if (
not span_text
or span_text != span_text.strip()
or " " in span_text
):
if not span_text or span_text != span_text.strip() or " " in span_text:
logging.warning(
f"Skipping entity with spaces in '{text}': {entity} -> '{span_text}'"
)
@@ -174,9 +152,7 @@ class NameModel:
valid_entities.append((start, end, label))
if not valid_entities:
logging.warning(
f"Skipping training example with no valid entities: '{text}'"
)
logging.warning(f"Skipping training example with no valid entities: '{text}'")
skipped_count += 1
continue
@@ -234,9 +210,7 @@ class NameModel:
batches = minibatch(examples, size=batch_size)
for batch in batches:
batch_losses = {}
self.nlp.update(
batch, losses=batch_losses, drop=dropout_rate, sgd=optimizer
)
self.nlp.update(batch, losses=batch_losses, drop=dropout_rate, sgd=optimizer)
logging.info(
f"Training batch with {len(batch)} examples, current losses: {batch_losses}"
)
@@ -247,7 +221,7 @@ class NameModel:
del batches # free memory
losses_history.append(losses.get("ner", 0))
logging.info(f"Epoch {epoch + 1}/{epochs}, Total Loss: {losses['ner']:.4f}")
logging.info(f"Epoch {epoch+1}/{epochs}, Total Loss: {losses['ner']:.4f}")
# Store training statistics
self.training_stats = {
@@ -259,9 +233,7 @@ class NameModel:
"dropout_rate": dropout_rate,
}
logging.info(
f"Training completed. Final loss: {self.training_stats['final_loss']:.4f}"
)
logging.info(f"Training completed. Final loss: {self.training_stats['final_loss']:.4f}")
def evaluate(self, test_data: List[Tuple[str, Dict]]) -> Dict[str, Any]:
"""Evaluate the trained model on test data"""
@@ -310,14 +282,10 @@ class NameModel:
entity_stats[label]["fp"] += 1
# Calculate overall metrics
precision = (
correct_entities / predicted_entities if predicted_entities > 0 else 0
)
precision = correct_entities / predicted_entities if predicted_entities > 0 else 0
recall = correct_entities / actual_entities if actual_entities > 0 else 0
f1_score = (
2 * (precision * recall) / (precision + recall)
if (precision + recall) > 0
else 0
2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
)
# Calculate per-label metrics
@@ -327,11 +295,7 @@ class NameModel:
label_precision = tp / (tp + fp) if (tp + fp) > 0 else 0
label_recall = tp / (tp + fn) if (tp + fn) > 0 else 0
label_f1 = (
(
2
* (label_precision * label_recall)
/ (label_precision + label_recall)
)
(2 * (label_precision * label_recall) / (label_precision + label_recall))
if (label_precision + label_recall) > 0
else 0
)
@@ -421,9 +385,7 @@ class NameModel:
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"confidence": getattr(
ent, "score", None
), # If confidence scores are available
"confidence": getattr(ent, "score", None), # If confidence scores are available
}
)
@@ -48,9 +48,7 @@ class NameTagger:
# Find the first occurrence of this native word that doesn't overlap
start_pos = 0
while True:
pos = name_lower.find(
native_word_lower, start_pos
) # Case-insensitive search
pos = name_lower.find(native_word_lower, start_pos) # Case-insensitive search
if pos == -1:
break
@@ -80,9 +78,7 @@ class NameTagger:
# Find the first occurrence that doesn't overlap
start_pos = 0
while True:
pos = name_lower.find(
surname_lower, start_pos
) # Case-insensitive search
pos = name_lower.find(surname_lower, start_pos) # Case-insensitive search
if pos == -1:
break
@@ -124,13 +120,8 @@ class NameTagger:
continue
# Check for overlaps with already validated entities
if any(
start < v_end and end > v_start
for v_start, v_end, _ in validated_entities
):
logging.warning(
f"Overlapping span ({start}, {end}, '{label}') in '{name}'"
)
if any(start < v_end and end > v_start for v_start, v_end, _ in validated_entities):
logging.warning(f"Overlapping span ({start}, {end}, '{label}') in '{name}'")
continue
# CRITICAL VALIDATION: Check that the span contains only the expected word (no spaces)
@@ -209,16 +200,10 @@ class NameTagger:
elif entities_str.startswith("[[") and entities_str.endswith("]]"):
return [tuple(e) for e in ast.literal_eval(entities_str)]
elif entities_str.startswith("[{") and entities_str.endswith("}]"):
return [
(e["start"], e["end"], e["label"]) for e in json.loads(entities_str)
]
return [(e["start"], e["end"], e["label"]) for e in json.loads(entities_str)]
else:
parsed = ast.literal_eval(entities_str)
return [
tuple(e)
for e in parsed
if isinstance(e, (list, tuple)) and len(e) == 3
]
return [tuple(e) for e in parsed if isinstance(e, (list, tuple)) and len(e) == 3]
except (ValueError, SyntaxError, json.JSONDecodeError):
return []
@@ -260,15 +245,13 @@ class NameTagger:
# Remove overlaps
filtered, last_end = [], -1
for s, e, label in valid:
for s, e, l in valid:
if s >= last_end:
filtered.append((s, e, label))
filtered.append((s, e, l))
last_end = e
return filtered
def validate_entities(
self, texts: pd.Series, entities_series: pd.Series
) -> pd.Series:
def validate_entities(self, texts: pd.Series, entities_series: pd.Series) -> pd.Series:
"""Vectorized entity validation."""
return pd.Series(map(self.validate, texts, entities_series), index=texts.index)
@@ -4,9 +4,9 @@ from typing import Dict, Any
import pandas as pd
from ners.processing.batch.batch_config import BatchConfig
from ners.processing.batch.batch_processor import BatchProcessor
from ners.processing.steps import PipelineStep
from processing.batch.batch_config import BatchConfig
from processing.batch.batch_processor import BatchProcessor
from processing.steps import PipelineStep
class Pipeline:
@@ -8,9 +8,9 @@ from typing import List, Optional
import pandas as pd
from pydantic import BaseModel
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.utils.data_loader import DataLoader
from ners.processing.batch.batch_config import BatchConfig
from core.config.pipeline_config import PipelineConfig
from core.utils.data_loader import DataLoader
from processing.batch.batch_config import BatchConfig
@dataclass
@@ -19,7 +19,7 @@ class PipelineState:
processed_batches: int = 0
total_batches: int = 0
failed_batches: Optional[List[int]] = None
failed_batches: List[int] = None
last_checkpoint: Optional[str] = None
def __post_init__(self):
@@ -38,10 +38,7 @@ class PipelineStep(ABC):
"""Abstract base class for pipeline steps"""
def __init__(
self,
name: str,
pipeline_config: PipelineConfig,
batch_config: Optional[BatchConfig] = None,
self, name: str, pipeline_config: PipelineConfig, batch_config: Optional[BatchConfig] = None
):
self.name = name
self.pipeline_config = pipeline_config
@@ -2,9 +2,9 @@ import logging
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.utils.text_cleaner import TextCleaner
from ners.processing.steps import PipelineStep
from core.config.pipeline_config import PipelineConfig
from core.utils.text_cleaner import TextCleaner
from processing.steps import PipelineStep
class DataCleaningStep(PipelineStep):
@@ -2,8 +2,8 @@ import logging
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from ners.processing.steps import PipelineStep
from core.config.pipeline_config import PipelineConfig
from processing.steps import PipelineStep
class DataSelectionStep(PipelineStep):
@@ -20,23 +20,15 @@ class DataSelectionStep(PipelineStep):
# Remove rows where region == "global" only for specific years
if "region" in batch.columns and "year" in batch.columns:
target_years = {2015, 2021, 2022}
mask_remove = batch["region"].str.lower().eq("global") & batch["year"].isin(
list(target_years)
)
mask_remove = batch["region"].str.lower().eq("global") & batch["year"].isin(target_years)
removed = int(mask_remove.sum())
if removed:
batch = batch[~mask_remove]
logging.info(
f"Removed {removed} rows with region == 'global' for years {sorted(target_years)} in batch {batch_id}"
)
logging.info(f"Removed {removed} rows with region == 'global' for years {sorted(target_years)} in batch {batch_id}")
# Check which columns exist in the batch
available_columns = [
col for col in self.selected_columns if col in batch.columns
]
missing_columns = [
col for col in self.selected_columns if col not in batch.columns
]
available_columns = [col for col in self.selected_columns if col in batch.columns]
missing_columns = [col for col in self.selected_columns if col not in batch.columns]
if missing_columns:
logging.warning(f"Missing columns in batch {batch_id}: {missing_columns}")
@@ -1,11 +1,11 @@
import numpy as np
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.utils.region_mapper import RegionMapper
from ners.processing.batch.batch_config import BatchConfig
from ners.processing.steps import PipelineStep
from ners.processing.steps.feature_extraction_step import Gender
from core.config.pipeline_config import PipelineConfig
from core.utils.region_mapper import RegionMapper
from processing.batch.batch_config import BatchConfig
from processing.steps import PipelineStep
from processing.steps.feature_extraction_step import Gender
class DataSplittingStep(PipelineStep):
@@ -26,9 +26,7 @@ class DataSplittingStep(PipelineStep):
if self.eval_indices is None:
np.random.seed(self.pipeline_config.data.random_seed)
eval_size = int(total_size * self.pipeline_config.data.evaluation_fraction)
self.eval_indices = set(
np.random.choice(total_size, size=eval_size, replace=False)
)
self.eval_indices = set(np.random.choice(total_size, size=eval_size, replace=False))
return self.eval_indices
def process_batch(self, batch: pd.DataFrame, batch_id: int) -> pd.DataFrame:
@@ -47,9 +45,7 @@ class DataSplittingStep(PipelineStep):
df_evaluation = df[eval_mask]
df_featured = df[~eval_mask]
self.data_loader.save_csv(
df_evaluation, data_dir / output_files["evaluation"]
)
self.data_loader.save_csv(df_evaluation, data_dir / output_files["evaluation"])
self.data_loader.save_csv(df_featured, data_dir / output_files["featured"])
else:
self.data_loader.save_csv(df, data_dir / output_files["featured"])
@@ -57,9 +53,7 @@ class DataSplittingStep(PipelineStep):
if self.pipeline_config.data.split_by_province:
for province in RegionMapper.get_provinces():
df_region = df[df.province == province]
self.data_loader.save_csv(
df_region, data_dir / "provinces" / f"{province}.csv"
)
self.data_loader.save_csv(df_region, data_dir / "provinces" / f"{province}.csv")
if self.pipeline_config.data.split_by_gender:
df_males = df[df.sex == Gender.MALE.value]
@@ -5,10 +5,10 @@ from typing import Dict, Any
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.utils.region_mapper import RegionMapper
from ners.processing.ner.name_tagger import NameTagger
from ners.processing.steps import PipelineStep
from core.config.pipeline_config import PipelineConfig
from core.utils.region_mapper import RegionMapper
from processing.ner.name_tagger import NameTagger
from processing.steps import PipelineStep
class Gender(Enum):
@@ -29,8 +29,8 @@ class FeatureExtractionStep(PipelineStep):
self.region_mapper = RegionMapper()
self.name_tagger = NameTagger()
@property
def requires_batch_mutation(self) -> bool:
@classmethod
def requires_batch_mutation(cls) -> bool:
"""This step creates new columns, so mutation is required"""
return True
@@ -64,14 +64,10 @@ class FeatureExtractionStep(PipelineStep):
self._assign_probable_names(result)
self._process_simple_names(result)
result["identified_category"] = self._assign_identified_category(
result["words"]
)
result["identified_category"] = self._assign_identified_category(result["words"])
if "year" in result.columns:
result["year"] = pd.to_numeric(result["year"], errors="coerce").astype(
"Int16"
)
result["year"] = pd.to_numeric(result["year"], errors="coerce").astype("Int16")
if "region" in result.columns:
result["province"] = self.region_mapper.map(result["region"]).str.lower()
@@ -7,12 +7,12 @@ import ollama
import pandas as pd
from pydantic import ValidationError
from ners.core.config.pipeline_config import PipelineConfig
from ners.core.utils.prompt_manager import PromptManager
from ners.core.utils.rate_limiter import RateLimitConfig
from ners.core.utils.rate_limiter import RateLimiter
from ners.processing.batch.batch_config import BatchConfig
from ners.processing.steps import PipelineStep, NameAnnotation
from core.config.pipeline_config import PipelineConfig
from core.utils.prompt_manager import PromptManager
from core.utils.rate_limiter import RateLimitConfig
from core.utils.rate_limiter import RateLimiter
from processing.batch.batch_config import BatchConfig
from processing.steps import PipelineStep, NameAnnotation
class LLMAnnotationStep(PipelineStep):
@@ -24,8 +24,7 @@ class LLMAnnotationStep(PipelineStep):
batch_config = BatchConfig(
batch_size=pipeline_config.processing.batch_size,
max_workers=min(
self.llm_config.max_concurrent_requests,
pipeline_config.processing.max_workers,
self.llm_config.max_concurrent_requests, pipeline_config.processing.max_workers
),
checkpoint_interval=pipeline_config.processing.checkpoint_interval,
use_multiprocessing=pipeline_config.processing.use_multiprocessing,
@@ -34,9 +33,7 @@ class LLMAnnotationStep(PipelineStep):
self.prompt = PromptManager(pipeline_config).load_prompt()
self.rate_limiter = (
self._create_rate_limiter()
if self.llm_config.enable_rate_limiting
else None
self._create_rate_limiter() if self.llm_config.enable_rate_limiting else None
)
# Statistics
@@ -79,9 +76,7 @@ class LLMAnnotationStep(PipelineStep):
f"Request took {elapsed_time:.2f}s, exceeding {self.llm_config.timeout_seconds}s timeout"
)
annotation = NameAnnotation.model_validate_json(
response.message.content
)
annotation = NameAnnotation.model_validate_json(response.message.content)
result = {
**annotation.model_dump(),
"annotated": 1,
@@ -124,9 +119,7 @@ class LLMAnnotationStep(PipelineStep):
logging.info(f"Batch {batch_id}: No entries to annotate")
return batch
logging.info(
f"Batch {batch_id}: Annotating {len(unannotated_entries)} entries with LLM"
)
logging.info(f"Batch {batch_id}: Annotating {len(unannotated_entries)} entries with LLM")
batch = batch.copy()
client = ollama.Client()
@@ -5,9 +5,9 @@ from typing import Dict
import pandas as pd
from ners.core.config.pipeline_config import PipelineConfig
from ners.processing.ner.name_model import NameModel
from ners.processing.steps import PipelineStep, NameAnnotation
from core.config.pipeline_config import PipelineConfig
from processing.ner.name_model import NameModel
from processing.steps import PipelineStep, NameAnnotation
class NERAnnotationStep(PipelineStep):
@@ -39,9 +39,7 @@ class NERAnnotationStep(PipelineStep):
logging.info("NER model loaded successfully")
else:
logging.warning(f"NER model not found at {self.model_path}")
logging.warning(
"NER annotation will be skipped. Train the model first."
)
logging.warning("NER annotation will be skipped. Train the model first.")
self.name_model.nlp = None
except Exception as e:
logging.error(f"Failed to load NER model: {e}")
@@ -82,9 +80,7 @@ class NERAnnotationStep(PipelineStep):
# Create annotation result in same format as LLM step
annotation = NameAnnotation(
identified_name=" ".join(native_parts) if native_parts else None,
identified_surname=" ".join(surname_parts)
if surname_parts
else None,
identified_surname=" ".join(surname_parts) if surname_parts else None,
)
result = {
@@ -128,9 +124,7 @@ class NERAnnotationStep(PipelineStep):
logging.info(f"Batch {batch_id}: No entries to annotate")
return batch
logging.info(
f"Batch {batch_id}: Annotating {len(unannotated_entries)} entries with NER"
)
logging.info(f"Batch {batch_id}: Annotating {len(unannotated_entries)} entries with NER")
batch = batch.copy()
-57
View File
@@ -1,57 +0,0 @@
[project]
name = "ners"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"geopandas>=1.1.1",
"joblib>=1.5.2",
"lightgbm>=4.6.0",
"matplotlib>=3.10.6",
"numpy>=2.3.3",
"ollama>=0.6.0",
"pandas>=2.3.3",
"plotly>=6.3.1",
"psutil>=7.1.0",
"pydantic>=2.11.10",
"pyyaml>=6.0.3",
"scikit-learn>=1.7.2",
"seaborn>=0.13.2",
"spacy>=3.8.7",
"streamlit>=1.50.0",
"tqdm>=4.67.1",
"typer>=0.19.2",
"tensorflow==2.20.0; sys_platform == 'linux' and platform_machine == 'x86_64'",
"xgboost>=3.0.5",
"networkx>=3.5",
]
[project.scripts]
ners = "ners.cli:app"
[build-system]
requires = ["uv_build>=0.8.12,<0.9.0"]
build-backend = "uv_build"
[dependency-groups]
dev = [
"ipykernel>=6.30.1",
"pyright>=1.1.406",
"pytest>=8.4.2",
"ruff>=0.13.3",
]
[tool.pyright]
pythonVersion = "3.11"
typeCheckingMode = "basic"
reportMissingImports = "none"
reportMissingModuleSource = "none"
useLibraryCodeForTypes = true
include = ["src"]
[tool.ruff]
# Keep defaults and additionally ignore notebooks
extend-exclude = [
"**/*.ipynb",
]
+170
View File
@@ -0,0 +1,170 @@
absl-py==2.3.0
altair==5.1.2
annotated-types==0.7.0
anyio==4.9.0
appnope==0.1.4
argon2-cffi==25.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
astunparse==1.6.3
async-lru==2.0.5
attrs==25.3.0
babel==2.17.0
beautifulsoup4==4.13.4
black==25.1.0
bleach==6.2.0
blinker==1.9.0
cachetools==6.1.0
certifi==2025.6.15
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
comm==0.2.2
contourpy==1.3.2
cycler==0.12.1
debugpy==1.8.14
decorator==5.2.1
defusedxml==0.7.1
executing==2.2.0
fastjsonschema==2.21.1
flake8==7.3.0
flatbuffers==25.2.10
fonttools==4.58.4
fqdn==1.5.1
gast==0.6.0
gitdb==4.0.12
GitPython==3.1.45
google-pasta==0.2.0
grpcio==1.73.0
h11==0.16.0
h5py==3.14.0
httpcore==1.0.9
httpx==0.28.1
idna==3.10
imbalanced-learn==0.13.0
ipykernel==6.29.5
ipython>=8.0,<9.0
ipython_pygments_lexers==1.1.1
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.6
joblib==1.5.1
json5==0.12.0
jsonpointer==3.0.0
jsonschema==4.24.0
jsonschema-specifications==2025.4.1
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyter_server==2.16.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.4
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
keras==3.10.0
kiwisolver==1.4.8
libclang==18.1.1
lightgbm~=4.6.0
Markdown==3.8.2
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.10.3
matplotlib-inline==0.1.7
mccabe==0.7.0
mdurl==0.1.2
mistune==3.1.3
ml-dtypes==0.3.2
mypy==1.17.0
mypy_extensions==1.1.0
namex==0.1.0
narwhals==2.0.1
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
nltk==3.9.1
notebook==7.4.4
notebook_shim==0.2.4
numpy==1.26.4
ollama~=0.5.1
opt_einsum==3.4.0
optree==0.16.0
overrides==7.7.0
packaging==25.0
pandas==2.3.0
pandocfilters==1.5.1
parso==0.8.4
pathspec==0.12.1
pexpect==4.9.0
pillow==11.2.1
platformdirs==4.3.8
plotly~=6.2.0
prometheus_client==0.22.1
prompt_toolkit==3.0.51
protobuf==4.25.8
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==21.0.0
pycodestyle==2.14.0
pycparser==2.22
pydantic~=2.11.7
pydantic_core==2.33.2
pydeck==0.9.1
pyflakes==3.4.0
Pygments==2.19.1
pyparsing==3.2.3
python-dateutil==2.9.0.post0
python-json-logger==3.3.0
pytz==2025.2
PyYAML~=6.0.2
pyzmq==27.0.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==14.0.0
rpds-py==0.26.0
scikit-learn~=1.6.1
scipy==1.15.3
seaborn==0.13.2
Send2Trash==1.8.3
six==1.17.0
sklearn-compat==0.1.3
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.7
spacy~=3.8.7
stack-data==0.6.3
streamlit~=1.47.1
tenacity==9.1.2
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow==2.16.2
tensorflow-io-gcs-filesystem==0.37.1
termcolor==3.1.0
terminado==0.18.1
threadpoolctl==3.6.0
tinycss2==1.4.0
toml==0.10.2
toolz==1.0.0
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
types-python-dateutil==2.9.0.20250516
types-PyYAML==6.0.12.20250516
typing-inspection==0.4.1
typing_extensions==4.14.0
tzdata==2025.2
uri-template==1.3.0
urllib3==2.5.0
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.1.3
wrapt==1.17.2
xgboost~=3.0.3
@@ -1,17 +1,13 @@
import logging
from abc import ABC, abstractmethod
from typing import Dict, Any, Optional, List, TYPE_CHECKING, Union
from typing import Dict, Any, Optional, List
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from ners.research.experiment import ExperimentConfig
if TYPE_CHECKING:
from ners.research.experiment.feature_extractor import FeatureExtractor
from sklearn.preprocessing import LabelEncoder
from research.experiment import ExperimentConfig
class BaseModel(ABC):
@@ -19,13 +15,13 @@ class BaseModel(ABC):
def __init__(self, config: ExperimentConfig):
self.config = config
self.model: Any | None = None
self.feature_extractor: "FeatureExtractor | None" = None
self.label_encoder: "LabelEncoder | None" = None
self.tokenizer: Any | None = None # For neural models
self.is_fitted: bool = False
self.training_history: Dict[str, Any] = {} # For learning curves
self.learning_curve_data: Dict[str, Any] = {}
self.model = None
self.feature_extractor = None
self.label_encoder = None
self.tokenizer = None # For neural models
self.is_fitted = False
self.training_history = {} # Store training history for learning curves
self.learning_curve_data = {} # Store learning curve experiment data
@property
@abstractmethod
@@ -52,7 +48,7 @@ class BaseModel(ABC):
@abstractmethod
def generate_learning_curve(
self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = []
self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = None
) -> Dict[str, Any]:
"""Generate learning curve data for the model"""
pass
@@ -62,17 +58,10 @@ class BaseModel(ABC):
if not self.is_fitted:
raise ValueError("Model must be fitted before making predictions")
if (
self.feature_extractor is None
or self.model is None
or self.label_encoder is None
):
raise ValueError("Model is not fully initialized for prediction")
features_df = self.feature_extractor.extract_features(X)
X_prepared = self.prepare_features(features_df)
predictions: Union[np.ndarray, Any] = self.model.predict(X_prepared)
predictions = self.model.predict(X_prepared)
# Handle different prediction formats
if hasattr(predictions, "shape") and len(predictions.shape) > 1:
@@ -86,9 +75,6 @@ class BaseModel(ABC):
if not self.is_fitted:
raise ValueError("Model must be fitted before making predictions")
if self.feature_extractor is None or self.model is None:
raise ValueError("Model is not fully initialized for prediction")
features_df = self.feature_extractor.extract_features(X)
X_prepared = self.prepare_features(features_df)
@@ -97,11 +83,7 @@ class BaseModel(ABC):
elif hasattr(self.model, "predict"):
# For neural networks that return probabilities directly
probabilities = self.model.predict(X_prepared)
if (
hasattr(probabilities, "shape")
and len(probabilities.shape) == 2
and probabilities.shape[1] > 1
):
if len(probabilities.shape) == 2 and probabilities.shape[1] > 1:
return probabilities
raise NotImplementedError("Model does not support probability predictions")
@@ -109,44 +91,35 @@ class BaseModel(ABC):
def get_feature_importance(self) -> Optional[Dict[str, float]]:
"""Get feature importance if supported by the model"""
model = self.model
if model is None:
return None
if hasattr(model, "feature_importances_"):
if hasattr(self.model, "feature_importances_"):
# For tree-based models
importances = model.feature_importances_
importances = self.model.feature_importances_
feature_names = self._get_feature_names()
return dict(zip(feature_names, importances))
elif hasattr(model, "coef_"):
elif hasattr(self.model, "coef_"):
# For linear models
coefficients = np.abs(model.coef_[0])
coefficients = np.abs(self.model.coef_[0])
feature_names = self._get_feature_names()
return dict(zip(feature_names, coefficients))
elif hasattr(model, "named_steps") and "classifier" in model.named_steps:
elif hasattr(self.model, "named_steps") and "classifier" in self.model.named_steps:
# For sklearn pipelines (like LogisticRegression with vectorizer)
classifier = model.named_steps["classifier"]
classifier = self.model.named_steps["classifier"]
if hasattr(classifier, "coef_"):
coefficients = np.abs(classifier.coef_[0])
if hasattr(model.named_steps["vectorizer"], "get_feature_names_out"):
feature_names = model.named_steps[
"vectorizer"
].get_feature_names_out()
if hasattr(self.model.named_steps["vectorizer"], "get_feature_names_out"):
feature_names = self.model.named_steps["vectorizer"].get_feature_names_out()
# Take top features to avoid too many n-grams
top_indices = np.argsort(coefficients)[-20:]
return dict(
zip(feature_names[top_indices], coefficients[top_indices])
)
return dict(zip(feature_names[top_indices], coefficients[top_indices]))
return None
def _get_feature_names(self) -> List[str]:
"""Get feature names (override in subclasses if needed)"""
model = self.model
if model is not None and hasattr(model, "feature_names_in_"):
return list(model.feature_names_in_)
if hasattr(self.model, "feature_names_in_"):
return list(self.model.feature_names_in_)
return [f"feature_{i}" for i in range(100)] # Default fallback
def save(self, path: str):
@@ -170,7 +143,7 @@ class BaseModel(ABC):
model_data = joblib.load(path)
# Recreate the model instance
from ners.research.experiment import ExperimentConfig
from research.experiment import ExperimentConfig
config = ExperimentConfig.from_dict(model_data["config"])
instance = cls(config)
@@ -248,9 +221,7 @@ class BaseModel(ABC):
if "accuracy" in self.training_history:
axes[0].plot(self.training_history["accuracy"], label="Training Accuracy")
if "val_accuracy" in self.training_history:
axes[0].plot(
self.training_history["val_accuracy"], label="Validation Accuracy"
)
axes[0].plot(self.training_history["val_accuracy"], label="Validation Accuracy")
axes[0].set_title("Model Accuracy")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Accuracy")
@@ -18,9 +18,7 @@ class ExperimentConfig:
tags: List[str] = field(default_factory=list)
# Model configuration
model_type: str = (
"logistic_regression" # logistic_regression, lstm, transformer, etc.
)
model_type: str = "logistic_regression" # logistic_regression, lstm, transformer, etc.
model_params: Dict[str, Any] = field(default_factory=dict)
# Feature configuration
@@ -28,9 +26,7 @@ class ExperimentConfig:
feature_params: Dict[str, Any] = field(default_factory=dict)
# Data configuration
train_data_filter: Optional[Dict[str, Any]] = (
None # Filter criteria for training data
)
train_data_filter: Optional[Dict[str, Any]] = None # Filter criteria for training data
test_data_filter: Optional[Dict[str, Any]] = None
target_column: str = "sex"
@@ -40,9 +36,7 @@ class ExperimentConfig:
cross_validation_folds: int = 5
# Evaluation configuration
metrics: List[str] = field(
default_factory=lambda: ["accuracy", "precision", "recall", "f1"]
)
metrics: List[str] = field(default_factory=lambda: ["accuracy", "precision", "recall", "f1"])
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for serialization"""
@@ -70,7 +64,7 @@ class ExperimentStatus(Enum):
def calculate_metrics(
y_true: np.ndarray, y_pred: np.ndarray, metrics: Optional[List[str]] = None
y_true: np.ndarray, y_pred: np.ndarray, metrics: List[str] = None
) -> Dict[str, float]:
"""Calculate specified metrics"""
@@ -2,7 +2,7 @@ from dataclasses import dataclass, field, asdict
from datetime import datetime
from typing import Optional, Dict, List, Any
from ners.research.experiment import ExperimentConfig, ExperimentStatus
from research.experiment import ExperimentConfig, ExperimentStatus
@dataclass
@@ -51,8 +51,6 @@ class ExperimentResult:
"""Create from dictionary"""
data["config"] = ExperimentConfig.from_dict(data["config"])
data["start_time"] = datetime.fromisoformat(data["start_time"])
data["end_time"] = (
datetime.fromisoformat(data["end_time"]) if data["end_time"] else None
)
data["end_time"] = datetime.fromisoformat(data["end_time"]) if data["end_time"] else None
data["status"] = ExperimentStatus(data["status"])
return cls(**data)

Some files were not shown because too many files have changed in this diff Show More