7 Commits

Author SHA1 Message Date
bernard-ng b7fc90ef71 [notebook] add results 2025-10-18 22:57:21 +02:00
Bernard Ngandu c463e6ed7e Rename model notation reference (#13) 2025-10-18 16:06:59 +02:00
Bernard Ngandu ad600ef565 Add technical architecture report (#12) 2025-10-18 15:43:28 +02:00
bernard-ng 8160bb0f6f docs: update README 2025-10-07 23:58:47 +02:00
bernard-ng f2ac0c9769 fix: add github workflow 2025-10-07 23:21:35 +02:00
bernard-ng d3b3840278 fix: nn models pad_sequences 2025-10-06 00:37:29 +02:00
bernard-ng cb22c06628 fix: remove svm model 2025-10-06 00:03:54 +02:00
47 changed files with 33447 additions and 30747 deletions
+35
View File
@@ -0,0 +1,35 @@
name: audit
on:
push:
branches:
- main
pull_request:
jobs:
bandit:
name: bandit
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Cache uv dependencies
uses: actions/cache@v4
with:
path: |
~/.cache/uv
.venv
key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
restore-keys: |
${{ runner.os }}-uv-
- name: Sync dependencies (with dev tools)
run: uv sync --dev
- name: Run Bandit (security linter)
run: uv run bandit -r . -c pyproject.toml || true
+40
View File
@@ -0,0 +1,40 @@
name: quality
on:
push:
branches:
- main
pull_request:
jobs:
lint:
name: ruff and pyright
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Cache uv dependencies
uses: actions/cache@v4
with:
path: |
~/.cache/uv
.venv
key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
restore-keys: |
${{ runner.os }}-uv-
- name: Sync dependencies (with dev tools)
run: uv sync --dev
- name: Run Ruff (lint + format checks)
run: |
uv run ruff check .
uv run ruff format --check .
- name: Run Pyright (type checks)
run: uv run pyright
+206
View File
@@ -0,0 +1,206 @@
# Formal Model Specifications
This document formalises the statistical models implemented in
`src/ners/research/models`. Throughout, the training set is
$\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N$ with labels
$y^{(i)} \in \{0,1\}$ for the binary gender classes. Feature vectors
$\mathbf{x}^{(i)}$ combine
* character $n$-gram count representations of name strings produced by
`CountVectorizer` or `TfidfVectorizer`, and
* engineered scalar or categorical metadata (e.g., name length, province)
that are either used directly or encoded by `LabelEncoder`.
For neural architectures, character or token sequences are converted into
integer index sequences using a `Tokenizer` before being padded to a
maximum length specified in the configuration. Predictions are returned as
class posterior probabilities via a softmax layer unless otherwise noted.
## Logistic Regression (`logistic_regression_model.py`)
**Feature map.** Character $n$-gram counts
$\phi(\mathbf{x}) \in \mathbb{R}^d$ obtained with
`CountVectorizer(analyzer="char", ngram_range=(2,4))` (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】
**Model.** The linear logit for class $1$ is
$z = \mathbf{w}^\top \phi(\mathbf{x}) + b$. The class posteriors are
$p(y=1\mid \mathbf{x}) = \sigma(z)$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$
with $\sigma(u) = (1 + e^{-u})^{-1}$.
**Training objective.** Minimise the regularised negative log-likelihood
$$\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),$$
where $R$ is the penalty induced by the chosen solver (e.g., $\ell_2$ for
`liblinear`).
## Multinomial Naive Bayes (`naive_bayes_model.py`)
**Feature map.** Character $n$-gram counts
$\phi(\mathbf{x}) \in \mathbb{N}^d$ derived with
`CountVectorizer(analyzer="char", ngram_range=(2,4))` by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】
**Generative model.** For each class $c \in \{0,1\}$, the class prior is
$\pi_c = \frac{N_c}{N}$. Conditional feature probabilities are estimated with
Laplace smoothing (parameter $\alpha$):
$$\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},$$
where $N_{cj}$ counts the total occurrences of feature $j$ among examples of
class $c$. The likelihood of an input with counts $\phi_j(\mathbf{x})$ is
$$p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.$$
**Inference.** Predict with the maximum a posteriori (MAP) decision
$\hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}$.
## Random Forest (`random_forest_model.py`)
**Feature map.** Concatenation of engineered numerical features and label
encoded categorical attributes produced on demand in
`prepare_features`.【F:src/ners/research/models/random_forest_model.py†L28-L71】
**Model.** An ensemble of $T$ decision trees ${\{T_t\}}_{t=1}^T$, each trained on
a bootstrap sample of the data with random feature sub-sampling. Each tree
outputs a class prediction $T_t(\mathbf{x}) \in \{0,1\}$. The forest prediction
is the mode of individual votes:
$$\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.$$
**Class probability.** For soft outputs,
$p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x})$ where
$p_t$ is the class distribution estimated at the leaf reached by
$\mathbf{x}$ in tree $t$.
## LightGBM (`lightgbm_model.py`)
**Feature map.** Hybrid of numeric inputs, categorical label encodings, and
character $n$-gram counts expanded into dense columns and assembled into a
feature matrix persisted in `self.feature_columns`.【F:src/ners/research/models/lightgbm_model.py†L38-L118】
**Model.** Gradient boosted decision trees forming an additive function
$F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x})$, where $h_m$ denotes the
$m$-th tree and $\eta$ is the learning rate.
**Training objective.** LightGBM minimises
$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),$$
using second-order Taylor approximations of the loss $\ell$ (binary log-loss by
default) and regulariser $\Omega$ determined by tree complexity constraints.
## XGBoost (`xgboost_model.py`)
**Feature map.** Combination of numeric metadata, categorical label encodings,
and character $n$-gram counts as described in `prepare_features`.【F:src/ners/research/models/xgboost_model.py†L41-L113】
**Model.** Additive ensemble of regression trees
$F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x})$ with $f_m \in \mathcal{F}$, the
space of trees with fixed structure.
**Training objective.** At boosting iteration $m$, minimise the regularised
objective
$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),$$
where $\Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2$ penalises the
number of leaves $T_f$ and their scores $w_j$. The optimal leaf weights follow
$$w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},$$
with $g_i$ and $h_i$ denoting first- and second-order gradients of the loss for
sample $i$.
## Convolutional Neural Network (`cnn_model.py`)
**Input encoding.** Character-level token sequences padded to length
$L$ using `Tokenizer(char_level=True)` followed by `pad_sequences`.【F:src/ners/research/models/cnn_model.py†L23-L64】
**Architecture.** Embedding layer producing $X \in \mathbb{R}^{L \times d}$,
followed by two convolutional blocks:
1. $H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X))$ with kernel size
$k_1$ and $F$ filters, then temporal max-pooling.
2. $H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)}))$ with kernel
size $k_2$ and $F$ filters.
Global max-pooling yields $h = \max_{t} H^{(2)}_{t,:}$, which passes through a
dense layer and dropout before the softmax layer producing
$p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b)$.
**Loss.** Cross-entropy between softmax output and the ground-truth label.
## Bidirectional GRU (`bigru_model.py`)
**Input encoding.** Word-level sequences padded to length $L$ with
`Tokenizer(char_level=False)` and `pad_sequences`.【F:src/ners/research/models/bigru_model.py†L47-L69】
**Recurrent dynamics.** A stacked bidirectional GRU computes forward and
backward hidden states according to
\[
\begin{aligned}
\mathbf{z}_t &= \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
\mathbf{r}_t &= \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
\tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h\big),\\
\mathbf{h}_t &= (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
\end{aligned}
\]
The final representation concatenates the last forward and backward states
before passing through dense layers and a softmax classifier.
## Bidirectional LSTM (`lstm_model.py`)
**Input encoding.** Word-level sequences padded to length $L$ using the same
pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】
**Recurrent dynamics.** At each timestep, the LSTM updates its memory cell via
\[
\begin{aligned}
\mathbf{i}_t &= \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
\mathbf{f}_t &= \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
\mathbf{o}_t &= \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
\tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
\mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
\mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
\end{aligned}
\]
Bidirectional aggregation concatenates terminal forward/backward hidden vectors
before the dense-softmax head.
## Transformer Encoder (`transformer_model.py`)
**Input encoding.** Token sequences padded to a fixed length with positional
indices $\{0, \ldots, L-1\}$ added through a learned positional embedding.
`Tokenizer` initialises the vocabulary; padding uses `pad_sequences`.【F:src/ners/research/models/transformer_model.py†L25-L77】
**Architecture.** For hidden dimension $d$, the encoder block computes
\[
\begin{aligned}
Z^{(0)} &= X + P,\\
Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\\
Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big),
\end{aligned}
\]
where $\operatorname{MHAttn}$ is multi-head self-attention with
$H$ heads. Global average pooling produces a fixed-length vector for the dense
and dropout layers before the final softmax classifier.
## Ensemble Voting (`ensemble_model.py`)
**Base learners.** A configurable set of pipelines that include character
$n$-gram vectorisers and classical classifiers (logistic regression,
random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】
**Aggregation.** Given model posteriors $\mathbf{p}_j(\mathbf{x})$ and non-negative
weights $w_j$, the soft-voting ensemble predicts
$$p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).$$
Hard voting instead returns
$\hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}$.
+5
View File
@@ -1,5 +1,10 @@
# A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
[![audit](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml)
[![quality](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml)
---
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
data.
+72
View File
@@ -0,0 +1,72 @@
# Technical Architecture Report: DRC NERS NLP
## Project Overview
The DRC NERS NLP project delivers an end-to-end system for Congolese name analysis and gender inference backed by a 5-million-record dataset enriched with demographic metadata.【F:README.md†L1-L12】 The toolkit wraps a configurable processing pipeline, experiment runner, and Streamlit dashboard so that researchers and practitioners can reproducibly clean raw registry data, engineer features, benchmark multiple models, and publish insights without modifying core code.
```mermaid
flowchart LR
A[Data ingestion\nDataLoader] --> B[Preprocessing\nBatch pipeline]
B --> C[Feature extraction\nName heuristics + NER]
C --> D[Model training\nExperiment runner]
D --> E[Evaluation\nMetrics + tracking]
E --> F[Visualization & Deployment\nStreamlit + exports]
```
## Software Architecture and Implementation
### Configuration-driven Orchestration
* **Central config management** `ConfigManager` resolves layered YAML/JSON configs, injects project paths, and merges environment overrides, ensuring every workflow can be replayed with the same parameters.【F:src/ners/core/config/config_manager.py†L12-L157】
* **Pipeline definition** `PipelineConfig` captures stage order, batch settings, annotation parameters, and dataset splits. The default `pipeline.yaml` enumerates every stage and shared directories, creating a single source of truth for the runbook.【F:src/ners/core/config/pipeline_config.py†L10-L29】【F:config/pipeline.yaml†L1-L68】
* **Research templates** Pre-built experiment templates map feature sets and hyperparameters for each baseline architecture, allowing experiments to be reproduced or extended declaratively.【F:config/research_templates.yaml†L1-L86】
### Modular Data Pipeline
1. **Data ingestion** `DataLoader` streams CSV chunks with typed columns, optional balancing, and dataset size limits; it also writes artifacts using consistent encodings for downstream reuse.【F:src/ners/core/utils/data_loader.py†L33-L174】
2. **Batch processing engine** The `Pipeline` class wires ordered steps and delegates to a `BatchProcessor` that supports sequential or concurrent execution, checkpointing, and memory-aware concatenation via `MemoryMonitor` to handle multi-million row datasets safely.【F:src/ners/processing/pipeline.py†L12-L57】【F:src/ners/processing/batch/batch_processor.py†L12-L173】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
3. **Preprocessing steps**
* `DataCleaningStep` drops critical nulls, normalizes text, and deduplicates records.【F:src/ners/processing/steps/data_cleaning_step.py†L10-L31】
* `FeatureExtractionStep` engineers linguistic statistics, name segments, gender/category inference, region mapping, and spaCy-based tagging while optimizing dtypes to keep memory usage low.【F:src/ners/processing/steps/feature_extraction_step.py†L24-L196】
* `DataSelectionStep` enforces column whitelists and domain-specific filters (e.g., removing “global” regions for certain years).【F:src/ners/processing/steps/data_selection_step.py†L9-L60】
* `NERAnnotationStep` loads a spaCy model, parallelizes tagging with retries, and records provenance for each batch.【F:src/ners/processing/steps/ner_annotation_step.py†L13-L172】
* `LLMAnnotationStep` calls an Ollama-hosted model with configurable concurrency, rate limiting, and exponential backoff to enrich unannotated rows while maintaining checkpoints.【F:src/ners/processing/steps/llm_annotation_step.py†L18-L169】
* `DataSplittingStep` persists evaluation, gender, and province-specific splits in deterministic fashion, reusing the shared data loader for consistent I/O.【F:src/ners/processing/steps/data_splitting_step.py†L11-L69】
4. **Pipeline runner** `run_pipeline` composes the configured steps, captures progress metrics, and invokes the splitter to materialize curated datasets, turning raw CSVs into model-ready corpora with a single command.【F:src/ners/main.py†L14-L75】
5. **Operational visibility** `PipelineMonitor` inspects checkpoint state, estimates storage use, and exposes Typer commands for status, cleanup, or reset, simplifying long-running batch management.【F:src/ners/processing/monitoring/pipeline_monitor.py†L11-L196】【F:src/ners/cli.py†L156-L200】
### Research Experimentation Pipeline
* **CLI and templates** Typer-powered commands load configuration environments and instantiate experiments from templates, guaranteeing every run is traceable to a versioned config file.【F:src/ners/cli.py†L13-L146】
* **Experiment runner** `ExperimentRunner` fetches the featured dataset, applies filters, splits data, trains models from the registry, computes metrics, confusion matrices, and feature importance, then persists joblib artifacts per run.【F:src/ners/research/experiment/experiment_runner.py†L24-L271】
* **Model registry and abstractions** Traditional, neural, and ensemble estimators inherit from `BaseModel`, which standardizes feature extraction, training, persistence, and probability interfaces. For example, the logistic regression model couples a character n-gram vectorizer with solver-specific tuning guidance.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/models/logistic_regression_model.py†L1-L47】
* **Tracking and artifact management** `ExperimentTracker` records metadata, metrics, and tags in JSON for comparison/export, while `ModelTrainer` orchestrates runs, saves serialized models/configs, and generates learning curves for later visualization or deployment.【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】【F:src/ners/research/model_trainer.py†L16-L200】
### Visualization and User Interfaces
* **Streamlit portal** `ners.web.app` bootstraps a Streamlit dashboard that shares the same configuration stack, giving analysts access to pipeline monitors, experiment summaries, and predictions through a browser-friendly UI.【F:src/ners/web/app.py†L1-L67】
* **Analysis utilities** The statistics package offers reusable seaborn/matplotlib plots (e.g., transition matrices, letter frequency charts) for exploratory studies, exporting intermediate CSVs alongside visuals.【F:src/ners/research/statistics/plots.py†L1-L39】
## Technology Stack and Environments
* **Core languages & libraries** Python 3.11 orchestrated by the `uv` package manager, with heavy use of pandas, NumPy, scikit-learn, joblib, spaCy, Streamlit, seaborn, and Typer across modules.【F:Dockerfile†L3-L48】【F:src/ners/research/experiment/experiment_runner.py†L6-L21】
* **LLM integration** The LLM annotation step leverages the Ollama client, optional rate limiting, and JSON-schema validation to keep third-party inference reproducible and auditable.【F:src/ners/processing/steps/llm_annotation_step.py†L21-L116】
* **Containerization** The Dockerfile provisions a slim Debian image with reproducible uv-managed environments, while `compose.yml` mounts configs/data and exposes Streamlit, providing parity between local and deployment setups.【F:Dockerfile†L3-L48】【F:compose.yml†L1-L23】
## Reproducibility and Automation
* All entry points (`ners pipeline`, `ners ner`, `ners research`, `ners monitor`) source environment-specific configs and return exit codes, enabling CI/CD and scheduled jobs to orchestrate pipelines reliably.【F:src/ners/cli.py†L13-L200】
* Version-controlled configs cover pipeline stages, annotations, prompts, and experiment templates; `ConfigManager` ensures default paths map to versioned `data/`, `models/`, and `outputs/` directories for each run.【F:src/ners/core/config/config_manager.py†L15-L111】【F:config/pipeline.yaml†L1-L68】
* Experiment artifacts (models, metrics, learning curves, exports) are stored per experiment ID with timestamps and hashes, easing regression comparisons and rollbacks.【F:src/ners/research/experiment/experiment_tracker.py†L45-L163】【F:src/ners/research/model_trainer.py†L109-L200】
## Scalability and Performance Considerations
* Chunked reads, optimized dtypes, and optional stratified sampling keep ingestion memory-efficient even for multi-million row CSVs.【F:src/ners/core/utils/data_loader.py†L40-L161】
* Batch processing supports threaded or multiprocess execution with incremental checkpointing, enabling restarts mid-run and reducing wasted computation on failure.【F:src/ners/processing/batch/batch_processor.py†L29-L156】
* Memory monitoring and dtype normalization inside feature engineering prevent ballooning DataFrame footprints during annotation-heavy stages.【F:src/ners/processing/steps/feature_extraction_step.py†L53-L195】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
* Rate-limited concurrency in NER and LLM steps balances throughput with external service stability, retrying transient failures without blocking the whole run.【F:src/ners/processing/steps/ner_annotation_step.py†L50-L166】【F:src/ners/processing/steps/llm_annotation_step.py†L58-L163】
## Deployment and Interfaces
* **Command-line workflows** The Typer CLI exposes discrete subcommands for pipeline execution, NER dataset generation, research training, and checkpoint maintenance, simplifying automation scripts and developer onboarding.【F:src/ners/cli.py†L13-L200】
* **Web interface** Streamlit shares the same configuration context as the CLI and surfaces monitoring utilities, experiment tracking, and interactive analysis for non-technical stakeholders.【F:src/ners/web/app.py†L1-L67】
* **Containerized services** Docker Compose binds the CLI, configs, and data directories into a reproducible container, standardizing environment setup across OSes and enabling GPU-enabled hosts to mount device-specific resources if needed.【F:compose.yml†L1-L23】
## Summary of Design Choices
* Configuration-first design separates code from experiment definitions, allowing fast iteration without code changes.【F:src/ners/core/config/config_manager.py†L15-L157】【F:config/research_templates.yaml†L1-L86】
* Batch checkpoints, memory monitoring, and rate limiting deliver resilience against large-scale processing failures and external service hiccups.【F:src/ners/processing/batch/batch_processor.py†L29-L173】【F:src/ners/processing/steps/llm_annotation_step.py†L21-L169】
* Unified model abstractions, experiment tracking, and artifact exports make the research stack extensible and production-ready, with metrics and models stored alongside configuration for reproducible science.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】
* Streamlit dashboards and Typer commands democratize access, letting analysts trigger pipelines or inspect experiments without touching Python modules.【F:src/ners/cli.py†L13-L200】【F:src/ners/web/app.py†L1-L67】
By combining configuration-driven orchestration, modular batch processing, and standardized experiment tooling, the DRC NERS NLP project functions as a robust, reproducible pipeline capable of scaling from exploratory research to production-ready deployments.
Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 42 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 37 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

After

Width:  |  Height:  |  Size: 54 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

After

Width:  |  Height:  |  Size: 33 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 454 KiB

After

Width:  |  Height:  |  Size: 463 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 224 KiB

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 34 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 455 KiB

After

Width:  |  Height:  |  Size: 464 KiB

+2 -2
View File
@@ -1,3 +1,3 @@
category,l2,kl_mf,kl_fm,jsd,permutation_p_value
names,0.3189041485139616,0.04320097944655348,0.0215380760498496,0.03236952774820154,0.978
surnames,1.2770018925640299,0.2936188220992242,0.23989460296618093,0.26675671253270256,0.001
names,0.3189041485139616,0.04320097944655348,0.0215380760498496,0.03236952774820154,0.977
surnames,1.2770018925640299,0.2936188220992242,0.23989460296618093,0.26675671253270256,0.002
1 category l2 kl_mf kl_fm jsd permutation_p_value
2 names 0.3189041485139616 0.04320097944655348 0.0215380760498496 0.03236952774820154 0.978 0.977
3 surnames 1.2770018925640299 0.2936188220992242 0.23989460296618093 0.26675671253270256 0.001 0.002
Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

After

Width:  |  Height:  |  Size: 45 KiB

-93
View File
@@ -1,93 +0,0 @@
# Model Notation Reference
This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
## Logistic Regression
- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
## Multinomial Naive Bayes
- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
## Support Vector Machine (RBF Kernel)
- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
- Implementation notes: TFIDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
## Random Forest
- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
- Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
## LightGBM (Gradient Boosted Trees)
- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
## XGBoost
- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
## Convolutional Neural Network (1D)
- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
## Bidirectional GRU
- Forward GRU recursion: $\begin{aligned}
&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
\end{aligned}$
- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
## LSTM
- Gates per timestep: $\begin{aligned}
&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
\end{aligned}$
- Bidirectional stacking concatenates final hidden vectors before classification via softmax.
- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
## Transformer Encoder (Single Block)
- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
- Sequence pooling (global average) feeds dense layers and softmax classifier.
- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
## Ensemble (Soft Voting)
- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.
+16
View File
@@ -37,5 +37,21 @@ build-backend = "uv_build"
[dependency-groups]
dev = [
"ipykernel>=6.30.1",
"pyright>=1.1.406",
"pytest>=8.4.2",
"ruff>=0.13.3",
]
[tool.pyright]
pythonVersion = "3.11"
typeCheckingMode = "basic"
reportMissingImports = "none"
reportMissingModuleSource = "none"
useLibraryCodeForTypes = true
include = ["src"]
[tool.ruff]
# Keep defaults and additionally ignore notebooks
extend-exclude = [
"**/*.ipynb",
]
+23 -4
View File
@@ -118,12 +118,31 @@ def research_train(
exp_cfg = exp_builder.find_template(tmpl, name, type)
trainer = ModelTrainer(cfg)
# Validate and coerce template fields to expected types for type safety
model_name = exp_cfg.get("name")
model_type = exp_cfg.get("model_type")
features = exp_cfg.get("features")
tags = exp_cfg.get("tags", [])
if not isinstance(model_name, str) or not isinstance(model_type, str):
raise typer.BadParameter(
"Template must include 'name' and 'model_type' as strings"
)
if features is None:
features = ["full_name"]
elif not isinstance(features, list):
raise typer.BadParameter("Template 'features' must be a list of strings")
if not isinstance(tags, list):
tags = []
trainer.train_single_model(
model_name=exp_cfg.get("name"),
model_type=exp_cfg.get("model_type"),
features=exp_cfg.get("features"),
model_name=model_name,
model_type=model_type,
features=features,
model_params=exp_cfg.get("model_params", {}),
tags=exp_cfg.get("tags", []),
tags=tags,
)
+5 -3
View File
@@ -16,13 +16,13 @@ def get_config() -> PipelineConfig:
def load_config(config_path: Optional[Union[str, Path]] = None) -> PipelineConfig:
"""Load configuration from specified path"""
if config_path:
return config_manager.load_config(Path(config_path))
if config_path is not None:
return config_manager.load_config(config_path)
return config_manager.get_config()
def setup_config(
config_path: Optional[Path] = None, env: str = "development"
config_path: Optional[Union[str, Path]] = None, env: str = "development"
) -> PipelineConfig:
"""
Unified configuration loading and logging setup for all entrypoint scripts.
@@ -37,6 +37,8 @@ def setup_config(
# Determine config path
if config_path is None:
config_path = Path("config") / f"pipeline.{env}.yaml"
else:
config_path = Path(config_path)
# Load configuration
config = ConfigManager(config_path).load_config()
+14 -8
View File
@@ -13,7 +13,9 @@ class ConfigManager:
"""Centralized configuration management"""
def __init__(self, config_path: Optional[Union[str, Path]] = None):
self.config_path = config_path or self._find_config_file()
self.config_path: Path = (
Path(config_path) if config_path is not None else self._find_config_file()
)
self._config: Optional[PipelineConfig] = None
self._setup_default_paths()
@@ -47,10 +49,12 @@ class ConfigManager:
checkpoints_dir=root_dir / "data" / "checkpoints",
)
def load_config(self, config_path: Optional[Path] = None) -> PipelineConfig:
def load_config(
self, config_path: Optional[Union[str, Path]] = None
) -> PipelineConfig:
"""Load configuration from file"""
if config_path:
self.config_path = config_path
if config_path is not None:
self.config_path = Path(config_path)
if not self.config_path.exists():
logging.warning(
@@ -80,9 +84,11 @@ class ConfigManager:
"""Create default configuration"""
return PipelineConfig(paths=self.default_paths)
def save_config(self, config: PipelineConfig, path: Optional[Path] = None):
def save_config(
self, config: PipelineConfig, path: Optional[Union[str, Path]] = None
):
"""Save configuration to file"""
save_path = path or self.config_path
save_path = Path(path) if path is not None else self.config_path
save_path.parent.mkdir(parents=True, exist_ok=True)
config_dict = config.model_dump()
@@ -142,8 +148,8 @@ class ConfigManager:
env_config = self.load_config(env_config_path)
# Merge configurations
base_dict = base_config.dict()
env_dict = env_config.dict()
base_dict = base_config.model_dump()
env_dict = env_config.model_dump()
self._deep_update(base_dict, env_dict)
return PipelineConfig(**base_dict)
+2 -2
View File
@@ -260,9 +260,9 @@ class NameTagger:
# Remove overlaps
filtered, last_end = [], -1
for s, e, l in valid:
for s, e, label in valid:
if s >= last_end:
filtered.append((s, e, l))
filtered.append((s, e, label))
last_end = e
return filtered
+1 -1
View File
@@ -19,7 +19,7 @@ class PipelineState:
processed_batches: int = 0
total_batches: int = 0
failed_batches: List[int] = None
failed_batches: Optional[List[int]] = None
last_checkpoint: Optional[str] = None
def __post_init__(self):
@@ -21,7 +21,7 @@ class DataSelectionStep(PipelineStep):
if "region" in batch.columns and "year" in batch.columns:
target_years = {2015, 2021, 2022}
mask_remove = batch["region"].str.lower().eq("global") & batch["year"].isin(
target_years
list(target_years)
)
removed = int(mask_remove.sum())
if removed:
@@ -29,8 +29,8 @@ class FeatureExtractionStep(PipelineStep):
self.region_mapper = RegionMapper()
self.name_tagger = NameTagger()
@classmethod
def requires_batch_mutation(cls) -> bool:
@property
def requires_batch_mutation(self) -> bool:
"""This step creates new columns, so mutation is required"""
return True
+44 -26
View File
@@ -1,6 +1,6 @@
import logging
from abc import ABC, abstractmethod
from typing import Dict, Any, Optional, List
from typing import Dict, Any, Optional, List, TYPE_CHECKING, Union
import joblib
import matplotlib.pyplot as plt
@@ -9,19 +9,23 @@ import pandas as pd
from ners.research.experiment import ExperimentConfig
if TYPE_CHECKING:
from ners.research.experiment.feature_extractor import FeatureExtractor
from sklearn.preprocessing import LabelEncoder
class BaseModel(ABC):
"""Abstract base class for all models"""
def __init__(self, config: ExperimentConfig):
self.config = config
self.model = None
self.feature_extractor = None
self.label_encoder = None
self.tokenizer = None # For neural models
self.is_fitted = False
self.training_history = {} # Store training history for learning curves
self.learning_curve_data = {} # Store learning curve experiment data
self.model: Any | None = None
self.feature_extractor: "FeatureExtractor | None" = None
self.label_encoder: "LabelEncoder | None" = None
self.tokenizer: Any | None = None # For neural models
self.is_fitted: bool = False
self.training_history: Dict[str, Any] = {} # For learning curves
self.learning_curve_data: Dict[str, Any] = {}
@property
@abstractmethod
@@ -48,7 +52,7 @@ class BaseModel(ABC):
@abstractmethod
def generate_learning_curve(
self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = None
self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = []
) -> Dict[str, Any]:
"""Generate learning curve data for the model"""
pass
@@ -58,10 +62,17 @@ class BaseModel(ABC):
if not self.is_fitted:
raise ValueError("Model must be fitted before making predictions")
if (
self.feature_extractor is None
or self.model is None
or self.label_encoder is None
):
raise ValueError("Model is not fully initialized for prediction")
features_df = self.feature_extractor.extract_features(X)
X_prepared = self.prepare_features(features_df)
predictions = self.model.predict(X_prepared)
predictions: Union[np.ndarray, Any] = self.model.predict(X_prepared)
# Handle different prediction formats
if hasattr(predictions, "shape") and len(predictions.shape) > 1:
@@ -75,6 +86,9 @@ class BaseModel(ABC):
if not self.is_fitted:
raise ValueError("Model must be fitted before making predictions")
if self.feature_extractor is None or self.model is None:
raise ValueError("Model is not fully initialized for prediction")
features_df = self.feature_extractor.extract_features(X)
X_prepared = self.prepare_features(features_df)
@@ -83,7 +97,11 @@ class BaseModel(ABC):
elif hasattr(self.model, "predict"):
# For neural networks that return probabilities directly
probabilities = self.model.predict(X_prepared)
if len(probabilities.shape) == 2 and probabilities.shape[1] > 1:
if (
hasattr(probabilities, "shape")
and len(probabilities.shape) == 2
and probabilities.shape[1] > 1
):
return probabilities
raise NotImplementedError("Model does not support probability predictions")
@@ -91,30 +109,29 @@ class BaseModel(ABC):
def get_feature_importance(self) -> Optional[Dict[str, float]]:
"""Get feature importance if supported by the model"""
if hasattr(self.model, "feature_importances_"):
model = self.model
if model is None:
return None
if hasattr(model, "feature_importances_"):
# For tree-based models
importances = self.model.feature_importances_
importances = model.feature_importances_
feature_names = self._get_feature_names()
return dict(zip(feature_names, importances))
elif hasattr(self.model, "coef_"):
elif hasattr(model, "coef_"):
# For linear models
coefficients = np.abs(self.model.coef_[0])
coefficients = np.abs(model.coef_[0])
feature_names = self._get_feature_names()
return dict(zip(feature_names, coefficients))
elif (
hasattr(self.model, "named_steps")
and "classifier" in self.model.named_steps
):
elif hasattr(model, "named_steps") and "classifier" in model.named_steps:
# For sklearn pipelines (like LogisticRegression with vectorizer)
classifier = self.model.named_steps["classifier"]
classifier = model.named_steps["classifier"]
if hasattr(classifier, "coef_"):
coefficients = np.abs(classifier.coef_[0])
if hasattr(
self.model.named_steps["vectorizer"], "get_feature_names_out"
):
feature_names = self.model.named_steps[
if hasattr(model.named_steps["vectorizer"], "get_feature_names_out"):
feature_names = model.named_steps[
"vectorizer"
].get_feature_names_out()
# Take top features to avoid too many n-grams
@@ -127,8 +144,9 @@ class BaseModel(ABC):
def _get_feature_names(self) -> List[str]:
"""Get feature names (override in subclasses if needed)"""
if hasattr(self.model, "feature_names_in_"):
return list(self.model.feature_names_in_)
model = self.model
if model is not None and hasattr(model, "feature_names_in_"):
return list(model.feature_names_in_)
return [f"feature_{i}" for i in range(100)] # Default fallback
def save(self, path: str):
+1 -1
View File
@@ -70,7 +70,7 @@ class ExperimentStatus(Enum):
def calculate_metrics(
y_true: np.ndarray, y_pred: np.ndarray, metrics: List[str] = None
y_true: np.ndarray, y_pred: np.ndarray, metrics: Optional[List[str]] = None
) -> Dict[str, float]:
"""Calculate specified metrics"""
@@ -99,14 +99,24 @@ class ExperimentBuilder:
logging.warning(f"Unknown feature type: {feature_str}")
continue
name = (
template_config.get("name")
or template_config.get("model_type")
or "experiment"
)
model_type = template_config.get("model_type") or "logistic_regression"
description = template_config.get("description") or ""
return ExperimentConfig(
name=template_config.get("name"),
description=template_config.get("description"),
model_type=template_config.get("model_type"),
name=str(name),
description=str(description),
model_type=str(model_type),
features=features,
model_params=template_config.get("model_params", {}),
tags=template_config.get("tags", []),
test_size=template_config.get("test_size", 0.2),
cross_validation_folds=template_config.get("cross_validation_folds", 5),
test_size=float(template_config.get("test_size", 0.2)),
cross_validation_folds=int(
template_config.get("cross_validation_folds", 5)
),
train_data_filter=template_config.get("train_data_filter"),
)
@@ -1,5 +1,5 @@
from enum import Enum
from typing import List, Dict, Any, Union
from typing import List, Dict, Any, Union, Optional
import pandas as pd
@@ -25,7 +25,9 @@ class FeatureExtractor:
"""Extract different types of features from name data"""
def __init__(
self, feature_types: List[FeatureType], feature_params: Dict[str, Any] = None
self,
feature_types: List[FeatureType],
feature_params: Optional[Dict[str, Any]] = None,
):
self.feature_types = feature_types
self.feature_params = feature_params or {}
-2
View File
@@ -10,7 +10,6 @@ from ners.research.models.logistic_regression_model import LogisticRegressionMod
from ners.research.models.lstm_model import LSTMModel
from ners.research.models.naive_bayes_model import NaiveBayesModel
from ners.research.models.random_forest_model import RandomForestModel
from ners.research.models.svm_model import SVMModel
from ners.research.models.transformer_model import TransformerModel
from ners.research.models.xgboost_model import XGBoostModel
@@ -23,7 +22,6 @@ MODEL_REGISTRY = {
"lstm": LSTMModel,
"naive_bayes": NaiveBayesModel,
"random_forest": RandomForestModel,
"svm": SVMModel,
"transformer": TransformerModel,
"xgboost": XGBoostModel,
}
+5 -5
View File
@@ -1,7 +1,7 @@
import json
import logging
from datetime import datetime
from typing import List, Dict, Any
from typing import List, Dict, Any, Optional
import pandas as pd
@@ -30,9 +30,9 @@ class ModelTrainer:
self,
model_name: str,
model_type: str = "logistic_regression",
features: List[str] = None,
model_params: Dict[str, Any] = None,
tags: List[str] = None,
features: Optional[List[str]] = None,
model_params: Optional[Dict[str, Any]] = None,
tags: Optional[List[str]] = None,
save_artifacts: bool = True,
) -> str:
"""
@@ -106,7 +106,7 @@ class ModelTrainer:
logging.info(f"Completed training {len(experiment_ids)} models successfully")
return experiment_ids
def save_model_artifacts(self, experiment_id: str) -> Dict[str, str]:
def save_model_artifacts(self, experiment_id: str) -> Dict[str, Optional[str]]:
"""
Save model artifacts in a structured way for easy loading.
Returns paths to saved artifacts.
+12 -4
View File
@@ -13,7 +13,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
class BiGRUModel(NeuralNetworkModel):
"""Bidirectional GRU model for name classification"""
def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
def build_model(self, vocab_size: int, **kwargs) -> Any:
params = kwargs
model = Sequential(
[
@@ -23,6 +23,7 @@ class BiGRUModel(NeuralNetworkModel):
input_dim=vocab_size,
output_dim=params.get("embedding_dim", 64),
mask_zero=True,
input_length=params.get("max_len", 6),
),
# First recurrent block returns full sequences to allow stacking.
# Moderate dropout + optional recurrent_dropout to reduce overfitting
@@ -32,7 +33,10 @@ class BiGRUModel(NeuralNetworkModel):
params.get("gru_units", 32),
return_sequences=True,
dropout=params.get("dropout", 0.2),
recurrent_dropout=params.get("recurrent_dropout", 0.0),
# Use a small non-zero recurrent_dropout by default to
# disable cuDNN path, which has strict right-padding mask
# requirements and can assert when using Bidirectional.
recurrent_dropout=params.get("recurrent_dropout", 0.1),
)
),
# Second GRU summarizes to the last hidden state (no return_sequences),
@@ -41,7 +45,7 @@ class BiGRUModel(NeuralNetworkModel):
GRU(
params.get("gru_units", 32),
dropout=params.get("dropout", 0.2),
recurrent_dropout=params.get("recurrent_dropout", 0.0),
recurrent_dropout=params.get("recurrent_dropout", 0.1),
)
),
# Small dense head; ReLU + dropout for capacity and regularization.
@@ -69,4 +73,8 @@ class BiGRUModel(NeuralNetworkModel):
sequences = self.tokenizer.texts_to_sequences(text_data)
max_len = self.config.model_params.get("max_len", 6)
return pad_sequences(sequences, maxlen=max_len, padding="post")
# Ensure padding and truncation are applied on the right to keep
# contiguous non-zero tokens on the left, matching RNN mask expectations.
return pad_sequences(
sequences, maxlen=max_len, padding="post", truncating="post"
)
+5 -2
View File
@@ -21,7 +21,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
class CNNModel(NeuralNetworkModel):
"""1D Convolutional Neural Network for character patterns"""
def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
def build_model(self, vocab_size: int, **kwargs) -> Any:
"""Build CNN model with known vocabulary size"""
params = kwargs
@@ -83,4 +83,7 @@ class CNNModel(NeuralNetworkModel):
"max_len", 20
) # Longer for character level
return pad_sequences(sequences, maxlen=max_len, padding="post")
# Right-side padding and truncation ensure contiguous non-zero tokens on the left
return pad_sequences(
sequences, maxlen=max_len, padding="post", truncating="post"
)
+9 -4
View File
@@ -13,7 +13,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
class LSTMModel(NeuralNetworkModel):
"""LSTM model for sequence learning"""
def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
def build_model(self, vocab_size: int, **kwargs) -> Any:
params = kwargs
model = Sequential(
[
@@ -30,7 +30,9 @@ class LSTMModel(NeuralNetworkModel):
params.get("lstm_units", 32),
return_sequences=True,
dropout=params.get("dropout", 0.2),
recurrent_dropout=params.get("recurrent_dropout", 0.0),
# Default to a small non-zero recurrent_dropout to avoid
# cuDNN mask assertions when masking with Bidirectional.
recurrent_dropout=params.get("recurrent_dropout", 0.1),
)
),
# Second LSTM condenses sequence to a fixed vector for classification.
@@ -38,7 +40,7 @@ class LSTMModel(NeuralNetworkModel):
LSTM(
params.get("lstm_units", 32),
dropout=params.get("dropout", 0.2),
recurrent_dropout=params.get("recurrent_dropout", 0.0),
recurrent_dropout=params.get("recurrent_dropout", 0.1),
)
),
# Compact dense head with dropout; sufficient capacity for name signals.
@@ -68,4 +70,7 @@ class LSTMModel(NeuralNetworkModel):
sequences = self.tokenizer.texts_to_sequences(text_data)
max_len = self.config.model_params.get("max_len", 6)
return pad_sequences(sequences, maxlen=max_len, padding="post")
# Right-side padding and truncation to preserve contiguous non-zero tokens
return pad_sequences(
sequences, maxlen=max_len, padding="post", truncating="post"
)
@@ -22,7 +22,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
class TransformerModel(NeuralNetworkModel):
"""Transformer-based model"""
def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
def build_model(self, vocab_size: int, **kwargs) -> Any:
params = kwargs
# Use a single resolved max_len everywhere to avoid shape mismatches
max_len = int(params.get("max_len", 6))
@@ -88,4 +88,7 @@ class TransformerModel(NeuralNetworkModel):
sequences = self.tokenizer.texts_to_sequences(text_data)
max_len = int(self.config.model_params.get("max_len", 6))
return pad_sequences(sequences, maxlen=max_len, padding="post")
# Right-side padding and truncation for consistent masking/shape
return pad_sequences(
sequences, maxlen=max_len, padding="post", truncating="post"
)
+29 -8
View File
@@ -24,7 +24,7 @@ class NeuralNetworkModel(BaseModel):
return "neural_network"
@abstractmethod
def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
def build_model(self, vocab_size: int, **kwargs) -> Any:
"""Build neural network model with known vocabulary size"""
pass
@@ -86,9 +86,7 @@ class NeuralNetworkModel(BaseModel):
logging.info(f"Vocabulary size: {vocab_size}")
# Get additional model parameters
self.model = self.build_model_with_vocab(
vocab_size=vocab_size, **self.config.model_params
)
self.model = self.build_model(vocab_size=vocab_size, **self.config.model_params)
# Train the neural network
logging.info(
@@ -149,6 +147,29 @@ class NeuralNetworkModel(BaseModel):
if invalid_mask.any():
arr[invalid_mask] = oov_index
# Enforce strictly right-padded masks for RNN/cuDNN compatibility.
# Any zero appearing before the last non-zero in a sequence will be
# replaced with the OOV index so the mask remains contiguous True->False.
try:
nz = arr != 0 # non-padding tokens
if nz.ndim == 2 and arr.shape[1] > 0:
# Identify rows that have at least one non-zero
has_nz = nz.any(axis=1)
# Compute last non-zero position per row; if none, set to -1
indices = np.arange(arr.shape[1], dtype=np.int64)
# Max of indices where nz is True gives last non-zero
last_pos = (nz * indices).max(axis=1)
last_pos = np.where(has_nz, last_pos, -1)
# Broadcast to mark the left region up to last non-zero (inclusive)
left_region = indices <= last_pos[:, None]
# Zeros inside the left region are invalid padding -> set to OOV
zero_inside = (~nz) & left_region
if zero_inside.any():
arr[zero_inside] = oov_index
except Exception:
# Best-effort; skip if any unexpected broadcasting issue occurs
pass
# Use int32 for TF embedding ops compatibility
return arr.astype(np.int32, copy=False)
except Exception as e:
@@ -226,8 +247,8 @@ class NeuralNetworkModel(BaseModel):
max_len = self.config.model_params.get("max_len", 6)
for fold, (train_idx, val_idx) in enumerate(cv.split(X_prepared, y_encoded)):
# Create fresh model for each fold using build_model_with_vocab
fold_model = self.build_model_with_vocab(
# Create fresh model for each fold using build_model
fold_model = self.build_model(
vocab_size=vocab_size, max_len=max_len, **self.config.model_params
)
@@ -341,8 +362,8 @@ class NeuralNetworkModel(BaseModel):
val_scores = []
for seed in range(3): # 3 runs for variance
# Build fresh model using build_model_with_vocab
model = self.build_model_with_vocab(
# Build fresh model using build_model
model = self.build_model(
vocab_size=vocab_size, max_len=max_len, **self.config.model_params
)
+2 -6
View File
@@ -5,7 +5,8 @@ import numpy as np
import pandas as pd
from scipy.spatial.distance import euclidean
from scipy.stats import entropy
from typing import Dict, Any
from collections import Counter
from typing import Dict, Any, Literal
LETTERS = "abcdefghijklmnopqrstuvwxyz"
START_TOKEN = "^"
@@ -234,11 +235,6 @@ def build_transition_comparisons(
return out
import pandas as pd
from collections import Counter
from typing import Literal
def build_ngrams_count(
df: pd.DataFrame,
n: int,
+13 -4
View File
@@ -30,12 +30,21 @@ def train_from_template(
logging.info(f"Features: {experiment_config.get('features')}")
trainer = ModelTrainer(cfg)
name_val = experiment_config.get("name")
type_val = experiment_config.get("model_type")
features_val = experiment_config.get("features") or ["full_name"]
tags_val = experiment_config.get("tags", [])
if not isinstance(name_val, str) or not isinstance(type_val, str):
raise ValueError("Template must include 'name' and 'model_type' as strings")
if not isinstance(features_val, list):
raise ValueError("Template 'features' must be a list of strings")
trainer.train_single_model(
model_name=experiment_config.get("name"),
model_type=experiment_config.get("model_type"),
features=experiment_config.get("features"),
model_name=name_val,
model_type=type_val,
features=features_val,
model_params=experiment_config.get("model_params", {}),
tags=experiment_config.get("tags", []),
tags=tags_val if isinstance(tags_val, list) else [],
)
logging.info("Training completed successfully!")
+3 -1
View File
@@ -1 +1,3 @@
from .ner_testing import NERTesting
from .ner_testing import NERTesting as NERTesting
__all__ = ["NERTesting"]
+2 -2
View File
@@ -116,7 +116,7 @@ class Predictions:
try:
probabilities = model.predict_proba(input_df)[0]
return max(probabilities)
except:
except Exception:
return None
def _display_single_prediction_results(
@@ -209,7 +209,7 @@ class Predictions:
try:
probabilities = model.predict_proba(df)
df["confidence"] = np.max(probabilities, axis=1)
except:
except Exception:
df["confidence"] = None
st.success("Predictions completed!")
+584 -80
View File
File diff suppressed because one or more lines are too long
+167 -87
View File
@@ -73,60 +73,82 @@
" cv = exp.get(\"cv_metrics\", {}) or {}\n",
"\n",
" cm = exp.get(\"confusion_matrix\")\n",
" tn=fp=fn=tp=np.nan\n",
" if isinstance(cm, list) and len(cm)==2 and all(isinstance(r, list) and len(r)==2 for r in cm):\n",
" tn = fp = fn = tp = np.nan\n",
" if (\n",
" isinstance(cm, list)\n",
" and len(cm) == 2\n",
" and all(isinstance(r, list) and len(r) == 2 for r in cm)\n",
" ):\n",
" # By inspection of the provided metrics, mapping is:\n",
" # rows = true [f, m]; cols = pred [f, m]\n",
" tn, fp = cm[0][0], cm[0][1] # true negatives and false positives for positive class 'm'\n",
" tn, fp = (\n",
" cm[0][0],\n",
" cm[0][1],\n",
" ) # true negatives and false positives for positive class 'm'\n",
" fn, tp = cm[1][0], cm[1][1]\n",
"\n",
" # Derived metrics from confusion matrix (where present)\n",
" def safe_div(a,b): \n",
" return float(a)/float(b) if (b not in (0, None) and not pd.isna(b)) else np.nan\n",
" def safe_div(a, b):\n",
" return (\n",
" float(a) / float(b) if (b not in (0, None) and not pd.isna(b)) else np.nan\n",
" )\n",
"\n",
" sensitivity = safe_div(tp, tp+fn) # TPR for 'm'\n",
" specificity = safe_div(tn, tn+fp) # TNR for 'm'\n",
" sensitivity = safe_div(tp, tp + fn) # TPR for 'm'\n",
" specificity = safe_div(tn, tn + fp) # TNR for 'm'\n",
" balanced_acc = np.nanmean([sensitivity, specificity])\n",
" mcc_num = (tp*tn - fp*fn)\n",
" mcc_den = sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn)) if all(x==x for x in [tp+fp, tp+fn, tn+fp, tn+fn]) else np.nan\n",
" mcc_num = tp * tn - fp * fn\n",
" mcc_den = (\n",
" sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))\n",
" if all(x == x for x in [tp + fp, tp + fn, tn + fp, tn + fn])\n",
" else np.nan\n",
" )\n",
" mcc = safe_div(mcc_num, mcc_den)\n",
"\n",
" n_test = exp.get(\"test_size\") or np.nansum([tn, fp, fn, tp])\n",
" test_acc = te.get(\"accuracy\", np.nan)\n",
" # 95% CI for accuracy via normal approximation (ok for n=2000)\n",
" if pd.notna(test_acc) and pd.notna(n_test) and n_test>0:\n",
" se = np.sqrt(test_acc*(1-test_acc)/n_test)\n",
" acc_ci_lo = test_acc - 1.96*se\n",
" acc_ci_hi = test_acc + 1.96*se\n",
" if pd.notna(test_acc) and pd.notna(n_test) and n_test > 0:\n",
" se = np.sqrt(test_acc * (1 - test_acc) / n_test)\n",
" acc_ci_lo = test_acc - 1.96 * se\n",
" acc_ci_hi = test_acc + 1.96 * se\n",
" else:\n",
" acc_ci_lo = acc_ci_hi = np.nan\n",
"\n",
" rows.append({\n",
" \"experiment_id\": exp_id,\n",
" \"model\": name or model_type,\n",
" \"model_family\": (model_type or \"\").upper(),\n",
" \"feature_set\": features,\n",
" \"train_accuracy\": tr.get(\"accuracy\", np.nan),\n",
" \"test_accuracy\": test_acc,\n",
" \"cv_accuracy_mean\": cv.get(\"accuracy\", np.nan),\n",
" \"cv_accuracy_std\": cv.get(\"accuracy_std\", np.nan),\n",
" \"train_f1\": tr.get(\"f1\", np.nan),\n",
" \"test_f1\": te.get(\"f1\", np.nan),\n",
" \"cv_f1_mean\": cv.get(\"f1\", np.nan),\n",
" \"cv_f1_std\": cv.get(\"f1_std\", np.nan),\n",
" \"TP\": tp, \"FP\": fp, \"TN\": tn, \"FN\": fn,\n",
" \"sensitivity_TPR_m\": sensitivity,\n",
" \"specificity_TNR_m\": specificity,\n",
" \"balanced_accuracy\": balanced_acc,\n",
" \"MCC\": mcc,\n",
" \"n_test\": n_test,\n",
" \"acc_95ci_lo\": acc_ci_lo,\n",
" \"acc_95ci_hi\": acc_ci_hi,\n",
" \"train_minus_test_gap\": (tr.get(\"accuracy\", np.nan) - test_acc) if pd.notna(tr.get(\"accuracy\", np.nan)) and pd.notna(test_acc) else np.nan,\n",
" \"test_minus_cv_gap\": (test_acc - cv.get(\"accuracy\", np.nan)) if pd.notna(test_acc) and pd.notna(cv.get(\"accuracy\", np.nan)) else np.nan,\n",
" \"start_time\": exp.get(\"start_time\"),\n",
" \"end_time\": exp.get(\"end_time\")\n",
" })\n",
" rows.append(\n",
" {\n",
" \"experiment_id\": exp_id,\n",
" \"model\": name or model_type,\n",
" \"model_family\": (model_type or \"\").upper(),\n",
" \"feature_set\": features,\n",
" \"train_accuracy\": tr.get(\"accuracy\", np.nan),\n",
" \"test_accuracy\": test_acc,\n",
" \"cv_accuracy_mean\": cv.get(\"accuracy\", np.nan),\n",
" \"cv_accuracy_std\": cv.get(\"accuracy_std\", np.nan),\n",
" \"train_f1\": tr.get(\"f1\", np.nan),\n",
" \"test_f1\": te.get(\"f1\", np.nan),\n",
" \"cv_f1_mean\": cv.get(\"f1\", np.nan),\n",
" \"cv_f1_std\": cv.get(\"f1_std\", np.nan),\n",
" \"TP\": tp,\n",
" \"FP\": fp,\n",
" \"TN\": tn,\n",
" \"FN\": fn,\n",
" \"sensitivity_TPR_m\": sensitivity,\n",
" \"specificity_TNR_m\": specificity,\n",
" \"balanced_accuracy\": balanced_acc,\n",
" \"MCC\": mcc,\n",
" \"n_test\": n_test,\n",
" \"acc_95ci_lo\": acc_ci_lo,\n",
" \"acc_95ci_hi\": acc_ci_hi,\n",
" \"train_minus_test_gap\": (tr.get(\"accuracy\", np.nan) - test_acc)\n",
" if pd.notna(tr.get(\"accuracy\", np.nan)) and pd.notna(test_acc)\n",
" else np.nan,\n",
" \"test_minus_cv_gap\": (test_acc - cv.get(\"accuracy\", np.nan))\n",
" if pd.notna(test_acc) and pd.notna(cv.get(\"accuracy\", np.nan))\n",
" else np.nan,\n",
" \"start_time\": exp.get(\"start_time\"),\n",
" \"end_time\": exp.get(\"end_time\"),\n",
" }\n",
" )\n",
"\n",
"df = pd.DataFrame(rows)"
]
@@ -139,23 +161,53 @@
"outputs": [],
"source": [
"# Clean and order categorical fields\n",
"df[\"feature_set\"] = df[\"feature_set\"].replace({\"full_name\":\"Full name\",\"native_name\":\"Native\",\"surname\":\"Surname\"})\n",
"order_features = [\"Full name\",\"Surname\",\"Native\"]\n",
"df[\"feature_set\"] = pd.Categorical(df[\"feature_set\"], categories=order_features, ordered=True)\n",
"df[\"feature_set\"] = df[\"feature_set\"].replace(\n",
" {\"full_name\": \"Full name\", \"native_name\": \"Native\", \"surname\": \"Surname\"}\n",
")\n",
"order_features = [\"Full name\", \"Surname\", \"Native\"]\n",
"df[\"feature_set\"] = pd.Categorical(\n",
" df[\"feature_set\"], categories=order_features, ordered=True\n",
")\n",
"\n",
"order_family = [\"LOGISTIC_REGRESSION\",\"LIGHTGBM\",\"LSTM\",\"CNN\",\"BIGRU\", \"RANDOM_FOREST\", \"TRANSFORMER\", \"NAIVE_BAYES\", \"XGBOOST\"]\n",
"df[\"model_family\"] = pd.Categorical(df[\"model_family\"], categories=order_family, ordered=True)\n",
"order_family = [\n",
" \"LOGISTIC_REGRESSION\",\n",
" \"LIGHTGBM\",\n",
" \"LSTM\",\n",
" \"CNN\",\n",
" \"BIGRU\",\n",
" \"RANDOM_FOREST\",\n",
" \"TRANSFORMER\",\n",
" \"NAIVE_BAYES\",\n",
" \"XGBOOST\",\n",
"]\n",
"df[\"model_family\"] = pd.Categorical(\n",
" df[\"model_family\"], categories=order_family, ordered=True\n",
")\n",
"\n",
"# Summary table (subset of most relevant columns)\n",
"summary_cols = [\n",
" \"experiment_id\",\"model_family\",\"feature_set\",\n",
" \"train_accuracy\",\"test_accuracy\",\"cv_accuracy_mean\",\"cv_accuracy_std\",\n",
" \"acc_95ci_lo\",\"acc_95ci_hi\",\n",
" \"balanced_accuracy\",\"MCC\",\n",
" \"train_minus_test_gap\",\"test_minus_cv_gap\",\n",
" \"n_test\"\n",
" \"experiment_id\",\n",
" \"model_family\",\n",
" \"feature_set\",\n",
" \"train_accuracy\",\n",
" \"test_accuracy\",\n",
" \"cv_accuracy_mean\",\n",
" \"cv_accuracy_std\",\n",
" \"acc_95ci_lo\",\n",
" \"acc_95ci_hi\",\n",
" \"balanced_accuracy\",\n",
" \"MCC\",\n",
" \"train_minus_test_gap\",\n",
" \"test_minus_cv_gap\",\n",
" \"n_test\",\n",
"]\n",
"summary = df[summary_cols].sort_values([\"model_family\",\"feature_set\",\"test_accuracy\"], ascending=[True, True, False]).reset_index(drop=True)\n",
"summary = (\n",
" df[summary_cols]\n",
" .sort_values(\n",
" [\"model_family\", \"feature_set\", \"test_accuracy\"], ascending=[True, True, False]\n",
" )\n",
" .reset_index(drop=True)\n",
")\n",
"\n",
"# Display the master summary table\n",
"display(summary)"
@@ -171,25 +223,37 @@
"# Build a pivot for plotting\n",
"plot_df = df.dropna(subset=[\"test_accuracy\"]).copy()\n",
"# Prepare positions\n",
"families = [f for f in order_family if f in plot_df[\"model_family\"].astype(str).unique()]\n",
"features = [f for f in order_features if f in plot_df[\"feature_set\"].astype(str).unique()]\n",
"families = [\n",
" f for f in order_family if f in plot_df[\"model_family\"].astype(str).unique()\n",
"]\n",
"features = [\n",
" f for f in order_features if f in plot_df[\"feature_set\"].astype(str).unique()\n",
"]\n",
"\n",
"# Bar positions\n",
"x = np.arange(len(families))\n",
"width = 0.8 / max(1,len(features)) # total width split by features\n",
"width = 0.8 / max(1, len(features)) # total width split by features\n",
"\n",
"fig1 = plt.figure(figsize=(10,6))\n",
"fig1 = plt.figure(figsize=(10, 6))\n",
"for i, feat in enumerate(features):\n",
" sub = plot_df[plot_df[\"feature_set\"].astype(str)==feat]\n",
" sub = plot_df[plot_df[\"feature_set\"].astype(str) == feat]\n",
" # Align to families\n",
" y = []\n",
" yerr = [[], []] # lower and upper errors for asymmetric CI\n",
" for fam in families:\n",
" row = sub[sub[\"model_family\"].astype(str)==fam]\n",
" row = sub[sub[\"model_family\"].astype(str) == fam]\n",
" if len(row):\n",
" val = float(row.iloc[0][\"test_accuracy\"])\n",
" lo = float(row.iloc[0][\"acc_95ci_lo\"]) if pd.notna(row.iloc[0][\"acc_95ci_lo\"]) else np.nan\n",
" hi = float(row.iloc[0][\"acc_95ci_hi\"]) if pd.notna(row.iloc[0][\"acc_95ci_hi\"]) else np.nan\n",
" lo = (\n",
" float(row.iloc[0][\"acc_95ci_lo\"])\n",
" if pd.notna(row.iloc[0][\"acc_95ci_lo\"])\n",
" else np.nan\n",
" )\n",
" hi = (\n",
" float(row.iloc[0][\"acc_95ci_hi\"])\n",
" if pd.notna(row.iloc[0][\"acc_95ci_hi\"])\n",
" else np.nan\n",
" )\n",
" else:\n",
" val, lo, hi = np.nan, np.nan, np.nan\n",
" y.append(val)\n",
@@ -201,7 +265,14 @@
" yerr[0].append(np.nan)\n",
" yerr[1].append(np.nan)\n",
"\n",
" plt.bar(x + i*width - (len(features)-1)*width/2, y, width, label=feat, yerr=yerr, capsize=4)\n",
" plt.bar(\n",
" x + i * width - (len(features) - 1) * width / 2,\n",
" y,\n",
" width,\n",
" label=feat,\n",
" yerr=yerr,\n",
" capsize=4,\n",
" )\n",
"\n",
"plt.xticks(x, families, rotation=0)\n",
"plt.ylabel(\"Test accuracy\")\n",
@@ -219,15 +290,15 @@
"metadata": {},
"outputs": [],
"source": [
"fig2 = plt.figure(figsize=(10,6))\n",
"fig2 = plt.figure(figsize=(10, 6))\n",
"for i, feat in enumerate(features):\n",
" sub = plot_df[plot_df[\"feature_set\"].astype(str)==feat]\n",
" sub = plot_df[plot_df[\"feature_set\"].astype(str) == feat]\n",
" y = []\n",
" for fam in families:\n",
" row = sub[sub[\"model_family\"].astype(str)==fam]\n",
" row = sub[sub[\"model_family\"].astype(str) == fam]\n",
" val = float(row.iloc[0][\"test_f1\"]) if len(row) else np.nan\n",
" y.append(val)\n",
" plt.bar(x + i*width - (len(features)-1)*width/2, y, width, label=feat)\n",
" plt.bar(x + i * width - (len(features) - 1) * width / 2, y, width, label=feat)\n",
"\n",
"plt.xticks(x, families, rotation=0)\n",
"plt.ylabel(\"Test F1\")\n",
@@ -245,14 +316,18 @@
"metadata": {},
"outputs": [],
"source": [
"fig3 = plt.figure(figsize=(7,7))\n",
"fig3 = plt.figure(figsize=(7, 7))\n",
"for feat in features:\n",
" sub = df[df[\"feature_set\"].astype(str)==feat]\n",
" sub = df[df[\"feature_set\"].astype(str) == feat]\n",
" plt.scatter(sub[\"train_accuracy\"], sub[\"test_accuracy\"], label=feat)\n",
"# y=x reference\n",
"lims = [min(df[\"train_accuracy\"].min(), df[\"test_accuracy\"].min())-0.02, max(df[\"train_accuracy\"].max(), df[\"test_accuracy\"].max())+0.02]\n",
"lims = [\n",
" min(df[\"train_accuracy\"].min(), df[\"test_accuracy\"].min()) - 0.02,\n",
" max(df[\"train_accuracy\"].max(), df[\"test_accuracy\"].max()) + 0.02,\n",
"]\n",
"plt.plot(lims, lims, linestyle=\"--\")\n",
"plt.xlim(lims); plt.ylim(lims)\n",
"plt.xlim(lims)\n",
"plt.ylim(lims)\n",
"plt.xlabel(\"Train accuracy\")\n",
"plt.ylabel(\"Test accuracy\")\n",
"plt.title(\"Overfitting analysis: Train vs Test accuracy\")\n",
@@ -268,22 +343,24 @@
"metadata": {},
"outputs": [],
"source": [
"best_rows = df.sort_values(\"test_accuracy\", ascending=False).groupby(\"feature_set\").head(1)\n",
"best_rows = (\n",
" df.sort_values(\"test_accuracy\", ascending=False).groupby(\"feature_set\").head(1)\n",
")\n",
"for _, row in best_rows.iterrows():\n",
" cm = np.array([[row[\"TN\"], row[\"FP\"]], [row[\"FN\"], row[\"TP\"]]], dtype=float)\n",
" if np.isnan(cm).any():\n",
" continue\n",
" fig = plt.figure(figsize=(5,5))\n",
" fig = plt.figure(figsize=(5, 5))\n",
" im = plt.imshow(cm, interpolation=\"nearest\")\n",
" plt.title(f\"Confusion Matrix — {row['model_family']} ({row['feature_set']})\")\n",
" plt.xticks([0,1], [\"Pred: f\",\"Pred: m\"])\n",
" plt.yticks([0,1], [\"True: f\",\"True: m\"])\n",
" plt.xticks([0, 1], [\"Pred: f\", \"Pred: m\"])\n",
" plt.yticks([0, 1], [\"True: f\", \"True: m\"])\n",
" # Annotate counts and rates\n",
" total = cm.sum()\n",
" for i in range(2):\n",
" for j in range(2):\n",
" val = cm[i,j]\n",
" plt.text(j, i, f\"{int(val)}\\n({val/total:.2%})\", ha=\"center\", va=\"center\")\n",
" val = cm[i, j]\n",
" plt.text(j, i, f\"{int(val)}\\n({val / total:.2%})\", ha=\"center\", va=\"center\")\n",
" plt.colorbar(im, fraction=0.046, pad=0.04)\n",
" plt.tight_layout()\n",
" plt.show()"
@@ -298,34 +375,37 @@
"source": [
"deltas = []\n",
"for fam in families:\n",
" fam_rows = df[df[\"model_family\"].astype(str)==fam]\n",
" base = fam_rows[fam_rows[\"feature_set\"]==\"Native\"]\n",
" fam_rows = df[df[\"model_family\"].astype(str) == fam]\n",
" base = fam_rows[fam_rows[\"feature_set\"] == \"Native\"]\n",
" if len(base):\n",
" base_acc = float(base.iloc[0][\"test_accuracy\"])\n",
" for feat in [\"Full name\",\"Surname\"]:\n",
" tgt = fam_rows[fam_rows[\"feature_set\"]==feat]\n",
" for feat in [\"Full name\", \"Surname\"]:\n",
" tgt = fam_rows[fam_rows[\"feature_set\"] == feat]\n",
" if len(tgt):\n",
" deltas.append({\n",
" \"model_family\": fam,\n",
" \"comparison\": f\"{feat} minus Native\",\n",
" \"delta_accuracy\": float(tgt.iloc[0][\"test_accuracy\"]) - base_acc\n",
" })\n",
" deltas.append(\n",
" {\n",
" \"model_family\": fam,\n",
" \"comparison\": f\"{feat} minus Native\",\n",
" \"delta_accuracy\": float(tgt.iloc[0][\"test_accuracy\"])\n",
" - base_acc,\n",
" }\n",
" )\n",
"\n",
"deltas_df = pd.DataFrame(deltas)\n",
"display(deltas_df)\n",
"\n",
"fig5 = plt.figure(figsize=(10,6))\n",
"fig5 = plt.figure(figsize=(10, 6))\n",
"# Make bars grouped by model_family\n",
"comp_types = deltas_df[\"comparison\"].unique().tolist() if not deltas_df.empty else []\n",
"x2 = np.arange(len(families))\n",
"width2 = 0.8 / max(1, len(comp_types))\n",
"for i, comp in enumerate(comp_types):\n",
" sub = deltas_df[deltas_df[\"comparison\"]==comp]\n",
" sub = deltas_df[deltas_df[\"comparison\"] == comp]\n",
" y = []\n",
" for fam in families:\n",
" row = sub[sub[\"model_family\"]==fam]\n",
" row = sub[sub[\"model_family\"] == fam]\n",
" y.append(float(row.iloc[0][\"delta_accuracy\"]) if len(row) else np.nan)\n",
" plt.bar(x2 + i*width2 - (len(comp_types)-1)*width2/2, y, width2, label=comp)\n",
" plt.bar(x2 + i * width2 - (len(comp_types) - 1) * width2 / 2, y, width2, label=comp)\n",
"\n",
"plt.xticks(x2, families)\n",
"plt.axhline(0, linestyle=\"--\")\n",
+1022 -107
View File
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Generated
+60
View File
@@ -710,6 +710,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/76/c6/c88e154df9c4e1a2a66ccf0005a88dfb2650c1dffb6f5ce603dfbd452ce3/idna-3.10-py3-none-any.whl", hash = "sha256:946d195a0d259cbba61165e88e65941f16e9b36ea6ddb97f00452bae8b1287d3", size = 70442, upload-time = "2024-09-15T18:07:37.964Z" },
]
[[package]]
name = "iniconfig"
version = "2.1.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f2/97/ebf4da567aa6827c909642694d71c9fcf53e5b504f2d96afea02718862f3/iniconfig-2.1.0.tar.gz", hash = "sha256:3abbd2e30b36733fee78f9c7f7308f2d0050e88f0087fd25c2645f63c773e1c7", size = 4793, upload-time = "2025-03-19T20:09:59.721Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/2c/e1/e6716421ea10d38022b952c159d5161ca1193197fb744506875fbb87ea7b/iniconfig-2.1.0-py3-none-any.whl", hash = "sha256:9deba5723312380e77435581c6bf4935c94cbfab9b1ed33ef8d238ea168eb760", size = 6050, upload-time = "2025-03-19T20:10:01.071Z" },
]
[[package]]
name = "ipykernel"
version = "6.30.1"
@@ -1349,6 +1358,8 @@ dependencies = [
[package.dev-dependencies]
dev = [
{ name = "ipykernel" },
{ name = "pyright" },
{ name = "pytest" },
{ name = "ruff" },
]
@@ -1379,6 +1390,8 @@ requires-dist = [
[package.metadata.requires-dev]
dev = [
{ name = "ipykernel", specifier = ">=6.30.1" },
{ name = "pyright", specifier = ">=1.1.406" },
{ name = "pytest", specifier = ">=8.4.2" },
{ name = "ruff", specifier = ">=0.13.3" },
]
@@ -1400,6 +1413,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/eb/8d/776adee7bbf76365fdd7f2552710282c79a4ead5d2a46408c9043a2b70ba/networkx-3.5-py3-none-any.whl", hash = "sha256:0030d386a9a06dee3565298b4a734b68589749a544acbb6c412dc9e2489ec6ec", size = 2034406, upload-time = "2025-05-29T11:35:04.961Z" },
]
[[package]]
name = "nodeenv"
version = "1.9.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/43/16/fc88b08840de0e0a72a2f9d8c6bae36be573e475a6326ae854bcc549fc45/nodeenv-1.9.1.tar.gz", hash = "sha256:6ec12890a2dab7946721edbfbcd91f3319c6ccc9aec47be7c7e6b7011ee6645f", size = 47437, upload-time = "2024-06-04T18:44:11.171Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d2/1d/1b658dbd2b9fa9c4c9f32accbfc0205d532c8c6194dc0f2a4c0428e7128a/nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9", size = 22314, upload-time = "2024-06-04T18:44:08.352Z" },
]
[[package]]
name = "numpy"
version = "2.3.3"
@@ -1734,6 +1756,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/3f/93/023955c26b0ce614342d11cc0652f1e45e32393b6ab9d11a664a60e9b7b7/plotly-6.3.1-py3-none-any.whl", hash = "sha256:8b4420d1dcf2b040f5983eed433f95732ed24930e496d36eb70d211923532e64", size = 9833698, upload-time = "2025-10-02T16:10:22.584Z" },
]
[[package]]
name = "pluggy"
version = "1.6.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
]
[[package]]
name = "preshed"
version = "3.0.10"
@@ -2079,6 +2110,35 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/15/73/a7141a1a0559bf1a7aa42a11c879ceb19f02f5c6c371c6d57fd86cefd4d1/pyproj-3.7.2-cp314-cp314t-win_arm64.whl", hash = "sha256:d9d25bae416a24397e0d85739f84d323b55f6511e45a522dd7d7eae70d10c7e4", size = 6391844, upload-time = "2025-08-14T12:05:40.745Z" },
]
[[package]]
name = "pyright"
version = "1.1.406"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "nodeenv" },
{ name = "typing-extensions" },
]
sdist = { url = "https://files.pythonhosted.org/packages/f7/16/6b4fbdd1fef59a0292cbb99f790b44983e390321eccbc5921b4d161da5d1/pyright-1.1.406.tar.gz", hash = "sha256:c4872bc58c9643dac09e8a2e74d472c62036910b3bd37a32813989ef7576ea2c", size = 4113151, upload-time = "2025-10-02T01:04:45.488Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/f6/a2/e309afbb459f50507103793aaef85ca4348b66814c86bc73908bdeb66d12/pyright-1.1.406-py3-none-any.whl", hash = "sha256:1d81fb43c2407bf566e97e57abb01c811973fdb21b2df8df59f870f688bdca71", size = 5980982, upload-time = "2025-10-02T01:04:43.137Z" },
]
[[package]]
name = "pytest"
version = "8.4.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "colorama", marker = "sys_platform == 'win32'" },
{ name = "iniconfig" },
{ name = "packaging" },
{ name = "pluggy" },
{ name = "pygments" },
]
sdist = { url = "https://files.pythonhosted.org/packages/a3/5c/00a0e072241553e1a7496d638deababa67c5058571567b92a7eaa258397c/pytest-8.4.2.tar.gz", hash = "sha256:86c0d0b93306b961d58d62a4db4879f27fe25513d4b969df351abdddb3c30e01", size = 1519618, upload-time = "2025-09-04T14:34:22.711Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a8/a4/20da314d277121d6534b3a980b29035dcd51e6744bd79075a6ce8fa4eb8d/pytest-8.4.2-py3-none-any.whl", hash = "sha256:872f880de3fc3a5bdc88a11b39c9710c3497a547cfa9320bc3c5e62fbf272e79", size = 365750, upload-time = "2025-09-04T14:34:20.226Z" },
]
[[package]]
name = "python-dateutil"
version = "2.9.0.post0"