[notebook] add results

Rename model notation reference (#13 )
Add technical architecture report (#12 )
2025-10-18 22:57:21 +02:00 · 2025-10-18 16:06:59 +02:00 · 2025-10-18 15:43:28 +02:00 · 2025-10-07 23:58:47 +02:00 · 2025-10-07 23:21:35 +02:00 · 2025-10-06 00:37:29 +02:00
47 changed files with 33447 additions and 30747 deletions
@@ -0,0 +1,35 @@
+name: audit
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  bandit:
+    name: bandit
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        run: curl -LsSf https://astral.sh/uv/install.sh | sh
+
+      - name: Cache uv dependencies
+        uses: actions/cache@v4
+        with:
+          path: |
+            ~/.cache/uv
+            .venv
+          key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
+          restore-keys: |
+            ${{ runner.os }}-uv-
+
+      - name: Sync dependencies (with dev tools)
+        run: uv sync --dev
+
+      - name: Run Bandit (security linter)
+        run: uv run bandit -r . -c pyproject.toml || true
@@ -0,0 +1,40 @@
+name: quality
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  lint:
+    name: ruff and pyright
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install uv
+        run: curl -LsSf https://astral.sh/uv/install.sh | sh
+
+      - name: Cache uv dependencies
+        uses: actions/cache@v4
+        with:
+          path: |
+            ~/.cache/uv
+            .venv
+          key: ${{ runner.os }}-uv-${{ hashFiles('**/uv.lock') }}
+          restore-keys: |
+            ${{ runner.os }}-uv-
+
+      - name: Sync dependencies (with dev tools)
+        run: uv sync --dev
+
+      - name: Run Ruff (lint + format checks)
+        run: |
+          uv run ruff check .
+          uv run ruff format --check .
+
+      - name: Run Pyright (type checks)
+        run: uv run pyright
@@ -0,0 +1,206 @@
+# Formal Model Specifications
+
+This document formalises the statistical models implemented in
+`src/ners/research/models`. Throughout, the training set is
+$\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N$ with labels
+$y^{(i)} \in \{0,1\}$ for the binary gender classes. Feature vectors
+$\mathbf{x}^{(i)}$ combine
+
+* character $n$-gram count representations of name strings produced by
+  `CountVectorizer` or `TfidfVectorizer`, and
+* engineered scalar or categorical metadata (e.g., name length, province)
+  that are either used directly or encoded by `LabelEncoder`.
+
+For neural architectures, character or token sequences are converted into
+integer index sequences using a `Tokenizer` before being padded to a
+maximum length specified in the configuration. Predictions are returned as
+class posterior probabilities via a softmax layer unless otherwise noted.
+
+## Logistic Regression (`logistic_regression_model.py`)
+
+**Feature map.** Character $n$-gram counts
+$\phi(\mathbf{x}) \in \mathbb{R}^d$ obtained with
+`CountVectorizer(analyzer="char", ngram_range=(2,4))` (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】
+
+**Model.** The linear logit for class $1$ is
+$z = \mathbf{w}^\top \phi(\mathbf{x}) + b$. The class posteriors are
+$p(y=1\mid \mathbf{x}) = \sigma(z)$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$
+with $\sigma(u) = (1 + e^{-u})^{-1}$.
+
+**Training objective.** Minimise the regularised negative log-likelihood
+
+$$\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),$$
+
+where $R$ is the penalty induced by the chosen solver (e.g., $\ell_2$ for
+`liblinear`).
+
+## Multinomial Naive Bayes (`naive_bayes_model.py`)
+
+**Feature map.** Character $n$-gram counts
+$\phi(\mathbf{x}) \in \mathbb{N}^d$ derived with
+`CountVectorizer(analyzer="char", ngram_range=(2,4))` by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】
+
+**Generative model.** For each class $c \in \{0,1\}$, the class prior is
+$\pi_c = \frac{N_c}{N}$. Conditional feature probabilities are estimated with
+Laplace smoothing (parameter $\alpha$):
+
+$$\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},$$
+
+where $N_{cj}$ counts the total occurrences of feature $j$ among examples of
+class $c$. The likelihood of an input with counts $\phi_j(\mathbf{x})$ is
+
+$$p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.$$
+
+**Inference.** Predict with the maximum a posteriori (MAP) decision
+$\hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}$.
+
+## Random Forest (`random_forest_model.py`)
+
+**Feature map.** Concatenation of engineered numerical features and label
+encoded categorical attributes produced on demand in
+`prepare_features`.【F:src/ners/research/models/random_forest_model.py†L28-L71】
+
+**Model.** An ensemble of $T$ decision trees ${\{T_t\}}_{t=1}^T$, each trained on
+a bootstrap sample of the data with random feature sub-sampling. Each tree
+outputs a class prediction $T_t(\mathbf{x}) \in \{0,1\}$. The forest prediction
+is the mode of individual votes:
+
+$$\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.$$ 
+
+**Class probability.** For soft outputs,
+$p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x})$ where
+$p_t$ is the class distribution estimated at the leaf reached by
+$\mathbf{x}$ in tree $t$.
+
+## LightGBM (`lightgbm_model.py`)
+
+**Feature map.** Hybrid of numeric inputs, categorical label encodings, and
+character $n$-gram counts expanded into dense columns and assembled into a
+feature matrix persisted in `self.feature_columns`.【F:src/ners/research/models/lightgbm_model.py†L38-L118】
+
+**Model.** Gradient boosted decision trees forming an additive function
+$F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x})$, where $h_m$ denotes the
+$m$-th tree and $\eta$ is the learning rate.
+
+**Training objective.** LightGBM minimises
+
+$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),$$
+
+using second-order Taylor approximations of the loss $\ell$ (binary log-loss by
+default) and regulariser $\Omega$ determined by tree complexity constraints.
+
+## XGBoost (`xgboost_model.py`)
+
+**Feature map.** Combination of numeric metadata, categorical label encodings,
+and character $n$-gram counts as described in `prepare_features`.【F:src/ners/research/models/xgboost_model.py†L41-L113】
+
+**Model.** Additive ensemble of regression trees
+$F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x})$ with $f_m \in \mathcal{F}$, the
+space of trees with fixed structure.
+
+**Training objective.** At boosting iteration $m$, minimise the regularised
+objective
+
+$$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),$$
+
+where $\Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2$ penalises the
+number of leaves $T_f$ and their scores $w_j$. The optimal leaf weights follow
+
+$$w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},$$
+
+with $g_i$ and $h_i$ denoting first- and second-order gradients of the loss for
+sample $i$.
+
+## Convolutional Neural Network (`cnn_model.py`)
+
+**Input encoding.** Character-level token sequences padded to length
+$L$ using `Tokenizer(char_level=True)` followed by `pad_sequences`.【F:src/ners/research/models/cnn_model.py†L23-L64】
+
+**Architecture.** Embedding layer producing $X \in \mathbb{R}^{L \times d}$,
+followed by two convolutional blocks:
+
+1. $H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X))$ with kernel size
+   $k_1$ and $F$ filters, then temporal max-pooling.
+2. $H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)}))$ with kernel
+   size $k_2$ and $F$ filters.
+
+Global max-pooling yields $h = \max_{t} H^{(2)}_{t,:}$, which passes through a
+dense layer and dropout before the softmax layer producing
+$p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b)$.
+
+**Loss.** Cross-entropy between softmax output and the ground-truth label.
+
+## Bidirectional GRU (`bigru_model.py`)
+
+**Input encoding.** Word-level sequences padded to length $L$ with
+`Tokenizer(char_level=False)` and `pad_sequences`.【F:src/ners/research/models/bigru_model.py†L47-L69】
+
+**Recurrent dynamics.** A stacked bidirectional GRU computes forward and
+backward hidden states according to
+
+\[
+\begin{aligned}
+\mathbf{z}_t &= \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
+\mathbf{r}_t &= \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
+\tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h\big),\\
+\mathbf{h}_t &= (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
+\end{aligned}
+\]
+
+The final representation concatenates the last forward and backward states
+before passing through dense layers and a softmax classifier.
+
+## Bidirectional LSTM (`lstm_model.py`)
+
+**Input encoding.** Word-level sequences padded to length $L$ using the same
+pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】
+
+**Recurrent dynamics.** At each timestep, the LSTM updates its memory cell via
+
+\[
+\begin{aligned}
+\mathbf{i}_t &= \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
+\mathbf{f}_t &= \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
+\mathbf{o}_t &= \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
+\tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
+\mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
+\mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
+\end{aligned}
+\]
+
+Bidirectional aggregation concatenates terminal forward/backward hidden vectors
+before the dense-softmax head.
+
+## Transformer Encoder (`transformer_model.py`)
+
+**Input encoding.** Token sequences padded to a fixed length with positional
+indices $\{0, \ldots, L-1\}$ added through a learned positional embedding.
+`Tokenizer` initialises the vocabulary; padding uses `pad_sequences`.【F:src/ners/research/models/transformer_model.py†L25-L77】
+
+**Architecture.** For hidden dimension $d$, the encoder block computes
+
+\[
+\begin{aligned}
+Z^{(0)} &= X + P,\\
+Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\\
+Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big),
+\end{aligned}
+\]
+
+where $\operatorname{MHAttn}$ is multi-head self-attention with
+$H$ heads. Global average pooling produces a fixed-length vector for the dense
+and dropout layers before the final softmax classifier.
+
+## Ensemble Voting (`ensemble_model.py`)
+
+**Base learners.** A configurable set of pipelines that include character
+$n$-gram vectorisers and classical classifiers (logistic regression,
+random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】
+
+**Aggregation.** Given model posteriors $\mathbf{p}_j(\mathbf{x})$ and non-negative
+weights $w_j$, the soft-voting ensemble predicts
+
+$$p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).$$
+
+Hard voting instead returns
+$\hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}$.
@@ -1,5 +1,10 @@
 # A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

+[![audit](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/audit.yml)
+[![quality](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml/badge.svg)](https://github.com/bernard-ng/drc-ners-nlp/actions/workflows/quality.yml)
+
+---
+
 Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often
 underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training
 data.
@@ -0,0 +1,72 @@
+# Technical Architecture Report: DRC NERS NLP
+
+## Project Overview
+The DRC NERS NLP project delivers an end-to-end system for Congolese name analysis and gender inference backed by a 5-million-record dataset enriched with demographic metadata.【F:README.md†L1-L12】 The toolkit wraps a configurable processing pipeline, experiment runner, and Streamlit dashboard so that researchers and practitioners can reproducibly clean raw registry data, engineer features, benchmark multiple models, and publish insights without modifying core code.
+
+```mermaid
+flowchart LR
+    A[Data ingestion\nDataLoader] --> B[Preprocessing\nBatch pipeline]
+    B --> C[Feature extraction\nName heuristics + NER]
+    C --> D[Model training\nExperiment runner]
+    D --> E[Evaluation\nMetrics + tracking]
+    E --> F[Visualization & Deployment\nStreamlit + exports]
+```
+
+## Software Architecture and Implementation
+
+### Configuration-driven Orchestration
+* **Central config management** – `ConfigManager` resolves layered YAML/JSON configs, injects project paths, and merges environment overrides, ensuring every workflow can be replayed with the same parameters.【F:src/ners/core/config/config_manager.py†L12-L157】
+* **Pipeline definition** – `PipelineConfig` captures stage order, batch settings, annotation parameters, and dataset splits. The default `pipeline.yaml` enumerates every stage and shared directories, creating a single source of truth for the runbook.【F:src/ners/core/config/pipeline_config.py†L10-L29】【F:config/pipeline.yaml†L1-L68】
+* **Research templates** – Pre-built experiment templates map feature sets and hyperparameters for each baseline architecture, allowing experiments to be reproduced or extended declaratively.【F:config/research_templates.yaml†L1-L86】
+
+### Modular Data Pipeline
+1. **Data ingestion** – `DataLoader` streams CSV chunks with typed columns, optional balancing, and dataset size limits; it also writes artifacts using consistent encodings for downstream reuse.【F:src/ners/core/utils/data_loader.py†L33-L174】
+2. **Batch processing engine** – The `Pipeline` class wires ordered steps and delegates to a `BatchProcessor` that supports sequential or concurrent execution, checkpointing, and memory-aware concatenation via `MemoryMonitor` to handle multi-million row datasets safely.【F:src/ners/processing/pipeline.py†L12-L57】【F:src/ners/processing/batch/batch_processor.py†L12-L173】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
+3. **Preprocessing steps** –
+   * `DataCleaningStep` drops critical nulls, normalizes text, and deduplicates records.【F:src/ners/processing/steps/data_cleaning_step.py†L10-L31】
+   * `FeatureExtractionStep` engineers linguistic statistics, name segments, gender/category inference, region mapping, and spaCy-based tagging while optimizing dtypes to keep memory usage low.【F:src/ners/processing/steps/feature_extraction_step.py†L24-L196】
+   * `DataSelectionStep` enforces column whitelists and domain-specific filters (e.g., removing “global” regions for certain years).【F:src/ners/processing/steps/data_selection_step.py†L9-L60】
+   * `NERAnnotationStep` loads a spaCy model, parallelizes tagging with retries, and records provenance for each batch.【F:src/ners/processing/steps/ner_annotation_step.py†L13-L172】
+   * `LLMAnnotationStep` calls an Ollama-hosted model with configurable concurrency, rate limiting, and exponential backoff to enrich unannotated rows while maintaining checkpoints.【F:src/ners/processing/steps/llm_annotation_step.py†L18-L169】
+   * `DataSplittingStep` persists evaluation, gender, and province-specific splits in deterministic fashion, reusing the shared data loader for consistent I/O.【F:src/ners/processing/steps/data_splitting_step.py†L11-L69】
+4. **Pipeline runner** – `run_pipeline` composes the configured steps, captures progress metrics, and invokes the splitter to materialize curated datasets, turning raw CSVs into model-ready corpora with a single command.【F:src/ners/main.py†L14-L75】
+5. **Operational visibility** – `PipelineMonitor` inspects checkpoint state, estimates storage use, and exposes Typer commands for status, cleanup, or reset, simplifying long-running batch management.【F:src/ners/processing/monitoring/pipeline_monitor.py†L11-L196】【F:src/ners/cli.py†L156-L200】
+
+### Research Experimentation Pipeline
+* **CLI and templates** – Typer-powered commands load configuration environments and instantiate experiments from templates, guaranteeing every run is traceable to a versioned config file.【F:src/ners/cli.py†L13-L146】
+* **Experiment runner** – `ExperimentRunner` fetches the featured dataset, applies filters, splits data, trains models from the registry, computes metrics, confusion matrices, and feature importance, then persists joblib artifacts per run.【F:src/ners/research/experiment/experiment_runner.py†L24-L271】
+* **Model registry and abstractions** – Traditional, neural, and ensemble estimators inherit from `BaseModel`, which standardizes feature extraction, training, persistence, and probability interfaces. For example, the logistic regression model couples a character n-gram vectorizer with solver-specific tuning guidance.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/models/logistic_regression_model.py†L1-L47】
+* **Tracking and artifact management** – `ExperimentTracker` records metadata, metrics, and tags in JSON for comparison/export, while `ModelTrainer` orchestrates runs, saves serialized models/configs, and generates learning curves for later visualization or deployment.【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】【F:src/ners/research/model_trainer.py†L16-L200】
+
+### Visualization and User Interfaces
+* **Streamlit portal** – `ners.web.app` bootstraps a Streamlit dashboard that shares the same configuration stack, giving analysts access to pipeline monitors, experiment summaries, and predictions through a browser-friendly UI.【F:src/ners/web/app.py†L1-L67】
+* **Analysis utilities** – The statistics package offers reusable seaborn/matplotlib plots (e.g., transition matrices, letter frequency charts) for exploratory studies, exporting intermediate CSVs alongside visuals.【F:src/ners/research/statistics/plots.py†L1-L39】
+
+## Technology Stack and Environments
+* **Core languages & libraries** – Python 3.11 orchestrated by the `uv` package manager, with heavy use of pandas, NumPy, scikit-learn, joblib, spaCy, Streamlit, seaborn, and Typer across modules.【F:Dockerfile†L3-L48】【F:src/ners/research/experiment/experiment_runner.py†L6-L21】
+* **LLM integration** – The LLM annotation step leverages the Ollama client, optional rate limiting, and JSON-schema validation to keep third-party inference reproducible and auditable.【F:src/ners/processing/steps/llm_annotation_step.py†L21-L116】
+* **Containerization** – The Dockerfile provisions a slim Debian image with reproducible uv-managed environments, while `compose.yml` mounts configs/data and exposes Streamlit, providing parity between local and deployment setups.【F:Dockerfile†L3-L48】【F:compose.yml†L1-L23】
+
+## Reproducibility and Automation
+* All entry points (`ners pipeline`, `ners ner`, `ners research`, `ners monitor`) source environment-specific configs and return exit codes, enabling CI/CD and scheduled jobs to orchestrate pipelines reliably.【F:src/ners/cli.py†L13-L200】
+* Version-controlled configs cover pipeline stages, annotations, prompts, and experiment templates; `ConfigManager` ensures default paths map to versioned `data/`, `models/`, and `outputs/` directories for each run.【F:src/ners/core/config/config_manager.py†L15-L111】【F:config/pipeline.yaml†L1-L68】
+* Experiment artifacts (models, metrics, learning curves, exports) are stored per experiment ID with timestamps and hashes, easing regression comparisons and rollbacks.【F:src/ners/research/experiment/experiment_tracker.py†L45-L163】【F:src/ners/research/model_trainer.py†L109-L200】
+
+## Scalability and Performance Considerations
+* Chunked reads, optimized dtypes, and optional stratified sampling keep ingestion memory-efficient even for multi-million row CSVs.【F:src/ners/core/utils/data_loader.py†L40-L161】
+* Batch processing supports threaded or multiprocess execution with incremental checkpointing, enabling restarts mid-run and reducing wasted computation on failure.【F:src/ners/processing/batch/batch_processor.py†L29-L156】
+* Memory monitoring and dtype normalization inside feature engineering prevent ballooning DataFrame footprints during annotation-heavy stages.【F:src/ners/processing/steps/feature_extraction_step.py†L53-L195】【F:src/ners/processing/batch/memory_monitor.py†L7-L25】
+* Rate-limited concurrency in NER and LLM steps balances throughput with external service stability, retrying transient failures without blocking the whole run.【F:src/ners/processing/steps/ner_annotation_step.py†L50-L166】【F:src/ners/processing/steps/llm_annotation_step.py†L58-L163】
+
+## Deployment and Interfaces
+* **Command-line workflows** – The Typer CLI exposes discrete subcommands for pipeline execution, NER dataset generation, research training, and checkpoint maintenance, simplifying automation scripts and developer onboarding.【F:src/ners/cli.py†L13-L200】
+* **Web interface** – Streamlit shares the same configuration context as the CLI and surfaces monitoring utilities, experiment tracking, and interactive analysis for non-technical stakeholders.【F:src/ners/web/app.py†L1-L67】
+* **Containerized services** – Docker Compose binds the CLI, configs, and data directories into a reproducible container, standardizing environment setup across OSes and enabling GPU-enabled hosts to mount device-specific resources if needed.【F:compose.yml†L1-L23】
+
+## Summary of Design Choices
+* Configuration-first design separates code from experiment definitions, allowing fast iteration without code changes.【F:src/ners/core/config/config_manager.py†L15-L157】【F:config/research_templates.yaml†L1-L86】
+* Batch checkpoints, memory monitoring, and rate limiting deliver resilience against large-scale processing failures and external service hiccups.【F:src/ners/processing/batch/batch_processor.py†L29-L173】【F:src/ners/processing/steps/llm_annotation_step.py†L21-L169】
+* Unified model abstractions, experiment tracking, and artifact exports make the research stack extensible and production-ready, with metrics and models stored alongside configuration for reproducible science.【F:src/ners/research/base_model.py†L15-L210】【F:src/ners/research/experiment/experiment_tracker.py†L14-L200】
+* Streamlit dashboards and Typer commands democratize access, letting analysts trigger pipelines or inspect experiments without touching Python modules.【F:src/ners/cli.py†L13-L200】【F:src/ners/web/app.py†L1-L67】
+
+By combining configuration-driven orchestration, modular batch processing, and standardized experiment tooling, the DRC NERS NLP project functions as a robust, reproducible pipeline capable of scaling from exploratory research to production-ready deployments.
@@ -1,3 +1,3 @@
 category,l2,kl_mf,kl_fm,jsd,permutation_p_value
-names,0.3189041485139616,0.04320097944655348,0.0215380760498496,0.03236952774820154,0.978
-surnames,1.2770018925640299,0.2936188220992242,0.23989460296618093,0.26675671253270256,0.001
+names,0.3189041485139616,0.04320097944655348,0.0215380760498496,0.03236952774820154,0.977
+surnames,1.2770018925640299,0.2936188220992242,0.23989460296618093,0.26675671253270256,0.002
@@ -1,93 +0,0 @@
-# Model Notation Reference
-
-This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
-
-## Logistic Regression
- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
-
-## Multinomial Naive Bayes
- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
-
-## Support Vector Machine (RBF Kernel)
- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
- Implementation notes: TF–IDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
-
-## Random Forest
- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
- Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
-
-## LightGBM (Gradient Boosted Trees)
- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
-
-## XGBoost
- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
-
-## Convolutional Neural Network (1D)
- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
-
-## Bidirectional GRU
- Forward GRU recursion: $\begin{aligned}
-&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
-&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
-&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
-&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
-\end{aligned}$
- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
-
-## LSTM
- Gates per timestep: $\begin{aligned}
-&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
-&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
-&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
-&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
-&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
-&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
-\end{aligned}$
- Bidirectional stacking concatenates final hidden vectors before classification via softmax.
- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
-
-## Transformer Encoder (Single Block)
- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
- Sequence pooling (global average) feeds dense layers and softmax classifier.
- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
-
-## Ensemble (Soft Voting)
- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.
@@ -37,5 +37,21 @@ build-backend = "uv_build"
 [dependency-groups]
 dev = [
    "ipykernel>=6.30.1",
+    "pyright>=1.1.406",
+    "pytest>=8.4.2",
    "ruff>=0.13.3",
 ]
+
+[tool.pyright]
+pythonVersion = "3.11"
+typeCheckingMode = "basic"
+reportMissingImports = "none"
+reportMissingModuleSource = "none"
+useLibraryCodeForTypes = true
+include = ["src"]
+
+[tool.ruff]
+# Keep defaults and additionally ignore notebooks
+extend-exclude = [
+    "**/*.ipynb",
+]
@@ -118,12 +118,31 @@ def research_train(
    exp_cfg = exp_builder.find_template(tmpl, name, type)

    trainer = ModelTrainer(cfg)
+    # Validate and coerce template fields to expected types for type safety
+    model_name = exp_cfg.get("name")
+    model_type = exp_cfg.get("model_type")
+    features = exp_cfg.get("features")
+    tags = exp_cfg.get("tags", [])
+
+    if not isinstance(model_name, str) or not isinstance(model_type, str):
+        raise typer.BadParameter(
+            "Template must include 'name' and 'model_type' as strings"
+        )
+
+    if features is None:
+        features = ["full_name"]
+    elif not isinstance(features, list):
+        raise typer.BadParameter("Template 'features' must be a list of strings")
+
+    if not isinstance(tags, list):
+        tags = []
+
    trainer.train_single_model(
-        model_name=exp_cfg.get("name"),
-        model_type=exp_cfg.get("model_type"),
-        features=exp_cfg.get("features"),
+        model_name=model_name,
+        model_type=model_type,
+        features=features,
        model_params=exp_cfg.get("model_params", {}),
-        tags=exp_cfg.get("tags", []),
+        tags=tags,
    )


@@ -16,13 +16,13 @@ def get_config() -> PipelineConfig:

 def load_config(config_path: Optional[Union[str, Path]] = None) -> PipelineConfig:
    """Load configuration from specified path"""
-    if config_path:
-        return config_manager.load_config(Path(config_path))
+    if config_path is not None:
+        return config_manager.load_config(config_path)
    return config_manager.get_config()


 def setup_config(
-    config_path: Optional[Path] = None, env: str = "development"
+    config_path: Optional[Union[str, Path]] = None, env: str = "development"
 ) -> PipelineConfig:
    """
    Unified configuration loading and logging setup for all entrypoint scripts.
@@ -37,6 +37,8 @@ def setup_config(
    # Determine config path
    if config_path is None:
        config_path = Path("config") / f"pipeline.{env}.yaml"
+    else:
+        config_path = Path(config_path)

    # Load configuration
    config = ConfigManager(config_path).load_config()
@@ -13,7 +13,9 @@ class ConfigManager:
    """Centralized configuration management"""

    def __init__(self, config_path: Optional[Union[str, Path]] = None):
-        self.config_path = config_path or self._find_config_file()
+        self.config_path: Path = (
+            Path(config_path) if config_path is not None else self._find_config_file()
+        )
        self._config: Optional[PipelineConfig] = None
        self._setup_default_paths()

@@ -47,10 +49,12 @@ class ConfigManager:
            checkpoints_dir=root_dir / "data" / "checkpoints",
        )

-    def load_config(self, config_path: Optional[Path] = None) -> PipelineConfig:
+    def load_config(
+        self, config_path: Optional[Union[str, Path]] = None
+    ) -> PipelineConfig:
        """Load configuration from file"""
-        if config_path:
-            self.config_path = config_path
+        if config_path is not None:
+            self.config_path = Path(config_path)

        if not self.config_path.exists():
            logging.warning(
@@ -80,9 +84,11 @@ class ConfigManager:
        """Create default configuration"""
        return PipelineConfig(paths=self.default_paths)

-    def save_config(self, config: PipelineConfig, path: Optional[Path] = None):
+    def save_config(
+        self, config: PipelineConfig, path: Optional[Union[str, Path]] = None
+    ):
        """Save configuration to file"""
-        save_path = path or self.config_path
+        save_path = Path(path) if path is not None else self.config_path
        save_path.parent.mkdir(parents=True, exist_ok=True)

        config_dict = config.model_dump()
@@ -142,8 +148,8 @@ class ConfigManager:
            env_config = self.load_config(env_config_path)

            # Merge configurations
-            base_dict = base_config.dict()
-            env_dict = env_config.dict()
+            base_dict = base_config.model_dump()
+            env_dict = env_config.model_dump()
            self._deep_update(base_dict, env_dict)

            return PipelineConfig(**base_dict)
@@ -260,9 +260,9 @@ class NameTagger:

        # Remove overlaps
        filtered, last_end = [], -1
-        for s, e, l in valid:
+        for s, e, label in valid:
            if s >= last_end:
-                filtered.append((s, e, l))
+                filtered.append((s, e, label))
                last_end = e
        return filtered

@@ -19,7 +19,7 @@ class PipelineState:

    processed_batches: int = 0
    total_batches: int = 0
-    failed_batches: List[int] = None
+    failed_batches: Optional[List[int]] = None
    last_checkpoint: Optional[str] = None

    def __post_init__(self):
@@ -21,7 +21,7 @@ class DataSelectionStep(PipelineStep):
        if "region" in batch.columns and "year" in batch.columns:
            target_years = {2015, 2021, 2022}
            mask_remove = batch["region"].str.lower().eq("global") & batch["year"].isin(
-                target_years
+                list(target_years)
            )
            removed = int(mask_remove.sum())
            if removed:
@@ -29,8 +29,8 @@ class FeatureExtractionStep(PipelineStep):
        self.region_mapper = RegionMapper()
        self.name_tagger = NameTagger()

-    @classmethod
-    def requires_batch_mutation(cls) -> bool:
+    @property
+    def requires_batch_mutation(self) -> bool:
        """This step creates new columns, so mutation is required"""
        return True

@@ -1,6 +1,6 @@
 import logging
 from abc import ABC, abstractmethod
-from typing import Dict, Any, Optional, List
+from typing import Dict, Any, Optional, List, TYPE_CHECKING, Union

 import joblib
 import matplotlib.pyplot as plt
@@ -9,19 +9,23 @@ import pandas as pd

 from ners.research.experiment import ExperimentConfig

+if TYPE_CHECKING:
+    from ners.research.experiment.feature_extractor import FeatureExtractor
+    from sklearn.preprocessing import LabelEncoder
+

 class BaseModel(ABC):
    """Abstract base class for all models"""

    def __init__(self, config: ExperimentConfig):
        self.config = config
-        self.model = None
-        self.feature_extractor = None
-        self.label_encoder = None
-        self.tokenizer = None  # For neural models
-        self.is_fitted = False
-        self.training_history = {}  # Store training history for learning curves
-        self.learning_curve_data = {}  # Store learning curve experiment data
+        self.model: Any | None = None
+        self.feature_extractor: "FeatureExtractor | None" = None
+        self.label_encoder: "LabelEncoder | None" = None
+        self.tokenizer: Any | None = None  # For neural models
+        self.is_fitted: bool = False
+        self.training_history: Dict[str, Any] = {}  # For learning curves
+        self.learning_curve_data: Dict[str, Any] = {}

    @property
    @abstractmethod
@@ -48,7 +52,7 @@ class BaseModel(ABC):

    @abstractmethod
    def generate_learning_curve(
-        self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = None
+        self, X: pd.DataFrame, y: pd.Series, train_sizes: List[float] = []
    ) -> Dict[str, Any]:
        """Generate learning curve data for the model"""
        pass
@@ -58,10 +62,17 @@ class BaseModel(ABC):
        if not self.is_fitted:
            raise ValueError("Model must be fitted before making predictions")

+        if (
+            self.feature_extractor is None
+            or self.model is None
+            or self.label_encoder is None
+        ):
+            raise ValueError("Model is not fully initialized for prediction")
+
        features_df = self.feature_extractor.extract_features(X)
        X_prepared = self.prepare_features(features_df)

-        predictions = self.model.predict(X_prepared)
+        predictions: Union[np.ndarray, Any] = self.model.predict(X_prepared)

        # Handle different prediction formats
        if hasattr(predictions, "shape") and len(predictions.shape) > 1:
@@ -75,6 +86,9 @@ class BaseModel(ABC):
        if not self.is_fitted:
            raise ValueError("Model must be fitted before making predictions")

+        if self.feature_extractor is None or self.model is None:
+            raise ValueError("Model is not fully initialized for prediction")
+
        features_df = self.feature_extractor.extract_features(X)
        X_prepared = self.prepare_features(features_df)

@@ -83,7 +97,11 @@ class BaseModel(ABC):
        elif hasattr(self.model, "predict"):
            # For neural networks that return probabilities directly
            probabilities = self.model.predict(X_prepared)
-            if len(probabilities.shape) == 2 and probabilities.shape[1] > 1:
+            if (
+                hasattr(probabilities, "shape")
+                and len(probabilities.shape) == 2
+                and probabilities.shape[1] > 1
+            ):
                return probabilities

        raise NotImplementedError("Model does not support probability predictions")
@@ -91,30 +109,29 @@ class BaseModel(ABC):
    def get_feature_importance(self) -> Optional[Dict[str, float]]:
        """Get feature importance if supported by the model"""

-        if hasattr(self.model, "feature_importances_"):
+        model = self.model
+        if model is None:
+            return None
+
+        if hasattr(model, "feature_importances_"):
            # For tree-based models
-            importances = self.model.feature_importances_
+            importances = model.feature_importances_
            feature_names = self._get_feature_names()
            return dict(zip(feature_names, importances))

-        elif hasattr(self.model, "coef_"):
+        elif hasattr(model, "coef_"):
            # For linear models
-            coefficients = np.abs(self.model.coef_[0])
+            coefficients = np.abs(model.coef_[0])
            feature_names = self._get_feature_names()
            return dict(zip(feature_names, coefficients))

-        elif (
-            hasattr(self.model, "named_steps")
-            and "classifier" in self.model.named_steps
-        ):
+        elif hasattr(model, "named_steps") and "classifier" in model.named_steps:
            # For sklearn pipelines (like LogisticRegression with vectorizer)
-            classifier = self.model.named_steps["classifier"]
+            classifier = model.named_steps["classifier"]
            if hasattr(classifier, "coef_"):
                coefficients = np.abs(classifier.coef_[0])
-                if hasattr(
-                    self.model.named_steps["vectorizer"], "get_feature_names_out"
-                ):
-                    feature_names = self.model.named_steps[
+                if hasattr(model.named_steps["vectorizer"], "get_feature_names_out"):
+                    feature_names = model.named_steps[
                        "vectorizer"
                    ].get_feature_names_out()
                    # Take top features to avoid too many n-grams
@@ -127,8 +144,9 @@ class BaseModel(ABC):

    def _get_feature_names(self) -> List[str]:
        """Get feature names (override in subclasses if needed)"""
-        if hasattr(self.model, "feature_names_in_"):
-            return list(self.model.feature_names_in_)
+        model = self.model
+        if model is not None and hasattr(model, "feature_names_in_"):
+            return list(model.feature_names_in_)
        return [f"feature_{i}" for i in range(100)]  # Default fallback

    def save(self, path: str):
@@ -70,7 +70,7 @@ class ExperimentStatus(Enum):


 def calculate_metrics(
-    y_true: np.ndarray, y_pred: np.ndarray, metrics: List[str] = None
+    y_true: np.ndarray, y_pred: np.ndarray, metrics: Optional[List[str]] = None
 ) -> Dict[str, float]:
    """Calculate specified metrics"""

@@ -99,14 +99,24 @@ class ExperimentBuilder:
                logging.warning(f"Unknown feature type: {feature_str}")
                continue

+        name = (
+            template_config.get("name")
+            or template_config.get("model_type")
+            or "experiment"
+        )
+        model_type = template_config.get("model_type") or "logistic_regression"
+        description = template_config.get("description") or ""
+
        return ExperimentConfig(
-            name=template_config.get("name"),
-            description=template_config.get("description"),
-            model_type=template_config.get("model_type"),
+            name=str(name),
+            description=str(description),
+            model_type=str(model_type),
            features=features,
            model_params=template_config.get("model_params", {}),
            tags=template_config.get("tags", []),
-            test_size=template_config.get("test_size", 0.2),
-            cross_validation_folds=template_config.get("cross_validation_folds", 5),
+            test_size=float(template_config.get("test_size", 0.2)),
+            cross_validation_folds=int(
+                template_config.get("cross_validation_folds", 5)
+            ),
            train_data_filter=template_config.get("train_data_filter"),
        )
@@ -1,5 +1,5 @@
 from enum import Enum
-from typing import List, Dict, Any, Union
+from typing import List, Dict, Any, Union, Optional

 import pandas as pd

@@ -25,7 +25,9 @@ class FeatureExtractor:
    """Extract different types of features from name data"""

    def __init__(
-        self, feature_types: List[FeatureType], feature_params: Dict[str, Any] = None
+        self,
+        feature_types: List[FeatureType],
+        feature_params: Optional[Dict[str, Any]] = None,
    ):
        self.feature_types = feature_types
        self.feature_params = feature_params or {}
@@ -10,7 +10,6 @@ from ners.research.models.logistic_regression_model import LogisticRegressionMod
 from ners.research.models.lstm_model import LSTMModel
 from ners.research.models.naive_bayes_model import NaiveBayesModel
 from ners.research.models.random_forest_model import RandomForestModel
-from ners.research.models.svm_model import SVMModel
 from ners.research.models.transformer_model import TransformerModel
 from ners.research.models.xgboost_model import XGBoostModel

@@ -23,7 +22,6 @@ MODEL_REGISTRY = {
    "lstm": LSTMModel,
    "naive_bayes": NaiveBayesModel,
    "random_forest": RandomForestModel,
-    "svm": SVMModel,
    "transformer": TransformerModel,
    "xgboost": XGBoostModel,
 }
@@ -1,7 +1,7 @@
 import json
 import logging
 from datetime import datetime
-from typing import List, Dict, Any
+from typing import List, Dict, Any, Optional

 import pandas as pd

@@ -30,9 +30,9 @@ class ModelTrainer:
        self,
        model_name: str,
        model_type: str = "logistic_regression",
-        features: List[str] = None,
-        model_params: Dict[str, Any] = None,
-        tags: List[str] = None,
+        features: Optional[List[str]] = None,
+        model_params: Optional[Dict[str, Any]] = None,
+        tags: Optional[List[str]] = None,
        save_artifacts: bool = True,
    ) -> str:
        """
@@ -106,7 +106,7 @@ class ModelTrainer:
        logging.info(f"Completed training {len(experiment_ids)} models successfully")
        return experiment_ids

-    def save_model_artifacts(self, experiment_id: str) -> Dict[str, str]:
+    def save_model_artifacts(self, experiment_id: str) -> Dict[str, Optional[str]]:
        """
        Save model artifacts in a structured way for easy loading.
        Returns paths to saved artifacts.
@@ -13,7 +13,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
 class BiGRUModel(NeuralNetworkModel):
    """Bidirectional GRU model for name classification"""

-    def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
+    def build_model(self, vocab_size: int, **kwargs) -> Any:
        params = kwargs
        model = Sequential(
            [
@@ -23,6 +23,7 @@ class BiGRUModel(NeuralNetworkModel):
                    input_dim=vocab_size,
                    output_dim=params.get("embedding_dim", 64),
                    mask_zero=True,
+                    input_length=params.get("max_len", 6),
                ),
                # First recurrent block returns full sequences to allow stacking.
                # Moderate dropout + optional recurrent_dropout to reduce overfitting
@@ -32,7 +33,10 @@ class BiGRUModel(NeuralNetworkModel):
                        params.get("gru_units", 32),
                        return_sequences=True,
                        dropout=params.get("dropout", 0.2),
-                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                        # Use a small non-zero recurrent_dropout by default to
+                        # disable cuDNN path, which has strict right-padding mask
+                        # requirements and can assert when using Bidirectional.
+                        recurrent_dropout=params.get("recurrent_dropout", 0.1),
                    )
                ),
                # Second GRU summarizes to the last hidden state (no return_sequences),
@@ -41,7 +45,7 @@ class BiGRUModel(NeuralNetworkModel):
                    GRU(
                        params.get("gru_units", 32),
                        dropout=params.get("dropout", 0.2),
-                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.1),
                    )
                ),
                # Small dense head; ReLU + dropout for capacity and regularization.
@@ -69,4 +73,8 @@ class BiGRUModel(NeuralNetworkModel):
        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 6)

-        return pad_sequences(sequences, maxlen=max_len, padding="post")
+        # Ensure padding and truncation are applied on the right to keep
+        # contiguous non-zero tokens on the left, matching RNN mask expectations.
+        return pad_sequences(
+            sequences, maxlen=max_len, padding="post", truncating="post"
+        )
@@ -21,7 +21,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
 class CNNModel(NeuralNetworkModel):
    """1D Convolutional Neural Network for character patterns"""

-    def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
+    def build_model(self, vocab_size: int, **kwargs) -> Any:
        """Build CNN model with known vocabulary size"""

        params = kwargs
@@ -83,4 +83,7 @@ class CNNModel(NeuralNetworkModel):
            "max_len", 20
        )  # Longer for character level

-        return pad_sequences(sequences, maxlen=max_len, padding="post")
+        # Right-side padding and truncation ensure contiguous non-zero tokens on the left
+        return pad_sequences(
+            sequences, maxlen=max_len, padding="post", truncating="post"
+        )
@@ -13,7 +13,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
 class LSTMModel(NeuralNetworkModel):
    """LSTM model for sequence learning"""

-    def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
+    def build_model(self, vocab_size: int, **kwargs) -> Any:
        params = kwargs
        model = Sequential(
            [
@@ -30,7 +30,9 @@ class LSTMModel(NeuralNetworkModel):
                        params.get("lstm_units", 32),
                        return_sequences=True,
                        dropout=params.get("dropout", 0.2),
-                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                        # Default to a small non-zero recurrent_dropout to avoid
+                        # cuDNN mask assertions when masking with Bidirectional.
+                        recurrent_dropout=params.get("recurrent_dropout", 0.1),
                    )
                ),
                # Second LSTM condenses sequence to a fixed vector for classification.
@@ -38,7 +40,7 @@ class LSTMModel(NeuralNetworkModel):
                    LSTM(
                        params.get("lstm_units", 32),
                        dropout=params.get("dropout", 0.2),
-                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.1),
                    )
                ),
                # Compact dense head with dropout; sufficient capacity for name signals.
@@ -68,4 +70,7 @@ class LSTMModel(NeuralNetworkModel):
        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 6)

-        return pad_sequences(sequences, maxlen=max_len, padding="post")
+        # Right-side padding and truncation to preserve contiguous non-zero tokens
+        return pad_sequences(
+            sequences, maxlen=max_len, padding="post", truncating="post"
+        )
@@ -22,7 +22,7 @@ from ners.research.neural_network_model import NeuralNetworkModel
 class TransformerModel(NeuralNetworkModel):
    """Transformer-based model"""

-    def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
+    def build_model(self, vocab_size: int, **kwargs) -> Any:
        params = kwargs
        # Use a single resolved max_len everywhere to avoid shape mismatches
        max_len = int(params.get("max_len", 6))
@@ -88,4 +88,7 @@ class TransformerModel(NeuralNetworkModel):
        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = int(self.config.model_params.get("max_len", 6))

-        return pad_sequences(sequences, maxlen=max_len, padding="post")
+        # Right-side padding and truncation for consistent masking/shape
+        return pad_sequences(
+            sequences, maxlen=max_len, padding="post", truncating="post"
+        )
@@ -24,7 +24,7 @@ class NeuralNetworkModel(BaseModel):
        return "neural_network"

    @abstractmethod
-    def build_model_with_vocab(self, vocab_size: int, **kwargs) -> Any:
+    def build_model(self, vocab_size: int, **kwargs) -> Any:
        """Build neural network model with known vocabulary size"""
        pass

@@ -86,9 +86,7 @@ class NeuralNetworkModel(BaseModel):
        logging.info(f"Vocabulary size: {vocab_size}")

        # Get additional model parameters
-        self.model = self.build_model_with_vocab(
-            vocab_size=vocab_size, **self.config.model_params
-        )
+        self.model = self.build_model(vocab_size=vocab_size, **self.config.model_params)

        # Train the neural network
        logging.info(
@@ -149,6 +147,29 @@ class NeuralNetworkModel(BaseModel):
                if invalid_mask.any():
                    arr[invalid_mask] = oov_index

+                # Enforce strictly right-padded masks for RNN/cuDNN compatibility.
+                # Any zero appearing before the last non-zero in a sequence will be
+                # replaced with the OOV index so the mask remains contiguous True->False.
+                try:
+                    nz = arr != 0  # non-padding tokens
+                    if nz.ndim == 2 and arr.shape[1] > 0:
+                        # Identify rows that have at least one non-zero
+                        has_nz = nz.any(axis=1)
+                        # Compute last non-zero position per row; if none, set to -1
+                        indices = np.arange(arr.shape[1], dtype=np.int64)
+                        # Max of indices where nz is True gives last non-zero
+                        last_pos = (nz * indices).max(axis=1)
+                        last_pos = np.where(has_nz, last_pos, -1)
+                        # Broadcast to mark the left region up to last non-zero (inclusive)
+                        left_region = indices <= last_pos[:, None]
+                        # Zeros inside the left region are invalid padding -> set to OOV
+                        zero_inside = (~nz) & left_region
+                        if zero_inside.any():
+                            arr[zero_inside] = oov_index
+                except Exception:
+                    # Best-effort; skip if any unexpected broadcasting issue occurs
+                    pass
+
            # Use int32 for TF embedding ops compatibility
            return arr.astype(np.int32, copy=False)
        except Exception as e:
@@ -226,8 +247,8 @@ class NeuralNetworkModel(BaseModel):
        max_len = self.config.model_params.get("max_len", 6)

        for fold, (train_idx, val_idx) in enumerate(cv.split(X_prepared, y_encoded)):
-            # Create fresh model for each fold using build_model_with_vocab
-            fold_model = self.build_model_with_vocab(
+            # Create fresh model for each fold using build_model
+            fold_model = self.build_model(
                vocab_size=vocab_size, max_len=max_len, **self.config.model_params
            )

@@ -341,8 +362,8 @@ class NeuralNetworkModel(BaseModel):
            val_scores = []

            for seed in range(3):  # 3 runs for variance
-                # Build fresh model using build_model_with_vocab
-                model = self.build_model_with_vocab(
+                # Build fresh model using build_model
+                model = self.build_model(
                    vocab_size=vocab_size, max_len=max_len, **self.config.model_params
                )

@@ -5,7 +5,8 @@ import numpy as np
 import pandas as pd
 from scipy.spatial.distance import euclidean
 from scipy.stats import entropy
-from typing import Dict, Any
+from collections import Counter
+from typing import Dict, Any, Literal

 LETTERS = "abcdefghijklmnopqrstuvwxyz"
 START_TOKEN = "^"
@@ -234,11 +235,6 @@ def build_transition_comparisons(
    return out


-import pandas as pd
-from collections import Counter
-from typing import Literal
-
-
 def build_ngrams_count(
    df: pd.DataFrame,
    n: int,
@@ -30,12 +30,21 @@ def train_from_template(
        logging.info(f"Features: {experiment_config.get('features')}")

        trainer = ModelTrainer(cfg)
+        name_val = experiment_config.get("name")
+        type_val = experiment_config.get("model_type")
+        features_val = experiment_config.get("features") or ["full_name"]
+        tags_val = experiment_config.get("tags", [])
+        if not isinstance(name_val, str) or not isinstance(type_val, str):
+            raise ValueError("Template must include 'name' and 'model_type' as strings")
+        if not isinstance(features_val, list):
+            raise ValueError("Template 'features' must be a list of strings")
+
        trainer.train_single_model(
-            model_name=experiment_config.get("name"),
-            model_type=experiment_config.get("model_type"),
-            features=experiment_config.get("features"),
+            model_name=name_val,
+            model_type=type_val,
+            features=features_val,
            model_params=experiment_config.get("model_params", {}),
-            tags=experiment_config.get("tags", []),
+            tags=tags_val if isinstance(tags_val, list) else [],
        )

        logging.info("Training completed successfully!")
@@ -1 +1,3 @@
-from .ner_testing import NERTesting
+from .ner_testing import NERTesting as NERTesting
+
+__all__ = ["NERTesting"]
@@ -116,7 +116,7 @@ class Predictions:
        try:
            probabilities = model.predict_proba(input_df)[0]
            return max(probabilities)
-        except:
+        except Exception:
            return None

    def _display_single_prediction_results(
@@ -209,7 +209,7 @@ class Predictions:
            try:
                probabilities = model.predict_proba(df)
                df["confidence"] = np.max(probabilities, axis=1)
-            except:
+            except Exception:
                df["confidence"] = None

        st.success("Predictions completed!")
@@ -73,60 +73,82 @@
    "    cv = exp.get(\"cv_metrics\", {}) or {}\n",
    "\n",
    "    cm = exp.get(\"confusion_matrix\")\n",
-    "    tn=fp=fn=tp=np.nan\n",
-    "    if isinstance(cm, list) and len(cm)==2 and all(isinstance(r, list) and len(r)==2 for r in cm):\n",
+    "    tn = fp = fn = tp = np.nan\n",
+    "    if (\n",
+    "        isinstance(cm, list)\n",
+    "        and len(cm) == 2\n",
+    "        and all(isinstance(r, list) and len(r) == 2 for r in cm)\n",
+    "    ):\n",
    "        # By inspection of the provided metrics, mapping is:\n",
    "        # rows = true [f, m]; cols = pred [f, m]\n",
-    "        tn, fp = cm[0][0], cm[0][1]  # true negatives and false positives for positive class 'm'\n",
+    "        tn, fp = (\n",
+    "            cm[0][0],\n",
+    "            cm[0][1],\n",
+    "        )  # true negatives and false positives for positive class 'm'\n",
    "        fn, tp = cm[1][0], cm[1][1]\n",
    "\n",
    "    # Derived metrics from confusion matrix (where present)\n",
-    "    def safe_div(a,b): \n",
-    "        return float(a)/float(b) if (b not in (0, None) and not pd.isna(b)) else np.nan\n",
+    "    def safe_div(a, b):\n",
+    "        return (\n",
+    "            float(a) / float(b) if (b not in (0, None) and not pd.isna(b)) else np.nan\n",
+    "        )\n",
    "\n",
-    "    sensitivity = safe_div(tp, tp+fn)  # TPR for 'm'\n",
-    "    specificity = safe_div(tn, tn+fp)  # TNR for 'm'\n",
+    "    sensitivity = safe_div(tp, tp + fn)  # TPR for 'm'\n",
+    "    specificity = safe_div(tn, tn + fp)  # TNR for 'm'\n",
    "    balanced_acc = np.nanmean([sensitivity, specificity])\n",
-    "    mcc_num = (tp*tn - fp*fn)\n",
-    "    mcc_den = sqrt((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn)) if all(x==x for x in [tp+fp, tp+fn, tn+fp, tn+fn]) else np.nan\n",
+    "    mcc_num = tp * tn - fp * fn\n",
+    "    mcc_den = (\n",
+    "        sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))\n",
+    "        if all(x == x for x in [tp + fp, tp + fn, tn + fp, tn + fn])\n",
+    "        else np.nan\n",
+    "    )\n",
    "    mcc = safe_div(mcc_num, mcc_den)\n",
    "\n",
    "    n_test = exp.get(\"test_size\") or np.nansum([tn, fp, fn, tp])\n",
    "    test_acc = te.get(\"accuracy\", np.nan)\n",
    "    # 95% CI for accuracy via normal approximation (ok for n=2000)\n",
-    "    if pd.notna(test_acc) and pd.notna(n_test) and n_test>0:\n",
-    "        se = np.sqrt(test_acc*(1-test_acc)/n_test)\n",
-    "        acc_ci_lo = test_acc - 1.96*se\n",
-    "        acc_ci_hi = test_acc + 1.96*se\n",
+    "    if pd.notna(test_acc) and pd.notna(n_test) and n_test > 0:\n",
+    "        se = np.sqrt(test_acc * (1 - test_acc) / n_test)\n",
+    "        acc_ci_lo = test_acc - 1.96 * se\n",
+    "        acc_ci_hi = test_acc + 1.96 * se\n",
    "    else:\n",
    "        acc_ci_lo = acc_ci_hi = np.nan\n",
    "\n",
-    "    rows.append({\n",
-    "        \"experiment_id\": exp_id,\n",
-    "        \"model\": name or model_type,\n",
-    "        \"model_family\": (model_type or \"\").upper(),\n",
-    "        \"feature_set\": features,\n",
-    "        \"train_accuracy\": tr.get(\"accuracy\", np.nan),\n",
-    "        \"test_accuracy\": test_acc,\n",
-    "        \"cv_accuracy_mean\": cv.get(\"accuracy\", np.nan),\n",
-    "        \"cv_accuracy_std\": cv.get(\"accuracy_std\", np.nan),\n",
-    "        \"train_f1\": tr.get(\"f1\", np.nan),\n",
-    "        \"test_f1\": te.get(\"f1\", np.nan),\n",
-    "        \"cv_f1_mean\": cv.get(\"f1\", np.nan),\n",
-    "        \"cv_f1_std\": cv.get(\"f1_std\", np.nan),\n",
-    "        \"TP\": tp, \"FP\": fp, \"TN\": tn, \"FN\": fn,\n",
-    "        \"sensitivity_TPR_m\": sensitivity,\n",
-    "        \"specificity_TNR_m\": specificity,\n",
-    "        \"balanced_accuracy\": balanced_acc,\n",
-    "        \"MCC\": mcc,\n",
-    "        \"n_test\": n_test,\n",
-    "        \"acc_95ci_lo\": acc_ci_lo,\n",
-    "        \"acc_95ci_hi\": acc_ci_hi,\n",
-    "        \"train_minus_test_gap\": (tr.get(\"accuracy\", np.nan) - test_acc) if pd.notna(tr.get(\"accuracy\", np.nan)) and pd.notna(test_acc) else np.nan,\n",
-    "        \"test_minus_cv_gap\": (test_acc - cv.get(\"accuracy\", np.nan)) if pd.notna(test_acc) and pd.notna(cv.get(\"accuracy\", np.nan)) else np.nan,\n",
-    "        \"start_time\": exp.get(\"start_time\"),\n",
-    "        \"end_time\": exp.get(\"end_time\")\n",
-    "    })\n",
+    "    rows.append(\n",
+    "        {\n",
+    "            \"experiment_id\": exp_id,\n",
+    "            \"model\": name or model_type,\n",
+    "            \"model_family\": (model_type or \"\").upper(),\n",
+    "            \"feature_set\": features,\n",
+    "            \"train_accuracy\": tr.get(\"accuracy\", np.nan),\n",
+    "            \"test_accuracy\": test_acc,\n",
+    "            \"cv_accuracy_mean\": cv.get(\"accuracy\", np.nan),\n",
+    "            \"cv_accuracy_std\": cv.get(\"accuracy_std\", np.nan),\n",
+    "            \"train_f1\": tr.get(\"f1\", np.nan),\n",
+    "            \"test_f1\": te.get(\"f1\", np.nan),\n",
+    "            \"cv_f1_mean\": cv.get(\"f1\", np.nan),\n",
+    "            \"cv_f1_std\": cv.get(\"f1_std\", np.nan),\n",
+    "            \"TP\": tp,\n",
+    "            \"FP\": fp,\n",
+    "            \"TN\": tn,\n",
+    "            \"FN\": fn,\n",
+    "            \"sensitivity_TPR_m\": sensitivity,\n",
+    "            \"specificity_TNR_m\": specificity,\n",
+    "            \"balanced_accuracy\": balanced_acc,\n",
+    "            \"MCC\": mcc,\n",
+    "            \"n_test\": n_test,\n",
+    "            \"acc_95ci_lo\": acc_ci_lo,\n",
+    "            \"acc_95ci_hi\": acc_ci_hi,\n",
+    "            \"train_minus_test_gap\": (tr.get(\"accuracy\", np.nan) - test_acc)\n",
+    "            if pd.notna(tr.get(\"accuracy\", np.nan)) and pd.notna(test_acc)\n",
+    "            else np.nan,\n",
+    "            \"test_minus_cv_gap\": (test_acc - cv.get(\"accuracy\", np.nan))\n",
+    "            if pd.notna(test_acc) and pd.notna(cv.get(\"accuracy\", np.nan))\n",
+    "            else np.nan,\n",
+    "            \"start_time\": exp.get(\"start_time\"),\n",
+    "            \"end_time\": exp.get(\"end_time\"),\n",
+    "        }\n",
+    "    )\n",
    "\n",
    "df = pd.DataFrame(rows)"
   ]
@@ -139,23 +161,53 @@
   "outputs": [],
   "source": [
    "# Clean and order categorical fields\n",
-    "df[\"feature_set\"] = df[\"feature_set\"].replace({\"full_name\":\"Full name\",\"native_name\":\"Native\",\"surname\":\"Surname\"})\n",
-    "order_features = [\"Full name\",\"Surname\",\"Native\"]\n",
-    "df[\"feature_set\"] = pd.Categorical(df[\"feature_set\"], categories=order_features, ordered=True)\n",
+    "df[\"feature_set\"] = df[\"feature_set\"].replace(\n",
+    "    {\"full_name\": \"Full name\", \"native_name\": \"Native\", \"surname\": \"Surname\"}\n",
+    ")\n",
+    "order_features = [\"Full name\", \"Surname\", \"Native\"]\n",
+    "df[\"feature_set\"] = pd.Categorical(\n",
+    "    df[\"feature_set\"], categories=order_features, ordered=True\n",
+    ")\n",
    "\n",
-    "order_family = [\"LOGISTIC_REGRESSION\",\"LIGHTGBM\",\"LSTM\",\"CNN\",\"BIGRU\", \"RANDOM_FOREST\", \"TRANSFORMER\", \"NAIVE_BAYES\", \"XGBOOST\"]\n",
-    "df[\"model_family\"] = pd.Categorical(df[\"model_family\"], categories=order_family, ordered=True)\n",
+    "order_family = [\n",
+    "    \"LOGISTIC_REGRESSION\",\n",
+    "    \"LIGHTGBM\",\n",
+    "    \"LSTM\",\n",
+    "    \"CNN\",\n",
+    "    \"BIGRU\",\n",
+    "    \"RANDOM_FOREST\",\n",
+    "    \"TRANSFORMER\",\n",
+    "    \"NAIVE_BAYES\",\n",
+    "    \"XGBOOST\",\n",
+    "]\n",
+    "df[\"model_family\"] = pd.Categorical(\n",
+    "    df[\"model_family\"], categories=order_family, ordered=True\n",
+    ")\n",
    "\n",
    "# Summary table (subset of most relevant columns)\n",
    "summary_cols = [\n",
-    "    \"experiment_id\",\"model_family\",\"feature_set\",\n",
-    "    \"train_accuracy\",\"test_accuracy\",\"cv_accuracy_mean\",\"cv_accuracy_std\",\n",
-    "    \"acc_95ci_lo\",\"acc_95ci_hi\",\n",
-    "    \"balanced_accuracy\",\"MCC\",\n",
-    "    \"train_minus_test_gap\",\"test_minus_cv_gap\",\n",
-    "    \"n_test\"\n",
+    "    \"experiment_id\",\n",
+    "    \"model_family\",\n",
+    "    \"feature_set\",\n",
+    "    \"train_accuracy\",\n",
+    "    \"test_accuracy\",\n",
+    "    \"cv_accuracy_mean\",\n",
+    "    \"cv_accuracy_std\",\n",
+    "    \"acc_95ci_lo\",\n",
+    "    \"acc_95ci_hi\",\n",
+    "    \"balanced_accuracy\",\n",
+    "    \"MCC\",\n",
+    "    \"train_minus_test_gap\",\n",
+    "    \"test_minus_cv_gap\",\n",
+    "    \"n_test\",\n",
    "]\n",
-    "summary = df[summary_cols].sort_values([\"model_family\",\"feature_set\",\"test_accuracy\"], ascending=[True, True, False]).reset_index(drop=True)\n",
+    "summary = (\n",
+    "    df[summary_cols]\n",
+    "    .sort_values(\n",
+    "        [\"model_family\", \"feature_set\", \"test_accuracy\"], ascending=[True, True, False]\n",
+    "    )\n",
+    "    .reset_index(drop=True)\n",
+    ")\n",
    "\n",
    "# Display the master summary table\n",
    "display(summary)"
@@ -171,25 +223,37 @@
    "# Build a pivot for plotting\n",
    "plot_df = df.dropna(subset=[\"test_accuracy\"]).copy()\n",
    "# Prepare positions\n",
-    "families = [f for f in order_family if f in plot_df[\"model_family\"].astype(str).unique()]\n",
-    "features = [f for f in order_features if f in plot_df[\"feature_set\"].astype(str).unique()]\n",
+    "families = [\n",
+    "    f for f in order_family if f in plot_df[\"model_family\"].astype(str).unique()\n",
+    "]\n",
+    "features = [\n",
+    "    f for f in order_features if f in plot_df[\"feature_set\"].astype(str).unique()\n",
+    "]\n",
    "\n",
    "# Bar positions\n",
    "x = np.arange(len(families))\n",
-    "width = 0.8 / max(1,len(features))  # total width split by features\n",
+    "width = 0.8 / max(1, len(features))  # total width split by features\n",
    "\n",
-    "fig1 = plt.figure(figsize=(10,6))\n",
+    "fig1 = plt.figure(figsize=(10, 6))\n",
    "for i, feat in enumerate(features):\n",
-    "    sub = plot_df[plot_df[\"feature_set\"].astype(str)==feat]\n",
+    "    sub = plot_df[plot_df[\"feature_set\"].astype(str) == feat]\n",
    "    # Align to families\n",
    "    y = []\n",
    "    yerr = [[], []]  # lower and upper errors for asymmetric CI\n",
    "    for fam in families:\n",
-    "        row = sub[sub[\"model_family\"].astype(str)==fam]\n",
+    "        row = sub[sub[\"model_family\"].astype(str) == fam]\n",
    "        if len(row):\n",
    "            val = float(row.iloc[0][\"test_accuracy\"])\n",
-    "            lo = float(row.iloc[0][\"acc_95ci_lo\"]) if pd.notna(row.iloc[0][\"acc_95ci_lo\"]) else np.nan\n",
-    "            hi = float(row.iloc[0][\"acc_95ci_hi\"]) if pd.notna(row.iloc[0][\"acc_95ci_hi\"]) else np.nan\n",
+    "            lo = (\n",
+    "                float(row.iloc[0][\"acc_95ci_lo\"])\n",
+    "                if pd.notna(row.iloc[0][\"acc_95ci_lo\"])\n",
+    "                else np.nan\n",
+    "            )\n",
+    "            hi = (\n",
+    "                float(row.iloc[0][\"acc_95ci_hi\"])\n",
+    "                if pd.notna(row.iloc[0][\"acc_95ci_hi\"])\n",
+    "                else np.nan\n",
+    "            )\n",
    "        else:\n",
    "            val, lo, hi = np.nan, np.nan, np.nan\n",
    "        y.append(val)\n",
@@ -201,7 +265,14 @@
    "            yerr[0].append(np.nan)\n",
    "            yerr[1].append(np.nan)\n",
    "\n",
-    "    plt.bar(x + i*width - (len(features)-1)*width/2, y, width, label=feat, yerr=yerr, capsize=4)\n",
+    "    plt.bar(\n",
+    "        x + i * width - (len(features) - 1) * width / 2,\n",
+    "        y,\n",
+    "        width,\n",
+    "        label=feat,\n",
+    "        yerr=yerr,\n",
+    "        capsize=4,\n",
+    "    )\n",
    "\n",
    "plt.xticks(x, families, rotation=0)\n",
    "plt.ylabel(\"Test accuracy\")\n",
@@ -219,15 +290,15 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fig2 = plt.figure(figsize=(10,6))\n",
+    "fig2 = plt.figure(figsize=(10, 6))\n",
    "for i, feat in enumerate(features):\n",
-    "    sub = plot_df[plot_df[\"feature_set\"].astype(str)==feat]\n",
+    "    sub = plot_df[plot_df[\"feature_set\"].astype(str) == feat]\n",
    "    y = []\n",
    "    for fam in families:\n",
-    "        row = sub[sub[\"model_family\"].astype(str)==fam]\n",
+    "        row = sub[sub[\"model_family\"].astype(str) == fam]\n",
    "        val = float(row.iloc[0][\"test_f1\"]) if len(row) else np.nan\n",
    "        y.append(val)\n",
-    "    plt.bar(x + i*width - (len(features)-1)*width/2, y, width, label=feat)\n",
+    "    plt.bar(x + i * width - (len(features) - 1) * width / 2, y, width, label=feat)\n",
    "\n",
    "plt.xticks(x, families, rotation=0)\n",
    "plt.ylabel(\"Test F1\")\n",
@@ -245,14 +316,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "fig3 = plt.figure(figsize=(7,7))\n",
+    "fig3 = plt.figure(figsize=(7, 7))\n",
    "for feat in features:\n",
-    "    sub = df[df[\"feature_set\"].astype(str)==feat]\n",
+    "    sub = df[df[\"feature_set\"].astype(str) == feat]\n",
    "    plt.scatter(sub[\"train_accuracy\"], sub[\"test_accuracy\"], label=feat)\n",
    "# y=x reference\n",
-    "lims = [min(df[\"train_accuracy\"].min(), df[\"test_accuracy\"].min())-0.02, max(df[\"train_accuracy\"].max(), df[\"test_accuracy\"].max())+0.02]\n",
+    "lims = [\n",
+    "    min(df[\"train_accuracy\"].min(), df[\"test_accuracy\"].min()) - 0.02,\n",
+    "    max(df[\"train_accuracy\"].max(), df[\"test_accuracy\"].max()) + 0.02,\n",
+    "]\n",
    "plt.plot(lims, lims, linestyle=\"--\")\n",
-    "plt.xlim(lims); plt.ylim(lims)\n",
+    "plt.xlim(lims)\n",
+    "plt.ylim(lims)\n",
    "plt.xlabel(\"Train accuracy\")\n",
    "plt.ylabel(\"Test accuracy\")\n",
    "plt.title(\"Overfitting analysis: Train vs Test accuracy\")\n",
@@ -268,22 +343,24 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "best_rows = df.sort_values(\"test_accuracy\", ascending=False).groupby(\"feature_set\").head(1)\n",
+    "best_rows = (\n",
+    "    df.sort_values(\"test_accuracy\", ascending=False).groupby(\"feature_set\").head(1)\n",
+    ")\n",
    "for _, row in best_rows.iterrows():\n",
    "    cm = np.array([[row[\"TN\"], row[\"FP\"]], [row[\"FN\"], row[\"TP\"]]], dtype=float)\n",
    "    if np.isnan(cm).any():\n",
    "        continue\n",
-    "    fig = plt.figure(figsize=(5,5))\n",
+    "    fig = plt.figure(figsize=(5, 5))\n",
    "    im = plt.imshow(cm, interpolation=\"nearest\")\n",
    "    plt.title(f\"Confusion Matrix — {row['model_family']} ({row['feature_set']})\")\n",
-    "    plt.xticks([0,1], [\"Pred: f\",\"Pred: m\"])\n",
-    "    plt.yticks([0,1], [\"True: f\",\"True: m\"])\n",
+    "    plt.xticks([0, 1], [\"Pred: f\", \"Pred: m\"])\n",
+    "    plt.yticks([0, 1], [\"True: f\", \"True: m\"])\n",
    "    # Annotate counts and rates\n",
    "    total = cm.sum()\n",
    "    for i in range(2):\n",
    "        for j in range(2):\n",
-    "            val = cm[i,j]\n",
-    "            plt.text(j, i, f\"{int(val)}\\n({val/total:.2%})\", ha=\"center\", va=\"center\")\n",
+    "            val = cm[i, j]\n",
+    "            plt.text(j, i, f\"{int(val)}\\n({val / total:.2%})\", ha=\"center\", va=\"center\")\n",
    "    plt.colorbar(im, fraction=0.046, pad=0.04)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
@@ -298,34 +375,37 @@
   "source": [
    "deltas = []\n",
    "for fam in families:\n",
-    "    fam_rows = df[df[\"model_family\"].astype(str)==fam]\n",
-    "    base = fam_rows[fam_rows[\"feature_set\"]==\"Native\"]\n",
+    "    fam_rows = df[df[\"model_family\"].astype(str) == fam]\n",
+    "    base = fam_rows[fam_rows[\"feature_set\"] == \"Native\"]\n",
    "    if len(base):\n",
    "        base_acc = float(base.iloc[0][\"test_accuracy\"])\n",
-    "        for feat in [\"Full name\",\"Surname\"]:\n",
-    "            tgt = fam_rows[fam_rows[\"feature_set\"]==feat]\n",
+    "        for feat in [\"Full name\", \"Surname\"]:\n",
+    "            tgt = fam_rows[fam_rows[\"feature_set\"] == feat]\n",
    "            if len(tgt):\n",
-    "                deltas.append({\n",
-    "                    \"model_family\": fam,\n",
-    "                    \"comparison\": f\"{feat} minus Native\",\n",
-    "                    \"delta_accuracy\": float(tgt.iloc[0][\"test_accuracy\"]) - base_acc\n",
-    "                })\n",
+    "                deltas.append(\n",
+    "                    {\n",
+    "                        \"model_family\": fam,\n",
+    "                        \"comparison\": f\"{feat} minus Native\",\n",
+    "                        \"delta_accuracy\": float(tgt.iloc[0][\"test_accuracy\"])\n",
+    "                        - base_acc,\n",
+    "                    }\n",
+    "                )\n",
    "\n",
    "deltas_df = pd.DataFrame(deltas)\n",
    "display(deltas_df)\n",
    "\n",
-    "fig5 = plt.figure(figsize=(10,6))\n",
+    "fig5 = plt.figure(figsize=(10, 6))\n",
    "# Make bars grouped by model_family\n",
    "comp_types = deltas_df[\"comparison\"].unique().tolist() if not deltas_df.empty else []\n",
    "x2 = np.arange(len(families))\n",
    "width2 = 0.8 / max(1, len(comp_types))\n",
    "for i, comp in enumerate(comp_types):\n",
-    "    sub = deltas_df[deltas_df[\"comparison\"]==comp]\n",
+    "    sub = deltas_df[deltas_df[\"comparison\"] == comp]\n",
    "    y = []\n",
    "    for fam in families:\n",
-    "        row = sub[sub[\"model_family\"]==fam]\n",
+    "        row = sub[sub[\"model_family\"] == fam]\n",
    "        y.append(float(row.iloc[0][\"delta_accuracy\"]) if len(row) else np.nan)\n",
-    "    plt.bar(x2 + i*width2 - (len(comp_types)-1)*width2/2, y, width2, label=comp)\n",
+    "    plt.bar(x2 + i * width2 - (len(comp_types) - 1) * width2 / 2, y, width2, label=comp)\n",
    "\n",
    "plt.xticks(x2, families)\n",
    "plt.axhline(0, linestyle=\"--\")\n",
@@ -710,6 +710,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/76/c6/c88e154df9c4e1a2a66ccf0005a88dfb2650c1dffb6f5ce603dfbd452ce3/idna-3.10-py3-none-any.whl", hash = "sha256:946d195a0d259cbba61165e88e65941f16e9b36ea6ddb97f00452bae8b1287d3", size = 70442, upload-time = "2024-09-15T18:07:37.964Z" },
 ]

+[[package]]
+name = "iniconfig"
+version = "2.1.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f2/97/ebf4da567aa6827c909642694d71c9fcf53e5b504f2d96afea02718862f3/iniconfig-2.1.0.tar.gz", hash = "sha256:3abbd2e30b36733fee78f9c7f7308f2d0050e88f0087fd25c2645f63c773e1c7", size = 4793, upload-time = "2025-03-19T20:09:59.721Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2c/e1/e6716421ea10d38022b952c159d5161ca1193197fb744506875fbb87ea7b/iniconfig-2.1.0-py3-none-any.whl", hash = "sha256:9deba5723312380e77435581c6bf4935c94cbfab9b1ed33ef8d238ea168eb760", size = 6050, upload-time = "2025-03-19T20:10:01.071Z" },
+]
+
 [[package]]
 name = "ipykernel"
 version = "6.30.1"
@@ -1349,6 +1358,8 @@ dependencies = [
 [package.dev-dependencies]
 dev = [
    { name = "ipykernel" },
+    { name = "pyright" },
+    { name = "pytest" },
    { name = "ruff" },
 ]

@@ -1379,6 +1390,8 @@ requires-dist = [
 [package.metadata.requires-dev]
 dev = [
    { name = "ipykernel", specifier = ">=6.30.1" },
+    { name = "pyright", specifier = ">=1.1.406" },
+    { name = "pytest", specifier = ">=8.4.2" },
    { name = "ruff", specifier = ">=0.13.3" },
 ]

@@ -1400,6 +1413,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/eb/8d/776adee7bbf76365fdd7f2552710282c79a4ead5d2a46408c9043a2b70ba/networkx-3.5-py3-none-any.whl", hash = "sha256:0030d386a9a06dee3565298b4a734b68589749a544acbb6c412dc9e2489ec6ec", size = 2034406, upload-time = "2025-05-29T11:35:04.961Z" },
 ]

+[[package]]
+name = "nodeenv"
+version = "1.9.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/43/16/fc88b08840de0e0a72a2f9d8c6bae36be573e475a6326ae854bcc549fc45/nodeenv-1.9.1.tar.gz", hash = "sha256:6ec12890a2dab7946721edbfbcd91f3319c6ccc9aec47be7c7e6b7011ee6645f", size = 47437, upload-time = "2024-06-04T18:44:11.171Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d2/1d/1b658dbd2b9fa9c4c9f32accbfc0205d532c8c6194dc0f2a4c0428e7128a/nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9", size = 22314, upload-time = "2024-06-04T18:44:08.352Z" },
+]
+
 [[package]]
 name = "numpy"
 version = "2.3.3"
@@ -1734,6 +1756,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/3f/93/023955c26b0ce614342d11cc0652f1e45e32393b6ab9d11a664a60e9b7b7/plotly-6.3.1-py3-none-any.whl", hash = "sha256:8b4420d1dcf2b040f5983eed433f95732ed24930e496d36eb70d211923532e64", size = 9833698, upload-time = "2025-10-02T16:10:22.584Z" },
 ]

+[[package]]
+name = "pluggy"
+version = "1.6.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
+]
+
 [[package]]
 name = "preshed"
 version = "3.0.10"
@@ -2079,6 +2110,35 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/15/73/a7141a1a0559bf1a7aa42a11c879ceb19f02f5c6c371c6d57fd86cefd4d1/pyproj-3.7.2-cp314-cp314t-win_arm64.whl", hash = "sha256:d9d25bae416a24397e0d85739f84d323b55f6511e45a522dd7d7eae70d10c7e4", size = 6391844, upload-time = "2025-08-14T12:05:40.745Z" },
 ]

+[[package]]
+name = "pyright"
+version = "1.1.406"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nodeenv" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f7/16/6b4fbdd1fef59a0292cbb99f790b44983e390321eccbc5921b4d161da5d1/pyright-1.1.406.tar.gz", hash = "sha256:c4872bc58c9643dac09e8a2e74d472c62036910b3bd37a32813989ef7576ea2c", size = 4113151, upload-time = "2025-10-02T01:04:45.488Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f6/a2/e309afbb459f50507103793aaef85ca4348b66814c86bc73908bdeb66d12/pyright-1.1.406-py3-none-any.whl", hash = "sha256:1d81fb43c2407bf566e97e57abb01c811973fdb21b2df8df59f870f688bdca71", size = 5980982, upload-time = "2025-10-02T01:04:43.137Z" },
+]
+
+[[package]]
+name = "pytest"
+version = "8.4.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+    { name = "iniconfig" },
+    { name = "packaging" },
+    { name = "pluggy" },
+    { name = "pygments" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/a3/5c/00a0e072241553e1a7496d638deababa67c5058571567b92a7eaa258397c/pytest-8.4.2.tar.gz", hash = "sha256:86c0d0b93306b961d58d62a4db4879f27fe25513d4b969df351abdddb3c30e01", size = 1519618, upload-time = "2025-09-04T14:34:22.711Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/a8/a4/20da314d277121d6534b3a980b29035dcd51e6744bd79075a6ce8fa4eb8d/pytest-8.4.2-py3-none-any.whl", hash = "sha256:872f880de3fc3a5bdc88a11b39c9710c3497a547cfa9320bc3c5e62fbf272e79", size = 365750, upload-time = "2025-09-04T14:34:20.226Z" },
+]
+
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"
Author	SHA1	Message	Date
bernard-ng	b7fc90ef71	[notebook] add results	2025-10-18 22:57:21 +02:00
Bernard Ngandu	c463e6ed7e	Rename model notation reference (#13 )	2025-10-18 16:06:59 +02:00
Bernard Ngandu	ad600ef565	Add technical architecture report (#12 )	2025-10-18 15:43:28 +02:00
bernard-ng	8160bb0f6f	docs: update README	2025-10-07 23:58:47 +02:00
bernard-ng	f2ac0c9769	fix: add github workflow	2025-10-07 23:21:35 +02:00
bernard-ng	d3b3840278	fix: nn models pad_sequences	2025-10-06 00:37:29 +02:00
bernard-ng	cb22c06628	fix: remove svm model	2025-10-06 00:03:54 +02:00