Rename model notation reference (#13)

2025-10-18 16:06:59 +02:00
parent ad600ef565
commit c463e6ed7e
2 changed files with 206 additions and 93 deletions
@@ -0,0 +1,206 @@
 # Formal Model Specifications
 This document formalises the statistical models implemented in
 `src/ners/research/models`. Throughout, the training set is
 $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N$ with labels
 $y^{(i)} \in \{0,1\}$ for the binary gender classes. Feature vectors
 $\mathbf{x}^{(i)}$ combine
 * character $n$-gram count representations of name strings produced by
  `CountVectorizer` or `TfidfVectorizer`, and
 * engineered scalar or categorical metadata (e.g., name length, province)
  that are either used directly or encoded by `LabelEncoder`.
 For neural architectures, character or token sequences are converted into
 integer index sequences using a `Tokenizer` before being padded to a
 maximum length specified in the configuration. Predictions are returned as
 class posterior probabilities via a softmax layer unless otherwise noted.
 ## Logistic Regression (`logistic_regression_model.py`)
 **Feature map.** Character $n$-gram counts
 $\phi(\mathbf{x}) \in \mathbb{R}^d$ obtained with
 `CountVectorizer(analyzer="char", ngram_range=(2,4))` (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】
 **Model.** The linear logit for class $1$ is
 $z = \mathbf{w}^\top \phi(\mathbf{x}) + b$. The class posteriors are
 $p(y=1\mid \mathbf{x}) = \sigma(z)$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$
 with $\sigma(u) = (1 + e^{-u})^{-1}$.
 **Training objective.** Minimise the regularised negative log-likelihood
 $$\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),$$
 where $R$ is the penalty induced by the chosen solver (e.g., $\ell_2$ for
 `liblinear`).
 ## Multinomial Naive Bayes (`naive_bayes_model.py`)
 **Feature map.** Character $n$-gram counts
 $\phi(\mathbf{x}) \in \mathbb{N}^d$ derived with
 `CountVectorizer(analyzer="char", ngram_range=(2,4))` by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】
 **Generative model.** For each class $c \in \{0,1\}$, the class prior is
 $\pi_c = \frac{N_c}{N}$. Conditional feature probabilities are estimated with
 Laplace smoothing (parameter $\alpha$):
 $$\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},$$
 where $N_{cj}$ counts the total occurrences of feature $j$ among examples of
 class $c$. The likelihood of an input with counts $\phi_j(\mathbf{x})$ is
 $$p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.$$
 **Inference.** Predict with the maximum a posteriori (MAP) decision
 $\hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}$.
 ## Random Forest (`random_forest_model.py`)
 **Feature map.** Concatenation of engineered numerical features and label
 encoded categorical attributes produced on demand in
 `prepare_features`.【F:src/ners/research/models/random_forest_model.py†L28-L71】
 **Model.** An ensemble of $T$ decision trees ${\{T_t\}}_{t=1}^T$, each trained on
 a bootstrap sample of the data with random feature sub-sampling. Each tree
 outputs a class prediction $T_t(\mathbf{x}) \in \{0,1\}$. The forest prediction
 is the mode of individual votes:
 $$\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.$$ 
 **Class probability.** For soft outputs,
 $p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x})$ where
 $p_t$ is the class distribution estimated at the leaf reached by
 $\mathbf{x}$ in tree $t$.
 ## LightGBM (`lightgbm_model.py`)
 **Feature map.** Hybrid of numeric inputs, categorical label encodings, and
 character $n$-gram counts expanded into dense columns and assembled into a
 feature matrix persisted in `self.feature_columns`.【F:src/ners/research/models/lightgbm_model.py†L38-L118】
 **Model.** Gradient boosted decision trees forming an additive function
 $F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x})$, where $h_m$ denotes the
 $m$-th tree and $\eta$ is the learning rate.
 **Training objective.** LightGBM minimises
 $$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),$$
 using second-order Taylor approximations of the loss $\ell$ (binary log-loss by
 default) and regulariser $\Omega$ determined by tree complexity constraints.
 ## XGBoost (`xgboost_model.py`)
 **Feature map.** Combination of numeric metadata, categorical label encodings,
 and character $n$-gram counts as described in `prepare_features`.【F:src/ners/research/models/xgboost_model.py†L41-L113】
 **Model.** Additive ensemble of regression trees
 $F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x})$ with $f_m \in \mathcal{F}$, the
 space of trees with fixed structure.
 **Training objective.** At boosting iteration $m$, minimise the regularised
 objective
 $$\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),$$
 where $\Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2$ penalises the
 number of leaves $T_f$ and their scores $w_j$. The optimal leaf weights follow
 $$w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},$$
 with $g_i$ and $h_i$ denoting first- and second-order gradients of the loss for
 sample $i$.
 ## Convolutional Neural Network (`cnn_model.py`)
 **Input encoding.** Character-level token sequences padded to length
 $L$ using `Tokenizer(char_level=True)` followed by `pad_sequences`.【F:src/ners/research/models/cnn_model.py†L23-L64】
 **Architecture.** Embedding layer producing $X \in \mathbb{R}^{L \times d}$,
 followed by two convolutional blocks:
 1. $H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X))$ with kernel size
   $k_1$ and $F$ filters, then temporal max-pooling.
 2. $H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)}))$ with kernel
   size $k_2$ and $F$ filters.
 Global max-pooling yields $h = \max_{t} H^{(2)}_{t,:}$, which passes through a
 dense layer and dropout before the softmax layer producing
 $p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b)$.
 **Loss.** Cross-entropy between softmax output and the ground-truth label.
 ## Bidirectional GRU (`bigru_model.py`)
 **Input encoding.** Word-level sequences padded to length $L$ with
 `Tokenizer(char_level=False)` and `pad_sequences`.【F:src/ners/research/models/bigru_model.py†L47-L69】
 **Recurrent dynamics.** A stacked bidirectional GRU computes forward and
 backward hidden states according to
 \[
 \begin{aligned}
 \mathbf{z}_t &= \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
 \mathbf{r}_t &= \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
 \tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h\big),\\
 \mathbf{h}_t &= (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
 \end{aligned}
 \]
 The final representation concatenates the last forward and backward states
 before passing through dense layers and a softmax classifier.
 ## Bidirectional LSTM (`lstm_model.py`)
 **Input encoding.** Word-level sequences padded to length $L$ using the same
 pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】
 **Recurrent dynamics.** At each timestep, the LSTM updates its memory cell via
 \[
 \begin{aligned}
 \mathbf{i}_t &= \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
 \mathbf{f}_t &= \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
 \mathbf{o}_t &= \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
 \tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
 \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
 \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
 \end{aligned}
 \]
 Bidirectional aggregation concatenates terminal forward/backward hidden vectors
 before the dense-softmax head.
 ## Transformer Encoder (`transformer_model.py`)
 **Input encoding.** Token sequences padded to a fixed length with positional
 indices $\{0, \ldots, L-1\}$ added through a learned positional embedding.
 `Tokenizer` initialises the vocabulary; padding uses `pad_sequences`.【F:src/ners/research/models/transformer_model.py†L25-L77】
 **Architecture.** For hidden dimension $d$, the encoder block computes
 \[
 \begin{aligned}
 Z^{(0)} &= X + P,\\
 Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\\
 Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big),
 \end{aligned}
 \]
 where $\operatorname{MHAttn}$ is multi-head self-attention with
 $H$ heads. Global average pooling produces a fixed-length vector for the dense
 and dropout layers before the final softmax classifier.
 ## Ensemble Voting (`ensemble_model.py`)
 **Base learners.** A configurable set of pipelines that include character
 $n$-gram vectorisers and classical classifiers (logistic regression,
 random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】
 **Aggregation.** Given model posteriors $\mathbf{p}_j(\mathbf{x})$ and non-negative
 weights $w_j$, the soft-voting ensemble predicts
 $$p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).$$
 Hard voting instead returns
 $\hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}$.
@@ -1,93 +0,0 @@
 # Model Notation Reference
 This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
 ## Logistic Regression
 - Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
 - Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
 - Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
 - Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
 - Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
 - Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
 ## Multinomial Naive Bayes
 - Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
 - Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
 - Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
 - Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
 - Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
 ## Support Vector Machine (RBF Kernel)
 - Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
 - RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
 - Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
 - Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
 - Implementation notes: TF–IDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
 ## Random Forest
 - Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
 - Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
 - Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
 - Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
 - Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
 ## LightGBM (Gradient Boosted Trees)
 - Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
 - Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
 - Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
 - Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
 - Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
 ## XGBoost
 - Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
 - Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
 - Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
 - Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
 - Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
 ## Convolutional Neural Network (1D)
 - Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
 - Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
 - Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
 - Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
 - Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
 - Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
 ## Bidirectional GRU
 - Forward GRU recursion: $\begin{aligned}
 &\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
 &\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
 &\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
 &\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
 \end{aligned}$
 - Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
 - Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
 - Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
 ## LSTM
 - Gates per timestep: $\begin{aligned}
 &\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
 &\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
 &\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
 &\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
 &\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
 &\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
 \end{aligned}$
 - Bidirectional stacking concatenates final hidden vectors before classification via softmax.
 - Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
 - Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
 ## Transformer Encoder (Single Block)
 - Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
 - Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
 - Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
 - Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
 - Sequence pooling (global average) feeds dense layers and softmax classifier.
 - Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
 - Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
 ## Ensemble (Soft Voting)
 - Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
 - Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
 - Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
 - Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.