Files
drc-ners-nlp/MODEL_NOTATION.md
2025-10-18 16:06:59 +02:00

9.3 KiB

Formal Model Specifications

This document formalises the statistical models implemented in src/ners/research/models. Throughout, the training set is \mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N with labels y^{(i)} \in \{0,1\} for the binary gender classes. Feature vectors \mathbf{x}^{(i)} combine

  • character $n$-gram count representations of name strings produced by CountVectorizer or TfidfVectorizer, and
  • engineered scalar or categorical metadata (e.g., name length, province) that are either used directly or encoded by LabelEncoder.

For neural architectures, character or token sequences are converted into integer index sequences using a Tokenizer before being padded to a maximum length specified in the configuration. Predictions are returned as class posterior probabilities via a softmax layer unless otherwise noted.

Logistic Regression (logistic_regression_model.py)

Feature map. Character $n$-gram counts \phi(\mathbf{x}) \in \mathbb{R}^d obtained with CountVectorizer(analyzer="char", ngram_range=(2,4)) (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】

Model. The linear logit for class 1 is z = \mathbf{w}^\top \phi(\mathbf{x}) + b. The class posteriors are p(y=1\mid \mathbf{x}) = \sigma(z) and p(y=0\mid \mathbf{x}) = 1 - \sigma(z) with \sigma(u) = (1 + e^{-u})^{-1}.

Training objective. Minimise the regularised negative log-likelihood

\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),

where R is the penalty induced by the chosen solver (e.g., \ell_2 for liblinear).

Multinomial Naive Bayes (naive_bayes_model.py)

Feature map. Character $n$-gram counts \phi(\mathbf{x}) \in \mathbb{N}^d derived with CountVectorizer(analyzer="char", ngram_range=(2,4)) by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】

Generative model. For each class c \in \{0,1\}, the class prior is \pi_c = \frac{N_c}{N}. Conditional feature probabilities are estimated with Laplace smoothing (parameter \alpha):

\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},

where N_{cj} counts the total occurrences of feature j among examples of class c. The likelihood of an input with counts \phi_j(\mathbf{x}) is

p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.

Inference. Predict with the maximum a posteriori (MAP) decision \hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}.

Random Forest (random_forest_model.py)

Feature map. Concatenation of engineered numerical features and label encoded categorical attributes produced on demand in prepare_features.【F:src/ners/research/models/random_forest_model.py†L28-L71】

Model. An ensemble of T decision trees {\{T_t\}}_{t=1}^T, each trained on a bootstrap sample of the data with random feature sub-sampling. Each tree outputs a class prediction T_t(\mathbf{x}) \in \{0,1\}. The forest prediction is the mode of individual votes:

\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.

Class probability. For soft outputs, p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x}) where p_t is the class distribution estimated at the leaf reached by \mathbf{x} in tree t.

LightGBM (lightgbm_model.py)

Feature map. Hybrid of numeric inputs, categorical label encodings, and character $n$-gram counts expanded into dense columns and assembled into a feature matrix persisted in self.feature_columns.【F:src/ners/research/models/lightgbm_model.py†L38-L118】

Model. Gradient boosted decision trees forming an additive function F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x}), where h_m denotes the $m$-th tree and \eta is the learning rate.

Training objective. LightGBM minimises

\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),

using second-order Taylor approximations of the loss \ell (binary log-loss by default) and regulariser \Omega determined by tree complexity constraints.

XGBoost (xgboost_model.py)

Feature map. Combination of numeric metadata, categorical label encodings, and character $n$-gram counts as described in prepare_features.【F:src/ners/research/models/xgboost_model.py†L41-L113】

Model. Additive ensemble of regression trees F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x}) with f_m \in \mathcal{F}, the space of trees with fixed structure.

Training objective. At boosting iteration m, minimise the regularised objective

\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),

where \Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2 penalises the number of leaves T_f and their scores w_j. The optimal leaf weights follow

w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},

with g_i and h_i denoting first- and second-order gradients of the loss for sample i.

Convolutional Neural Network (cnn_model.py)

Input encoding. Character-level token sequences padded to length L using Tokenizer(char_level=True) followed by pad_sequences.【F:src/ners/research/models/cnn_model.py†L23-L64】

Architecture. Embedding layer producing X \in \mathbb{R}^{L \times d}, followed by two convolutional blocks:

  1. H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X)) with kernel size k_1 and F filters, then temporal max-pooling.
  2. H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)})) with kernel size k_2 and F filters.

Global max-pooling yields h = \max_{t} H^{(2)}_{t,:}, which passes through a dense layer and dropout before the softmax layer producing p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b).

Loss. Cross-entropy between softmax output and the ground-truth label.

Bidirectional GRU (bigru_model.py)

Input encoding. Word-level sequences padded to length L with Tokenizer(char_level=False) and pad_sequences.【F:src/ners/research/models/bigru_model.py†L47-L69】

Recurrent dynamics. A stacked bidirectional GRU computes forward and backward hidden states according to

[ \begin{aligned} \mathbf{z}_t &= \sigma(W_z \mathbf{x}t + U_z \mathbf{h}{t-1} + \mathbf{b}_z),\ \mathbf{r}_t &= \sigma(W_r \mathbf{x}t + U_r \mathbf{h}{t-1} + \mathbf{b}_r),\ \tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}t \odot \mathbf{h}{t-1}) + \mathbf{b}_h\big),\ \mathbf{h}_t &= (1 - \mathbf{z}t) \odot \mathbf{h}{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t. \end{aligned} ]

The final representation concatenates the last forward and backward states before passing through dense layers and a softmax classifier.

Bidirectional LSTM (lstm_model.py)

Input encoding. Word-level sequences padded to length L using the same pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】

Recurrent dynamics. At each timestep, the LSTM updates its memory cell via

[ \begin{aligned} \mathbf{i}_t &= \sigma(W_i \mathbf{x}t + U_i \mathbf{h}{t-1} + \mathbf{b}_i),\ \mathbf{f}_t &= \sigma(W_f \mathbf{x}t + U_f \mathbf{h}{t-1} + \mathbf{b}_f),\ \mathbf{o}_t &= \sigma(W_o \mathbf{x}t + U_o \mathbf{h}{t-1} + \mathbf{b}_o),\ \tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}t + U_c \mathbf{h}{t-1} + \mathbf{b}_c),\ \mathbf{c}_t &= \mathbf{f}t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t). \end{aligned} ]

Bidirectional aggregation concatenates terminal forward/backward hidden vectors before the dense-softmax head.

Transformer Encoder (transformer_model.py)

Input encoding. Token sequences padded to a fixed length with positional indices \{0, \ldots, L-1\} added through a learned positional embedding. Tokenizer initialises the vocabulary; padding uses pad_sequences.【F:src/ners/research/models/transformer_model.py†L25-L77】

Architecture. For hidden dimension d, the encoder block computes

[ \begin{aligned} Z^{(0)} &= X + P,\ Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\ Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big), \end{aligned} ]

where \operatorname{MHAttn} is multi-head self-attention with H heads. Global average pooling produces a fixed-length vector for the dense and dropout layers before the final softmax classifier.

Ensemble Voting (ensemble_model.py)

Base learners. A configurable set of pipelines that include character $n$-gram vectorisers and classical classifiers (logistic regression, random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】

Aggregation. Given model posteriors \mathbf{p}_j(\mathbf{x}) and non-negative weights w_j, the soft-voting ensemble predicts

p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).

Hard voting instead returns \hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}.