9.3 KiB
Formal Model Specifications
This document formalises the statistical models implemented in
src/ners/research/models. Throughout, the training set is
\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^N with labels
y^{(i)} \in \{0,1\} for the binary gender classes. Feature vectors
\mathbf{x}^{(i)} combine
- character $n$-gram count representations of name strings produced by
CountVectorizerorTfidfVectorizer, and - engineered scalar or categorical metadata (e.g., name length, province)
that are either used directly or encoded by
LabelEncoder.
For neural architectures, character or token sequences are converted into
integer index sequences using a Tokenizer before being padded to a
maximum length specified in the configuration. Predictions are returned as
class posterior probabilities via a softmax layer unless otherwise noted.
Logistic Regression (logistic_regression_model.py)
Feature map. Character $n$-gram counts
\phi(\mathbf{x}) \in \mathbb{R}^d obtained with
CountVectorizer(analyzer="char", ngram_range=(2,4)) (default configuration).【F:src/ners/research/models/logistic_regression_model.py†L16-L46】
Model. The linear logit for class 1 is
z = \mathbf{w}^\top \phi(\mathbf{x}) + b. The class posteriors are
p(y=1\mid \mathbf{x}) = \sigma(z) and p(y=0\mid \mathbf{x}) = 1 - \sigma(z)
with \sigma(u) = (1 + e^{-u})^{-1}.
Training objective. Minimise the regularised negative log-likelihood
\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^N \left[y^{(i)}\log p(y^{(i)}=1\mid \mathbf{x}^{(i)}) + (1-y^{(i)}) \log p(y^{(i)}=0\mid \mathbf{x}^{(i)})\right] + \lambda R(\mathbf{w}),
where R is the penalty induced by the chosen solver (e.g., \ell_2 for
liblinear).
Multinomial Naive Bayes (naive_bayes_model.py)
Feature map. Character $n$-gram counts
\phi(\mathbf{x}) \in \mathbb{N}^d derived with
CountVectorizer(analyzer="char", ngram_range=(2,4)) by default.【F:src/ners/research/models/naive_bayes_model.py†L15-L38】
Generative model. For each class c \in \{0,1\}, the class prior is
\pi_c = \frac{N_c}{N}. Conditional feature probabilities are estimated with
Laplace smoothing (parameter \alpha):
\theta_{cj} = \frac{N_{cj} + \alpha}{\sum_{k=1}^d (N_{ck} + \alpha)},
where N_{cj} counts the total occurrences of feature j among examples of
class c. The likelihood of an input with counts \phi_j(\mathbf{x}) is
p(\phi(\mathbf{x})\mid y=c) = \prod_{j=1}^d \theta_{cj}^{\phi_j(\mathbf{x})}.
Inference. Predict with the maximum a posteriori (MAP) decision
\hat{y} = \arg\max_c \left\{ \log \pi_c + \sum_{j=1}^d \phi_j(\mathbf{x}) \log \theta_{cj} \right\}.
Random Forest (random_forest_model.py)
Feature map. Concatenation of engineered numerical features and label
encoded categorical attributes produced on demand in
prepare_features.【F:src/ners/research/models/random_forest_model.py†L28-L71】
Model. An ensemble of T decision trees {\{T_t\}}_{t=1}^T, each trained on
a bootstrap sample of the data with random feature sub-sampling. Each tree
outputs a class prediction T_t(\mathbf{x}) \in \{0,1\}. The forest prediction
is the mode of individual votes:
\hat{y} = \operatorname{mode}\{T_t(\mathbf{x}) : t = 1,\dots,T\}.
Class probability. For soft outputs,
p(y=1\mid \mathbf{x}) = \frac{1}{T} \sum_{t=1}^T p_t(y=1\mid \mathbf{x}) where
p_t is the class distribution estimated at the leaf reached by
\mathbf{x} in tree t.
LightGBM (lightgbm_model.py)
Feature map. Hybrid of numeric inputs, categorical label encodings, and
character $n$-gram counts expanded into dense columns and assembled into a
feature matrix persisted in self.feature_columns.【F:src/ners/research/models/lightgbm_model.py†L38-L118】
Model. Gradient boosted decision trees forming an additive function
F_M(\mathbf{x}) = \sum_{m=0}^M \eta h_m(\mathbf{x}), where h_m denotes the
$m$-th tree and \eta is the learning rate.
Training objective. LightGBM minimises
\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + h_m(\mathbf{x}^{(i)})\big) + \Omega(h_m),
using second-order Taylor approximations of the loss \ell (binary log-loss by
default) and regulariser \Omega determined by tree complexity constraints.
XGBoost (xgboost_model.py)
Feature map. Combination of numeric metadata, categorical label encodings,
and character $n$-gram counts as described in prepare_features.【F:src/ners/research/models/xgboost_model.py†L41-L113】
Model. Additive ensemble of regression trees
F_M(\mathbf{x}) = \sum_{m=1}^M f_m(\mathbf{x}) with f_m \in \mathcal{F}, the
space of trees with fixed structure.
Training objective. At boosting iteration m, minimise the regularised
objective
\mathcal{L}^{(m)} = \sum_{i=1}^N \ell\big(y^{(i)}, F_{m-1}(\mathbf{x}^{(i)}) + f_m(\mathbf{x}^{(i)})\big) + \Omega(f_m),
where \Omega(f) = \gamma T_f + \tfrac{1}{2} \lambda \sum_{j=1}^{T_f} w_j^2 penalises the
number of leaves T_f and their scores w_j. The optimal leaf weights follow
w_j^{\star} = - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda},
with g_i and h_i denoting first- and second-order gradients of the loss for
sample i.
Convolutional Neural Network (cnn_model.py)
Input encoding. Character-level token sequences padded to length
L using Tokenizer(char_level=True) followed by pad_sequences.【F:src/ners/research/models/cnn_model.py†L23-L64】
Architecture. Embedding layer producing X \in \mathbb{R}^{L \times d},
followed by two convolutional blocks:
H^{(1)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_1}(X))with kernel sizek_1andFfilters, then temporal max-pooling.H^{(2)} = \operatorname{ReLU}(\operatorname{Conv1D}_{k_2}(H^{(1)}))with kernel sizek_2andFfilters.
Global max-pooling yields h = \max_{t} H^{(2)}_{t,:}, which passes through a
dense layer and dropout before the softmax layer producing
p(y\mid \mathbf{x}) = \operatorname{softmax}(W h + b).
Loss. Cross-entropy between softmax output and the ground-truth label.
Bidirectional GRU (bigru_model.py)
Input encoding. Word-level sequences padded to length L with
Tokenizer(char_level=False) and pad_sequences.【F:src/ners/research/models/bigru_model.py†L47-L69】
Recurrent dynamics. A stacked bidirectional GRU computes forward and backward hidden states according to
[ \begin{aligned} \mathbf{z}_t &= \sigma(W_z \mathbf{x}t + U_z \mathbf{h}{t-1} + \mathbf{b}_z),\ \mathbf{r}_t &= \sigma(W_r \mathbf{x}t + U_r \mathbf{h}{t-1} + \mathbf{b}_r),\ \tilde{\mathbf{h}}_t &= \tanh\big(W_h \mathbf{x}_t + U_h(\mathbf{r}t \odot \mathbf{h}{t-1}) + \mathbf{b}_h\big),\ \mathbf{h}_t &= (1 - \mathbf{z}t) \odot \mathbf{h}{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t. \end{aligned} ]
The final representation concatenates the last forward and backward states before passing through dense layers and a softmax classifier.
Bidirectional LSTM (lstm_model.py)
Input encoding. Word-level sequences padded to length L using the same
pipeline as the BiGRU model.【F:src/ners/research/models/lstm_model.py†L45-L67】
Recurrent dynamics. At each timestep, the LSTM updates its memory cell via
[ \begin{aligned} \mathbf{i}_t &= \sigma(W_i \mathbf{x}t + U_i \mathbf{h}{t-1} + \mathbf{b}_i),\ \mathbf{f}_t &= \sigma(W_f \mathbf{x}t + U_f \mathbf{h}{t-1} + \mathbf{b}_f),\ \mathbf{o}_t &= \sigma(W_o \mathbf{x}t + U_o \mathbf{h}{t-1} + \mathbf{b}_o),\ \tilde{\mathbf{c}}_t &= \tanh(W_c \mathbf{x}t + U_c \mathbf{h}{t-1} + \mathbf{b}_c),\ \mathbf{c}_t &= \mathbf{f}t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t). \end{aligned} ]
Bidirectional aggregation concatenates terminal forward/backward hidden vectors before the dense-softmax head.
Transformer Encoder (transformer_model.py)
Input encoding. Token sequences padded to a fixed length with positional
indices \{0, \ldots, L-1\} added through a learned positional embedding.
Tokenizer initialises the vocabulary; padding uses pad_sequences.【F:src/ners/research/models/transformer_model.py†L25-L77】
Architecture. For hidden dimension d, the encoder block computes
[ \begin{aligned} Z^{(0)} &= X + P,\ Z^{(1)} &= \operatorname{LayerNorm}\big(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)}))\big),\ Z^{(2)} &= \operatorname{LayerNorm}\big(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2)\big), \end{aligned} ]
where \operatorname{MHAttn} is multi-head self-attention with
H heads. Global average pooling produces a fixed-length vector for the dense
and dropout layers before the final softmax classifier.
Ensemble Voting (ensemble_model.py)
Base learners. A configurable set of pipelines that include character $n$-gram vectorisers and classical classifiers (logistic regression, random forest, naive Bayes).【F:src/ners/research/models/ensemble_model.py†L29-L96】
Aggregation. Given model posteriors \mathbf{p}_j(\mathbf{x}) and non-negative
weights w_j, the soft-voting ensemble predicts
p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x}).
Hard voting instead returns
\hat{y} = \operatorname{mode}\{\arg\max_c p_j(y=c\mid \mathbf{x})\}.