feat: document models
This commit is contained in:
@@ -0,0 +1,93 @@
|
||||
# Model Notation Reference
|
||||
|
||||
This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
|
||||
|
||||
## Logistic Regression
|
||||
- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
|
||||
- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
|
||||
- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
|
||||
- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
|
||||
- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
|
||||
- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
|
||||
|
||||
## Multinomial Naive Bayes
|
||||
- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
|
||||
- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
|
||||
- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
|
||||
- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
|
||||
- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
|
||||
|
||||
## Support Vector Machine (RBF Kernel)
|
||||
- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
|
||||
- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
|
||||
- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
|
||||
- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
|
||||
- Implementation notes: TF–IDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
|
||||
|
||||
## Random Forest
|
||||
- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
|
||||
- Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
|
||||
- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
|
||||
- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
|
||||
- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
|
||||
|
||||
## LightGBM (Gradient Boosted Trees)
|
||||
- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
|
||||
- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
|
||||
- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
|
||||
- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
|
||||
- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
|
||||
|
||||
## XGBoost
|
||||
- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
|
||||
- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
|
||||
- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
|
||||
- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
|
||||
- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
|
||||
|
||||
## Convolutional Neural Network (1D)
|
||||
- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
|
||||
- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
|
||||
- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
|
||||
- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
|
||||
- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
|
||||
- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
|
||||
|
||||
## Bidirectional GRU
|
||||
- Forward GRU recursion: $\begin{aligned}
|
||||
&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
|
||||
&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
|
||||
&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
|
||||
&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
|
||||
\end{aligned}$
|
||||
- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
|
||||
- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
|
||||
- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
|
||||
|
||||
## LSTM
|
||||
- Gates per timestep: $\begin{aligned}
|
||||
&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
|
||||
&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
|
||||
&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
|
||||
&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
|
||||
&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
|
||||
&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
|
||||
\end{aligned}$
|
||||
- Bidirectional stacking concatenates final hidden vectors before classification via softmax.
|
||||
- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
|
||||
- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
|
||||
|
||||
## Transformer Encoder (Single Block)
|
||||
- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
|
||||
- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
|
||||
- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
|
||||
- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
|
||||
- Sequence pooling (global average) feeds dense layers and softmax classifier.
|
||||
- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
|
||||
- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
|
||||
|
||||
## Ensemble (Soft Voting)
|
||||
- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
|
||||
- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
|
||||
- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
|
||||
- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.
|
||||
@@ -17,17 +17,38 @@ class BiGRUModel(NeuralNetworkModel):
|
||||
params = kwargs
|
||||
model = Sequential(
|
||||
[
|
||||
Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
|
||||
# Mask padding tokens so recurrent layers ignore them; fix input length
|
||||
# for better shape inference and to support masking through the stack.
|
||||
Embedding(
|
||||
input_dim=vocab_size,
|
||||
output_dim=params.get("embedding_dim", 64),
|
||||
input_length=max_len,
|
||||
mask_zero=True,
|
||||
),
|
||||
# First recurrent block returns full sequences to allow stacking.
|
||||
# Moderate dropout + optional recurrent_dropout to reduce overfitting
|
||||
# on short names while retaining temporal signal.
|
||||
Bidirectional(
|
||||
GRU(
|
||||
params.get("gru_units", 32),
|
||||
return_sequences=True,
|
||||
dropout=params.get("dropout", 0.2),
|
||||
recurrent_dropout=params.get("recurrent_dropout", 0.0),
|
||||
)
|
||||
),
|
||||
Bidirectional(GRU(params.get("gru_units", 32), dropout=params.get("dropout", 0.2))),
|
||||
# Second GRU summarizes to the last hidden state (no return_sequences),
|
||||
# capturing bidirectional context efficiently for classification.
|
||||
Bidirectional(
|
||||
GRU(
|
||||
params.get("gru_units", 32),
|
||||
dropout=params.get("dropout", 0.2),
|
||||
recurrent_dropout=params.get("recurrent_dropout", 0.0),
|
||||
)
|
||||
),
|
||||
# Small dense head; ReLU + dropout for capacity and regularization.
|
||||
Dense(64, activation="relu"),
|
||||
Dropout(params.get("dropout", 0.5)),
|
||||
# Two-way softmax for binary gender classification.
|
||||
Dense(2, activation="softmax"),
|
||||
]
|
||||
)
|
||||
@@ -38,19 +59,13 @@ class BiGRUModel(NeuralNetworkModel):
|
||||
return model
|
||||
|
||||
def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
|
||||
text_data = []
|
||||
for feature_type in self.config.features:
|
||||
if feature_type.value in X.columns:
|
||||
text_data.extend(X[feature_type.value].astype(str).tolist())
|
||||
|
||||
if not text_data:
|
||||
raise ValueError("No text data found in the provided DataFrame.")
|
||||
text_data = self._collect_text_corpus(X)
|
||||
|
||||
if self.tokenizer is None:
|
||||
self.tokenizer = Tokenizer(char_level=False, lower=True, oov_token="<OOV>")
|
||||
self.tokenizer.fit_on_texts(text_data)
|
||||
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data)
|
||||
max_len = self.config.model_params.get("max_len", 6)
|
||||
|
||||
return pad_sequences(sequences, maxlen=max_len, padding="post")
|
||||
|
||||
@@ -9,6 +9,7 @@ from tensorflow.keras.layers import (
|
||||
GlobalMaxPooling1D,
|
||||
Dense,
|
||||
Dropout,
|
||||
SpatialDropout1D,
|
||||
)
|
||||
from tensorflow.keras.models import Sequential
|
||||
|
||||
@@ -24,21 +25,33 @@ class CNNModel(NeuralNetworkModel):
|
||||
params = kwargs
|
||||
model = Sequential(
|
||||
[
|
||||
# Learn char/subword embeddings; spatial dropout regularizes across channels
|
||||
# to make the model robust to noisy characters and transliteration.
|
||||
Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
|
||||
SpatialDropout1D(rate=params.get("embedding_dropout", 0.1)),
|
||||
# Small kernels capture short n-gram like patterns; padding='same' keeps
|
||||
# sequence length stable for simpler pooling behavior.
|
||||
Conv1D(
|
||||
filters=params.get("filters", 64),
|
||||
kernel_size=params.get("kernel_size", 3),
|
||||
activation="relu",
|
||||
padding="same",
|
||||
),
|
||||
# Downsample to gain some position invariance and reduce computation.
|
||||
MaxPooling1D(pool_size=2),
|
||||
# Second conv layer to compose higher-level motifs (e.g., suffix+vowel).
|
||||
Conv1D(
|
||||
filters=params.get("filters", 64),
|
||||
kernel_size=params.get("kernel_size", 3),
|
||||
activation="relu",
|
||||
padding="same",
|
||||
),
|
||||
# Global max pooling picks strongest motif evidence anywhere in the name.
|
||||
GlobalMaxPooling1D(),
|
||||
# Compact dense head with dropout to control overfitting.
|
||||
Dense(64, activation="relu"),
|
||||
Dropout(params.get("dropout", 0.5)),
|
||||
# Two-way softmax for binary classification.
|
||||
Dense(2, activation="softmax"),
|
||||
]
|
||||
)
|
||||
@@ -55,21 +68,14 @@ class CNNModel(NeuralNetworkModel):
|
||||
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
||||
|
||||
# Get text data from extracted features - use character level for CNN
|
||||
text_data = []
|
||||
for feature_type in self.config.features:
|
||||
if feature_type.value in X.columns:
|
||||
text_data.extend(X[feature_type.value].astype(str).tolist())
|
||||
|
||||
if not text_data:
|
||||
# Fallback - should not happen if FeatureExtractor is properly configured
|
||||
text_data = [""] * len(X)
|
||||
text_data = self._collect_text_corpus(X)
|
||||
|
||||
# Initialize character-level tokenizer
|
||||
if self.tokenizer is None:
|
||||
self.tokenizer = Tokenizer(char_level=True, lower=True, oov_token="<OOV>")
|
||||
self.tokenizer.fit_on_texts(text_data)
|
||||
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data)
|
||||
max_len = self.config.model_params.get("max_len", 20) # Longer for character level
|
||||
|
||||
return pad_sequences(sequences, maxlen=max_len, padding="post")
|
||||
|
||||
@@ -31,7 +31,8 @@ class EnsembleModel(TraditionalModel):
|
||||
"base_models", ["logistic_regression", "random_forest", "naive_bayes"]
|
||||
)
|
||||
|
||||
# Create base models with simplified configs
|
||||
# Create base models with simplified configs; diverse vectorizers/classifiers
|
||||
# encourage complementary errors that voting can average out.
|
||||
estimators = []
|
||||
for model_type in base_model_types:
|
||||
if model_type == "logistic_regression":
|
||||
@@ -78,8 +79,10 @@ class EnsembleModel(TraditionalModel):
|
||||
)
|
||||
estimators.append((f"nb", model))
|
||||
|
||||
# Soft voting averages probabilities (preferred when members are calibrated);
|
||||
# hard voting uses majority class. Parallelize member predictions.
|
||||
voting_type = params.get("voting", "soft") # 'hard' or 'soft'
|
||||
return VotingClassifier(estimators=estimators, voting=voting_type)
|
||||
return VotingClassifier(estimators=estimators, voting=voting_type, n_jobs=params.get("n_jobs", -1))
|
||||
|
||||
def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
|
||||
text_features = []
|
||||
|
||||
@@ -20,6 +20,8 @@ class LightGBMModel(TraditionalModel):
|
||||
def build_model(self) -> BaseEstimator:
|
||||
params = self.config.model_params
|
||||
|
||||
# Leaf-wise boosted trees excel on sparse/categorical mixes; binary objective
|
||||
# and parallelism improve training speed for this task.
|
||||
return lgb.LGBMClassifier(
|
||||
n_estimators=params.get("n_estimators", 100),
|
||||
max_depth=params.get("max_depth", -1),
|
||||
@@ -28,6 +30,8 @@ class LightGBMModel(TraditionalModel):
|
||||
subsample=params.get("subsample", 0.8),
|
||||
colsample_bytree=params.get("colsample_bytree", 0.8),
|
||||
random_state=self.config.random_seed,
|
||||
objective=params.get("objective", "binary"),
|
||||
n_jobs=params.get("n_jobs", -1),
|
||||
verbose=2,
|
||||
)
|
||||
|
||||
|
||||
@@ -13,14 +13,23 @@ class LogisticRegressionModel(TraditionalModel):
|
||||
|
||||
def build_model(self) -> BaseEstimator:
|
||||
params = self.config.model_params
|
||||
# Character n-grams are strong signals for names; (2,5) balances
|
||||
# capturing prefixes/suffixes with tractable feature size.
|
||||
vectorizer = CountVectorizer(
|
||||
analyzer="char",
|
||||
ngram_range=params.get("ngram_range", (2, 5)),
|
||||
max_features=params.get("max_features", 10000),
|
||||
)
|
||||
|
||||
# liblinear handles sparse, small-to-medium problems well; n_jobs parallelizes
|
||||
# OvR across classes (no effect for binary). class_weight can mitigate imbalance.
|
||||
classifier = LogisticRegression(
|
||||
max_iter=params.get("max_iter", 1000), random_state=self.config.random_seed, verbose=2
|
||||
max_iter=params.get("max_iter", 1000),
|
||||
random_state=self.config.random_seed,
|
||||
verbose=2,
|
||||
solver=params.get("solver", "liblinear"),
|
||||
n_jobs=params.get("n_jobs", -1),
|
||||
class_weight=params.get("class_weight", None),
|
||||
)
|
||||
|
||||
return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)])
|
||||
|
||||
@@ -2,7 +2,7 @@ from typing import Any
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
|
||||
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
|
||||
from tensorflow.keras.models import Sequential
|
||||
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
||||
from tensorflow.keras.preprocessing.text import Tokenizer
|
||||
@@ -17,10 +17,35 @@ class LSTMModel(NeuralNetworkModel):
|
||||
params = kwargs
|
||||
model = Sequential(
|
||||
[
|
||||
Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
|
||||
Bidirectional(LSTM(params.get("lstm_units", 32), return_sequences=True)),
|
||||
Bidirectional(LSTM(params.get("lstm_units", 32))),
|
||||
# Mask padding tokens; required for LSTM to ignore padded timesteps.
|
||||
Embedding(
|
||||
input_dim=vocab_size,
|
||||
output_dim=params.get("embedding_dim", 64),
|
||||
input_length=max_len,
|
||||
mask_zero=True,
|
||||
),
|
||||
# Stacked bidirectional LSTMs: first returns sequences to feed the next.
|
||||
# Dropout/recurrent_dropout mitigate overfitting on short sequences.
|
||||
Bidirectional(
|
||||
LSTM(
|
||||
params.get("lstm_units", 32),
|
||||
return_sequences=True,
|
||||
dropout=params.get("dropout", 0.2),
|
||||
recurrent_dropout=params.get("recurrent_dropout", 0.0),
|
||||
)
|
||||
),
|
||||
# Second LSTM condenses sequence to a fixed vector for classification.
|
||||
Bidirectional(
|
||||
LSTM(
|
||||
params.get("lstm_units", 32),
|
||||
dropout=params.get("dropout", 0.2),
|
||||
recurrent_dropout=params.get("recurrent_dropout", 0.0),
|
||||
)
|
||||
),
|
||||
# Compact dense head with dropout; sufficient capacity for name signals.
|
||||
Dense(64, activation="relu"),
|
||||
Dropout(params.get("dropout", 0.5)),
|
||||
# Two-way softmax for binary classification.
|
||||
Dense(2, activation="softmax"),
|
||||
]
|
||||
)
|
||||
@@ -31,14 +56,7 @@ class LSTMModel(NeuralNetworkModel):
|
||||
return model
|
||||
|
||||
def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
|
||||
text_data = []
|
||||
|
||||
for feature_type in self.config.features:
|
||||
if feature_type.value in X.columns:
|
||||
text_data.extend(X[feature_type.value].astype(str).tolist())
|
||||
|
||||
if not text_data:
|
||||
raise ValueError("No text data found in the provided DataFrame.")
|
||||
text_data = self._collect_text_corpus(X)
|
||||
|
||||
# Initialize tokenizer if needed
|
||||
if self.tokenizer is None:
|
||||
@@ -46,7 +64,7 @@ class LSTMModel(NeuralNetworkModel):
|
||||
self.tokenizer.fit_on_texts(text_data)
|
||||
|
||||
# Convert to sequences
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data)
|
||||
max_len = self.config.model_params.get("max_len", 6)
|
||||
|
||||
return pad_sequences(sequences, maxlen=max_len, padding="post")
|
||||
|
||||
@@ -13,12 +13,15 @@ class NaiveBayesModel(TraditionalModel):
|
||||
|
||||
def build_model(self) -> BaseEstimator:
|
||||
params = self.config.model_params
|
||||
# Bag-of-character-ngrams aligns with Multinomial NB assumptions; (1,4)
|
||||
# includes unigrams for coverage and higher n for suffix/prefix cues.
|
||||
vectorizer = CountVectorizer(
|
||||
analyzer="char",
|
||||
ngram_range=params.get("ngram_range", (1, 4)),
|
||||
max_features=params.get("max_features", 8000),
|
||||
)
|
||||
|
||||
# Laplace smoothing (alpha) counters zero counts for rare n-grams.
|
||||
classifier = MultinomialNB(alpha=params.get("alpha", 1.0))
|
||||
|
||||
return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)])
|
||||
|
||||
@@ -1,3 +1,5 @@
|
||||
from typing import Dict
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.base import BaseEstimator
|
||||
@@ -10,15 +12,23 @@ from research.traditional_model import TraditionalModel
|
||||
class RandomForestModel(TraditionalModel):
|
||||
"""Random Forest with engineered features"""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
# Persist encoders so categorical mappings stay consistent.
|
||||
self.label_encoders: Dict[str, LabelEncoder] = {}
|
||||
|
||||
def build_model(self) -> BaseEstimator:
|
||||
|
||||
params = self.config.model_params
|
||||
|
||||
# Tree ensemble is robust to mixed numeric/categorical encodings; parallelize
|
||||
# across trees for speed. Keep depth moderate for generalisation.
|
||||
return RandomForestClassifier(
|
||||
n_estimators=params.get("n_estimators", 100),
|
||||
max_depth=params.get("max_depth", None),
|
||||
random_state=self.config.random_seed,
|
||||
verbose=2,
|
||||
n_jobs=params.get("n_jobs", -1),
|
||||
)
|
||||
|
||||
def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
|
||||
@@ -33,9 +43,24 @@ class RandomForestModel(TraditionalModel):
|
||||
# Numerical features
|
||||
features.append(column.fillna(0).values.reshape(-1, 1))
|
||||
else:
|
||||
# Categorical features (encode them)
|
||||
le = LabelEncoder()
|
||||
encoded = le.fit_transform(column.fillna("unknown").astype(str))
|
||||
# Categorical features (encode them persistently)
|
||||
feature_key = f"encoder_{feature_type.value}"
|
||||
|
||||
if feature_key not in self.label_encoders:
|
||||
self.label_encoders[feature_key] = LabelEncoder()
|
||||
encoded = self.label_encoders[feature_key].fit_transform(
|
||||
column.fillna("unknown").astype(str)
|
||||
)
|
||||
else:
|
||||
encoder = self.label_encoders[feature_key]
|
||||
column_clean = column.fillna("unknown").astype(str)
|
||||
known_classes = set(encoder.classes_)
|
||||
default_class = "unknown" if "unknown" in known_classes else encoder.classes_[0]
|
||||
column_mapped = column_clean.apply(
|
||||
lambda value: value if value in known_classes else default_class
|
||||
)
|
||||
encoded = encoder.transform(column_mapped)
|
||||
|
||||
features.append(encoded.reshape(-1, 1))
|
||||
|
||||
return np.hstack(features) if features else np.array([]).reshape(len(X), 0)
|
||||
|
||||
@@ -13,17 +13,23 @@ class SVMModel(TraditionalModel):
|
||||
|
||||
def build_model(self) -> BaseEstimator:
|
||||
params = self.config.model_params
|
||||
# TF-IDF downweights very common patterns; char n-grams (2,4) are effective
|
||||
# for distinguishing name morphology under RBF kernels.
|
||||
vectorizer = TfidfVectorizer(
|
||||
analyzer="char",
|
||||
ngram_range=params.get("ngram_range", (2, 4)),
|
||||
max_features=params.get("max_features", 5000),
|
||||
)
|
||||
|
||||
# RBF kernel captures non-linear interactions between n-grams; probability=True
|
||||
# adds calibration at some cost. Larger cache helps speed kernel computations.
|
||||
classifier = SVC(
|
||||
kernel=params.get("kernel", "rbf"),
|
||||
C=params.get("C", 1.0),
|
||||
gamma=params.get("gamma", "scale"),
|
||||
probability=True, # Enable probability prediction
|
||||
class_weight=params.get("class_weight", None),
|
||||
cache_size=params.get("cache_size", 1000),
|
||||
random_state=self.config.random_seed,
|
||||
verbose=2,
|
||||
)
|
||||
|
||||
@@ -27,7 +27,12 @@ class TransformerModel(NeuralNetworkModel):
|
||||
|
||||
# Build Transformer model
|
||||
inputs = Input(shape=(max_len,))
|
||||
x = Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64))(inputs)
|
||||
x = Embedding(
|
||||
input_dim=vocab_size,
|
||||
output_dim=params.get("embedding_dim", 64),
|
||||
input_length=max_len,
|
||||
mask_zero=True,
|
||||
)(inputs)
|
||||
|
||||
# Add positional encoding
|
||||
positions = tf.range(start=0, limit=max_len, delta=1)
|
||||
@@ -39,6 +44,7 @@ class TransformerModel(NeuralNetworkModel):
|
||||
x = self._transformer_encoder(x, params)
|
||||
x = GlobalAveragePooling1D()(x)
|
||||
x = Dense(32, activation="relu")(x)
|
||||
x = Dropout(params.get("dropout", 0.1))(x)
|
||||
outputs = Dense(2, activation="softmax")(x)
|
||||
|
||||
model = Model(inputs, outputs)
|
||||
@@ -54,6 +60,7 @@ class TransformerModel(NeuralNetworkModel):
|
||||
attn = MultiHeadAttention(
|
||||
num_heads=cfg_params.get("transformer_num_heads", 2),
|
||||
key_dim=cfg_params.get("transformer_head_size", 64),
|
||||
dropout=cfg_params.get("attn_dropout", 0.1),
|
||||
)(x, x)
|
||||
x = LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(attn))
|
||||
|
||||
@@ -62,13 +69,7 @@ class TransformerModel(NeuralNetworkModel):
|
||||
return LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(ff))
|
||||
|
||||
def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
|
||||
text_data = []
|
||||
for feature_type in self.config.features:
|
||||
if feature_type.value in X.columns:
|
||||
text_data.extend(X[feature_type.value].astype(str).tolist())
|
||||
|
||||
if not text_data:
|
||||
raise ValueError("No text data found in the provided DataFrame.")
|
||||
text_data = self._collect_text_corpus(X)
|
||||
|
||||
# Initialize tokenizer if needed
|
||||
if self.tokenizer is None:
|
||||
@@ -76,7 +77,7 @@ class TransformerModel(NeuralNetworkModel):
|
||||
self.tokenizer.fit_on_texts(text_data)
|
||||
|
||||
# Convert to sequences
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
|
||||
sequences = self.tokenizer.texts_to_sequences(text_data)
|
||||
max_len = self.config.model_params.get("max_len", 6)
|
||||
|
||||
return pad_sequences(sequences, maxlen=max_len, padding="post")
|
||||
|
||||
@@ -20,6 +20,8 @@ class XGBoostModel(TraditionalModel):
|
||||
def build_model(self) -> BaseEstimator:
|
||||
params = self.config.model_params
|
||||
|
||||
# Histogram-based trees and parallelism provide fast training; default
|
||||
# logloss metric suits binary classification of gender.
|
||||
return xgb.XGBClassifier(
|
||||
n_estimators=params.get("n_estimators", 100),
|
||||
max_depth=params.get("max_depth", 6),
|
||||
@@ -28,6 +30,8 @@ class XGBoostModel(TraditionalModel):
|
||||
colsample_bytree=params.get("colsample_bytree", 0.8),
|
||||
random_state=self.config.random_seed,
|
||||
eval_metric="logloss",
|
||||
n_jobs=params.get("n_jobs", -1),
|
||||
tree_method=params.get("tree_method", "hist"),
|
||||
verbosity=2,
|
||||
)
|
||||
|
||||
|
||||
@@ -82,6 +82,25 @@ class NeuralNetworkModel(BaseModel):
|
||||
self.is_fitted = True
|
||||
return self
|
||||
|
||||
def _collect_text_corpus(self, X: pd.DataFrame) -> List[str]:
|
||||
"""Combine configured textual features into one string per record."""
|
||||
|
||||
column_names = [feature.value for feature in self.config.features if feature.value in X.columns]
|
||||
if not column_names:
|
||||
raise ValueError("No configured text features found in the provided DataFrame.")
|
||||
|
||||
text_frame = X[column_names].fillna("").astype(str)
|
||||
|
||||
if len(column_names) == 1:
|
||||
return text_frame.iloc[:, 0].tolist()
|
||||
|
||||
combined_rows = []
|
||||
for row in text_frame.itertuples(index=False):
|
||||
tokens = [value for value in row if value]
|
||||
combined_rows.append(" ".join(tokens))
|
||||
|
||||
return combined_rows
|
||||
|
||||
def cross_validate(
|
||||
self, X: pd.DataFrame, y: pd.Series, cv_folds: int = 5
|
||||
) -> dict[str, np.floating[Any]]:
|
||||
@@ -145,6 +164,9 @@ class NeuralNetworkModel(BaseModel):
|
||||
"""Generate learning curve data for the model"""
|
||||
logging.info(f"Generating learning curve for {self.__class__.__name__}")
|
||||
|
||||
if train_sizes is None:
|
||||
train_sizes = [0.1, 0.3, 0.5, 0.7, 1.0]
|
||||
|
||||
learning_curve_data = {
|
||||
"train_sizes": [],
|
||||
"train_scores": [],
|
||||
|
||||
Reference in New Issue
Block a user