diff --git a/model_notation.md b/model_notation.md new file mode 100644 index 0000000..a15018c --- /dev/null +++ b/model_notation.md @@ -0,0 +1,93 @@ +# Model Notation Reference + +This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$. + +## Logistic Regression +- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$. +- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$. +- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$. +- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured. +- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution. +- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization. + +## Multinomial Naive Bayes +- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$. +- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing. +- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$. +- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features. +- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy. + +## Support Vector Machine (RBF Kernel) +- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$. +- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$. +- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$. +- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure. +- Implementation notes: TF–IDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data. + +## Random Forest +- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification. +- Each tree draws a bootstrap sample of the training set and a random subset of features at each split. +- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature. +- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals. +- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings. + +## LightGBM (Gradient Boosted Trees) +- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$. +- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint. +- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration. +- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost. +- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata. + +## XGBoost +- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$. +- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics. +- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$. +- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully. +- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings. + +## Convolutional Neural Network (1D) +- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$. +- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$. +- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$. +- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy. +- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs. +- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance. + +## Bidirectional GRU +- Forward GRU recursion: $\begin{aligned} +&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\ +&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\ +&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\ +&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t. +\end{aligned}$ +- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output. +- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name. +- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting. + +## LSTM +- Gates per timestep: $\begin{aligned} +&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\ +&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\ +&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\ +&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\ +&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\ +&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t). +\end{aligned}$ +- Bidirectional stacking concatenates final hidden vectors before classification via softmax. +- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions. +- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches. + +## Transformer Encoder (Single Block) +- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$. +- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$. +- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$. +- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU). +- Sequence pooling (global average) feeds dense layers and softmax classifier. +- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored. +- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation. + +## Ensemble (Soft Voting) +- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$. +- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$. +- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$. +- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated. diff --git a/research/models/bigru_model.py b/research/models/bigru_model.py index b89229d..7cbc21f 100644 --- a/research/models/bigru_model.py +++ b/research/models/bigru_model.py @@ -17,17 +17,38 @@ class BiGRUModel(NeuralNetworkModel): params = kwargs model = Sequential( [ - Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)), + # Mask padding tokens so recurrent layers ignore them; fix input length + # for better shape inference and to support masking through the stack. + Embedding( + input_dim=vocab_size, + output_dim=params.get("embedding_dim", 64), + input_length=max_len, + mask_zero=True, + ), + # First recurrent block returns full sequences to allow stacking. + # Moderate dropout + optional recurrent_dropout to reduce overfitting + # on short names while retaining temporal signal. Bidirectional( GRU( params.get("gru_units", 32), return_sequences=True, dropout=params.get("dropout", 0.2), + recurrent_dropout=params.get("recurrent_dropout", 0.0), ) ), - Bidirectional(GRU(params.get("gru_units", 32), dropout=params.get("dropout", 0.2))), + # Second GRU summarizes to the last hidden state (no return_sequences), + # capturing bidirectional context efficiently for classification. + Bidirectional( + GRU( + params.get("gru_units", 32), + dropout=params.get("dropout", 0.2), + recurrent_dropout=params.get("recurrent_dropout", 0.0), + ) + ), + # Small dense head; ReLU + dropout for capacity and regularization. Dense(64, activation="relu"), Dropout(params.get("dropout", 0.5)), + # Two-way softmax for binary gender classification. Dense(2, activation="softmax"), ] ) @@ -38,19 +59,13 @@ class BiGRUModel(NeuralNetworkModel): return model def prepare_features(self, X: pd.DataFrame) -> np.ndarray: - text_data = [] - for feature_type in self.config.features: - if feature_type.value in X.columns: - text_data.extend(X[feature_type.value].astype(str).tolist()) - - if not text_data: - raise ValueError("No text data found in the provided DataFrame.") + text_data = self._collect_text_corpus(X) if self.tokenizer is None: self.tokenizer = Tokenizer(char_level=False, lower=True, oov_token="") self.tokenizer.fit_on_texts(text_data) - sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)]) + sequences = self.tokenizer.texts_to_sequences(text_data) max_len = self.config.model_params.get("max_len", 6) return pad_sequences(sequences, maxlen=max_len, padding="post") diff --git a/research/models/cnn_model.py b/research/models/cnn_model.py index 9cae44b..eca4594 100644 --- a/research/models/cnn_model.py +++ b/research/models/cnn_model.py @@ -9,6 +9,7 @@ from tensorflow.keras.layers import ( GlobalMaxPooling1D, Dense, Dropout, + SpatialDropout1D, ) from tensorflow.keras.models import Sequential @@ -24,21 +25,33 @@ class CNNModel(NeuralNetworkModel): params = kwargs model = Sequential( [ + # Learn char/subword embeddings; spatial dropout regularizes across channels + # to make the model robust to noisy characters and transliteration. Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)), + SpatialDropout1D(rate=params.get("embedding_dropout", 0.1)), + # Small kernels capture short n-gram like patterns; padding='same' keeps + # sequence length stable for simpler pooling behavior. Conv1D( filters=params.get("filters", 64), kernel_size=params.get("kernel_size", 3), activation="relu", + padding="same", ), + # Downsample to gain some position invariance and reduce computation. MaxPooling1D(pool_size=2), + # Second conv layer to compose higher-level motifs (e.g., suffix+vowel). Conv1D( filters=params.get("filters", 64), kernel_size=params.get("kernel_size", 3), activation="relu", + padding="same", ), + # Global max pooling picks strongest motif evidence anywhere in the name. GlobalMaxPooling1D(), + # Compact dense head with dropout to control overfitting. Dense(64, activation="relu"), Dropout(params.get("dropout", 0.5)), + # Two-way softmax for binary classification. Dense(2, activation="softmax"), ] ) @@ -55,21 +68,14 @@ class CNNModel(NeuralNetworkModel): from tensorflow.keras.preprocessing.sequence import pad_sequences # Get text data from extracted features - use character level for CNN - text_data = [] - for feature_type in self.config.features: - if feature_type.value in X.columns: - text_data.extend(X[feature_type.value].astype(str).tolist()) - - if not text_data: - # Fallback - should not happen if FeatureExtractor is properly configured - text_data = [""] * len(X) + text_data = self._collect_text_corpus(X) # Initialize character-level tokenizer if self.tokenizer is None: self.tokenizer = Tokenizer(char_level=True, lower=True, oov_token="") self.tokenizer.fit_on_texts(text_data) - sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)]) + sequences = self.tokenizer.texts_to_sequences(text_data) max_len = self.config.model_params.get("max_len", 20) # Longer for character level return pad_sequences(sequences, maxlen=max_len, padding="post") diff --git a/research/models/ensemble_model.py b/research/models/ensemble_model.py index 227f1a2..0114532 100644 --- a/research/models/ensemble_model.py +++ b/research/models/ensemble_model.py @@ -31,7 +31,8 @@ class EnsembleModel(TraditionalModel): "base_models", ["logistic_regression", "random_forest", "naive_bayes"] ) - # Create base models with simplified configs + # Create base models with simplified configs; diverse vectorizers/classifiers + # encourage complementary errors that voting can average out. estimators = [] for model_type in base_model_types: if model_type == "logistic_regression": @@ -78,8 +79,10 @@ class EnsembleModel(TraditionalModel): ) estimators.append((f"nb", model)) + # Soft voting averages probabilities (preferred when members are calibrated); + # hard voting uses majority class. Parallelize member predictions. voting_type = params.get("voting", "soft") # 'hard' or 'soft' - return VotingClassifier(estimators=estimators, voting=voting_type) + return VotingClassifier(estimators=estimators, voting=voting_type, n_jobs=params.get("n_jobs", -1)) def prepare_features(self, X: pd.DataFrame) -> np.ndarray: text_features = [] diff --git a/research/models/lightgbm_model.py b/research/models/lightgbm_model.py index c0fa692..8b242ad 100644 --- a/research/models/lightgbm_model.py +++ b/research/models/lightgbm_model.py @@ -20,6 +20,8 @@ class LightGBMModel(TraditionalModel): def build_model(self) -> BaseEstimator: params = self.config.model_params + # Leaf-wise boosted trees excel on sparse/categorical mixes; binary objective + # and parallelism improve training speed for this task. return lgb.LGBMClassifier( n_estimators=params.get("n_estimators", 100), max_depth=params.get("max_depth", -1), @@ -28,6 +30,8 @@ class LightGBMModel(TraditionalModel): subsample=params.get("subsample", 0.8), colsample_bytree=params.get("colsample_bytree", 0.8), random_state=self.config.random_seed, + objective=params.get("objective", "binary"), + n_jobs=params.get("n_jobs", -1), verbose=2, ) diff --git a/research/models/logistic_regression_model.py b/research/models/logistic_regression_model.py index 9a280b6..dc2e54b 100644 --- a/research/models/logistic_regression_model.py +++ b/research/models/logistic_regression_model.py @@ -13,14 +13,23 @@ class LogisticRegressionModel(TraditionalModel): def build_model(self) -> BaseEstimator: params = self.config.model_params + # Character n-grams are strong signals for names; (2,5) balances + # capturing prefixes/suffixes with tractable feature size. vectorizer = CountVectorizer( analyzer="char", ngram_range=params.get("ngram_range", (2, 5)), max_features=params.get("max_features", 10000), ) + # liblinear handles sparse, small-to-medium problems well; n_jobs parallelizes + # OvR across classes (no effect for binary). class_weight can mitigate imbalance. classifier = LogisticRegression( - max_iter=params.get("max_iter", 1000), random_state=self.config.random_seed, verbose=2 + max_iter=params.get("max_iter", 1000), + random_state=self.config.random_seed, + verbose=2, + solver=params.get("solver", "liblinear"), + n_jobs=params.get("n_jobs", -1), + class_weight=params.get("class_weight", None), ) return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)]) diff --git a/research/models/lstm_model.py b/research/models/lstm_model.py index 3b59834..28e378c 100644 --- a/research/models/lstm_model.py +++ b/research/models/lstm_model.py @@ -2,7 +2,7 @@ from typing import Any import numpy as np import pandas as pd -from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense +from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.preprocessing.text import Tokenizer @@ -17,10 +17,35 @@ class LSTMModel(NeuralNetworkModel): params = kwargs model = Sequential( [ - Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)), - Bidirectional(LSTM(params.get("lstm_units", 32), return_sequences=True)), - Bidirectional(LSTM(params.get("lstm_units", 32))), + # Mask padding tokens; required for LSTM to ignore padded timesteps. + Embedding( + input_dim=vocab_size, + output_dim=params.get("embedding_dim", 64), + input_length=max_len, + mask_zero=True, + ), + # Stacked bidirectional LSTMs: first returns sequences to feed the next. + # Dropout/recurrent_dropout mitigate overfitting on short sequences. + Bidirectional( + LSTM( + params.get("lstm_units", 32), + return_sequences=True, + dropout=params.get("dropout", 0.2), + recurrent_dropout=params.get("recurrent_dropout", 0.0), + ) + ), + # Second LSTM condenses sequence to a fixed vector for classification. + Bidirectional( + LSTM( + params.get("lstm_units", 32), + dropout=params.get("dropout", 0.2), + recurrent_dropout=params.get("recurrent_dropout", 0.0), + ) + ), + # Compact dense head with dropout; sufficient capacity for name signals. Dense(64, activation="relu"), + Dropout(params.get("dropout", 0.5)), + # Two-way softmax for binary classification. Dense(2, activation="softmax"), ] ) @@ -31,14 +56,7 @@ class LSTMModel(NeuralNetworkModel): return model def prepare_features(self, X: pd.DataFrame) -> np.ndarray: - text_data = [] - - for feature_type in self.config.features: - if feature_type.value in X.columns: - text_data.extend(X[feature_type.value].astype(str).tolist()) - - if not text_data: - raise ValueError("No text data found in the provided DataFrame.") + text_data = self._collect_text_corpus(X) # Initialize tokenizer if needed if self.tokenizer is None: @@ -46,7 +64,7 @@ class LSTMModel(NeuralNetworkModel): self.tokenizer.fit_on_texts(text_data) # Convert to sequences - sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)]) + sequences = self.tokenizer.texts_to_sequences(text_data) max_len = self.config.model_params.get("max_len", 6) return pad_sequences(sequences, maxlen=max_len, padding="post") diff --git a/research/models/naive_bayes_model.py b/research/models/naive_bayes_model.py index d542658..becad50 100644 --- a/research/models/naive_bayes_model.py +++ b/research/models/naive_bayes_model.py @@ -13,12 +13,15 @@ class NaiveBayesModel(TraditionalModel): def build_model(self) -> BaseEstimator: params = self.config.model_params + # Bag-of-character-ngrams aligns with Multinomial NB assumptions; (1,4) + # includes unigrams for coverage and higher n for suffix/prefix cues. vectorizer = CountVectorizer( analyzer="char", ngram_range=params.get("ngram_range", (1, 4)), max_features=params.get("max_features", 8000), ) + # Laplace smoothing (alpha) counters zero counts for rare n-grams. classifier = MultinomialNB(alpha=params.get("alpha", 1.0)) return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)]) diff --git a/research/models/random_forest_model.py b/research/models/random_forest_model.py index 4c274ff..ee4764b 100644 --- a/research/models/random_forest_model.py +++ b/research/models/random_forest_model.py @@ -1,3 +1,5 @@ +from typing import Dict + import numpy as np import pandas as pd from sklearn.base import BaseEstimator @@ -10,15 +12,23 @@ from research.traditional_model import TraditionalModel class RandomForestModel(TraditionalModel): """Random Forest with engineered features""" + def __init__(self, config): + super().__init__(config) + # Persist encoders so categorical mappings stay consistent. + self.label_encoders: Dict[str, LabelEncoder] = {} + def build_model(self) -> BaseEstimator: params = self.config.model_params + # Tree ensemble is robust to mixed numeric/categorical encodings; parallelize + # across trees for speed. Keep depth moderate for generalisation. return RandomForestClassifier( n_estimators=params.get("n_estimators", 100), max_depth=params.get("max_depth", None), random_state=self.config.random_seed, verbose=2, + n_jobs=params.get("n_jobs", -1), ) def prepare_features(self, X: pd.DataFrame) -> np.ndarray: @@ -33,9 +43,24 @@ class RandomForestModel(TraditionalModel): # Numerical features features.append(column.fillna(0).values.reshape(-1, 1)) else: - # Categorical features (encode them) - le = LabelEncoder() - encoded = le.fit_transform(column.fillna("unknown").astype(str)) + # Categorical features (encode them persistently) + feature_key = f"encoder_{feature_type.value}" + + if feature_key not in self.label_encoders: + self.label_encoders[feature_key] = LabelEncoder() + encoded = self.label_encoders[feature_key].fit_transform( + column.fillna("unknown").astype(str) + ) + else: + encoder = self.label_encoders[feature_key] + column_clean = column.fillna("unknown").astype(str) + known_classes = set(encoder.classes_) + default_class = "unknown" if "unknown" in known_classes else encoder.classes_[0] + column_mapped = column_clean.apply( + lambda value: value if value in known_classes else default_class + ) + encoded = encoder.transform(column_mapped) + features.append(encoded.reshape(-1, 1)) return np.hstack(features) if features else np.array([]).reshape(len(X), 0) diff --git a/research/models/svm_model.py b/research/models/svm_model.py index e90ec21..b980f36 100644 --- a/research/models/svm_model.py +++ b/research/models/svm_model.py @@ -13,17 +13,23 @@ class SVMModel(TraditionalModel): def build_model(self) -> BaseEstimator: params = self.config.model_params + # TF-IDF downweights very common patterns; char n-grams (2,4) are effective + # for distinguishing name morphology under RBF kernels. vectorizer = TfidfVectorizer( analyzer="char", ngram_range=params.get("ngram_range", (2, 4)), max_features=params.get("max_features", 5000), ) + # RBF kernel captures non-linear interactions between n-grams; probability=True + # adds calibration at some cost. Larger cache helps speed kernel computations. classifier = SVC( kernel=params.get("kernel", "rbf"), C=params.get("C", 1.0), gamma=params.get("gamma", "scale"), probability=True, # Enable probability prediction + class_weight=params.get("class_weight", None), + cache_size=params.get("cache_size", 1000), random_state=self.config.random_seed, verbose=2, ) diff --git a/research/models/transformer_model.py b/research/models/transformer_model.py index 1c6f1b9..aef751d 100644 --- a/research/models/transformer_model.py +++ b/research/models/transformer_model.py @@ -27,7 +27,12 @@ class TransformerModel(NeuralNetworkModel): # Build Transformer model inputs = Input(shape=(max_len,)) - x = Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64))(inputs) + x = Embedding( + input_dim=vocab_size, + output_dim=params.get("embedding_dim", 64), + input_length=max_len, + mask_zero=True, + )(inputs) # Add positional encoding positions = tf.range(start=0, limit=max_len, delta=1) @@ -39,6 +44,7 @@ class TransformerModel(NeuralNetworkModel): x = self._transformer_encoder(x, params) x = GlobalAveragePooling1D()(x) x = Dense(32, activation="relu")(x) + x = Dropout(params.get("dropout", 0.1))(x) outputs = Dense(2, activation="softmax")(x) model = Model(inputs, outputs) @@ -54,6 +60,7 @@ class TransformerModel(NeuralNetworkModel): attn = MultiHeadAttention( num_heads=cfg_params.get("transformer_num_heads", 2), key_dim=cfg_params.get("transformer_head_size", 64), + dropout=cfg_params.get("attn_dropout", 0.1), )(x, x) x = LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(attn)) @@ -62,13 +69,7 @@ class TransformerModel(NeuralNetworkModel): return LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(ff)) def prepare_features(self, X: pd.DataFrame) -> np.ndarray: - text_data = [] - for feature_type in self.config.features: - if feature_type.value in X.columns: - text_data.extend(X[feature_type.value].astype(str).tolist()) - - if not text_data: - raise ValueError("No text data found in the provided DataFrame.") + text_data = self._collect_text_corpus(X) # Initialize tokenizer if needed if self.tokenizer is None: @@ -76,7 +77,7 @@ class TransformerModel(NeuralNetworkModel): self.tokenizer.fit_on_texts(text_data) # Convert to sequences - sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)]) + sequences = self.tokenizer.texts_to_sequences(text_data) max_len = self.config.model_params.get("max_len", 6) return pad_sequences(sequences, maxlen=max_len, padding="post") diff --git a/research/models/xgboost_model.py b/research/models/xgboost_model.py index 07e4be0..28093ee 100644 --- a/research/models/xgboost_model.py +++ b/research/models/xgboost_model.py @@ -20,6 +20,8 @@ class XGBoostModel(TraditionalModel): def build_model(self) -> BaseEstimator: params = self.config.model_params + # Histogram-based trees and parallelism provide fast training; default + # logloss metric suits binary classification of gender. return xgb.XGBClassifier( n_estimators=params.get("n_estimators", 100), max_depth=params.get("max_depth", 6), @@ -28,6 +30,8 @@ class XGBoostModel(TraditionalModel): colsample_bytree=params.get("colsample_bytree", 0.8), random_state=self.config.random_seed, eval_metric="logloss", + n_jobs=params.get("n_jobs", -1), + tree_method=params.get("tree_method", "hist"), verbosity=2, ) diff --git a/research/neural_network_model.py b/research/neural_network_model.py index 6baf1dc..b48b47a 100644 --- a/research/neural_network_model.py +++ b/research/neural_network_model.py @@ -82,6 +82,25 @@ class NeuralNetworkModel(BaseModel): self.is_fitted = True return self + def _collect_text_corpus(self, X: pd.DataFrame) -> List[str]: + """Combine configured textual features into one string per record.""" + + column_names = [feature.value for feature in self.config.features if feature.value in X.columns] + if not column_names: + raise ValueError("No configured text features found in the provided DataFrame.") + + text_frame = X[column_names].fillna("").astype(str) + + if len(column_names) == 1: + return text_frame.iloc[:, 0].tolist() + + combined_rows = [] + for row in text_frame.itertuples(index=False): + tokens = [value for value in row if value] + combined_rows.append(" ".join(tokens)) + + return combined_rows + def cross_validate( self, X: pd.DataFrame, y: pd.Series, cv_folds: int = 5 ) -> dict[str, np.floating[Any]]: @@ -145,6 +164,9 @@ class NeuralNetworkModel(BaseModel): """Generate learning curve data for the model""" logging.info(f"Generating learning curve for {self.__class__.__name__}") + if train_sizes is None: + train_sizes = [0.1, 0.3, 0.5, 0.7, 1.0] + learning_curve_data = { "train_sizes": [], "train_scores": [],