feat: document models

2025-09-20 23:35:54 +02:00
parent dd2a9f2711
commit e41b15a863
13 changed files with 256 additions and 47 deletions
@@ -0,0 +1,93 @@
+# Model Notation Reference
+
+This document summarises the mathematical formulation and notation behind the models available in `research/models`. In all cases, the input example is represented by a feature vector $\mathbf{x}$ (after any feature-extraction or vectorisation steps) and the target label belongs to a finite set of classes $\mathcal{Y}$.
+
+## Logistic Regression
+- Decision function: $z = \mathbf{w}^\top \mathbf{x} + b$.
+- Binary posterior: $p(y=1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$ and $p(y=0\mid \mathbf{x}) = 1 - \sigma(z)$.
+- Multi-class (one-vs-rest or softmax): $p(y=c\mid \mathbf{x}) = \frac{\exp(\mathbf{w}_c^\top \mathbf{x} + b_c)}{\sum_{k \in \mathcal{Y}} \exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}$.
+- Loss: negative log-likelihood $\mathcal{L} = -\sum_i \log p(y_i\mid \mathbf{x}_i)$ plus regularisation when configured.
+- Gender prediction rationale: linear decision boundaries over character n-gram counts provide a strong, interpretable baseline for name-based gender attribution.
+- Implementation notes: uses character n-grams via `CountVectorizer`; `solver='liblinear'` with optional `class_weight` and `n_jobs` to speed up sparse optimization.
+
+## Multinomial Naive Bayes
+- Class prior: $p(y=c) = \frac{N_c}{N}$ where $N_c$ counts training instances in class $c$.
+- Conditional likelihood (bag-of-ngrams): $p(\mathbf{x}\mid y=c) = \prod_{j=1}^{d} p(x_j\mid y=c)^{x_j}$ with categorical parameters estimated via Laplace smoothing.
+- Posterior up to normalisation: $\log p(y=c\mid \mathbf{x}) \propto \log p(y=c) + \sum_{j=1}^{d} x_j \log p(x_j\mid y=c)$.
+- Gender prediction rationale: captures the relative frequency of character patterns associated with each gender, giving a fast and robust probabilistic baseline for sparse n-gram features.
+- Implementation notes: character n-gram counts with Laplace smoothing; extremely fast to train and deploy.
+
+## Support Vector Machine (RBF Kernel)
+- Dual-form decision function: $f(\mathbf{x}) = \operatorname{sign}\Big( \sum_{i=1}^{M} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \Big)$.
+- RBF kernel: $K(\mathbf{x}_i, \mathbf{x}) = \exp\big(-\gamma \lVert \mathbf{x}_i - \mathbf{x} \rVert_2^2\big)$.
+- Soft-margin optimisation: $\min_{\mathbf{w}, \xi} \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum_i \xi_i$ s.t. $y_i(\mathbf{w}^\top \phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$, $\xi_i \geq 0$.
+- Gender prediction rationale: non-linear kernels model subtle character-pattern interactions beyond linear baselines, improving separability when male and female names share prefixes but diverge in internal structure.
+- Implementation notes: TF–IDF character features; increased `cache_size` and optional `class_weight` for stability on imbalanced data.
+
+## Random Forest
+- Ensemble of $T$ decision trees: $\hat{y} = \operatorname{mode}\{ T_t(\mathbf{x}) : t=1, \dots, T \}$ for classification.
+- Each tree draws a bootstrap sample of the training set and a random subset of features at each split.
+- Feature importance (used in implementation): mean decrease in impurity aggregated over splits per feature.
+- Gender prediction rationale: handles heterogeneous engineered features (length, province, endings) without heavy preprocessing, while delivering interpretable feature-importance signals.
+- Implementation notes: enables `n_jobs=-1` for parallel trees; persistent label encoders ensure stable categorical mappings.
+
+## LightGBM (Gradient Boosted Trees)
+- Additive model: $F_0(\mathbf{x}) = \hat{p}$ (initial prediction), $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta h_m(\mathbf{x})$.
+- Each weak learner $h_m$ is a decision tree grown with leaf-wise strategy and depth constraint.
+- Optimises differentiable loss (default: logistic) using first- and second-order gradients over data in each boosting iteration.
+- Gender prediction rationale: excels with sparse categorical encodings and numerous engineered features, offering strong accuracy with manageable inference cost.
+- Implementation notes: `objective='binary'`, `n_jobs=-1` for throughput; works well with compact character-gram features plus metadata.
+
+## XGBoost
+- Objective: $\mathcal{L}^{(t)} = \sum_{i} \ell(y_i, \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)) + \Omega(f_t)$ with regulariser $\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_j w_j^2$.
+- Tree score expansion via second-order Taylor approximation; optimal leaf weight $w_j = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$ where $g_i$ and $h_i$ are gradient and Hessian statistics.
+- Final prediction: $\hat{y}(\mathbf{x}) = \sum_{t=1}^{M} \eta f_t(\mathbf{x})$.
+- Gender prediction rationale: strong regularisation and gradient-informed splits capture interactions between textual and metadata features; suited to high-stakes deployment when tuned carefully.
+- Implementation notes: `tree_method='hist'`, `n_jobs=-1` for efficient CPU training; integrates engineered categorical encodings.
+
+## Convolutional Neural Network (1D)
+- Token/character embeddings produce $X \in \mathbb{R}^{L \times d}$.
+- Convolution layer: $H^{(k)} = \operatorname{ReLU}(X * W^{(k)} + b^{(k)})$ where $*$ denotes 1D convolution with filter $W^{(k)}$.
+- Pooling summarises temporal dimension (max or global max); dense layers map pooled vector to logits $\mathbf{z}$.
+- Output probabilities: $p(y=c\mid \mathbf{x}) = \operatorname{softmax}_c(\mathbf{z})$; loss via cross-entropy.
+- Gender prediction rationale: convolutional filters learn discriminative prefixes, suffixes, and intra-name motifs directly from characters, accommodating mixed-language inputs.
+- Implementation notes: adds `SpatialDropout1D` on embeddings and `padding='same'` in conv layers for stability and length-invariance.
+
+## Bidirectional GRU
+- Forward GRU recursion: $\begin{aligned}
+&\mathbf{z}_t = \sigma(W_z \mathbf{x}_t + U_z \mathbf{h}_{t-1} + \mathbf{b}_z),\\
+&\mathbf{r}_t = \sigma(W_r \mathbf{x}_t + U_r \mathbf{h}_{t-1} + \mathbf{b}_r),\\
+&\tilde{\mathbf{h}}_t = \tanh(W_h \mathbf{x}_t + U_h(\mathbf{r}_t \odot \mathbf{h}_{t-1}) + \mathbf{b}_h),\\
+&\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t.
+\end{aligned}$
+- Backward GRU mirrors the recurrence from $t=L$ to $1$; final representation concatenates $[\mathbf{h}_L^{\rightarrow}; \mathbf{h}_1^{\leftarrow}]$ before dense layers and softmax output.
+- Gender prediction rationale: bidirectional context processes character sequences in both directions, learning gender-specific morphemes appearing at any position within the name.
+- Implementation notes: `Embedding(mask_zero=True)` propagates masks to GRUs, ignoring padding; optional `recurrent_dropout` reduces overfitting.
+
+## LSTM
+- Gates per timestep: $\begin{aligned}
+&\mathbf{i}_t = \sigma(W_i \mathbf{x}_t + U_i \mathbf{h}_{t-1} + \mathbf{b}_i),\\
+&\mathbf{f}_t = \sigma(W_f \mathbf{x}_t + U_f \mathbf{h}_{t-1} + \mathbf{b}_f),\\
+&\mathbf{o}_t = \sigma(W_o \mathbf{x}_t + U_o \mathbf{h}_{t-1} + \mathbf{b}_o),\\
+&\tilde{\mathbf{c}}_t = \tanh(W_c \mathbf{x}_t + U_c \mathbf{h}_{t-1} + \mathbf{b}_c),\\
+&\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\\
+&\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
+\end{aligned}$
+- Bidirectional stacking concatenates final hidden vectors before classification via softmax.
+- Gender prediction rationale: long short-term memory cells model long-range dependencies within names, capturing compound structures common in multilingual gendered naming conventions.
+- Implementation notes: `Embedding(mask_zero=True)` and `recurrent_dropout` regularise sequence modeling across padded batches.
+
+## Transformer Encoder (Single Block)
+- Input embeddings $X \in \mathbb{R}^{L \times d}$ plus positional embeddings $P$ produce $Z^{(0)} = X + P$.
+- Multi-head self-attention: $\operatorname{MHAttn}(Z) = \operatorname{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$ where $\text{head}_h = \operatorname{softmax}\big(\frac{Q_h K_h^\top}{\sqrt{d_k}}\big) V_h$ and $(Q_h, K_h, V_h) = (Z W_h^Q, Z W_h^K, Z W_h^V)$.
+- Add & norm: $Z^{(1)} = \operatorname{LayerNorm}(Z^{(0)} + \operatorname{Dropout}(\operatorname{MHAttn}(Z^{(0)})))$.
+- Position-wise feed-forward: $Z^{(2)} = \operatorname{LayerNorm}(Z^{(1)} + \operatorname{Dropout}(\phi(Z^{(1)} W_1 + b_1) W_2 + b_2))$, with activation $\phi(\cdot)$ (ReLU).
+- Sequence pooling (global average) feeds dense layers and softmax classifier.
+- Gender prediction rationale: self-attention captures global dependencies and shared subword units across names, outperforming recurrent models when sufficient labelled data is available; otherwise risk of overfitting should be monitored.
+- Implementation notes: `Embedding(mask_zero=True)` with learned positional embeddings; attention dropout (`attn_dropout`) and classifier dropout improve generalisation.
+
+## Ensemble (Soft Voting)
+- Base learners indexed by $j$ output probability vectors $\mathbf{p}_j(\mathbf{x})$.
+- Aggregated prediction with weights $w_j$: $p(y=c\mid \mathbf{x}) = \frac{1}{\sum_j w_j} \sum_j w_j \, p_j(y=c\mid \mathbf{x})$.
+- Hard voting variant predicts $\hat{y} = \operatorname{mode}\{ \hat{y}_j \}$, where $\hat{y}_j = \arg\max_c p_j(y=c\mid \mathbf{x})$.
+- Gender prediction rationale: blends complementary inductive biases (linear, tree-based, neural) to reduce variance on ambiguous names; remains suitable provided individual members are well-calibrated.
@@ -17,17 +17,38 @@ class BiGRUModel(NeuralNetworkModel):
        params = kwargs
        model = Sequential(
            [
-                Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
+                # Mask padding tokens so recurrent layers ignore them; fix input length
+                # for better shape inference and to support masking through the stack.
+                Embedding(
+                    input_dim=vocab_size,
+                    output_dim=params.get("embedding_dim", 64),
+                    input_length=max_len,
+                    mask_zero=True,
+                ),
+                # First recurrent block returns full sequences to allow stacking.
+                # Moderate dropout + optional recurrent_dropout to reduce overfitting
+                # on short names while retaining temporal signal.
                Bidirectional(
                    GRU(
                        params.get("gru_units", 32),
                        return_sequences=True,
                        dropout=params.get("dropout", 0.2),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
                    )
                ),
-                Bidirectional(GRU(params.get("gru_units", 32), dropout=params.get("dropout", 0.2))),
+                # Second GRU summarizes to the last hidden state (no return_sequences),
+                # capturing bidirectional context efficiently for classification.
+                Bidirectional(
+                    GRU(
+                        params.get("gru_units", 32),
+                        dropout=params.get("dropout", 0.2),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                    )
+                ),
+                # Small dense head; ReLU + dropout for capacity and regularization.
                Dense(64, activation="relu"),
                Dropout(params.get("dropout", 0.5)),
+                # Two-way softmax for binary gender classification.
                Dense(2, activation="softmax"),
            ]
        )
@@ -38,19 +59,13 @@ class BiGRUModel(NeuralNetworkModel):
        return model

    def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
-        text_data = []
-        for feature_type in self.config.features:
-            if feature_type.value in X.columns:
-                text_data.extend(X[feature_type.value].astype(str).tolist())
-
-        if not text_data:
-            raise ValueError("No text data found in the provided DataFrame.")
+        text_data = self._collect_text_corpus(X)

        if self.tokenizer is None:
            self.tokenizer = Tokenizer(char_level=False, lower=True, oov_token="<OOV>")
            self.tokenizer.fit_on_texts(text_data)

-        sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
+        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 6)

        return pad_sequences(sequences, maxlen=max_len, padding="post")
@@ -9,6 +9,7 @@ from tensorflow.keras.layers import (
    GlobalMaxPooling1D,
    Dense,
    Dropout,
+    SpatialDropout1D,
 )
 from tensorflow.keras.models import Sequential

@@ -24,21 +25,33 @@ class CNNModel(NeuralNetworkModel):
        params = kwargs
        model = Sequential(
            [
+                # Learn char/subword embeddings; spatial dropout regularizes across channels
+                # to make the model robust to noisy characters and transliteration.
                Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
+                SpatialDropout1D(rate=params.get("embedding_dropout", 0.1)),
+                # Small kernels capture short n-gram like patterns; padding='same' keeps
+                # sequence length stable for simpler pooling behavior.
                Conv1D(
                    filters=params.get("filters", 64),
                    kernel_size=params.get("kernel_size", 3),
                    activation="relu",
+                    padding="same",
                ),
+                # Downsample to gain some position invariance and reduce computation.
                MaxPooling1D(pool_size=2),
+                # Second conv layer to compose higher-level motifs (e.g., suffix+vowel).
                Conv1D(
                    filters=params.get("filters", 64),
                    kernel_size=params.get("kernel_size", 3),
                    activation="relu",
+                    padding="same",
                ),
+                # Global max pooling picks strongest motif evidence anywhere in the name.
                GlobalMaxPooling1D(),
+                # Compact dense head with dropout to control overfitting.
                Dense(64, activation="relu"),
                Dropout(params.get("dropout", 0.5)),
+                # Two-way softmax for binary classification.
                Dense(2, activation="softmax"),
            ]
        )
@@ -55,21 +68,14 @@ class CNNModel(NeuralNetworkModel):
        from tensorflow.keras.preprocessing.sequence import pad_sequences

        # Get text data from extracted features - use character level for CNN
-        text_data = []
-        for feature_type in self.config.features:
-            if feature_type.value in X.columns:
-                text_data.extend(X[feature_type.value].astype(str).tolist())
-
-        if not text_data:
-            # Fallback - should not happen if FeatureExtractor is properly configured
-            text_data = [""] * len(X)
+        text_data = self._collect_text_corpus(X)

        # Initialize character-level tokenizer
        if self.tokenizer is None:
            self.tokenizer = Tokenizer(char_level=True, lower=True, oov_token="<OOV>")
            self.tokenizer.fit_on_texts(text_data)

-        sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
+        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 20)  # Longer for character level

        return pad_sequences(sequences, maxlen=max_len, padding="post")
@@ -31,7 +31,8 @@ class EnsembleModel(TraditionalModel):
            "base_models", ["logistic_regression", "random_forest", "naive_bayes"]
        )

-        # Create base models with simplified configs
+        # Create base models with simplified configs; diverse vectorizers/classifiers
+        # encourage complementary errors that voting can average out.
        estimators = []
        for model_type in base_model_types:
            if model_type == "logistic_regression":
@@ -78,8 +79,10 @@ class EnsembleModel(TraditionalModel):
                )
                estimators.append((f"nb", model))

+        # Soft voting averages probabilities (preferred when members are calibrated);
+        # hard voting uses majority class. Parallelize member predictions.
        voting_type = params.get("voting", "soft")  # 'hard' or 'soft'
-        return VotingClassifier(estimators=estimators, voting=voting_type)
+        return VotingClassifier(estimators=estimators, voting=voting_type, n_jobs=params.get("n_jobs", -1))

    def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
        text_features = []
@@ -20,6 +20,8 @@ class LightGBMModel(TraditionalModel):
    def build_model(self) -> BaseEstimator:
        params = self.config.model_params

+        # Leaf-wise boosted trees excel on sparse/categorical mixes; binary objective
+        # and parallelism improve training speed for this task.
        return lgb.LGBMClassifier(
            n_estimators=params.get("n_estimators", 100),
            max_depth=params.get("max_depth", -1),
@@ -28,6 +30,8 @@ class LightGBMModel(TraditionalModel):
            subsample=params.get("subsample", 0.8),
            colsample_bytree=params.get("colsample_bytree", 0.8),
            random_state=self.config.random_seed,
+            objective=params.get("objective", "binary"),
+            n_jobs=params.get("n_jobs", -1),
            verbose=2,
        )

@@ -13,14 +13,23 @@ class LogisticRegressionModel(TraditionalModel):

    def build_model(self) -> BaseEstimator:
        params = self.config.model_params
+        # Character n-grams are strong signals for names; (2,5) balances
+        # capturing prefixes/suffixes with tractable feature size.
        vectorizer = CountVectorizer(
            analyzer="char",
            ngram_range=params.get("ngram_range", (2, 5)),
            max_features=params.get("max_features", 10000),
        )

+        # liblinear handles sparse, small-to-medium problems well; n_jobs parallelizes
+        # OvR across classes (no effect for binary). class_weight can mitigate imbalance.
        classifier = LogisticRegression(
-            max_iter=params.get("max_iter", 1000), random_state=self.config.random_seed, verbose=2
+            max_iter=params.get("max_iter", 1000),
+            random_state=self.config.random_seed,
+            verbose=2,
+            solver=params.get("solver", "liblinear"),
+            n_jobs=params.get("n_jobs", -1),
+            class_weight=params.get("class_weight", None),
        )

        return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)])
@@ -2,7 +2,7 @@ from typing import Any

 import numpy as np
 import pandas as pd
-from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
+from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
 from tensorflow.keras.models import Sequential
 from tensorflow.keras.preprocessing.sequence import pad_sequences
 from tensorflow.keras.preprocessing.text import Tokenizer
@@ -17,10 +17,35 @@ class LSTMModel(NeuralNetworkModel):
        params = kwargs
        model = Sequential(
            [
-                Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64)),
-                Bidirectional(LSTM(params.get("lstm_units", 32), return_sequences=True)),
-                Bidirectional(LSTM(params.get("lstm_units", 32))),
+                # Mask padding tokens; required for LSTM to ignore padded timesteps.
+                Embedding(
+                    input_dim=vocab_size,
+                    output_dim=params.get("embedding_dim", 64),
+                    input_length=max_len,
+                    mask_zero=True,
+                ),
+                # Stacked bidirectional LSTMs: first returns sequences to feed the next.
+                # Dropout/recurrent_dropout mitigate overfitting on short sequences.
+                Bidirectional(
+                    LSTM(
+                        params.get("lstm_units", 32),
+                        return_sequences=True,
+                        dropout=params.get("dropout", 0.2),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                    )
+                ),
+                # Second LSTM condenses sequence to a fixed vector for classification.
+                Bidirectional(
+                    LSTM(
+                        params.get("lstm_units", 32),
+                        dropout=params.get("dropout", 0.2),
+                        recurrent_dropout=params.get("recurrent_dropout", 0.0),
+                    )
+                ),
+                # Compact dense head with dropout; sufficient capacity for name signals.
                Dense(64, activation="relu"),
+                Dropout(params.get("dropout", 0.5)),
+                # Two-way softmax for binary classification.
                Dense(2, activation="softmax"),
            ]
        )
@@ -31,14 +56,7 @@ class LSTMModel(NeuralNetworkModel):
        return model

    def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
-        text_data = []
-
-        for feature_type in self.config.features:
-            if feature_type.value in X.columns:
-                text_data.extend(X[feature_type.value].astype(str).tolist())
-
-        if not text_data:
-            raise ValueError("No text data found in the provided DataFrame.")
+        text_data = self._collect_text_corpus(X)

        # Initialize tokenizer if needed
        if self.tokenizer is None:
@@ -46,7 +64,7 @@ class LSTMModel(NeuralNetworkModel):
            self.tokenizer.fit_on_texts(text_data)

        # Convert to sequences
-        sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
+        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 6)

        return pad_sequences(sequences, maxlen=max_len, padding="post")
@@ -13,12 +13,15 @@ class NaiveBayesModel(TraditionalModel):

    def build_model(self) -> BaseEstimator:
        params = self.config.model_params
+        # Bag-of-character-ngrams aligns with Multinomial NB assumptions; (1,4)
+        # includes unigrams for coverage and higher n for suffix/prefix cues.
        vectorizer = CountVectorizer(
            analyzer="char",
            ngram_range=params.get("ngram_range", (1, 4)),
            max_features=params.get("max_features", 8000),
        )

+        # Laplace smoothing (alpha) counters zero counts for rare n-grams.
        classifier = MultinomialNB(alpha=params.get("alpha", 1.0))

        return Pipeline([("vectorizer", vectorizer), ("classifier", classifier)])
@@ -1,3 +1,5 @@
+from typing import Dict
+
 import numpy as np
 import pandas as pd
 from sklearn.base import BaseEstimator
@@ -10,15 +12,23 @@ from research.traditional_model import TraditionalModel
 class RandomForestModel(TraditionalModel):
    """Random Forest with engineered features"""

+    def __init__(self, config):
+        super().__init__(config)
+        # Persist encoders so categorical mappings stay consistent.
+        self.label_encoders: Dict[str, LabelEncoder] = {}
+
    def build_model(self) -> BaseEstimator:

        params = self.config.model_params

+        # Tree ensemble is robust to mixed numeric/categorical encodings; parallelize
+        # across trees for speed. Keep depth moderate for generalisation.
        return RandomForestClassifier(
            n_estimators=params.get("n_estimators", 100),
            max_depth=params.get("max_depth", None),
            random_state=self.config.random_seed,
            verbose=2,
+            n_jobs=params.get("n_jobs", -1),
        )

    def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
@@ -33,9 +43,24 @@ class RandomForestModel(TraditionalModel):
                    # Numerical features
                    features.append(column.fillna(0).values.reshape(-1, 1))
                else:
-                    # Categorical features (encode them)
-                    le = LabelEncoder()
-                    encoded = le.fit_transform(column.fillna("unknown").astype(str))
+                    # Categorical features (encode them persistently)
+                    feature_key = f"encoder_{feature_type.value}"
+
+                    if feature_key not in self.label_encoders:
+                        self.label_encoders[feature_key] = LabelEncoder()
+                        encoded = self.label_encoders[feature_key].fit_transform(
+                            column.fillna("unknown").astype(str)
+                        )
+                    else:
+                        encoder = self.label_encoders[feature_key]
+                        column_clean = column.fillna("unknown").astype(str)
+                        known_classes = set(encoder.classes_)
+                        default_class = "unknown" if "unknown" in known_classes else encoder.classes_[0]
+                        column_mapped = column_clean.apply(
+                            lambda value: value if value in known_classes else default_class
+                        )
+                        encoded = encoder.transform(column_mapped)
+
                    features.append(encoded.reshape(-1, 1))

        return np.hstack(features) if features else np.array([]).reshape(len(X), 0)
@@ -13,17 +13,23 @@ class SVMModel(TraditionalModel):

    def build_model(self) -> BaseEstimator:
        params = self.config.model_params
+        # TF-IDF downweights very common patterns; char n-grams (2,4) are effective
+        # for distinguishing name morphology under RBF kernels.
        vectorizer = TfidfVectorizer(
            analyzer="char",
            ngram_range=params.get("ngram_range", (2, 4)),
            max_features=params.get("max_features", 5000),
        )

+        # RBF kernel captures non-linear interactions between n-grams; probability=True
+        # adds calibration at some cost. Larger cache helps speed kernel computations.
        classifier = SVC(
            kernel=params.get("kernel", "rbf"),
            C=params.get("C", 1.0),
            gamma=params.get("gamma", "scale"),
            probability=True,  # Enable probability prediction
+            class_weight=params.get("class_weight", None),
+            cache_size=params.get("cache_size", 1000),
            random_state=self.config.random_seed,
            verbose=2,
        )
@@ -27,7 +27,12 @@ class TransformerModel(NeuralNetworkModel):

        # Build Transformer model
        inputs = Input(shape=(max_len,))
-        x = Embedding(input_dim=vocab_size, output_dim=params.get("embedding_dim", 64))(inputs)
+        x = Embedding(
+            input_dim=vocab_size,
+            output_dim=params.get("embedding_dim", 64),
+            input_length=max_len,
+            mask_zero=True,
+        )(inputs)

        # Add positional encoding
        positions = tf.range(start=0, limit=max_len, delta=1)
@@ -39,6 +44,7 @@ class TransformerModel(NeuralNetworkModel):
        x = self._transformer_encoder(x, params)
        x = GlobalAveragePooling1D()(x)
        x = Dense(32, activation="relu")(x)
+        x = Dropout(params.get("dropout", 0.1))(x)
        outputs = Dense(2, activation="softmax")(x)

        model = Model(inputs, outputs)
@@ -54,6 +60,7 @@ class TransformerModel(NeuralNetworkModel):
        attn = MultiHeadAttention(
            num_heads=cfg_params.get("transformer_num_heads", 2),
            key_dim=cfg_params.get("transformer_head_size", 64),
+            dropout=cfg_params.get("attn_dropout", 0.1),
        )(x, x)
        x = LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(attn))

@@ -62,13 +69,7 @@ class TransformerModel(NeuralNetworkModel):
        return LayerNormalization(epsilon=1e-6)(x + Dropout(cfg_params.get("dropout", 0.1))(ff))

    def prepare_features(self, X: pd.DataFrame) -> np.ndarray:
-        text_data = []
-        for feature_type in self.config.features:
-            if feature_type.value in X.columns:
-                text_data.extend(X[feature_type.value].astype(str).tolist())
-
-        if not text_data:
-            raise ValueError("No text data found in the provided DataFrame.")
+        text_data = self._collect_text_corpus(X)

        # Initialize tokenizer if needed
        if self.tokenizer is None:
@@ -76,7 +77,7 @@ class TransformerModel(NeuralNetworkModel):
            self.tokenizer.fit_on_texts(text_data)

        # Convert to sequences
-        sequences = self.tokenizer.texts_to_sequences(text_data[: len(X)])
+        sequences = self.tokenizer.texts_to_sequences(text_data)
        max_len = self.config.model_params.get("max_len", 6)

        return pad_sequences(sequences, maxlen=max_len, padding="post")
@@ -20,6 +20,8 @@ class XGBoostModel(TraditionalModel):
    def build_model(self) -> BaseEstimator:
        params = self.config.model_params

+        # Histogram-based trees and parallelism provide fast training; default
+        # logloss metric suits binary classification of gender.
        return xgb.XGBClassifier(
            n_estimators=params.get("n_estimators", 100),
            max_depth=params.get("max_depth", 6),
@@ -28,6 +30,8 @@ class XGBoostModel(TraditionalModel):
            colsample_bytree=params.get("colsample_bytree", 0.8),
            random_state=self.config.random_seed,
            eval_metric="logloss",
+            n_jobs=params.get("n_jobs", -1),
+            tree_method=params.get("tree_method", "hist"),
            verbosity=2,
        )

@@ -82,6 +82,25 @@ class NeuralNetworkModel(BaseModel):
        self.is_fitted = True
        return self

+    def _collect_text_corpus(self, X: pd.DataFrame) -> List[str]:
+        """Combine configured textual features into one string per record."""
+
+        column_names = [feature.value for feature in self.config.features if feature.value in X.columns]
+        if not column_names:
+            raise ValueError("No configured text features found in the provided DataFrame.")
+
+        text_frame = X[column_names].fillna("").astype(str)
+
+        if len(column_names) == 1:
+            return text_frame.iloc[:, 0].tolist()
+
+        combined_rows = []
+        for row in text_frame.itertuples(index=False):
+            tokens = [value for value in row if value]
+            combined_rows.append(" ".join(tokens))
+
+        return combined_rows
+
    def cross_validate(
            self, X: pd.DataFrame, y: pd.Series, cv_folds: int = 5
    ) -> dict[str, np.floating[Any]]:
@@ -145,6 +164,9 @@ class NeuralNetworkModel(BaseModel):
        """Generate learning curve data for the model"""
        logging.info(f"Generating learning curve for {self.__class__.__name__}")

+        if train_sizes is None:
+            train_sizes = [0.1, 0.3, 0.5, 0.7, 1.0]
+
        learning_curve_data = {
            "train_sizes": [],
            "train_scores": [],