ML Workflows, Prediction & Model Errors
Faculty of Humanities, Education and Social Sciences (FHSE), University of Luxembourg
The familiar model as a supervised ML algorithm.
The linear regression model is a useful starting point because:
The model specification:
\[ y_i = \beta + \sum_{j=1}^{P} \omega_j x_{ij} + \varepsilon_i \]
Quadratic loss (\(L_2\))
\[L_2 = \sum_{i=1}^{n_1} (\psi_i - y_i)^2\]
Foundation of ordinary least squares.
Penalizes large errors more heavily.
Absolute loss (\(L_1\))
\[L_1 = \sum_{i=1}^{n_1} |\psi_i - y_i|\]
More robust to outliers.
Used in median regression (quantile regression).
The learning process selects weights \(\boldsymbol{\omega}\) that minimize the chosen loss function over the training set.
Three standard metrics, computed on the test set only:
\[\text{MAE} = \frac{1}{n_2} \sum_{i=1}^{n_2} |\psi_i - y_i|\]
\[\text{RMSE} = \sqrt{\frac{1}{n_2} \sum_{i=1}^{n_2} (\psi_i - y_i)^2}\]
\[R^2 = r_{\psi, y}^2\]
Important
We can compute metrics on both training and test sets, but we ultimately care about the test set. Comparing training vs. test performance diagnoses over- and under-fitting.
After training, we want to understand which features drive predictions and how.
A partial dependence plot (PDP) isolates the relationship between a focal feature \(\boldsymbol{x}_f\) and the predicted label, averaging over all other (control) features \(\boldsymbol{x}_c\):
\[\hat{f}(\boldsymbol{x}_f) = \mathbb{E}_{\boldsymbol{x}_c}\left[f(\boldsymbol{x}_c, \boldsymbol{x}_f, \boldsymbol{q})\right] = \int f(\boldsymbol{x}_c, \boldsymbol{x}_f, \boldsymbol{q})\, dp(\boldsymbol{x}_c)\]
This is the ML equivalent of the ceteris paribus condition.
Note
PDPs can be misleading when focal features are correlated with controls. Alternatives such as accumulated local effects (ALE) exist but are less standard in social-science applications.
How does the algorithm learn?
Most ML problems are too complex for closed-form solutions, i.e., we turn to gradient descent:
\(\eta\) (eta) is the learning rate — a tuning parameter controlling step size.
Note
Linear regression has a closed-form OLS solution, so gradient descent is not needed. But it becomes essential for penalized models (lasso, elastic net) and neural networks.
We want optimizers that are:
Stochastic gradient descent (SGD): updates weights using a random mini-batch of instances at each step — much faster than full-batch gradient descent on large datasets.
Modern variants (Adam, RMSProp, AdaGrad) adapt the learning rate automatically and are standard in deep learning.
Understanding and managing the three sources of error.
Every predictive model produces error. There are exactly three sources:
\[\text{Total Error} = \underbrace{B^2}_{\text{Bias}^2} + \underbrace{V}_{\text{Variance}} + \underbrace{I}_{\text{Irreducible}}\]
Irreducible Error \(I\)
Cannot be eliminated by any algorithm.
Sources: missing features, measurement error, random shocks.
Remedy: collect more and better data.
Bias Error \(B\)
Systematic under- or over-prediction.
Causes: stopping training too early; model too simple (underfitting).
Remedy: increase model complexity; allow training to converge.
Variance Error \(V\)
Model fails to generalize; overly sensitive to the specific training sample.
Causes: overfitting (model too complex, capitalizing on noise).
Remedy: reduce model complexity; regularize.
Bias and variance pull in opposite directions:
The goal is to find the sweet spot — the complexity level at which total error is minimized.
Important
Variance error should not be underestimated. It is a primary reason why algorithms fail when deployed on new data. Always evaluate on a held-out test set that the model has never seen.
Compare training and test performance:
| Scenario | Training performance | Test performance | Diagnosis |
|---|---|---|---|
| Underfit | Poor | Poor | Bias error dominant; model too simple |
| Good fit | Good | Good | Balanced bias–variance |
| Overfit | Excellent | Poor | Variance error dominant; model too complex |
Adding more to the train/test split.
A tuning parameter affects how an algorithm operates but cannot be estimated from the data — it must be set by the researcher.
Examples:
Although the researcher sets the final value, we can use re-sampling to let the data inform that decision.
Re-sampling provides out-of-sample estimates of performance for each candidate value — it is superior to re-substitution.
Validation set
Further split training into training-proper and validation.
Disadvantage: only one performance estimate; can be atypical.
\(k\)-fold cross-validation
Randomly divide training into \(k\) folds. Each fold is used once for validation; remaining \(k-1\) folds for training.
\(k = 10\) is standard. Provides \(k\) performance estimates.
The bootstrap
Repeatedly sample \(n_1\) instances with replacement. Out-of-bag instances (\(\approx 36.8\%\)) serve as validation.
Particularly useful with small datasets.
When tuning is needed, we divide data into three parts:
\[ \underbrace{\text{Training proper}}_{\text{fit the model}} + \underbrace{\text{Validation}}_{\text{tune hyperparameters}} + \underbrace{\text{Test}}_{\text{final evaluation}} \]
Important
The test set is untouched until the very end. The validation set informs hyperparameter choices; the test set gives the honest estimate of generalization performance.
Once the model is deployed on the test set, no further changes to the model are permitted.
Penalizing complexity to prevent over-fitting.
Reminder: variance error arises when a model is too complex and overfits the training data. One powerful remedy is regularization.
Regularization adds a penalty for over-fitting to the loss function, which serves as a constraint on the optimization problem.
This can be understood as a shrinkage estimator, i.e. coefficients that contribute little to prediction are shrunk toward zero, reducing model complexity automatically.
The penalized optimization problem takes the general form:
\[\min_{\boldsymbol{\omega}} \underbrace{\sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2}_{\text{fit to training data}} + \underbrace{\lambda \cdot \text{Penalty}(\boldsymbol{\omega})}_{\text{complexity penalty}}\]
\(\lambda\) is a tuning parameter controlling the strength of regularization — selected by cross-validation.
Lasso = Least Absolute Shrinkage and Selection Operator (Tibshirani (1996)).
The penalty is the \(L_1\)-norm of the weight vector:
\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \sum_{j=1}^{P} |\omega_j|\]
Key property: the lasso shrinks some coefficients exactly to zero, performing automatic feature selection.
Note
The lasso assumes all predictors are on the same scale. Always standardize features before applying regularization. This is handled automatically by step_normalize() in a recipe.
The penalty is the \(L_2\)-norm (squared coefficients):
\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \sum_{j=1}^{P} \omega_j^2\]
Key difference from lasso: ridge shrinks coefficients toward zero but rarely sets them exactly to zero — it retains all features.
Lasso (\(q = 1\))
Ridge (\(q = 2\))
Elastic nets combine lasso and ridge, offering a mixture tuned by \(\alpha \in [0,1]\):
\[\text{Penalty}(\alpha) = \sum_{j=1}^{P} \left[\frac{1}{2}(1-\alpha)\omega_j^2 + \alpha|\omega_j|\right]\]
\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \cdot \text{Penalty}(\alpha)\]
Elastic nets are a flexible default when it is unclear whether lasso or ridge is more appropriate.
Tip
A coefficient shrunk to exactly zero means the lasso component removed that feature. Compare this to the unpenalized regression from Session 3 — regularization often substantially simplifies the model.