Theory of Regression

The familiar model as a supervised ML algorithm.

Preliminary Remarks

The linear regression model is a useful starting point because:

  • It is familiar from classical statistics.
  • It illustrates all core ML concepts (loss, training, evaluation, interpretation).
  • It is somewhat atypical for ML: \(\phi(.)\) is imposed rather than learnt; there are no tuning parameters and no automatic feature selection.

The model specification:

\[ y_i = \beta + \sum_{j=1}^{P} \omega_j x_{ij} + \varepsilon_i \]

  • \(\beta\): bias (intercept)
  • \(\omega_j\): weight (partial slope) for feature \(j\)
  • \(P\): number of predictive features
  • \(\varepsilon_i\): irreducible error

Loss Functions for Regression

Quadratic loss (\(L_2\))

\[L_2 = \sum_{i=1}^{n_1} (\psi_i - y_i)^2\]

Foundation of ordinary least squares.

Penalizes large errors more heavily.

Absolute loss (\(L_1\))

\[L_1 = \sum_{i=1}^{n_1} |\psi_i - y_i|\]

More robust to outliers.

Used in median regression (quantile regression).

The learning process selects weights \(\boldsymbol{\omega}\) that minimize the chosen loss function over the training set.

Performance Metrics (test set)

Three standard metrics, computed on the test set only:

\[\text{MAE} = \frac{1}{n_2} \sum_{i=1}^{n_2} |\psi_i - y_i|\]

\[\text{RMSE} = \sqrt{\frac{1}{n_2} \sum_{i=1}^{n_2} (\psi_i - y_i)^2}\]

\[R^2 = r_{\psi, y}^2\]

Important

We can compute metrics on both training and test sets, but we ultimately care about the test set. Comparing training vs. test performance diagnoses over- and under-fitting.

Interpretation: Partial Dependence Plots

After training, we want to understand which features drive predictions and how.

A partial dependence plot (PDP) isolates the relationship between a focal feature \(\boldsymbol{x}_f\) and the predicted label, averaging over all other (control) features \(\boldsymbol{x}_c\):

\[\hat{f}(\boldsymbol{x}_f) = \mathbb{E}_{\boldsymbol{x}_c}\left[f(\boldsymbol{x}_c, \boldsymbol{x}_f, \boldsymbol{q})\right] = \int f(\boldsymbol{x}_c, \boldsymbol{x}_f, \boldsymbol{q})\, dp(\boldsymbol{x}_c)\]

This is the ML equivalent of the ceteris paribus condition.

Note

PDPs can be misleading when focal features are correlated with controls. Alternatives such as accumulated local effects (ALE) exist but are less standard in social-science applications.

Optimization

How does the algorithm learn?

Gradient Descent

Most ML problems are too complex for closed-form solutions, i.e., we turn to gradient descent:

  1. Initialize weights \(\boldsymbol{\omega}\).
  2. Compute the gradient of the loss function with respect to \(\boldsymbol{\omega}\): \[\nabla_{\boldsymbol{\omega}} L\]
  3. Update weights in the direction of steepest descent: \[\boldsymbol{\omega} \leftarrow \boldsymbol{\omega} - \eta \nabla_{\boldsymbol{\omega}} L\]
  4. Repeat until convergence (loss ceases to decrease meaningfully).

\(\eta\) (eta) is the learning rate — a tuning parameter controlling step size.

Note

Linear regression has a closed-form OLS solution, so gradient descent is not needed. But it becomes essential for penalized models (lasso, elastic net) and neural networks.

Why a Good Optimizer Matters

We want optimizers that are:

  • Simple in terms of calculus — avoid complex second-order derivatives
  • Numerically evaluable — can be computed on real data
  • Mini-batch capable — process small subsets of data at a time (important for large datasets)

Stochastic gradient descent (SGD): updates weights using a random mini-batch of instances at each step — much faster than full-batch gradient descent on large datasets.

Modern variants (Adam, RMSProp, AdaGrad) adapt the learning rate automatically and are standard in deep learning.

Machine Learning Errors

Understanding and managing the three sources of error.

The three sources

Every predictive model produces error. There are exactly three sources:

\[\text{Total Error} = \underbrace{B^2}_{\text{Bias}^2} + \underbrace{V}_{\text{Variance}} + \underbrace{I}_{\text{Irreducible}}\]

Irreducible Error \(I\)

Cannot be eliminated by any algorithm.

Sources: missing features, measurement error, random shocks.

Remedy: collect more and better data.

Bias Error \(B\)

Systematic under- or over-prediction.

Causes: stopping training too early; model too simple (underfitting).

Remedy: increase model complexity; allow training to converge.

Variance Error \(V\)

Model fails to generalize; overly sensitive to the specific training sample.

Causes: overfitting (model too complex, capitalizing on noise).

Remedy: reduce model complexity; regularize.

The Bias-Variance Trade-Off

Bias and variance pull in opposite directions:

  • To reduce bias: increase model complexity (add features, interactions, nonlinearities)
  • To reduce variance: decrease model complexity (penalize, simplify, reduce features)

The goal is to find the sweet spot — the complexity level at which total error is minimized.

Important

Variance error should not be underestimated. It is a primary reason why algorithms fail when deployed on new data. Always evaluate on a held-out test set that the model has never seen.

Diagnosing Over- and Under-Fitting

Compare training and test performance:

Scenario Training performance Test performance Diagnosis
Underfit Poor Poor Bias error dominant; model too simple
Good fit Good Good Balanced bias–variance
Overfit Excellent Poor Variance error dominant; model too complex

Tuning Parameters & Cross-Validation

Adding more to the train/test split.

Tuning Parameters

A tuning parameter affects how an algorithm operates but cannot be estimated from the data — it must be set by the researcher.

Examples:

  • \(M\): number of principal components to retain
  • \(\lambda\): regularization strength in lasso/ridge
  • \(k\): number of neighbors in kNN
  • Learning rate \(\eta\) in neural networks

Although the researcher sets the final value, we can use re-sampling to let the data inform that decision.

Re-sampling provides out-of-sample estimates of performance for each candidate value — it is superior to re-substitution.

Three Re-Sampling Methods

Validation set

Further split training into training-proper and validation.

Disadvantage: only one performance estimate; can be atypical.

\(k\)-fold cross-validation

Randomly divide training into \(k\) folds. Each fold is used once for validation; remaining \(k-1\) folds for training.

\(k = 10\) is standard. Provides \(k\) performance estimates.

The bootstrap

Repeatedly sample \(n_1\) instances with replacement. Out-of-bag instances (\(\approx 36.8\%\)) serve as validation.

Particularly useful with small datasets.

The Three-Way Split of Data

When tuning is needed, we divide data into three parts:

\[ \underbrace{\text{Training proper}}_{\text{fit the model}} + \underbrace{\text{Validation}}_{\text{tune hyperparameters}} + \underbrace{\text{Test}}_{\text{final evaluation}} \]

Important

The test set is untouched until the very end. The validation set informs hyperparameter choices; the test set gives the honest estimate of generalization performance.

Once the model is deployed on the test set, no further changes to the model are permitted.

Regularization

Penalizing complexity to prevent over-fitting.

What is Regularization?

Reminder: variance error arises when a model is too complex and overfits the training data. One powerful remedy is regularization.

Regularization adds a penalty for over-fitting to the loss function, which serves as a constraint on the optimization problem.

This can be understood as a shrinkage estimator, i.e. coefficients that contribute little to prediction are shrunk toward zero, reducing model complexity automatically.

The penalized optimization problem takes the general form:

\[\min_{\boldsymbol{\omega}} \underbrace{\sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2}_{\text{fit to training data}} + \underbrace{\lambda \cdot \text{Penalty}(\boldsymbol{\omega})}_{\text{complexity penalty}}\]

\(\lambda\) is a tuning parameter controlling the strength of regularization — selected by cross-validation.

The Lasso

Lasso = Least Absolute Shrinkage and Selection Operator (Tibshirani (1996)).

The penalty is the \(L_1\)-norm of the weight vector:

\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \sum_{j=1}^{P} |\omega_j|\]

Key property: the lasso shrinks some coefficients exactly to zero, performing automatic feature selection.

Note

The lasso assumes all predictors are on the same scale. Always standardize features before applying regularization. This is handled automatically by step_normalize() in a recipe.

Tikhonov Regularization (Ridge Regression)

The penalty is the \(L_2\)-norm (squared coefficients):

\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \sum_{j=1}^{P} \omega_j^2\]

Key difference from lasso: ridge shrinks coefficients toward zero but rarely sets them exactly to zero — it retains all features.

Lasso (\(q = 1\))

  • Sparse solutions
  • Built-in feature selection
  • Preferred when many features are irrelevant

Ridge (\(q = 2\))

  • Dense solutions (all features retained)
  • Better when many features are modestly relevant
  • Stable under multicollinearity

Elastic Nets

Elastic nets combine lasso and ridge, offering a mixture tuned by \(\alpha \in [0,1]\):

\[\text{Penalty}(\alpha) = \sum_{j=1}^{P} \left[\frac{1}{2}(1-\alpha)\omega_j^2 + \alpha|\omega_j|\right]\]

\[\min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} (\beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i)^2 + \lambda \cdot \text{Penalty}(\alpha)\]

  • \(\alpha = 1\): reduces to the lasso
  • \(\alpha = 0\): reduces to Tikhonov/ridge
  • Both \(\lambda\) and \(\alpha\) are tuning parameters

Elastic nets are a flexible default when it is unclear whether lasso or ridge is more appropriate.

Tip

A coefficient shrunk to exactly zero means the lasso component removed that feature. Compare this to the unpenalized regression from Session 3 — regularization often substantially simplifies the model.

Reference List

Aydede, Y. (2023). Machine learning toolbox for social scientists: Applied predictive analytics with r (1st ed.). Chapman; Hall/CRC.
Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research, 19(84), 1–5.
Breiman, L. (1984). Classification and regression trees. Wadsworth International Group.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Cimentada, J. (2020). Machine learning for social scientists. https://cimentadaj.github.io/ml_socsci/index.html
De Cock, D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19. https://doi.org/10.1080/10691898.2011.11889627
Jacobucci, R., Grimm, K. J., & Zhang, Z. (2023). Machine learning for social and behavioral research. The Guilford Press.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840.
Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3(4), 319–342. https://doi.org/10.1023/A:1022645801436
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Steenbergen, M. (2025). Introduction to machine learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org//10.1111/j.2517-6161.1996.tb02080.x
Xu, Q.-S., & Liang, Y.-Z. (2001). Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1), 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2