1 The Idea of Machine Learning

Machine Learning is a field of Artificial Intelligence that deals with statistical algorithms to analyze data. The name of Machine Learning comes from the fact that the algorithms estimate functions from the data, rather than having a human specify the parameters directly. In a rule-based system, classical approach, a programmer encodes decisions in the program (e.g.  “if temperature is greater than 38 mark patient as sick”). In Machine Learning the parameters of the functions are obtained from the available data. For example, the slope and intercept of a regression line are computed from the data, with a given formula, but it is not included hard-coded in the program. A human experts decides the type of model to be considered, and the learning algorithm to be used; the data determine the specific values of the model.

In general, data consist of \(n\) points, each with \(p\) measured features (predictors) \(X = (X_1,\ldots,X_p)\). What the algorithm does with those features depends on the type of data and study we are interested on. In some kind of problems, called supervised learning (see Section 5), besides the feature values we also have a response \(Y\) for each observation, and the goal is to get a function that approximates the values of \(Y\) for any given values of \(X\). In other situations, known as unsupervised learning (Section 5), there is no response, and the purpose of the algorithms is to discover structure within the data \(X\).

Why fit a model to data? Two main reasons: prediction and inference (see Section 3).

Thus, the goal of Machine learning is to get models/functions that help us to better understand the data and to predict. For that, we have an algorithm that we apply to a data, and obtain a model. Some times we have extra data, or we split the data into two sets, one used to fit the model, the other one to evaluate how well the model generalises to data not used in the training. These are the terms used:

  • Training data: the observations used to fit (learn) the model.
  • Test data: new observations used to evaluate how well the model generalises.
  • Model/learner: the estimated function that can be used to analyze the data.

The figure below simultes all the process of Machine Learning. We have made up a funcion (sine of x plus x), added a little error, and fit a degree 5 polynomial model. The figure plots the data, the true relationship (by this we mean the function sin(x)+x, not the error), and the model.

set.seed(42)
x <- seq(0, 10, length.out = 100)
f_true <- sin(x) + 0.5 * x
y <- f_true + rnorm(100, sd = 0.8)
datos <- data.frame(x,y)

fit     <- lm(y ~ poly(x, 5), data = datos)
y_hat   <- predict(fit, newdata = datos)

df_pts  <- data.frame(x = x, y = y)
df_line <- data.frame(
  x     = rep(x, 2),
  value = c(f_true, y_hat),
  curve = rep(c("True", "Estimated"), each = length(x))
)

ggplot() +
  geom_point(data = df_pts, aes(x = x, y = y), colour = "grey60", size = 1.5) +
  geom_line(data = df_line, aes(x = x, y = value, colour = curve, linetype = curve),
            linewidth = 0.9) +
  scale_colour_manual(values = c("True" = "steelblue",
                                 "Estimated" = "firebrick")) +
  scale_linetype_manual(values = c("True" = "solid",
                                   "Estimated" = "dashed")) +
  labs(x = "X", y = "Y", colour = NULL, linetype = NULL) +
  theme_bw() +
  theme(legend.position = "top")
Observed data, true relationship, and estimated function

Figure 1.1: Observed data, true relationship, and estimated function

One important fact to take into account is that the data we observe comes with an error. What do we mean by that, how come we have an error if we have observed data, won’t be that part of the data? By this we mean that there is a true relationship in the data, for example, the time it takes for a stone to fall from the top of a building is given by the gravity formula, but our measures are not always precise: some times we will make small errors in the measure, or the instruments used are not always equally precise, or we forget to measure an observation, some one does not answer a question, etc. We write \(Y = f(X) + \varepsilon\), for the case of data with a response value, that is, supervised learning. In the case we do not have a response, we do not know what to expect, then, of course, there is no error as such.


2 Data Science as an Art

How do we start a data analysis problem? What shall we do, computing statistics and see where we reach, or shall we start modelling right away? Which model? How do we measure if our model is good? So many question and, nevertheless, they are all wrong questions at this stage. In a beautiful book, “The Art of Data Science” by Roger D. Peng and Elizabeth Matsui (available online free here), the authors explain that the main question is not about what to do, but rather what goals we want to reach. Means, modes and models are easy to do nowadays with so much computer power and languages available to any one, and even more so in this times of Artificial Intelligence.

So, what are the characteristics of a good question? Most important of all, the question should go to the core of the problem we want to study, it should make us think what we want to get, and what the consecuences of an answer will be. It is not good to wonder if the linear model is better than the quadratic model for a particular case of a two-variable data. It is more important to wonder whether it makes sense for the two variables to be related, and what will happen if we find a strong (or weak) relationship.

According to Peng-Matsui, the main characteristics of a good question are the following:

  • Of interest, to you and to your audience. Academics, business, or technology are very different questions, and what interests on group will not be of any importance to the other. A question has to be adapted to the audience. And most important, it should be interesting to you, it should wake the desire of answering it within you; a bored researcher is not a good researcher.
  • Not answered: a good question should try to widen knowledge, not just reproduce it. Doing what other people have done before you is a good way to learn, but becoming an expert requires to solve problems that people have not thought about, or have not been able to solve.
  • Plausible framework: it should be reasonable and likely to be true, one has to have a feeling that it can be answered. There is no point on asking questions that could be solved but lead us nowhere: Do salmon sales correlate to beans sales? Unless you suspect there could be a possible relation, or you have observe some data in that direction, it makes no sense as a good question, as hardly any one eats salmon and beans at the same time.
  • Translatable to data: You are doing Data Science, so your question has to be solved by doing Data Analysis, and one has to be sure that there is data that could be used to answer the question.
  • Concise and specific: Finally, a good question is clear, it is stated in proper terms and it is quite clear. A question of the type “Is eating health good for you?” is not precise, it is not clear what “healthy” means, or what we expect by “good”, we should be more precise.

3 Inference and Prediction

When we build models to study data there are two points of view we can take: how variables affect the results, and what we can say about unseen data. We call these two aspects inference and prediction, and there are different on the approach. For example, if we want to study the relation between unemployment and inflation, we can draw a model where the independent variable is unemployment and the dependent variable is inflation. If we are interested on inference, our study will try to see the influence of unemployment on inflation: What will be the change in inflation if unemployment increases by 1%? On the other hand, if we look for prediction, we will be interested on estimating the value of inflation for a value of unemployment: How much will be the inflation if unemployment is equal to 6%? Inference has its own issues, for example, if we have two (or more) predictors we will be interested on finding whether there is a relation between them. Doing a model with related predictors will give us the wrong influence of one of them on the dependent variable. On the other hand, such problem does not occur if our interest is in the prediction of the dependent variable.

Thus, if \(\widehat{f}\) is the function (model) that gives us the relationship between features and response, we can use it for both goals, inference and prediction the same model \(\widehat{f}\) underlies both goals, but what we do with it differs.

Inference Prediction
Question How does \(Y\) change with \(X_j\)? What is \(\widehat{Y}\) for a new \(X\)?
Focus Interpretability, significance Accuracy of \(\widehat{Y}\)
\(\widehat{f}\) needs to be… Interpretable Accurate (black-box OK)
Example Does unemployment decrease inflation? What will inflation be for a given value of unemployment?

This course focuses on prediction. You have seen inference in your previous statistics courses (confidence intervals, hypothesis testing, linear model interpretation).

For prediction, the expected prediction error for a new observation \(x_0\) decomposes as:

\[\mathbb{E}\left[(Y - \widehat{f}(x_0))^2\right] = \underbrace{\text{Var}(\varepsilon)}_{\text{irreducible}} + \underbrace{\left[\text{Bias}(\widehat{f}(x_0))\right]^2 + \text{Var}(\widehat{f}(x_0))}_{\text{reducible}}\]

We can reduce the reducible error by choosing a better model; the irreducible error \(\text{Var}(\varepsilon)\) is the part of the error we cannot change, so the error of a model is, at least, as much as \(\text{Var}(\varepsilon)\).

# Inference example: interpret coefficients
data(mtcars)
lm_fit <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_fit)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
## wt          -3.87783074 0.63273349 -6.128695 1.119647e-06
## hp          -0.03177295 0.00902971 -3.518712 1.451229e-03
# Prediction example: predict for a new car
new_car <- data.frame(wt = 3.0, hp = 120)
predict(lm_fit, newdata = new_car, interval = "prediction")
##        fit      lwr      upr
## 1 21.78102 16.38174 27.18031

The model we get is \[\widehat{\text{mpg}} = 37.22727012 -3.87783074 * \text{wt} -0.03177295 * \text{hp}\]

In the inference framing we care about the coefficient on wt and its p-value: is it true that increasing 1 unit the value of wt will decrease the value of mpg by approximately 3.9 units? In the prediction framing we care about the predicted mpg and its interval: what is the possible range of values of mpg when wt = 3 and hp = 120?


4 Accuracy vs. Interpretability

Once we have a model, we have trained it over data, we want to be able to use it for either prediction or inference: we would like to be able to estimate values for data that we do not have, or we want to say something about the influence of a predictor on the response. In the unsupervised case these two points do not occur, as we have left the model to find the structure of the data on its own, but we might still be interested on whether such structure has a meaning that throws light on our problem, or if it is possible to generalise the structure to other data. In all cases, we find a need to balance between a good use of a model, and its interpretability. For example, a quadratic model that relates unemployment rate to inflation could be more accurate (in prediction terms) than a linear model but more difficult to interpret. In this example, if \(X\) stands for unemployment and \(Y\) for inflation, the linear model will be something like \(\widehat{Y} = a_0 + a_1 X\), while the quadratic model will look like \(\widehat{Y} = b_0 + b_1 X + b_2 X^2\). But we pay a price for better prediction: what is the meaning if “unemployment squared”? In the linear model we have that, for a change of 1 unit of \(X\) (change of 1% in the unemployment rate), inflation changes by \(a_1\) units; everything is clear, the relation between the variables as well as the effect. But squaring the unemployment rate is not something that has economic sense. And the situation would be even worse if the coefficient \(b_1 = 0\), because the whole explanation of the model depends on the non-economic concept of unemployment squared.

The more flexible a model is (for example, a formula that has many parameters to adjust, instead of just a couple of parameters like the linear case), the better it could fit the data, and do better predictions (but see Section 8). But, on the other hand, more parameters make models more difficult to interpret. The following figure is taken from the ISLR book, figure 2.7.

knitr::include_graphics("flexible-interpretable.png")
Flexibility vs. interpretability.

Figure 4.1: Flexibility vs. interpretability.


5 Supervised vs. Unsupervised Learning

As we mentioned in the basic concepts of Machine Learning, data comes with a set of features (some times called predictors), and then data points could have or not a response, which is another variable that tells us what we want to study. Coming back to the unemployment-inflation case, the unemployment rate is the predictor, and the predicted variable or response is the inflation. We say that the data are labelled. The response could be numeric (as in this example) or categorical (a label of “dog” or “cat” for images).

There are other cases where none of the data values is considered as a response we will be interested on trying to find some structure within the data. For example, we could be looking at different characteristics of a car (miles per gallon, weight, power, etc) and try to see if these variables form groups, and what those groups mean.

5.1 Supervised Learning

This is the case of labelled data: each observation has both predictors \(X\) and a response \(Y\). We could consider two kinds of study:

  • Regression: \(Y\) is quantitative (e.g., house price, temperature).
  • Classification: \(Y\) is categorical (e.g., dog/cat, spam/not-spam).

The model is “supervised” because the true \(Y\) values guide (supervise) the learning.

set.seed(7)
n <- 80
dat <- rbind(
  data.frame(x1 = rnorm(n, 2, 1), x2 = rnorm(n, 2, 1), label = "A"),
  data.frame(x1 = rnorm(n, 5, 1), x2 = rnorm(n, 5, 1), label = "B")
)

ggplot(dat, aes(x = x1, y = x2, colour = label)) +
  geom_point(size = 2) +
  scale_colour_manual(values = c("A" = "steelblue", "B" = "firebrick")) +
  labs(x = "X1", y = "X2", colour = "Class") +
  theme_bw() +
  theme(legend.position = "top")
Supervised: labelled training data

Figure 5.1: Supervised: labelled training data

5.2 Unsupervised Learning

We have no response \(Y\) — only predictors \(X\). The goal is to discover structure (clusters, dimensions) in the data. Some of the methods we will see are: K-means, hierarchical clustering, linear discriminant analysis and principal component analysis (PCA).

data(iris)
set.seed(42)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 20)

df_iris    <- data.frame(iris, cluster = factor(km$cluster))
df_centers <- data.frame(
  Petal.Length = km$centers[, 3],
  Petal.Width  = km$centers[, 4],
  cluster      = factor(1:3)
)

ggplot(df_iris, aes(x = Petal.Length, y = Petal.Width, colour = cluster)) +
  geom_point(size = 2) +
  geom_point(data = df_centers, shape = 8, size = 4, stroke = 1.5) +
  labs(x = "Petal Length", y = "Petal Width", colour = "Cluster") +
  theme_bw() +
  theme(legend.position = "top")
Unsupervised: K-means clusters (k = 3)

Figure 5.2: Unsupervised: K-means clusters (k = 3)

Unsupervised learning has an issue: we cannot evaluate the performace of the model against some known truth (response), as there are no values of \(Y\) to compare to. Evaluate becomes hard, though not impossible. The interpretability issue becomes quite central in unsupervised learning since, as we do not have a measure of true values, at least we should be able to interpret the results in a way that makes sense.


6 The Bias–Variance Trade-off

This is one of the central ideas of supervised Machine Learning. We start considering a point \(x_0\) and then we split the mean squared error of \(\widehat{f}(x_0)\) as:

\[\text{MSE}(x_0) = \text{Bias}^2(\widehat{f}(x_0)) + \text{Var}(\widehat{f}(x_0)) + \text{Var}(\varepsilon)\]

The MSE is the (squared of the) error we would make is we model on a large number of training sets. The above formula splits into three components:

  • Bias: error from wrong assumptions in the model. A very simple model (e.g. straight line through curved data) has high bias — it easily misses the truth relation in the data, as it lacks complexity to record anything beyond simple relations (careful: a simple model does not mean it is a bad model!).
  • Variance: sensitivity to fluctuations in the training set. A very complex model (e.g. interpolating every data point) has high variance — it changes a lot across different training samples. You can check it yourself by running the code below with different seed values, or no seed at all.
  • Irreducible error: noise in \(Y\) itself; cannot be reduced.

The trade-off: decreasing bias tends to increase variance, and vice versa. The optimal model balances the two.

set.seed(1)
x <- seq(0, 2 * pi, length.out = 60)
y <- sin(x) + rnorm(60, sd = 0.4)

fit_lin  <- lm(y ~ x)
fit_med  <- lm(y ~ poly(x, 5))
fit_over <- lm(y ~ poly(x, 20))

x_seq <- seq(0, 2 * pi, length.out = 300)
df_lines <- data.frame(
  x     = rep(x_seq, 4),
  value = c(sin(x_seq),
            predict(fit_lin,  newdata = data.frame(x = x_seq)),
            predict(fit_med,  newdata = data.frame(x = x_seq)),
            predict(fit_over, newdata = data.frame(x = x_seq))),
  model = rep(c("True f", "Linear (high bias)",
                "Degree 5 (balanced)", "Degree 20 (high variance)"),
              each = length(x_seq))
)
df_lines$model <- factor(df_lines$model,
                         levels = c("True f", "Linear (high bias)",
                                    "Degree 5 (balanced)", "Degree 20 (high variance)"))

ggplot() +
  geom_point(data = data.frame(x = x, y = y),
             aes(x = x, y = y), colour = "grey60", size = 1.5) +
  geom_line(data = df_lines, aes(x = x, y = value, colour = model, linetype = model),
            linewidth = 0.9) +
  scale_colour_manual(values = c("True f"                 = "black",
                                 "Linear (high bias)"     = "steelblue",
                                 "Degree 5 (balanced)"    = "forestgreen",
                                 "Degree 20 (high variance)" = "firebrick")) +
  scale_linetype_manual(values = c("True f"                 = "solid",
                                   "Linear (high bias)"     = "dashed",
                                   "Degree 5 (balanced)"    = "dashed",
                                   "Degree 20 (high variance)" = "dashed")) +
  labs(x = "X", y = "Y", colour = NULL, linetype = NULL) +
  theme_bw() +
  theme(legend.position = "top",
        legend.text = element_text(size = 8))
Bias-variance trade-off

Figure 6.1: Bias-variance trade-off

6.1 Model Complexity and the U-Curve

As model complexity increases:

  • Training error always decreases as more complicated models can adjust better to the particularities of the data.
  • Test error follows a U-shape: first decreases (bias falls faster than variance rises), then increases (variance dominates).

We are interested on the test error, because that tells us whether our model can adjust decently to unseen data. To put it in the setting of learning a language: you can learn by heart a book on a new language, and completely understand what is written in the book, or if some one tells you a sentence very similar to the ones in the book. But what is the use of that, if you cannot ask for food in a new country? The test is to be able to move and eat talking a language you just learn, because memorizing a book is of now use, not matter how well you do it. The minimum of the test error curve is the point you are looking for: a balance between complexity and accuracy.

# Simulated illustration of train vs test error
complexity <- 1:15
df_curve <- data.frame(
  complexity = rep(complexity, 2),
  error      = c(10 * exp(-0.4 * complexity) + 0.5,
                 10 * exp(-0.4 * complexity) + 0.5 + 0.08 * (complexity - 5)^2),
  type       = rep(c("Training error", "Test error"), each = length(complexity))
)
opt <- complexity[which.min(10 * exp(-0.4 * complexity) + 0.5 + 0.08 * (complexity - 5)^2)]

ggplot(df_curve, aes(x = complexity, y = error, colour = type)) +
  geom_line(linewidth = 0.9) +
  geom_vline(xintercept = opt, linetype = "dotted", colour = "grey40") +
  scale_colour_manual(values = c("Training error" = "steelblue",
                                 "Test error"     = "firebrick")) +
  labs(x = "Model Complexity", y = "Error", colour = NULL) +
  theme_bw() +
  theme(legend.position = "top")
Typical U-shaped test error curve as a function of model complexity.

Figure 6.2: Typical U-shaped test error curve as a function of model complexity.


7 Evaluation Metrics

How do we measure how well \(\widehat{f}\) performs on test data? That depends on the type of problem. First of all, unsupervised problems do not have a measure of success, as we do not have a label, a “true” value to compare the model with data. With respect to supervised models, the evaluation of the performance depends on whether we are dealing with a regression problem (numerci response) or a classification one (categorical response). Here we give a short introduction to different evaluation metrics, and in the sections later we will explain in more detail each metric adjusted to different problems.

7.1 Regression Metrics

Let \(y_i\) be the true value and \(\widehat{y}_i\) the predicted value for observation \(i\).

  • Mean Squared Error (MSE): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\]

  • Root Mean Squared Error (RMSE): the square root of \(\text{MSE}\) — same units as \(Y\), easier to interpret.

  • Mean Absolute Error (MAE): \[\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \widehat{y}_i|\]

  • \(R^2\) (coefficient of determination): \[R^2 = 1 - \frac{\sum(y_i - \widehat{y}_i)^2}{\sum(y_i - \bar{y})^2}\] Proportion of variance explained. \(R^2 = 1\) is perfect; \(R^2 = 0\) means the model does no better than predicting the mean of \(Y\), \(\bar{Y}\).

set.seed(10)
# Simulate a train/test split on mtcars
n      <- nrow(mtcars)
idx    <- sample(1:n, size = floor(0.7 * n))
train  <- mtcars[idx, ]
test   <- mtcars[-idx, ]

fit    <- lm(mpg ~ wt + hp + cyl, data = train)
y_hat  <- predict(fit, newdata = test)
y_true <- test$mpg

MSE  <- mean((y_true - y_hat)^2)
RMSE <- sqrt(MSE)
MAE  <- mean(abs(y_true - y_hat))
SS_res <- sum((y_true - y_hat)^2)
SS_tot <- sum((y_true - mean(y_true))^2)
R2   <- 1 - SS_res / SS_tot

cat(sprintf("MSE  = %.3f\nRMSE = %.3f\nMAE  = %.3f\nR^2  = %.3f\n",
            MSE, RMSE, MAE, R2))
## MSE  = 9.802
## RMSE = 3.131
## MAE  = 2.637
## R^2  = 0.708

7.2 Classification Metrics

For a binary classifier, that is, \(Y\) takes two values that we assume, for simplicity, that are \(0\) and \(1\), we have the following metrics:

  • Confusion matrix — the foundation of all classification metrics, it checks true value versus predicted ones:
Predicted 0 Predicted 1
True 0 TN = True Negative FP = False Positive
True 1 FN = False Negative TP = True Positive

Observe that the words “Negative” and “Positive” refer to whether the prediction is \(0\) or \(1\). On the other hand, “True” and “False” stands for whether the prediction was wrong or right.

  • Accuracy: \(\frac{TP + TN}{n}\) — proportion correctly classified.

  • Precision (Positive Predictive Value): \(\frac{TP}{TP + FP}\) — of predicted positives, how many are correct?

  • Recall (Sensitivity): \(\frac{TP}{TP + FN}\) — of true positives, how many did we catch?

  • Specificity: \(\frac{TN}{TN+FP}\) - how many negative values where predicted correctly?

  • Negative Predicted Value: \(\frac{TN}{TN+FN}\) - how many predicted negatives are correct?

Conditions on predicted Conditions on actual
Positive Precision \(= \frac{TP}{TP+FP}\) Sensitivity \(= \frac{TP}{TP+FN}\)
Negative NPV \(= \frac{TN}{TN+FN}\) Specificity \(= \frac{TN}{TN+FP}\)
  • \(F_1\) score: \(2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\) — harmonic mean of precision and recall.

Accuracy can mislead when classes are imbalanced (e.g., 95% negatives: always predicting 0 gives 95% accuracy but is useless). Use precision/recall or \(F_1\) in such cases.

# Binary classification: setosa vs. rest
iris_bin <- iris
iris_bin$setosa <- ifelse(iris$Species == "setosa", 1, 0)

set.seed(5)
idx2   <- sample(1:150, 105)
tr     <- iris_bin[idx2, ]
te     <- iris_bin[-idx2, ]

log_fit <- glm(setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
               data = tr, family = binomial)
probs   <- predict(log_fit, newdata = te, type = "response")
preds   <- ifelse(probs > 0.5, 1, 0)

# Confusion matrix
cm <- table(Predicted = preds, Actual = te$setosa)
print(cm)
##          Actual
## Predicted  0  1
##         0 30  0
##         1  0 15
TP <- cm["1", "1"]; TN <- cm["0", "0"]
FP <- cm["1", "0"]; FN <- cm["0", "1"]

accuracy  <- (TP + TN) / sum(cm)
precision <- TP / (TP + FP)
recall    <- TP / (TP + FN)
f1        <- 2 * precision * recall / (precision + recall)

cat(sprintf("\nAccuracy  = %.3f\nPrecision = %.3f\nRecall    = %.3f\nF1 Score  = %.3f\n",
            accuracy, precision, recall, f1))
## 
## Accuracy  = 1.000
## Precision = 1.000
## Recall    = 1.000
## F1 Score  = 1.000

7.3 Train/Test Split and Cross-Validation

All metrics above must be computed on held-out test data, not the training data. Training error is optimistic and will underestimate the true (generalisation) error.

A common approach for small datasets is \(k\)-fold cross-validation: split the data into \(k\) folds, train on \(k-1\) and evaluate on the remaining fold, repeat \(k\) times, and average the metric.

set.seed(99)
k <- 5
folds <- sample(rep(1:k, length.out = nrow(mtcars)))
cv_mse <- numeric(k)

for (fold in 1:k) {
  tr_cv   <- mtcars[folds != fold, ]
  te_cv   <- mtcars[folds == fold, ]
  fit_cv  <- lm(mpg ~ wt + hp + cyl, data = tr_cv)
  pred_cv <- predict(fit_cv, newdata = te_cv)
  cv_mse[fold] <- mean((te_cv$mpg - pred_cv)^2)
}

df_cv <- data.frame(fold = paste0("Fold ", 1:k), mse = cv_mse)

ggplot(df_cv, aes(x = fold, y = mse)) +
  geom_col(fill = "steelblue") +
  geom_hline(yintercept = mean(cv_mse), colour = "firebrick",
             linetype = "dashed", linewidth = 0.9) +
  annotate("text", x = 4.5, y = mean(cv_mse) + 0.5,
           label = sprintf("Mean CV MSE = %.2f", mean(cv_mse)),
           colour = "firebrick", size = 3.5) +
  labs(title = "5-Fold CV MSE", x = NULL, y = "MSE") +
  theme_bw()
5-fold cross-validation MSE for a linear model on mtcars.

Figure 7.1: 5-fold cross-validation MSE for a linear model on mtcars.


8 Overfitting and Underfitting

As mentioned above, a very flexible model will adjust very well to the training data. However, this has the problem of not being able to perform well on new, unseen data. This over adjustment to training data is called overfitting. For example, if we fit an \(n-1\) degree polynomial to a data set with \(n\) points we will get a perfect fit, but it will not work for a new data point that does not lie in the graph of the polynomial.

set.seed(1)

x <- 1:4
y <- x^3 - x^2 - x + 1
df <- data.frame(x = x, y = y)

# Fine grid for the smooth curve
x_seq <- seq(1, 4, length.out = 200)
df_curve <- data.frame(x = x_seq, y = x_seq^3 - x_seq^2 - x_seq + 1)

ggplot() +
  geom_line(data = df_curve, aes(x = x, y = y), colour = "steelblue", linewidth = 1) +
  geom_point(data = df, aes(x = x, y = y), colour = "grey40", size = 3) +
  geom_point(data = data.frame(x = 2.5, y = 20), aes(x = x, y = y), colour = "blue", size = 3) +
  labs(x = "X", y = "Y") +
  theme_bw()
Overfitting with a degree 3 polynomial

Figure 8.1: Overfitting with a degree 3 polynomial

On the other hand, if the model has very few parameters, for example, a line that has only the \(y\)-intercept and the slope, we might not be able to capture the intricacies of a complex data set, we are in a situation of underfitting.


9 Summary

Concept Key idea
Machine learning Estimate \(f\) in \(Y = f(X) + \varepsilon\) or the structure of \(X\) from data
Inference Understand how \(X\) affects \(Y\)
Prediction Accurate \(\widehat{Y}\) for new \(X\)
Supervised Labelled data: learn \(X \to Y\)
Unsupervised No labels: find structure in \(X\)
Bias–variance Complexity ↑: bias ↓, variance ↑
Evaluation MSE/RMSE/R² (regression); accuracy/F₁ (classification)

10 References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer. Chapters 1–2.