Machine Learning is a field of Artificial Intelligence that deals with statistical algorithms to analyze data. The name of Machine Learning comes from the fact that the algorithms estimate functions from the data, rather than having a human specify the parameters directly. In a rule-based system, classical approach, a programmer encodes decisions in the program (e.g. “if temperature is greater than 38 mark patient as sick”). In Machine Learning the parameters of the functions are obtained from the available data. For example, the slope and intercept of a regression line are computed from the data, with a given formula, but it is not included hard-coded in the program. A human experts decides the type of model to be considered, and the learning algorithm to be used; the data determine the specific values of the model.
In general, data consist of \(n\) points, each with \(p\) measured features (predictors) \(X = (X_1,\ldots,X_p)\). What the algorithm does with those features depends on the type of data and study we are interested on. In some kind of problems, called supervised learning (see Section 5), besides the feature values we also have a response \(Y\) for each observation, and the goal is to get a function that approximates the values of \(Y\) for any given values of \(X\). In other situations, known as unsupervised learning (Section 5), there is no response, and the purpose of the algorithms is to discover structure within the data \(X\).
Why fit a model to data? Two main reasons: prediction and inference (see Section 3).
Thus, the goal of Machine learning is to get models/functions that help us to better understand the data and to predict. For that, we have an algorithm that we apply to a data, and obtain a model. Some times we have extra data, or we split the data into two sets, one used to fit the model, the other one to evaluate how well the model generalises to data not used in the training. These are the terms used:
The figure below simultes all the process of Machine Learning. We have made up a funcion (sine of x plus x), added a little error, and fit a degree 5 polynomial model. The figure plots the data, the true relationship (by this we mean the function sin(x)+x, not the error), and the model.
set.seed(42)
x <- seq(0, 10, length.out = 100)
f_true <- sin(x) + 0.5 * x
y <- f_true + rnorm(100, sd = 0.8)
datos <- data.frame(x,y)
fit <- lm(y ~ poly(x, 5), data = datos)
y_hat <- predict(fit, newdata = datos)
df_pts <- data.frame(x = x, y = y)
df_line <- data.frame(
x = rep(x, 2),
value = c(f_true, y_hat),
curve = rep(c("True", "Estimated"), each = length(x))
)
ggplot() +
geom_point(data = df_pts, aes(x = x, y = y), colour = "grey60", size = 1.5) +
geom_line(data = df_line, aes(x = x, y = value, colour = curve, linetype = curve),
linewidth = 0.9) +
scale_colour_manual(values = c("True" = "steelblue",
"Estimated" = "firebrick")) +
scale_linetype_manual(values = c("True" = "solid",
"Estimated" = "dashed")) +
labs(x = "X", y = "Y", colour = NULL, linetype = NULL) +
theme_bw() +
theme(legend.position = "top")Figure 1.1: Observed data, true relationship, and estimated function
One important fact to take into account is that the data we observe comes with an error. What do we mean by that, how come we have an error if we have observed data, won’t be that part of the data? By this we mean that there is a true relationship in the data, for example, the time it takes for a stone to fall from the top of a building is given by the gravity formula, but our measures are not always precise: some times we will make small errors in the measure, or the instruments used are not always equally precise, or we forget to measure an observation, some one does not answer a question, etc. We write \(Y = f(X) + \varepsilon\), for the case of data with a response value, that is, supervised learning. In the case we do not have a response, we do not know what to expect, then, of course, there is no error as such.
How do we start a data analysis problem? What shall we do, computing statistics and see where we reach, or shall we start modelling right away? Which model? How do we measure if our model is good? So many question and, nevertheless, they are all wrong questions at this stage. In a beautiful book, “The Art of Data Science” by Roger D. Peng and Elizabeth Matsui (available online free here), the authors explain that the main question is not about what to do, but rather what goals we want to reach. Means, modes and models are easy to do nowadays with so much computer power and languages available to any one, and even more so in this times of Artificial Intelligence.
So, what are the characteristics of a good question? Most important of all, the question should go to the core of the problem we want to study, it should make us think what we want to get, and what the consecuences of an answer will be. It is not good to wonder if the linear model is better than the quadratic model for a particular case of a two-variable data. It is more important to wonder whether it makes sense for the two variables to be related, and what will happen if we find a strong (or weak) relationship.
According to Peng-Matsui, the main characteristics of a good question are the following:
When we build models to study data there are two points of view we can take: how variables affect the results, and what we can say about unseen data. We call these two aspects inference and prediction, and there are different on the approach. For example, if we want to study the relation between unemployment and inflation, we can draw a model where the independent variable is unemployment and the dependent variable is inflation. If we are interested on inference, our study will try to see the influence of unemployment on inflation: What will be the change in inflation if unemployment increases by 1%? On the other hand, if we look for prediction, we will be interested on estimating the value of inflation for a value of unemployment: How much will be the inflation if unemployment is equal to 6%? Inference has its own issues, for example, if we have two (or more) predictors we will be interested on finding whether there is a relation between them. Doing a model with related predictors will give us the wrong influence of one of them on the dependent variable. On the other hand, such problem does not occur if our interest is in the prediction of the dependent variable.
Thus, if \(\widehat{f}\) is the function (model) that gives us the relationship between features and response, we can use it for both goals, inference and prediction the same model \(\widehat{f}\) underlies both goals, but what we do with it differs.
| Inference | Prediction | |
|---|---|---|
| Question | How does \(Y\) change with \(X_j\)? | What is \(\widehat{Y}\) for a new \(X\)? |
| Focus | Interpretability, significance | Accuracy of \(\widehat{Y}\) |
| \(\widehat{f}\) needs to be… | Interpretable | Accurate (black-box OK) |
| Example | Does unemployment decrease inflation? | What will inflation be for a given value of unemployment? |
This course focuses on prediction. You have seen inference in your previous statistics courses (confidence intervals, hypothesis testing, linear model interpretation).
For prediction, the expected prediction error for a new observation \(x_0\) decomposes as:
\[\mathbb{E}\left[(Y - \widehat{f}(x_0))^2\right] = \underbrace{\text{Var}(\varepsilon)}_{\text{irreducible}} + \underbrace{\left[\text{Bias}(\widehat{f}(x_0))\right]^2 + \text{Var}(\widehat{f}(x_0))}_{\text{reducible}}\]
We can reduce the reducible error by choosing a better model; the irreducible error \(\text{Var}(\varepsilon)\) is the part of the error we cannot change, so the error of a model is, at least, as much as \(\text{Var}(\varepsilon)\).
# Inference example: interpret coefficients
data(mtcars)
lm_fit <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_fit)$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
## wt -3.87783074 0.63273349 -6.128695 1.119647e-06
## hp -0.03177295 0.00902971 -3.518712 1.451229e-03
# Prediction example: predict for a new car
new_car <- data.frame(wt = 3.0, hp = 120)
predict(lm_fit, newdata = new_car, interval = "prediction")## fit lwr upr
## 1 21.78102 16.38174 27.18031
The model we get is \[\widehat{\text{mpg}} = 37.22727012 -3.87783074 * \text{wt} -0.03177295 * \text{hp}\]
In the inference framing we care about the coefficient on wt and its p-value: is it true that
increasing 1 unit the value of wt will decrease the value of mpg by approximately 3.9 units?
In the prediction framing we care about the predicted mpg and its interval: what is the
possible range of values of mpg when wt = 3 and hp = 120?
Once we have a model, we have trained it over data, we want to be able to use it for either prediction or inference: we would like to be able to estimate values for data that we do not have, or we want to say something about the influence of a predictor on the response. In the unsupervised case these two points do not occur, as we have left the model to find the structure of the data on its own, but we might still be interested on whether such structure has a meaning that throws light on our problem, or if it is possible to generalise the structure to other data. In all cases, we find a need to balance between a good use of a model, and its interpretability. For example, a quadratic model that relates unemployment rate to inflation could be more accurate (in prediction terms) than a linear model but more difficult to interpret. In this example, if \(X\) stands for unemployment and \(Y\) for inflation, the linear model will be something like \(\widehat{Y} = a_0 + a_1 X\), while the quadratic model will look like \(\widehat{Y} = b_0 + b_1 X + b_2 X^2\). But we pay a price for better prediction: what is the meaning if “unemployment squared”? In the linear model we have that, for a change of 1 unit of \(X\) (change of 1% in the unemployment rate), inflation changes by \(a_1\) units; everything is clear, the relation between the variables as well as the effect. But squaring the unemployment rate is not something that has economic sense. And the situation would be even worse if the coefficient \(b_1 = 0\), because the whole explanation of the model depends on the non-economic concept of unemployment squared.
The more flexible a model is (for example, a formula that has many parameters to adjust, instead of just a couple of parameters like the linear case), the better it could fit the data, and do better predictions (but see Section 8). But, on the other hand, more parameters make models more difficult to interpret. The following figure is taken from the ISLR book, figure 2.7.
Figure 4.1: Flexibility vs. interpretability.
As we mentioned in the basic concepts of Machine Learning, data comes with a set of features (some times called predictors), and then data points could have or not a response, which is another variable that tells us what we want to study. Coming back to the unemployment-inflation case, the unemployment rate is the predictor, and the predicted variable or response is the inflation. We say that the data are labelled. The response could be numeric (as in this example) or categorical (a label of “dog” or “cat” for images).
There are other cases where none of the data values is considered as a response we will be interested on trying to find some structure within the data. For example, we could be looking at different characteristics of a car (miles per gallon, weight, power, etc) and try to see if these variables form groups, and what those groups mean.
This is the case of labelled data: each observation has both predictors \(X\) and a response \(Y\). We could consider two kinds of study:
The model is “supervised” because the true \(Y\) values guide (supervise) the learning.
set.seed(7)
n <- 80
dat <- rbind(
data.frame(x1 = rnorm(n, 2, 1), x2 = rnorm(n, 2, 1), label = "A"),
data.frame(x1 = rnorm(n, 5, 1), x2 = rnorm(n, 5, 1), label = "B")
)
ggplot(dat, aes(x = x1, y = x2, colour = label)) +
geom_point(size = 2) +
scale_colour_manual(values = c("A" = "steelblue", "B" = "firebrick")) +
labs(x = "X1", y = "X2", colour = "Class") +
theme_bw() +
theme(legend.position = "top")Figure 5.1: Supervised: labelled training data
We have no response \(Y\) — only predictors \(X\). The goal is to discover structure (clusters, dimensions) in the data. Some of the methods we will see are: K-means, hierarchical clustering, linear discriminant analysis and principal component analysis (PCA).
data(iris)
set.seed(42)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 20)
df_iris <- data.frame(iris, cluster = factor(km$cluster))
df_centers <- data.frame(
Petal.Length = km$centers[, 3],
Petal.Width = km$centers[, 4],
cluster = factor(1:3)
)
ggplot(df_iris, aes(x = Petal.Length, y = Petal.Width, colour = cluster)) +
geom_point(size = 2) +
geom_point(data = df_centers, shape = 8, size = 4, stroke = 1.5) +
labs(x = "Petal Length", y = "Petal Width", colour = "Cluster") +
theme_bw() +
theme(legend.position = "top")Figure 5.2: Unsupervised: K-means clusters (k = 3)
Unsupervised learning has an issue: we cannot evaluate the performace of the model against some known truth (response), as there are no values of \(Y\) to compare to. Evaluate becomes hard, though not impossible. The interpretability issue becomes quite central in unsupervised learning since, as we do not have a measure of true values, at least we should be able to interpret the results in a way that makes sense.
This is one of the central ideas of supervised Machine Learning. We start considering a point \(x_0\) and then we split the mean squared error of \(\widehat{f}(x_0)\) as:
\[\text{MSE}(x_0) = \text{Bias}^2(\widehat{f}(x_0)) + \text{Var}(\widehat{f}(x_0)) + \text{Var}(\varepsilon)\]
The MSE is the (squared of the) error we would make is we model on a large number of training sets. The above formula splits into three components:
The trade-off: decreasing bias tends to increase variance, and vice versa. The optimal model balances the two.
set.seed(1)
x <- seq(0, 2 * pi, length.out = 60)
y <- sin(x) + rnorm(60, sd = 0.4)
fit_lin <- lm(y ~ x)
fit_med <- lm(y ~ poly(x, 5))
fit_over <- lm(y ~ poly(x, 20))
x_seq <- seq(0, 2 * pi, length.out = 300)
df_lines <- data.frame(
x = rep(x_seq, 4),
value = c(sin(x_seq),
predict(fit_lin, newdata = data.frame(x = x_seq)),
predict(fit_med, newdata = data.frame(x = x_seq)),
predict(fit_over, newdata = data.frame(x = x_seq))),
model = rep(c("True f", "Linear (high bias)",
"Degree 5 (balanced)", "Degree 20 (high variance)"),
each = length(x_seq))
)
df_lines$model <- factor(df_lines$model,
levels = c("True f", "Linear (high bias)",
"Degree 5 (balanced)", "Degree 20 (high variance)"))
ggplot() +
geom_point(data = data.frame(x = x, y = y),
aes(x = x, y = y), colour = "grey60", size = 1.5) +
geom_line(data = df_lines, aes(x = x, y = value, colour = model, linetype = model),
linewidth = 0.9) +
scale_colour_manual(values = c("True f" = "black",
"Linear (high bias)" = "steelblue",
"Degree 5 (balanced)" = "forestgreen",
"Degree 20 (high variance)" = "firebrick")) +
scale_linetype_manual(values = c("True f" = "solid",
"Linear (high bias)" = "dashed",
"Degree 5 (balanced)" = "dashed",
"Degree 20 (high variance)" = "dashed")) +
labs(x = "X", y = "Y", colour = NULL, linetype = NULL) +
theme_bw() +
theme(legend.position = "top",
legend.text = element_text(size = 8))Figure 6.1: Bias-variance trade-off
As model complexity increases:
We are interested on the test error, because that tells us whether our model can adjust decently to unseen data. To put it in the setting of learning a language: you can learn by heart a book on a new language, and completely understand what is written in the book, or if some one tells you a sentence very similar to the ones in the book. But what is the use of that, if you cannot ask for food in a new country? The test is to be able to move and eat talking a language you just learn, because memorizing a book is of now use, not matter how well you do it. The minimum of the test error curve is the point you are looking for: a balance between complexity and accuracy.
# Simulated illustration of train vs test error
complexity <- 1:15
df_curve <- data.frame(
complexity = rep(complexity, 2),
error = c(10 * exp(-0.4 * complexity) + 0.5,
10 * exp(-0.4 * complexity) + 0.5 + 0.08 * (complexity - 5)^2),
type = rep(c("Training error", "Test error"), each = length(complexity))
)
opt <- complexity[which.min(10 * exp(-0.4 * complexity) + 0.5 + 0.08 * (complexity - 5)^2)]
ggplot(df_curve, aes(x = complexity, y = error, colour = type)) +
geom_line(linewidth = 0.9) +
geom_vline(xintercept = opt, linetype = "dotted", colour = "grey40") +
scale_colour_manual(values = c("Training error" = "steelblue",
"Test error" = "firebrick")) +
labs(x = "Model Complexity", y = "Error", colour = NULL) +
theme_bw() +
theme(legend.position = "top")Figure 6.2: Typical U-shaped test error curve as a function of model complexity.
How do we measure how well \(\widehat{f}\) performs on test data? That depends on the type of problem. First of all, unsupervised problems do not have a measure of success, as we do not have a label, a “true” value to compare the model with data. With respect to supervised models, the evaluation of the performance depends on whether we are dealing with a regression problem (numerci response) or a classification one (categorical response). Here we give a short introduction to different evaluation metrics, and in the sections later we will explain in more detail each metric adjusted to different problems.
Let \(y_i\) be the true value and \(\widehat{y}_i\) the predicted value for observation \(i\).
Mean Squared Error (MSE): \[\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\]
Root Mean Squared Error (RMSE): the square root of \(\text{MSE}\) — same units as \(Y\), easier to interpret.
Mean Absolute Error (MAE): \[\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \widehat{y}_i|\]
\(R^2\) (coefficient of determination): \[R^2 = 1 - \frac{\sum(y_i - \widehat{y}_i)^2}{\sum(y_i - \bar{y})^2}\] Proportion of variance explained. \(R^2 = 1\) is perfect; \(R^2 = 0\) means the model does no better than predicting the mean of \(Y\), \(\bar{Y}\).
set.seed(10)
# Simulate a train/test split on mtcars
n <- nrow(mtcars)
idx <- sample(1:n, size = floor(0.7 * n))
train <- mtcars[idx, ]
test <- mtcars[-idx, ]
fit <- lm(mpg ~ wt + hp + cyl, data = train)
y_hat <- predict(fit, newdata = test)
y_true <- test$mpg
MSE <- mean((y_true - y_hat)^2)
RMSE <- sqrt(MSE)
MAE <- mean(abs(y_true - y_hat))
SS_res <- sum((y_true - y_hat)^2)
SS_tot <- sum((y_true - mean(y_true))^2)
R2 <- 1 - SS_res / SS_tot
cat(sprintf("MSE = %.3f\nRMSE = %.3f\nMAE = %.3f\nR^2 = %.3f\n",
MSE, RMSE, MAE, R2))## MSE = 9.802
## RMSE = 3.131
## MAE = 2.637
## R^2 = 0.708
For a binary classifier, that is, \(Y\) takes two values that we assume, for simplicity, that are \(0\) and \(1\), we have the following metrics:
| Predicted 0 | Predicted 1 | |
|---|---|---|
| True 0 | TN = True Negative | FP = False Positive |
| True 1 | FN = False Negative | TP = True Positive |
Observe that the words “Negative” and “Positive” refer to whether the prediction is \(0\) or \(1\). On the other hand, “True” and “False” stands for whether the prediction was wrong or right.
Accuracy: \(\frac{TP + TN}{n}\) — proportion correctly classified.
Precision (Positive Predictive Value): \(\frac{TP}{TP + FP}\) — of predicted positives, how many are correct?
Recall (Sensitivity): \(\frac{TP}{TP + FN}\) — of true positives, how many did we catch?
Specificity: \(\frac{TN}{TN+FP}\) - how many negative values where predicted correctly?
Negative Predicted Value: \(\frac{TN}{TN+FN}\) - how many predicted negatives are correct?
| Conditions on predicted | Conditions on actual | |
|---|---|---|
| Positive | Precision \(= \frac{TP}{TP+FP}\) | Sensitivity \(= \frac{TP}{TP+FN}\) |
| Negative | NPV \(= \frac{TN}{TN+FN}\) | Specificity \(= \frac{TN}{TN+FP}\) |
Accuracy can mislead when classes are imbalanced (e.g., 95% negatives: always predicting 0 gives 95% accuracy but is useless). Use precision/recall or \(F_1\) in such cases.
# Binary classification: setosa vs. rest
iris_bin <- iris
iris_bin$setosa <- ifelse(iris$Species == "setosa", 1, 0)
set.seed(5)
idx2 <- sample(1:150, 105)
tr <- iris_bin[idx2, ]
te <- iris_bin[-idx2, ]
log_fit <- glm(setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = tr, family = binomial)
probs <- predict(log_fit, newdata = te, type = "response")
preds <- ifelse(probs > 0.5, 1, 0)
# Confusion matrix
cm <- table(Predicted = preds, Actual = te$setosa)
print(cm)## Actual
## Predicted 0 1
## 0 30 0
## 1 0 15
TP <- cm["1", "1"]; TN <- cm["0", "0"]
FP <- cm["1", "0"]; FN <- cm["0", "1"]
accuracy <- (TP + TN) / sum(cm)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1 <- 2 * precision * recall / (precision + recall)
cat(sprintf("\nAccuracy = %.3f\nPrecision = %.3f\nRecall = %.3f\nF1 Score = %.3f\n",
accuracy, precision, recall, f1))##
## Accuracy = 1.000
## Precision = 1.000
## Recall = 1.000
## F1 Score = 1.000
All metrics above must be computed on held-out test data, not the training data. Training error is optimistic and will underestimate the true (generalisation) error.
A common approach for small datasets is \(k\)-fold cross-validation: split the data into \(k\) folds, train on \(k-1\) and evaluate on the remaining fold, repeat \(k\) times, and average the metric.
set.seed(99)
k <- 5
folds <- sample(rep(1:k, length.out = nrow(mtcars)))
cv_mse <- numeric(k)
for (fold in 1:k) {
tr_cv <- mtcars[folds != fold, ]
te_cv <- mtcars[folds == fold, ]
fit_cv <- lm(mpg ~ wt + hp + cyl, data = tr_cv)
pred_cv <- predict(fit_cv, newdata = te_cv)
cv_mse[fold] <- mean((te_cv$mpg - pred_cv)^2)
}
df_cv <- data.frame(fold = paste0("Fold ", 1:k), mse = cv_mse)
ggplot(df_cv, aes(x = fold, y = mse)) +
geom_col(fill = "steelblue") +
geom_hline(yintercept = mean(cv_mse), colour = "firebrick",
linetype = "dashed", linewidth = 0.9) +
annotate("text", x = 4.5, y = mean(cv_mse) + 0.5,
label = sprintf("Mean CV MSE = %.2f", mean(cv_mse)),
colour = "firebrick", size = 3.5) +
labs(title = "5-Fold CV MSE", x = NULL, y = "MSE") +
theme_bw()Figure 7.1: 5-fold cross-validation MSE for a linear model on mtcars.
As mentioned above, a very flexible model will adjust very well to the training data. However, this has the problem of not being able to perform well on new, unseen data. This over adjustment to training data is called overfitting. For example, if we fit an \(n-1\) degree polynomial to a data set with \(n\) points we will get a perfect fit, but it will not work for a new data point that does not lie in the graph of the polynomial.
set.seed(1)
x <- 1:4
y <- x^3 - x^2 - x + 1
df <- data.frame(x = x, y = y)
# Fine grid for the smooth curve
x_seq <- seq(1, 4, length.out = 200)
df_curve <- data.frame(x = x_seq, y = x_seq^3 - x_seq^2 - x_seq + 1)
ggplot() +
geom_line(data = df_curve, aes(x = x, y = y), colour = "steelblue", linewidth = 1) +
geom_point(data = df, aes(x = x, y = y), colour = "grey40", size = 3) +
geom_point(data = data.frame(x = 2.5, y = 20), aes(x = x, y = y), colour = "blue", size = 3) +
labs(x = "X", y = "Y") +
theme_bw()Figure 8.1: Overfitting with a degree 3 polynomial
On the other hand, if the model has very few parameters, for example, a line that has only the \(y\)-intercept and the slope, we might not be able to capture the intricacies of a complex data set, we are in a situation of underfitting.
| Concept | Key idea |
|---|---|
| Machine learning | Estimate \(f\) in \(Y = f(X) + \varepsilon\) or the structure of \(X\) from data |
| Inference | Understand how \(X\) affects \(Y\) |
| Prediction | Accurate \(\widehat{Y}\) for new \(X\) |
| Supervised | Labelled data: learn \(X \to Y\) |
| Unsupervised | No labels: find structure in \(X\) |
| Bias–variance | Complexity ↑: bias ↓, variance ↑ |
| Evaluation | MSE/RMSE/R² (regression); accuracy/F₁ (classification) |