Understand the error metrics used for regression and the pros/cons of each.
We start loadin all the libraries needed.
if (!require(pacman)) {
install.packages("pacman")
} else{
library(pacman)
pacman::p_load(ggplot2,bbplot,latex2exp,lattice,caret,tikzDevice,nortest,DescTools,ggpubr,BiocManager,ggrepel,magrittr,reshape2,ggthemes)
}
## Loading required package: pacman
Let’s review the main regression error metrics formulas:
\[ ME = \frac{1}{N}\sum_{i}(y_{i}-\hat{y}_{i}) \]
\[ MSE = \frac{1}{N}\sum_{i}(y_{i}-\hat{y}_{i})^{2} \] \[ RMSE = \sqrt{\frac{1}{N}\sum_{i}(y_{i}-\hat{y}_{i})^{2}} \] \[ MAE = \frac{1}{N}\sum_{i}|y_{i}-\hat{y}_{i}|\] \[ MRE = \frac{1}{N}\sum_{i}\left|\frac{y_{i}-\hat{y}_{i}}{y_{i}}\right|\]
\[ R^{2} = 1-\frac{\frac{1}{N}\sum_{i}(y_{i}-\hat{y}_{i})^2}{\frac{1}{N}\sum_{i}(y_{i}-\bar{y})^{2}}\] \[ R_{adj}^{2} = 1-\left[ \frac{(1-R^{2})(N-1)}{(N-k-1)} \right ] \] where \(N\) is the total number of observations, k is the number of predictors and \(\hat{y}_{i}\) are the predictions of our model for each observation \(i\).
The residuals are the differences between the real values and the predicted. Therefore, we have:
\[ Residuals\equiv(y_{i}-\hat{y}_{i})\equiv error_{i}\]
In order to understand the limitations of each error metric, it’s usefull to write them again as functions of our errors:
\[ ME = \frac{1}{N}\sum_{i}(error_{i}) \]
\[ MSE = \frac{1}{N}\sum_{i}(error_{i})^{2} \] \[ RMSE = \sqrt{\frac{1}{N}\sum_{i}(error_{i})^{2}} \] \[ MAE = \frac{1}{N}\sum_{i}|error_{i}|\] \[ MRE = \frac{1}{N}\sum_{i}\left|\frac{error_{i}}{y_{i}}\right|\]
\[ R^{2} = 1-\frac{\frac{1}{N}\sum_{i}(error_{i})^2}{\frac{1}{N}\sum_{i}(y_{i}-\bar{y})^{2}}\]
If in the last equation, we consider \(\bar{y}\) as the prediction of a “model”, we can clearly see that \(R^{2}\) is comparing how good or bad is performing our model against another model which predicts every point to be the mean \(\bar{y}\). Thus, if the variance of our model’s errors is smaller than the variance in the data, \(R^{2}\) will be big, and small otherwise.
This can be clearly seen in the figure for one fake sample data, where a linnear model (shown in green) was fit to the data together with the mean of \(y\) values (shown in red).
In the last equation can be make a small transformation to the numerator to realize that \(R^{2}\) and \(RMSE\) are closely related:
\[ R^{2} = 1-\frac{\frac{1}{N}\sum_{i}(error_{i})^2}{\frac{1}{N}\sum_{i}(y_{i}-\bar{y})^{2}}=1-\frac{\sqrt{\frac{1}{N}\sum_{i}(error_{i})^2}^{2}}{\frac{1}{N}\sum_{i}(y_{i}-\bar{y})^{2}} = 1-\frac{RMSE^{2}}{\frac{1}{N}\sum_{i}(y_{i}-\bar{y})^{2}}=1-\frac{RMSE^{2}}{\sigma_{y}^{2}}=1-(\frac{RMSE}{\sigma_{y}})^{2}\] where the denominator is a fixed term as it doesn’t contain predictions from any model. It’s simply the variance of the y’s values. Thus, \(R^{2}\) is related to quadratic decreases of the \(RMSE^{2}\). In fact, if we make the replacements \(y=R^{2}\) and \(x=RMSE^{2}\), we obtain a linear relationship: \(y=1-\beta x\), where \(\beta\) is inverse of the \(y\) variance.
From the last equation, we can have three possible scenarios depending on the relationship between the \(RMSE\) and \(\sigma_{y}\):
In this case the range of values that \(RMSE\) can take lead to have a \(R^{2}\) within [0,1] range as shown in the figure (values in green)
In this other case the range of values that \(RMSE\) can take lead to \(R^{2}=1\) as shown in the figure (dot shown in green)
In this last case the range of values that \(RMSE\) can take lead to have a \(R^{2}\) less than one as shown in the figure (values shown in green)
Moreover \(R^{2}\) is not a good error metrics in order to compare models because the more features you add to your model, the bigger the \(R^{2}\). Therefore, if you want to know up to which extent an additional feature is providing useful information to your model, then the metric to use is \(R_{adj}^{2}\).
We can expect that any model \(\hat{y}\) will be able to describe most of our data, while somethig is left behind leading to some error \(\epsilon\). Therefore, we can always write:
\[y=\hat{y}+\epsilon\]
where we expect to have \(\epsilon\) as small as possible. One desirable condition that we expect from our model is that the errors \(\epsilon\) are normaly distribuited with some sample deviation, namely: \(\epsilon=N (0,\hat{\sigma})\). In other words, we expect that our model will not make more overstimations than underestimations or the other way around. This is a strong condition, and not allways will be met. Failure to achive this, will tell us that for some reason our model is not capturing relevat information from our data although we don’t know what data is missing. If our dataset is quite big, we can carry out a formal statistical test to check the hypothesis of having errors normally distribuited around zero.
Let’s generate fake sample data \(X={x_{1},...,x_{N}}\) normally distributed assuming that they represent the errors of our linear model.
set.seed(375)
errors <- rnorm(500,0,1)
errors2 <- errors*errors
abserrors <- abs(errors)
Now let’s see the means, medians and modes as it’s very instructive:
mean_errors <- mean(errors)
median_errors <- median(errors)
mode_errors <- Mode(errors)
mean_errors2 <- mean(errors2)
median_errors2 <- median(errors2)
mode_errors2 <- Mode(errors2)
rmse <- sqrt(mean(errors2))
mean_abserrors <- mean(abs(errors))
median_abserrors <- median(abs(errors))
mode_abserros <- Mode(abs(errors))
However instead of finishing our work at our model’s error metrics, one of the most inportant steps in order to gain insgith about our model, is to plot the error distribution to check visually if they follow a normal distribution
The figure shows a distribution which doesn’t seem to follow a perfect normal distribution. However, in this case given the small size of the sample we can’t be sure and this test is higly subjective.
Another visual aid to check if our model’s errors can come from a normal distribution is to compare the quartiles of our errors, with the quartiles of the normal distribution. This can be inspected creating an scatter plot of the quartiles of both distributions, this kind of plot is called Q-Q plot. If our errors follow a normal distribution, the dots will be arranged on an straight line with slope one. Conversely, if our errors don’t fit to a normal distribution, then Q-Q plot will show some dots deviating from the straight line.
ggqqplot(errors,title = TeX('Normal Q-Q plot of $\\epsilon$') )+
theme_economist()
According to the Q-Q plot shown on the figure, the errors from our model seem to follow a normal distribution as only a few points are not perfectly on the straight line of slope one. However, again this is very qualitative and if we to be more quantitive, we must perform an statistical test.
There are several statistical tests to check if a sample data follows a normal distribution. As our sample size is small (five hundred values) we will use the Shapiro’s test because is the best suited for small sample sizes. In this test, the null hypothesis is that our values (our errors) follow a normal distribution.
shapiro.test(errors)
##
## Shapiro-Wilk normality test
##
## data: errors
## W = 0.99812, p-value = 0.8644
As the p-value is above 0.05 (the usual significance level), we can accept with an 86% probability of not being mistaken that our errors follow a normal distribution.
Let’s see where is located the ME (mean error) of our error distribution.
ggplot(k,aes(x=errors)) +
geom_density(fill="lightblue")+
geom_vline(xintercept = mean(errors), size = 1, colour="red") +
annotate("text", x = 0.8, y = 0.45, label = "ME (mean error)",colour="red") +
xlab(TeX('$\\epsilon$')) +
ggtitle(TeX('Distribution of $\\epsilon$'))+
theme_economist()
As the figure shows, our model IS making mistakes, but on average the amount is close to zero. Thus, having an \(ME=0\) doesn’t mean that our model is not making mistakes, as sometines our model is overstimating the real values while sometimes its overstimating the real values. Therefore this error metrics can be misleading.
Now, let’s see what happens with the MAE (mean absolute error). In order to do so, we need to remember that we are computing the mean of absolute errors, therefore we need to look at the distribution of absolute errors instead of the error’s distribution. Let’s see how it looks like this distribution
ggplot(k,aes(x=abserrors)) +
geom_density(fill="lightblue")+
geom_vline(xintercept = mean(abserrors), size = 1, colour="red") +
annotate("text", x = 1.5, y = 0.75, label = "MAE (mean absolute error)",colour="red") +
xlab(TeX('|$\\epsilon$|')) +
ggtitle(TeX('Distribution of |$\\epsilon$|'))+
theme_economist()
As shown in the figure, the absolute errors have different distribution from the errors. Is fact, this distribution will have long tails to the rigth (rigth skewed). Therefore, only when the distribution shows neglectible tails and starts becoming more spiked, the MAE (mean absolute errror) will be close to ME (the mean error).
These error metrics uses another distribution: the distribution of squared errrors. Therefore, to understand this error metric we should look at it.
ggplot(k,aes(x=errors2)) +
geom_density(fill="lightblue")+
geom_vline(xintercept = mean(errors2), size = 1, colour="red") +
annotate("text", x = 3.5, y = 0.75, label = "MSE (mean squared error)",colour="red") +
geom_vline(xintercept = sqrt(mean(errors2)), size = 1, colour="blue") +
annotate("text", x = 3.95, y = 0.63, label = "RMSE (root mean squared error)",colour="blue") +
xlab(TeX('$\\epsilon^2$')) +
ggtitle(TeX('Distribution of $\\epsilon^2$'))+
theme_economist()
The last figure, shows again another different distribution from the previous ones which has the longest tail. The reason of this long tail is that when we square we are magnifiying a lot the errors. Thus, a good way to check out the existance of outliers is to look at the tails of this distribution, which in turn explains why \(RMSE\) and \(MSE\) are so sensitive to them. Moreover, it shows that the \(RMSE<MSE\)
Therefore, when comparing error metrics we are looking into very different things.
Now that we know which distributions we are dealing with, we can plot them together in the same graph in order to ha a better comparison.
Rigth now, we know that each error metric is working with a different distribution, let’s plot them all together to have a better understanding of the error metrics.
x <- data.frame(v1=errors,v2=abserrors,v3=errors2)
colnames(x) <- c("errors","abs(errors)","errors^2")
data <- melt(x)
## No id variables; using all as measure variables
ggplot(data,aes(x=value,fill=variable)) +
geom_density(alpha=0.2) +
labs(fill="") +
geom_vline(xintercept = mean(errors), size = 1, colour="red")+
annotate("text", x = 4.3, y = 0.52, label = "ME (mean error)",colour="red") +
geom_vline(xintercept = mean(abserrors), size = 1, colour="limegreen") +
annotate("text", x = 5.25, y = 0.73, label = "MAE (mean absolute error)",colour="limegreen") +
geom_vline(xintercept = mean(errors2), size = 1, colour="red") +
annotate("text", x = 5.23, y = 0.83, label = "MSE (mean squared error)",colour="steelblue3") +
geom_vline(xintercept = sqrt(mean(errors2)), size = 1, colour="snow4") +
annotate("text", x = 5.8, y = 0.63, label = "RMSE (root mean squared error)",colour="snow4") +
ggtitle("Distributions and error metrics") +
theme_economist()
In the last figure, the \(x\) axis represents simultaneously the values of \(\epsilon\), \(|\epsilon|\) and \(\epsilon^2\). From the figure, severeal facts can be observed. First, the distribution of the errors \(\epsilon\) is pretty symmetrical in this case (something desirable when building a model). Second, the distribuion of the \(|\epsilon|\) has lost the symmetry as all negative errors become positive and it has longer tails. Third, the distribution of \(\epsilon^2\) has also lost the symmetry and it has the longest tails as they errors become amplified when squared.
Error metrics are just a number which summarizes everything, but they doesn’t tell us the whole history. One number, can’t show us if the corresponding distribution is simetrical or not, or how long are the tails. Let’s consider a couple of examples: the first will be our old Gaussian distribution, the second will be a binomial distribution. Now we’re going to generate fake data according to a normal distribution:
set.seed(347)
k2 <- rnorm(500,mean=-1,sd=0.3)+rnorm(500,mean=1,sd=0.3)
k3 <- rnorm(500,0,3)
k4 <- rnorm(500,0,0.5)
y <- data.frame(v1=k,v2=k2,v3=k3,v4=k4)
colnames(y) <- c("M1","M2","M3","M4")
data2 <- melt(y)
## No id variables; using all as measure variables
Let’s compare two distributions of errors:
ggplot(data2,aes(x=value,fill=variable)) +
labs(fill="") +
geom_density(alpha=0.2) +
ggtitle("Distributions with similar ME (mean error)") +
theme_economist()
As shown in the figure, in this fake scenario all the models (M1, M2, M3, M4) have very different error distributions, however the ME (mean error) is similar for all (except for model M3).
apply(y,2,mean)
## M1 M2 M3 M4
## 0.012446673 0.020431054 -0.258960506 0.002404615
Thus, the decission of selecting one model over others can’t be based only in the ME a deeper understanding of the problem. Why this happens? Because the mean its only one piece of the cake.
In mathematical terms, the mean it’s the first moment of the distribution:
\[ <x^n>=\int_{-\infty}^{\infty}x^{n}g(x)dx \]
where \(n=1\), but you don’t know the other moments \(<x^{n}>\) with \(n>1\). Two simple examples will help to understand this:
For the first example, let’s consider you tell us that you own a car (first moment). Without more details, we can’t know which car you own as there are millions of them on the roads. If you provide the color (red for example, which could be the second moment) we have more information and we can discard some cars. If you also provide the brand (third moment) we can discard more cars, and so on. Clearly is not enough with one question (moment).
The second example is the kids game: “Who is who?” In this game you try to guess the opponent’s character asking them questions about the hair color, the mouth’s size,…etc (here, the answers to the questions play the role of moments of the distribution) until you know who is who. But again, it’s not enough with only one question (moment).
Providing the error metrics of your model it’s only the starting point of understanding your problem and the limitations of your model. Deeper understanding can be gained if you inspect the shape of error distribution of your models. However, this inspection not allways will tell you what your model is lacking of. Bear in mind that the errors of your model can come from (among others):