In many cases, it is almost impossible to fit a perfect linear model to available data. After attempting a fit, the remainder variation in the dependent variable left unexplained by the model is the residuals. In other words, the residuals are the deviations between the observable outcome and the predicted outcome. Mathematically this is defined as:
\[e_i = \hat{Y_i} - {Y_i}\]
Though they are closely related, residuals are fundamentally different from the errors in that these are unobservable deviations. Residuals are reasonable estimates for the errors in a linear regression model.
In addition to usual model parameters such as R2, p-value
or rmse
, residuals can be very important in investigating a model’s fit. If the mean of the outcome served as the model (intercept only), we would have a lot of variation in the dependent variable around this mean value. By adding one or more predictors, we are able to reduce this variation but not completely eliminate it. The leftover variation (the remainder variation) not explained by the model fit consists of the residuals. This is also where the idea for the R2 formula comes from. If the residuals are very small such that the model is a near perfect fit, R2 will be close to 1.
For example, using the mtcars
data set, we get a model coefficient of 20.09 for an intercept only fit.
lm(mpg ~ 1, mtcars)
##
## Call:
## lm(formula = mpg ~ 1, data = mtcars)
##
## Coefficients:
## (Intercept)
## 20.09
This is the mean of the mpg
variable in the data set, and illustrates that an intercept only model will simply yield the mean of the dependent variable. The dependent outcome will also have a lot of variation around this mean value.
mean(mtcars$mpg)
## [1] 20.09062
By adding a new predictor, we could significantly reduce the variation around this mean value:
lm(mpg ~ wt, mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
We can examine these variations in a dotplot to get an idea what the respective distribution looks like. We use the broom
packages to get an extended data set.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
broom::augment(lm(mpg ~ 1, mtcars)) %>% head
## # A tibble: 6 x 9
## .rownames mpg .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 20.1 1.07 0.909 0.0313 6.12 7.58e-4 0.153
## 2 Mazda RX4 Wag 21 20.1 1.07 0.909 0.0312 6.12 7.58e-4 0.153
## 3 Datsun 710 22.8 20.1 1.07 2.71 0.0312 6.11 6.73e-3 0.457
## 4 Hornet 4 Drive 21.4 20.1 1.07 1.31 0.0312 6.12 1.57e-3 0.221
## 5 Hornet Sportabo~ 18.7 20.1 1.07 -1.39 0.0312 6.12 1.77e-3 -0.234
## 6 Valiant 18.1 20.1 1.07 -1.99 0.0312 6.12 3.63e-3 -0.336
mod_intercept_only <- broom::augment(lm(mpg ~ 1, mtcars)) %>% select(.rownames, .resid) %>% mutate(mod = as.factor(rep(0, times = length(.resid))), .resid_1 = .resid, x = .rownames) %>% select(-.resid, -.rownames)
mod_with_one_predictor <- broom::augment(lm(mpg ~ wt, mtcars)) %>% select(.rownames, .resid) %>% mutate(mod = as.factor(rep(1, times = length(.resid))), .resid_1 = .resid, x = .rownames) %>% select(-.resid, -.rownames)
bind_rows(mod_with_one_predictor, mod_intercept_only) %>% transform(mod=as.factor(ifelse(mod==0, "Intercept Only", "With Predictor"))) %>% select(-x) -> mods_tidy
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
mods_tidy %>% ggplot(data = ., aes(x = 1, y = .resid_1, fill = mod)) + geom_dotplot(position = position_dodge(width = 1), binaxis = "y", binwidth = 0.75, stackdir = "center") + labs(colour = "Models", x = "", y = "Residuals") + theme(axis.text.x.bottom = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom") + theme_bw()
This informs that there is less variation in the residuals when a predictor is added.
However, that’s not there is about residuals. R generates a couple of diagnostic plots after fitting a linear regression model and these plots can inform if the model is a good fit or not following from the properties residuals should have. Some of these properties include:
Taking a look at the plot of mpg
as a function of wt
and examining the residuals, we see that they look pretty evenly distributed about the regression line.
lm(mpg ~ wt, mtcars) %>% broom::augment() %>% ggplot(data = ., aes(x = wt, y = mpg)) + geom_smooth(method = "lm", se = FALSE) + geom_pointrange(aes(ymin = mpg, ymax = .fitted), color = "blue") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'
However, that doesn’t say enough. We would need to examine additional diagnostic plots provided in R in other to judge the distribution of the residuals.
mod <- lm(mpg ~ wt, mtcars)
plot(mod, which = 1)
The first plot is the Residuals vs Fitted
curve. This gives a plot of the residuals (the difference between the observed outcomes and the predicted outcomes) against the fitted values (these are the predicted outcomes). This plot should have non-linear patterns. If there is an additional non-linear component present in the predictor and the model fails to capture this non-linear relationship, this plot would typically have some distinct pattern.
As an example of a Residuals vs Fitted
plot which exhibits some non-linear pattern, we could fit a model to some arbitrary data which carries, for example, a cosine component, and then examine what the plot would look like.
x = seq(0, 10, length=50)
y = 3 * x + cos(2 * x) + rnorm(50, 0, 0.02)
df <- data.frame(x = x, y = y)
lm(y ~ x, data = df) %T>% plot(which = 1) %>% summary
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10696 -0.69333 0.06333 0.68767 1.07219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04280 0.20108 -0.213 0.832
## x 3.02072 0.03465 87.176 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7216 on 48 degrees of freedom
## Multiple R-squared: 0.9937, Adjusted R-squared: 0.9936
## F-statistic: 7600 on 1 and 48 DF, p-value: < 2.2e-16
We see that, even though the regression coefficients are statistically significant and the linear term is accurately predicted, this is not enough to guarantee that the model is a good fit. The model misses the sinsoidal component in the observed outcome. The pattern present in the Residuals vs Fitted
curve suggests that there is still some significant amount of variation unexplained by the model. While this model is not entirely a good one, it doesn’t imply that it may not serve some purpose. Evaluations would typically depend on a combination of other factors.
plot(mod, which = 2)
The second plot is the Q-Q plot which shows whether the residuals are normally distributed. It compares the standardized residuals to a theoretical normal distribution and checks if it follows a straight line or if it deviates significantly. This plot from the mtcars
model earlier obtained looks quite satisfactory. Though there is some little deviation at the edges, the Q-Q plot looks generally okay. This suggests that the residuals are perhaps adequately normally distributed.
Perhaps, another step is to simply compare the plot of the residuals to that of a normal distribution and examine visually if there are any differences. This is the same step we are taking with the Q-Q plot, except this is instead a density plot. Here, we see again that at both ends the curves deviate a little.
plot(density(resid(mod)), ylim=c(0,0.15), main = "Density plot of the residuals")
curve(dnorm(x,0,3), from=-10, to=10, col = "blue", add=TRUE)
Though it is not standard practice, we could conduct a normality test using the Shapiro test which has as null hypothesis normality of the input data. When conducted, we obtain a result that is not statistically significant, leading us to fail to reject the null hypothesis that the residuals are normally distributed.
shapiro.test(resid(mod))
##
## Shapiro-Wilk normality test
##
## data: resid(mod)
## W = 0.94508, p-value = 0.1044
If the Shapiro test were to fail, it would not automatically suggest that we should reject an assumption of normality of the residuals. Exploratory graphics such as Q-Q plots are often enough in gaining an idea into the distribution of the residuals. Our modelling is often just a step in a long process of attempting to understand the data and gaining key insights.
Next, we examine an alternative data set and observe how an “abnormal” Q-Q plot may look like. Here, we use the diamonds
data set.
lm(price ~ carat, diamonds) %>% plot(which = 2)
Unlike that from the mtcars
model, this Q-Q plot seems to have significant deviation at both edges. This means that a linear regression model may not be entirely suitable for this data set since the residuals are not normally distributed. Though there could be room for some possible inferences even from this linear model, a more suitable model would perhaps explain the data set even better.
plot(mod, which = 3)
The third plot is called the Scale-Location
plot, and is a plot of the square root of the standardized residuals against the fitted values. For a good fit, it should show that the residuals are spread equally along the fitted values. This is very much like the first plot, except the y-axis is a different quantity. This plot allows for a check of equal variance or homoscedasticity. The plot should have the average of the residuals (the red line) near horizontal and the spread of the different points not vary with the fitted values.
If in doubt, an additional test for heteroscedasticity, the Breusch-Pagan test, could be conducted to check if this condition is satisfied or not. This test, developed by Trevor Breusch and Adrian Pagan in 1973, checks whether the variance of the errors from a regression is dependent on the values of the independent variables.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(mod)
##
## studentized Breusch-Pagan test
##
## data: mod
## BP = 0.040438, df = 1, p-value = 0.8406
The null hypothesis here is that homoscedasticity is present. Since the p-value
is significantly high, we can fail to reject the null hypothesis.
Again, we examine the Scale-Location
plot from the diamonds
data set in comparison.
lm(price ~ carat, diamonds) %>% plot(which = 3)
Here, the square root of the standardized residuals have some pattern. They are concentrated at one location and seem to vary with the fitted value. In addition, their average is not horizontal and shows an angle.
lm(price ~ carat, diamonds) %>% bptest
##
## studentized Breusch-Pagan test
##
## data: .
## BP = 9131.2, df = 1, p-value < 2.2e-16
An additional check with the Breusch-Pagan test produces a statistically significant result suggesting that heteroscedasticity is present. Here we reject the null hypothesis of homoscedasticity.
plot(mod, which = 5)
The final plot is the Residuals vs Leverage
plot which helps to find influential cases, if there are any. The concept of leverage and influence is important here. Leverage is simply a function of the distance between an explanatory variable and the mean of the explanatory variables. This means observations that are far from the mean of the explanatory variables will have high leverage, that is, may be an outlier. However, not all points with high leverage are influential. Observations that may significantly affect the slope of the regression line when removed or added are influential observations. Such observations may combine both leverage and influence.
Unlike in other plots, patterns are not that relevant here. Instead, outlying values in the upper or lower corners matter. This should be important since removing them or including them will affect the results of the regression model. In this case, there is one observation (Chrylser Imperial
) located at the top right corner which appears to be an outlier with high influence. This point is beyond the Cook’s distance lines and should have significant impact on the regression results. We could observe additional details about this point using augment
from the broom
package.
broom::augment(mod) %>% select(.rownames, .hat, .cooksd) %>% arrange(desc(.hat))
## # A tibble: 32 x 3
## .rownames .hat .cooksd
## <chr> <dbl> <dbl>
## 1 Lincoln Continental 0.195 0.0719
## 2 Chrysler Imperial 0.184 0.532
## 3 Cadillac Fleetwood 0.170 0.0184
## 4 Lotus Europa 0.129 0.0132
## 5 Honda Civic 0.118 0.0249
## 6 Toyota Corolla 0.0956 0.260
## 7 Fiat X1-9 0.0866 0.000711
## 8 Porsche 914-2 0.0704 0.000101
## 9 Fiat 128 0.0661 0.193
## 10 Datsun 710 0.0584 0.0154
## # ... with 22 more rows
We can use the .hat
variable to tell observations with high leverage. These appear to be the first three or four listed in descending order of .hat
.
broom::augment(mod) %>% arrange(desc(.hat)) %>% .[1:4,] -> sample
sample
## # A tibble: 4 x 10
## .rownames mpg wt .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Lincoln Co~ 10.4 5.42 8.30 1.35 2.10 0.195 3.07 0.0719 0.770
## 2 Chrysler I~ 14.7 5.34 8.72 1.31 5.98 0.184 2.84 0.532 2.17
## 3 Cadillac F~ 10.4 5.25 9.23 1.26 1.17 0.170 3.09 0.0184 0.423
## 4 Lotus Euro~ 30.4 1.51 29.2 1.09 1.20 0.129 3.09 0.0132 0.423
broom::augment(mod) %>% arrange(desc(.hat)) %>% ggplot(data = ., aes(x = wt, y = mpg)) + geom_point() + geom_text(data = sample, aes(label = .rownames), size = 2) + geom_smooth(method = "lm", se = FALSE) + theme_bw()
## `geom_smooth()` using formula 'y ~ x'
The plot above shows the four main points which seem to have high leverage. Of these, Chrysler Imperial
appears to have the highest influence as we can tell from the Residuals vs Leverage
plot. The .cooksd
in the augment
output combines the two quantities: leverage and influence. The results given below show that Chrysler Imperial
has the highest value of .cooksd
.
broom::augment(mod) %>% arrange(desc(.cooksd)) %>% select(.rownames, .cooksd)
## # A tibble: 32 x 2
## .rownames .cooksd
## <chr> <dbl>
## 1 Chrysler Imperial 0.532
## 2 Toyota Corolla 0.260
## 3 Fiat 128 0.193
## 4 Lincoln Continental 0.0719
## 5 Ford Pantera L 0.0371
## 6 Camaro Z28 0.0313
## 7 Duster 360 0.0313
## 8 Merc 240D 0.0311
## 9 AMC Javelin 0.0263
## 10 Honda Civic 0.0249
## # ... with 22 more rows
Our original model has coefficients:
coef(mod)
## (Intercept) wt
## 37.285126 -5.344472
With the outlier removed, the coefficients become:
outlier <- c("Chrysler Imperial")
lm(mpg ~ wt, data = mtcars[-which(rownames(mtcars) %in% outlier),]) %>% coef
## (Intercept) wt
## 38.746334 -5.869829
Plotting these two lines reveal just how much the point of high leverage and influence has an effect on the regression results. We can use base plot to achieve this. The point of high influence is colored red in the plot. With the point excluded, we obtain the blue regression line. Adding the red point pulls the regression line upwards and we obtain the black, dashed line.
mtcars_wo <- mtcars[-which(rownames(mtcars) %in% outlier),]
plot(mpg ~ wt, mtcars, pch = 19, main = "Plot of mpg against wt showing regression lines")
#Without the outlier
abline(lm(mpg ~ wt, mtcars_wo), col = "blue")
#With the outlier added to the plot, the regression line is pulled upwards as shown by the dashed line.
abline(lm(mpg ~ wt, mtcars),col = "black", lty = "dashed")
sample_wo <- mtcars[which(rownames(mtcars) %in% outlier),]
points(sample_wo$wt, sample_wo$mpg, col = "red", pch = 19)
grid(nx = 8, ny = 11, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)
These four major plots tell a story about the residuals from a regression model and should be examined in order to gain better insight into a regression model. However, they should not determine squarely whether the model is accepted or rejected. Sometimes, some models without a good fit may still serve some use in other cases. The ultimate decision lies with the practitioner in determining which models to use or not. Sometimes, such linear models even if they do not fit the data well may enable unique insights into how to proceed with modelling. This may include adding a quadratic term, removing some outliers or conducting a transformation. Every work with a model has its own unique steps. However, examining a regression model’s residuals is always good practice.