02 Understanding Residuals in R

0. Introduction
1. Residuals vs Fitted
2. Normal Q-Q
3. Scale-Location
4. Residuals vs Leverage
5. Conclusion
6. References

0. Introduction

In many cases, it is almost impossible to fit a perfect linear model to available data. After attempting a fit, the remainder variation in the dependent variable left unexplained by the model is the residuals. In other words, the residuals are the deviations between the observable outcome and the predicted outcome. Mathematically this is defined as:

\[e_i = \hat{Y_i} - {Y_i}\]

Though they are closely related, residuals are fundamentally different from the errors in that these are unobservable deviations. Residuals are reasonable estimates for the errors in a linear regression model.

In addition to usual model parameters such as R², p-value or rmse, residuals can be very important in investigating a model’s fit. If the mean of the outcome served as the model (intercept only), we would have a lot of variation in the dependent variable around this mean value. By adding one or more predictors, we are able to reduce this variation but not completely eliminate it. The leftover variation (the remainder variation) not explained by the model fit consists of the residuals. This is also where the idea for the R² formula comes from. If the residuals are very small such that the model is a near perfect fit, R² will be close to 1.

For example, using the mtcars data set, we get a model coefficient of 20.09 for an intercept only fit.

lm(mpg ~ 1, mtcars)

## 
## Call:
## lm(formula = mpg ~ 1, data = mtcars)
## 
## Coefficients:
## (Intercept)  
##       20.09

This is the mean of the mpg variable in the data set, and illustrates that an intercept only model will simply yield the mean of the dependent variable. The dependent outcome will also have a lot of variation around this mean value.

mean(mtcars$mpg)

## [1] 20.09062

By adding a new predictor, we could significantly reduce the variation around this mean value:

lm(mpg ~ wt, mtcars)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

We can examine these variations in a dotplot to get an idea what the respective distribution looks like. We use the broom packages to get an extended data set.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

broom::augment(lm(mpg ~ 1, mtcars)) %>% head

## # A tibble: 6 x 9
##   .rownames          mpg .fitted .se.fit .resid   .hat .sigma .cooksd .std.resid
##   <chr>            <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
## 1 Mazda RX4         21      20.1    1.07  0.909 0.0313   6.12 7.58e-4      0.153
## 2 Mazda RX4 Wag     21      20.1    1.07  0.909 0.0312   6.12 7.58e-4      0.153
## 3 Datsun 710        22.8    20.1    1.07  2.71  0.0312   6.11 6.73e-3      0.457
## 4 Hornet 4 Drive    21.4    20.1    1.07  1.31  0.0312   6.12 1.57e-3      0.221
## 5 Hornet Sportabo~  18.7    20.1    1.07 -1.39  0.0312   6.12 1.77e-3     -0.234
## 6 Valiant           18.1    20.1    1.07 -1.99  0.0312   6.12 3.63e-3     -0.336

mod_intercept_only <- broom::augment(lm(mpg ~ 1, mtcars)) %>% select(.rownames, .resid) %>% mutate(mod = as.factor(rep(0, times = length(.resid))), .resid_1 = .resid, x = .rownames) %>% select(-.resid, -.rownames)

mod_with_one_predictor <- broom::augment(lm(mpg ~ wt, mtcars)) %>% select(.rownames, .resid) %>% mutate(mod = as.factor(rep(1, times = length(.resid))), .resid_1 = .resid, x = .rownames) %>% select(-.resid, -.rownames)

bind_rows(mod_with_one_predictor, mod_intercept_only)  %>% transform(mod=as.factor(ifelse(mod==0, "Intercept Only", "With Predictor"))) %>% select(-x) -> mods_tidy

## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector, coercing
## into character vector

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

mods_tidy %>% ggplot(data = ., aes(x = 1, y = .resid_1, fill = mod)) + geom_dotplot(position = position_dodge(width = 1), binaxis = "y", binwidth = 0.75, stackdir = "center") + labs(colour = "Models", x = "", y = "Residuals") + theme(axis.text.x.bottom = element_blank(), axis.ticks.x = element_blank(), legend.position = "bottom") + theme_bw()

This informs that there is less variation in the residuals when a predictor is added.

However, that’s not there is about residuals. R generates a couple of diagnostic plots after fitting a linear regression model and these plots can inform if the model is a good fit or not following from the properties residuals should have. Some of these properties include:

A mean close to zero
No systematic pattern, i.e. the residuals should be equally distributed across the regression line above and below. This is known as homoscedasticity.
From the above point, it follows that the residuals should be normally distributed.

Taking a look at the plot of mpg as a function of wt and examining the residuals, we see that they look pretty evenly distributed about the regression line.

lm(mpg ~ wt, mtcars) %>% broom::augment() %>% ggplot(data = ., aes(x = wt, y = mpg)) + geom_smooth(method = "lm", se = FALSE) + geom_pointrange(aes(ymin = mpg, ymax = .fitted), color = "blue") + theme_bw()

## `geom_smooth()` using formula 'y ~ x'

However, that doesn’t say enough. We would need to examine additional diagnostic plots provided in R in other to judge the distribution of the residuals.

1. Residuals vs Fitted

mod <- lm(mpg ~ wt, mtcars)
plot(mod, which = 1)

The first plot is the Residuals vs Fitted curve. This gives a plot of the residuals (the difference between the observed outcomes and the predicted outcomes) against the fitted values (these are the predicted outcomes). This plot should have non-linear patterns. If there is an additional non-linear component present in the predictor and the model fails to capture this non-linear relationship, this plot would typically have some distinct pattern.

As an example of a Residuals vs Fitted plot which exhibits some non-linear pattern, we could fit a model to some arbitrary data which carries, for example, a cosine component, and then examine what the plot would look like.

x = seq(0, 10, length=50)
y = 3 * x + cos(2 * x) + rnorm(50, 0, 0.02)
df <- data.frame(x = x, y = y)
lm(y ~ x, data = df) %T>% plot(which = 1) %>% summary

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10696 -0.69333  0.06333  0.68767  1.07219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04280    0.20108  -0.213    0.832    
## x            3.02072    0.03465  87.176   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7216 on 48 degrees of freedom
## Multiple R-squared:  0.9937, Adjusted R-squared:  0.9936 
## F-statistic:  7600 on 1 and 48 DF,  p-value: < 2.2e-16

We see that, even though the regression coefficients are statistically significant and the linear term is accurately predicted, this is not enough to guarantee that the model is a good fit. The model misses the sinsoidal component in the observed outcome. The pattern present in the Residuals vs Fitted curve suggests that there is still some significant amount of variation unexplained by the model. While this model is not entirely a good one, it doesn’t imply that it may not serve some purpose. Evaluations would typically depend on a combination of other factors.

2. Normal Q-Q

plot(mod, which = 2)

The second plot is the Q-Q plot which shows whether the residuals are normally distributed. It compares the standardized residuals to a theoretical normal distribution and checks if it follows a straight line or if it deviates significantly. This plot from the mtcars model earlier obtained looks quite satisfactory. Though there is some little deviation at the edges, the Q-Q plot looks generally okay. This suggests that the residuals are perhaps adequately normally distributed.

Perhaps, another step is to simply compare the plot of the residuals to that of a normal distribution and examine visually if there are any differences. This is the same step we are taking with the Q-Q plot, except this is instead a density plot. Here, we see again that at both ends the curves deviate a little.

plot(density(resid(mod)), ylim=c(0,0.15), main = "Density plot of the residuals")
curve(dnorm(x,0,3), from=-10, to=10, col = "blue", add=TRUE)

Though it is not standard practice, we could conduct a normality test using the Shapiro test which has as null hypothesis normality of the input data. When conducted, we obtain a result that is not statistically significant, leading us to fail to reject the null hypothesis that the residuals are normally distributed.

shapiro.test(resid(mod))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mod)
## W = 0.94508, p-value = 0.1044

If the Shapiro test were to fail, it would not automatically suggest that we should reject an assumption of normality of the residuals. Exploratory graphics such as Q-Q plots are often enough in gaining an idea into the distribution of the residuals. Our modelling is often just a step in a long process of attempting to understand the data and gaining key insights.

Next, we examine an alternative data set and observe how an “abnormal” Q-Q plot may look like. Here, we use the diamonds data set.

lm(price ~ carat, diamonds) %>% plot(which = 2)

Unlike that from the mtcars model, this Q-Q plot seems to have significant deviation at both edges. This means that a linear regression model may not be entirely suitable for this data set since the residuals are not normally distributed. Though there could be room for some possible inferences even from this linear model, a more suitable model would perhaps explain the data set even better.

3. Scale-Location

plot(mod, which = 3)

The third plot is called the Scale-Location plot, and is a plot of the square root of the standardized residuals against the fitted values. For a good fit, it should show that the residuals are spread equally along the fitted values. This is very much like the first plot, except the y-axis is a different quantity. This plot allows for a check of equal variance or homoscedasticity. The plot should have the average of the residuals (the red line) near horizontal and the spread of the different points not vary with the fitted values.

If in doubt, an additional test for heteroscedasticity, the Breusch-Pagan test, could be conducted to check if this condition is satisfied or not. This test, developed by Trevor Breusch and Adrian Pagan in 1973, checks whether the variance of the errors from a regression is dependent on the values of the independent variables.

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.6.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.6.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(mod)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 0.040438, df = 1, p-value = 0.8406

The null hypothesis here is that homoscedasticity is present. Since the p-value is significantly high, we can fail to reject the null hypothesis.

Again, we examine the Scale-Location plot from the diamonds data set in comparison.

lm(price ~ carat, diamonds) %>% plot(which = 3)

Here, the square root of the standardized residuals have some pattern. They are concentrated at one location and seem to vary with the fitted value. In addition, their average is not horizontal and shows an angle.

lm(price ~ carat, diamonds) %>% bptest

## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 9131.2, df = 1, p-value < 2.2e-16

An additional check with the Breusch-Pagan test produces a statistically significant result suggesting that heteroscedasticity is present. Here we reject the null hypothesis of homoscedasticity.

4. Residuals vs Leverage

plot(mod, which = 5)

The final plot is the Residuals vs Leverage plot which helps to find influential cases, if there are any. The concept of leverage and influence is important here. Leverage is simply a function of the distance between an explanatory variable and the mean of the explanatory variables. This means observations that are far from the mean of the explanatory variables will have high leverage, that is, may be an outlier. However, not all points with high leverage are influential. Observations that may significantly affect the slope of the regression line when removed or added are influential observations. Such observations may combine both leverage and influence.

Unlike in other plots, patterns are not that relevant here. Instead, outlying values in the upper or lower corners matter. This should be important since removing them or including them will affect the results of the regression model. In this case, there is one observation (Chrylser Imperial) located at the top right corner which appears to be an outlier with high influence. This point is beyond the Cook’s distance lines and should have significant impact on the regression results. We could observe additional details about this point using augment from the broom package.

broom::augment(mod) %>% select(.rownames, .hat, .cooksd) %>% arrange(desc(.hat))

## # A tibble: 32 x 3
##    .rownames             .hat  .cooksd
##    <chr>                <dbl>    <dbl>
##  1 Lincoln Continental 0.195  0.0719  
##  2 Chrysler Imperial   0.184  0.532   
##  3 Cadillac Fleetwood  0.170  0.0184  
##  4 Lotus Europa        0.129  0.0132  
##  5 Honda Civic         0.118  0.0249  
##  6 Toyota Corolla      0.0956 0.260   
##  7 Fiat X1-9           0.0866 0.000711
##  8 Porsche 914-2       0.0704 0.000101
##  9 Fiat 128            0.0661 0.193   
## 10 Datsun 710          0.0584 0.0154  
## # ... with 22 more rows

We can use the .hat variable to tell observations with high leverage. These appear to be the first three or four listed in descending order of .hat.

broom::augment(mod) %>% arrange(desc(.hat)) %>% .[1:4,] -> sample
sample

## # A tibble: 4 x 10
##   .rownames     mpg    wt .fitted .se.fit .resid  .hat .sigma .cooksd .std.resid
##   <chr>       <dbl> <dbl>   <dbl>   <dbl>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
## 1 Lincoln Co~  10.4  5.42    8.30    1.35   2.10 0.195   3.07  0.0719      0.770
## 2 Chrysler I~  14.7  5.34    8.72    1.31   5.98 0.184   2.84  0.532       2.17 
## 3 Cadillac F~  10.4  5.25    9.23    1.26   1.17 0.170   3.09  0.0184      0.423
## 4 Lotus Euro~  30.4  1.51   29.2     1.09   1.20 0.129   3.09  0.0132      0.423

broom::augment(mod) %>% arrange(desc(.hat)) %>% ggplot(data = ., aes(x = wt, y = mpg)) + geom_point() + geom_text(data = sample, aes(label = .rownames), size = 2) + geom_smooth(method = "lm", se = FALSE) + theme_bw()

## `geom_smooth()` using formula 'y ~ x'

The plot above shows the four main points which seem to have high leverage. Of these, Chrysler Imperial appears to have the highest influence as we can tell from the Residuals vs Leverage plot. The .cooksd in the augment output combines the two quantities: leverage and influence. The results given below show that Chrysler Imperial has the highest value of .cooksd.

broom::augment(mod) %>% arrange(desc(.cooksd)) %>% select(.rownames, .cooksd)

## # A tibble: 32 x 2
##    .rownames           .cooksd
##    <chr>                 <dbl>
##  1 Chrysler Imperial    0.532 
##  2 Toyota Corolla       0.260 
##  3 Fiat 128             0.193 
##  4 Lincoln Continental  0.0719
##  5 Ford Pantera L       0.0371
##  6 Camaro Z28           0.0313
##  7 Duster 360           0.0313
##  8 Merc 240D            0.0311
##  9 AMC Javelin          0.0263
## 10 Honda Civic          0.0249
## # ... with 22 more rows

Our original model has coefficients:

coef(mod)

## (Intercept)          wt 
##   37.285126   -5.344472

With the outlier removed, the coefficients become:

outlier <- c("Chrysler Imperial")
lm(mpg ~ wt, data = mtcars[-which(rownames(mtcars) %in% outlier),]) %>% coef

## (Intercept)          wt 
##   38.746334   -5.869829

Plotting these two lines reveal just how much the point of high leverage and influence has an effect on the regression results. We can use base plot to achieve this. The point of high influence is colored red in the plot. With the point excluded, we obtain the blue regression line. Adding the red point pulls the regression line upwards and we obtain the black, dashed line.

mtcars_wo <- mtcars[-which(rownames(mtcars) %in% outlier),]
plot(mpg ~ wt, mtcars, pch = 19, main = "Plot of mpg against wt showing regression lines")

#Without the outlier
abline(lm(mpg ~ wt, mtcars_wo), col = "blue")

#With the outlier added to the plot, the regression line is pulled upwards as shown by the dashed line.
abline(lm(mpg ~ wt, mtcars),col = "black", lty = "dashed")

sample_wo <- mtcars[which(rownames(mtcars) %in% outlier),]
points(sample_wo$wt, sample_wo$mpg, col = "red", pch = 19)
grid(nx = 8, ny = 11, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE)

5. Conclusion

These four major plots tell a story about the residuals from a regression model and should be examined in order to gain better insight into a regression model. However, they should not determine squarely whether the model is accepted or rejected. Sometimes, some models without a good fit may still serve some use in other cases. The ultimate decision lies with the practitioner in determining which models to use or not. Sometimes, such linear models even if they do not fit the data well may enable unique insights into how to proceed with modelling. This may include adding a quadratic term, removing some outliers or conducting a transformation. Every work with a model has its own unique steps. However, examining a regression model’s residuals is always good practice.

6. References

Bommae Kim; Understanding diagnostic plots for linear regression analysis; Available at https://data.library.virginia.edu/diagnostic-plots/; Retrieved on 01/05/2020
Caffo, Brian; Regression Models for Data Science in R; Available at https://leanpub.com/regmods/read#leanpub-auto-residuals; Retrieved on 01/05/2020
Breusch-Pagan Test; (2020, May). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test