Residuals represent variation left unexplained by our model. We emphasize the difference between residuals and errors. The errors unobservable true errors from the known coefficients, while residuals are the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.
Let’s start by talking about our motivating example using the diamond data set. Remember, in this data set the diamonds were diamond prices in Singapore dollars. And, the diamond, the explanatory variable is the weight of the diamond in carats.
Generating a Plot
library(UsingR)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.0.3
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.0.3
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
data(diamond)
library(ggplot2)
g= ggplot(diamond, aes(x=carat,y=price))
g= g + xlab("Mass(Carats)")
g= g + ylab("Price(SIN $)")
g= g + geom_point(size=8, colour="black", alpha=0.5)
g= g + geom_point(size=5, colour="red", alpha=0.2)
g= g + geom_smooth(method="lm", color="blue")
g
## `geom_smooth()` using formula 'y ~ x'
Looking at the plot we can see that not all dots lie in the regression line. The variation around the regression line, is residual variation. That variation that was left unexplained by having accounted for mass. So, we start out with a lot of variation. We’ve explained a substantial fraction of it by the linear relationship with mass, but there’s some left over variation. We’ll call that residual variation. These distances that make up the residual variation are called the residuals. The residuals are quite useful for diagnosing many things, including poor model fit.
Let’s remind ourselves of the model that we’re considering. Our outcome and our example, which would be price, is Yi, which we’re assuming is a line. Beta nought plus Beta Xi, where Xi, in this case, is mass in our example. We’re adding on some gaussian error, epsolon i, which we’re assuming to be normally distributed with a mean of zero and a variance of sigma squared. Remember, the outcome is Yi for predictor value, Xi at that specific index i. So, Yi is a price, Xi is a mass. The point on the line corresponding to Xi is Yi hat.
We can figure out Yi hat just by plugging in Beta0 hat plus Beta1 hat times Xi.Well, our residual is nothing other than the vertical distance between the observed data point, observed outcome, Yi, and the fitted value, Yi hat. So, ei is equal to Yi minus Yi hat. Remember, our least square’s criteria tried to minimize the sum of the squared vertical distances. So in essence, it was minimizing the sum of the squared residual, summation ei squared.One way to think about the residuals are as an estimate of epsilon i. Though, you have to be careful with that, because as we’ll see later on, we can decrease the residuals just by adding irrelevant regressors into the equation.Let’s talk about some aspects of residuals that will help us interpret them.
same thing in words First of all, their population is expected value is zero.But also, their sum, their empirical sum, hence the empirical mean also, is zero if you include an intercept. If you don’t include an intercept, this property doesn’t have to hold. Well, the generalization of this property is, if you include any regression term in, in linear regression, in our generalization in linear models,
library(UsingR)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(diamond)
y<-diamond$price;
x<-diamond$carat;
n<-length(y)
fit<-lm(y~x)
e<-resid(fit)#the easiest way to get residual
e
## 1 2 3 4 5 6
## -17.9483176 -7.7380691 -22.9483176 -85.1585661 -28.6303057 6.2619309
## 7 8 9 10 11 12
## 23.4721795 37.6311854 -38.7893116 24.4721795 51.8414339 40.7389488
## 13 14 15 16 17 18
## 0.2619309 13.4209369 -1.2098087 40.5287002 36.1029250 -44.8405542
## 19 20 21 22 23 24
## 79.3696943 -25.0508027 57.8414339 9.2619309 -20.9483176 -3.7380691
## 25 26 27 28 29 30
## -19.9483176 27.8414339 -54.9483176 8.8414339 -26.9483176 16.4721795
## 31 32 33 34 35 36
## -22.9483176 -13.1020453 -12.1020453 -0.5278205 3.2619309 2.2619309
## 37 38 39 40 41 42
## -1.2098087 -43.2098087 -27.9483176 -23.3122938 -15.6303057 43.2672091
## 43 44 45 46 47 48
## 32.8414339 7.3696943 4.3696943 -11.5278205 -14.8405542 17.4721795
yhat<-predict(fit)# If we were to get the predicted fitted values and (remember), if we don't give the predict function new data, but just give it the output from lm-function, then it just predicts the observed x values.
max(abs(e-(y-yhat)))#other way of getting residual
## [1] 9.485746e-13
max(abs(e-(y-coef(fit)[1]-coef(fit)[2]*x)))#if we manually even calculate the fitted values, coef fit 1 and then coef fit 2 times x, that of course gives exactly the same numbers.
## [1] 9.485746e-13
So the way we want to do to get the residuals is resid, but hopefully showing other codes illustrate what’s going on in the background with ‘resid’, or the actual calculation that ‘resid’ is doing.
sum(e)
## [1] -1.865175e-14
sum(e*x)#and also the sum of the residual times x variable should also be zero
## [1] 6.959711e-15
So the residuals are the sign lengths of the red line that undershown the following plot. It can be done using base R graphics. Sometimes,base R can also be used with ggplot graphics.
plot(diamond$carat, diamond$price,xlab="Mass(carats)",ylab="Price(SIN $)",bg="lightblue",col="black", cex=1.1, pch=21, frame=FALSE)
We then add the fitted line and in base R. If we want to add the fitted line we can just do abline and put the object that we assign to the lm fit just as an argument, and it adds the regression line.
plot(diamond$carat, diamond$price,xlab="Mass(carats)",ylab="Price(SIN $)",bg="lightblue",col="black", cex=1.1, pch=21, frame=FALSE)
abline(fit, lwd=2)
for(i in 1:n)
lines(c(x[i],x[i]),c(y[i],yhat[i]),col="red", lwd=2)
This scatter plot isn’t particularly useful for assessing residual variation because there is a lot of white space above and below the fitted line. We can make it more useful by plotting residuals on the vertical axis versus mass on the horizontal axis
plot(x, e,
xlab="Mass(Carats)",
ylab="Residuals(SIN$)",
bg="lightblue",
col="black", cex=2, pch=21, frame= FALSE)
abline(h=0, lwd=2)
for(i in 1:n)
lines(c(x[i],x[i]),c(e[i],0),col="red", lwd=2)
Now we can see the residual variation much more clearly. When we look at the residual plot, we’re looking for any form of pattern. The residuals should be mostly patternless. Also, remember that if we’ve included an intercept, residuals have to sum to zero. So they have to lie above and below the horizontal line at zero, and we’d like them to spread nicely in a random looking fashion distributed both above and below zero. In this plot particularly, we can see some interesting patterns. For example, there were lots of diamonds of exactly the same mass.
Next, we’re going to go through some pathological residual plots, just to highlight what residual plots can do for us. So, we’ve concocted some examples that help us understand how residuals can highlight model fit.
x=runif(100,-3,3);#x is going to be 100 different numbers between -3 and +3
y=x+sin(x)+rnorm(100,sd=0.2);#y is equal to x, an identity line plus 'sin(x)' which makes it oscillate around it a little bit plus some normal noise.
library(ggplot2)
g=ggplot(data.frame(x=x,y=y),aes(x=x,y=y))
g=g+geom_smooth(method="lm",colour="black")
g=g+geom_point(size=7,colour="black", alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g
## `geom_smooth()` using formula 'y ~ x'
It’s a little difficult to see the non-linearity, that ‘sin x’ term is little apparent. Looking at the plot, we would immediately notice something pattern but nonetheless, it’s maybe a little hard to see. This model is actually not the correct model for this data and this might happen in practice. This doesn’t mean that this model is unimportant. There is a linear trend and the model is accounting for it, it’s just not accounting for the secondary variation in the ‘sin x’ term. We have an average identity line that represents the relationship between y and x and it explains a lot of the variation. So, it is essential to remember that in regression, having the exact right model is not always the primary goal. We can get meaningful information about trends from incorrect models.
Now we are going to plot the residuals versus x and see if the ‘sin x’ term seems more apparent.
g=ggplot (data.frame(x=x, y=resid(lm(y~x))),aes(x=x, y=y))#x in this case is x. We have defined the x variable as the variable named x. But now, my y is not the y variable, but the residual from the linear model fit.
g=g+geom_hline(yintercept=0,size=2);#We want to put the horizontal line at zero
g=g+geom_point(size=7,colour="black",alpha=0.4)#And add the points
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+xlab("X")+ylab("Residual")
g
The above plot shows the sin (x) term extremely apparently. That’s what the residual plot has done. It’s zoomed in, on this part of the model inadequacy and highlighted it, and that’s one thing that residual plots are especially good at.
We are going to create another plot where by all appearances, the plot falls perfectly on a line. But when we highlight in, at the residuals, it looks quite different.
x<-runif(100, 0, 6);#x variable is a bunch of uniform random variables
y<-x+rnorm(100, mean=0, sd=0.001*x);#y variable is the x variable, so an identity line, plus the errors, the standard deviation of the errors times the x term. That's how we generated data with heteroscedasticity.
g=ggplot(data.frame(x=x,y=y),aes(x=x,y=y))
g=g+geom_smooth(method="lm", colour="black")
g=g+geom_point(size=7,colour="black", alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g
## `geom_smooth()` using formula 'y ~ x'
So there’s the plot. It seems like the points fall exactly on an identity line. As we highlight in on the residuals (in below figure),we see a trend toward greater variability as we head along the x variable. That property, where the variability increases with the x variables is called heteroscedasticity. Heteroscedasticity is one of those things that residual plots are quite good at diagnosing. If I go back to the above plot, we can’t see it at all. Zoom in on the residuals and there we see it there.
g=ggplot(data.frame(x=x,y=resid(lm(y~x))),aes(x=x,y=y))
g=g+geom_hline(yintercept=0, size=2);
g=g+geom_point(size=7,colour="black", alpha=0.4)
g=g+geom_point(size=5,colour="red",alpha=0.4)
g=g+xlab("X")+ylab("Residual")
g
diamond$e<-resid(lm(price~carat, data=diamond))
head(diamond)#Checking the dataframe
## carat price e
## 1 0.17 355 -17.948318
## 2 0.16 328 -7.738069
## 3 0.17 350 -22.948318
## 4 0.18 325 -85.158566
## 5 0.25 642 -28.630306
## 6 0.16 342 6.261931
g=ggplot(diamond,aes(x=carat,y=e))
g=g+xlab("Mass(carats)")
g=g+ylab("Residual Price(SIN$)")
g=g+geom_hline(yintercept=0, size=2);
g=g+geom_point(size=7,colour="black", alpha=0.5)
g=g+geom_point(size=5,colour="blue",alpha=0.2)
g
So there doesn’t appear to be a lot of pattern in the plot, so this is good. It seems like it’s a pretty good fit.
Now, let’s illustrate something about variability in a diamond dataset that will help us set the stage for defining some new properties about our regression model fit. So we are going to create two residual vectors. The first residual vector is the one where we just fit an intercept. So the residuals are just the deviations around the average price. Then,we are going to find the residuals where we add in our variation around the regression line. So, the first one is variation around the average price. The second is the variation around the regression line with carats as the explanatory variable and price as the outcome.
e<-c(resid(lm(price~1,data=diamond)),
resid(lm(price~carat, data=diamond)))
Creating a factor variable The first one is just going to be labeled as an intercept only model residuals and the second set are going to be labeled as intercept and slope residuals.
fit=factor(c(rep("Itc",nrow(diamond)),
rep("Itc,slope", nrow(diamond))))
And finally creating a ggplot Y variable is going to be the residuals, when x variable is going to be the fit. Linear model or linear model with just an intercept or the linear model with carat, the mass as the predictor.
g=ggplot(data.frame(e=e, fit=fit),aes(y=e,x=fit, fill=fit))
g=g+xlab("Fitting Approach")
g=g+ylab("Residual Price(SIN$)")
g=g+geom_dotplot(binaxis="y", size=2, stackdir="center",binwidth=30);
## Warning: Ignoring unknown parameters: size
g
So what we see on this left-hand plot with just the intercept is the variation in diamond prices around the average diamond price. So just the, basically the variation in diamond prices from the sample.
What we’re seeing in the rightmost plot, this is displaying the variation around the regression line. So what we’ve done is we’ve explained a lot of this variation with the relationship with mass.
And what we’re going to talk about next is r squared, which basically says, we can decompose this total variation in the y’s by themselves leaving just the mean into the variation explained by the regression model and the variation that’s left over after accounting for the regression model. So this is the variation that’s left over after counting for the regression, but also this sort of subtraction of these two would be the variation that was explained by the regression model. But there’s a formula for that and we’re going to dive into that next.