view on on rpub: Residuals and Diagnostics For linear regression
Residuals and Diagnostics plots are tools to spot problems on the model will use some simulations and R data package to illustrate how to generate diagnostics plots and how to interpret them.
to evaluate the model fitting and residuals of a linear model generated by R, we can use the plot(model) to produce a series of 4 diagnostic plots:
Regression analysis since regression based on certain assumptions. Regression diagnostics used to evaluate the model assumptions and investigate whether or not there are observations with a large, undue influence on the analysis. Again, the assumptions for linear regression are:
The fitted vs residuals plot is mainly useful for investigating:
simulated Example:polynomial function To illustrate how violations of linearity.
x=rnorm(1:500)
y=1+x+x^2+x^3
plot(lm(y~x),which = 1)
we can see the pattern in the graph so that indicate the data violations of linearity. the y equation is 3rd order polynomial function. if the relationship between x and y is non-linear, the residuals will be a non-linear function of the fitted values.
simulated Example: linear function
we know Speed = distance / time so using mtcars data from R to plot the residuals of distance as dependent variable and the speed the out come to see the linear residuals.
data("cars")
model <- lm(dist~speed,data=cars)
plot(model,which = 1)
linearity seems to hold, as the red line is close to the dashed line. and the heteroskedasticity as we move to the right on the x-axis, the spread of the residuals seems to be increasing. notice the points 23, 35, and 49 they far from the data may be outliers, as they has large residual values.
The scale-location plot is very similar to residuals vs fitted, but plot the square root Standardized residuals vs fitted values to verify homoskedasticity assumption.We want to look at:
we can also verify our finding by using bptest from lmtest library,the bptest function test the hypothesis of homoskedasticity by check the p-value of the test we can verify our finding normly if the p-value less then .05 we reject the hypothesis.
Example with mtcars data: but not include all predictors to get violation for homoskedasticity assumption.
modelmt <- lm(disp ~ cyl + hp ,data= mtcars)
plot(modelmt,which = 3)
at this plot the red line is raising not horizontal line means the average is not approximately constant, only constant for the fitted values around 210 to 340 and for other values the average increasing that violation for homoskedasticity assumption.
bptest(modelmt)
##
## studentized Breusch-Pagan test
##
## data: modelmt
## BP = 6.1726, df = 2, p-value = 0.04567
the p-value is less then .05 so the assumption of homoskedasticity is violated as we expected.
Simulated example with polynomial function:
x=1:200
y= 4*x + x^2 + x^4
modelpoly <- lm(y~x)
plot(modelpoly ,which = 3)
at this plot the red line is approximately constant, but the variance around the line changing that also violation for homoskedasticity assumption.
bptest(modelpoly )
##
## studentized Breusch-Pagan test
##
## data: modelpoly
## BP = 26.561, df = 1, p-value = 2.553e-07
also p-value is less then .05 so the assumption of homoskedasticity is violated as we expected.
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation. A Q-Q plot is a scatterplot created by plotting two sets of quantities against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight if not the residuals considered Gaussian (normal) IID.
example with swiss data:
data("swiss")
modelsw <- lm(Fertility ~.,data=swiss)
plot(modelsw ,which = 2)
Gaussian (normal) IID values fall 99.7% of values between there standard deviations quantities. so to the quantities of standardized residuals theoretically must match the quantities of standardized normal that the dash line in the plot.
in the plot the data points seems to be approximately forming a line that’s roughly straight so that verify the assumption of the residual is a Gaussian (normal) IID.
can check the distribution of the residuals by another plots the PDF and CDF of the residuals.
resdulpdf<-density(modelsw$residuals)
plot(resdulpdf,main='Residual pdf shape Plot',xlab='Residuals')
the shape of the probability density function is approximately bell shaped
resdulcdf <- ecdf(modelsw$residuals)
plot(resdulcdf,main='Residuals cdf ',xlab='Residuals')
the CDF values in Gaussian (normal) IID summation equals one and must centered around the mean as known the residuals mean equals zero as in the plot and the plot approximately symmetric around the mean and the check x-axis the range from -15 to 15 .
The Residuals vs. Leverage plots helps to identify influential data points on the model. outliers can be influential, though they don’t necessarily have to it and some points within a normal range in your model could be very influential.
to convey those points will simulate some examples in the follow examples:
Example: of data point with low Leverage and low Influence.
#low lav low inf
x =c(.5*(1 :30) +.7*rnorm(30)) #x without_outlier
y =2+ x + 1.2*rnorm(30) #y without_outlier
xout =c(x,17) #x with_outlier
yout =c(y,19) #y with_outlier
plot(xout , yout)
points(x = 17,y =19 ,col= "orange", lwd =2)
abline(v = mean(x) , col = "purple3",lwd = 3) #the mean vertical line
abline(lm(y ~ x), col = "Red") #outlier is not included in regression line
abline( lm(yout ~ xout), col = "Blue") #outlier is included in regression line
points(x = 17,y =19 ,col= "orange", lwd =2)
from the plot: the outlier is close to the mean so it has low leverage and so low influence. because, the two fitted regression lines has the same slopes so that conform the low influence.
#HL lI
x1=.5* (1:30) +.5*rnorm(30) #without_outlier
y1=2+ x1 + 1.2*rnorm(30) #without_outlier
xout1 =c(x1,22) #with_outlier
yout1 =c(y1,24) #with_outlier
plot(xout1 , yout1)
points(x = 22,y =24 ,col= "orange", lwd =2)
abline(v = mean(x1) , col = "purple3",lwd = 3) #the mean vertical line
abline(lm(y1 ~ x1), col = "Red") #outlier is not included in regression line
abline( lm(yout1 ~ xout1), col = "Blue") #outlier is included in regression line
from the plot: the outlier is far from the mean so it has high leverage but low influence. because, the two fitted regression lines has the same slopes so that conform the low influence.
x2=.2* (1:30) +.5*rnorm(30) #without_outlier
y2=2+ x2 + 1.2*rnorm(30) #without_outlier
xout2 =c(x2,9,10) #with_outlier
yout2 =c(y2,14,16) #with_outlier
plot(xout2 , yout2)
points(x = c(9,10),y =c(14,16) ,col= "orange", lwd =2)
abline(v = mean(x2) , col = "purple3",lwd = 3) #the mean vertical line
abline(lm(y2 ~ x2), col = "Red") #outlier is not included in regression line
abline( lm(yout2 ~ xout2), col = "Blue") #outlier is included in regression line
from the plot: the outliers is far from the mean so it has high leverage and high influence. because, the two fitted regression lines has a different slopes so that conform the low influence. we can use Residuals vs. Leverage to verfy that:
plot(lm(yout2 ~ xout2),which=5)
the two outliers in the plot appears with values on x-axis grater then the all data.we see that points that have high residual and high leverage. The points that lie close to or outside of the dashed red curves are worth investigating further but as these simulated data so we know those the outliers added to the data. the red dashed line represents cook’s distances above 0.5 , 1 so let look to cook’s distances plot too see cook’s distances of the outliers.
plot(lm(yout2 ~ xout2),which=4)
from the plot we see the outliers point 32 ,31 with 0.6 ,and 0.2 cook’s distances ,that was the expected as in previous plot Residuals vs. Leverage the point 32 has high leverage and was above the red dash line with 0.5 cook’s distance.