Diagnostic plots show how well the model fits the data.
Residuals vs fitted values
Residual plot helps us decide…
…if the “mean function” is appropriate - don’t want a trend in the residuals
…if there is heteroscedasticity
…if we have outliers - are there points for which the model is not appropriate?
Quantiles of residuals vs quantilies of normal
Uses studentized residuals (weighs extreme x values lower)
Square root of studentized residuals vs fitted values
Reduces skewness of the data - eaier to see trends in the residuals
Want the red line to be straight acros - no trend
Measures each data point’s influence on regression coefficents
Want all points to be in between the two red lines
Goal: improve diagnostic plots
Common transformations:
Square Root - helps with heteroscedasticity
Logarithm - helps equalize error (helps with heteroscedasticity) and helps with curvature in residuals
Inverse
library(alr3)
## Loading required package: car
data(brains)
attach(brains)
The two variables are BrainWt and BodyWt. BrainWt will be the response and BodyWt will be the predictor.
What we want the plots to look like:
Residual plot: points all over - no trend to the data
QQ norm plot: all points on the y=x line
Scale location plot: points all over - no trend to the data
Cook’s distance plot: all points in between the red lines
You can use the transformations on the response, the predictor or both. I only showed each of the transformations on both.
mod = lm(BrainWt ~ BodyWt)
plot(mod)
there is a clear trend to the data
not matching - three clear outliers
there is a clear trend to the data
2 points are outside of the red lines
mod7 = lm(sqrt(BrainWt) ~ sqrt(BodyWt))
plot(mod7)
there is a clear trend to the data
not matching - three clear outliers
there is a clear trend to the data
2 points are outside of the red lines
mod8 = lm(log(BrainWt) ~ log(BodyWt))
plot(mod8)
no real trend in the data
almost matching the quantiles
no real trend in the data
all points are inside of red lines
This looks like the best one.
mod9 = lm(1/BrainWt ~ 1/BodyWt)
plot(mod9)
## hat values (leverages) are all = 0.01612903
## and there are no factor predictors; no plot no. 5
no real trend in the data
almost matching the quantiles
no real trend in the data
none - values are so small that they are practically 0 so the graphs don’t all show or look right.
library(alr3)
data(stopping)
varaibles: Distance and Speed
square root of Distance is the best transformation
Outlier: a point that’s well separated from the rest of the data
Identify from plot(model)
Residual (1st)
Scale Location (3rd)
Sloppy rule: if abs(studentized residuals) > 2, then that point is an outlier in the y direction
Studentized residual = residual/standard error(residuals)
How do we deal with them? - depends on context
Was the data point recorded incorrectly?
Why is this point an outlier?
Are we missing a predictor that could explain the trend?
Influencial Point: an outlier in the x direction
Identify from plot(model)
The diagnostic plots show how accurate the model is. The residual plot is the most important one to check because you can see the mean function. The transformations can make our model more accurate. Outliers and influencial points can affect the model as well so those will have to be dealt with to make the model more accurate.