Diagnostic Plots

Diagnostic plots show how well the model fits the data.

Residual Plot

Residuals vs fitted values

Residual plot helps us decide…

…if the “mean function” is appropriate - don’t want a trend in the residuals

…if there is heteroscedasticity

…if we have outliers - are there points for which the model is not appropriate?

QQnorm Plot

Quantiles of residuals vs quantilies of normal

Uses studentized residuals (weighs extreme x values lower)

Scale Location Plot

Square root of studentized residuals vs fitted values

Reduces skewness of the data - eaier to see trends in the residuals

Want the red line to be straight acros - no trend

Cook’s Distance

Measures each data point’s influence on regression coefficents

Want all points to be in between the two red lines

Transformation

Goal: improve diagnostic plots

Common transformations:

Example 1

library(alr3)
## Loading required package: car
data(brains)
attach(brains)

The two variables are BrainWt and BodyWt. BrainWt will be the response and BodyWt will be the predictor.

What we want the plots to look like:

  1. Residual plot: points all over - no trend to the data

  2. QQ norm plot: all points on the y=x line

  3. Scale location plot: points all over - no trend to the data

  4. Cook’s distance plot: all points in between the red lines

You can use the transformations on the response, the predictor or both. I only showed each of the transformations on both.

Without Transformations

mod = lm(BrainWt ~ BodyWt)
plot(mod)

  1. there is a clear trend to the data

  2. not matching - three clear outliers

  3. there is a clear trend to the data

  4. 2 points are outside of the red lines

Square Root

mod7 = lm(sqrt(BrainWt) ~ sqrt(BodyWt))
plot(mod7)

  1. there is a clear trend to the data

  2. not matching - three clear outliers

  3. there is a clear trend to the data

  4. 2 points are outside of the red lines

Logarithm

mod8 = lm(log(BrainWt) ~ log(BodyWt))
plot(mod8)

  1. no real trend in the data

  2. almost matching the quantiles

  3. no real trend in the data

  4. all points are inside of red lines

This looks like the best one.

Inverse

mod9 = lm(1/BrainWt ~ 1/BodyWt)
plot(mod9)

## hat values (leverages) are all = 0.01612903
##  and there are no factor predictors; no plot no. 5

  1. no real trend in the data

  2. almost matching the quantiles

  3. no real trend in the data

  4. none - values are so small that they are practically 0 so the graphs don’t all show or look right.

Example 2

library(alr3)

data(stopping)

varaibles: Distance and Speed

square root of Distance is the best transformation

Outliers and Influencial Points

Outliers

Outlier: a point that’s well separated from the rest of the data

Identify from plot(model)

  • Residual (1st)

  • Scale Location (3rd)

Sloppy rule: if abs(studentized residuals) > 2, then that point is an outlier in the y direction

Studentized residual = residual/standard error(residuals)

How do we deal with them? - depends on context

  1. Was the data point recorded incorrectly?

  2. Why is this point an outlier?

  3. Are we missing a predictor that could explain the trend?

Influencial Points

Influencial Point: an outlier in the x direction

Identify from plot(model)

  • Cook’s Distance (4th)

Importance

The diagnostic plots show how accurate the model is. The residual plot is the most important one to check because you can see the mean function. The transformations can make our model more accurate. Outliers and influencial points can affect the model as well so those will have to be dealt with to make the model more accurate.