Diagnostic Plots

Diagnostic plots show how well the model fits the data.

Residual Plot

Residuals vs fitted values

Residual plot helps us decide…

…if the “mean function” is appropriate - don’t want a trend in the residuals

…if there is heteroscedasticity

…if we have outliers - are there points for which the model is not appropriate?

QQnorm Plot

Quantiles of residuals vs quantilies of normal

Uses studentized residuals (weighs extreme x values lower)

Scale Location Plot

Square root of studentized residuals vs fitted values

Reduces skewness of the data - eaier to see trends in the residuals

Want the red line to be straight acros - no trend

Cook’s Distance

Measures each data point’s influence on regression coefficents

Want all points to be in between the two red lines

Transformation

Goal: improve diagnostic plots

Common transformations:

Square Root - helps with heteroscedasticity
Logarithm - helps equalize error (helps with heteroscedasticity) and helps with curvature in residuals
Inverse

Example 1

library(alr3)

## Loading required package: car

data(brains)
attach(brains)

The two variables are BrainWt and BodyWt. BrainWt will be the response and BodyWt will be the predictor.

What we want the plots to look like:

Residual plot: points all over - no trend to the data
QQ norm plot: all points on the y=x line
Scale location plot: points all over - no trend to the data
Cook’s distance plot: all points in between the red lines

You can use the transformations on the response, the predictor or both. I only showed each of the transformations on both.

Without Transformations

mod = lm(BrainWt ~ BodyWt)
plot(mod)

there is a clear trend to the data
not matching - three clear outliers
there is a clear trend to the data
2 points are outside of the red lines

Square Root

mod7 = lm(sqrt(BrainWt) ~ sqrt(BodyWt))
plot(mod7)

there is a clear trend to the data
not matching - three clear outliers
there is a clear trend to the data
2 points are outside of the red lines

Logarithm

mod8 = lm(log(BrainWt) ~ log(BodyWt))
plot(mod8)

no real trend in the data
almost matching the quantiles
no real trend in the data
all points are inside of red lines

This looks like the best one.

Inverse

mod9 = lm(1/BrainWt ~ 1/BodyWt)
plot(mod9)

## hat values (leverages) are all = 0.01612903
##  and there are no factor predictors; no plot no. 5

no real trend in the data
almost matching the quantiles
no real trend in the data
none - values are so small that they are practically 0 so the graphs don’t all show or look right.

Example 2

library(alr3)

data(stopping)

varaibles: Distance and Speed

square root of Distance is the best transformation

Outliers and Influencial Points

Outliers

Outlier: a point that’s well separated from the rest of the data

Identify from plot(model)

Residual (1st)
Scale Location (3rd)

Sloppy rule: if abs(studentized residuals) > 2, then that point is an outlier in the y direction

Studentized residual = residual/standard error(residuals)

How do we deal with them? - depends on context

Was the data point recorded incorrectly?
Why is this point an outlier?
Are we missing a predictor that could explain the trend?

Influencial Points

Influencial Point: an outlier in the x direction