Residual analysis with diagnostic plots

We know how to get a linear regression formula using mod<-lm()and how to plot it using plot(mod). Now we are going to use this information to find a way to get rid of outliers in order to get a linear trend.

There are 4 different plots that analyze the relationship between the response and predictor variables.

  1. Residuals vs fitted vals. plot

plot(mod,xbar=“fitted values”, ybar=residuals) abline(0,0)

Ideal: no trend and equal spread

Look out for: Curvature indicates our “mean function” Xis missing something. (Try polynomial reg and/or an interaction) Heteroscedasticity Outliers

  1. QQnorm plot

    Assumption: E ~ normal

    Uses studentized residuals (weights extermes x values lower)

  2. Scale location plot

    plot(studentized residuals ~fitted values)

Look out for:

A trend in our residuals

Reduces skewness of our data
  1. Cook’s distance

Measures influence that each data point has on the regression coeffiences \(\hat\beta\).

plot(residuals ~ cook’s disance)

Will have red dotted lines in a sideways v:> shape

Points out far right are not good, it means that the data point has too much influence on $\hat\beta$.

Let’s use this information in the dataset brains located in the package alr3.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(brains)
attach(brains)
names(brains)
## [1] "BrainWt" "BodyWt"

As we can tell from the data the response variable= Brainwt, predictor variable=Bodywt.

mod<-lm(BrainWt~BodyWt, data=brains)
plot(mod)

The outliers are human, Asian elephant, and African elephant. The elephants are outside of the red dotted line in Cook’s distance. This means that they have too much influence.

Now we must transform them to get them to become linear. The three formulas below can do this.

  1. sqrt(y)=xbeta+E

    Needs y>0

    Helps with heteroscedasticity

    Could help with mean function

  2. log(y)=xbeta+E

    Needs y>0

    Helps with heteroscedasticity

    Could help with mean function

  3. 1/y=xbeta+E

    Could work well if some understanding of the topic tells you x and y are inversely related

Our goal is now to improve mean function and maybe heteroscedasticity.

par(mfrow=c(2,2))
mod2<-((lm(sqrt(BrainWt)~BodyWt, data=brains)))
plot(mod2)

mod3<-(lm(log(BrainWt)~BodyWt, data=brains))
plot(mod3)

sc<-1/BrainWt
mod4<-lm((1/BrainWt)~BodyWt, data=brains)
plot(mod4)

mod5<-((lm(sqrt(BrainWt)~sqrt(BodyWt), data=brains)))
plot(mod5)

mod6<-(lm(log(BrainWt)~log(BodyWt), data=brains))
plot(mod6)

As seen from this data, taking the log of the y and xbeta is the best fit graph for the data brains.

Now we can do this again using the car dataset.

attach(cars)
par(mfrow=c(2,2))
cmod2<-((lm(sqrt(dist)~speed)))
plot(cmod2)

cmod3<-((lm(log(dist)~speed)))
plot(cmod3)

cmod4<-((lm(1/(dist)~speed)))
plot(cmod4)

cmod5<-((lm(sqrt(dist)~sqrt(speed))))
plot(cmod5)

Taking the sqrt of the dist is the best fit graph for this cars dataset.

5.4 Outliers and influential points

Cook’s distance helped us find data pts with leverage. Outlier in x direction is not an influence pt. Outliers in y direction are influence pts. Not all outliers with cause problems.

How to find outliers in y directions

studentize resids=resid/(SE (resid)) if abs(studentized resids)> 2 then we need to pay extra attention to it.

How to deal with outliers: You can delete if it was clearly wrong. Don’t delete if not clearly wrong, If it was correctly measured and entered figure out what happened. Car manufacturer X has problems with brakes. Also ask yourself are you missing a predictor? If so it might cause the outliers.