Ch 5 Diagnostic plots and transformations of variables

Residual analysis with diagnostic plots

We know how to get a linear regression formula using mod<-lm()and how to plot it using plot(mod). Now we are going to use this information to find a way to get rid of outliers in order to get a linear trend.

There are 4 different plots that analyze the relationship between the response and predictor variables.

Residuals vs fitted vals. plot

plot(mod,xbar=“fitted values”, ybar=residuals) abline(0,0)

Ideal: no trend and equal spread

Look out for: Curvature indicates our “mean function” Xis missing something. (Try polynomial reg and/or an interaction) Heteroscedasticity Outliers

QQnorm plot

Assumption: E ~ normal

Uses studentized residuals (weights extermes x values lower)
Scale location plot

plot(studentized residuals ~fitted values)

Look out for:

A trend in our residuals

Reduces skewness of our data

Cook’s distance

Measures influence that each data point has on the regression coeffiences \(\hat\beta\).

plot(residuals ~ cook’s disance)

Will have red dotted lines in a sideways v:> shape

Points out far right are not good, it means that the data point has too much influence on $\hat\beta$.

Let’s use this information in the dataset brains located in the package alr3.

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

data(brains)
attach(brains)
names(brains)

## [1] "BrainWt" "BodyWt"

As we can tell from the data the response variable= Brainwt, predictor variable=Bodywt.

mod<-lm(BrainWt~BodyWt, data=brains)
plot(mod)

The outliers are human, Asian elephant, and African elephant. The elephants are outside of the red dotted line in Cook’s distance. This means that they have too much influence.

Now we must transform them to get them to become linear. The three formulas below can do this.

sqrt(y)=xbeta+E

Needs y>0

Helps with heteroscedasticity

Could help with mean function
log(y)=xbeta+E

Needs y>0

Helps with heteroscedasticity

Could help with mean function
1/y=xbeta+E

Could work well if some understanding of the topic tells you x and y are inversely related

Our goal is now to improve mean function and maybe heteroscedasticity.

par(mfrow=c(2,2))
mod2<-((lm(sqrt(BrainWt)~BodyWt, data=brains)))
plot(mod2)

mod3<-(lm(log(BrainWt)~BodyWt, data=brains))
plot(mod3)

sc<-1/BrainWt
mod4<-lm((1/BrainWt)~BodyWt, data=brains)
plot(mod4)

mod5<-((lm(sqrt(BrainWt)~sqrt(BodyWt), data=brains)))
plot(mod5)

mod6<-(lm(log(BrainWt)~log(BodyWt), data=brains))
plot(mod6)

As seen from this data, taking the log of the y and xbeta is the best fit graph for the data brains.

Now we can do this again using the car dataset.

attach(cars)
par(mfrow=c(2,2))
cmod2<-((lm(sqrt(dist)~speed)))
plot(cmod2)

cmod3<-((lm(log(dist)~speed)))
plot(cmod3)

cmod4<-((lm(1/(dist)~speed)))
plot(cmod4)

cmod5<-((lm(sqrt(dist)~sqrt(speed))))
plot(cmod5)

Taking the sqrt of the dist is the best fit graph for this cars dataset.

Ch 5 Diagnostic plots and transformations of variables

Rebecca Lewis

March 20, 2018

Residual analysis with diagnostic plots

5.4 Outliers and influential points