Sections 5.2-5.4

Today in class we covered diagnostic plots, transformations of variables, and outliers.

Diagnostic Plots

There are four different diagnostic plots R spits out when we use the plot command.

Residuals vs. Fitted Values

The first is the residuals versus the fitted values. It is ideal if there is no trend on this plot, and if there is equal spread between the points.

This plot needs to be checked first, because it is where we can assess if we have the correct “mean function.” If there is a lot of curvature, our mean function \(X\beta\) is missing something, like a polynomial or an interaction term. We can also transform either of the variables to try and counteract this - we will discuss this later in the learning log.

On this plot, we can also check for heteroscedasticity, as well as if there are any outliers.

QQnorm plot

Checks our assumption that the error terms follow a normal distribution. This plot uses studentized residuals, and gives less weight to extreme x-values.

Scale-Location Plot

Allows us to look for a trend in our residuals. This plot also reduces the skewness of our data.

Cook’s Distance

Lastly, the Cook’s distance plot measures the influence each data point has on the regression coefficient \(\hat{\beta}\). If the data is not included within the red dashed line, the data point has too much influence on the coefficient estimate (which is not good).

Transformations

We learned about three different transformations we can do to our variables to help us with some of the issues we may run in to. We can perform a square root transformation to help with heteroscedasticity. We could also perform a log function to help with heterscedasticity; it could also help the mean function in some cases, but we have to make sure that we have a dependent variable that’s always greater than zero. We could also perform an inverse function, although this might not always help. This could work well if some understanding fo the topic tells us that x and y are inversely related.

Example

I will use the stopping data set to illustrate the different diagnostic plots, and how different transformations might help us.

This data set includes information on both the speed (mph) and stopping distance (in feet) of cars.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(stopping)
attach(stopping)
plot(Distance~Speed)

We can see from plotting the intial data that there seems to be an upward trend that is not perfectly linearly, as well as possibly some increasing variance near the upper speeds.

mod1 <- lm(Distance~Speed)
plot(mod1)

Just by looking at the first plot, we can see that there is an obvious curve, as well as evidence of heteroscedasticity, since the vertical spread gets wider as the speed increases.

Because of this, we will play around with some transformations in an attempt to get the correct mean function for our model.

Even though we would normally not even look at the other three plots once we decide we have the wrong mean function, we will do it here to demonstrate the concepts I discussed earlier. The second plot checks our normality assumption. It looks great for the middle values, but it a little off near the extremes. Our Scale-Location plot shows an upward trend for our residuals, as well as it shows our heteroscedasticity in our model again. Lastly, the Cook’s distance plot shows us that all of the points are not outside of the red line - which is good! However, there is a bit of an upwards trend.

Let’s now play around with the transformations. We will look at square root and log transformations on both the dependent and independent variables.

Square root on the response:

mod2<-lm(sqrt(Distance)~Speed)
plot(mod2)

Square root on both response and predictor:

mod3<-lm(sqrt(Distance)~sqrt(Speed))
plot(mod3)

Log on response:

mod4<-lm(log(Distance)~Speed)
plot(mod4)

Log on response and predictor:

mod5<-lm(log(Distance)~log(Speed))
plot(mod5)

We can clearly see that our first transformation, square root on the response, had the best effect. The curves are all straightened out - everything looks nice!

By using this model instead of our original model, we will more accurately be able to model the relationship between distance and speed.

Outliers and Influential Points

A point can be an outlier but not influential, as long as the point is only an outlier in the x direction. In other words, the point is still on trend. However, the point becomes an influential point when it’s an outlier in the y direction.

We can find these influencial points by calculating the studentized residual. If the absolute value of this is great than 2, then we need to pay extra attention.

Once we’ve determined our outliers, there are a few things we can do to deal with them. If they’re clearly wrong, we can delete them. If they are correctly measured and entered, then we need to find out what happened with that data.

Or maybe we are missing a predictor that would account for the outlier’s difference!