Today in class we focused on 3 things: Diagnostic Plots, Transformations, and (briefly) Outliers.
I will talk about how to interpret these plots later.
The goal of transformations is to improve our diagnostic plots. There are 3 common tranformations:
When doing transformations, you want to start at the top of this list and work your way down. Something to remember: simpler is always better. This means that if the regression looks just as good taking the square root of one side instead of both sides, choose the model with only the one side. The simpler the model is, the easier it is to explain and interpret to other people. You can also apply different transformations to each side–but for the same reason as earlier you want to stay away from that option unless it is really necessary. Having different transformations on each side will make it really hard to explain and interpret.
We will use brain weight/body weight data.
NEW: we will use the function {with} to conduct plots. Sometimes in different scripts we are working with multiple data sets, so instead of attaching each one and possibly masking some variables, this is a solution to that problem. I’ll use a different type of code for each method to show the different options.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data("brains")
head(brains)
## BrainWt BodyWt
## Arctic_fox 44.500 3.385
## Owl_monkey 15.499 0.480
## Beaver 8.100 1.350
## Cow 423.012 464.983
## Gray_wolf 119.498 36.328
## Goat 114.996 27.660
with(brains, plot(BrainWt ~ BodyWt))
This original plot has a couple outliers and a big clump at the smaller values for both x and y axes. We can start applying transformations!
First, we will take the square root of just the left hand side, and then take it of both sides to see if that helps.
with(brains, plot(sqrt(BrainWt)~BodyWt))
with(brains, plot(sqrt(BrainWt)~sqrt(BodyWt)))
The square root of the y axis helped a tad, but we see a better result by taking the square root of both sides.However, it is still not a linear option that we want, yet. We can continue to do more transformations.
logbrain <- with(brains, log(BrainWt))
logbody <- with(brains, log(BodyWt))
mod1 <- lm(logbrain ~ BodyWt, data = brains)
mod2 <- lm(logbrain ~ logbody)
plot(mod1)
The first graph is the residuals vs fitted. This is not good; they are not spread out and there are still a lot of outliers. We don’t even have to look at the other options; we can reject this option right away.
plot(mod2)
plot(logbrain ~ logbody)
Looking at the residuals vs fitted, this looks so much better! There aren’t really outliers, and few little evidence of curvature in our data (seen by the red line). Our QQ Norm looks pretty good–for the most part our data points stay along the line. Our scale-location has a slight curve, but still not bad. Lastly, our Cook’s plot looks good as well. If a data point was on the outside of the dotted lines, then that would have too much leverage in the equation. Here, however, we have all the data points in between the dotted lines.
Because our diagnostic plots looked so good, I plotted the logbody against logbrain. Looking at our data points, they finally follow a linear model! The outliers have been limited and the data points are all going in the same direction.
For this method we have to attach the data set.
attach(brains)
inverbr <- 1/BrainWt
inverbod <- 1/BodyWt
mod3 <- lm(inverbr ~ BodyWt)
plot(mod3)
When just applying the inverse to the y value, it does not help our residuals. We can move right on, and see if the inverse of both sides helps.
mod4 <- lm(inverbr ~ inverbod)
plot(mod4)
Taking the inverse of both sides is a little better than just the one, but still not good.
After looking at all the different model, taking the log of both sides produced the best results!
As we know, outliers are data points that are well seperated from the rest of the data. Running our diagnostic plots, they can be seen in the first (residuals vs fitted) and third (scale vs location). There is a rule to categorized outliers: if the abs of the studentized residuals is >2, then it is an outlier. Remember: studentized residual is residual / SE(residuals).
There are 3 questions when dealing with outliers: