Covered Today

Today in class we focused on 3 things: Diagnostic Plots, Transformations, and (briefly) Outliers.

Diagnostic Plots

  1. First we have our fitted distribution. It has our fitted values on the x-axis and the residuals on the y-axis. This is the most important distribution plot to look at. It will tell us if we are correct in fitting a linear model, or if we should try a quadratic one. It also tells us if there is heteroscedacicity and if there are outliers. If there are issues with this plot, we don’t even have to spend the time to look at this one; we can go ahead and try a tranformation/quadratic to get a better fit.
  2. Secondly we have the QQ Norm Plot. R will graph a line of the normal distribution and then will plot the residuals of our data plots in that same area. The goal is to have the data points on the line–that would fulfill our assumption that the residuals follow a normal distribution.
  3. Next we have the Scale-Location plot. It has our fitted values on the x-axis and the square root of our standardized residuals on the y-axis. The square root reduces the skew of the data and makes it easier to see trends in the data.
  4. Lastly we have Cook’s plot. It measures how much influence each data point has for the beta coefficient of that predictor.

I will talk about how to interpret these plots later.

Transformations

The goal of transformations is to improve our diagnostic plots. There are 3 common tranformations:

  1. Square root–of one or both sides of the equation. This helps with heteroscedacicity.
  2. Log–of one or both sides. This helps equalize error, heteroscedacicity, and straighten any trends out.
  3. Inverse–of one or both sides.

When doing transformations, you want to start at the top of this list and work your way down. Something to remember: simpler is always better. This means that if the regression looks just as good taking the square root of one side instead of both sides, choose the model with only the one side. The simpler the model is, the easier it is to explain and interpret to other people. You can also apply different transformations to each side–but for the same reason as earlier you want to stay away from that option unless it is really necessary. Having different transformations on each side will make it really hard to explain and interpret.

Example of Transformations!

We will use brain weight/body weight data.

NEW: we will use the function {with} to conduct plots. Sometimes in different scripts we are working with multiple data sets, so instead of attaching each one and possibly masking some variables, this is a solution to that problem. I’ll use a different type of code for each method to show the different options.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data("brains")
head(brains)
##            BrainWt  BodyWt
## Arctic_fox  44.500   3.385
## Owl_monkey  15.499   0.480
## Beaver       8.100   1.350
## Cow        423.012 464.983
## Gray_wolf  119.498  36.328
## Goat       114.996  27.660
with(brains, plot(BrainWt ~ BodyWt))

This original plot has a couple outliers and a big clump at the smaller values for both x and y axes. We can start applying transformations!

Square Root

First, we will take the square root of just the left hand side, and then take it of both sides to see if that helps.

with(brains, plot(sqrt(BrainWt)~BodyWt))

with(brains, plot(sqrt(BrainWt)~sqrt(BodyWt)))

The square root of the y axis helped a tad, but we see a better result by taking the square root of both sides.However, it is still not a linear option that we want, yet. We can continue to do more transformations.

Log

logbrain <- with(brains, log(BrainWt))
logbody <- with(brains, log(BodyWt))
mod1 <- lm(logbrain ~ BodyWt, data = brains)
mod2 <- lm(logbrain ~ logbody)
plot(mod1)

The first graph is the residuals vs fitted. This is not good; they are not spread out and there are still a lot of outliers. We don’t even have to look at the other options; we can reject this option right away.

plot(mod2)

plot(logbrain ~ logbody)

Looking at the residuals vs fitted, this looks so much better! There aren’t really outliers, and few little evidence of curvature in our data (seen by the red line). Our QQ Norm looks pretty good–for the most part our data points stay along the line. Our scale-location has a slight curve, but still not bad. Lastly, our Cook’s plot looks good as well. If a data point was on the outside of the dotted lines, then that would have too much leverage in the equation. Here, however, we have all the data points in between the dotted lines.

Because our diagnostic plots looked so good, I plotted the logbody against logbrain. Looking at our data points, they finally follow a linear model! The outliers have been limited and the data points are all going in the same direction.

Inverse

For this method we have to attach the data set.

attach(brains)
inverbr <- 1/BrainWt
inverbod <- 1/BodyWt
mod3 <- lm(inverbr ~ BodyWt)
plot(mod3)

When just applying the inverse to the y value, it does not help our residuals. We can move right on, and see if the inverse of both sides helps.

mod4 <- lm(inverbr ~ inverbod)
plot(mod4)

Taking the inverse of both sides is a little better than just the one, but still not good.

Concluding Transformations

After looking at all the different model, taking the log of both sides produced the best results!

Outliers

As we know, outliers are data points that are well seperated from the rest of the data. Running our diagnostic plots, they can be seen in the first (residuals vs fitted) and third (scale vs location). There is a rule to categorized outliers: if the abs of the studentized residuals is >2, then it is an outlier. Remember: studentized residual is residual / SE(residuals).

There are 3 questions when dealing with outliers:

  1. Was the dtaa point recorded incorrectly?
  2. If it is correct, why is this an outlier?
  3. If the point is correct, are we missing a predictor that could explain the trend?