In our most recent class, we learned about diagnostic plots. After we create a model, we can create plots to decide whether the model is a good fit or not.
The first plot that we would look at would be a residual plot. This helps us decide 1) if the mean function x(beta) is appropriate, 2) if there is homoscedasticity, and 3) if we have outliers.
The second plot we create is a qqnorm plot. This plots the quantiles of the residuals vs. quantiles of the normal data points. This plot uses standardized residuals, which weights the extreme x-values as lower.
The third plot we create is the scale location plot, which look at the square root of the residuals vs the fitted values. This reduces the skew of the data and makes it easier to see trends in the residuals.
The last plot we look at is called Cook’s distance. This plot measures each data point’s influence on the coeffecient beta hat.
If we want to automatically create all of these plots, we first create a model and then use our plot function. In this example we will use a data set that looks at the body weight and brain weights of animals.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(brains)
attach(brains)
head(brains)
## BrainWt BodyWt
## Arctic_fox 44.500 3.385
## Owl_monkey 15.499 0.480
## Beaver 8.100 1.350
## Cow 423.012 464.983
## Gray_wolf 119.498 36.328
## Goat 114.996 27.660
mymod <- lm(BrainWt ~ BodyWt)
plot(mymod)
The plot function spits out all of the four previously mentioned plots in order. In our first plot, we want the red line to follow closely to the gray line. In the second plot, we want our data points to follow the dotted line. In the third plot, we want the red line to be decently straight. In the fourth plot, we want all of our data points to be within the dotted curved lines.
If we want to try to improve our plots, we can use transformations. A few common transformations are to take the square root (helps with heteroscedasticity), log (helps straighten the trend), or inverse of our models, as follows:
tmod <- lm(sqrt(BrainWt) ~ sqrt(BodyWt))
plot(tmod)
tmod2 <- lm(log(BrainWt) ~ log(BodyWt))
plot(tmod2)
tmod3 <- lm((1/BrainWt) ~ (1/BodyWt))
plot(tmod3)
## hat values (leverages) are all = 0.01612903
## and there are no factor predictors; no plot no. 5
Looking at these plots while keeping the desired goals for each in mind, it looks like the log transformation gave us the best results.
We also began section 5.4 in class, which is about outliers and influential points. To deal with outliers, we first want to identify them from the plot of our model. We then ask these questions to decide what to do with them: