Sections 5.2-5.4

On Tuesday we covered 3 main topics from Chapter 5:

-Diagnostic plots

-Transformation of variables

-Outliers

Diagnostic Plots

We started off by covering diagnostic plots, which are used for residual analysis in linear regression. In class we used the brains dataset from the alr3 package, which has brain and body weights (BrainWt and BodyWt respectively) for 62 species of animals. Before creating the diagnostic plots, we can plot the BrainWt against BodyWt. This will help us establish the trend between the two variables

library(alr3)
## Loading required package: car
data(brains)
names(brains)
## [1] "BrainWt" "BodyWt"
attach(brains)
with(brains,plot(BrainWt~BodyWt))

We can then set up the model and use the plot function to obtain the four diagnostic plots. The plots are as follows:

  1. Residuals vs. Fitted Values With this plot, it is ideal to have no trend and equal spread amongst the residual points. What we want to look out for is curvature, which could indicate that our “mean function” is missing something such as an interaction or a polynomial. We also want to look out for outliers and heteroscedasticity, but keep in mind that the curvature is the most important issue to look out for.

  2. QQnorm plot The assumption for this plot is that the errors follow a normal distribution. It uses studentized residuals, which gives less weight to extreme x values. We want the points to follow the dotted line since that is what represents the normal distribution.

  3. Scale Location Plot The Scale Location plot plots the studentized residuals against the fitted values. This is used to look for a trend in our residuals, and reduces the skewness of our data.

  4. Cook’s Distance Our final plot is the Cook’s Distance plot, which measures the influence each data point has on the regression coefficient, \(\widehat{\beta}\). It plots two dashed lines, which are the bounds in which we want the residuals to be between. Being outside the bounds indicates that the data point has too much influence on \(\beta\), which is not what we want.

mod<-lm(BrainWt~BodyWt,data=brains)
plot(mod)

For the residuals vs. fitted plot, most of the points are clumped together around zero, but there are a few outliers (Human, Asian elephant, and African elephant). With the rest of the points there doesn’t seem to be any noticeable curvature or trend. For the QQ norm plot, most of the points follow the line that indicates a normal distribution, except again there are a few outliers, which are the same as the graph above. With the data on the scale location plot, it is hard to determine a trend due to the presence of outliers. Finally, for Cook’s Distance, the Asian elephant and African elephant are the two data points that are outside of the bounds of the dashed lines. This indicates that these two points influence the beta coefficient too much.

Transformations

In order to improve the diagnostic plots, we can use regression transformations. The overall goal is to improve the mean function and potentially heteroscedasticity.

We went over 3 different types of transformations: 1. Square Root transformation 2. Log transformation 3. Inverse transformation

The square root and log transformations require that \(y>0\), and help with heteroscedasticity and could also help with the mean function. The inverse transformation could work well if some understanding of the topic tells you x and y are inversely related. Keep in mind that square root and log transformation are generally preferred over the inverse transformation since it is more complicated.

We can see how the results are manipulated through transformations with our brains dataset. First, we can look at just transforming the response variable. We can use par(mfrow = c(2,2)) to get our diagnostic plots organized in 2 rows and 2 columns.

mod1<-lm(sqrt(BrainWt)~BodyWt)
par(mfrow = c(2,2))
plot(mod1)

It looks like there was no real improvement between this and the original model. There still appears to be no specific trend, but outliers still seem to be an issue.

mod2<-lm(log(BrainWt)~BodyWt)
par(mfrow = c(2,2))
plot(mod2)

The residuals vs fitted and scale location plots have nearly a straight vertical line, which means that there really isn’t an equal spread, so this isn’t very helpful. There are less points outside the bounds for Cook’s distance, so that part is better, but overall this transformation doesn’t seem helpful.

mod3<-lm(1/(BrainWt)~BodyWt)
par(mfrow = c(2,2))
plot(mod3)

The residuals vs. fitted and scale location plot also have points that appear to create a vertical straight line, which means that there is still a lack of equal spread. Also, the normal QQ plot is much farther off the normal distribution than prior to the transformation, which poses a problem (assumption of normality is not met). Most of the points lie in between the bounds of Cook’s distance, so that improved a bit.

Now we can try transforming the predictor variable as well to see if that helps.

mod11<-lm(sqrt(BrainWt)~sqrt(BodyWt))
par(mfrow = c(2,2))
plot(mod11)

For this one the spread looks a lot better for the residuals, although the Normal QQ plot still has several points that are a ways off of the normal line, and there is one clear point outside of the bounds for Cook’s Distance.

mod12<-lm(log(BrainWt)~log(BodyWt))
par(mfrow = c(2,2))
plot(mod12)

This one looks much better. The residuals have a much more even spread, and in the Normal QQ plot the data follows the normal line very closely. Also, all of the points lie in the bounds for Cook’s Distance, so this is a great improvement.

mod13<-lm(1/(BrainWt)~1/(BodyWt))
par(mfrow = c(2,2))
plot(mod13)
## hat values (leverages) are all = 0.01612903
##  and there are no factor predictors; no plot no. 5

This one is problematic because there is a lack of equal spread, and the Normal QQ is far off of the line, more so than the original model.

Overall,mod12<-lm(log(BrainWt)~log(BodyWt)) is the best model since it reduced the outliers, created a more even spread, and no data point has too much influence on the beta coefficient.

Outliers and Influential Points

Finally, we discussed outliers and influential points. An outlier can be influential, but not necessarily. If it is an outlier in the x direction, it won’t necessarily affect the projection of the regression line. However, if the oulier is in the y direction, it becomes influential, therefore affecting the projection of the regression line and has leverage (which can be determined using Cook’s Distance).

Not all outliers cause problems and need to be paid attention to. In order to interpret the effect of the outliers, we look at studentized residuals. Studentized residuals are calculated by dividing the residuals by the standard error of the residual. If the absolute value of the standardized residual is greater than 2, then we need to pay special attention to the outliers.

In terms of dealing with outliers, we can delete them if they are clearly wrong. However, if the data was correctly measured and entered, then we need to figure out what happened to cause the response to be far off of the rest. One thing we need to question is whether or not a predictor is missing, which could have an impact on the presence of outliers.