Sections 5.2-5.4

Diagnostic Plots

At the beginning of class today we talked about diagnostic plots for linear regression models. In R, when we create a model and use the plot command, inputting the model name, R outputs 4 different plots that help us analyze our model.

The first plot is the residual plot that plots the fitted values of our model on the x-axis and the residuals on the y-axis. When looking at the residual plot, the most important thing to look for is curvature.We don’t want to see any curvature (or trend) in our residuals becasue if a trend exists that indicates that our “mean function” is missing something, such as an interaction term. Polynomial regression can also be used to fix this issue. It’s important that there is no trend in our model because if trend exists, our model will not be useful in creating point predictions. We can also use this plot to analyze heteroscadasticity and outliers which impact our confidence and prediciton intervals, but a trend in the residuls is the most important think to look our for.

The second plot produced is the QQnorm plot, which tells us if our residuals are normlly distributed. If the residuals are normally distributed, they will follow teh straight line in the plot. This plot uses studentized residuals, which means that less weight is given to extreme x-values.

The third plot is the scale-location plot. This plot is similar to our residual plot, but all of the residual values are positive. Thus, it can be though of that all of the residuals below 0 in the first plot are reflected across the line y=0 to produce this plot. Thus, similar to our first plot, in this plot we also want to see if we notice any trend in teh residuals. This plot reduces the skewness of our data.

The last plot illustrates Cook’s Distance, which measures teh influence that each data point has on the regression coefficients. Ideally we want this value to be low for each point becasue we every point to have an equal weight in affecting the coefficients. The plot gives Cook’s distance on the x-axis and the residuals on the y-axis. There are also some dashed, red lines plotte don the graph, and it’s good if the data points fall inside the red lines. If any points fall outside the red lines then they are having too much of an impact on the coefficients.

Transofrmations

If we notice a trend in our residuals, heteroscadasticity, or other issues with our data, we can try and transform one or more of our variables to produce a better model.

Square Root

One way we can transform our model is by using the square-root function. We can take the square-root of our response and/or one or more of our predictors. Then we can use the diagnostic plots to see if any of these transformations improved our model.

Log Function

Another way we can transform our model is by using the log function. We can take the log of our response and/or one or more of our predictors. Then we can use the diagnostic plots to see if any of these transformations improved our model.

Inverse Funtion

Finally, we can transform our model is by taking the inverse of our response and/or one or more of our predictors. Then we can use the diagnostic plots to see if any of these transformations improved our model. However, we should try the square-root and log functions before we use teh inverse function becasue the first two transformations are easier to interpret and explain.

In the final section of this R guide there is an example of model transformations using the stopping dataset from the alr3 library.

Outliers and Influential Points

At the end of class we discussed outliers and influential points. We learned that outliers in the x-direction may not be influential becasue they could still allign with our model. However, outliers in the y-direction are influential because they will have a higher weight on our coefficients. In other words, they will have a high Cook’s distance.

To find outliers in teh y-direction we find the studentized residuals, which is equal to the residuals, divided by the standard error of the residuals.

Once we’ve identified our outliers, there are several ways in which we can deal with them. Overall, we can’t just delete outliers because they may carry important information regarding our data. However, if a value for a data point was clearly measured or entered wrong (negative weight, 8 year old child with 5 children, etc.) then we can safely delete that point form our data. If the data was correctly measured and entered though, we need to use our knowledge of teh subject we’re studying to try nd understand why the outlier exists. On epotential issue is that we could be missing a predictor in our model. Also, perhaps the outlier doesn’t actually fall into the proper sample for your data; if that’s the case you might be able to delete the point and see how it affects your model.

Example of Transformations

Using this dataset we will create a model relating speed and stopping distance. Speed will be our predictor, and stopping distance will be our response. The argument is that cars traveling at a higher speed will require more distance to stop.

Let’s call and attach our data.

library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(stopping)
attach(stopping)

Let’s create our model. Use the par(mfrow = c(2,2)) command to see all four diagnostic plots in a 2x2 grid.

mod11 <- lm(Distance ~ Speed)
par(mfrow = c(2,2))
plot(mod11)

We see that our model has a trend in the residuals and is heteroscadastic. Also, the residuals aren’t totally normally distributed.

Transofrmations

Let’s try some transformations to improve our model.

Square-root of our response:

mod12 <- lm(sqrt(Distance) ~ Speed)
par(mfrow = c(2,2))
plot(mod12)

Pretty good! The trend in our residuals is pretty much non existence, and the homoscadasticity and normality of our residuals improved.

Square-root of our response and predictor:

mod12a <- lm(sqrt(Distance) ~ sqrt(Speed))
par(mfrow = c(2,2))
plot(mod12a)

This model is worse than just the square-root of our response.

Log of our response:

mod13 <- lm(log(Distance) ~ Speed)
par(mfrow = c(2,2))
plot(mod13)

This is also not as good as our first transormation.

Log of our predictor:

mod13a <- lm(Distance ~ log(Speed))
par(mfrow = c(2,2))
plot(mod13a)

This model has really bad trend, and the residuals aren’t very normal.

Inverse of the response:

mod14 <- lm((1/Distance) ~ Speed)
par(mfrow = c(2,2))
plot(mod14)

This model still isn’t as good as the very first transformation.

mod14a <- lm((1/Distance) ~ (1/Speed))
par(mfrow = c(2,2))
plot(mod14a)
## hat values (leverages) are all = 0.01612903
##  and there are no factor predictors; no plot no. 5

This model clearly has major issues and does not work at all.

Overall, transforming the model by takign the square-root of the response produced the best model for this dataset.