Concepts

Today we discussed three differnt topics. The first topic was diagnostic plots and their interpretations. These diagnostic plots included the residual plot, the QQ plot, the scale-location plot and the Cook’s distance plot. Generally, all of these plots help us understand how good our mean function, or the \(X\beta\) portion of our regression equation, fits our data.

The second topic we discussed today was the idea of transformations, or transforming our predicted variable in order to create a more appropriate mean function. We discussed three main transformations, square root, log and inverse.

The final topic discussed today was the impact of outliers and influential points and how we can deal with them. An outlier is any point that is well-separated from the rest. However, the advice for how to deal with them is very data and knowledge specific so it is difficult to give general advice. Nevertheless, I will lightly discuss what can be done about outliers later on.

Using the Concepts

To illustrate these concepts, I’ll be presenting an example using the data set stopping from the R package alr3. We will be predicting the distance to stopping based on the speed of the car.

Let’s start out by creating the untransformed model.

mod.orig <- lm(Distance ~ Speed, data = stopping)

Diagnostic Plots

Now, we’ll need to look at the model’s diagnostic plots to determine whether or not we need to transform the data in any way. To do this, we can use the function plot(model).

plot(mod.orig)

The very first plot we see is the residual plot. This is the most important plot of the four we will see because it helps us determine whether or not the mean function is appropriate for our data. Depending on what the residual plot tells us, we may or may not continue to look at the other three diagnostic plots. We can see from the plot that our mean function is probably not appropriate because there is a trend in our residuals, illustrated by the red line. Consequently, we will not look at the other three diagnostic plots and, instead, we will begin transforming our data.

Transformations

Usually, one would begin with either a square root or log transformation since those are the easiest to interpret and end with an inverse transformation only if there was evidence that the predictor and response were inversely related. However, for the sake of demonstrating all three kinds of transformations, I will put them in a different order.

The first transformation we will make is an inverse transformation on our response variable.

inv.mod <- lm((1/Distance) ~ Speed, data = stopping)
plot(inv.mod)

Looking at our first plot, the residual plot, we again see that our residuals have a trend so we can conclude that the inverse transformation is not appropriate for improving our mean function. Thus, we can skip the rest of the diagnostic plots and move on to our next transformation.

Our next transformation will be the log transformaiton.

log.mod <- lm(log(Distance) ~ Speed, data = stopping)
plot(log.mod)

Once again, we see from our residual plot that the residuals have a trend so a log transformation does not improve our mean function. We will skip the rest of the diagnostic plots and move on to our last transformation.

Our final transformation is the square root transformation.

sqrt.mod <- lm(sqrt(Distance) ~ Speed, data = stopping)
plot(sqrt.mod)

Here, our residual plot shows no trend, so it seems that the square root transformation is appropriate for improving our mean function. This means that we should continue interpreting our diagnostic plots.

Diagnostic Plots (Continued)

Our residual plot tells us a few other things besides whether or not our mean function is appropriate for our data, or there is a trend in our residuals. The residual plot also allows us to detect heteroscedasticity. It appears that the range of our residuals does not change depending on our fitted value so we do not see heteroscedasticity. Finally, the residual plot can help us find outliers in a data set. There are no points well-separated from the rest on the plot, so we can say that there appear to be no outliers.

Our next plot is the QQ plot. This plot helps us determine whether our residuals come from a normal distribution. With the exception of a few points in the lower left corner and upper right corner of the plot, our residuals appear to follow a normal distribution so our QQ plot actually looks very good.

The third plot is the scale-location plot. This is similar to the residual plot, but includes transformed residuals instead of the raw residuals. These transformed residuals make it easier to detect any trends in the residuals. Though there does seem to be a very slight upward trend in our residuals in this plot, it is slight enough that we won’t worry too much about it.

The final plot we see is the Cook’s distance plot. This graphs the residuals against their leverage. There are also boundary lines drawn on the plot which indicate the acceptable range of the residuals and their leverage. On our plot, we cannot even see the boundaries on the plot because the data have no issues with influential outliers.

Outliers

Now, we get to the topic of outliers. First, we define an outlier very generally as any point well-separated from the rest, but it can also be defined as a point having a studentized residual of greater than two or less than negative two. This more specific definition would indicate that the point is an outlier in the y direction. Additionally, if the data point is outside the boundaries in the Cook’s distance plot discussed above, it would be considered an outlier in the x direction.

So, what should we do if we have an outlier?

First, check to see if the data was recorded incorrectly. If it was, change the observation so that it is correct, if possible, or remove the observation. However, if the data were recorded correctly, you need to ask yourself why the observation is an outlier. One possible explanation is that the model is missing a predictor that could explain the outliers seen. However, there are many other possible explanations depending on the context of the data.

Importance of the Topics

These topics are all very important with respect to the other topics we have covered this semester. You can create many models, but if you don’t check their assumptions you will never know whether the information you get from your model is accurate. This is why understanding diagnostic plots and outliers is so crucial. Additionally, the mean function of the regression equation is the most important part of the regression equation and creating an optimal mean function is crucial to having a proper model so understanding transformations is important.