Today we covered plots, transformations and outliers.

Diagnostic Plots

If we give the plot function a simple linear regression model as the only argument, the output consists of four plots.

The first one plots the residuals against the fitted values, and includes a red line that shows any trends in the residuals. In this plot, we want to see fairly even spread of the points, and we do not want any trends. If the residual plot shows trends, then it means our mean function is incorrect; we may need to add an interaction term or use polynomial regression. This is the most important plot to look for; if we don’t have the correct mean function, our model is not very useful.

The second plot is a normal QQ plot, which plots the standardized residuals against the quantiles of a normal distribution. This is used to check for the normality condition. We want to see the points following a straight line. Note that some deviation is acceptable.

The third plot is the Scale-Location plot. This plots the square root of the standardized residuals against the fitted values of the model. The scale-location plot reduces the weight of extreme values, making it easier to spot any trends in the residuals.

The final plot is for Cook’s distance. This is a measure of how much influence each observation in our dataset has on the regression coefficient. We don’t want to have points with a large value for Cook’s distance because it would mean those points would have more weight than the rest. In the plot, we want all of our points to be inside of the funnel created by the two dotted lines; any points that are outside of this region have a large influence on the coefficients.

It is worth noting that creating scatterplots of our data should be the very first step; if there is a non-linear relationship between the predictor and the response, we shouldn’t be creating a linear model to begin with.

The residual plot and the normal q-q plot are both familiar. The R functions covered in class are all ones we’ve used before.

Transformations

Transformations are useful if we run into a situation where our plots show that the normality and/or equal variance assumptions do not hold. In the past we were able to check for normality and equal variance, but if those assumptions were violated, there wasn’t a whole lot we could do. We covered the square root, log, and inverse transformations (though there are many more).

Both the square root and log transforms can help with heteroscedasticity and lack of normality. The inverse transform is useful if we know from other studies or background information that the two variables we are working with have an inverse relationship with one another.

We can take a look at transformations using the stopping dataset from the alr3 library.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(stopping)
attach(stopping)
plot(Distance ~ Speed)

This dataset deals with stopping distances vs speed for cars. The predictor is speed, and the response is stopping distance. In the scatterplot of the data, we can see the trend is not linear. We should try to use a transformation to improve it.

Square Root Transforms

Let’s apply the square root transformation and see if it helps. The command par(mfrow = c(2,2)) allows us to display all 4 scatterplots at the same time in a 2x2 matrix. We will also use this when creating our diagnostic plots.

par(mfrow = c(2,2))
plot(Distance ~ Speed, main = "Original Model")
plot (sqrt(Distance) ~ Speed, main = "Square Root of Distance Only")
plot(Distance ~ sqrt(Speed), main = "Square Root of Speed")
plot(sqrt(Distance) ~ sqrt(Speed), main = "Square Root of Dist and Speed")

Looks like taking the square root of both variables, and taking the square root of distance only give us the most linear scatterplots. Taking the square root of speed only results in a curve, so we won’t consider using that model.

Checking our Models

We were able to get the scatterplots to show a more linear pattern, but we also need to check the other conditions as well. First let’s fit the model for the square root of both variables, and take a look at the plots.

sqrtBothMod <- lm(sqrt(Distance) ~ sqrt(Speed))
par(mfrow = c(2,2))
plot(sqrtBothMod)

Normality and leverage both look pretty good here. The points in the Normal Q-Q plot tend to follow a straight line. None of the points have an excessively large influence on the regression coefficients, which is good. However, the residual plot shows a trend. We should probably look at a different model.

Now let’s fit the model with only the distance squared.

sqrtDistMod <- lm(sqrt(Distance) ~ Speed)
par(mfrow = c(2,2))
plot(sqrtDistMod)

That residual plot is much better, with the red line being very flat instead of curved like the other model. Normality and leverage look good too. The scale-location plot has a very minor trend that isn’t cause for concern.

The log Transformation

Now let’s try the log transformation and see if it can give us a better model than what we have so far. As with the square root transformations, we can take the log of the response only, predictor only, or both.

par(mfrow = c(2,2))
plot(Distance ~ Speed, main = "Original Model")
plot(log(Distance) ~ Speed, main = "Log of Distance Only")
plot(Distance ~ log(Speed), main = "Log of Speed Only")
plot(log(Distance) ~ log(Speed), main = "Log of Both")

None of the log transformations give us satisfying results. The best out of these is definitely taking the log of both variables, but we can clearly see there is heteroscedasticity; the data points are very spread out for low speeds, and get close together for higher speeds. There’s no point in creating linear models for any of the log transforms and comparing them.

The Inverse Transformations

It doesn’t make sense for stopping distances to have an inverse relationship with speed, so the inverse transformation likely isn’t going to be useful here.

par(mfrow = c(2,2))
plot(Distance ~ Speed, main = "Original Model")
plot(1/Distance ~ Speed, main = "Inverse of Distance Only")
plot(Distance ~ 1/Speed, main = "Inverse of Speed Only")
plot(1/Distance ~ 1/Speed, main = "Inverse of Both")

The inverse transformations don’t help at all; each one results in a scatterplot with a non-linear trend. Out best model has the square root of stopping distance as the response and speed as the predictor.

Outliers

Outliers are data points that don’t fit in with the rest. However, not all outliers are influential points. We can use Cook’s distance to determine whether an outlier is an influential point. Outliers in the x-direction are usually not a concern, since they still follow the trend somewhat. Outliers in the y-direction are a problem though.

We use the studentized residual to determine whether a certain data point is cause for concern. The studentized residual is the point’s residual divided by the standard error of the residuals. If the absolute value of a point’s studentized residual is greater than 2, then it’s going to create problems.

Dealing with Outliers

We have lots of tools and methods to detect outliers, but how do we deal with them? Unfortunately, there is no easy answer to this question. Dealing with outliers involves a lot of gray area.

Generally, we want to avoid outright deleting outlier points because they can contain useful information. However, the exception to this rule is if we know for certain the outlier was the result of a bad measurment or we can see that it is clearly wrong. An example would be a negative value for a person’s weight. These points can be deleted from the dataset.

In cases where the outlier is a valid data point, we need to look into it furhter in the context of the data we’re working with, and figure out what is causing the outlier. Our model could also be missing a predictor. In these cases, the outlier has potentially useful data we would lose if we were to simply delete it.