Summary

Today our main focus was on looking at diagnostic plots and using them to assess our model, as well as using transformations ot achieve a better fit for our model.

Diagnostic Plots

When looking at our diagnostic plots, we found that we could use the plot(model) command to quickly produce four different plots of our model. The four plots and their use are as follows:

  1. Residual Plot (resids vs fitted) which helps us decide if the mean function X\(\beta\) is appropriate(most important!) by looking at if there is there a trend? We can also use it to answer other questions such as: Is there heteroscedasticity? Do we have outliers? Are there points for which the model is not appropriate?

  2. QQNorm which plots quantiles of residuals vs quantiles of a normal distribution. We want our points to follow a linear trend. This plot uses standardized resiudals which weight extreme values lower than other values.

  3. Scale location plot (sqrt(standardized resids) vs fitted values) reduces skewness of data and makes it easier to see trends in the residuals.

  4. Cook’s Distance plot measures each data points influence on regression coefficient \(\hat{\beta}\)

Transformations

The next topic that we looked at was using transformations to improve the fit of our model and improve our diagnostic plos. Usually we use these to try and correct any trends we see in our first residual plot, as we want to see our residual errors as being random and without pattern. We can also use transformations to correct for heteroscedasticity. We focused on three main transformations:

  1. sqrt(y)=x(beta)+error: Which helps with heteroscedasticity

  2. log(y)=x(beta)+error: Which helps with heteroscedasticity as well as straightening out a curved trend in residuals

  3. 1/y = x(beta)+error: Can help if we know from experience there is an inverse relationship between our variables

Outliers and Influential Points

Finally we looked at outliers and influential points. An outlier is a point that is well seperated from the rest of the data. Unfortunately, there is no single great answer about what to do with outliers. In MLR, when we use plot(mod), the first and the third plots are most helpful in identifying outliers. The sloppy rule is that if the |studentized residual| > 2 then we will call that point an outlier in the y direction (use third plot).

When dealing with outlier points, we hae to think about each point and what may have caused it. Several of the questions we may ask are:

  1. Was the data point recorded incorrectly, causing the outlier?

  2. If it was recorded correctly, we need to think about why this point is an outlier?

  3. If this point is correct, are we potentially missing a predictor that could explain it?

Example

Here I will use the stopping data package in order to provide an example of how we use diagnostic plots and transformations to improve our model.

We will start by creating a basic regression model like we’ve previously done.

data(stopping, package = "alr3")
stop.mod1 <- lm(stopping$Distance ~ stopping$Speed)
plot(stop.mod1)

In this first basic model, we notice that the first plot we see when using plot(stop.mod1) has a bit of a trend to it, in that it over predicts for our high and low speed values and underpredicts for our middle speed values. We want to try to remove this trend, so we will try to use a transformation on one or both of our variables to correct it. Often we will start with either a square root or log transformation. In this case, it looks like the residuals follow a bit of a parabolic pattern. For me, that makes me think that we should try a square root transformation first.

stop.mod2 <- lm(sqrt(stopping$Distance) ~ stopping$Speed)
plot(stop.mod2)

Looking at that same first plot, we can see that after taking the square root of stopping disntance, we have almost now trend in our residuals as indicated by the red line. This is a good thing and after looking at all the other plots I feel as if this model has been improved and we will keep the square root in for our final model.