Learning Log 14

Diagnostic Plots

In last class, we assessed whether a model was a good fit by looking at plots.

The first plot we can look at is the residual plot. This plot helps us answer questions like a) is the “mean function” Xβ appropriate? b) is there heteroscedasticity? and c) do we have any outliers.

The second plot is we can look at is the QQ Norm plot. It plots the quantiles of the residuals against the quantiles of a normal distribution.

The third plot is a scale location plot. This plots the sqrt(standardized resids) vs the fitted values. It helps reduce skewness of data, so its easier to see trends in residuals.

Finally, the fourth plot is cook’s distance. This plot measures each data point’s influence on the \(\hat{β}\).

First, we must create a model to create these plots. I will use the dataset brains to do this.

library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(brains)
attach(brains)

brmod <- lm(BrainWt~BodyWt)
plot(brmod)

As you can see from the output, it automatically gives us the four plots. Looking at the last plot, take note of the dotted line cone. Our goal is to have all data points inside this cone. The data points within the red cone curve don’t have to much leverage, but outside we need to be paying attention to them We must try to do a transformation to get them inside.

Transformation

If we see something in these plots like skew, if it’s an inappropriate “mean function”, or heteroscedacity, we might want to use a transformation to address these problems. Be sure to start with only one transformttion with one variable first because simpler is better.

Some common transofrmations:

  1. sqrt(Y) = Xβ + ε
  1. helps with heteroscedasticity
  1. log(Y) = Xβ + ε
  1. helps equalize error(helps with heteroscedasticity)
  2. help straighten the trend and help w/ curvature in resids
  1. 1/Y = Xβ + ε

Practice

Goal:Find a transformation to get linear trend

We will use data set brains to create these transformations.

brmod1 <- lm(sqrt(BrainWt) ~ sqrt(BodyWt))
plot(brmod1)

brmod2 <- lm(log(BrainWt) ~ log(BodyWt))
plot(brmod2)

brmod3 <- lm((1/BrainWt) ~ (1/BodyWt))
plot(brmod3)

## hat values (leverages) are all = 0.01612903
##  and there are no factor predictors; no plot no. 5

Looking at these plots, it appears the log transformation gave us the best model.

Outliers and Influential Parts

Outliers

An outlier is a pt well- separated from the rest of data. Before we can address them we need to identify them through a residual plot or studentized tests.

Before we can answer how we deal with them, we must first ask ourselves these three questions.

  1. Was the data pt recorded incorrectly?

  2. If it is correct, why is this pt an outlier?

  3. If the pt is correct, are we missing a predictor that could explain the trend?