We covered a lot in class today. We started with Diagnostic Plots to analyze our residuals with. To bring up the diagnostic plots, just use the plot() command on your model.

Diagnostic Plots

library(fivethirtyeight)
data(avengers)
attach(avengers)
mod <- lm(appearances~years_since_joining,data=avengers)
par(mfrow=c(2,2))
plot(mod)

The first plot is just the residual plot that we have been using so far. In this plot you are looking for no curvature. If curvature exists, this means that you may be missing a predictor or a quadratic or interaction term. Also look out for heteroscedasticity and outliers.

We have also been exposed to the normal QQ plot before. You want the points to be close to the line. This data does not pass this test, meaning the data is not normally distributed.

The third plot is called a Scale-Location plot. It is another way to look at the residuals. Basically it flips the negative residuals so you can focus on the magnitude of the residuals.

The fourth plot is a Cook’s distance plot. The red lines are not entirely visible but the goal is to have all of your points within these red lines. This data actually passes this test. It measures the influence that each individual data point has on the regression coefficient.

This data was the data I used in my R Guide. I want to use it here because it was not especially nice data. Perfect for residual analysis and transformations!

Transformations

We looked at three types of transformations in class: square root, logarithmic, and inverse. These can be applied to the response or the predictor or both, in order to boost your linear relationship. I would like to see if any of these can make my avengers scatterplot look any better.

sqmod <- lm(sqrt(appearances)~years_since_joining, data=avengers)
logmod <- lm(log(appearances)~years_since_joining, data=avengers)
invmod <- lm((1/appearances)~years_since_joining)
par(mfrow=c(2,2))
plot(appearances~years_since_joining)
plot(sqrt(appearances)~years_since_joining)
plot(log(appearances)~years_since_joining)
plot((1/appearances)~years_since_joining)

The log transform has potential. Lets try transforming the predictor as well.

sqsqmod <- lm(sqrt(appearances)~sqrt(years_since_joining), data=avengers)
#loglogmod <- lm((log(appearances))~(log(years_since_joining)), data=avengers)
invinvmod <- lm((1/appearances)~(1/years_since_joining), data = avengers)
par(mfrow=c(2,2))
plot(appearances~years_since_joining)
plot(sqrt(appearances)~sqrt(years_since_joining))
#plot(log(appearances),log(years_since_joining)
#plot((1/appearances),1/(years_since_joining))

The commented out code gave me some errors and based on the look of my other plots, I don’t think these would’ve been too helpful anyway.

Lets compare the log model to the base. Was it an improvement?

summary(mod)
## 
## Call:
## lm(formula = appearances ~ years_since_joining, data = avengers)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -544.9 -344.7 -273.7  108.3 3921.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          374.166     68.599   5.454  1.7e-07 ***
## years_since_joining    1.502      1.703   0.882    0.379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 678.4 on 171 degrees of freedom
## Multiple R-squared:  0.004528,   Adjusted R-squared:  -0.001293 
## F-statistic: 0.7778 on 1 and 171 DF,  p-value: 0.379
summary(logmod)
## 
## Call:
## lm(formula = log(appearances) ~ years_since_joining, data = avengers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3915 -0.9576 -0.1297  1.2163  3.3726 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.9783258  0.1547525  32.170   <2e-16 ***
## years_since_joining 0.0009249  0.0038420   0.241     0.81    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.53 on 171 degrees of freedom
## Multiple R-squared:  0.0003388,  Adjusted R-squared:  -0.005507 
## F-statistic: 0.05796 on 1 and 171 DF,  p-value: 0.81

I’m imagining things. This model is much worse than the base model. The already large p-value more than doubled.

Outliers

The last thing we touched on in class were outliers and influential points. Basically, don’t worry about outliers if they don’t influence the regression line. Only worry about outliers that skew your regression line. Even then, don’t go deleting outliers for that reason alone. Think about why the outlier exists. If it can be explained, you may be missing a predictor.