Assignment 11

We are using the “cars” dataset to make a linear model for stopping distance as a function of speed.

library(datasets)
library(ggplot2)
library(tidyverse)
library(lmtest)
library(nortest)
cars <- as.data.frame(cars)

Cars Model

First, we will make sure the data follows a linear pattern:

ggplot(cars) + geom_point(aes(x = speed, y = dist))

The data seems to follow a linear pattern, though it does seem to fan out a bit. We will need to keep an eye out for violations of the constant variance assumption.

lm_cars <- lm(dist ~ speed, data = cars)
summary(lm_cars)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

ggplot(cars) + geom_point(aes(x = speed, y = dist)) + geom_smooth(aes(x = speed, y = dist),method = 'lm')

The summary of the model checks all the boxes we’re looking for. The residuals are fairly symmetric, with the 1st quartile and 3rd quartile being about equal, and the median being near 0. The coefficient for speed is easily different from 0 with a t value near 10. Lastly, the R squared value indicates a strong fit.

Diagnostics

plot(lm_cars)

There are two issues that are not concerning, but should be noted, and one that is concerning. First, the small deviations from normality are not really a concern with only 50 observations. There is one point (49) that is somewhat influential, so it would be nice to know some other features of this car if we had access to them. Finally, the residuals exhibit non constant variance and some curvature. This will need to be addressed, espcially because if we use some domain knowledge, we know that momentum varies quadratically with velocity.

diagnostics <- data.frame(fitted = lm_cars$fitted.values, residual_abs = abs(lm_cars$residuals),
                          residuals = lm_cars$residuals)
ggplot(diagnostics) + geom_point(aes(x = fitted, y = residual_abs)) + 
  labs(x= 'Fitted value', y = 'Absolute Value of Residuals')

ggplot(diagnostics) + geom_histogram(aes( x = residuals), color = 'black', fill = 'darkblue', binwidth = 4)

bptest(lm_cars)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_cars
## BP = 3.2149, df = 1, p-value = 0.07297

ad.test(lm_cars$residuals)

## 
##  Anderson-Darling normality test
## 
## data:  lm_cars$residuals
## A = 0.79406, p-value = 0.0369

If normality were the only issue, we would not change or reject the model because of the skew. We would merely note that the standard error of the beta coefficient is should be higher. However, the heteroskedacity is more of a concern, even given the higher p value from its test. This is because of our domain knowledge, and our likely desire to use our model to exrapolate stopping distance.

Remedial Measures

We will transform the response variable as a remedial measure.

cars <- cars %>%
  mutate(
    dist_sqrt = sqrt(dist)
  )

ggplot(cars) + geom_point(aes(x = speed, y = dist_sqrt)) + geom_smooth(aes(x = speed, y = dist_sqrt),method = 'lm')

lm_cars_sqrt <- lm(dist_sqrt ~ speed, data = cars)
summary(lm_cars_sqrt)

## 
## Call:
## lm(formula = dist_sqrt ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0684 -0.6983 -0.1799  0.5909  3.1534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.27705    0.48444   2.636   0.0113 *  
## speed        0.32241    0.02978  10.825 1.77e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.102 on 48 degrees of freedom
## Multiple R-squared:  0.7094, Adjusted R-squared:  0.7034 
## F-statistic: 117.2 on 1 and 48 DF,  p-value: 1.773e-14

plot(lm_cars_sqrt)

bptest(lm_cars_sqrt)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_cars_sqrt
## BP = 0.011192, df = 1, p-value = 0.9157

ad.test(lm_cars_sqrt$residuals)

## 
##  Anderson-Darling normality test
## 
## data:  lm_cars_sqrt$residuals
## A = 0.39752, p-value = 0.3551

The transformation fixes both issues, and improves the fit. This remedial measure appears to have been a success.

Assignment 11

Peter Goodridge

April 22, 2018

Cars Model

Diagnostics

Remedial Measures