Load in the cars data and view basic summary

attach(cars)
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Visualize the relationsip between the two variables

plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)', 
     main='Stopping Distance vs. Speed')

Create a simple regression model to model distance as a function of speed

lmod <- lm(dist ~ speed, data=cars)
summary(lmod)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Plot the linear model on the scatterplot we produced earlier

plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)', 
     main='Stopping Distance vs. Speed')
abline(lmod)

It seems that this model is a decent fit. It has an adjusted r squared of .6. Lets see how the distribution of our response variable holds against the normal distribution

x <- cars$dist 
h<-hist(x, breaks=10, col="red", xlab="Distance", 
    main="Histogram with Normal Curve") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

Based on this chart, it seems as if though some sort of transform can help increase model accuracy.

Lets run diagnostics on our model Residuals~There is a skew in our residuals as the spread does not exactly follow a normal distribution

res1<-lmod$residuals
hist(res1)

QQ-Plots

qqnorm(lmod$residuals)
qqline(lmod$residuals)

QQ-Plots seem to tell the same story. While the model we built is not bad, it can be better.What does constant variance check say?

plot(fitted(lmod), residuals(lmod), xlab="fitted", ylab="residuals")
abline(h=0);

plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))

It looks like variance is constant but one can also make the argument that there is some sort of curve. We need a more robust test. The square root of the residuals make a stronger visual argument that there is no constant variance.

library(olsrr)
## Warning: package 'olsrr' was built under R version 3.4.4
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(lmod)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data               
##  --------------------------------
##  Response : dist 
##  Variables: fitted values of dist 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.650233 
##  Prob > Chi2   =    0.03104933

If we select alpha to be .05, then at the 95% level, we do not have constant variance according to this test. We need some sort of transformation on the response variable.

How do we even determine what transform to apply? Bring in Box-Cox

require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.4
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
boxcox(lmod, plotit=T, lambda=seq(0, 1.0, by=0.5))

We can transform our response variable by raising it to the power lambda. In this case, the Box-Cox produced a lambda of .05. Our newmodel should be distanced raised to the .05 power.

Does it improve our model? Lets see

lmod2 <- lm(dist^.05 ~ speed, data=cars)
summary(lmod2)
## 
## Call:
## lm(formula = dist^0.05 ~ speed, data = cars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.078210 -0.012687 -0.000667  0.015227  0.060624 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.0851319  0.0112450   96.50  < 2e-16 ***
## speed       0.0070858  0.0006914   10.25 1.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02559 on 48 degrees of freedom
## Multiple R-squared:  0.6864, Adjusted R-squared:  0.6798 
## F-statistic:   105 on 1 and 48 DF,  p-value: 1.124e-13

Adjusted r squared is pretty much the same. How about diagotics plots?

res2<-lmod2$residuals
hist(res2);

qqnorm(lmod$residuals)
qqline(lmod$residuals);

plot(fitted(lmod2), residuals(lmod2), xlab="fitted", ylab="residuals")
abline(h=0);

ols_test_breusch_pagan(lmod2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                  
##  -------------------------------------
##  Response : dist^0.05 
##  Variables: fitted values of dist^0.05 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    12.27235 
##  Prob > Chi2   =    0.0004597196

Residuals look much better when it comes to their distribution being normal and constant variance condition is met. A transform by Box-Cox was a good choice.