Load in the cars data and view basic summary
attach(cars)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Visualize the relationsip between the two variables
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)',
main='Stopping Distance vs. Speed')
Create a simple regression model to model distance as a function of speed
lmod <- lm(dist ~ speed, data=cars)
summary(lmod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Plot the linear model on the scatterplot we produced earlier
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)',
main='Stopping Distance vs. Speed')
abline(lmod)
It seems that this model is a decent fit. It has an adjusted r squared of .6. Lets see how the distribution of our response variable holds against the normal distribution
x <- cars$dist
h<-hist(x, breaks=10, col="red", xlab="Distance",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Based on this chart, it seems as if though some sort of transform can help increase model accuracy.
Lets run diagnostics on our model Residuals~There is a skew in our residuals as the spread does not exactly follow a normal distribution
res1<-lmod$residuals
hist(res1)
QQ-Plots
qqnorm(lmod$residuals)
qqline(lmod$residuals)
QQ-Plots seem to tell the same story. While the model we built is not bad, it can be better.What does constant variance check say?
plot(fitted(lmod), residuals(lmod), xlab="fitted", ylab="residuals")
abline(h=0);
plot(fitted(lmod), sqrt(abs(residuals(lmod))), xlab="fitted", ylab=expression(sqrt(hat(epsilon))))
It looks like variance is constant but one can also make the argument that there is some sort of curve. We need a more robust test. The square root of the residuals make a stronger visual argument that there is no constant variance.
library(olsrr)
## Warning: package 'olsrr' was built under R version 3.4.4
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(lmod)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------------
## Response : dist
## Variables: fitted values of dist
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.650233
## Prob > Chi2 = 0.03104933
If we select alpha to be .05, then at the 95% level, we do not have constant variance according to this test. We need some sort of transformation on the response variable.
How do we even determine what transform to apply? Bring in Box-Cox
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.4
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
boxcox(lmod, plotit=T, lambda=seq(0, 1.0, by=0.5))
We can transform our response variable by raising it to the power lambda. In this case, the Box-Cox produced a lambda of .05. Our newmodel should be distanced raised to the .05 power.
Does it improve our model? Lets see
lmod2 <- lm(dist^.05 ~ speed, data=cars)
summary(lmod2)
##
## Call:
## lm(formula = dist^0.05 ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.078210 -0.012687 -0.000667 0.015227 0.060624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0851319 0.0112450 96.50 < 2e-16 ***
## speed 0.0070858 0.0006914 10.25 1.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02559 on 48 degrees of freedom
## Multiple R-squared: 0.6864, Adjusted R-squared: 0.6798
## F-statistic: 105 on 1 and 48 DF, p-value: 1.124e-13
Adjusted r squared is pretty much the same. How about diagotics plots?
res2<-lmod2$residuals
hist(res2);
qqnorm(lmod$residuals)
qqline(lmod$residuals);
plot(fitted(lmod2), residuals(lmod2), xlab="fitted", ylab="residuals")
abline(h=0);
ols_test_breusch_pagan(lmod2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------
## Response : dist^0.05
## Variables: fitted values of dist^0.05
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 12.27235
## Prob > Chi2 = 0.0004597196
Residuals look much better when it comes to their distribution being normal and constant variance condition is met. A transform by Box-Cox was a good choice.