Load Packages
Load in the cars data and view fisrt 10 rows using head()
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
View basic summary
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Visualize the relationsip between the two variables
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)',
main='Stopping Distance vs. Speed')Create a simple regression model to model distance as a function of speed
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Plot the linear model on the scatterplot we produced earlier
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)',
main='Stopping Distance vs. Speed')
abline(lmod)It seems that this model is a decent fit. It has an adjusted r squared of .6. Lets see how the distribution of our response variable holds against the normal distribution
x <- cars$dist
h<-hist(x, breaks=10, col="red", xlab="Distance",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)Based on this chart, it seems as if though some sort of transform can help increase model accuracy.
Lets run diagnostics on our model Residuals~There is a skew in our residuals as the spread does not exactly follow a normal distribution
QQ-Plots
QQ-Plots seem to tell the same story. While the model we built is not bad, it can be better.What does constant variance check say?
It looks like variance is constant but one can also make the argument that there is some sort of curve. We need a more robust test. The square root of the residuals make a stronger visual argument that there is no constant variance.
## Warning: package 'olsrr' was built under R version 3.6.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## --------------------------------
## Response : dist
## Variables: fitted values of dist
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 4.650233
## Prob > Chi2 = 0.03104933
If we select alpha to be .05, then at the 95% level, we do not have constant variance according to this test. We need some sort of transformation on the response variable.
How do we even determine what transform to apply? Bring in Box-Cox
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
We can transform our response variable by raising it to the power lambda. In this case, the Box-Cox produced a lambda of .05. Our newmodel should be distanced raised to the .05 power.
Does it improve our model? Lets see
##
## Call:
## lm(formula = dist^0.05 ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.078210 -0.012687 -0.000667 0.015227 0.060624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0851319 0.0112450 96.50 < 2e-16 ***
## speed 0.0070858 0.0006914 10.25 1.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02559 on 48 degrees of freedom
## Multiple R-squared: 0.6864, Adjusted R-squared: 0.6798
## F-statistic: 105 on 1 and 48 DF, p-value: 1.124e-13
Adjusted r squared is pretty much the same. How about diagotics plots?
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------------
## Response : dist^0.05
## Variables: fitted values of dist^0.05
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 12.27235
## Prob > Chi2 = 0.0004597196
Residuals look much better when it comes to their distribution being normal and constant variance condition is met. A transform by Box-Cox was a good choice.