data_605_hw11
Using the cars dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis).
First, let’s load the data and take a look of it.
attach(cars)
head(cars); dim(cars)## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## [1] 50 2
What is the correlation of the two variables?
r = with(cars, cor(speed, dist))
print(glue("The correlation between the two variables is approximately ", round(r, 3)))## The correlation between the two variables is approximately 0.807
with(cars,
plot(speed, dist,
xlab = 'Speed (mph)',
ylab = 'Stopping Distance (ft)',
main = 'Stopping Distance vs. Speed'))Let’s build a lm model and fit a regression line to the chart.
mod = lm(dist ~ speed, data = cars)
with(cars,
plot(speed, dist,
xlab = 'Speed (mph)',
ylab = 'Stopping Distance (ft)',
main = 'Stopping Distance vs. Speed'))
abline(mod, col = "red", lty = "dashed")Let’s look at the summary of the model.
summary(mod)##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
par(mfrow = c(2, 2))
plot(mod)First of all, there’s a descent fit between the two variables. There’s a strong positive linear relationship, i.e. r = .8 while the adjusted r-squared is approximately .64. However, the diagnostic plots reveal that there’s a skew in our residuals as the speed does not exactly follow a normal distribution. The QQ-plot tells the same story and we also need to take care of the outliers.
Let’s raise the independent variable by power 2.
cars$speed2 = speed^2
mod2 = lm(dist ~ speed2, data = cars)
summary(mod2)##
## Call:
## lm(formula = dist ~ speed2, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.448 -9.211 -3.594 5.076 45.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.86005 4.08633 2.168 0.0351 *
## speed2 0.12897 0.01319 9.781 5.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.05 on 48 degrees of freedom
## Multiple R-squared: 0.6659, Adjusted R-squared: 0.6589
## F-statistic: 95.67 on 1 and 48 DF, p-value: 5.2e-13
with(cars,
plot(speed2, dist,
xlab = 'Speed (mph)^2',
ylab = 'Stopping Distance (ft)',
main = 'Stopping Distance vs. Speed'))
abline(mod2, col = "red", lty = "dashed")par(mfrow = c(2, 2))
plot(mod2)We see some slight improvement from the transformation.
par(mfrow = c(1, 2))
res1 = mod$residuals
res2 = mod2$residuals
hist(res1, main = "residuals from model 1")
hist(res2, main = "residuals from model 2")The residuals look much better with a look of normal distribution and constant variance. A better strategy would need to apply Box-Cox transformation for better accessment and evaluation.