data_605_hw11

Using the cars dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis).

First, let’s load the data and take a look of it.

attach(cars)
head(cars); dim(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
## [1] 50  2

What is the correlation of the two variables?

r = with(cars, cor(speed, dist))
print(glue("The correlation between the two variables is approximately ", round(r, 3)))
## The correlation between the two variables is approximately 0.807
with(cars, 
     plot(speed, dist, 
          xlab = 'Speed (mph)', 
          ylab = 'Stopping Distance (ft)', 
          main = 'Stopping Distance vs. Speed'))

Let’s build a lm model and fit a regression line to the chart.

mod = lm(dist ~ speed, data = cars)

with(cars, 
     plot(speed, dist, 
          xlab = 'Speed (mph)', 
          ylab = 'Stopping Distance (ft)', 
          main = 'Stopping Distance vs. Speed'))

abline(mod, col = "red", lty = "dashed")

Let’s look at the summary of the model.

summary(mod)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
par(mfrow = c(2, 2))
plot(mod)

First of all, there’s a descent fit between the two variables. There’s a strong positive linear relationship, i.e. r = .8 while the adjusted r-squared is approximately .64. However, the diagnostic plots reveal that there’s a skew in our residuals as the speed does not exactly follow a normal distribution. The QQ-plot tells the same story and we also need to take care of the outliers.

Let’s raise the independent variable by power 2.

cars$speed2 = speed^2

mod2 = lm(dist ~ speed2, data = cars)
summary(mod2)
## 
## Call:
## lm(formula = dist ~ speed2, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.448  -9.211  -3.594   5.076  45.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.86005    4.08633   2.168   0.0351 *  
## speed2       0.12897    0.01319   9.781  5.2e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.05 on 48 degrees of freedom
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6589 
## F-statistic: 95.67 on 1 and 48 DF,  p-value: 5.2e-13
with(cars, 
     plot(speed2, dist, 
          xlab = 'Speed (mph)^2', 
          ylab = 'Stopping Distance (ft)', 
          main = 'Stopping Distance vs. Speed'))

abline(mod2, col = "red", lty = "dashed")

par(mfrow = c(2, 2))
plot(mod2)

We see some slight improvement from the transformation.

par(mfrow = c(1, 2))
res1 = mod$residuals
res2 = mod2$residuals

hist(res1, main = "residuals from model 1")
hist(res2, main = "residuals from model 2")

The residuals look much better with a look of normal distribution and constant variance. A better strategy would need to apply Box-Cox transformation for better accessment and evaluation.