DATA 605 - Homework 11

Using the cars dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Solution

cars dataset includes 50 observations of speed and dist. distis the stopping distance in feet and speed relates to the speed of a car before applying the brakes in miles per hour.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [1] "speed" "dist"

Data Visualization

Plot the Stopping Distance vs Speed.

plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)', 
     main='Stopping Distance vs. Speed')

Statistical Analysis

Build a linear regression model and calculate the correlation between speed and distance.

corr = round(cor(cars$speed, cars$dist),4)
print (paste0("Correlation = ",corr))
## [1] "Correlation = 0.8069"
cars_lm <- lm(cars$dist ~ cars$speed)
cars_lm
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)', 
     main='Stopping Distance vs. Speed')
abline(cars_lm)

There is some correlation between two variables.
Summarize the linear model

summary(cars_lm)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis

Fitted Value vs Residuals

plot(cars_lm$fitted.values, cars_lm$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)

It is possible to say that the outlier values do not show the same variance of the residuals; however, it is not very clear. I think it is reasonable to continue with the analysis and assume similar variance of residuals.

qqnorm(cars_lm$residuals)
qqline(cars_lm$residuals)

The normal Q-Q plot of the residuals appears to follow the theoretical line. Residuals are reasonably normally distributed.

Quadratic Model

speed <- cars$speed
speed2 <- speed^2
dist <- cars$dist
  
cars_qm <- lm(dist ~ speed + speed2)
summary(cars_qm)
## 
## Call:
## lm(formula = dist ~ speed + speed2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.720  -9.184  -3.188   4.628  45.152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.47014   14.81716   0.167    0.868
## speed        0.91329    2.03422   0.449    0.656
## speed2       0.09996    0.06597   1.515    0.136
## 
## Residual standard error: 15.18 on 47 degrees of freedom
## Multiple R-squared:  0.6673, Adjusted R-squared:  0.6532 
## F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12
speedvalues <- seq(0, 25, 0.1)
predictedcounts <- predict(cars_qm,list(speed=speedvalues, speed2=speedvalues^2))
plot(speed, dist, pch=16, xlab='Speed (mph)', ylab='Stopping Distance (ft)')
lines(speedvalues, predictedcounts)

plot(cars_qm$fitted.values, cars_qm$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0)

qqnorm(cars_qm$residuals)
qqline(cars_qm$residuals)

Conclusion

The linear model does a good job at explaining the data. Q-Q plot has some deviations, coefficients are not very significant and \(R^2\) is not increased by much. The Q-Q plot confirms that using the speed as a predictor is not sufficient to explain the data. Other factors also needs to be considered to accurately predict the stopping distance.