Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

2) Visualizations

qplot(speed, dist, data=cars, geom=c("point", "smooth"), 
   method="lm", formula=y~x, color= 'Red', 
   main="Regression Model", 
   xlab="speed", ylab="dist")
## Warning: Ignoring unknown parameters: method, formula

3) Statistical Analysis

Building a linear model

model = lm(cars$dist ~ cars$speed,data = cars)
summary(model)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

4) Residual Analysis

plot(model)

model$coefficients
## (Intercept)  cars$speed 
##  -17.579095    3.932409

Conclusion

  1. The equation for the cars dataset is : dist = -17.5791 + 3.9324 * speed
  2. The R square for our model is 0.6511 ,the residual variance is higher and indicates that model is a not a perfect fit.
  3. The residuals and fitted value graph shows that values are not normally distributed around zero.
  4. If the residuals were normally distributed, we would expect the points plotted in this figure to follow a straight line. With our model, we see that the two ends follow the line but there are some outliers. This behavior indicates that the residuals are not completely normally distributed.