Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Data Visualization

ggplot(data=cars, aes(cars$speed)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red") +
  labs(title = "Historgram - Speed") +
  labs(x = "speed") +
  labs(y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=cars, aes(cars$dist)) + 
  geom_histogram(aes(fill = ..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red") +
  labs(title = "Historgram - Distance") +
  labs(x = "distance ") +
  labs(y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(cars, aes(x=speed, y=dist)) +
  geom_point(size=2, shape=23)

Statistical Analysis

Correlation:
cor(cars$speed, cars$dist)
## [1] 0.8068949

Modeling

Using the simple linear regression, yhat = a*x+b. b is the y-intercept of the line, a is the slope, x is speed and y is output dist. Using lm function we can have the model:

car_model = lm(cars$dist~ cars$speed)
car_model
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

The regression model is \[ dist = 3.932 * speed - 17.579 \]

plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
  abline(car_model, col="red")

Evaluating the Quality of the model

summary(car_model)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The Multiple R-squared value 0.6511 means this moel could explain 65.11% of the data variation

Residual Analysis

qqnorm(resid(car_model))
qqline(resid(car_model))

Based on the visualization of the residuals, we see that the two end fiverge form the QQ plot line. This indicates that the residuals are normally distributed.

Conclusion

We see that the data has 0.8069 correlation and 65.11% multiple R -square and QQ-plot shows that using speed as the only predictor in the model is insufficient to explain the distance. Therefore, we would suggest adding other factors into the model the make the model more reliable.