Assignment Prompt

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(ggplot2)
library(DT)

Dataset

?cars
datatable(cars, options = list(scrollX = TRUE))
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Speed and Stopping Distances of Cars

Description

The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

Usage

cars

Format

A data frame with 50 observations on 2 variables.

[,1] speed numeric Speed (mph)
[,2] dist numeric Stopping distance (ft)

Source

Ezekiel, M. (1930) Methods of Correlation Analysis. Wiley.

References

McNeil, D. R. (1977) Interactive Data Analysis. Wiley.

Exploratory Analysis

Independent Variable: speed Dependent Variable: distance

Distribution of the Variables

Both variables’ histograms appears nearly normal.

ggplot(data = cars, aes(x = speed)) + geom_histogram(bins = 10)  +  
  labs(x = "Speed" , y = "count") + 
  ggtitle("Speed Histogram") + theme(plot.title=element_text(hjust=0.5))+
  geom_vline(xintercept = mean(cars$speed), 
             color = "indianred") +
  geom_vline(xintercept = median(cars$speed), 
             color = "cornflowerblue")

ggplot(data = cars, aes(x = dist)) + geom_histogram(bins = 15)  +  
  labs(x = "Stopping Distance (ft)" , y = "count") + 
  ggtitle("Stopping Distance (ft) Histogram") + theme(plot.title=element_text(hjust=0.5)) +
  geom_vline(xintercept = mean(cars$dist), 
             color = "indianred") +
  geom_vline(xintercept = median(cars$dist), 
             color = "cornflowerblue")

Correlation

There is a strong relationship between the speed of the car and the distances needed for the car to stop. The faster the car, the longer the distance required the car to stop.

ggplot(data = cars, aes(x = speed, y = dist)) + geom_point()  +  geom_smooth() +
  labs(x = "Speed" , y = "Stopping Distance (ft)") + 
  ggtitle("Speed v. Stopping Distance (ft)") + theme(plot.title=element_text(hjust=0.5))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Linear Model 1

The linear function is:

\[ stopping \; distance = B_{0}+B_{1}\cdot speed\] where

\(B_{0\) is the intercept.

\(B_{1}\) is the slope.

\[ stopping \; distance = -17.5791+3.9324\cdot speed\]

m1 <- lm(dist ~ speed, data = cars) 
summary(m1)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() + 
  stat_smooth(method = "lm", col = "indianred") +
  ggtitle("Speed v. Stopping Distance (ft)") + theme(plot.title=element_text(hjust=0.5))
## `geom_smooth()` using formula = 'y ~ x'

Model Diagnostics

Linearity and Constant Variability: Residuals indicates the difference between the actual and predicted value. We want the the mean of residuals to be close to 0. Based on the plot below, the points are distributed around 0 in no apparent pattern. Both linearity and constant variability appear to be met.

library(ggplot2)
ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") +
  ylab("Residuals")  + 
  ggtitle("Residuals v. Fitted") + theme(plot.title=element_text(hjust=0.5))

Normality of Residuals: All of the points lies roughly on the line. The assumption of normality is met.

ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 8) + xlab("Residuals")

ggplot(data = m1, aes(sample = .resid)) + stat_qq() +stat_qq_line(col = "red")  + 
  xlab("Theoretical Quantities") + ylab("Standardized Resididuals")

  ggtitle("Normal Q~Q ") + theme(plot.title=element_text(hjust=0.5))
## NULL

Conclusion

With all assumptions to be met and the adjusted R-square value for our model is 0.64, linear regression modeling appears to be an appropriate approach to model this dataset.