Question 1:

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

fit <- lm(dist ~ speed, data = cars)

Linearity

linfit <- ggplot(cars, aes(x = speed, y = dist, na.rm = TRUE))+
  geom_point(na.rm = TRUE, color = 'darkturquoise')+
  geom_smooth(color = 'darkseagreen')+
  geom_smooth(method = "lm", color = 'red', se = FALSE)+
  ggtitle('Relationship Between Speed and Distance')+
  xlab('Speed')+
  ylab('Distance')+
  theme(plot.title = element_text(hjust = 0.5, size = 7.5))

linfit
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Regression Analysis

summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The adjusted R-squared value is 0.6438. The p-value is \(1.49 \times 10^{-12}\). With a p-value under 0.05 and an R-squared value above 0.5, it is likely that speed is in fact a predictor of distance. From the visualization, it appears that there is a positive, relatively linear relationship between speed and stopping distance.

sct <- ggplot(data = fit, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  xlab("Fitted values") +
  ylab("Residuals") +
  ggtitle("Linearity of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))


sct

The residuals plot is not very linear, nor is it evenly distributed around the axis. There is not a major skew above or below the axis, however. There is no noticeable pattern in the plot of fitted values vs. residuals. The constant variability condition is met.

hst <- ggplot(data = fit, aes(x = .resid)) +
  geom_histogram(binwidth = 5) +
  xlab("Residuals") +
  ggtitle("Histogram of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))
hst

The histogram of the residuals shows a nearly normal distribution.

npp <- ggplot(data = fit, aes(sample = .resid)) +
  stat_qq()+
  ggtitle("Normal Probability Plot of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

npp

The normal probability plot is very linear for the first 75% of the plot. It falls off a bit as theoretical values increase. For the most part, the normal probability plot is linear.

plt <- ggarrange(linfit,sct,hst,npp,
                 nrow = 2, ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
plt

The linear model is appropriate. The residuals show constant variability, the residuals are nearly normal, and the scatterplot of the data and the normal probability plot are nearly linear. The equation for the linear model is \(Stopping Distance = -17.5791 + 3.9324 * Speed\).