Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(tidyverse)
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Determining Linearity Between Variables

plot(cars[,"speed"],cars[,"dist"], main="Relationship Between Speed and Distance",
xlab="Speed", ylab="Distance")

ggplot(cars,aes(x = `speed`, y = `dist`)) + geom_point() + geom_smooth(method = "lm")

There appears to be a linear relationship between speed and distance. Distance increases as speed increases.

Building Linear Model

set.seed(123)
model1 <- lm(dist ~ speed, data=cars)
summary(model1)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

With a p-value below 0.05, it can be determined that speed is a statistically significant variable when determining distance. The Adjusted R-squared of 64.38% represents the percentage of variance in distance that can be explained by speed. This percentage is an indicator that the linear model is a good fit. Additionally, the median value of -2.272 is close to zero, which is a good indicator of the model fit. The t-value is 9.464, which is also a good indicator of the model fit since the standard error for speed is 9.464 times smaller than the coefficient value, and typically a good model fit has a standard error that is five to ten times smaller than the corresponding coefficient.

Residual Analysis

plot(model1)

ggplot(model1,aes(x = `speed`, y = `dist`)) + geom_point() +geom_smooth(method = "lm")

Residual vs. Fitted Values Plot

ggplot(model1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

plot(fitted(model1),resid(model1))

Plotting Original Data with Fitted Line

plot(dist ~ speed, data=cars)
abline(model1)

Histogram of Residual Values

ggplot(data = model1, aes(x = .resid)) +
  geom_histogram(binwidth = 10) +
  xlab("Residuals")

hist(model1$residuals, breaks = 10)

Q-Q Plot of Residual Values

qqnorm(resid(model1))
qqline(resid(model1))

par(mfrow=c(2,2))
plot(model1)

Conclusion

When analyzing each of the residual plots, it appears that the model is a good fit to the data. In the Residual vs. Fitted Values plot, the residual points are scattered around the zero threshold with no discernible pattern. The histogram plot appears to show a positive skewness to the right, with the center to the left of the zero threshold. This indicates that the residuals are not normally distributed. The Q-Q plot shows its right tail diverging from the line, which is another indication that the residuals are not normally distributed. Taken altogether, while the initial measures of the regression model shows promising fit, the plots indicate that the residuals of the model are not normally distributed, meaning that statistical inferences made from the model may be biased or invalid. The regression model can possibly be improved by log-transforming the dependent variable dist, or removing outliers from the dataset.