Cars Analysis

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

library(dplyr)
library(ggplot2)
data(cars)

Let’s take a glimpse of the data.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Visualize the Data

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  labs(title = "Speed vs. Stopping Distance",
       x = "Speed", y = "Stopping Distance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

The Linear Model

The linear model function is \(\hat{y} = a_0 + a_1 x_1\).

cars.lm <- lm(dist ~ speed, data=cars)
cars.lm
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The y-intercept is \(a_0 = -17.579\) and slope is \(a_1 = 3.932\).

The final regression model is \(\hat{y} = -17.579 + 3.932 x_1\).

We will now add the linear reression line to the plot:

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  labs(title = "Speed vs. Stopping Distance",
       x = "Speed", y = "Stopping Distance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", formula = y~x, color = "red")  # Add linear regression line

Evaluate the Quality of the Model

In this section, we will use the summary() function to assess how well the data fit the model.

summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

In regards to the residuals, the median is near zero (being -2.272), the minimum and maximum values are roughly the same magnitude (approximately 30 and 43), and the first and third quartile values are roughly the same magnitude (-9.5 and 9.2). This is a sign that the model is well fit.

The t value for speed is 9.464. The t value is the estimate divided by the standard error. For a well fit model, we want to see a standard error that is at least 5-10 times smaller than the corresponding coefficient. In this case, the standard error is around 10 times smaller than the corresponding coefficient, which is a good sign that the model is well fit.

The p-value for speed is very small, being 1.49e-12, which means that there is strong evidence of a linear relationship between stopping distance and speed.

The multiple R-Squared is 0.6511, which means that 65.11% of the variability in stopping distance is explained by the variation in speed. A 65.11% R-Squared is a good sign that the linear model is well fit.

Residual Analysis

Fitted Values vs Residuals

We will first look at the fitted values vs residuals plot to determine if there are any obvious trends.

fitted_values <- fitted(cars.lm)
residuals <- residuals(cars.lm)

plot_data <- data.frame(Fitted_Values = fitted_values, Residuals = residuals)

# fitted values vs residuals plot
ggplot(plot_data, aes(x = Fitted_Values, y = Residuals)) +
  geom_point() +
  labs(title = "Fitted Values vs. Residuals",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "blue")

From the above Fitted Values vs. Residuals plot, we can see that the points are scattered about 0 with no apparent trend. This is a good sign for a fit model.

QQ Plot

Next, we will look at the QQ plot.

ggplot(data.frame(Residuals = residuals), aes(sample = Residuals)) +
  stat_qq() +
  stat_qq_line() +  # Add a line to represent the theoretical quantiles
  labs(title = "QQ Plot of Residuals",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) 

From the above QQ plot, we can see that majority of the points are approximately on the normal line. So we can assume normality. This is a good sign that the linear model is well fit.

par(mfrow=c(2,2))
plot(cars.lm)

Conclusion

In conclusion, there are multiple signs that the linear model is well fit. Some of these signs include there being a high R-squared value of approximately 65.1%, there is a small p-value, the majority of the points on the QQ plot fall along the normal line, and the fitted vs residuals plot shows that there is no apparent trend and the points are scattered about zero.