PROBLEM:

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed.

Replicate the analysis of your textbook chapter 3 :

- Visualization

- Quality evaluation of the model

- Residual analysis

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data("cars")

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Create a scatter plot to look at the relationship between the two variables.

With stopping distance being a function of speed, we will plot distance on the x-axis and speed on the y-axis.

cars %>% ggplot(aes(x = speed, y = dist)) +
  geom_point() +
  ggtitle("Speed vs. Stopping Distance") + 
  xlab("Speed") +
  ylab("Distance") +
  theme(legend.position = "none",
        axis.title.x = element_text(color="black",size=14),
        axis.title.y = element_text(color="black", size=14),
        axis.text.x = element_text(size=6),
        axis.text.y = element_text(size=6),
        plot.title = element_text(color = "black", size=24))

Bulid a model

In this step we want to bulid a model where the predictor speed explains the output stopping distance. We can use the lm() function to bulid a linear model.

cars_lm <- lm(dist ~ speed, data = cars)

summary(cars_lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

From the summary information of the model a line that best fits the data is

$ = -17.579 + 3.932 * speed $

We can interpret this as saying that for each 1 unit increase in speed the stopping distance will be an additional 3.932 units.

We can be visualize this line superimposed on our data below…

cars %>% ggplot(aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  ggtitle("Speed vs. Stopping Distance") + 
  xlab("Speed") +
  ylab("Distance") +
  theme(legend.position = "none",
        axis.title.x = element_text(color="black",size=14),
        axis.title.y = element_text(color="black", size=14),
        axis.text.x = element_text(size=6),
        axis.text.y = element_text(size=6),
        plot.title = element_text(color = "black", size=24))

## `geom_smooth()` using formula = 'y ~ x'

Evaluating the quality of the model

Distribution of the residuals

When evaluating the quality of the model we can look at the distribution of the residuals. From the above summary of the model, the residuals do seem like they are nearly distributed around a mean of 0, where the 1st and 3rd quartiles are about the same. The min and max values are quite different and we see that the max is more than 1.5 * the interquartile range.

Standard error for the coefficients

For a good model we want to see that the standard error is 5-10 times smaller than the corresponding coefficients.

For the intercept the coefficient is -2.6010742 times smaller

For the speed the coefficient is 9.4642599 times smaller

Based on the ratio of the speed coefficient to the standard error we can be sure that there is relatively little variability in the slope estimate.

R-squared

Using $R^2$, we can describe how closely the data is clustered around the linear fit. We can interpret the R-squared as 65.11% of the variability in distance can be explain by the model.

Residual analysis

Residual distribtution

residuals <- tibble(cars_lm$residuals)

ggplot(cars_lm, aes(x= .resid)) + 
  geom_histogram(bins = 10, fill = "lightgrey", color = "grey") +
  ggtitle("Residual Distribution") + 
  xlab("Residual") +
  ylab("Count") +
  theme(legend.position = "none",
        axis.title.x = element_text(color="black",size=14),
        axis.title.y = element_text(color="black", size=14),
        axis.text.x = element_text(size=6),
        axis.text.y = element_text(size=6),
        plot.title = element_text(color = "black", size=20))

Residual plots

par(mfrow = c(2,2))
plot(cars_lm)

Looking first at the Residuals vs. Fitted plot we can see that the values are evenly separated above and below 0 and are not following any particular trend or pattern. Next, when observing the Q-Q plot we see that most of the values do follow a straight line, however a few points on the right deviate off the expected line. Despite this, the trend of the Q-Q plot indicates a nearly normal distribution.

Conclusions

Based on the residual analysis and the statistical output, the linear regression model built is pretty valid estimate in predicting stopping distance using speed. We are unsure of any other conditions that might have effected the stopping distance and extrapolating this to other conditions or data may not be appropriate. There are likely other factors that may influence the stopping distance (surface conditions or type of tires), so a better model might exist.

Week 11 Homework

Dirk Hartog

2024-04-03