Week 11 Assignment

Using the “cars” data set in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

The cars data set is built into R and can be accessed easily through the variable “cars”

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Data Exploration

Inspecting the head of the data frame we can see that there are two columns in the data set for speed and stopping distance.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Printing out the dimensions of the dataframe we can see that we have 50 observations total.

dim(cars)
## [1] 50  2

Plotting the histogram for the speed column we can see that it is slightly skewed to the right.

ggplot(cars, aes(x=speed)) +
  geom_histogram(binwidth = 5, fill="grey", color="black") +
  labs(title="Histogram of Car Speeds", x="Speed (mph)", y="Frequency") +
  theme_minimal()

The histogram plot for stopping distance reveals a possible outlier at 120 feet. It is centered between 20 and 40 feet and right skewed.

ggplot(cars, aes(x=dist)) +
  geom_histogram(binwidth = 5, fill="grey", color="black") +
  labs(title="Histogram of Car Stopping Distance", x="Stopping Distance (ft)", y="Frequency") +
  theme_minimal()

The scatter plot below plots the relationship between Speed and Stopping distance. It appears that there is a positive relationship between speed and stopping distance meaning the faster the speed the greater the stopping distance. This makes sense because if we think about it logically the faster you are going, the more time the car will need to stop thus the longer distnances needed to stop.

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +  # Add points to scatter plot
  theme_minimal() + 
  labs(title = "Relationship Between Speed and Stopping Distance",
       x = "Speed (mph)",
       y = "Stopping Distance (ft)")

Linear Model

# Linear regression: Wins ~ SO
cars_model <- lm(dist ~ speed, data = cars)

# Model summary
summary(cars_model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

We have a y-intercept of -17.5791 and a slope of 3.9324. This means that a car going 0 mph would be expected to need -17 feet to stop and that for each mph you would need an additional 3 feet to stop. Intuitively a car going 0 mph would need 0 feet to stop and a negative stopping distance does not make sense but adding 3 additional feet per mph supports the positive relationship between stopping distance and speed.

Residual Analysis

Investigating the rest of the summary info on the model reveals that the residuals do seem to be mostly normally distributed about a mean of -2.72. Ideally, we would like the mean to be centered about 0 but -2.72 is pretty close. The min value of -29.069 and max value of 43.201 do show that it is slightly skewed to the right.

The multiple \(R^2\) value of 0.6511 this shows that the model is fitting decently well to the data. Ideally, we would like a \(R^2\) value closer to 1 but a value of .6511 shows that about 65% of the variability in Stopping Distance can be explained through speed. While 65% is not 100% it at least shows that it is able to capture a decent amount of the variability.

We have a p-value of 1.49e-12 which is well below the significant threshold of .05 suggesting that there is a significant relationship between Car speed and Stopping Distance.

ggplot(cars, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Impact of Car Speed on Car Stopping Distance",
       x = "Speed (mph)",
       y = "Stopping Distance (ft)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

plot(fitted(cars_model), resid(cars_model))

There seems to be no obvious pattern in the residual plot above. The points seem to for the most part show constant variance of error for the larger values we see some more variance. Speed seems to be independent of stopping distance which is a good thing because we need independence for linear regression. We do see the presence of some outliers present as well which may need to be removed or adjusted for the model.

qqnorm(resid(cars_model))
qqline(resid(cars_model))

The middle values of the plot have a normal distribution, the tails do not, especially the right tail, which shows that there are more extreme values than what would be expected if the data were perfectly normal. This shows that the data is right skewed but the data in the center is normally distributed.

par(mfrow=c(2,2))
plot(cars_model)

Conclusion

In conclusion there is a significant relationship between speed and stopping distance. We got a significant p value of 1.49e-12 and a \(R^2\) value of .6511. Although the \(R^2\) was not super close to 1 .6511 is still good. The residual plots showed that while there were some outliers and the distribution was right skewed we still saw mostly constant error variance. This analysis shows that speed is a significant factor in stopping distance.