Homework 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

Lets first look at the structure of the data set

drive <- cars
glimpse(drive)

## Rows: 50
## Columns: 2
## $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
## $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…

Visualization

Lets use a scatter plot to see if there is a liner relation in the data

plot_ly(data = drive, x = ~speed, y = ~dist, type = "scatter", mode = "markers") %>%
  layout(
    title = "Speed vs. Stopping Distance",
    xaxis = list(title = "Speed"),
    yaxis = list(title = "Stopping Distance")
  )

It seems like there is some kind of linear relationship between speed and Stopping distance

Quality evaluation of the model

Lets run a linear regression on the dataset (omitted the intercept, because in the context of vehicles, there should be no stopping distance when the car is not moving, which was -17.5791)

model <- lm(dist ~ speed+0, data = drive)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed + 0, data = drive)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.183 -12.637  -5.455   4.590  50.181 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## speed   2.9091     0.1414   20.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.26 on 49 degrees of freedom
## Multiple R-squared:  0.8963, Adjusted R-squared:  0.8942 
## F-statistic: 423.5 on 1 and 49 DF,  p-value: < 2.2e-16

The linear regression model predicts stopping distance based on the speed of the car. The coefficient for speed is 2.9091, indicating that for every one-unit increase in speed, the stopping distance increases by approximately 2.9091 units. The model shows strong statistical significance (t-value = 20.58, p < 2e-16) and explains a substantial portion of the variance in stopping distance (Multiple R-squared = 0.8963). Overall, the model is highly significant (F-statistic = 423.5, p < 2.2e-16) and fits the data well.

Residual analysis

Note that we see that the residuals tend to increase as we move to the right. Additionally, the residuals are not uniformly scattered above and below zero.

plot(fitted(model), resid(model))

#### We see that the values seem to not follow a particular pattern, the model seems accurate

Q-Q plot

qqnorm(resid(model))
qqline(resid(model))

The model fits the data well because the residuals seem to be normally distributed around the mean of zero based off of page 24 of the Linear regression book that says “…if the model fits the data well, we would expect the residuals to be normally (Gaussian) distributed around a mean of zero.”