library(tidyverse)

Introduction

This research aims to construct a linear model to predict stopping distance based on speed using the “cars” dataset available in R. The methodology follows a structured approach comprising exploratory data analysis, model building, evaluation of model quality, and residual analysis, as outlined in Chapter 3 of the textbook.

Methodology

Data Collection and Preparation

The “cars” dataset, a built-in dataset in R, is utilized for this analysis. The dataset consists of two variables: speed (in miles per hour) and stopping distance (in feet).

data(cars)

cars %>%
  ggplot(aes(dist)) +
  geom_histogram(bins = 20)

cars %>%
  ggplot(aes(speed)) +
  geom_histogram(bins = 20)

cars %>%
  ggplot(aes(speed, dist)) +
  geom_point() +
  geom_smooth(formula = 'y ~ x', method = 'loess')

Histograms reveal that both speed and stopping distance exhibit non-normal distributions. However, the scatterplot illustrates a linear relationship between the two variables.

Model Building

model <- lm(dist ~ speed, data = cars)

summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The linear regression model indicates that speed is a statistically significant predictor of stopping distance \((p < 0.05)\).

The model overall demonstrates statistical significance, with a significant portion of variance in stopping distance explained by speed \((R-squared = 0.6511)\). The coefficient for speed suggests that for each 1 mph increase, stopping distance is expected to increase by 3.9324 feet.

Model Visualization

cars %>%
  ggplot(aes(speed, dist)) +
  geom_point() +
  geom_smooth(formula = 'y ~ x', method = 'lm', se = TRUE)

Scatterplot overlays the regression line, indicating a reasonable fit of predicted values to actual values.

Residual Analysis

plot(model)

Fitted versus residuals plot suggests no evidence of heteroscedasticity, and residuals appear randomly distributed around zero. The QQ plot indicates relatively normal distribution of residuals, with some outliers in the right tail. The leverage plot shows that single observations do not exert undue influence on parameter estimates.

Conclusion

The linear regression model provides a satisfactory fit for predicting stopping distance based on speed. Despite non-normal distributions of variables, the model meets key assumptions of linear regression. Overall, the analysis validates the effectiveness of speed as a predictor and underscores the importance of proper model evaluation and residual analysis in regression modeling.