Simple Linear Regression with Cars

2025-09-21

1) Inspecting the Data

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

The dataset has 50 observations of speed (mph) and stopping distance (ft).
We expect a positive linear relationship between these two variables.

2) Feature Engineering

##   speed dist speed_kph  log_dist speed_c
## 1     4    2   6.43736 0.6931472   -11.4
## 2     4   10   6.43736 2.3025851   -11.4
## 3     7    4  11.26538 1.3862944    -8.4
## 4     7   22  11.26538 3.0910425    -8.4
## 5     8   16  12.87472 2.7725887    -7.4
## 6     9   10  14.48406 2.3025851    -6.4

3) Fitting Models

fit: raw linear regression
fit_log: log-transformed response
fit_cent: centered predictor

4) Model Summaries

## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -17.6      6.76      -2.60 1.23e- 2   -31.2      -3.99
## 2 speed           3.93     0.416      9.46 1.49e-12     3.10      4.77

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.651         0.644  15.4      89.6 1.49e-12     1  -207.  419.  425.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

5) Scatter Plot with Regression Line

## `geom_smooth()` using formula = 'y ~ x'

6) Residuals and Q–Q Plots

7) Interactive Plot

ggplotly(plt_scatter + geom_smooth(method="lm", se=FALSE, color="#e31a1c"))

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

8) Predictions with Confidence and Prediction Intervals

##   speed       fit      lwr      upr       fit        lwr       upr
## 1     5  2.082949 -7.64415 11.81005  2.082949 -30.333587  34.49948
## 2    10 21.744993 15.46192 28.02807 21.744993  -9.809601  53.29959
## 3    15 41.407036 37.02115 45.79292 41.407036  10.174821  72.63925
## 4    20 61.069080 55.24729 66.89088 61.069080  29.603089  92.53507
## 5    25 80.731124 71.59608 89.86617 80.731124  48.487298 112.97495

9) Inference

We test the null hypothesis:

\[ H_0: \beta_1 = 0 \quad \text{vs} \quad H_a: \beta_1 \neq 0 \]

The regression output shows that the slope is highly significant, so we reject the null hypothesis.
There is strong evidence that speed is associated with stopping distance.

10) Confidence Interval

For the slope coefficient:

\[ \hat{\beta}_1 \pm t_{\alpha/2, df} \cdot SE(\hat{\beta}_1) \]

The 95% confidence interval does not include 0, which confirms that the effect of speed on distance is positive.

11) Saving Analysis Artifacts

saveRDS(list(fit=fit, co=tidy(fit), gl=glance(fit), aug=aug,
             plt_scatter=plt_scatter, plt_resid=plt_resid, plt_qq=plt_qq),
        file = "cars_lm_artifacts.rds")

12) Conclusion

There is a clear positive linear relationship between speed and stopping distance.
Residuals are roughly normal, though variation increases at higher speeds.
Predictions become less precise as speed increases.
Both static and interactive plots help visualize these results.
Mathematical inference supports the conclusion that speed is a strong predictor of stopping distance.