Fundamentals of Simple Linear Regression

Modeling the Relationship Between Car Speed and Stopping Distance

Author

Abdullah Al Shamim

Published

February 27, 2026

Introduction

Simple Linear Regression is a statistical method used to model the relationship between a single independent variable (\(X\)) and a dependent variable (\(Y\)). In this lesson, we analyze the cars dataset to see how a car’s speed influences its stopping distance.


1. Data Visualization: The Regression Plot

Before diving into the numbers, we visualize the data points and the line of best fit.

Code
library(tidyverse)

cars %>%
  ggplot(aes(speed, dist)) +
  geom_point(size = 3, color = "#cc00cc") +
  geom_smooth(method = lm, se = FALSE, color = "#5f008f") +
  theme_test() +
  labs(title = "Speed of Car vs. Stopping Distance",
       x = "Speed of Car (mph)",
       y = "Distance taken to Stop (ft)") +
  theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12, face = "bold")) +
  annotate("text", x = 10, y = 100,
           label = "Intercept = -17.58 \n Slope = 3.93 \n p-value < 0.05 \n R-squared = 0.65",
           color = "black", fontface = "bold", size = 4)


2. Data Exploration & Model Summary

We use the built-in cars dataset. It contains 50 observations of speed (mph) and stopping distance (ft).

Code
# Quick look at the data
head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
Code
# Generate Summary Statistics using the linear model function
cars %>% 
  lm(dist ~ speed, data = .) %>% 
  summary()

Call:
lm(formula = dist ~ speed, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpretation of the Summary:

  • R-squared (0.6511): Approximately 65% of the variation in stopping distance can be explained by the car’s speed.
  • p-value: The extremely small p-value (\(1.49 \times 10^{-12}\)) indicates that speed is a statistically significant predictor of distance.

3. Building the Linear Model

We define our linear equation as \(Y = \beta_0 + \beta_1X + \epsilon\). In R, we store this in an object called linear_model.

Code
# Building the linear model
linear_model <- lm(dist ~ speed, data = cars) 
summary(linear_model)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

In this model, Speed is the independent variable (Predictor) and Distance is the dependent variable (Outcome). The coefficients tell us that for every 1 mph increase in speed, the stopping distance increases by roughly 3.93 feet.


4. Examining Model Residuals

Residuals are the differences between the actual observed values and the values predicted by the model. Analyzing them is crucial for validating our model’s assumptions.

Code
# View residuals
# hist(linear_model$residuals, main="Histogram of Residuals", xlab="Residual Value", col="#cc00cc")

# Plotting the distribution of residuals
ggplot(data.frame(residuals = linear_model$residuals), aes(x = residuals)) +
  geom_histogram(bins = 10, fill = "#5f008f", color = "white", alpha = 0.7) +
  labs(title = "Distribution of Residuals", x = "Residuals", y = "Frequency") +
  theme_minimal()

Why check residuals? If the residuals are Normally Distributed and show no specific pattern, it suggests the model is a “good fit” for the data.


5. Making Predictions

One of the primary goals of regression is prediction. Let’s estimate the stopping distance for cars traveling at 10, 15, and 20 mph.

Code
# Prepare new data for predictions
new_speed <- data.frame(speed = c(10, 15, 20)) 

# Predictive model output
predictions <- predict(linear_model, new_speed)
round(predictions, 1)
   1    2    3 
21.7 41.4 61.1 

Quick Predictive Model (One-Step)

You can also perform the entire modeling and prediction process in a single piped command:

Code
cars %>% 
  lm(dist ~ speed, data = .) %>% 
  predict(data.frame(speed = c(10, 15, 20))) %>% 
  round()
 1  2  3 
22 41 61 

Simple Regression Toolkit (Cheat Sheet)

  • Model Function: lm(dependent ~ independent, data = df)
  • Key Metric: R-squared (measures model accuracy).
  • Significance: Check if p-value < 0.05.
  • Coefficients: Intercept (starting point) and Slope (rate of change).

Congratulations! You have mastered the basics of Linear Regression. You can now build a model, evaluate its strength, and use it to predict future outcomes.