2025-11-15

What is Simple Linear Regression?

Simple Linear Regression is a statistical method used to model the relationship between two continuous variables.

The Goal:

  • To understand how a single independent variable (predictor, X) influences a single dependent variable (response, Y).
  • To create a linear model that can be used to make predictions.

Key Questions it Answers:

  1. Is there a statistically significant relationship between the two variables?
  2. If so, what is the nature of this relationship (i.e., how much does Y change for a one-unit change in X)?
  3. Can we predict future values of Y based on a given value of X?

The Mathematical Model

The relationship in Simple Linear Regression is described by a mathematical equation. We assume that the true relationship can be modeled by a line, with some random error.

The model is:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where:

  • \(Y\) is the dependent (response) variable.
  • \(X\) is the independent (predictor) variable.
  • \(\beta_0\) is the intercept of the line (the value of \(Y\) when \(X=0\)).
  • \(\beta_1\) is the slope of the line (the change in \(Y\) for a one-unit increase in \(X\)).
  • \(\epsilon\) is the random error term, representing variability not explained by the model.

Our goal is to estimate the unknown parameters, \(\beta_0\) and \(\beta_1\), from our sample data.

Example: Vehicle Stopping Distances

Let’s use the built-in R dataset cars, which contains the speed of cars and the distances they took to stop in the 1920s.

Question: Can we predict a car’s stopping distance based on its speed?

  • Independent Variable (X): Speed (mph)
  • Dependent Variable (Y): Stopping Distance (ft)

First, we visualize the data with a scatter plot to see if a linear relationship seems plausible.

The plot suggests a positive linear relationship: as speed increases, stopping distance tends to increase.

Fitting the Model with R

We can use the lm() function in R to fit a linear model. The formula dist ~ speed means we are modeling dist as a function of speed.

Below is the R code to build the model and add the regression line to our plot.

car_model <- lm(dist ~ speed, data = cars)
ggplot(cars, aes(x = speed, y = dist)) +
  geom_point(color = "blue", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "#8C1D40", size=1.5) + 
  labs(
    title = "Fitted Regression Line", 
    subtitle = "Stopping Distance = -17.58 + 3.93 * Speed",
    x = "Speed (mph)", 
    y = "Stopping Distance (ft)"
  ) + theme_minimal(base_size = 16)

The Estimated Regression Line

The lm() function uses a method called Ordinary Least Squares (OLS) to find the “best” estimates for \(\beta_0\) and \(\beta_1\). These estimates are denoted with a “hat” symbol.

The estimated regression equation is:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \]

For our cars dataset, the R output from summary(car_model) gives us:

  • \(\hat{\beta}_0\) (Intercept) = -17.58
  • \(\hat{\beta}_1\) (Slope) = 3.93

So, our predictive model is:

Stopping Distance = -17.58 + 3.93 * Speed

This means that for every 1 mph increase in speed, we expect the stopping distance to increase by approximately 3.93 feet.

Evaluating the Model: Residuals

How good is our model? A common way to check is by examining the residuals.

A residual is the difference between the observed value and the value predicted by the model:

\[ e_i = y_i - \hat{y}_i \]

A good model should have residuals that are randomly scattered around zero, with no obvious patterns. We can visualize this with a residual plot.

This plot shows a slight curve, suggesting a simple straight line might not be the perfect model, but there are no glaring patterns.

Visualizing Least Squares in 3D

The “best” line is the one that minimizes the Sum of Squared Errors (SSE). The SSE is the sum of the squared residuals ($ e_i^2 $).

We can visualize the SSE as a 3D surface, where the x and y axes are possible values for the intercept (\(\beta_0\)) and slope (\(\beta_1\)), and the z-axis is the resulting SSE. The goal is to find the lowest point on this surface.

The red ‘X’ marks the combination of slope and intercept that minimizes the SSE, giving us our best-fit line.

Conclusion

Summary of Simple Linear Regression:

  • It provides a simple and powerful way to model the linear relationship between two variables.
  • The model is defined by an intercept and a slope, which are estimated from data.
  • The Method of Least Squares is used to find the line that best fits the data by minimizing the sum of squared residuals.
  • Visual tools like scatter plots and residual plots are essential for building and evaluating the model.
  • R, with packages like ggplot2 and plotly, provides a comprehensive environment for performing regression analysis.

This method forms the foundation for more complex statistical models.