2026-04-08

What Is Simple Linear Regression?

Simple linear regression models the relationship between two continuous variables by fitting a straight line through observed data.

Goal: Predict a response variable \(Y\) using a single predictor variable \(X\).

Real world examples:

Predicting house price from square footage, estimating fuel efficiency from vehicle weight, or forecasting sales from advertising spend.

We’ll go through the math, fit a model in R, and visualize the results.

The Mathematical Model

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad i = 1, 2, \dots, n\]

where

\(Y_i\) is the observed response for the \(i\)-th data point, \(X_i\) is the predictor value, \(\beta_0\) is the intercept (value of \(Y\) when \(X = 0\)), \(\beta_1\) is the slope (change in \(Y\) per unit change in \(X\)), and \(\varepsilon_i \sim N(0, \sigma^2)\) are independent error terms.

The fitted line is written as \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\).

Estimating the Parameters

We pick \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the sum of squared residuals:

\[S(\beta_0, \beta_1) = \sum_{i=1}^{n} (Y_i - \beta_0 - \beta_1 X_i)^2\]

Taking partial derivatives and setting them to zero gives us:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

These are the ordinary least squares (OLS) estimators.

Our Example: Cars Dataset

We’re using R’s built-in cars dataset which has speed (mph) and stopping distance (ft) for 50 cars from the 1920s.

# load the built in cars data
data(cars)
head(cars, 8)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
## 7    10   18
## 8    10   26

Question: Can we predict stopping distance from speed?

Fitting the Model in R

# fit a simple linear regression model
model = lm(dist ~ speed, data = cars)
summary(model)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
## speed         3.932409  0.4155128  9.463990 1.489836e-12

The fitted equation is:

\[\hat{\text{dist}} = -17.58 + 3.93 \times \text{speed}\]

So for every 1 mph increase in speed, stopping distance goes up by about 3.9 feet on average.

Scatter Plot with Regression Line (ggplot)

The shaded band is the 95% confidence interval for the regression line.

Residual Analysis (ggplot)

There’s kind of a funnel shape here which suggests the variance might be increasing with speed (heteroscedasticity).

Interactive 3D Plot (Plotly)

Here we map speed, distance, and residual magnitude in a 3D scatter plot.

Goodness of Fit: \(R^2\)

The coefficient of determination \(R^2\) tells us how much of the variability in \(Y\) is explained by the model:

\[R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

# get the r squared value
summary(model)$r.squared
## [1] 0.6510794

An \(R^2\) of 0.651 means about 65.1% of the variation in stopping distance is explained by speed.

Hypothesis Test on the Slope

We want to test if speed is actually a significant predictor:

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0\]

The test statistic is:

\[t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t_{n-2}\]

# look at the coefficients table for speed
coefs = summary(model)$coefficients
coefs["speed", ]
##     Estimate   Std. Error      t value     Pr(>|t|) 
## 3.932409e+00 4.155128e-01 9.463990e+00 1.489836e-12

Since \(p < 0.001\), we reject \(H_0\) so speed is a significant predictor.

Key Takeaways

Simple linear regression fits \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\) using least squares.

Always check residual plots for model assumptions like linearity, constant variance, and normality.

\(R^2\) tells you how much of the variation the model explains, and hypothesis tests tell you if the predictor is significant.

The cars example shows a clear positive relationship between speed and stopping distance.

Tools used: ggplot2 for static plots, plotly for the interactive 3D plot, and LaTeX for the math, all inside R Markdown ioslides.