Statistics

2026-04-06

What is Simple Linear Regression?

Simple Linear Regression models the linear relationship between:

A response variable \(Y\) (dependent)
A single predictor variable \(X\) (independent)

It is one of the most fundamental tools in statistics and machine learning, and is widely used in science.

Goal: Estimate the best-fitting linear relationship between the variables.

The Model

The population regression model is:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Where:

\(\beta_0\) = intercept — the expected value of \(Y\) when \(X = 0\)
\(\beta_1\) = slope — the change in \(Y\) for a one-unit increase in \(X\)
\(\varepsilon \sim \mathcal{N}(0, \sigma^2)\) = random error term

The fitted (estimated) model is:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

Estimating the Coefficients

The Ordinary Least Squares (OLS) method minimizes the sum of squared residuals:

\[\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

The closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Example: Car Speed vs. Stopping Distance

If we use R’s built-in cars dataset and mtcars for the 3D Plot:

X = Speed of car (mph)
Y = Stopping distance (ft)

head(cars, 5)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

# Fit simple linear regression
model <- lm(dist ~ speed, data = cars)

Scatter Plot with Regression Line

Residuals Plot

Weight, Horsepower & MPG (mtcars)

Fitting the Model in R

# Model summary
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpreting the Results

From summary(model):

\(\hat{\beta}_0 \approx -17.58\) — negative intercept (theoretical; speed = 0 means no distance)
\(\hat{\beta}_1 \approx 3.93\) — for every 1 mph increase in speed, stopping distance increases by ~3.93 ft
\(R^2 \approx 0.651\) — speed explains ~65% of the variation in stopping distance
The p-value for \(\hat{\beta}_1\) is very small (\(< 0.001\)), confirming a significant linear relationship

Goodness of Fit: \(R^2\)

The coefficient of determination \(R^2\) measures how well the model fits:

\[R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

\(R^2 = 1\): perfect fit
\(R^2 = 0\): model explains none of the variability
Values closer to 1 indicate a stronger linear relationship

For our model: \(R^2 \approx 0.651\)

Summary

Simple linear regression estimates the linear relationship \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
Coefficients are estimated via Ordinary Least Squares
Model fit is assessed using residual plots and \(R^2\)
The cars dataset shows a statistically significant positive relationship between speed and stopping distance
Reference: A First Course in Probability - 10th Edition - Sheldon Ross - Pearson Press