What is Simple Linear Regression?

Simple Linear Regression models the linear relationship between:

  • A response variable \(Y\) (dependent)
  • A predictor variable \(X\) (independent)

It is one of the most fundamental tools in statistics and machine learning, used to:

  • Understand relationships between variables
  • Make predictions on new data
  • Quantify how much \(X\) explains \(Y\)

The Model Equation

The simple linear regression model is written as:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Where:

  • \(\beta_0\) = intercept — value of \(Y\) when \(X = 0\)
  • \(\beta_1\) = slope — change in \(Y\) for a one-unit increase in \(X\)
  • \(\varepsilon \sim N(0, \sigma^2)\) = random error term

The fitted (predicted) values are:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

Estimating the Coefficients

We use Ordinary Least Squares (OLS) to minimize the Residual Sum of Squares:

\[RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

The closed-form OLS solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Example Dataset: Car Speed vs. Stopping Distance

We use R’s built-in cars dataset:

  • X = Speed of car (mph)
  • Y = Stopping distance (ft)
  • 50 observations collected in the 1920s
head(cars, 5)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

Scatter Plot with Regression Line

Residuals Plot

R Code for the Model

# Fit the linear regression model
model <- lm(dist ~ speed, data = cars)

# View the summary
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Model Interpretation

From the model output:

  • \(\hat{\beta}_0 \approx\) -17.58 — estimated stopping distance when speed = 0
  • \(\hat{\beta}_1 \approx\) 3.93 — each additional mph adds ~3.93 ft of stopping distance
  • \(R^2 \approx\) 0.651 — the model explains ~65.1% of the variance in stopping distance

The coefficient of determination \(R^2\) is defined as:

\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

3D Interactive Plot: Speed, Distance & Residuals

Summary

Simple Linear Regression is a powerful yet interpretable model:

Component Description
Model \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
Estimation Ordinary Least Squares (OLS)
Fit measure \(R^2 \in [0, 1]\)
Key assumption \(\varepsilon \sim N(0, \sigma^2)\)

Key takeaway: For every 1 mph increase in speed, stopping distance increases by approximately 3.9 feet — a clear, quantifiable relationship!