What is Simple Linear Regression?

Simple linear regression models the linear relationship between a response variable \(Y\) and a single predictor variable \(X\).

  • Widely used in prediction and inference
  • Foundation for more complex regression models
  • Assumes a straight-line relationship between \(X\) and \(Y\)

Example dataset: We use R’s built-in cars dataset, which records the speed (mph) and stopping distance (ft) of cars from the 1920s.

The Model

The simple linear regression model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

where:

  • \(Y_i\) is the response for observation \(i\)
  • \(X_i\) is the predictor for observation \(i\)
  • \(\beta_0\) is the intercept (value of \(Y\) when \(X = 0\))
  • \(\beta_1\) is the slope (change in \(Y\) per unit increase in \(X\))
  • \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) is random error

Estimating the Coefficients

The coefficients \(\beta_0\) and \(\beta_1\) are estimated by Ordinary Least Squares (OLS), minimizing:

\[RSS = \sum_{i=1}^{n} \left( Y_i - \hat{Y}_i \right)^2 = \sum_{i=1}^{n} \left( Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i \right)^2\]

The closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Scatter Plot with Regression Line

Fitting the Model in R

# Fit the linear model
model <- lm(dist ~ speed, data = cars)

# Summary of results
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Diagnostics

A random scatter around zero suggests the linearity and constant-variance assumptions hold reasonably well.

QQ Plot of Residuals

Points following the diagonal line indicate the normality assumption is approximately satisfied.

3D View: Speed, Distance, and Residuals

Goodness of Fit: \(R^2\)

The coefficient of determination measures the proportion of variance in \(Y\) explained by \(X\):

\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

  • \(R^2 \in [0, 1]\); values closer to 1 indicate a better fit
  • For the cars model: \(R^2 \approx 0.65\), meaning speed explains about 65% of the variation in stopping distance
## R-squared: 0.6511

Summary

Concept Description
Model \(Y = \beta_0 + \beta_1 X + \varepsilon\)
Estimation OLS minimizes residual sum of squares
Diagnostics Residual plots, Q-Q plot
Fit quality \(R^2\) measures explained variance

Key takeaway: Simple linear regression is an interpretable, powerful tool for modeling the relationship between two continuous variables — and a critical foundation for all of regression analysis.