What is Simple Linear Regression?

Simple Linear Regression models the relationship between:

  • A response variable \(Y\) (continuous)
  • A single predictor variable \(X\)

The goal is to find the best-fitting line through the data that minimizes prediction error.

It is one of the most widely used statistical tools in science, engineering, economics, and data science.

The Model

The population model is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

where:

  • \(\beta_0\) = intercept (expected value of \(Y\) when \(X = 0\))
  • \(\beta_1\) = slope (change in \(Y\) per unit change in \(X\))
  • \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) = random error term

We estimate \(\beta_0\) and \(\beta_1\) from data using Ordinary Least Squares (OLS).

Least Squares Estimation

OLS minimizes the Residual Sum of Squares (RSS):

\[\text{RSS} = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

The closed-form solutions are:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}, \quad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Example Dataset: airquality

We model Ozone as a function of Temperature using R’s built-in airquality dataset.

Fitted Regression Line

3D Visualization: Ozone, Temp, and Wind

R Code: Fitting the Model

# Fit simple linear regression
model <- lm(Ozone ~ Temp, data = na.omit(airquality))

# View model summary
summary(model)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = na.omit(airquality))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.922 -17.459  -0.874  10.444 118.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -147.6461    18.7553  -7.872 2.76e-12 ***
## Temp           2.4391     0.2393  10.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4833 
## F-statistic: 103.9 on 1 and 109 DF,  p-value: < 2.2e-16

Coefficient of Determination

The coefficient of determination \(R^2\) measures the proportion of variance in \(Y\) explained by \(X\):

\[R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} = 1 - \frac{\sum (Y_i - \hat{Y}_i)^2}{\sum (Y_i - \bar{Y})^2}\]

For our model: \(R^2 =\) 0.488, meaning approximately 48.8% of the variability in Ozone is explained by Temperature.

Residual Diagnostics

Key Takeaways

  • Simple linear regression fits a straight line \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\) to data
  • OLS minimizes the sum of squared residuals
  • \(R^2\) quantifies how well the model explains variability
  • Always check residual plots — patterns suggest the linear model may be inadequate
  • Temperature is a statistically significant predictor of Ozone levels (\(p < 0.001\))