Simple Linear Regression

What is Simple Linear Regression?

Simple Linear Regression models the linear relationship between:

A response variable \(Y\) (outcome)
A single predictor variable \(X\) (input)

It is one of the most foundational tools in statistics and machine learning.

Goal: find the line that best summarizes the relationship between \(X\) and \(Y\), and use it to make predictions.

The Model

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Where:

Symbol	Meaning
\(Y_i\)	Observed response for observation \(i\)
\(\beta_0\)	Intercept (value of \(Y\) when \(X = 0\))
\(\beta_1\)	Slope (change in \(Y\) per unit increase in \(X\))
\(\varepsilon_i\)	Random error term

Assumption: \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) — errors are independent and normally distributed with mean zero.

Least Squares Estimation

We estimate \(\beta_0\) and \(\beta_1\) by minimizing the Residual Sum of Squares (RSS):

\[\text{RSS} = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

This yields the Ordinary Least Squares (OLS) estimators:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}\]

\[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Example: Study Hours vs. Exam Score

R Code: Fitting the Model

# Simulate data
set.seed(42)
hours <- round(runif(50, 1, 10), 1)
score <- 45 + 5.2 * hours + rnorm(50, 0, 7)

df <- data.frame(hours = hours, score = score)

# Fit simple linear regression
model <- lm(score ~ hours, data = df)

# View summary
summary(model)

# Plot
library(ggplot2)
ggplot(df, aes(x = hours, y = score)) +
  geom_point(color = "#4A6FA5") +
  geom_smooth(method = "lm", color = "#E05C2A") +
  theme_minimal()

Model Output

## 
## Call:
## lm(formula = score ~ hours, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.102  -4.191   1.419   4.744   9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.3314     2.3426   19.35   <2e-16 ***
## hours         5.0360     0.3377   14.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.463 on 48 degrees of freedom
## Multiple R-squared:  0.8225, Adjusted R-squared:  0.8188 
## F-statistic: 222.4 on 1 and 48 DF,  p-value: < 2.2e-16

Coefficient of Determination: \(R^2\)

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

\(R^2\) measures the proportion of variance in \(Y\) explained by \(X\)
Ranges from 0 to 1; higher is better (for a well-specified model)
\(R^2 = 0\): the model explains nothing beyond the mean
\(R^2 = 1\): perfect fit

Residual Diagnostics

Good residuals should show no pattern — they should scatter randomly around zero.

3D Surface: RSS as a Function of \(\beta_0\) and \(\beta_1\)

Inference on the Slope

We can test whether \(X\) has a statistically significant linear effect on \(Y\):

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0\]

The test statistic is:

\[t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t_{n-2} \quad \text{under } H_0\]

A 95% confidence interval for the slope is:

\[\hat{\beta}_1 \pm t^*_{n-2} \cdot \text{SE}(\hat{\beta}_1)\]

## 95% CI for β₁: (4.357,  5.715)

Since zero is not in this interval, we reject \(H_0\) — study hours significantly predict exam score.

Key Takeaways

SLR models a linear relationship: \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
OLS minimizes \(\sum(Y_i - \hat{Y}_i)^2\) to estimate parameters
\(R^2\) tells us how much variance is explained
Residual plots and Q-Q plots validate model assumptions
The \(t\)-test on \(\hat{\beta}_1\) tests whether the relationship is statistically real

SLR is the building block for multiple regression, generalized linear models, and most predictive modeling pipelines.