What Is Simple Linear Regression?

Simple Linear Regression models the linear relationship between a response variable \(Y\) and a predictor variable \(X\).

It answers: “As X changes by one unit, how much does Y tend to change?”

Predictor \(X\) Response \(Y\)
Study hours Exam score
Temperature (°F) Ice cream sales
Advertising (\() | Revenue (\))
Square footage House price ($)

Fit the best straight line through a cloud of data points.

The Model Equation

The population regression model:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Symbol Meaning
\(\beta_0\) Intercept — value of \(Y\) when \(X = 0\)
\(\beta_1\) Slope — change in \(Y\) per one-unit increase in \(X\)
\(\varepsilon_i\) Error term — random noise

Estimated model using sample data:

\[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\]

The difference \(e_i = Y_i - \hat{Y}_i\) is called a residual.

Ordinary Least Squares (OLS) Estimation

Minimize the Residual Sum of Squares:

\[\text{RSS} = \sum_{i=1}^{n}(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2\]

Closed-form solutions:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]

Key fact: The regression line always passes through \((\bar{X}, \bar{Y})\).

Model quality — \(R^2\):

\[R^2 = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

Ranges from 0 to 1 — proportion of variance in \(Y\) explained by \(X\).

Visualizing the Least Squares Fit

The Four KEY Assumptions (LINE)

  1. Linearity — true relationship between \(X\) and \(Y\) is linear
  2. Independence — observations are independent
  3. Normality — errors \(\varepsilon_i \sim N(0,\sigma^2)\)
  4. Equal variance — \(\text{Var}(\varepsilon_i) = \sigma^2\) (constant)

Hypothesis Test for the Slope

Test whether \(X\) has a significant linear effect on \(Y\):

\[H_0: \beta_1 = 0 \quad \text{(no relationship)} \qquad H_a: \beta_1 \neq 0\]

Test statistic:

\[t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} \sim t_{n-2} \quad \text{where} \quad \text{SE}(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{S_{XX}}}\]

95% Confidence interval for \(\beta_1\):

\[\hat{\beta}_1 \pm t_{\alpha/2,\, n-2} \cdot \text{SE}(\hat{\beta}_1)\]

  • If CI does not contain 0 → reject \(H_0\)
  • Small p-value (< 0.05) → statistically significant slope

Worked Example: R Code

# Simulate data
set.seed(123)
hours <- runif(60, 1, 12)
score <- 40 + 5 * hours + rnorm(60, 0, 8)
df_exam <- data.frame(hours, score)

# Fit model
model <- lm(score ~ hours, data = df_exam)
summary(model)

# 95% confidence interval for slope
confint(model, level = 0.95)
## Intercept: 42.138  Slope: 4.798 
## R-squared: 0.8423 
## p-value: < 2.2e-16

Regression Plot with Confidence Band

Interactive 3D Plotly View

```