2025-10-18

Simple Linear Regression

Understanding the linear relationship between a predictor \(x\) and a response \(y\).

Overview

What this presentation covers

  • Intuition: when simple linear regression should be used
  • The mathematical model and assumptions
  • Estimation: least squares (closed-form)
  • Inference: hypothesis tests and confidence intervals
  • A worked example (R code + plots)

The model (mathematical form)

We model the data as \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} N(0,\sigma^2) \]

Key goal: estimate \(\beta_0, \beta_1\) and quantify uncertainty (standard errors, \(t\)-tests, CI).

Motivation Example

Scenario: A data analyst wants to predict student exam scores (\(y\)) based on the number of hours spent studying (\(x\)).

Question: Is there a linear relationship between the amount of time a student spends studying and their exam score?

Visualization idea: If we plot exam scores against study hours, we might expect results to roughly cluster along an upward-sloping line. This would indicate that more study time generally leads to higher exam scores.

Goal of regression:

  • Quantify the relationship between \(x\) and \(y\)
  • Use a fitted model to predict exam scores for new study times
  • Measure the uncertainty with those predictions

Model Assumptions

For a simple linear regression model to produce valid estimates, the following assumption must hold:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]

  1. Linearity: The expected value of \(y\) is a linear function of \(x\): \[ E[Y|X=x] = \beta_0 + \beta_1 x \]

  2. Independence: Observations \((x_i, y_i)\) are independent of each other.

  3. Homoscedasticity The variance of the errors is constant: \[ Var(\varepsilon_i) = \sigma^2 \quad \text{for all } i \]

  4. Normality of errors: The errors are normally distributed: \[ \varepsilon_i \sim N(0, \sigma^2) \]

Why it is important:

  • Violating these assumptions can make estimates biased or inefficient
  • Diagnostic checks and residual plots can help to verify these assumptions

Estimation via Least Squares

In simple linear regression, we estimate the parameters \(\beta_0\) and \(\beta_1\) by minimizing the Residual Sum of Squares (RSS):

\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 \]

Closed-form solutions: \[ \hat{\beta}_1 = \frac{\sum){i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Interpretation:

  • \(\hat{\beta}_1\) is the estimated change in \(y\) for a one-unit increase in \(x\)
  • \(\hat{\beta}_0\) is the predicted value of \(y\) when \(x = 0\)

Key point:

Least squares provides the line that best fits the data we observed by minimizing the squared vertical distances between the points and the regression line.

Worked Example: Linear Fit in R

We will use a simple linear regression model to predict exam scores based on hours studied.

# Sample dataset
hours <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scores <- c(52, 55, 57, 60, 63, 65, 67, 70, 72, 75)
df <- data.frame(hours, scores)

# Fit linear regression model
model <- lm(scores ~ hours, data = df)

# View summary of model
summary(model)
## 
## Call:
## lm(formula = scores ~ hours, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3818 -0.3227  0.1182  0.1591  0.6545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80000    0.24371  204.34 3.68e-16 ***
## hours        2.50909    0.03928   63.88 4.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3568 on 8 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.9978 
## F-statistic:  4081 on 1 and 8 DF,  p-value: 4.01e-12

Visualization with ggplot2

We can visualize the relationship between exam scores and hours studied:

Interpretation of the Plot

  • Each point represents a student’s observed exam score for a given number of study hours.
  • The black line is the fitted regression line: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\).
  • The shaded region shows the 95% confidence interval for predicted mean scores.
  • The plot visually confirms a positive linear relationship: students who study more hours generally achieve higher exam scores.

Residual Plot

Interpretation of the Residual Plot

  • The residuals are the differences between the observed and predicted scores: \(\text{residual}_i = y_i - \hat{y}_i\).
  • Ideally, the residuals should be randomly scattered around 0.
  • Observations:
    • Points appear roughly randomly scattered → linearity assumptions appears reasonable.
    • No obvious funnel shape → homoscedasticity assumption likely holds.
    • No extreme outliers → model is generally a good fit.
  • This plot helps to diagnose potential violations of linear regression assumptions.

3D Interactive Plot (Plotly)

Interpretation:

  • X-axis: Hours studied
  • Y-axis: Exam score
  • Z-axis: Hours slept (random noise for 3D effect)
  • The interactive plot allows rotating and zooming to explore relationships
  • Suggests how study habits and rest together might influence exam performance

Hypothesis Testing for the Slope

We want to test whether hours studied significantly affects exam scores:

\[ H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0 \]

  • Null hypothesis \(H_0\): There is no linear relationship between hours studied and exam scores.
  • Alternative hypothesis \(H_1\): There is a linear relationship.

Test statistic:

\[ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \]

  • \(\hat{\beta}_1\) = estimated slope
  • \(SE(\hat{\beta}_1)\) = standard error of \(\hat{\beta}_1\)
  • Degrees of freedom = \(n - 2\)

Confidence Intervals for Prediction

Once the model is fitted, we can estimate confidence intervals for the expected exam score at a given number of study hours \(x_0\):

\[ \hat{y}_0 \pm t_{\alpha/2, n-2} \cdot SE(\hat{y}_0) \]

Where the standard error of the predicted mean is:

\[ SE(\hat{y}_0) = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right)} \]

  • \(\hat{y}_0\) = predicted exam score
  • \(t_{\alpha/2, n-2}\) = critical value from t-distribution
  • \(\hat{\sigma}^2\) = estimated variance of residuals

In R:

predict(model, newdata = data.frame(hours = 7), interval = "confidence")
##        fit      lwr      upr
## 1 67.36364 67.07014 67.65713

Summary & Key Takeaways

  • Simple Linear Regression helps quantify and interpret the linear relationship between a predictor (\(x\)) and a response (\(y\)).
  • The least squares method estimates parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squared residuals.
  • Model assumptions (linearity, independence, homoscedasticity, normality) should always be checked through diagnostic plots.
  • Hypothesis testing of \(\beta_1\) determines if the predictor has a statistically significant effect on the response.
  • Confidence intervals provide a range for expected outcomes and help assess prediction uncertainty.
  • Visualization tools (ggplot2, plotly) enhance interpretability and allow for interactive model exploration.

Big picture

Simple linear regression provides a foundation for understanding predictive relationships - a first step toward more advanced models like multiple regression and logistic regression.