Simple Linear Regression: Concepts & Application

2025-10-18

Simple Linear Regression

Understanding the linear relationship between a predictor \(x\) and a response \(y\).

Overview

What this presentation covers

Intuition: when simple linear regression should be used
The mathematical model and assumptions
Estimation: least squares (closed-form)
Inference: hypothesis tests and confidence intervals
A worked example (R code + plots)

The model (mathematical form)

We model the data as \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} N(0,\sigma^2) \]

Key goal: estimate \(\beta_0, \beta_1\) and quantify uncertainty (standard errors, \(t\)-tests, CI).

Motivation Example

Scenario: A data analyst wants to predict student exam scores (\(y\)) based on the number of hours spent studying (\(x\)).

Question: Is there a linear relationship between the amount of time a student spends studying and their exam score?

Visualization idea: If we plot exam scores against study hours, we might expect results to roughly cluster along an upward-sloping line. This would indicate that more study time generally leads to higher exam scores.

Goal of regression:

Quantify the relationship between \(x\) and \(y\)
Use a fitted model to predict exam scores for new study times
Measure the uncertainty with those predictions

Model Assumptions

For a simple linear regression model to produce valid estimates, the following assumption must hold:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]

Linearity: The expected value of \(y\) is a linear function of \(x\): \[ E[Y|X=x] = \beta_0 + \beta_1 x \]
Independence: Observations \((x_i, y_i)\) are independent of each other.
Homoscedasticity The variance of the errors is constant: \[ Var(\varepsilon_i) = \sigma^2 \quad \text{for all } i \]
Normality of errors: The errors are normally distributed: \[ \varepsilon_i \sim N(0, \sigma^2) \]

Why it is important:

Violating these assumptions can make estimates biased or inefficient
Diagnostic checks and residual plots can help to verify these assumptions

Estimation via Least Squares

In simple linear regression, we estimate the parameters \(\beta_0\) and \(\beta_1\) by minimizing the Residual Sum of Squares (RSS):

\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 \]

Closed-form solutions: \[ \hat{\beta}_1 = \frac{\sum){i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Interpretation:

\(\hat{\beta}_1\) is the estimated change in \(y\) for a one-unit increase in \(x\)
\(\hat{\beta}_0\) is the predicted value of \(y\) when \(x = 0\)

Key point:

Least squares provides the line that best fits the data we observed by minimizing the squared vertical distances between the points and the regression line.

Worked Example: Linear Fit in R

We will use a simple linear regression model to predict exam scores based on hours studied.

# Sample dataset
hours <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scores <- c(52, 55, 57, 60, 63, 65, 67, 70, 72, 75)
df <- data.frame(hours, scores)

# Fit linear regression model
model <- lm(scores ~ hours, data = df)

# View summary of model
summary(model)

## 
## Call:
## lm(formula = scores ~ hours, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3818 -0.3227  0.1182  0.1591  0.6545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 49.80000    0.24371  204.34 3.68e-16 ***
## hours        2.50909    0.03928   63.88 4.01e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3568 on 8 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.9978 
## F-statistic:  4081 on 1 and 8 DF,  p-value: 4.01e-12

Visualization with ggplot2

We can visualize the relationship between exam scores and hours studied:

Interpretation of the Plot

Each point represents a student’s observed exam score for a given number of study hours.
The black line is the fitted regression line: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\).
The shaded region shows the 95% confidence interval for predicted mean scores.
The plot visually confirms a positive linear relationship: students who study more hours generally achieve higher exam scores.

Residual Plot

Interpretation of the Residual Plot

The residuals are the differences between the observed and predicted scores: \(\text{residual}_i = y_i - \hat{y}_i\).
Ideally, the residuals should be randomly scattered around 0.
Observations:
- Points appear roughly randomly scattered → linearity assumptions appears reasonable.
- No obvious funnel shape → homoscedasticity assumption likely holds.
- No extreme outliers → model is generally a good fit.
This plot helps to diagnose potential violations of linear regression assumptions.

3D Interactive Plot (Plotly)

Interpretation:

X-axis: Hours studied
Y-axis: Exam score
Z-axis: Hours slept (random noise for 3D effect)
The interactive plot allows rotating and zooming to explore relationships
Suggests how study habits and rest together might influence exam performance

Hypothesis Testing for the Slope

We want to test whether hours studied significantly affects exam scores:

\[ H_0: \beta_1 = 0 \quad \text{vs} \quad H_1: \beta_1 \neq 0 \]

Null hypothesis \(H_0\): There is no linear relationship between hours studied and exam scores.
Alternative hypothesis \(H_1\): There is a linear relationship.

Test statistic:

\[ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \]

\(\hat{\beta}_1\) = estimated slope
\(SE(\hat{\beta}_1)\) = standard error of \(\hat{\beta}_1\)
Degrees of freedom = \(n - 2\)

Confidence Intervals for Prediction

Once the model is fitted, we can estimate confidence intervals for the expected exam score at a given number of study hours \(x_0\):

\[ \hat{y}_0 \pm t_{\alpha/2, n-2} \cdot SE(\hat{y}_0) \]

Where the standard error of the predicted mean is:

\[ SE(\hat{y}_0) = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} \right)} \]

\(\hat{y}_0\) = predicted exam score
\(t_{\alpha/2, n-2}\) = critical value from t-distribution
\(\hat{\sigma}^2\) = estimated variance of residuals

In R:

predict(model, newdata = data.frame(hours = 7), interval = "confidence")

##        fit      lwr      upr
## 1 67.36364 67.07014 67.65713

Summary & Key Takeaways

Simple Linear Regression helps quantify and interpret the linear relationship between a predictor (\(x\)) and a response (\(y\)).
The least squares method estimates parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squared residuals.
Model assumptions (linearity, independence, homoscedasticity, normality) should always be checked through diagnostic plots.
Hypothesis testing of \(\beta_1\) determines if the predictor has a statistically significant effect on the response.
Confidence intervals provide a range for expected outcomes and help assess prediction uncertainty.
Visualization tools (ggplot2, plotly) enhance interpretability and allow for interactive model exploration.

Big picture

Simple linear regression provides a foundation for understanding predictive relationships - a first step toward more advanced models like multiple regression and logistic regression.