2025-09-17

Slide 1: Introduction

Simple linear regression models the relationship between two variables by fitting a linear equation to observed data.
We use it to predict outcomes and understand associations. In this project, we apply it to real-world car data to see how weight impacts fuel efficiency.

Slide 2: Learning Objectives

  • Understand the simple linear regression model and OLS estimation
  • Fit and interpret a regression line in R
  • Produce ggplot diagnostics and interactive Plotly visuals(2D and 3D)
  • Show the R code used to generate the results (These objectives guide us through both theory and application of regression)

Slide 3: Model Equation

The simple linear regression model:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \qquad \epsilon_i \sim iid(0, \sigma^2) \]

Where:
- \(Y_i\) = dependent variable
- \(X_i\) = independent variable
- \(\beta_0\) = intercept
- \(\beta_1\) = slope
- \(\epsilon_i\) = error term

This equation describes how we estimate the outcome(Y) using predictor(X).

Slide 4: OLS Estimators

Ordinary least squares (OLS) estimates:
\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}, \qquad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \] Variance of the slope estimator:

\[ Var(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^{n} (X_i - \bar{X})^2} \] These formulas show us how regression finds the best fitting line by minimizing squared errors.

Slide 5.1: Simple Linear Regression Appliction

For this project, we will use the “mtcars” dataset to evaluate the relationship between car weight and fuel efficiency.
Simple linear regression is commonly used to quantify relationships between a predictor and outcome, so weight will be our independent variable and miles per gallon will be our dependent variable.
We also add a “car” column to label points in interactive plots.

Slide 5.2: Load Data

df <- mtcars[, c("mpg", "wt", "hp")]
df$car <- rownames(mtcars)
head(df)
##                    mpg    wt  hp               car
## Mazda RX4         21.0 2.620 110         Mazda RX4
## Mazda RX4 Wag     21.0 2.875 110     Mazda RX4 Wag
## Datsun 710        22.8 2.320  93        Datsun 710
## Hornet 4 Drive    21.4 3.215 110    Hornet 4 Drive
## Hornet Sportabout 18.7 3.440 175 Hornet Sportabout
## Valiant           18.1 3.460 105           Valiant

Slide 6: Fit Linear Model

Simple linear regression model with “mpg” as dependent variable and “wt” as independent variable.
This lets us quantify the relationship between weight and fuel efficiency using the OLS method.
fit <- lm(mpg ~ wt, data = df)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Slide 7.1: Scatterplot with Regression Line Code

This code produces a scatterplot with the regression line to visualize the model fit.

ggplot(df, aes(x = wt, y = mpg)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = TRUE, color = "blue") +
  labs(title = "MPG vs Weight with OLS Line",
       x = "Weight (1000 lbs)", y = "MPG") +
  theme_minimal(base_size = 14)

Slide 7.2: Scatterplot with Regression Line Plot

Slide 8.1: Residuals vs Fitted Code

Residual plots help check assumptions like linearity and constant variance.

fitted_vals <- fitted(fit)
residuals_vals <- resid(fit)

plot(fitted_vals, residuals_vals,
     main = "Residuals vs Fitted",
     xlab = "Fitted values", ylab = "Residuals",
     pch = 19)
abline(h = 0, lty = 2, col = "red")

Slide 8.2: Residuals vs Fitted Plot

Slide 9: Histogram of Residuals

Shows us whether residuals are approximately normally distributed.

hist(residuals(fit), breaks=10, col='steelblue', border='black',
     main='Histogram of Residuals', xlab='Residual', ylab='Count')

Slide 10: Q-Q Plot of Residuals

Q-Q plot confirms whether residuals follow a normal distribution assumption

qqnorm(residuals(fit), main='Q-Q Plot of Residuals', cex=0.6)
qqline(residuals(fit), col='red')

Slide 11: Interactive Scatter (Plotly)

This allows us to hover over points and see individual car details.

p_plotly <- plot_ly(df, x = ~wt, y = ~mpg, type = 'scatter', mode = 'markers',
                    text = ~paste("car:", car, "<br>wt:", wt, "<br>mpg:", mpg),
                    hoverinfo = 'text', marker = list(size=5)) %>%
  add_lines(x = ~wt, y = fitted(fit), line = list(color = 'blue')) %>%
  layout(height=300)
p_plotly

Slide 12.1: 3D Regression Plane Code

Here, we extend the model to multiple regression with two predictors: weight and horsepower.

# 3D regression plane code (shown as code only)
fit3 <- lm(mpg ~ wt + hp, data = df)

wt_seq <- seq(min(df$wt), max(df$wt), length.out = 30)
hp_seq <- seq(min(df$hp), max(df$hp), length.out = 30)
grid <- expand.grid(wt = wt_seq, hp = hp_seq)
grid$pred <- predict(fit3, newdata = grid)
zmat <- matrix(grid$pred, nrow = 30, ncol = 30)

Slide 12.2: 3D Regression Plane Plot

The 3D surface shows how weight and horsepower together affect fuel efficiency.

Slide 13: Hypothesis and Inference

Test significance of slope:

\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_A: \beta_1 \neq 0 \]

Test statistic:

\[ t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \]

Compare with critical \(t\)-value or p-value. This tests whether car weight has a significant effect on MPG.

Slide 14: Prediction Example

We use a model to predict MPG for cars of different weights and show both confidence and prediction intervals.

newdata <- data.frame(wt = c(2.5, 3.0, 3.5))
conf_int <- predict(fit, newdata = newdata, interval = "confidence")
pred_int <- predict(fit, newdata = newdata, interval = "prediction")
list(confidence = conf_int, prediction = pred_int)
## $confidence
##        fit      lwr      upr
## 1 23.92395 22.55284 25.29506
## 2 21.25171 20.12444 22.37899
## 3 18.57948 17.43342 19.72553
## 
## $prediction
##        fit      lwr      upr
## 1 23.92395 17.55411 30.29378
## 2 21.25171 14.92987 27.57355
## 3 18.57948 12.25426 24.90469

Slide 15: Takeaways

  • Always check linearity with scatterplots
  • Check homoscedasticity (residuals vs fitted)
  • Check residual normality (histogram & Q-Q plot)
  • Use ggplot and plotly to visualize data and regression
  • Interactive plots allow inspecting individual observations