Simple Linear Regression — Concepts & Example

Muhammad Arif

Agenda

What is SLR? (Math)

We model a numerical response \(Y\) with a single predictor \(X\): \[ Y = \beta_0 + \beta_1 X + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2). \]

Estimation (Math)

Ordinary Least Squares (OLS) chooses \(\hat{\beta}_0,\hat{\beta}_1\) to minimize the sum of squared errors: \[ \text{SSE} = \sum_{i=1}^n \bigl(y_i - \hat{y}_i\bigr)^2 = \sum_{i=1}^n \bigl(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\bigr)^2. \]

Model fit quality is summarized by: \[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{MSE}=\frac{\text{SSE}}{n-2}. \]

Assumptions (Plain Language)

We will check residual plots to support assumptions.

Example Data (From Notes)

Use built-in trees data to predict Volume from Girth.

# Only base R + ggplot2 + plotly as covered in notes
library(ggplot2)
library(plotly)
data(trees)
str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

GGPlot 1: Scatter + Fitted Line (as in notes)

g <- ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "firebrick") +
  labs(title = "Tree Volume vs Girth",
       x = "Girth (inches)",
       y = "Volume (cubic ft)") +
  theme_bw()
g
## `geom_smooth()` using formula = 'y ~ x'

Fit the Model & Interpret (Code Slide)

mod <- lm(Volume ~ Girth, data = trees)
summary(mod)
## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16
coef(mod)
## (Intercept)       Girth 
##  -36.943459    5.065856
# Quick predictions for the observed X values:
head(fitted(mod))
##         1         2         3         4         5         6 
##  5.103149  6.622906  7.636077 16.248033 17.261205 17.767790

GGPlot 2: Residual Plot (check linearity/variance)

res_df <- data.frame(
  Girth = trees$Girth,
  Residuals = resid(mod)
)
ggplot(res_df, aes(x = Girth, y = Residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(title = "Residuals vs Girth", x = "Girth", y = "Residuals") +
  theme_minimal()

Plotly: Interactive 3D (From ioslides & Plotly notes)

We include a simple 3D plot to satisfy the requirement.

# 3D scatter of mtcars: mpg vs. wt & hp (illustrative)
xax <- list(title = "wt")
yax <- list(title = "hp")
zax <- list(title = "mpg")

plot_ly(mtcars, x = ~wt, y = ~hp, z = ~mpg,
        type = "scatter3d", mode = "markers",
        color = ~as.factor(cyl)) %>%
  layout(title = "mpg vs (wt, hp) — plotly 3D",
         scene = list(xaxis = xax, yaxis = yax, zaxis = zax))

Inference Quick View (Math)

Confidence interval for slope \(\beta_1\): \[ \hat{\beta}_1 \pm t_{\alpha/2,\,n-2}\cdot \text{SE}(\hat{\beta}_1). \]

For a given \(x_0\), prediction for \(Y_0\): \[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0. \]

Clean Reporting Checklist

Appendix: Minimal Reproducible Code (Copy/Paste)

# Data
data(trees)

# Model
mod <- lm(Volume ~ Girth, data = trees)

# Fitted values and residuals
head(fitted(mod)); head(resid(mod))
##         1         2         3         4         5         6 
##  5.103149  6.622906  7.636077 16.248033 17.261205 17.767790
##         1         2         3         4         5         6 
## 5.1968508 3.6770939 2.5639226 0.1519667 1.5387954 1.9322098
# Two ggplots
library(ggplot2)
ggplot(trees, aes(Girth, Volume)) + geom_point() + geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

res_df <- data.frame(Girth = trees$Girth, Residuals = resid(mod))
ggplot(res_df, aes(Girth, Residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = 2)

# One plotly
library(plotly)
plot_ly(mtcars, x=~wt, y=~hp, z=~mpg, type="scatter3d", mode="markers", color=~as.factor(cyl))

References (From Course Notes)