Simple Linear Regression — Concepts & Example

Muhammad Arif

Agenda

What is Simple Linear Regression (SLR)?
Model, assumptions, and estimation
Interpreting coefficients and \(R^2\)
Example with trees dataset (base R)
Two ggplot visuals (scatter + fitted line, residual plot)
One plotly interactive (3D example)
How to present results clearly

What is SLR? (Math)

We model a numerical response \(Y\) with a single predictor \(X\): \[ Y = \beta_0 + \beta_1 X + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2). \]

\(\beta_0\): intercept (expected \(Y\) when \(X=0\))
\(\beta_1\): slope (change in \(Y\) per one-unit change in \(X\))
\(\varepsilon\): random error, mean 0, constant variance

Estimation (Math)

Ordinary Least Squares (OLS) chooses \(\hat{\beta}_0,\hat{\beta}_1\) to minimize the sum of squared errors: \[ \text{SSE} = \sum_{i=1}^n \bigl(y_i - \hat{y}_i\bigr)^2 = \sum_{i=1}^n \bigl(y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\bigr)^2. \]

Model fit quality is summarized by: \[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{MSE}=\frac{\text{SSE}}{n-2}. \]

Assumptions (Plain Language)

Linearity: mean of \(Y\) changes linearly with \(X\)
Independence of observations
Constant variance (homoscedasticity)
Normality of errors (for inference)

We will check residual plots to support assumptions.

Example Data (From Notes)

Use built-in trees data to predict Volume from Girth.

Girth (inches), Volume (cubic feet)
We’ll fit: Volume ~ Girth

# Only base R + ggplot2 + plotly as covered in notes
library(ggplot2)
library(plotly)
data(trees)
str(trees)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

GGPlot 1: Scatter + Fitted Line (as in notes)

g <- ggplot(trees, aes(x = Girth, y = Volume)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "firebrick") +
  labs(title = "Tree Volume vs Girth",
       x = "Girth (inches)",
       y = "Volume (cubic ft)") +
  theme_bw()
g

## `geom_smooth()` using formula = 'y ~ x'

Fit the Model & Interpret (Code Slide)

mod <- lm(Volume ~ Girth, data = trees)
summary(mod)

## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

coef(mod)

## (Intercept)       Girth 
##  -36.943459    5.065856

# Quick predictions for the observed X values:
head(fitted(mod))

##         1         2         3         4         5         6 
##  5.103149  6.622906  7.636077 16.248033 17.261205 17.767790

Slope \(\hat{\beta}_1\): expected change in Volume for +1 inch Girth
Intercept \(\hat{\beta}_0\): expected Volume at Girth \(=0\) (may be outside data range; interpret cautiously)
Look at p-value for slope and \(R^2\) for fit strength

GGPlot 2: Residual Plot (check linearity/variance)

res_df <- data.frame(
  Girth = trees$Girth,
  Residuals = resid(mod)
)
ggplot(res_df, aes(x = Girth, y = Residuals)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(title = "Residuals vs Girth", x = "Girth", y = "Residuals") +
  theme_minimal()

Residuals should be centered around 0 with no clear pattern
Spread should be roughly constant across Girth

Plotly: Interactive 3D (From ioslides & Plotly notes)

We include a simple 3D plot to satisfy the requirement.

# 3D scatter of mtcars: mpg vs. wt & hp (illustrative)
xax <- list(title = "wt")
yax <- list(title = "hp")
zax <- list(title = "mpg")

plot_ly(mtcars, x = ~wt, y = ~hp, z = ~mpg,
        type = "scatter3d", mode = "markers",
        color = ~as.factor(cyl)) %>%
  layout(title = "mpg vs (wt, hp) — plotly 3D",
         scene = list(xaxis = xax, yaxis = yax, zaxis = zax))

Inference Quick View (Math)

Confidence interval for slope \(\beta_1\): \[ \hat{\beta}_1 \pm t_{\alpha/2,\,n-2}\cdot \text{SE}(\hat{\beta}_1). \]

For a given \(x_0\), prediction for \(Y_0\): \[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0. \]

Clean Reporting Checklist

State the model and variables clearly
Show scatter + fitted line (with axis units)
Provide slope interpretation and \(R^2\)
Add a residual plot for diagnostics
Keep code short and reproducible

Appendix: Minimal Reproducible Code (Copy/Paste)

# Data
data(trees)

# Model
mod <- lm(Volume ~ Girth, data = trees)

# Fitted values and residuals
head(fitted(mod)); head(resid(mod))

##         1         2         3         4         5         6 
##  5.103149  6.622906  7.636077 16.248033 17.261205 17.767790

##         1         2         3         4         5         6 
## 5.1968508 3.6770939 2.5639226 0.1519667 1.5387954 1.9322098

# Two ggplots
library(ggplot2)
ggplot(trees, aes(Girth, Volume)) + geom_point() + geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula = 'y ~ x'

res_df <- data.frame(Girth = trees$Girth, Residuals = resid(mod))
ggplot(res_df, aes(Girth, Residuals)) + geom_point() + geom_hline(yintercept = 0, linetype = 2)

# One plotly
library(plotly)
plot_ly(mtcars, x=~wt, y=~hp, z=~mpg, type="scatter3d", mode="markers", color=~as.factor(cyl))

References (From Course Notes)

ggplot2 grammar and examples (scatter, smooth, theme)
plotly interactive 3D examples
SLR formulas for OLS, \(R^2\), and diagnostics