roadmap

  • SLR idea & notation
  • Example: trees (Girth → Volume)
  • Inference (CIs / tests)
  • Diagnostics (residual checks)
  • Plotly 3D demo
  • Reproducible workflow

what is simple linear regression?

We model a response \(Y\) using a single predictor \(X\): \[ Y_i=\beta_0+\beta_1 X_i+\varepsilon_i,\quad \varepsilon_i\stackrel{iid}{\sim}N(0,\sigma^2) \] Assumptions: linearity, constant variance, independence, (approx.) normal errors.

data peek: trees

##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

ggplot: scatter + fitted line

inference (closed-form & tests)

\[ \hat\beta_1 = \frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}, \quad \hat\beta_0 = \bar y - \hat\beta_1 \bar x \] \[ \widehat{SE}(\hat\beta_1) = \frac{\hat\sigma}{\sqrt{\sum (x_i-\bar x)^2}}, \quad \hat\sigma = \sqrt{\frac{SSE}{n-2}}, \quad t = \frac{\hat\beta_1}{\widehat{SE}(\hat\beta_1)} \sim t_{n-2} \] \[ \hat\beta_1 \pm t_{n-2,\,1-\alpha/2}\cdot \widehat{SE}(\hat\beta_1) \]

model summary

## 
## Call:
## lm(formula = Volume ~ Girth, data = trees)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## Girth         5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.935         0.933  4.25      419. 8.64e-19     1  -87.8  182.  186.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -36.9      3.37      -11.0 7.62e-12   -43.8     -30.1 
## 2 Girth           5.07     0.247      20.5 8.64e-19     4.56      5.57

diagnostics (residuals & Q–Q)

plotly (3D) intuition: mpg vs (wt, hp)

reproducible workflow (plain text)

  • Fit: mod <- lm(Y ~ X, data=...)
  • Summarize: summary(mod) (\(\hat\beta\), SEs, p-values, \(R^2\))
  • Visualize: geom_point() + geom_smooth(method="lm")
  • Diagnose: residuals vs fitted + Q–Q plot
  • Communicate: CIs and slope interpretation in words