Overview

  • Topic: Simple Linear Regression
  • What you’ll see: intuition, math, examples, plots (ggplot + plotly), R code

What is Simple Linear Regression?

Simple linear regression models a response variable \(Y\) as a linear function of a single predictor \(X\) plus random error:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \varepsilon_i \sim N(0,\sigma^2) \]

(assumptions shown on the next slide)

Assumptions (LaTeX)

  1. Linearity: \(E[Y|X]=\beta_0+\beta_1X\).
  2. Independence of errors: \(\text{Cov}(\varepsilon_i,\varepsilon_j)=0\).
  3. Homoscedasticity: \(\text{Var}(\varepsilon_i)=\sigma^2\).
  4. Normality of errors: \(\varepsilon_i\overset{iid}{\sim}N(0,\sigma^2)\).

Estimation: least squares (math)

The ordinary least squares estimates minimize the residual sum of squares:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}. \]

Example dataset: mtcars

We’ll use the built-in mtcars dataset to predict mpg from wt (weight).

ggplot example: scatter + regression

ggplot example 2: boxplot by cylinders

plotly example: 3D scatter (interactive)

## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.

Fitting the model (R code shown)

# Fit simple linear regression: mpg ~ wt
fit <- lm(mpg ~ wt, data=mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Diagnostics: Residuals vs Fitted

Diagnostics: Q-Q Plot of Residuals

Inference: Confidence interval for slope (math)

Under standard assumptions:

\[ \hat{\beta}_1 \pm t_{n-2, 1-\alpha/2} \cdot SE(\hat{\beta}_1), \]

where \(SE(\hat{\beta}_1)=\sqrt{\widehat{\sigma}^2/\sum (x_i-\bar{x})^2}\).

Prediction example: point & interval (R code)

new <- data.frame(wt = c(2.0, 3.0))
predict(fit, new, interval='confidence')   # mean prediction
##        fit      lwr      upr
## 1 26.59618 24.82389 28.36848
## 2 21.25171 20.12444 22.37899
predict(fit, new, interval='prediction') # individual prediction
##        fit      lwr      upr
## 1 26.59618 20.12811 33.06425
## 2 21.25171 14.92987 27.57355

Interpretation and takeaways

  • Negative slope: heavier cars tend to have lower MPG.
  • Check assumptions with residual plots & Q-Q plot.
  • Use CI to quantify uncertainty in slope and predictions.