2025-04-13

Simple Linear Regression

Simple linear regression is a powerful statistical tool used to model the relationship between two variables, the objective of which is “to predict the value of an output variable (or response) based on the value of an input (or predictor) variable.” (Simple Linear Regression. JMP Statistical Discovery LLC, 2025. https://www.jmp.com/en/statistics-knowledge-portal/what-is-regression)

Why Use Simple Linear Regression?

  • We wish to understand why a response variable has certain values within the context of a predictor variable.
  • It minimizes the error of future predictions.

Mathematics of Simple Linear Regression

Simple linear regression models the relationship between a dependent variable \(y\) and an independent variable \(x\) using the following equation:

\(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\); \(\epsilon_i \sim N(0,\sigma^{2})\).

where \(\beta_0\) and \(\beta_1\) are the y-intercept and slope of the regression line respectively, and \(\epsilon\) is the noise or error terms which are based on a normal distribution with mean 0 (Definition: Simple linear regression. The Book of Statistical Proofs. https://statproofbook.github.io/D/slr.html).

In practice, we cannot know the true values of \(\beta_0, \beta_1\) or \(\epsilon\) so we use \(y_i = \hat{\beta_0} + \hat{\beta_1} x_i\), where \(\hat{\beta_0}\) and
\(\hat{\beta_1}\) are estimates.

Mathematics of Simple Linear Regression (cont’d)

We use the sum of squares as a statistical measure of variability , or how much the response deviates from the predicted value.

SST = \(\Sigma^n_{i=1}(y_i - \bar{y})^2\) : This is the sum of the squared difference between the observed values of \(y\) and the mean of \(y\).

SSR = \(\Sigma^n_{i=1}(\hat{y_i} - \bar{y})^2\) : This is the sum of the squared difference between the predicted values of \(y\) and the mean of \(y\).

SSE = \(\Sigma^n_{i=1}(y_i - \hat{y_i})^2\) : This is the sum of the squared difference between the actual value of \(y\) and the predicted value of \(y\).

Since SST = SSR + SSE, minimizing the SSE term will result in a more accurate linear model.

(Reference: Sum of Squares: SST, SSR, SSE. 365 Data Science, 2025. https://365datascience.com/tutorials/statistics-tutorials/sum-squares/)

Linear Modeling Example

For this presentation, I will building linear regression models using the built-in R mtcars dataset. mod_1 will model the relationship between \(1 \over 4\) mile time (qsec) and weight (wt). mod_2 will model the relationship between displacement (disp) and gross horsepower (hp).

data("mtcars")
mod_1 = lm(qsec~wt, data = mtcars)
mod_2 = lm(hp~disp, data = mtcars)

For mod_1, \(\beta_0\) = 18.88 and \(\beta_1\) = -0.32. For mod_2, \(\beta_0\) = 45.73 and \(\beta_1\) = 0.44

The expected outcomes would be that there is a negative correlation between qsec and wt and that there is a positive correlation between hp and disp.

Plot of qsec vs. wt

Commentary: There is high variability of qsec with respect to wt, which indicates a weak correlation between the two variables. The colors of the points vary with the number of engine cylinders.

Plot of hp vs. disp

Commentary: There is less variability in this relationship, with hp values grouped more closely to the regression line (i.e. the SSE is lower in this plot than in the previous one).

Plot of hp vs. disp (cont’d)

The larger the marker in this plot, the slower the car with respect to qsec time. Since there is an obvious relationship between engine displacement and number of engine cylinders it makes sense that the four cylider cars would be clustered on the lower left and the eight cylinder cars would be clustered in the upper right.

Code used to produce plots

# plot 1
ggplot(data = mtcars, aes(x=wt, y=qsec, color=cyl)) + 
  geom_smooth(method = "lm", formula=y~x, se=F, color="red") +
  geom_point() + ylab("Quarter Mile Time") +
  xlab("Weight (in 1000s)")
#plot 2
ggplot(data = mtcars, aes(x=disp, y=hp, color=carb)) + 
  geom_smooth(method = "lm", formula=y~x, se=F, color="red") +
  geom_point() + ylab("Horsepower") +
  xlab("Displacement")
#plot 3
fig = plot_ly(mtcars, x=~disp, y=~hp, color=~cyl, size=~qsec) %>% 
  add_lines(x=~disp, y=fitted(mod_2), name="fitted") %>%
  add_markers(name="markers") %>%
  layout(margin=list(b=40,t=40,l=40,r=40), pad = 4)
fig