2026-04-11

Linear Regression: The Basics

  • Simple linear regression enables one to model the possible linear relationship between two possibly connected variables.

  • One variable is considered the input and the other the output of the model.

  • It helps us predict the value of the output variable based on the input.

  • Examples of correlated variables: Mass -> Force, Age -> Height, and much more.

What Does Linear Regression Do?

  • the model derives the equation of best fit for the two variables.
  • The derived equation will have a certain strength and direction. It can be strongly or weakly correlated and can have direction that is negative, positive, or neither.

Some example questions, How does the age of a tree affect its width? What is the relationship between the initial height and total horizontal distance covered? How does amount of hours studied affect exam score?

Estimated Model

\[ \hat{y} = b_0 + b_1 x \] This is the model behind basic linear regression.

  • \(b_0\): estimated intercept
  • \(b_1\): estimated slope
  • \(\hat{y}\): predicted value

The slope represents the rate of change between the two variables. The estimated intercept represents the estimated initial value.

Example Dataset: faithful

For the sake of an example we will use the built-in data set faithful.

variables of concern: - eruptions = eruption duration (minutes)
- waiting = waiting time until next eruption

Question to Model: How does eruption time affect waiting time?

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

Scatterplot of Data

Regression Line

This line is the regression line that represents the possible correlation between eruption duration and waiting time.

Building the Model

model <- lm(waiting ~ eruptions, data = faithful)
summary(model)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

This creates a regression model predicting the waiting time based on the eruption duration.

Interpreting the Equation

\[ {waiting} = 33.47 + 10.73(eruptions) \]

  • Intercept = 33.47
  • Slope = 10.73

Interpretation: - For each additional minute of eruption the waiting time increases by ~10.73 minutes

Interactive Plot

This is interactive.

Residuals

\[ e_i = y_i - \hat{y}_i \] This is the equation for a residual and it is a metric for how accurate the model is. This equation essentially computes the distance between points and the line. A good model will have residuals that sit close to 0, meaning the equation is a good predictor.

Residual Plot

Why It Matters

Linear regression is awesome! It helps make predictions, it can reveal interesting relationships, and can help understand trends. Thus, it can be used in so many fields such as quantitative finance, engineering, and chemistry.