2025-10-25

Definitions

  • Simple Linear Regression is a statistical model which represents the relationship between two continuous variables.
  • It predicts a dependent variable (Y) based on the behavior of an independent variable (X)
  • This model is absolutely integral to data modeling and machine learning.

The Model

The equation for a simple linear regression is:

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where: - \(Y\): Dependent variable
- \(X\): Independent variable
- \(\beta_0\): Intercept
- \(\beta_1\): Slope (effect of X on Y)
- \(\epsilon\): Error

Deriving the Least Squares Estimates

We estimate \(\beta_0\) and \(\beta_1\) by minimizing the sum of squared residuals:

\[ \text{RSS} = \sum_{i=1}^n (y_i - \hat{y_i})^2 \]

Taking derivatives gives us this:

\[ \hat{\beta_1} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})} {\sum (x_i - \bar{x})^2}, \quad \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \] These derivatives are what helps us find the smallest sum of squared residuals, since the minimum must be located where the derivative is zero.

Example Dataset

Let’s model the relationship between hours studied and exam score using simple linear regression on an artificial dataset.

Student Hours Studied Exam Score
Grug 0 11
Mikey 2 55
Janelle 3 67
Fabio 3 70
Christie 5 81
Squilliam 9 98

– Scatterplot with Regression Line using ggplot2

How Can We Create This Plot?

- The code below was used to create the plot in RStudio:
- x <- c(0,2,3,3,5,9) (Hours studied)
- y <- c(11,55,67,70,81,98) (Test score)
- data <- data.frame(Hours = x, Score = y) (Build the dataframe)

- ggplot(data, aes(x = Hours, y = Score)) +
-  geom_point(color = "blue", size = 3) +
-  geom_smooth(method = "lm", se = FALSE, color = "red") +
-  labs(title = "Hours Studied vs Exam Score", 
        x = "Hours Studied", y = "Exam Score")

- We use the lm method here to plot using a linear model.

Why is Linear Regression so Important for Machine Learning?

  • Linear regression is widely used within the context of machine learning.
  • As the importance and usage of machine learning and AI balloons, so too does the importance of linear regression models.
  • Many businesses use machine learning to try and predict future trends and outcomes.
  • Linear regression is an essential baseline in training machine models to make predictions on future decisions based on past data.
  • While plenty of important models use more complex regression models, simple linear regression serves as a great baseline.

Just How Important are Machine Learning and AI Becoming?

  • We can extrapolate this plotly graph using various sources:

Does Linear Regression Get the Same Recognition?

  • Let’s use Google Trends data to compare.

An underappreciated statistical method

  • As we can see, public knowledge of AI and Machine Learning continues to explode, but knowledge of linear regression stays stagnant. This highly important aspect of the rapidly expanding technologies that are AI/ML is extremely underrated.
  • JUSTICE FOR LINEAR REGRESSION!!!

Sources for plotly Graph