2025-09-21

What is Linear Regression?

Linear regression is a method for determining a relationship between two variables for the purpose of predicting where any given X data point should fall on the Y axis.

ggplot(iris, aes(Sepal.Length, Petal.Length)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) + # Create the regression line.
  labs(title = "Sepal Length vs Petal Length in Irises",
       subtitle = "Example of linear regression in blue")

Using Linear Regression

A linear regression model can be understood as a function that returns the predicted value of some Y variable given an X variable.

Example: Given an iris with a sepal length of 7 centimeters, our linear regression model predicts that, on average, such an iris will have a petal length of around 5.91 centimeters.

# Create our model
imodel <- lm(Petal.Length ~ Sepal.Length, iris)
# Predict the output of the model at a given value
predict(imodel, data.frame(Sepal.Length=c(7)))
##        1 
## 5.907587

Using Linear Regression Cont.

Graphing our predicted value, notice that the predicted petal length is near that of adjacent data points.

Constructing a Linear Regression Model

The least squares method for determining a linear regression model is an optimization problem where the goal is a line where the sum of the squared distances from every data point to the line is as small as possible.

The slope \(\beta\) of the least squares method is given by:

\[ \beta = \frac{n(\sum_{i=1}^n x_i y_i) - (\sum_{i=1}^n x_i)(\sum_{i=1}^n y_i)}{n(\sum_{i=1}^n x_i^2) - (\sum_{i=1}^n x_i)} \]

Least Squares Method Cont.

Given the formula for \(\beta\), we then need to find the y-intercept \(\alpha\) for our line to construct our slope-intercept form equation.

\[ \alpha = \frac{\sum_{i=1}^n y_i - \beta \sum_{i=1}^n x_i}{n} \]

Visualizing the Least Squares Method

The square distance, given by simple trigonometry, is calculated for every data point.

Issues with Linear Regression

  • Not all data are linear. In the iris example, there is a grouping in the lower left of many values quite far from our line of best fit.
  • Being a function of averages, the output can be skewed by outliers.

Extensions to Linear Regression

So-called multiple regression may be used to analyze data on more than two axes.

References

Miller, I., Miller, M., & Freund, J. E. (2014). John E. Freund’s mathematical statistics with applications. Pearson.