2025-10-14

What is it?

As many scatterplots show, it is rare to see all datapoints in a set fall perfectly on a line. If they did, there would be no point for statistical analysis.

The topic of linear regression is about choosing a line that best describes the overall behavior of the data.

Residuals

Key Terms:

  • \(y\) is the actual value for the given \(x\)
  • \(\widehat{y}\) is the predicted value (based on the linear regression model) for the given \(x\)

A residual, denoted \(e\) is the difference between the actual response minus the predicted response. In other words: \[ e = y - \widehat{y} \] Residuals can be positive or negative, but a best-fit line should have the collection of the residuals that have the smallest absolute values.

Least Squares Linear Regression

Linear Regression models follow the following format: \[ \widehat{y} = mx+b \] where \[ m = \frac{n \sum{x_i y_i} - (\sum{x_i})(\sum{y_i})}{n(\sum{x_i^2})(\sum{x_i})^2} \] and \[ b = \frac{1}{n}(\sum{y_i} - m \sum{x_i}) \]

Least Squares Linear Regression Example

As an example, say we have a list of \(x\) (\(2, 2, 10, 7, 5\)) and a list of \(y\) (\(1, 0, 5, 2, 3\)). Then to calculate m we would have have the following that we will calculate in R.

x = c(2, 2, 10, 7, 5)
y = c(1, 0, 5, 2, 3)

sum_x = sum(x)
sum_y = sum(y)
sum_x_squared = sum(x^2)
sum_x_y = sum(x * y)
n = length(x)

m = ((n * sum_x_y) - (sum_x * sum_y))/(n * sum_x_squared - (sum_x)^2)
b = (1 / n) * (sum_y - (m * sum_x))

Least Squares Linear Regression Example

Computing that gives the value of m as 0.509 and the value of b as -0.444. That means that are linear regression equation is \(\widehat{y}\) = 0.509 \(x\) + -0.444.

Linear Regression in R

Instead of calculating all of the different summations by hand, it is possible to do it in R. We will use the built in trees dataset for this example to construct a linear regression line as we look at Volume vs Girth.

The following slide will show the plot.

ggplot(trees, aes(x = Girth, y = Volume)) + 
  geom_point(aes(color="Data Points")) + 
  geom_smooth(aes(color="Regression Line"), 
              method = "lm", se = FALSE) + 
  scale_color_manual(values = c("Data Points" = "blue",
                                "Regression Line" = "red")) +
  labs(title = "Volume vs Girth with Linear Regression Line")

Linear Regression in R

The method = “lm” part of the code in the geom_smooth() section is what calculates the linear regression model

Effect of Outliers

Linear Regression is a very useful tool when trying to predict a y-value for a given x-value. Any outliers (a data point far outside of the range) can affect a linear regression line, though.