2026-02-07

Simple linear regression is a process in statistics used to find a linear function representing the relationship between an independent variable and a dependent variable in a data set. This function acts as a line of best fit to the data, and is often used to estimate unknown values of the dependent variable for values of the independent variable that lie outside the data set.

A simple linear regression model has the form:
\(y = \alpha + \beta x + \epsilon\)
Where \(y\) is the dependent variable,
\(x\) is the independent variable,
\(\alpha\) is a constant representing the \(y\)-intercept,
\(\beta\) is a scalar coefficient of \(x\) representing the slope of the line,
and \(\epsilon\) is a variable representing error.

To illustrate this, we can look at the ‘cars’ data set included in the R language as an example.
Though the data is about a century old, it is perfect for applying simple linear regression because it contains only two variables which are related in a way that is easy to understand intuitively. These variables are speed, which is the speed a car was traveling in miles per hour, and dist, which is the distance in feet it took the car to come to a stop. In the next slide we will see a scatterplot of this data, along with the “line of best fit” created by performing a simple linear regression.

Even if we had not shown the line in this plot, it is easy to see from the points alone that higher speeds tend to result in longer stopping distances. However, we can also see that there is some vertical distance between most points and the line, and sometimes this distance is large. How does this fit with our regression model?

In order for the line to be considered a “best fit” the vertical distances between the points and the line must be balanced in such a way that the sum of the squared vertical distances is as small as it can possibly be. If this sum is represented by the function
\(f(a , b) = \sum_{i=0}^n [y_i - (a + bx_i)]^2\)
Then the equation of the line of best fit is this function at the values of \(a\) and \(b\) which produce the smallest result. When we have this line, we can then get a sense for how much the variables we are looking at are related by examining the spread of points around the line. In the case of our cars plot, most points are clustered fairly close to the line with a few outliers. This implies that speed does indeed likely have a significant impact on stopping distance.

Here is another example, this time using R’s built in iris data set to compare the sepal length and petal length of iris flowers. The points are clustered fairly closely around the line, but does that necessarily imply a cause and effect relationship? Could other factors be at play?

In this plot, we have the three different species of irises from the data.set represented in different colors, each with their own line. From this perspective, each species of iris seems to have its own relationship between our variables.

Simple linear regression is a useful tool, but as we’ve seen in the previous two slides, it may lead us to make conclusions about data without seeing the bigger picture. Some data may not make sense to fit to a line. Sometimes it might make more sense to fit to a curve. There are several forms of regression, and it is important to consider the context of the data and what other factors may be involved in order to choose the best regression model. To illustrate this point, the last few slides show Anscombe’s quartet, a group of datasets created specifically to have almost exactly the same linear fit model despite very different looking scatterplots.

ggplot(data=anscombe, mapping=aes(x=x1, y=y1)) + 
  geom_point() + 
  stat_smooth(method="lm")

ggplot(data=anscombe, mapping=aes(x=x2, y=y2)) + 
  geom_point() + 
  stat_smooth(method="lm")

ggplot(data=anscombe, mapping=aes(x=x3, y=y3)) + 
  geom_point() + 
  stat_smooth(method="lm")

ggplot(data=anscombe, mapping=aes(x=x4, y=y4)) + 
  geom_point() + 
  stat_smooth(method="lm")