Put simply, simple linear regression is a method of finding the relation between two different variables. This is done by attempting to fit a linear equation such that the line created is as close to all of the data points as possible.
2024-10-29
Put simply, simple linear regression is a method of finding the relation between two different variables. This is done by attempting to fit a linear equation such that the line created is as close to all of the data points as possible.
Here we have a simple scatter plot.
Notice how there are no observations from \(x = 4\) to \(x = 6\).
What if we wanted to approximate values from \(x = 4\) to \(x = 6\)?
This is where simple linear regression comes in!
As we have said before, the linear regression line should be as close to all of the data points as possible. Well a line is just \(y = mx + b\), so to find a best fit line, we need to just find what \(m\) and \(b\) best represent all of the data in the graph.
So how do we find \(m\) and \(b\)?
Well they can be found with the following equations: \[ m = \frac{\Sigma(xy) - \frac{\Sigma(x)\Sigma(y)}{n}}{\Sigma(x^2) - \frac{(\Sigma(x))^2}{n}} \] \[ b = \frac{\Sigma(y)}{n} - m(\frac{\Sigma(x)}{n}) \]
Yes, it’s a lot, but don’t worry, we are going to dissect what the heck all of that means now.
Let’s go ahead and breakdown all of the components of \(m\) from the previous slide \[ m = \frac{\Sigma(xy) - \frac{\Sigma(x)\Sigma(y)}{n}}{\Sigma(x^2) - \frac{(\Sigma(x))^2}{n}} \]
\(\Sigma(xy)\) : Sum of the products of all \(x\) and \(y\) in the graph
\(\frac{\Sigma(x)\Sigma(y)}{n}\) : Sum of all \(x\) times the sum of all \(y\) all over number of observations \(n\)
\(\Sigma(x^2)\) : Sum of all \(x^2\) from the graph
\(\frac{(\Sigma(x))^2}{n}\) : Sum of all \(x\) squared over the number of observations \(n\)
So we already have our \(m\) and need to find our \(b\) to finish our line.
Fortunately, we have everything we need to find it already.
Realize that we are just finding the y-intercept to our line. Thus, we only need our slope \(m\) (which we are now experts on), and a given point.
But we can’t just use one point, as we need to represent every point on our plot. Which is why we get: \[b = \frac{\Sigma(y)}{n} - m(\frac{\Sigma(x)}{n})\]
Note how it looks complicated, but if you really look at it, it is simply the average \(y\) value minus the slope times average \(x\) value. That way, we can represent all of the points in our plot.
Well let’s think about what we are trying to achieve:
A line as close to every data point as possible is the same as saying a line that minimizes the distance from the line to each point.
Well the distance from the line to each point can be represented as:
\[ S = \Sigma(y_i - \hat{y_i})^2 \] Where \(y_i\) is the actual point value and \(\hat{y_i}\) is a predicted value. We want to minimize \(S\), so to find the minimum, we use derivatives.
And a lot of math, derivatives and tears later… we arrive at the linear regression model.
For a simple example, let’s take this plot with points (1,2), (2,2), (3,3), (4,3)
To begin, let’s find our \(m\), plugging our points in gets:
\[ m = \frac{(2 + 4 + 9 + 12) - \frac{(1 + 2 + 3 + 4)(2 + 2 + 3 + 3)}{4}}{(1 + 4 + 9 + 16) - \frac{(1 + 2 + 3 + 4)^2}{4}} \]
Which results in \(m = \frac{2}{5}\) (Just take my word for it).
Given that \(m\), we can now find our \(b\):
\[ b = \frac{2 + 2 + 3 + 3}{4} - \frac{2}{5}\frac{1 + 2 + 3 + 4}{4} \]
Which results in \(b = \frac{3}{2}\).
Making our final line \(y = \frac{2}{5}x + \frac{3}{2}\)
Let’s plot this line on the graph!
We can see that the line represents the plot fairly well.
There is one fatal flaw using this method:
We humans are very lazy and doing all of this is a lot of work…
That’s why we have computers!
Let’s go back to our original plot
Plugging all of these points into the equation would take an amount of time that we can only quantify as “forever”.
Luckily using R, we can do this in seconds!
The following is the code to plot our graph:
set.seed(123) x_values <- seq(1, 10, length.out = 100) y_values <- x_values + rnorm(100, mean = 0, sd = 0.5) data <- data.frame(x = x_values, y = y_values) data <- data[!(data$x >= 4 & data$x <= 6), ] y_range <- range(data$y, na.rm = TRUE) ggplot(data, aes(x = x, y = y)) + geom_point() + labs(title = "", x = "X", y = "Y") + xlim(1, 10) + ylim(y_range) + theme_minimal()
Adding the best fit line is as simple as adding this line of code to the plot:
geom_smooth(method = "lm", color = "red", se = FALSE)
## geom_smooth: na.rm = FALSE, orientation = NA, se = FALSE ## stat_smooth: na.rm = FALSE, orientation = NA, se = FALSE, method = lm ## position_identity
The following is what our code will look like after our addition:
ggplot(data, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm", color = "red", se = FALSE) + labs(title = "", x = "X", y = "Y") + xlim(1, 10) + ylim(y_range) + theme_minimal()
The following is the resulting plot:
A suitable best fit line with none of the excruciating effort!
We can now get an approximation of values from \(x=4\) to \(x=6\)
Please note that these are always going to be only approximations!