Simple Linear Regression is a technique in statistics to predict a function that describes how two variables relate to one another, using a set of of observations. The observations are assumed to have some deviation from the relationship
2023-11-12
Simple Linear Regression is a technique in statistics to predict a function that describes how two variables relate to one another, using a set of of observations. The observations are assumed to have some deviation from the relationship
First, we’ll generate a set of values that have relation to one another, but with some randomized variance.
x <- 1:100 y <- 1:100 dev <- runif(100, min=0, max=10) y = y + dev data1 <- data.frame ( X = x, Y = y)
So, for x values of 1 to 100, Y is equal to X plus a random number between 0 and 10.
The regression line will be a straight, non-vertical line, and so its formula is as simple as \(y = mx + b\) with \(m\) being the slope and \(b\) being the y-intercept. While we can intuitively see the relation of our simple scatter plot, generating the regression mathematically can be more complicated: \[\frac{y - \bar y}{s_y} = r_{xy}\frac{x - \bar x}{s_x}\] with \(\bar x\) and \(\bar y\) being the average of those values across the set, \(s_x\) and \(s_y\) being the standard deviation, and \(r_{xy}\) being the correlation coefficient, which gives us our slope.
\(r_{xy}\), or the correlation coefficient, can be found with the following formula, using the averages of various expressions across the set: \[r_{xy} = \frac{\overline {xy} - \bar x \bar y}{\sqrt{(\bar{x^2}-\bar x^2)(\bar{y^2}-\bar y^2)}}\] with, for example, \(\overline{xy}\) being the average of \(x \cdot y\) across the set, and \(\bar x \bar y\) being the product of the average of \(x\) and the average of \(y\).
Because we assume there is some deviation from the relationship, this means that our regression line might not be true to the actual function representing the relation due to the randomized noise. To accommodate this, we can calculate the confidence interval, which shows a range of how likely the regression line would be to fall within a certain range. First, let’s increase the variance in our values to +0-50:
dev <- runif(100, min=0, max=50) y = y + dev data1 <- data.frame ( X = x, Y = y)
This graph shows the 95% confidence interval, so there is a 99% chance the line will fall within the shaded area, counting for the fact the deviation within the data set is randomized.