Simple Linear Regression

2023-11-12

Introduction

Simple Linear Regression is a technique in statistics to predict a function that describes how two variables relate to one another, using a set of of observations. The observations are assumed to have some deviation from the relationship

Example

First, we’ll generate a set of values that have relation to one another, but with some randomized variance.

x <- 1:100
y <- 1:100
dev <- runif(100, min=0, max=10)
y = y + dev
data1 <- data.frame ( X = x, Y = y)

So, for x values of 1 to 100, Y is equal to X plus a random number between 0 and 10.

Plot showing our randomized relation

Generating the regression

The regression line will be a straight, non-vertical line, and so its formula is as simple as \(y = mx + b\) with \(m\) being the slope and \(b\) being the y-intercept. While we can intuitively see the relation of our simple scatter plot, generating the regression mathematically can be more complicated: \[\frac{y - \bar y}{s_y} = r_{xy}\frac{x - \bar x}{s_x}\] with \(\bar x\) and \(\bar y\) being the average of those values across the set, \(s_x\) and \(s_y\) being the standard deviation, and \(r_{xy}\) being the correlation coefficient, which gives us our slope.

Generating the regression (continued)

\(r_{xy}\), or the correlation coefficient, can be found with the following formula, using the averages of various expressions across the set: \[r_{xy} = \frac{\overline {xy} - \bar x \bar y}{\sqrt{(\bar{x^2}-\bar x^2)(\bar{y^2}-\bar y^2)}}\] with, for example, \(\overline{xy}\) being the average of \(x \cdot y\) across the set, and \(\bar x \bar y\) being the product of the average of \(x\) and the average of \(y\).

Plotting the regression

Condifdence interval

Because we assume there is some deviation from the relationship, this means that our regression line might not be true to the actual function representing the relation due to the randomized noise. To accommodate this, we can calculate the confidence interval, which shows a range of how likely the regression line would be to fall within a certain range. First, let’s increase the variance in our values to +0-50:

dev <- runif(100, min=0, max=50)
y = y + dev
data1 <- data.frame ( X = x, Y = y)

Confidence interval example

This graph shows the 95% confidence interval, so there is a 99% chance the line will fall within the shaded area, counting for the fact the deviation within the data set is randomized.