2024-03-22

What is Simple Linear Regression?

A simple linear regression estimates a line of best fit between a dependent and independent variable

The line of best fit formula might look very familiar:

\(\hat{y}=\alpha+\beta\text{x}\)

It’s just a line! with:

\(\alpha\) as the y intercept
\(\beta\) as the slope of the function
\(\hat{y}\) as the predicted value of y given an x

The goal of SLR is to find the minimum average distance between the points and the linear regression line, which we will calculate later.

Why use SLR

There are many reasons to use an SLR model

  • Simplicity
    • Its easier to interpret data from a line function than scattered datapoints
  • Determine Linearity
    • If your data has a near linear relationship, an SLR model can expose that
  • Making Prediction
    • Allows you to extrapolate your data and estimate future outcomes from past data if you have a near linear relationship
  • Regression Coefficients
    • The slope of your model indicated the direction and intensity of your relationship,

SLR Formula and how to find it

To calculate \(\hat{y}=\alpha+\beta\text{x}\), we will be use the least squares regression formula.

You’ll need to find a few things before calculating this line:

\(\bar{x}\): The mean value of x
\(\bar{y}\): The mean value of y
\(\text{S}_y\): The standard devation of y
\(\text{S}_x\): The standard deviation of x
\(\text{r}\): The correlation coefficient

Using these values you can calculate \(\beta\) with:
\(\beta = r*\frac{\text{S}_y}{\text{S}_x}\)

Then \(\alpha\) with:
\(\alpha=\bar{y}-\beta*\bar{x}\)

Line of best fit on a graph

On the left is a graph of the Iris dataset in R with petal width on the x axis and petal length on the y axis.
The red line is the simple linear regression line for this dataset

Whats the Coefficient of Determination?

The coefficient of determination, \(\text{r}^2\), represents how well the simple linear regression line predicts the data points, or more formally, it tells us the percentage of varation in \(\hat{y}\) that can be attributed to x.

An \(\text{r}^2\) value closer to 1 meaning the regression line closely fits the data points, and a value closer 0 meaning it doesn’t fit the data well.

You can calculate \(\text{r}^2\) with the formula:

\(\text{r}^2_\text{xy} = \left( \frac{\overline{xy}-\bar{x}\bar{y}}{\sqrt{\left( \overline{x^2}-\bar{x}^2\right)\left(\overline{y^2}-\bar{y}^2\right)}} \right)^2\)

Regression coefficient demo

[1] 0.01382265

You can see here the \(\text{r}^2\) value is very low because the Data points are very dispersed from the line

How to calculate SLR in R

Calculating the simple regression using built in packages in R.

Use the lm(linear model) function with a formula defining a relationship between your dependent variable x and dependent y, and the object where your variables are defined.

data("iris")
(lm = lm(Sepal.Length ~ Sepal.Width, data = iris))
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234

The model outputs your \(\alpha\) intercept and \(\beta\) slope

How to calculate SLR in R

You can use those variables in a graphing package like ggplot
Define your data holding object, then your x and y

data("iris")
lm = lm(Sepal.Length ~ Sepal.Width, data = iris)
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point() +
  geom_abline(intercept=lm$coefficients[1], slope=lm$coefficients[2])

Then using abline to add a line to the graph, take the coefficients from your model and apply them to the correct arguments to plot the line!

Results on next slide

How to calculate SLR in R

While this line doesnt fit the data points very well, this is how you calculate and use the simple linear regression algorithm in R.