2024-09-22

Simple Linear Regression

Brief introduction and exploration into the relationship within data through the use of simple linear regression.

What is Linear Regression (In Simple Terms)

  • Linear regression (LR) aims to find out how a set of independent variables can affect a set of dependent variables
  • LR also aims to find a specific multiplier, commonly referred to as the beta,or the slope of the linear equation between the independent and dependent variables,
  • A fitted linear regression model aims to predict the dependent variable based off an input independent variable
  • The simplest and most commonly known method of linear regression is using the least squared method, which aims to minimize the total distance between the line of best fit derived, and each of the dependent variables based off of an existing dataset.

Assumptions for Simple Linear Regression

  • The relationship between the independent and dependent variable needs to be linear, this can be verified by looking at a scatter plot of the observations that you have
  • The dependent variables should not be correlated with each other, in layman terms, one value of the dependent variable should not be affected by another value of the dependent variable
  • The predictions from the model should be normally distributed
  • For all observations, the variance for each prediction must be the same.

The Linear Regression Equation

  • Below is the equation for simple linear regression \[ y = \beta_0 + \beta_1 x + \epsilon \]
  • In this equation, y is the target (the dependent variable)
  • beta 0 here is the y intercept, or what the target variable would be if the independent variable was 0 (not including error).
  • beta 1 is the slope of this equation, or how much the dependent variable would change if the independent variable changed by one unit.
  • x is the independent variable
  • epsilon is the error term which accounts for imperfections in the real world relationship between the independent and dependent variables.

How to calculate Beta 1 (Least Squares)

  • Below is the algorithm that would be used to calculate the beta 1 value, or the slope of the linear equation \[ \beta_1 = \frac{ \sum (x_i - \bar{x})(y_i - \bar{y}) }{ \sum (x_i - \bar{x})^2 } \]
  • In this equation, x_i and y_i are observations for the independent and dependent variables respectively
  • x_bar and y_bar represent the mean values of the independent and dependent variables respectively
  • The numerator represents covariance between x and y
  • The denominator represents the variance of x

Example: Predicting House Prices using Square Footage data (ggplot2)

Example Cont: Analyzing

  • Let’s generate some data to go through the steps of simple liner regression

  • Before going through the steps of linear regression, it is important to visualize your data to ensure it fits the assumptions for simple LR, and to give yourself a general idea of how the data fits together.

  • If you visualize the graph, you can clearly see a positive linear relation between the independent and dependent variables

  • In the next slide, we can see the fitted model graphed in red, as a line of best fit through the data

Example Cont: Fitting our Model

## `geom_smooth()` using formula = 'y ~ x'

Example 2: Differently scattered data (Plotly)

Example 2: Cont

  • We can see from visualizing this different set of data that we cannot see a linear relationship between the independent and dependent variable, so using simple linear regression in this case would not be wise.

Conclusion

  • Linear regression is a powerful tool that can be used to find a relationship between 2 variables (that have a linear relationship)
  • You can use this model to make predictions of your target variable based off an independent variable
  • Simple linear regression is widely used in various fields, from finance to biology, to analyze relationships between variables and make predictions.