2023-10-15

Linear Regression

To visualize and understand the use of Linear Regression we will look at the change in College Tuition over time.

Topics Covered:

  • Independent & Dependent Variables
  • Linear Regression
  • Multiple Linear Regression
  • Mathematical Formula & Definition
  • Linear Regression Visualized
  • Conclusion of Visualization
  • Multiple Linear Regression Visualized
  • Conclusion of Visualization

Independent Variable

  • It’s called “independent” because its variation is not dependent on the variation of the dependent variable, but it might cause a change in the dependent variable.
  • It is typically plotted on the x-axis in a 2D plot and the x & y axis in 3D plots
  • Generally it is the variable that you suspect might have an effect on the dependent variable
  • In the context of linear regression it’s also known as the “predictor” or “explanatory” variable

source: https://www.khanacademy.org/math/statistics-probability

Dependent Variable

  • It is called “dependent” because it “depends” on the independent variable(s)
  • It is the main factor that you’re trying to understand or predict
  • In statistical modeling it is the variable whose changes you’re interested in observing
  • It is typically plotted on the y-axis in a 2D plot or the z-axis in a 3D plot
  • In the context of Linear Regression it’s also known as the “response” or “outcome” variable

source: https://www.khanacademy.org/math/statistics-probability

Linear Regression Definition

“Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The method assumes that there’s a linear relationship between the dependent and independent variables. It predicts the response variable based on the linear relationship to the explanatory variables.”

source: https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd41.htm

Linear Regression Formula

The formula for a simple linear regression is:

\[ Y = \beta_0 + \beta_1X + \epsilon \]

  • \(Y\): Dependent variable (what we want to predict)
  • \(\beta_0\): Intercept (constant term in the linear equation)
  • \(\beta_1\): Slope (effect of the independent variable on \(Y\))
  • \(X\): Independent variable (predictor)
  • \(\epsilon\): Error term (variation in \(Y\) not explained by \(X\))

This formula allows us to model the relationship between a dependent variable \(Y\) and an independent variable \(X\), where \(\beta_0\) and \(\beta_1\) are coefficients estimated from the data, and \(\epsilon\) is the error term.

Types of Linear Regression

  • Ordinary Least Squares (OLS):
    • OLS calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line.
      • It is the most common method used for linear regression
  • Weighted Least Squares (WLS):
    • WLS is used when the residuals’ variance is not constant across observations, adjusting the weights of each data point based on its variance.
  • Generalized Least Squares (GLS):
    • GLS is employed when there’s a known correlation between the residuals, adjusting the standard errors of the coefficients accordingly.

Types Continued

  • Ridge Regression (L2 regularization):
    • Ridge regression addresses multicollinearity by adding a degree of bias to the regression estimates, reducing the impact of correlated independent variables.
  • Lasso Regression (L1 regularization):
    • Lasso regression also addresses multicollinearity but can set some coefficient estimates to zero, effectively selecting more relevant features.
  • Elastic Net Regression:
    • Combining Ridge and Lasso, Elastic Net adds both L1 and L2 penalties to the loss function, useful for datasets with multiple correlated features.

Types Continued

  • Quantile Regression:
    • Quantile regression focuses on the median or other quantiles of the dependent variable, rather than the mean, providing a more comprehensive view of the relationship between variables.
  • Robust Regression:
    • Robust regression methods, like Huber regression or RANSAC, are designed to be less sensitive to outliers in data sets, providing more reliable estimates in the presence of noisy data.

Source: https://hastie.su.domains/ISLRv2_website.pdf

Multiple Linear Regression

Multiple Linear Regression

The formula for multiple linear regression is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon \]

  • \(Y\): Dependent variable (what we want to predict)
  • \(\beta_0\): Intercept (constant term in the linear equation)
  • \(\beta_1, \beta_2, ..., \beta_n\): Coefficients of independent variables \(X_1, X_2, ..., X_n\)
  • \(X_1, X_2, ..., X_n\): Independent variables (predictors)
  • \(\epsilon\): Error term (variation in \(Y\) not explained by \(X\))

This formula allows us to model the relationship between a dependent variable \(Y\) and multiple independent variables \(X_1, X_2, \ldots, X_n\). The \(\beta\) coefficients are estimated from the data and represent the influence of each independent variable on the dependent variable.

Linear Regression Visualized

Multiple Linear Regression Visualized

Interactive Plot