Cost Of College Education

2023-10-15

Linear Regression

To visualize and understand the use of Linear Regression we will look at the change in College Tuition over time.

Topics Covered:

Independent & Dependent Variables
Linear Regression
Multiple Linear Regression
Mathematical Formula & Definition
Linear Regression Visualized
Conclusion of Visualization
Multiple Linear Regression Visualized
Conclusion of Visualization

Independent Variable

It’s called “independent” because its variation is not dependent on the variation of the dependent variable, but it might cause a change in the dependent variable.
It is typically plotted on the x-axis in a 2D plot and the x & y axis in 3D plots
Generally it is the variable that you suspect might have an effect on the dependent variable
In the context of linear regression it’s also known as the “predictor” or “explanatory” variable

source: https://www.khanacademy.org/math/statistics-probability

Dependent Variable

It is called “dependent” because it “depends” on the independent variable(s)
It is the main factor that you’re trying to understand or predict
In statistical modeling it is the variable whose changes you’re interested in observing
It is typically plotted on the y-axis in a 2D plot or the z-axis in a 3D plot
In the context of Linear Regression it’s also known as the “response” or “outcome” variable

source: https://www.khanacademy.org/math/statistics-probability

Linear Regression Definition

“Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The method assumes that there’s a linear relationship between the dependent and independent variables. It predicts the response variable based on the linear relationship to the explanatory variables.”

source: https://www.itl.nist.gov/div898/handbook/pmd/section4/pmd41.htm

Linear Regression Formula

The formula for a simple linear regression is:

\[ Y = \beta_0 + \beta_1X + \epsilon \]

\(Y\): Dependent variable (what we want to predict)
\(\beta_0\): Intercept (constant term in the linear equation)
\(\beta_1\): Slope (effect of the independent variable on \(Y\))
\(X\): Independent variable (predictor)
\(\epsilon\): Error term (variation in \(Y\) not explained by \(X\))

This formula allows us to model the relationship between a dependent variable \(Y\) and an independent variable \(X\), where \(\beta_0\) and \(\beta_1\) are coefficients estimated from the data, and \(\epsilon\) is the error term.

Types of Linear Regression

Ordinary Least Squares (OLS):
- OLS calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line.
  - It is the most common method used for linear regression
Weighted Least Squares (WLS):
- WLS is used when the residuals’ variance is not constant across observations, adjusting the weights of each data point based on its variance.
Generalized Least Squares (GLS):
- GLS is employed when there’s a known correlation between the residuals, adjusting the standard errors of the coefficients accordingly.

Types Continued

Ridge Regression (L2 regularization):
- Ridge regression addresses multicollinearity by adding a degree of bias to the regression estimates, reducing the impact of correlated independent variables.
Lasso Regression (L1 regularization):
- Lasso regression also addresses multicollinearity but can set some coefficient estimates to zero, effectively selecting more relevant features.
Elastic Net Regression:
- Combining Ridge and Lasso, Elastic Net adds both L1 and L2 penalties to the loss function, useful for datasets with multiple correlated features.

Types Continued

Quantile Regression:
- Quantile regression focuses on the median or other quantiles of the dependent variable, rather than the mean, providing a more comprehensive view of the relationship between variables.
Robust Regression:
- Robust regression methods, like Huber regression or RANSAC, are designed to be less sensitive to outliers in data sets, providing more reliable estimates in the presence of noisy data.

Source: https://hastie.su.domains/ISLRv2_website.pdf

Multiple Linear Regression

Definition: Multiple Linear Regression is an extension of simple linear regression to the case of more than one independent variable, and it models the relationship between a single dependent variable and two or more independent variables by fitting a linear equation to the observed data

source: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Multivariable/BS704_Multivariable7.html

Multiple Linear Regression

The formula for multiple linear regression is:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon \]

\(Y\): Dependent variable (what we want to predict)
\(\beta_0\): Intercept (constant term in the linear equation)
\(\beta_1, \beta_2, ..., \beta_n\): Coefficients of independent variables \(X_1, X_2, ..., X_n\)
\(X_1, X_2, ..., X_n\): Independent variables (predictors)
\(\epsilon\): Error term (variation in \(Y\) not explained by \(X\))

This formula allows us to model the relationship between a dependent variable \(Y\) and multiple independent variables \(X_1, X_2, \ldots, X_n\). The \(\beta\) coefficients are estimated from the data and represent the influence of each independent variable on the dependent variable.

Linear Regression

Independent Variable

Dependent Variable

Linear Regression Definition

Linear Regression Formula

Types of Linear Regression

Types Continued

Types Continued

Multiple Linear Regression

Multiple Linear Regression

Linear Regression Visualized

Multiple Linear Regression Visualized

Interactive Plot