October 14th, 2023

Overview

  • Linear regression is a statistical method used to model the relationship between a dependent variable (Y) and one or more independent variables (X).

  • It is often used to identify possible relationships and understand how changes in independent variables impact the dependent variable, and it has a wide variety of applications, especially in the data science atmosphere.

  • Linear regression can be broken down into two main types: Simple Linear Regression (SLR) and Multiple Linear Regression (MLR). SLR only involves one independent variable, whereas MLR involves two or more independent variables.

Mathematical Representation

The multiple linear regression model is as follows:

\[Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \dots + \beta_{n}X_{n} + \varepsilon \]

In the model above, \(Y\) represents the dependent variable, \(\beta_{0}\) represents the intercept, \(\beta_{1},\beta_{2},\dots,\beta_{n}\) represent the coefficients, \(X_{1},X_{2},\dots,X_{n}\) represent the independent variables, and \(\varepsilon\) represents the error term.

In essence, each coefficient describes the effect that the corresponding independent variable has on the dependent variable, and the \(\beta_{0}\) intercept term represents the starting point/base value when all \(X\) values are zero.

The error term \(\varepsilon\) simply represents the unexplained variation that cannot be accounted for by the independent variables.

For simple linear regression, there is only one independent term (\(X_{1}\)).

Mathematical Representation (cont.)

The goal of linear regression is to estimate the coefficients that best fit the data. This is done using the method of least squares to minimize the sum of squared errors (residuals), or \(SSE\):

\[SSE = \sum_{i=1}^{n}(Y_i-(\beta_0+\beta_1X_{1i}+\beta_2X_{2i}+\dots+\beta_nX_{ni}))^2\]

When this value is minimized, the coefficients are determined and the best-fit model can be expressed as:

\[\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2 + \dots + \hat{\beta_n}X_n \]

In this model, \(\hat{Y}\) is the predicted value of \(Y\), and \(\hat{\beta_0},\hat{\beta_1},\hat{\beta_2},\dots,\hat{\beta_n}\) are the coefficients estimated.
This is the model we will be graphing to create best-fit lines for our data.

Data

As an example for this lesson, we will be using the “mtcars” base R data set. Let’s take a look at the head to get an understanding before we dive in.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This is a break down of each variable name and their descriptions:

mpg: Miles/gallon cyl: Number of cylinders disp: Displacement (cu.in.)
hp: Gross horsepower drat: Rear axle ratio wt: Weight (1000 lbs)
qsec: 1/4 mile time vs: V/S am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears carb: Number of carburetors

Example Plot- SLR

Shown below is a scatter plot of MPG and Weight.

There seems to be an inverse relationship between the two, but we can find a more precise relationship with SLR, if we can transform the data to a linear form.

Example Plot (cont.)

The line above demonstrates a linear regression model with the relationship \(Y \sim \frac{1}{X}\). Note that even though the initial data was not linear, we can transform the data first to find the relationship.

Example Plot (cont.)

df = data.frame(mtcars$mpg, 1/mtcars$wt)
lm(df)
## 
## Call:
## lm(formula = df)
## 
## Coefficients:
##  (Intercept)  X1.mtcars.wt  
##        4.386        45.829
lm() can be used in R to find a linear model for data. As seen above by the code output, the linear model has an intercept of 4.386 and a coefficient of 45.829. This means that, if you take the weight of the car, multiply it by 45.829, and add 4.386, you get the estimated value of MPG.

Example Plot 1 (cont.)- summary(lm(df)) output

## 
## Call:
## lm(formula = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2767 -2.2586 -0.3398  1.1162  7.1822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.386      1.536   2.855  0.00774 ** 
## X1.mtcars.wt   45.829      4.249  10.786 7.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.774 on 30 degrees of freedom
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.7881 
## F-statistic: 116.3 on 1 and 30 DF,  p-value: 7.639e-12

Example Plot (cont.)

The summary values in the previous slide give us some insight into our linear model. We can discuss a few of the more relevant ones.

  • The R-squared represents the strength of the correlation between the two variables, between 0 and 1. Closer to 1 means a higher strength correlation, while closer to 0 means a lower strength correlation. Our R-squared of 0.795 is reasonably strong.

  • The p-value represents the likelihood that the correlation is random. A lower p-value means the data correlation is less likely to be random and thus more statistically significant; typically, p < 0.05 is used to mark a statistically significant relationship. In our case, the p value is \(7.6*10^{-12}\) which means it is highly unlikely to be random.

Example Plot 2- MLR

Now that we have a general understanding of linear regression with one IV, we can take a quick look at one with multiple IVs. For example, shown below is a 3-Dimensional scatter plot of 1/4 mile time based on number of carburetors and horsepower. See if you can find out the relationship between the IVs and DVs!