CUNYSPS DATA624 Presentation

Chapter 6 of Applied Predictive Modeling: Linear Regression and It’s Cousins

Jonathan Hernandez

Introduction

This presentation will about Linear Regression and it’s alternatives like partial least squares and Penalized regression models. I will go over each one in detail starting with

Linear Regression

\[ \begin{align} y_i = b_0 +b_1x_{i1} + b_2x_{i2} + ... + b_px_{iP} + \epsilon_i \tag{1} \end{align} \]

where

Ordinary Least Squares

\[ \begin{align} SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 \tag{1} \end{align} \]

\[ \begin{align} (X^TX)^{-1}X^Ty \end{align} \]

Multiple Linear Regression Caveats

Table Summary of Metrics When Fitting a Multiple Linear Regression Model

Source: http://r-statistics.co/Linear-Regression.html

Linear Regression Example: (good fit and not-so-good fit)

# fit a linear model
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
lm_mtcars <- lm(mpg ~ ., data = mtcars)
summary(lm_mtcars)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
library(Matrix)
X <- as.matrix(mtcars[,2:11])
y <- as.matrix(mtcars$mpg)
solve(t(X) %*% X) %*% 
  (t(X) %*% y)
##             [,1]
## cyl   0.35082641
## disp  0.01354278
## hp   -0.02054767
## drat  1.24158213
## wt   -3.82613150
## qsec  1.19139689
## vs    0.18972068
## am    2.83222230
## gear  1.05426253
## carb -0.26321386

Summary of Results

\[ \begin{aligned} R^2 = 1 - SSE/SST = \sum{(y - \hat{y})^2} / \sum{(y - \bar{y})^2} \\ R^2_{adj} = R^2\big(n-1 / n-k-1 \big) \end{aligned} \] with \(\hat{y}\) = estimate of y (from multiple regression model) and \(\bar{y}\) = mean of y values and n and k are the number of observations and predictions respectively.

Residual Plots

hist(lm_mtcars$residuals)

qqnorm(lm_mtcars$residuals)
qqline(lm_mtcars$residuals)

plot(mtcars$mpg, lm_mtcars$residuals)
abline(0, 0)

Penalized Models

Ridge Regression

\[ \begin{aligned} SSE_{L_2} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^P \beta_j^2 \end{aligned} \]

Ridge Regression Example

library(glmnet)
## Loading required package: foreach
## Loaded glmnet 2.0-16
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# fit a ridge regression model using all features
# alpha = 0 is the ridge regression method alpha =1 is LASSO
# variables X and y are from earlier slides regarding linear regression

# center the response variable mpg
y <- y %>% scale(center = TRUE, scale = FALSE)
ridge_mpg <- glmnet(X, y, alpha = 0, standardize = TRUE)
summary(ridge_mpg)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1000   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv_fit_mpg <- cv.glmnet(X, y, alpha=0, standardize =  TRUE)
plot(cv_fit_mpg)

min_lambda <- cv_fit_mpg$lambda.min
min_lambda
## [1] 2.502772
coef(cv_fit_mpg, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.080660889
## cyl         -0.368057153
## disp        -0.005179902
## hp          -0.011713150
## drat         1.053216800
## wt          -1.264212474
## qsec         0.164975032
## vs           0.756163433
## am           1.655635458
## gear         0.546651086
## carb        -0.559817881

LASSO (Least Absolute Shrinkage and Selection Operator)

\[ \begin{aligned} SSE_{L_1} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^P |{\beta_j}| \end{aligned} \]

LASSO Example

# fit a LASSO regression model using all features
# alpha = 0 is the ridge regression method alpha = 1 is LASSO
# variables X and y are from earlier slides regarding linear regression

# center the response variable mpg
lasso_mpg <- glmnet(X, y, alpha=1, standardize = TRUE)
summary(lasso_mpg)
##           Length Class     Mode   
## a0         79    -none-    numeric
## beta      790    dgCMatrix S4     
## df         79    -none-    numeric
## dim         2    -none-    numeric
## lambda     79    -none-    numeric
## dev.ratio  79    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric
lasso_cv_fit_mpg <- cv.glmnet(X, y, alpha=1, standardize = TRUE)
plot(lasso_cv_fit_mpg)

lasso_min_lambda <- lasso_cv_fit_mpg$lambda.min
lasso_min_lambda
## [1] 0.8007036
coef(lasso_cv_fit_mpg, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 15.90939176
## cyl         -0.88608541
## disp         .         
## hp          -0.01168438
## drat         .         
## wt          -2.70814703
## qsec         .         
## vs           .         
## am           .         
## gear         .         
## carb         .

Elastic Net (Best of Both Worlds)

\[ \begin{aligned} SSE_{E_{net}} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda_1 \sum_{j=1}^P \beta_j^2 + \lambda_2 \sum_{j=1}^P |{\beta_j}| \end{aligned} \]

# Elastic Net, use alpha = 0.5

enet_mpg <- glmnet(X, y, alpha=0.5, standardize = TRUE)
summary(lasso_mpg)
##           Length Class     Mode   
## a0         79    -none-    numeric
## beta      790    dgCMatrix S4     
## df         79    -none-    numeric
## dim         2    -none-    numeric
## lambda     79    -none-    numeric
## dev.ratio  79    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric
enet_cv_fit_mpg <- cv.glmnet(X, y, alpha=0.5, standardize = TRUE)
plot(enet_cv_fit_mpg)

enet_min_lambda <- enet_cv_fit_mpg$lambda.min
enet_min_lambda
## [1] 0.6932109
coef(enet_cv_fit_mpg, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 11.217785623
## cyl         -0.640442781
## disp        -0.000866403
## hp          -0.014738425
## drat         0.624077142
## wt          -2.134845217
## qsec         .          
## vs           0.362323658
## am           1.272728484
## gear         .          
## carb        -0.335989062

Summary and Wrap Up

Thank you!!!