CUNYSPS DATA624 Presentation
Jonathan Hernandez
This presentation will about Linear Regression and it’s alternatives like partial least squares and Penalized regression models. I will go over each one in detail starting with
Linear Regression is one of the most basic tools a Data Scientist or Analyst can use for analyzing data. It is also used as a prerequiste for other advanced regression and machine learning techniques.
In its general form it is the formula
\[ \begin{align} y_i = b_0 +b_1x_{i1} + b_2x_{i2} + ... + b_px_{iP} + \epsilon_i \tag{1} \end{align} \]
where
When performing linear regression, the goal is to find coefficients that minimize the sum-of-squared errors (SSE) between the actual value of the response and the predicted response.
The (SSE) is defined as the following
\[ \begin{align} SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2 \tag{1} \end{align} \]
This equation contains the coefficient estimates for each predictor in a vector.
It can be shown that the coefficients that minimize the SSE is the matrix equation (Note this would involve a combination of Matrix Algebra and Solving a system of partial derivatives to derive). There is a problem with this solution which will be addressed later in the presentation.
\[ \begin{align} (X^TX)^{-1}X^Ty \end{align} \]
Several flaws in multiple linear regression. For example, the response variable is a linear model and that if the data have some non-linear structure in the mix, the multiple linear regression technique may not be accurate.
There are many statistics that can tell if the multiple linear regression model you are using is a good one or not.
Having data with large residuals when doing a multiple regression model can negatively impact your model.
If the data are shown to not have near constant variance, this is also an indictor that the model needs to be adjusted and may need transformations, feature extraction etc.
If the residual plot does not follow a normal or near-normal distribution, that is also a bad sign.
Using Q-Q plots can also be a good metric to test and see as well.
If there are more predictors than observations, Eq. (2) cannot be inverted.
Other metrics like \(R^2\), \(R^2_{adj}\), p-value and F-test come in handy as well. R’s lm() function gives you these statistics.
# fit a linear model
colnames(mtcars)## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
lm_mtcars <- lm(mpg ~ ., data = mtcars)
summary(lm_mtcars)##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
library(Matrix)
X <- as.matrix(mtcars[,2:11])
y <- as.matrix(mtcars$mpg)
solve(t(X) %*% X) %*%
(t(X) %*% y)## [,1]
## cyl 0.35082641
## disp 0.01354278
## hp -0.02054767
## drat 1.24158213
## wt -3.82613150
## qsec 1.19139689
## vs 0.18972068
## am 2.83222230
## gear 1.05426253
## carb -0.26321386
We saw from the previous slide that R’s lm() function along with the summmary() function gave us some insight metrics for our model. Let’s review in summary:
Residuals: These values show the summary of the residual values of our fitted model showing the min, 1st quartile, median, 3rd quartile and max value
Coefficients: For each predictor in our model, it shows the estimated coefficient \(b_i\), the standard error of the coefficient, it’s t-value and it’s p-value. Note that having a low p-value < 0.05 for a 95% confidence interval means that the predictor is significant in the model. We can say we fail to reject the null hypothesis (\(b_i = 0\)) and favor the alternative that is \(b_i \neq 0\))
Significant codes: These codes show the characters that indicate if the p-value is less than a certain threshold. It’s used more for readbility.
Residual standard error: This is nothing more than the positive squared root of the mean square error that is \(RSE = \sqrt{MSE}\)
R-Squared and Adjusted R-Squared: The \(R^2\) and \(R^2_{adj}\) are the R-squared and Adjusted R-squared values respectively. Mathematically, they are calculated as
\[ \begin{aligned} R^2 = 1 - SSE/SST = \sum{(y - \hat{y})^2} / \sum{(y - \bar{y})^2} \\ R^2_{adj} = R^2\big(n-1 / n-k-1 \big) \end{aligned} \] with \(\hat{y}\) = estimate of y (from multiple regression model) and \(\bar{y}\) = mean of y values and n and k are the number of observations and predictions respectively.
Finally, let’s look at some plots based on our model if it is a good fit in explaining the relationship to mpg (miles per gallon)
Residual Plots (If the residual plot follows a near normal distribution, this is a good sign)
hist(lm_mtcars$residuals)qqnorm(lm_mtcars$residuals)
qqline(lm_mtcars$residuals)plot(mtcars$mpg, lm_mtcars$residuals)
abline(0, 0)Now that we now know about linear regression, let’s introduce the idea of penalized models.
Sometimes, a model may have overfitting issues and may not be able to accurately predict future observations. Also, penalized models are helpful when there are issues with collinearity.
Another issue is that if there are large correlations between features that the variance will be very large. As such, penalized methods/models help to reduce variance but introduce bias into the picture.
I will talk about 3 penalized models in this presentation
Ridge Regression
LASSO Regression (Least Absolute shrinkage and selection operator) model
Elastic net (Ridge Regression and LASSO penalties combined)
These regularization methods impose penalities on coefficients and R has functions to use these techniques along with finding the optimal value of lambda which we’ll see in the next slide.
\[ \begin{aligned} SSE_{L_2} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^P \beta_j^2 \end{aligned} \]
\(L_2\) is for the second-order penalty hence the squared value in the parameter estimates.
What the second term (penalty) does is that the coefficient estimates can become large if there is a reduction on the SSE.
For large \(\lambda\) reduces the estimates near 0. However, the parameters never become actually zero and in addition, ridge regression does not perform feature selection.
library(glmnet)## Loading required package: foreach
## Loaded glmnet 2.0-16
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# fit a ridge regression model using all features
# alpha = 0 is the ridge regression method alpha =1 is LASSO
# variables X and y are from earlier slides regarding linear regression
# center the response variable mpg
y <- y %>% scale(center = TRUE, scale = FALSE)
ridge_mpg <- glmnet(X, y, alpha = 0, standardize = TRUE)
summary(ridge_mpg)## Length Class Mode
## a0 100 -none- numeric
## beta 1000 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv_fit_mpg <- cv.glmnet(X, y, alpha=0, standardize = TRUE)
plot(cv_fit_mpg)The two lines represent the min \(\lambda\) value and the 1 standard deviation away from the min \(\lambda\) value respectively.
Min \(\lambda\) for our model:
min_lambda <- cv_fit_mpg$lambda.min
min_lambda## [1] 2.502772
coef(cv_fit_mpg, s = "lambda.min")## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.080660889
## cyl -0.368057153
## disp -0.005179902
## hp -0.011713150
## drat 1.053216800
## wt -1.264212474
## qsec 0.164975032
## vs 0.756163433
## am 1.655635458
## gear 0.546651086
## carb -0.559817881
\[ \begin{aligned} SSE_{L_1} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^P |{\beta_j}| \end{aligned} \]
Another neat feature is that LASSO may actually make a coefficient = 0 so therefore it also does feature selection.
Much like Ridge Regression, we can get the \(\lambda\) value that minimizes the MSE as well as extract the coefficients from it. Note that if a feature coefficient has just ‘.’ is because LASSO penalized that coefficient to be zero and has done feature selection.
In R when using glmnet() and cv.glmnet, set the alpha parameter to be 1 to let R know you’re doing the LASSO technique.
Let’s see an example in the next slide of LASSO in action.
# fit a LASSO regression model using all features
# alpha = 0 is the ridge regression method alpha = 1 is LASSO
# variables X and y are from earlier slides regarding linear regression
# center the response variable mpg
lasso_mpg <- glmnet(X, y, alpha=1, standardize = TRUE)
summary(lasso_mpg)## Length Class Mode
## a0 79 -none- numeric
## beta 790 dgCMatrix S4
## df 79 -none- numeric
## dim 2 -none- numeric
## lambda 79 -none- numeric
## dev.ratio 79 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
lasso_cv_fit_mpg <- cv.glmnet(X, y, alpha=1, standardize = TRUE)
plot(lasso_cv_fit_mpg)lasso_min_lambda <- lasso_cv_fit_mpg$lambda.min
lasso_min_lambda## [1] 0.8007036
coef(lasso_cv_fit_mpg, s = "lambda.min")## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 15.90939176
## cyl -0.88608541
## disp .
## hp -0.01168438
## drat .
## wt -2.70814703
## qsec .
## vs .
## am .
## gear .
## carb .
Using the “lambda.min” as the lambda value has shown that LASSO only kept 4 variables in our model to use and penalized the rest.
For either ridge regression or LASSO regression, one can also use “lambda.1se” which means use a lambda value that is 1 standard error from the minimum.
Using that, you might get a lambda that has a larger or smaller MSE depending on the data.
I’ve talked about ridge and LASSO regression. Both are good penalization methods but what if you wanted to use both?
LASSO does well when there aren’t many parameters and some are already close to 0
Ridge regressions works well if there are many parameters of almost the same value.
LASSO may also take away variables that actually may greatly influence the response variable.
Introducing Elastic Net which combines both and the function is as follows:
\[ \begin{aligned} SSE_{E_{net}} = \sum_{i=1}^n (y_i - \hat{y_i})^2 + \lambda_1 \sum_{j=1}^P \beta_j^2 + \lambda_2 \sum_{j=1}^P |{\beta_j}| \end{aligned} \]
With this, you get the feature selection power of LASSO and penalties of both methods in one.
Elastic Net example
# Elastic Net, use alpha = 0.5
enet_mpg <- glmnet(X, y, alpha=0.5, standardize = TRUE)
summary(lasso_mpg)## Length Class Mode
## a0 79 -none- numeric
## beta 790 dgCMatrix S4
## df 79 -none- numeric
## dim 2 -none- numeric
## lambda 79 -none- numeric
## dev.ratio 79 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
enet_cv_fit_mpg <- cv.glmnet(X, y, alpha=0.5, standardize = TRUE)
plot(enet_cv_fit_mpg)enet_min_lambda <- enet_cv_fit_mpg$lambda.min
enet_min_lambda## [1] 0.6932109
coef(enet_cv_fit_mpg, s = "lambda.min")## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 11.217785623
## cyl -0.640442781
## disp -0.000866403
## hp -0.014738425
## drat 0.624077142
## wt -2.134845217
## qsec .
## vs 0.362323658
## am 1.272728484
## gear .
## carb -0.335989062
Linear Regression enables us to create linear models that can minimize the SSE
There are various ways and diagnostics to see if your linear model is a good fit to your data.
There is a matrix solution that finds the coefficients of the SSE but may not work if the data exhibit collinearity and/or the number of predictors > number of observations to which the matrix cannot be inverted.
Penalized methods allow us to decrease the variance while introducing some bias into the mix of our model. This is to also help reduce the MSE of your model.
Ridge regression, LASSO and elastic net can accomplish this by picking the minimum \(\lambda\) or 1 standard error (you can also make your own range of \(\lambda\) values and select the number of folds as well)
You can select (not shown on the slide) the penalization method by computing each one’s \(R^2\) value, AIC, BIC.
You don’t have to select the \(\lambda\) that minimizes the MSE, you can select your own \(\lambda\). Experiment with different values and number of folds to tune your model.
Thank you!!!