ridge_regression

This project aims to study about the Ridge Regression.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.2.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

data("Hitters")

First, we omit the NA values in Hitters data.

#Omit NA values
Hitters<-na.omit(Hitters)
#summary of Hitters data
str(Hitters)

## 'data.frame':    263 obs. of  20 variables:
##  $ AtBat    : int  315 479 496 321 594 185 298 323 401 574 ...
##  $ Hits     : int  81 130 141 87 169 37 73 81 92 159 ...
##  $ HmRun    : int  7 18 20 10 4 1 0 6 17 21 ...
##  $ Runs     : int  24 66 65 39 74 23 24 26 49 107 ...
##  $ RBI      : int  38 72 78 42 51 8 24 32 66 75 ...
##  $ Walks    : int  39 76 37 30 35 21 7 8 65 59 ...
##  $ Years    : int  14 3 11 2 11 2 3 2 13 10 ...
##  $ CAtBat   : int  3449 1624 5628 396 4408 214 509 341 5206 4631 ...
##  $ CHits    : int  835 457 1575 101 1133 42 108 86 1332 1300 ...
##  $ CHmRun   : int  69 63 225 12 19 1 0 6 253 90 ...
##  $ CRuns    : int  321 224 828 48 501 30 41 32 784 702 ...
##  $ CRBI     : int  414 266 838 46 336 9 37 34 890 504 ...
##  $ CWalks   : int  375 263 354 33 194 24 12 8 866 488 ...
##  $ League   : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
##  $ PutOuts  : int  632 880 200 805 282 76 121 143 0 238 ...
##  $ Assists  : int  43 82 11 40 421 127 283 290 0 445 ...
##  $ Errors   : int  10 14 3 4 25 7 9 19 0 22 ...
##  $ Salary   : num  475 480 500 91.5 750 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
##   ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...

#Dimension of Hitters data
dim(Hitters)

## [1] 263  20

Transform data to a matrix that correspond to the 19 predictors, with the response is Salary variable.

u<-model.matrix(Salary~.,data=Hitters)
#Dimension of new u
dim(u)

## [1] 263  20

#class u
class(u)

## [1] "matrix" "array"

#U now is a matrix
str(u)

##  num [1:263, 1:20] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:263] "-Alan Ashby" "-Alvin Davis" "-Andre Dawson" "-Andres Galarraga" ...
##   ..$ : chr [1:20] "(Intercept)" "AtBat" "Hits" "HmRun" ...
##  - attr(*, "assign")= int [1:20] 0 1 2 3 4 5 6 7 8 9 ...
##  - attr(*, "contrasts")=List of 3
##   ..$ League   : chr "contr.treatment"
##   ..$ Division : chr "contr.treatment"
##   ..$ NewLeague: chr "contr.treatment"

Create a new input x without the first column of u.

x<-u[,-1]
#Dependent variable y
y<-Hitters$Salary

Now, we apply the Ridge Regression. The important parameter of Ridge Regression is its tuning parameter or \(\lambda\).

When \(\lambda=0\), then a ridge regression model is fit, and if \(\lambda=1\) then a lasso model is fit.

Then, we will set up a range for our \(\lambda\) from \(10^{7}\) to \(10^{-4}\).

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.2.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-6

grid<-10^seq(7,-4,length=100)
#Apply ridge regression

ridge_model<-glmnet(x,
                    y,
                    alpha=0, #Alpha=0 is ridge model, alpha=1 is lasso model
                    lambda=grid)

Associated with each value of \(\alpha\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20x100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of \(\alpha\)).

dim(coef(ridge_model))

## [1]  20 100

We will store the coefficients of our ridge regression model in a new input.

coef<-coef(ridge_model)
#Corresponding to each value of lambda is the coefficient of intercepts of predictors in x.
#Coefficient of the first lambda, lambda=10^7
ridge_model$lambda[1]

## [1] 1e+07

coef[,1]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  5.357373e+02  5.443222e-05  1.974423e-04  7.955694e-04  3.338681e-04 
##           RBI         Walks         Years        CAtBat         CHits 
##  3.526612e-04  4.150630e-04  1.697604e-03  4.673161e-06  1.719785e-05 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  1.296898e-04  3.449997e-05  3.560316e-05  3.766565e-05 -5.789555e-04 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -7.806370e-03  2.179882e-05  3.560422e-06 -1.661300e-05 -1.144934e-04

sqrt(sum(coef[-1,1]^2))

## [1] 0.008079254

#The lambda corresponding to the lambda 50th.
ridge_model$lambda[50]

## [1] 35.93814

coef[,50]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  6.407070e+01 -5.026628e-01  2.320295e+00 -1.396185e+00  1.101672e+00 
##           RBI         Walks         Years        CAtBat         CHits 
##  7.703019e-01  3.016126e+00 -7.692944e+00  2.554182e-03  1.199480e-01 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  6.607760e-01  2.553854e-01  2.387831e-01 -2.098318e-01  4.973103e+01 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -1.209431e+02  2.577556e-01  1.444918e-01 -3.506943e+00 -1.386494e+01

#L2 norm corresponding to the 50th lambda of the model
sqrt(sum(coef[-1,50]^2))

## [1] 131.8458

We can see that for the first \(\lambda=10^7\), the L2 norm is nearly 10. While when \(\lambda=35.93\), the L2 norm is much bigger than L2 norm of the the fist lambda.

We can use the predict function to find the linear regression for the corresponding alpha.

#at lambda=0
predict(ridge_model,
        s=0, 
        type="coefficients")

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  163.95211942
## AtBat         -1.97353462
## Hits           7.39015765
## HmRun          3.99693813
## Runs          -2.22591655
## RBI           -0.93581247
## Walks          6.20539522
## Years         -3.57364393
## CAtBat        -0.17711724
## CHits          0.20852299
## CHmRun         0.03249169
## CRuns          1.37945884
## CRBI           0.72075581
## CWalks        -0.79765383
## LeagueN       63.39292928
## DivisionW   -116.98446212
## PutOuts        0.28199547
## Assists        0.37406401
## Errors        -3.41757649
## NewLeagueN   -25.87187336

#Linear regression
ols<-lm(Salary~.,data=Hitters)
ols$coefficients

##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  163.1035878   -1.9798729    7.5007675    4.3308829   -2.3762100   -1.0449620 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##    6.2312863   -3.4890543   -0.1713405    0.1339910   -0.1728611    1.4543049 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.8077088   -0.8115709   62.5994230 -116.8492456    0.2818925    0.3710692 
##       Errors   NewLeagueN 
##   -3.3607605  -24.7623251

You can see that when \(\lambda=0\), the ridge regression model estimates the same coefficients of the linear regression model.

We will use the ridge regression model at \(\lambda=0\) to predict the Salary and compare to the actual value of the data.

#Predict value with ridge regression model with lambda=0
predicted_ridge<-predict(ridge_model,
                         s=0,
                         newx=x)
head(predicted_ridge)

##                          s1
## -Alan Ashby        360.0936
## -Alvin Davis       712.5337
## -Andre Dawson     1170.1580
## -Andres Galarraga  556.4385
## -Alfredo Griffin   491.2903
## -Al Newman         248.7863

#Compare the result
mse<-function(x,y){
  return(mean((x-y)^2))
}
mse(predicted_ridge,y)

## [1] 92024.73

Now, we compare the MSE of predicted value by using the linear regression model above

mse(ols$fitted.values,y)

## [1] 92017.87

The difference between two MSE is small.

Now, we can plot the changing of the coefficients when lambda changed.

plot(ridge_model, xvar = "lambda", label = T)

Using 10-folds cross-validation technique to find the suitable \(\lambda\) for our model.

k_folds<-cv.glmnet(x,y,
                   alpha=0,
                   lambda=grid,
                   nfolds=10)
mse_lambda <- function(ridge_cv, ...) {
  mse_lam <- data.frame(Lambda = ridge_cv$lambda, 
                        MSE = ridge_cv$cvm)
  mse_lam %>% ggplot(aes(log(Lambda), MSE)) + 
    geom_line() + geom_point(color = "blue", alpha = 0.3) +
    geom_point(data = mse_lam %>% filter(MSE == min(MSE)), color = "red", size = 3) + 
    labs(title = "MSE V.s Lambda", 
         subtitle = "Note: The red point corresponds to the smallest value of the MSE") + 
    theme_minimal() 
}
mse_lambda(k_folds)

#The minimum MSE and the corresponding lambda
min(k_folds$cvm)

## [1] 114053.1

k_folds$lambda.min

## [1] 3.593814

Now, we know the lambda to have the minimum MSE is 5.99483. We will find the corresponding ridge regression coefficient.

#The ridge regression coefficients
coef[,which.min(k_folds$cvm)]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  1.522216e+02 -1.665928e+00  5.831975e+00  8.813105e-01 -4.913888e-01 
##           RBI         Walks         Years        CAtBat         CHits 
## -1.900933e-03  5.344774e+00 -1.000930e+01 -6.456691e-02  2.078165e-01 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  7.050186e-01  7.475146e-01  3.784916e-01 -6.280898e-01  6.166499e+01 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -1.222984e+02  2.801035e-01  2.986216e-01 -3.736192e+00 -2.797838e+01

#Predict value with the lambda corresponed to the smallest MSE.
predict<-predict(ridge_model,
                 s=k_folds$lambda.min,
                 newx=x)
mse(predict,y)

## [1] 93154.09

mse(ols$fitted.values,y)

## [1] 92017.87

MSE of OLS is larger than MSE of Ridge regression, which means that the OLS works better than Ridge regression. However, it is just the comparison on one dataset. We can solve this problem by applying these two models on 100 dataset to have the better comparison between two models.

str(Hitters)

## 'data.frame':    263 obs. of  20 variables:
##  $ AtBat    : int  315 479 496 321 594 185 298 323 401 574 ...
##  $ Hits     : int  81 130 141 87 169 37 73 81 92 159 ...
##  $ HmRun    : int  7 18 20 10 4 1 0 6 17 21 ...
##  $ Runs     : int  24 66 65 39 74 23 24 26 49 107 ...
##  $ RBI      : int  38 72 78 42 51 8 24 32 66 75 ...
##  $ Walks    : int  39 76 37 30 35 21 7 8 65 59 ...
##  $ Years    : int  14 3 11 2 11 2 3 2 13 10 ...
##  $ CAtBat   : int  3449 1624 5628 396 4408 214 509 341 5206 4631 ...
##  $ CHits    : int  835 457 1575 101 1133 42 108 86 1332 1300 ...
##  $ CHmRun   : int  69 63 225 12 19 1 0 6 253 90 ...
##  $ CRuns    : int  321 224 828 48 501 30 41 32 784 702 ...
##  $ CRBI     : int  414 266 838 46 336 9 37 34 890 504 ...
##  $ CWalks   : int  375 263 354 33 194 24 12 8 866 488 ...
##  $ League   : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
##  $ PutOuts  : int  632 880 200 805 282 76 121 143 0 238 ...
##  $ Assists  : int  43 82 11 40 421 127 283 290 0 445 ...
##  $ Errors   : int  10 14 3 4 25 7 9 19 0 22 ...
##  $ Salary   : num  475 480 500 91.5 750 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
##   ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...

MSE_ridge<-c()
MSE_ols<-c()


for (i in 1:100) {
set.seed(i)
index<-sample(2,nrow(Hitters),replace=TRUE,prob=c(0.5,0.5))
train<-Hitters[index==1,]
test<-Hitters[index==2,]
#Transform train data into matrix
x_ridge_train<-model.matrix(Salary~.,data=train)
x_ridge_train<-x_ridge_train[,-1]
y_ridge_train<-train$Salary

#Transform test data into matrix
x_ridge_test<-model.matrix(Salary~.,data=test)
x_ridge_test<-x_ridge_test[,-1]
y_ridge_test<-test$Salary

#Buil the ridge model on training data
ridge_model_1<-glmnet(x_ridge_test,
                    y_ridge_test,
                    alpha=0, #Alpha=0 is ridge model, alpha=1 is lasso model
                    lambda=k_folds$lambda.min)
#Predict the test data by ridge model.
predict_value<-predict(ridge_model_1,
                       s=k_folds$lambda.min,
                       newx=x_ridge_test)

mse_ridge_test<-mse(predict_value,y_ridge_test)
MSE_ridge<-c(MSE_ridge,mse_ridge_test)

#Apply linear regression model for training dataset.
linear_model<-lm(Salary~.,data=train)
#Predict
predict_ols<-predict(linear_model,
                    test %>% select(-Salary))
mse_ols_test<-mse(predict_ols, test$Salary)
MSE_ols<-c(MSE_ols,mse_ols_test)
}
df_result<- data.frame(Model = c(rep("Ridge", 100), rep("OLS", 100)),  
                        MSE = c(MSE_ridge, MSE_ols))

ggplot(data=df_result,aes(Model, MSE)) + geom_boxplot()

The plot shows that when applying two models for different 100 dataset, we see the MSE of Ridge Regression is much better than the Linear Regression model.

We can use the hypothesis to test if there is any difference between the average MSE calculated between two models. Set up the hypothesis

\(H_0:\mu_{ridge}=\mu_{ols}\)

\(H_a:\mu_{ridge}<\mu_{ols}\)

t.test(df_result$MSE~df_result$Model,
       alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  df_result$MSE by df_result$Model
## t = 18.486, df = 177.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group OLS and group Ridge is greater than 0
## 95 percent confidence interval:
##  48048.4     Inf
## sample estimates:
##   mean in group OLS mean in group Ridge 
##           134047.19            81278.86

Since p-value < \(\alpha=0.05\). Hence, we reject the null hypothesis. It implies that the dataset gives enough evidence to support that the mean MSE of Ridge regression is smaller than the MSE of linear regression model.

We will apply Ridge regression to find the Boston House Price.

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.2.2

data("BostonHousing")
str(BostonHousing)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

We will predict the median house value (medv) by using other predictors.

x<-model.matrix(medv~.,data=BostonHousing)
x<-x[,-1]
y<-BostonHousing$medv
#Create the sequence of lambda from 10^7 to 10^-4
grid<-10^seq(7,-4,length=100)

ridge_model_boston<-glmnet(x,y,
                           alpha=0,
                           lambda=grid)
#Using cross validation to find the approriate lambda
k_folds<-cv.glmnet(x,y,
                   alpha=0,
                   lambda=grid,
                   nfolds=10)
mse_lambda(k_folds)

#The minimum MSE 
min(k_folds$cvm)

## [1] 23.59885

#The corresponding lambda
k_folds$lambda.min

## [1] 0.1

#Ridge coefficients with the optimal lambda
coef(ridge_model_boston)[,which.min(k_folds$cvm)]

##   (Intercept)          crim            zn         indus         chas1 
##  3.452557e+01 -1.029866e-01  4.316565e-02  3.548493e-03  2.751901e+00 
##           nox            rm           age           dis           rad 
## -1.650679e+01  3.869711e+00 -4.085906e-04 -1.408433e+00  2.652875e-01 
##           tax       ptratio             b         lstat 
## -1.040403e-02 -9.324151e-01  9.281445e-03 -5.152642e-01

#Then predict the test data with the model
predict<-predict(ridge_model_boston,s=k_folds$lambda.min,
                 newx=x)

mse(predict,y)

## [1] 21.92165

#Compare with prediction by linear regression
ols_model<-lm(medv~.,data=BostonHousing)
mse(ols_model$fitted.values,y)

## [1] 21.89483

MSE of OLS is smaller than the Ridge regression. However, as we test above; when we test on more dataset, Ridge regression usually brings the better results than the OLS results.

References

https://rstudio-pubs-static.s3.amazonaws.com/306636_082a04de82de438eb2df0f6f57d478d6.html

ridge_regression

quoc_nguyen

2022-12-25

References