This project aims to study about the Ridge Regression.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.2.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
data("Hitters")
First, we omit the NA values in Hitters data.
#Omit NA values
Hitters<-na.omit(Hitters)
#summary of Hitters data
str(Hitters)
## 'data.frame': 263 obs. of 20 variables:
## $ AtBat : int 315 479 496 321 594 185 298 323 401 574 ...
## $ Hits : int 81 130 141 87 169 37 73 81 92 159 ...
## $ HmRun : int 7 18 20 10 4 1 0 6 17 21 ...
## $ Runs : int 24 66 65 39 74 23 24 26 49 107 ...
## $ RBI : int 38 72 78 42 51 8 24 32 66 75 ...
## $ Walks : int 39 76 37 30 35 21 7 8 65 59 ...
## $ Years : int 14 3 11 2 11 2 3 2 13 10 ...
## $ CAtBat : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
## $ CHits : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
## $ CHmRun : int 69 63 225 12 19 1 0 6 253 90 ...
## $ CRuns : int 321 224 828 48 501 30 41 32 784 702 ...
## $ CRBI : int 414 266 838 46 336 9 37 34 890 504 ...
## $ CWalks : int 375 263 354 33 194 24 12 8 866 488 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
## $ PutOuts : int 632 880 200 805 282 76 121 143 0 238 ...
## $ Assists : int 43 82 11 40 421 127 283 290 0 445 ...
## $ Errors : int 10 14 3 4 25 7 9 19 0 22 ...
## $ Salary : num 475 480 500 91.5 750 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
## ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...
#Dimension of Hitters data
dim(Hitters)
## [1] 263 20
Transform data to a matrix that correspond to the 19 predictors, with the response is Salary variable.
u<-model.matrix(Salary~.,data=Hitters)
#Dimension of new u
dim(u)
## [1] 263 20
#class u
class(u)
## [1] "matrix" "array"
#U now is a matrix
str(u)
## num [1:263, 1:20] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:263] "-Alan Ashby" "-Alvin Davis" "-Andre Dawson" "-Andres Galarraga" ...
## ..$ : chr [1:20] "(Intercept)" "AtBat" "Hits" "HmRun" ...
## - attr(*, "assign")= int [1:20] 0 1 2 3 4 5 6 7 8 9 ...
## - attr(*, "contrasts")=List of 3
## ..$ League : chr "contr.treatment"
## ..$ Division : chr "contr.treatment"
## ..$ NewLeague: chr "contr.treatment"
Create a new input x without the first column of u.
x<-u[,-1]
#Dependent variable y
y<-Hitters$Salary
Now, we apply the Ridge Regression. The important parameter of Ridge Regression is its tuning parameter or \(\lambda\).
When \(\lambda=0\), then a ridge regression model is fit, and if \(\lambda=1\) then a lasso model is fit.
Then, we will set up a range for our \(\lambda\) from \(10^{7}\) to \(10^{-4}\).
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.2.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-6
grid<-10^seq(7,-4,length=100)
#Apply ridge regression
ridge_model<-glmnet(x,
y,
alpha=0, #Alpha=0 is ridge model, alpha=1 is lasso model
lambda=grid)
Associated with each value of \(\alpha\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20x100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of \(\alpha\)).
dim(coef(ridge_model))
## [1] 20 100
We will store the coefficients of our ridge regression model in a new input.
coef<-coef(ridge_model)
#Corresponding to each value of lambda is the coefficient of intercepts of predictors in x.
#Coefficient of the first lambda, lambda=10^7
ridge_model$lambda[1]
## [1] 1e+07
coef[,1]
## (Intercept) AtBat Hits HmRun Runs
## 5.357373e+02 5.443222e-05 1.974423e-04 7.955694e-04 3.338681e-04
## RBI Walks Years CAtBat CHits
## 3.526612e-04 4.150630e-04 1.697604e-03 4.673161e-06 1.719785e-05
## CHmRun CRuns CRBI CWalks LeagueN
## 1.296898e-04 3.449997e-05 3.560316e-05 3.766565e-05 -5.789555e-04
## DivisionW PutOuts Assists Errors NewLeagueN
## -7.806370e-03 2.179882e-05 3.560422e-06 -1.661300e-05 -1.144934e-04
sqrt(sum(coef[-1,1]^2))
## [1] 0.008079254
#The lambda corresponding to the lambda 50th.
ridge_model$lambda[50]
## [1] 35.93814
coef[,50]
## (Intercept) AtBat Hits HmRun Runs
## 6.407070e+01 -5.026628e-01 2.320295e+00 -1.396185e+00 1.101672e+00
## RBI Walks Years CAtBat CHits
## 7.703019e-01 3.016126e+00 -7.692944e+00 2.554182e-03 1.199480e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.607760e-01 2.553854e-01 2.387831e-01 -2.098318e-01 4.973103e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.209431e+02 2.577556e-01 1.444918e-01 -3.506943e+00 -1.386494e+01
#L2 norm corresponding to the 50th lambda of the model
sqrt(sum(coef[-1,50]^2))
## [1] 131.8458
We can see that for the first \(\lambda=10^7\), the L2 norm is nearly 10. While when \(\lambda=35.93\), the L2 norm is much bigger than L2 norm of the the fist lambda.
We can use the predict function to find the linear regression for the corresponding alpha.
#at lambda=0
predict(ridge_model,
s=0,
type="coefficients")
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 163.95211942
## AtBat -1.97353462
## Hits 7.39015765
## HmRun 3.99693813
## Runs -2.22591655
## RBI -0.93581247
## Walks 6.20539522
## Years -3.57364393
## CAtBat -0.17711724
## CHits 0.20852299
## CHmRun 0.03249169
## CRuns 1.37945884
## CRBI 0.72075581
## CWalks -0.79765383
## LeagueN 63.39292928
## DivisionW -116.98446212
## PutOuts 0.28199547
## Assists 0.37406401
## Errors -3.41757649
## NewLeagueN -25.87187336
#Linear regression
ols<-lm(Salary~.,data=Hitters)
ols$coefficients
## (Intercept) AtBat Hits HmRun Runs RBI
## 163.1035878 -1.9798729 7.5007675 4.3308829 -2.3762100 -1.0449620
## Walks Years CAtBat CHits CHmRun CRuns
## 6.2312863 -3.4890543 -0.1713405 0.1339910 -0.1728611 1.4543049
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.8077088 -0.8115709 62.5994230 -116.8492456 0.2818925 0.3710692
## Errors NewLeagueN
## -3.3607605 -24.7623251
You can see that when \(\lambda=0\), the ridge regression model estimates the same coefficients of the linear regression model.
We will use the ridge regression model at \(\lambda=0\) to predict the Salary and compare to the actual value of the data.
#Predict value with ridge regression model with lambda=0
predicted_ridge<-predict(ridge_model,
s=0,
newx=x)
head(predicted_ridge)
## s1
## -Alan Ashby 360.0936
## -Alvin Davis 712.5337
## -Andre Dawson 1170.1580
## -Andres Galarraga 556.4385
## -Alfredo Griffin 491.2903
## -Al Newman 248.7863
#Compare the result
mse<-function(x,y){
return(mean((x-y)^2))
}
mse(predicted_ridge,y)
## [1] 92024.73
Now, we compare the MSE of predicted value by using the linear regression model above
mse(ols$fitted.values,y)
## [1] 92017.87
The difference between two MSE is small.
Now, we can plot the changing of the coefficients when lambda changed.
plot(ridge_model, xvar = "lambda", label = T)
Using 10-folds cross-validation technique to find the suitable \(\lambda\) for our model.
k_folds<-cv.glmnet(x,y,
alpha=0,
lambda=grid,
nfolds=10)
mse_lambda <- function(ridge_cv, ...) {
mse_lam <- data.frame(Lambda = ridge_cv$lambda,
MSE = ridge_cv$cvm)
mse_lam %>% ggplot(aes(log(Lambda), MSE)) +
geom_line() + geom_point(color = "blue", alpha = 0.3) +
geom_point(data = mse_lam %>% filter(MSE == min(MSE)), color = "red", size = 3) +
labs(title = "MSE V.s Lambda",
subtitle = "Note: The red point corresponds to the smallest value of the MSE") +
theme_minimal()
}
mse_lambda(k_folds)
#The minimum MSE and the corresponding lambda
min(k_folds$cvm)
## [1] 114053.1
k_folds$lambda.min
## [1] 3.593814
Now, we know the lambda to have the minimum MSE is 5.99483. We will find the corresponding ridge regression coefficient.
#The ridge regression coefficients
coef[,which.min(k_folds$cvm)]
## (Intercept) AtBat Hits HmRun Runs
## 1.522216e+02 -1.665928e+00 5.831975e+00 8.813105e-01 -4.913888e-01
## RBI Walks Years CAtBat CHits
## -1.900933e-03 5.344774e+00 -1.000930e+01 -6.456691e-02 2.078165e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 7.050186e-01 7.475146e-01 3.784916e-01 -6.280898e-01 6.166499e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.222984e+02 2.801035e-01 2.986216e-01 -3.736192e+00 -2.797838e+01
#Predict value with the lambda corresponed to the smallest MSE.
predict<-predict(ridge_model,
s=k_folds$lambda.min,
newx=x)
mse(predict,y)
## [1] 93154.09
mse(ols$fitted.values,y)
## [1] 92017.87
MSE of OLS is larger than MSE of Ridge regression, which means that the OLS works better than Ridge regression. However, it is just the comparison on one dataset. We can solve this problem by applying these two models on 100 dataset to have the better comparison between two models.
str(Hitters)
## 'data.frame': 263 obs. of 20 variables:
## $ AtBat : int 315 479 496 321 594 185 298 323 401 574 ...
## $ Hits : int 81 130 141 87 169 37 73 81 92 159 ...
## $ HmRun : int 7 18 20 10 4 1 0 6 17 21 ...
## $ Runs : int 24 66 65 39 74 23 24 26 49 107 ...
## $ RBI : int 38 72 78 42 51 8 24 32 66 75 ...
## $ Walks : int 39 76 37 30 35 21 7 8 65 59 ...
## $ Years : int 14 3 11 2 11 2 3 2 13 10 ...
## $ CAtBat : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
## $ CHits : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
## $ CHmRun : int 69 63 225 12 19 1 0 6 253 90 ...
## $ CRuns : int 321 224 828 48 501 30 41 32 784 702 ...
## $ CRBI : int 414 266 838 46 336 9 37 34 890 504 ...
## $ CWalks : int 375 263 354 33 194 24 12 8 866 488 ...
## $ League : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
## $ Division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
## $ PutOuts : int 632 880 200 805 282 76 121 143 0 238 ...
## $ Assists : int 43 82 11 40 421 127 283 290 0 445 ...
## $ Errors : int 10 14 3 4 25 7 9 19 0 22 ...
## $ Salary : num 475 480 500 91.5 750 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:59] 1 16 19 23 31 33 37 39 40 42 ...
## ..- attr(*, "names")= chr [1:59] "-Andy Allanson" "-Billy Beane" "-Bruce Bochte" "-Bob Boone" ...
MSE_ridge<-c()
MSE_ols<-c()
for (i in 1:100) {
set.seed(i)
index<-sample(2,nrow(Hitters),replace=TRUE,prob=c(0.5,0.5))
train<-Hitters[index==1,]
test<-Hitters[index==2,]
#Transform train data into matrix
x_ridge_train<-model.matrix(Salary~.,data=train)
x_ridge_train<-x_ridge_train[,-1]
y_ridge_train<-train$Salary
#Transform test data into matrix
x_ridge_test<-model.matrix(Salary~.,data=test)
x_ridge_test<-x_ridge_test[,-1]
y_ridge_test<-test$Salary
#Buil the ridge model on training data
ridge_model_1<-glmnet(x_ridge_test,
y_ridge_test,
alpha=0, #Alpha=0 is ridge model, alpha=1 is lasso model
lambda=k_folds$lambda.min)
#Predict the test data by ridge model.
predict_value<-predict(ridge_model_1,
s=k_folds$lambda.min,
newx=x_ridge_test)
mse_ridge_test<-mse(predict_value,y_ridge_test)
MSE_ridge<-c(MSE_ridge,mse_ridge_test)
#Apply linear regression model for training dataset.
linear_model<-lm(Salary~.,data=train)
#Predict
predict_ols<-predict(linear_model,
test %>% select(-Salary))
mse_ols_test<-mse(predict_ols, test$Salary)
MSE_ols<-c(MSE_ols,mse_ols_test)
}
df_result<- data.frame(Model = c(rep("Ridge", 100), rep("OLS", 100)),
MSE = c(MSE_ridge, MSE_ols))
ggplot(data=df_result,aes(Model, MSE)) + geom_boxplot()
The plot shows that when applying two models for different 100 dataset, we see the MSE of Ridge Regression is much better than the Linear Regression model.
We can use the hypothesis to test if there is any difference between the average MSE calculated between two models. Set up the hypothesis
\(H_0:\mu_{ridge}=\mu_{ols}\)
\(H_a:\mu_{ridge}<\mu_{ols}\)
t.test(df_result$MSE~df_result$Model,
alternative="greater")
##
## Welch Two Sample t-test
##
## data: df_result$MSE by df_result$Model
## t = 18.486, df = 177.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group OLS and group Ridge is greater than 0
## 95 percent confidence interval:
## 48048.4 Inf
## sample estimates:
## mean in group OLS mean in group Ridge
## 134047.19 81278.86
Since p-value < \(\alpha=0.05\). Hence, we reject the null hypothesis. It implies that the dataset gives enough evidence to support that the mean MSE of Ridge regression is smaller than the MSE of linear regression model.
We will apply Ridge regression to find the Boston House Price.
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.2.2
data("BostonHousing")
str(BostonHousing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We will predict the median house value (medv) by using other predictors.
x<-model.matrix(medv~.,data=BostonHousing)
x<-x[,-1]
y<-BostonHousing$medv
#Create the sequence of lambda from 10^7 to 10^-4
grid<-10^seq(7,-4,length=100)
ridge_model_boston<-glmnet(x,y,
alpha=0,
lambda=grid)
#Using cross validation to find the approriate lambda
k_folds<-cv.glmnet(x,y,
alpha=0,
lambda=grid,
nfolds=10)
mse_lambda(k_folds)
#The minimum MSE
min(k_folds$cvm)
## [1] 23.59885
#The corresponding lambda
k_folds$lambda.min
## [1] 0.1
#Ridge coefficients with the optimal lambda
coef(ridge_model_boston)[,which.min(k_folds$cvm)]
## (Intercept) crim zn indus chas1
## 3.452557e+01 -1.029866e-01 4.316565e-02 3.548493e-03 2.751901e+00
## nox rm age dis rad
## -1.650679e+01 3.869711e+00 -4.085906e-04 -1.408433e+00 2.652875e-01
## tax ptratio b lstat
## -1.040403e-02 -9.324151e-01 9.281445e-03 -5.152642e-01
#Then predict the test data with the model
predict<-predict(ridge_model_boston,s=k_folds$lambda.min,
newx=x)
mse(predict,y)
## [1] 21.92165
#Compare with prediction by linear regression
ols_model<-lm(medv~.,data=BostonHousing)
mse(ols_model$fitted.values,y)
## [1] 21.89483
MSE of OLS is smaller than the Ridge regression. However, as we test above; when we test on more dataset, Ridge regression usually brings the better results than the OLS results.