Given that big data has been a buzzword and concept for years now, the situation of independent variables (predictors) outnumbering the count of observations or instances is possible. When this situation arises, shrinkage methods are appropriate to counter overfitting and also identify important predictors over less important predictors, thus the reason for shrinkage or the need to decrease the number of predictor variables. With too many predictors, collinearity might occur and can make the model difficult to explain. Prediction performance can also suffer due to too many predictors.
Ridge regression is a technique of shrinkage to decrease the number of predictor variables. In this case a biased estimator is necessary. Ridge regression performs L2 regularization. This approach “shrinks” the parameters to prevent multicollinearity. For ridge regression, all variables do need to be standardized. Ridge regression coefficients are then adjusted back to the original scale.
Using the dataset found at https://www.kaggle.com/srvmig/seoul-bike-rent attempts to predict the rented bike count in a given hour for predictor variables temperature, humidity and wind speed. I recognize this dataset doesn’t match the situation of many predictor variables, but I still wanted to try out the ridge regression functionality in R for practice and to see the results.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
bike_data <- read.csv("SeoulBikeData.csv")
summary(bike_data)
## Date Rented_Bike_Count Hour Temperature
## 01/01/2018: 24 Min. : 0.0 Min. : 0.00 Min. :-17.80
## 01/02/2018: 24 1st Qu.: 191.0 1st Qu.: 5.75 1st Qu.: 3.50
## 01/03/2018: 24 Median : 504.5 Median :11.50 Median : 13.70
## 01/04/2018: 24 Mean : 704.6 Mean :11.50 Mean : 12.88
## 01/05/2018: 24 3rd Qu.:1065.2 3rd Qu.:17.25 3rd Qu.: 22.50
## 01/06/2018: 24 Max. :3556.0 Max. :23.00 Max. : 39.40
## (Other) :8616
## Humidity Wind_speed Visibility Dew_point.temperature
## Min. : 0.00 Min. :0.000 Min. : 27 Min. :-30.600
## 1st Qu.:42.00 1st Qu.:0.900 1st Qu.: 940 1st Qu.: -4.700
## Median :57.00 Median :1.500 Median :1698 Median : 5.100
## Mean :58.23 Mean :1.725 Mean :1437 Mean : 4.074
## 3rd Qu.:74.00 3rd Qu.:2.300 3rd Qu.:2000 3rd Qu.: 14.800
## Max. :98.00 Max. :7.400 Max. :2000 Max. : 27.200
##
## Solar_Radiation Rainfall Snowfall Seasons
## Min. :0.0000 Min. : 0.0000 Min. :0.00000 Autumn:2184
## 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.:0.00000 Spring:2208
## Median :0.0100 Median : 0.0000 Median :0.00000 Summer:2208
## Mean :0.5691 Mean : 0.1487 Mean :0.07507 Winter:2160
## 3rd Qu.:0.9300 3rd Qu.: 0.0000 3rd Qu.:0.00000
## Max. :3.5200 Max. :35.0000 Max. :8.80000
##
## Holiday Functioning_Day
## Holiday : 432 No : 295
## No Holiday:8328 Yes:8465
##
##
##
##
##
# from: https://rstatisticsblog.com/data-science-in-action/machine-learning/ridge-regression-in-r/
# Getting the independent variable
x_var <- data.matrix(bike_data[, c("Temperature", "Humidity", "Wind_speed")])
# Getting the dependent variable
y_var <- bike_data[, "Rented_Bike_Count"]
# Setting the range of lambda values
lambda_seq <- 10^seq(2, -2, by = -.1)
# Using glmnet function to build the ridge regression in r
glmnet_fit <- glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Checking the model
summary(glmnet_fit)
## Length Class Mode
## a0 41 -none- numeric
## beta 123 dgCMatrix S4
## df 41 -none- numeric
## dim 2 -none- numeric
## lambda 41 -none- numeric
## dev.ratio 41 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
Using the glmnet() function from the glmnet library, I’m able to generate the linear regression model using ridge regression. With the sequence of lambda values used, the next step is to find the best lambda value.
# Using cross validation glmnet
ridge_cv <- cv.glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Best lambda value
best_lambda <- ridge_cv$lambda.min
best_lambda
## [1] 0.1995262
The resulting best lambda is 0.1995262.
best_fit <- ridge_cv$glmnet.fit
head(best_fit)
## $a0
## s0 s1 s2 s3 s4 s5 s6 s7
## 715.3139 721.4041 726.8698 731.6619 735.7844 739.2761 742.1966 744.6144
## s8 s9 s10 s11 s12 s13 s14 s15
## 746.5996 748.2169 749.5308 750.5923 751.4468 752.1329 752.6826 753.1221
## s16 s17 s18 s19 s20 s21 s22 s23
## 753.4732 753.7533 753.9765 754.1524 754.2943 754.4072 754.4970 754.5684
## s24 s25 s26 s27 s28 s29 s30 s31
## 754.6252 754.6704 754.7062 754.7347 754.7574 754.7754 754.7897 754.8010
## s32 s33 s34 s35 s36 s37 s38 s39
## 754.8101 754.8172 754.8229 754.8274 754.8310 754.8339 754.8362 754.8380
## s40
## 754.8394
##
## $beta
## 3 x 41 sparse Matrix of class "dgCMatrix"
## [[ suppressing 41 column names 's0', 's1', 's2' ... ]]
##
## Temperature 26.95319 27.78455 28.483177 29.064171 29.543199 29.935358 30.254528
## Humidity -7.15556 -7.43340 -7.670423 -7.870051 -8.036392 -8.173761 -8.286367
## Wind_speed 34.02715 33.66604 33.280424 32.901609 32.548951 32.232765 31.956968
##
## Temperature 30.513060 30.72167 30.889362 31.024006 31.131820 31.218009
## Humidity -8.378112 -8.45249 -8.512501 -8.560842 -8.599646 -8.630728
## Wind_speed 31.721324 31.52314 31.358739 31.223192 31.112465 31.022542
##
## Temperature 31.286821 31.341703 31.385439 31.420269 31.447992 31.470049
## Humidity -8.655583 -8.675432 -8.691265 -8.703885 -8.713935 -8.721936
## Wind_speed 30.949850 30.891300 30.844278 30.806599 30.776461 30.752390
##
## Temperature 31.487482 31.501452 31.512560 31.521388 31.528405 31.533981
## Humidity -8.728254 -8.733327 -8.737361 -8.740568 -8.743118 -8.745144
## Wind_speed 30.733456 30.718098 30.705861 30.696119 30.688368 30.682202
##
## Temperature 31.538411 31.541932 31.54473 31.546950 31.548715 31.550118
## Humidity -8.746754 -8.748034 -8.74905 -8.749858 -8.750499 -8.751009
## Wind_speed 30.677298 30.673400 30.67030 30.667839 30.665882 30.664327
##
## Temperature 31.551232 31.552116 31.552819 31.553378 31.553821 31.554174
## Humidity -8.751414 -8.751736 -8.751991 -8.752194 -8.752356 -8.752484
## Wind_speed 30.663092 30.662110 30.661330 30.660710 30.660218 30.659827
##
## Temperature 31.554454 31.554676 31.554852 31.554993
## Humidity -8.752586 -8.752666 -8.752731 -8.752782
## Wind_speed 30.659517 30.659270 30.659074 30.658918
##
## $df
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [39] 3 3 3
##
## $dim
## [1] 3 41
##
## $lambda
## [1] 100.00000000 79.43282347 63.09573445 50.11872336 39.81071706
## [6] 31.62277660 25.11886432 19.95262315 15.84893192 12.58925412
## [11] 10.00000000 7.94328235 6.30957344 5.01187234 3.98107171
## [16] 3.16227766 2.51188643 1.99526231 1.58489319 1.25892541
## [21] 1.00000000 0.79432823 0.63095734 0.50118723 0.39810717
## [26] 0.31622777 0.25118864 0.19952623 0.15848932 0.12589254
## [31] 0.10000000 0.07943282 0.06309573 0.05011872 0.03981072
## [36] 0.03162278 0.02511886 0.01995262 0.01584893 0.01258925
## [41] 0.01000000
##
## $dev.ratio
## [1] 0.3675782 0.3702973 0.3721710 0.3734416 0.3742917 0.3748542 0.3752231
## [8] 0.3754631 0.3756184 0.3757182 0.3757823 0.3758233 0.3758494 0.3758660
## [15] 0.3758765 0.3758832 0.3758874 0.3758901 0.3758918 0.3758929 0.3758936
## [22] 0.3758940 0.3758943 0.3758944 0.3758945 0.3758946 0.3758947 0.3758947
## [29] 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947
## [36] 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947
Now, to identify the best model, the glmnet.fit is retrieved from the cross-validation object. From the above output the best lambda can be seen in the $lambda section output.
# Rebuilding the model with optimal lambda value
best_ridge <- glmnet(x_var, y_var, alpha = 0, lambda = best_lambda)
coef(best_ridge)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 754.73967
## Temperature 31.54498
## Humidity -8.74917
## Wind_speed 30.66961
Using the best lambada, the regression model is built again and the coefficients displayed.
Trying the MASS module version of ridge regression, I have generate a model using the same predictor variables.
library(MASS)
rr_mod <- lm.ridge(Rented_Bike_Count ~ Temperature + Humidity + Wind_speed, data=bike_data, lambda = best_lambda)
coef(rr_mod)
## Temperature Humidity Wind_speed
## 754.83683 31.55474 -8.75269 30.65920
As the display of the coefficients for the MASS module version of ridge regression indicate, glmnet and lm.ridge produce very similar outputs for ridge regression.
As mentioned in the introduction, ridge regression is a technique used to counter multicollinearity in a given dataset. Multicollinearity is common in observational studies in which the predictor variables can’t be controlled by the data scientists. This nature of data collection leads to variables with high correlation, such as the relationship between age and salary. A high correlation between predictor variables can lead to an inability to assess the results or impact of any individual predictor. When this occurs, this can lead to coefficients for some variables that do not appear to make sense given the full context of the dataset.