Ridge regression

Introduction

Given that big data has been a buzzword and concept for years now, the situation of independent variables (predictors) outnumbering the count of observations or instances is possible. When this situation arises, shrinkage methods are appropriate to counter overfitting and also identify important predictors over less important predictors, thus the reason for shrinkage or the need to decrease the number of predictor variables. With too many predictors, collinearity might occur and can make the model difficult to explain. Prediction performance can also suffer due to too many predictors.

Ridge regression is a technique of shrinkage to decrease the number of predictor variables. In this case a biased estimator is necessary. Ridge regression performs L2 regularization. This approach “shrinks” the parameters to prevent multicollinearity. For ridge regression, all variables do need to be standardized. Ridge regression coefficients are then adjusted back to the original scale.

Implementation

Using the dataset found at https://www.kaggle.com/srvmig/seoul-bike-rent attempts to predict the rented bike count in a given hour for predictor variables temperature, humidity and wind speed. I recognize this dataset doesn’t match the situation of many predictor variables, but I still wanted to try out the ridge regression functionality in R for practice and to see the results.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
bike_data <- read.csv("SeoulBikeData.csv")

summary(bike_data)
##          Date      Rented_Bike_Count      Hour        Temperature    
##  01/01/2018:  24   Min.   :   0.0    Min.   : 0.00   Min.   :-17.80  
##  01/02/2018:  24   1st Qu.: 191.0    1st Qu.: 5.75   1st Qu.:  3.50  
##  01/03/2018:  24   Median : 504.5    Median :11.50   Median : 13.70  
##  01/04/2018:  24   Mean   : 704.6    Mean   :11.50   Mean   : 12.88  
##  01/05/2018:  24   3rd Qu.:1065.2    3rd Qu.:17.25   3rd Qu.: 22.50  
##  01/06/2018:  24   Max.   :3556.0    Max.   :23.00   Max.   : 39.40  
##  (Other)   :8616                                                     
##     Humidity       Wind_speed      Visibility   Dew_point.temperature
##  Min.   : 0.00   Min.   :0.000   Min.   :  27   Min.   :-30.600      
##  1st Qu.:42.00   1st Qu.:0.900   1st Qu.: 940   1st Qu.: -4.700      
##  Median :57.00   Median :1.500   Median :1698   Median :  5.100      
##  Mean   :58.23   Mean   :1.725   Mean   :1437   Mean   :  4.074      
##  3rd Qu.:74.00   3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 14.800      
##  Max.   :98.00   Max.   :7.400   Max.   :2000   Max.   : 27.200      
##                                                                      
##  Solar_Radiation     Rainfall          Snowfall         Seasons    
##  Min.   :0.0000   Min.   : 0.0000   Min.   :0.00000   Autumn:2184  
##  1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.:0.00000   Spring:2208  
##  Median :0.0100   Median : 0.0000   Median :0.00000   Summer:2208  
##  Mean   :0.5691   Mean   : 0.1487   Mean   :0.07507   Winter:2160  
##  3rd Qu.:0.9300   3rd Qu.: 0.0000   3rd Qu.:0.00000                
##  Max.   :3.5200   Max.   :35.0000   Max.   :8.80000                
##                                                                    
##        Holiday     Functioning_Day
##  Holiday   : 432   No : 295       
##  No Holiday:8328   Yes:8465       
##                                   
##                                   
##                                   
##                                   
## 
# from: https://rstatisticsblog.com/data-science-in-action/machine-learning/ridge-regression-in-r/

# Getting the independent variable
x_var <- data.matrix(bike_data[, c("Temperature", "Humidity", "Wind_speed")])
# Getting the dependent variable
y_var <- bike_data[, "Rented_Bike_Count"]
 
# Setting the range of lambda values
lambda_seq <- 10^seq(2, -2, by = -.1)
# Using glmnet function to build the ridge regression in r
glmnet_fit <- glmnet(x_var, y_var, alpha = 0, lambda  = lambda_seq)
# Checking the model
summary(glmnet_fit)
##           Length Class     Mode   
## a0         41    -none-    numeric
## beta      123    dgCMatrix S4     
## df         41    -none-    numeric
## dim         2    -none-    numeric
## lambda     41    -none-    numeric
## dev.ratio  41    -none-    numeric
## nulldev     1    -none-    numeric
## npasses     1    -none-    numeric
## jerr        1    -none-    numeric
## offset      1    -none-    logical
## call        5    -none-    call   
## nobs        1    -none-    numeric

Using the glmnet() function from the glmnet library, I’m able to generate the linear regression model using ridge regression. With the sequence of lambda values used, the next step is to find the best lambda value.

# Using cross validation glmnet
ridge_cv <- cv.glmnet(x_var, y_var, alpha = 0, lambda = lambda_seq)
# Best lambda value
best_lambda <- ridge_cv$lambda.min
best_lambda
## [1] 0.1995262

The resulting best lambda is 0.1995262.

best_fit <- ridge_cv$glmnet.fit
head(best_fit)
## $a0
##       s0       s1       s2       s3       s4       s5       s6       s7 
## 715.3139 721.4041 726.8698 731.6619 735.7844 739.2761 742.1966 744.6144 
##       s8       s9      s10      s11      s12      s13      s14      s15 
## 746.5996 748.2169 749.5308 750.5923 751.4468 752.1329 752.6826 753.1221 
##      s16      s17      s18      s19      s20      s21      s22      s23 
## 753.4732 753.7533 753.9765 754.1524 754.2943 754.4072 754.4970 754.5684 
##      s24      s25      s26      s27      s28      s29      s30      s31 
## 754.6252 754.6704 754.7062 754.7347 754.7574 754.7754 754.7897 754.8010 
##      s32      s33      s34      s35      s36      s37      s38      s39 
## 754.8101 754.8172 754.8229 754.8274 754.8310 754.8339 754.8362 754.8380 
##      s40 
## 754.8394 
## 
## $beta
## 3 x 41 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 41 column names 's0', 's1', 's2' ... ]]
##                                                                                
## Temperature 26.95319 27.78455 28.483177 29.064171 29.543199 29.935358 30.254528
## Humidity    -7.15556 -7.43340 -7.670423 -7.870051 -8.036392 -8.173761 -8.286367
## Wind_speed  34.02715 33.66604 33.280424 32.901609 32.548951 32.232765 31.956968
##                                                                       
## Temperature 30.513060 30.72167 30.889362 31.024006 31.131820 31.218009
## Humidity    -8.378112 -8.45249 -8.512501 -8.560842 -8.599646 -8.630728
## Wind_speed  31.721324 31.52314 31.358739 31.223192 31.112465 31.022542
##                                                                        
## Temperature 31.286821 31.341703 31.385439 31.420269 31.447992 31.470049
## Humidity    -8.655583 -8.675432 -8.691265 -8.703885 -8.713935 -8.721936
## Wind_speed  30.949850 30.891300 30.844278 30.806599 30.776461 30.752390
##                                                                        
## Temperature 31.487482 31.501452 31.512560 31.521388 31.528405 31.533981
## Humidity    -8.728254 -8.733327 -8.737361 -8.740568 -8.743118 -8.745144
## Wind_speed  30.733456 30.718098 30.705861 30.696119 30.688368 30.682202
##                                                                       
## Temperature 31.538411 31.541932 31.54473 31.546950 31.548715 31.550118
## Humidity    -8.746754 -8.748034 -8.74905 -8.749858 -8.750499 -8.751009
## Wind_speed  30.677298 30.673400 30.67030 30.667839 30.665882 30.664327
##                                                                        
## Temperature 31.551232 31.552116 31.552819 31.553378 31.553821 31.554174
## Humidity    -8.751414 -8.751736 -8.751991 -8.752194 -8.752356 -8.752484
## Wind_speed  30.663092 30.662110 30.661330 30.660710 30.660218 30.659827
##                                                    
## Temperature 31.554454 31.554676 31.554852 31.554993
## Humidity    -8.752586 -8.752666 -8.752731 -8.752782
## Wind_speed  30.659517 30.659270 30.659074 30.658918
## 
## $df
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [39] 3 3 3
## 
## $dim
## [1]  3 41
## 
## $lambda
##  [1] 100.00000000  79.43282347  63.09573445  50.11872336  39.81071706
##  [6]  31.62277660  25.11886432  19.95262315  15.84893192  12.58925412
## [11]  10.00000000   7.94328235   6.30957344   5.01187234   3.98107171
## [16]   3.16227766   2.51188643   1.99526231   1.58489319   1.25892541
## [21]   1.00000000   0.79432823   0.63095734   0.50118723   0.39810717
## [26]   0.31622777   0.25118864   0.19952623   0.15848932   0.12589254
## [31]   0.10000000   0.07943282   0.06309573   0.05011872   0.03981072
## [36]   0.03162278   0.02511886   0.01995262   0.01584893   0.01258925
## [41]   0.01000000
## 
## $dev.ratio
##  [1] 0.3675782 0.3702973 0.3721710 0.3734416 0.3742917 0.3748542 0.3752231
##  [8] 0.3754631 0.3756184 0.3757182 0.3757823 0.3758233 0.3758494 0.3758660
## [15] 0.3758765 0.3758832 0.3758874 0.3758901 0.3758918 0.3758929 0.3758936
## [22] 0.3758940 0.3758943 0.3758944 0.3758945 0.3758946 0.3758947 0.3758947
## [29] 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947
## [36] 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947 0.3758947

Now, to identify the best model, the glmnet.fit is retrieved from the cross-validation object. From the above output the best lambda can be seen in the $lambda section output.

# Rebuilding the model with optimal lambda value
best_ridge <- glmnet(x_var, y_var, alpha = 0, lambda = best_lambda)


coef(best_ridge)
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 754.73967
## Temperature  31.54498
## Humidity     -8.74917
## Wind_speed   30.66961

Using the best lambada, the regression model is built again and the coefficients displayed.

Trying the MASS module version of ridge regression, I have generate a model using the same predictor variables.

library(MASS)
rr_mod <- lm.ridge(Rented_Bike_Count ~ Temperature + Humidity + Wind_speed, data=bike_data, lambda = best_lambda)

coef(rr_mod)
##             Temperature    Humidity  Wind_speed 
##   754.83683    31.55474    -8.75269    30.65920

As the display of the coefficients for the MASS module version of ridge regression indicate, glmnet and lm.ridge produce very similar outputs for ridge regression.

Conclusion

As mentioned in the introduction, ridge regression is a technique used to counter multicollinearity in a given dataset. Multicollinearity is common in observational studies in which the predictor variables can’t be controlled by the data scientists. This nature of data collection leads to variables with high correlation, such as the relationship between age and salary. A high correlation between predictor variables can lead to an inability to assess the results or impact of any individual predictor. When this occurs, this can lead to coefficients for some variables that do not appear to make sense given the full context of the dataset.