Ridge regression is a way of creating a model when predictors exceed the number of observations or when a data has high correlations among each other. Ridge regression penalizes the model if a predictor is less significant and thus avoids over fitting. It uses ridge estimator as a shrinkage estimator that shrinks the parameter.
In the following example, I am going to create a ridge regression model through preparing and splitting dataset first and then apply ridge regression method on training data and test data eventually. Through tuning the model, lambda of 6 gave a best RMSE value of 13.47491 on training and RMSE value of 10.77 on test data. X245, X244 and X253 seems to be the most important predictor.
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(permeability)
# Preparation
low_frequency <- nearZeroVar(fingerprints) # low frequencies using nearZeroVar function
X <- fingerprints[,-low_frequency] # Removing the low frequencies
print(paste0(dim(X)[2], " columns are left after removing 719 columns using nearZeroVar function"))
## [1] "388 columns are left after removing 719 columns using nearZeroVar function"
# Splitting the data into training and test
splitt <- createDataPartition(permeability, p=0.8, list=FALSE)
# Training
X_train <- X[splitt, ]
y_train <- permeability[splitt, ]
# Test
X_test <- X[-splitt, ]
y_test <- permeability[-splitt, ]
# Reidge Method Fit
ridge_fit <- train(X_train, y_train, method='ridge', metric='Rsquared',
tuneGrid = data.frame(.lambda= seq(0,1, by=0.1)),
trControl = trainControl(method = 'cv'), preProcess = c('center','scale'))
## Warning: model fit failed for Fold06: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning: model fit failed for Fold09: lambda=0.0 Error in if (zmin < gamhat) { : missing value where TRUE/FALSE needed
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
ridge_fit
## Ridge Regression
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 120, 119, 120, 120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0 2.719855e+15 0.2900293 7.543520e+14
## 0.1 1.239681e+01 0.4812332 9.491152e+00
## 0.2 1.223696e+01 0.5061374 9.436151e+00
## 0.3 1.241176e+01 0.5182994 9.646498e+00
## 0.4 1.275267e+01 0.5255242 9.947672e+00
## 0.5 1.317413e+01 0.5307187 1.029576e+01
## 0.6 1.368476e+01 0.5343161 1.067867e+01
## 0.7 1.425292e+01 0.5369683 1.109728e+01
## 0.8 1.486969e+01 0.5389557 1.155213e+01
## 0.9 1.552540e+01 0.5404674 1.209416e+01
## 1.0 1.621485e+01 0.5416396 1.269090e+01
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 1.
# Plot
plot(ridge_fit)
# important variables
varImp(ridge_fit)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 388)
##
## Overall
## X254 100.00
## X253 100.00
## X240 100.00
## X239 100.00
## X246 100.00
## X244 100.00
## X157 100.00
## X245 100.00
## X129 76.47
## X266 66.08
## X262 66.08
## X260 63.21
## X265 63.21
## X255 61.69
## X247 61.69
## X235 55.56
## X372 55.30
## X373 55.30
## X138 53.79
## X133 53.79
# Checking the accuracy on test dataset
postResample(predict(ridge_fit, X_test), obs=y_test)
## RMSE Rsquared MAE
## 18.7840977 0.4729294 15.4331342