Image downloaded from Wikipedia

1 Introduction: ridge regression regularization

Previously I demonstrated use of LASSO with Hastie’s elastic net algorithm for predicting Parkinson’s disease. Recall, that the elastic net minimizes error (binomial deviance) according to the following objective function: \input \[\min_{(β_0,β)\in {\Bbb R}^{p+1}} -\left[\frac{1}n\sum_{i=1}^i y_i(\beta_0 + x_i^T\beta) - log\left(1 + e^{(\beta_0 + x_i^T\beta)}\right)\right] + \lambda\left[\frac{(1-\alpha)||\beta||_2^2 }2 + \alpha||\beta||_1\right]\] The first bracketed term is the negative binomial log likelihood objective function to be minimized. The second bracket contains the penalty term (preceded by the \(\lambda\) coefficient). Note that with \(\alpha = 1\) the penalty term reduces to the pure LASSO penalty: \[\lambda||\beta||_1\] When \(\alpha = 0\) the result is a pure ridge regression penalty: \[\frac{\lambda||\beta||_2^2}2\] This time I will set \(\alpha = 0\) to take a look at how ridge regression compares with LASSO in terms of model performance.

2 Library of Libraries

RStudio software will be used for this analysis. Below is the list of libraries employed:

library(dplyr)        # data manipulation
library(readr)        # reading in data
library(tidyr)        # data cleaning
library(stringr)      # string manipulation
library(ggplot2)      # data visualization
library(purrr)        # functional programming
library(ROCR)         # ROC curve generation
library(caret)        # LOOCV
library(glmnet)       # for lasso modeling
library(knitr)        # for online report generation
library(kableExtra)   # for customizing tables

3 Training/testing sets

By way of reminder, this data set contains three groupings:

A control group of people without Parkinson’s disease (n=50).
A disease group of people with newly diagnosed Parkinson’s disease (n=30).
An at-risk group of people with rapid eye movement sleep disorder (RBD) (n=50).

The control and disease groups will be used for model training/testing and the at-risk RBD group will be used for validation. Data containing the control and Parkinson’s groups were randomly divided 70% to a training set and 30% to a testing set (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).

4 Predictor variables

This data set contains 24 predictor variables – 12 speech attributes extracted from recorded monologuing and a second time extracted from recorded reading.

5 Data prep and scaling for ridge model

I utilized the below code to prepare data for training, testing and validation. The coefficients produced by the ridge method are not scale equivalent. Consequently, the predictors need to be standardized to the same scale prior to analysis with ridge Below is the code I used to scale the data. I then ran a ridge model with the below plot of coefficients versus log-\(\lambda\). Notice that nearly all coefficients gradually shrink toward zero as \(\lambda\) increases.

X.train = scale((x_train))
X.test = scale((x_test))
X.assess = scale(x_assess)
X = scale((X))
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, family = "binomial")

6 Cross-validation to tune \(\lambda\)

How do we choose an optimal \(\lambda\) value? Conveniently, the glmnet package has built-in cross validation functionality. Cross-validation is a technique that partitions training data into k-parts; one part is held out while the rest are used to generate the model at a given parameter. The model is then tested on the hold-out data; this is repeated until all k-parts have been held out once. Considering the \(\lambda\) tuning coefficient, various levels of lambda are assessed for their impact on model performance. Below is the simple one line of code for 10-fold cross-validation. The plot generates a binomial deviance versus lambda values. Binomial deviance is a measure of logistic model fit to the existing data. Notice how the deviance reduces as \(\lambda\) increases – a sign that the model is fitting the data better and better as the \(\lambda\) coefficient increases.

cv.lasso = cv.glmnet(x = X.train, y = y_train, family = "binomial")

Binomial deviance versus lambda values

Two candidate lambda values emerge from this cross-validation process: the lambda that gives the minimum cross-validation error and the lambda that is within 1 standard error of the minimum. For this analysis, I selected the later lambda value for the model lambda coefficient.

Candidate lambda values
Lambda minimum	Lambda 1 standard error
0.07	0.135

7 Training and testing the ridge model

7.1 In-sample ridge performance

Using this tuned \(\lambda\) value I ran the elastic net on the data again and used the model to predict the training data disease outcome. This model selected the three variables with non-zero ridge coefficients shown below:

lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, lambda = lambda, family = "binomial")
pred.lasso = predict(lasso.fit, newx = X.train, s = lambda, type = "response")

Coefficient values in ridge model with lambda = lambda 1 se
	Ridge coefficient
(Intercept)	-0.464
Entropy of speech timing (-)	-0.035
Rate of speech timing (-/min)	-0.269
Acceleration of speech timing (-/min2)	0.042
Duration of pause intervals (ms)	0.464
Duration of voiced intervals (ms)	0.080
Gaping in-between voiced intervals (-/min)	0.229
Duration of unvoiced stops (ms)	-0.063
Decay of unvoiced fricatives (‰/min)	-0.028
Relative loudness of respiration (dB)	0.183
Pause intervals per respiration (-)	-0.207
Rate of speech respiration (-/min)	-0.225
Latency of respiratory exchange (ms)	-0.285
Entropy of speech timing (-)_1	0.003
Rate of speech timing (-/min)_1	-0.054
Acceleration of speech timing (-/min2)_1	0.067
Duration of pause intervals (ms)_1	0.325
Duration of voiced intervals (ms)_1	-0.047
Gaping in-between voiced Intervals (-/min)	0.122
Duration of unvoiced stops (ms)_1	0.240
Decay of unvoiced fricatives (‰/min)_1	0.123
Relative loudness of respiration (dB)_1	-0.012
Pause intervals per respiration (-)_1	0.084
Rate of speech respiration (-/min)_1	0.002
Latency of respiratory exchange (ms)_1	0.095

Here again, I used an imbalanced cost (false negatives weighted twice as much as false positives) with a grid search to determine the probability cut that minimizes the cost: pcut = 0.32.

The in-sample performance metrics for this LASSO model are below:

In-sample ridge metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.77	0.91	0.68	0.79

7.2 Out-of-sample ridge performance

The above ridge model was then tested on the testing data set with the below metrics:

Out-of-sample ridge metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.71	0.88	0.62	0.75

8 Validating the ridge model on an at-risk group

8.1 In-Sample performance

Finally, a ridge model was run on the aggregated training and testing data. Below is a visualization of the ridge coefficients versus \(\lambda\) values: Cross-validation was performed for \(\lambda\) tuning and a value of lambda.1se = 10.59 was selected as visualized below:

This model selected only two variables this time: both were “Duration of pause intervals.” Below is a list of ridge coefficients selected at the lambda.1se level:

Coefficient values in LASSO model with lambda = lambda 1 se
	ridge coefficient
(Intercept)	-0.512
Entropy of speech timing (-)	-0.001
Rate of speech timing (-/min)	-0.014
Acceleration of speech timing (-/min2)	0.001
Duration of pause intervals (ms)	0.016
Duration of voiced intervals (ms)	0.009
Gaping in-between voiced intervals (-/min)	-0.001
Duration of unvoiced stops (ms)	0.006
Decay of unvoiced fricatives (‰/min)	0.004
Relative loudness of respiration (dB)	0.003
Pause intervals per respiration (-)	-0.005
Rate of speech respiration (-/min)	-0.001
Latency of respiratory exchange (ms)	-0.002
Entropy of speech timing (-)_1	-0.002
Rate of speech timing (-/min)_1	-0.014
Acceleration of speech timing (-/min2)_1	0.003
Duration of pause intervals (ms)_1	0.016
Duration of voiced intervals (ms)_1	0.008
Gaping in-between voiced Intervals (-/min)	-0.002
Duration of unvoiced stops (ms)_1	0.011
Decay of unvoiced fricatives (‰/min)_1	0.005
Relative loudness of respiration (dB)_1	-0.004
Pause intervals per respiration (-)_1	-0.007
Rate of speech respiration (-/min)_1	0.008
Latency of respiratory exchange (ms)_1	0.007

Again, a grid search was performed for optimal p-cut as visualized below:

This model was then used to predict in-sample data with the below metrics:

In-sample LASSO validation metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.69	0.67	0.7	0.68

8.2 Out-of-sample validation performance

To fully validate the preceding ridge model, I tested it on the at-risk RBD group. Below are these out-of-sample metrics.

Out-of-sample ridge validation metrics
Accuracy	Out-of-sample sensitivity	Out-of-sample specificity	AUC
0.62	0.61	0.63	0.62

9 Summary and next steps

9.1 Ridge feature regularization performance

The validated ridge model was 62% accurate and correctly identified 14 of the 23 Parkinson’s positive patients in the at-risk group. This ridge model was superior to the baseline logistic model with 12 variables (52% accurate and correctly identified 14 of the 23 Parkinson’s positive patients). Ridge was comparable to LASSO in terms of accuracy (LASSO was 64% accurate) and sensitivity (both LASSO and ridge has a sensitivity of 61%).

9.2 Next steps

In the next post I will begin looking at some advanced statistical models (e.g.: support vector machines, Random Forests). I will make use of the feature selection functionality of LASSO with these models.

3: Ridge regression predicting Parkinson’s disease

Paul Boys

2019-09-16