Image downloaded from Wikipedia
Previously I demonstrated use of LASSO with Hastie’s elastic net algorithm for predicting Parkinson’s disease. Recall, that the elastic net minimizes error (binomial deviance) according to the following objective function: \input \[\min_{(β_0,β)\in {\Bbb R}^{p+1}} -\left[\frac{1}n\sum_{i=1}^i y_i(\beta_0 + x_i^T\beta) - log\left(1 + e^{(\beta_0 + x_i^T\beta)}\right)\right] + \lambda\left[\frac{(1-\alpha)||\beta||_2^2 }2 + \alpha||\beta||_1\right]\] The first bracketed term is the negative binomial log likelihood objective function to be minimized. The second bracket contains the penalty term (preceded by the \(\lambda\) coefficient). Note that with \(\alpha = 1\) the penalty term reduces to the pure LASSO penalty: \[\lambda||\beta||_1\] When \(\alpha = 0\) the result is a pure ridge regression penalty: \[\frac{\lambda||\beta||_2^2}2\] This time I will set \(\alpha = 0\) to take a look at how ridge regression compares with LASSO in terms of model performance.
RStudio software will be used for this analysis. Below is the list of libraries employed:
library(dplyr) # data manipulation
library(readr) # reading in data
library(tidyr) # data cleaning
library(stringr) # string manipulation
library(ggplot2) # data visualization
library(purrr) # functional programming
library(ROCR) # ROC curve generation
library(caret) # LOOCV
library(glmnet) # for lasso modeling
library(knitr) # for online report generation
library(kableExtra) # for customizing tables
By way of reminder, this data set contains three groupings:
The control and disease groups will be used for model training/testing and the at-risk RBD group will be used for validation. Data containing the control and Parkinson’s groups were randomly divided 70% to a training set and 30% to a testing set (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).
This data set contains 24 predictor variables – 12 speech attributes extracted from recorded monologuing and a second time extracted from recorded reading.
I utilized the below code to prepare data for training, testing and validation. The coefficients produced by the ridge method are not scale equivalent. Consequently, the predictors need to be standardized to the same scale prior to analysis with ridge Below is the code I used to scale the data. I then ran a ridge model with the below plot of coefficients versus log-\(\lambda\). Notice that nearly all coefficients gradually shrink toward zero as \(\lambda\) increases.
X.train = scale((x_train))
X.test = scale((x_test))
X.assess = scale(x_assess)
X = scale((X))
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, family = "binomial")
How do we choose an optimal \(\lambda\) value? Conveniently, the glmnet package has built-in cross validation functionality. Cross-validation is a technique that partitions training data into k-parts; one part is held out while the rest are used to generate the model at a given parameter. The model is then tested on the hold-out data; this is repeated until all k-parts have been held out once. Considering the \(\lambda\) tuning coefficient, various levels of lambda are assessed for their impact on model performance. Below is the simple one line of code for 10-fold cross-validation. The plot generates a binomial deviance versus lambda values. Binomial deviance is a measure of logistic model fit to the existing data. Notice how the deviance reduces as \(\lambda\) increases – a sign that the model is fitting the data better and better as the \(\lambda\) coefficient increases.
cv.lasso = cv.glmnet(x = X.train, y = y_train, family = "binomial")
Binomial deviance versus lambda values
| Lambda minimum | Lambda 1 standard error |
|---|---|
| 0.07 | 0.135 |
Using this tuned \(\lambda\) value I ran the elastic net on the data again and used the model to predict the training data disease outcome. This model selected the three variables with non-zero ridge coefficients shown below:
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, lambda = lambda, family = "binomial")
pred.lasso = predict(lasso.fit, newx = X.train, s = lambda, type = "response")
| Ridge coefficient | |
|---|---|
| (Intercept) | -0.464 |
| Entropy of speech timing (-) | -0.035 |
| Rate of speech timing (-/min) | -0.269 |
| Acceleration of speech timing (-/min2) | 0.042 |
| Duration of pause intervals (ms) | 0.464 |
| Duration of voiced intervals (ms) | 0.080 |
| Gaping in-between voiced intervals (-/min) | 0.229 |
| Duration of unvoiced stops (ms) | -0.063 |
| Decay of unvoiced fricatives (‰/min) | -0.028 |
| Relative loudness of respiration (dB) | 0.183 |
| Pause intervals per respiration (-) | -0.207 |
| Rate of speech respiration (-/min) | -0.225 |
| Latency of respiratory exchange (ms) | -0.285 |
| Entropy of speech timing (-)_1 | 0.003 |
| Rate of speech timing (-/min)_1 | -0.054 |
| Acceleration of speech timing (-/min2)_1 | 0.067 |
| Duration of pause intervals (ms)_1 | 0.325 |
| Duration of voiced intervals (ms)_1 | -0.047 |
| Gaping in-between voiced Intervals (-/min) | 0.122 |
| Duration of unvoiced stops (ms)_1 | 0.240 |
| Decay of unvoiced fricatives (‰/min)_1 | 0.123 |
| Relative loudness of respiration (dB)_1 | -0.012 |
| Pause intervals per respiration (-)_1 | 0.084 |
| Rate of speech respiration (-/min)_1 | 0.002 |
| Latency of respiratory exchange (ms)_1 | 0.095 |
Here again, I used an imbalanced cost (false negatives weighted twice as much as false positives) with a grid search to determine the probability cut that minimizes the cost: pcut = 0.32.
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.77 | 0.91 | 0.68 | 0.79 |
The above ridge model was then tested on the testing data set with the below metrics:
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.71 | 0.88 | 0.62 | 0.75 |
Finally, a ridge model was run on the aggregated training and testing data. Below is a visualization of the ridge coefficients versus \(\lambda\) values: Cross-validation was performed for \(\lambda\) tuning and a value of lambda.1se = 10.59 was selected as visualized below:
| ridge coefficient | |
|---|---|
| (Intercept) | -0.512 |
| Entropy of speech timing (-) | -0.001 |
| Rate of speech timing (-/min) | -0.014 |
| Acceleration of speech timing (-/min2) | 0.001 |
| Duration of pause intervals (ms) | 0.016 |
| Duration of voiced intervals (ms) | 0.009 |
| Gaping in-between voiced intervals (-/min) | -0.001 |
| Duration of unvoiced stops (ms) | 0.006 |
| Decay of unvoiced fricatives (‰/min) | 0.004 |
| Relative loudness of respiration (dB) | 0.003 |
| Pause intervals per respiration (-) | -0.005 |
| Rate of speech respiration (-/min) | -0.001 |
| Latency of respiratory exchange (ms) | -0.002 |
| Entropy of speech timing (-)_1 | -0.002 |
| Rate of speech timing (-/min)_1 | -0.014 |
| Acceleration of speech timing (-/min2)_1 | 0.003 |
| Duration of pause intervals (ms)_1 | 0.016 |
| Duration of voiced intervals (ms)_1 | 0.008 |
| Gaping in-between voiced Intervals (-/min) | -0.002 |
| Duration of unvoiced stops (ms)_1 | 0.011 |
| Decay of unvoiced fricatives (‰/min)_1 | 0.005 |
| Relative loudness of respiration (dB)_1 | -0.004 |
| Pause intervals per respiration (-)_1 | -0.007 |
| Rate of speech respiration (-/min)_1 | 0.008 |
| Latency of respiratory exchange (ms)_1 | 0.007 |
Again, a grid search was performed for optimal p-cut as visualized below:
This model was then used to predict in-sample data with the below metrics:
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.69 | 0.67 | 0.7 | 0.68 |
To fully validate the preceding ridge model, I tested it on the at-risk RBD group. Below are these out-of-sample metrics.
| Accuracy | Out-of-sample sensitivity | Out-of-sample specificity | AUC |
|---|---|---|---|
| 0.62 | 0.61 | 0.63 | 0.62 |
The validated ridge model was 62% accurate and correctly identified 14 of the 23 Parkinson’s positive patients in the at-risk group. This ridge model was superior to the baseline logistic model with 12 variables (52% accurate and correctly identified 14 of the 23 Parkinson’s positive patients). Ridge was comparable to LASSO in terms of accuracy (LASSO was 64% accurate) and sensitivity (both LASSO and ridge has a sensitivity of 61%).
In the next post I will begin looking at some advanced statistical models (e.g.: support vector machines, Random Forests). I will make use of the feature selection functionality of LASSO with these models.