Image downloaded from Wikipedia
Previously I attempted to predict new onset Parkinson’s disease from voice recordings. To do this I employed a basic logistic regression model incorporating 12 predictor variables. Interestingly, reducing the number of predictor variables significantly improved accuracy. The model with 12 variables was 52% accurate (61% sensitive) while the model with only 2 variables was 64% accurate (74% sensitive). This curious phenomenon is dubbed by some as Ockham’s razor: “Don’t multiply entities beyond necessity.” Although I much prefer the poetry of complexity, Ockham’s razor carries weight across the disciplines of scientific inquiry and statistical analysis. In statistics, complex models (i.e.: models with many variables) tend to flexibly explain much of the variance within training sample data. However, this flexibility results in a model biased to the training sample; such a model loses predictive accuracy on new data. Inattention to this bias-variance tradeoff allows pernicious weeds to proliferate in the otherwise fruitful garden of educated guessing.
Enter regularization models. The least absolute shrinkage and selection operator (LASSO) is a linear model with a penalty term. In this post I will employ Hastie’s elastic net algorithm which minimizes error (binomial deviance) according to the following objective function:
\[\min_{(β_0,β)\in {\Bbb R}^{p+1}} -\left[\frac{1}n\sum_{i=1}^i y_i(\beta_0 + x_i^T\beta) - log\left(1 + e^{(\beta_0 + x_i^T\beta)}\right)\right] + \lambda\left[\frac{(1-\alpha)||\beta||_2^2 }2 + \alpha||\beta||_1\right]\] The first bracketed term is the negative binomial log likelihood objective function to be minimized. The second bracket contains the penalty term (preceded by the \(\lambda\) coefficient). Note that with \(\alpha = 0\) the result is a pure ridge regression penalty: \[\frac{\lambda||\beta||_2^2}2\] When \(\alpha = 1\) the penalty term reduces to the pure LASSO penalty: \[\lambda||\beta||_1\] In this analysis, I will set \(\alpha = 1\) for LASSO’s feature selection functionality. How does LASSO select features? The picture below visualizes coefficient values produced from the LASSO penalty (left) and the ridge penalty (right). The blue colored diamond and circle are the constraint region; coefficients are determined by the point at which the ellipse touches the constraint region. Observe that LASSO’s diamond constraint region results in a coefficient shrinking to zero when the ellipse hits a corner.
Taken from : The Elements of Statistical Learning by Hastie, et al
Here, \(\lambda\) functions as a tuning coefficient: \(\lambda = 0\) would results in no penalty (i.e.: a basic logistic ) whereas as \(\lambda\to \infty\) all coefficients will shrink to zero. Cross-validation can be used to tune the \(\lambda\) parameter (more on this below).
RStudio software will be used for this analysis. Below is the list of libraries employed:
library(dplyr) # data manipulation
library(readr) # reading in data
library(tidyr) # data cleaning
library(stringr) # string manipulation
library(ggplot2) # data visualization
library(purrr) # functional programming
library(ROCR) # ROC curve generation
library(caret) # LOOCV
library(glmnet) # for lasso modeling
library(knitr) # for online report generation
library(kableExtra) # for customizing tables
By way of reminder, this data set contains three groupings:
The control and disease groups will be used for model training/testing and the at-risk RBD group will be used for validation. Data containing the control and Parkinson’s groups were randomly divided 70% to a training set and 30% to a testing set (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).
This data set contains 24 predictor variables – 12 speech attributes extracted from recorded monologuing and a second time extracted from recorded reading. I averaged these repeated measures into 12 variables for my previous baseline logistic model analysis. This was to prevent ill effects of collinearity. However, for analysis with LASSO I will retain all 24 variables and allow the variable selection process to handle collinearity issues.
I utilized the below code to prepare data for training, testing and validation. I then ran an initial LASSO model with the plot of coefficients versus log-\(\lambda\). Notice how nearly all predictor variable coefficients are reduced to zero at every level of lambda. What happened?
#preparing component matrix/vectors for use in glmnet package
x_train <- as.matrix(train[ ,-25])
y_train <- as_vector(train[ , 25])
x_test <- as.matrix(test[ ,-25])
y_test <- as_vector(test[ ,25])
x_assess <- as.matrix(assess[ ,-25])
y_assess <- as_vector(assess[ ,25])
X <- as.matrix(training_testing[,-25])
Y <- as_vector(training_testing[, 25])
alph = 1
lasso.fit = glmnet(x = x_train, y = y_train, alpha = alph, family = "binomial")
The coefficients produced by the LASSO method are not scale equivalent. Consequently, the predictors need to be standardized to the same scale prior to analysis with LASSO. Below is the code I used to scale the data. I then ran a LASSO model with the below plot of coefficients versus log-\(\lambda\). Notice that nearly all coefficients gradually shrink toward zero as \(\lambda\) increases.
X.train = scale((x_train))
X.test = scale((x_test))
X.assess = scale(x_assess)
X = scale((X))
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, family = "binomial")
How do we choose an optimal \(\lambda\) value? Conveniently, the glmnet package has built-in cross validation functionality. Cross-validation is a technique that partitions training data into k-parts; one part is held out while the rest are used to generate the model at a given parameter. The model is then tested on the hold-out data; this is repeated until all k-parts have been held out once. Considering the \(\lambda\) tuning coefficient, various levels of lambda are assessed for their impact on model performance. Below is the simple one line of code for 10-fold cross-validation. The plot generates a binomial deviance versus lambda values. Binomial deviance is a measure of logistic model fit to the existing data. Notice how the deviance reduces as \(\lambda\) increases – a sign that the model is fitting the data better and better as the \(\lambda\) coefficient increases.
cv.lasso = cv.glmnet(x = X.train, y = y_train, family = "binomial")
Binomial deviance versus lambda values
| Lambda minimum | Lambda 1 standard error |
|---|---|
| 0.07 | 0.135 |
Using this tuned \(\lambda\) value I ran the elastic net on the data again and used the model to predict the training data disease outcome. This model selected the three variables with non-zero LASSO coefficients shown below:
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, lambda = lambda, family = "binomial")
pred.lasso = predict(lasso.fit, newx = X.train, s = lambda, type = "response")
| LASSO coefficient | |
|---|---|
| (Intercept) | -0.443 |
| Entropy of speech timing (-) | 0.000 |
| Rate of speech timing (-/min) | 0.000 |
| Acceleration of speech timing (-/min2) | 0.000 |
| Duration of pause intervals (ms) | 0.276 |
| Duration of voiced intervals (ms) | 0.000 |
| Gaping in-between voiced intervals (-/min) | 0.000 |
| Duration of unvoiced stops (ms) | 0.000 |
| Decay of unvoiced fricatives (‰/min) | 0.000 |
| Relative loudness of respiration (dB) | 0.000 |
| Pause intervals per respiration (-) | 0.000 |
| Rate of speech respiration (-/min) | 0.000 |
| Latency of respiratory exchange (ms) | 0.000 |
| Entropy of speech timing (-)_1 | 0.000 |
| Rate of speech timing (-/min)_1 | 0.000 |
| Acceleration of speech timing (-/min2)_1 | 0.000 |
| Duration of pause intervals (ms)_1 | 0.127 |
| Duration of voiced intervals (ms)_1 | 0.000 |
| Gaping in-between voiced Intervals (-/min) | 0.000 |
| Duration of unvoiced stops (ms)_1 | 0.000 |
| Decay of unvoiced fricatives (‰/min)_1 | 0.000 |
| Relative loudness of respiration (dB)_1 | 0.000 |
| Pause intervals per respiration (-)_1 | 0.000 |
| Rate of speech respiration (-/min)_1 | 0.000 |
| Latency of respiratory exchange (ms)_1 | 0.000 |
Here again, I used an imbalanced cost (false negatives weighted twice as much as false positives) with a grid search to determine the probability cut that minimizes the cost: pcut = 0.38.
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.73 | 0.68 | 0.76 | 0.72 |
The above LASSO model was then tested on the testing data set with the below metrics:
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.67 | 0.62 | 0.69 | 0.66 |
Finally, a LASSO model was run on the aggregated training and testing data. Below is a visualization of the LASSO coefficients versus \(\lambda\) values: Cross-validation was performed for \(\lambda\) tuning and a value of lambda.1se = 0.13 was selected as visualized below:
| LASSO coefficient | |
|---|---|
| (Intercept) | -0.518 |
| Duration of pause intervals (ms) | 0.178 |
| Duration of pause intervals (ms)_1 | 0.151 |
Again, a grid search was performed for optimal p-cut as visualized below:
This model was then used to predict in-sample data with the below metrics:
| Accuracy | In-sample sensitivity | In-sample specificity | AUC |
|---|---|---|---|
| 0.71 | 0.63 | 0.76 | 0.7 |
To fully validate the preceding LASSO model, I tested it on the at-risk RBD group. Below are these out-of-sample metrics.
| Accuracy | Out-of-sample sensitivity | Out-of-sample specificity | AUC |
|---|---|---|---|
| 0.64 | 0.61 | 0.67 | 0.64 |
I used the LASSO model for feature selection. The final model retained 2 of the original 24 variables. This resulted in an accuracy of 64%. This model correctly identified 14 of the 23 Parkinson’s positive patients in the at-risk group. The LASSO was superior to the baseline logistic model with 12 variables (52% accurate and correctly identified 14 of the 23 Parkinson’s positive patients). LASSO was comparable to the logistic model with basic feature selection (64% accurate and correctly identified 17 of the 23). The LASSO produces a statistically sophisticated method for reducing features and improving model accuracy.
Having explored the elastic net, I will demonstrate use of the ridge penalty for feature regularization. I will next compare ridge regression model performance with the above validated LASSO model performance.