Image downloaded from Wikipedia

1 Introduction: the work of simplicity

Previously I attempted to predict new onset Parkinson’s disease from voice recordings. To do this I employed a basic logistic regression model incorporating 12 predictor variables. Interestingly, reducing the number of predictor variables significantly improved accuracy. The model with 12 variables was 52% accurate (61% sensitive) while the model with only 2 variables was 64% accurate (74% sensitive). This curious phenomenon is dubbed by some as Ockham’s razor: “Don’t multiply entities beyond necessity.” Although I much prefer the poetry of complexity, Ockham’s razor carries weight across the disciplines of scientific inquiry and statistical analysis. In statistics, complex models (i.e.: models with many variables) tend to flexibly explain much of the variance within training sample data. However, this flexibility results in a model biased to the training sample; such a model loses predictive accuracy on new data. Inattention to this bias-variance tradeoff allows pernicious weeds to proliferate in the otherwise fruitful garden of educated guessing.

2 How do we choose which variables to eliminate?

Enter regularization models. The least absolute shrinkage and selection operator (LASSO) is a linear model with a penalty term. In this post I will employ Hastie’s elastic net algorithm which minimizes error (binomial deviance) according to the following objective function:

\[\min_{(β_0,β)\in {\Bbb R}^{p+1}} -\left[\frac{1}n\sum_{i=1}^i y_i(\beta_0 + x_i^T\beta) - log\left(1 + e^{(\beta_0 + x_i^T\beta)}\right)\right] + \lambda\left[\frac{(1-\alpha)||\beta||_2^2 }2 + \alpha||\beta||_1\right]\] The first bracketed term is the negative binomial log likelihood objective function to be minimized. The second bracket contains the penalty term (preceded by the \(\lambda\) coefficient). Note that with \(\alpha = 0\) the result is a pure ridge regression penalty: \[\frac{\lambda||\beta||_2^2}2\] When \(\alpha = 1\) the penalty term reduces to the pure LASSO penalty: \[\lambda||\beta||_1\] In this analysis, I will set \(\alpha = 1\) for LASSO’s feature selection functionality. How does LASSO select features? The picture below visualizes coefficient values produced from the LASSO penalty (left) and the ridge penalty (right). The blue colored diamond and circle are the constraint region; coefficients are determined by the point at which the ellipse touches the constraint region. Observe that LASSO’s diamond constraint region results in a coefficient shrinking to zero when the ellipse hits a corner.

Taken from : The Elements of Statistical Learning by Hastie, et al

Here, \(\lambda\) functions as a tuning coefficient: \(\lambda = 0\) would results in no penalty (i.e.: a basic logistic ) whereas as \(\lambda\to \infty\) all coefficients will shrink to zero. Cross-validation can be used to tune the \(\lambda\) parameter (more on this below).

3 Library of Libraries

RStudio software will be used for this analysis. Below is the list of libraries employed:

library(dplyr)        # data manipulation
library(readr)        # reading in data
library(tidyr)        # data cleaning
library(stringr)      # string manipulation
library(ggplot2)      # data visualization
library(purrr)        # functional programming
library(ROCR)         # ROC curve generation
library(caret)        # LOOCV
library(glmnet)       # for lasso modeling
library(knitr)        # for online report generation
library(kableExtra)   # for customizing tables

4 Training/testing sets

By way of reminder, this data set contains three groupings:

A control group of people without Parkinson’s disease (n=50).
A disease group of people with newly diagnosed Parkinson’s disease (n=30).
An at-risk group of people with rapid eye movement sleep disorder (RBD) (n=50).

The control and disease groups will be used for model training/testing and the at-risk RBD group will be used for validation. Data containing the control and Parkinson’s groups were randomly divided 70% to a training set and 30% to a testing set (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).

5 Predictor variables

This data set contains 24 predictor variables – 12 speech attributes extracted from recorded monologuing and a second time extracted from recorded reading. I averaged these repeated measures into 12 variables for my previous baseline logistic model analysis. This was to prevent ill effects of collinearity. However, for analysis with LASSO I will retain all 24 variables and allow the variable selection process to handle collinearity issues.

6 Data prep and scaling for LASSO model

I utilized the below code to prepare data for training, testing and validation. I then ran an initial LASSO model with the plot of coefficients versus log-\(\lambda\). Notice how nearly all predictor variable coefficients are reduced to zero at every level of lambda. What happened?

#preparing component matrix/vectors for use in glmnet package
x_train <- as.matrix(train[ ,-25])
y_train <- as_vector(train[ , 25])

x_test <- as.matrix(test[ ,-25])
y_test <- as_vector(test[ ,25])

x_assess <- as.matrix(assess[ ,-25])
y_assess <- as_vector(assess[ ,25])

X <- as.matrix(training_testing[,-25])
Y <- as_vector(training_testing[, 25])

alph = 1

lasso.fit = glmnet(x = x_train, y = y_train, alpha = alph, family = "binomial")

The coefficients produced by the LASSO method are not scale equivalent. Consequently, the predictors need to be standardized to the same scale prior to analysis with LASSO. Below is the code I used to scale the data. I then ran a LASSO model with the below plot of coefficients versus log-\(\lambda\). Notice that nearly all coefficients gradually shrink toward zero as \(\lambda\) increases.

X.train = scale((x_train))
X.test = scale((x_test))
X.assess = scale(x_assess)
X = scale((X))
lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, family = "binomial")

7 Cross-validation to tune \(\lambda\)

How do we choose an optimal \(\lambda\) value? Conveniently, the glmnet package has built-in cross validation functionality. Cross-validation is a technique that partitions training data into k-parts; one part is held out while the rest are used to generate the model at a given parameter. The model is then tested on the hold-out data; this is repeated until all k-parts have been held out once. Considering the \(\lambda\) tuning coefficient, various levels of lambda are assessed for their impact on model performance. Below is the simple one line of code for 10-fold cross-validation. The plot generates a binomial deviance versus lambda values. Binomial deviance is a measure of logistic model fit to the existing data. Notice how the deviance reduces as \(\lambda\) increases – a sign that the model is fitting the data better and better as the \(\lambda\) coefficient increases.

cv.lasso = cv.glmnet(x = X.train, y = y_train, family = "binomial")

Binomial deviance versus lambda values

Two candidate lambda values emerge from this cross-validation process: the lambda that gives the minimum cross-validation error and the lambda that is within 1 standard error of the minimum. For this analysis, I selected the later lambda value for the model lambda coefficient.

Candidate lambda values
Lambda minimum	Lambda 1 standard error
0.07	0.135

8 Training and testing the LASSO model

8.1 In-sample LASSO performance

Using this tuned \(\lambda\) value I ran the elastic net on the data again and used the model to predict the training data disease outcome. This model selected the three variables with non-zero LASSO coefficients shown below:

lasso.fit = glmnet(x = X.train, y = y_train, alpha = alph, lambda = lambda, family = "binomial")
pred.lasso = predict(lasso.fit, newx = X.train, s = lambda, type = "response")

Coefficient values in LASSO model with lambda = lambda 1 se
	LASSO coefficient
(Intercept)	-0.443
Entropy of speech timing (-)	0.000
Rate of speech timing (-/min)	0.000
Acceleration of speech timing (-/min2)	0.000
Duration of pause intervals (ms)	0.276
Duration of voiced intervals (ms)	0.000
Gaping in-between voiced intervals (-/min)	0.000
Duration of unvoiced stops (ms)	0.000
Decay of unvoiced fricatives (‰/min)	0.000
Relative loudness of respiration (dB)	0.000
Pause intervals per respiration (-)	0.000
Rate of speech respiration (-/min)	0.000
Latency of respiratory exchange (ms)	0.000
Entropy of speech timing (-)_1	0.000
Rate of speech timing (-/min)_1	0.000
Acceleration of speech timing (-/min2)_1	0.000
Duration of pause intervals (ms)_1	0.127
Duration of voiced intervals (ms)_1	0.000
Gaping in-between voiced Intervals (-/min)	0.000
Duration of unvoiced stops (ms)_1	0.000
Decay of unvoiced fricatives (‰/min)_1	0.000
Relative loudness of respiration (dB)_1	0.000
Pause intervals per respiration (-)_1	0.000
Rate of speech respiration (-/min)_1	0.000
Latency of respiratory exchange (ms)_1	0.000

Here again, I used an imbalanced cost (false negatives weighted twice as much as false positives) with a grid search to determine the probability cut that minimizes the cost: pcut = 0.38.

The in-sample performance metrics for this LASSO model are below:

In-sample LASSO metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.73	0.68	0.76	0.72

8.2 Out-of-sample LASSO performance

The above LASSO model was then tested on the testing data set with the below metrics:

Out-of-sample LASSO metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.67	0.62	0.69	0.66

9 Validating the LASSO model on an at-risk group

9.1 In-Sample performance

Finally, a LASSO model was run on the aggregated training and testing data. Below is a visualization of the LASSO coefficients versus \(\lambda\) values: Cross-validation was performed for \(\lambda\) tuning and a value of lambda.1se = 0.13 was selected as visualized below:

This model selected only two variables this time: both were “Duration of pause intervals.” Below is a list of LASSO coefficients selected at the lambda.1se level:

Coefficient values in LASSO model with lambda = lambda 1 se
	LASSO coefficient
(Intercept)	-0.518
Duration of pause intervals (ms)	0.178
Duration of pause intervals (ms)_1	0.151

Again, a grid search was performed for optimal p-cut as visualized below:

This model was then used to predict in-sample data with the below metrics:

In-sample LASSO validation metrics
Accuracy	In-sample sensitivity	In-sample specificity	AUC
0.71	0.63	0.76	0.7

9.2 Out-of-sample validation performance

To fully validate the preceding LASSO model, I tested it on the at-risk RBD group. Below are these out-of-sample metrics.

Out-of-sample LASSO validation metrics
Accuracy	Out-of-sample sensitivity	Out-of-sample specificity	AUC
0.64	0.61	0.67	0.64

10 Summary and next steps

10.1 LASSO feature selection performance

I used the LASSO model for feature selection. The final model retained 2 of the original 24 variables. This resulted in an accuracy of 64%. This model correctly identified 14 of the 23 Parkinson’s positive patients in the at-risk group. The LASSO was superior to the baseline logistic model with 12 variables (52% accurate and correctly identified 14 of the 23 Parkinson’s positive patients). LASSO was comparable to the logistic model with basic feature selection (64% accurate and correctly identified 17 of the 23). The LASSO produces a statistically sophisticated method for reducing features and improving model accuracy.

10.2 Next steps

Having explored the elastic net, I will demonstrate use of the ridge penalty for feature regularization. I will next compare ridge regression model performance with the above validated LASSO model performance.

2: Ockham’s razor and Parkinson’s disease

Paul Boys

2019-12-11