Credit: Diagram from original research by Hlavnicka, et al copyrighted and licensed under Creative Commons Attribution 4.0 International License
I analyzed a data set of audio recordings taken from Hlavnicka’s original research; the goal of their original research was to use these audio recordings to accurately predict a diagnosis of Parkinson’s disease in an at-risk population. The data set can be downloaded from UCI Machine Learning Repository by clicking here.
This post details analytic techniques of the above data with a variety of baseline logistic models. The purpose is to examine model accuracy in predicting Parkinson’s disease as compared to 1) human voice screening, and 2) findings reported in the original research. I will compare the accuracy of the models with all predictor variables as well as with multiple subsets of predictors. Later posts will explore advanced statistical models and advanced feature selection methods (i.e.: methods to systematically reduce the number of variables for enhanced prediction performance).
Three groupings were included in this data set:
Per UpToDate, Rapid eye movement sleep disorder “is a parasomnia characterized by dream-enactment behaviors that emerge during a loss of REM sleep atonia. RBD dream enactment ranges in severity from benign hand gestures to violent thrashing, punching, and kicking. Patients typically present to medical attention with a concern related to injurious or potentially injurious actions to themselves and/or their bed partner.” Unfortunately, rapid eye movement disorder is often a prodromal syndrome of Parkinson’s disease. Development of an automatic audio screening tool may help general physicians and physician extenders with knowing when Parkinsonian symptoms are developing in an at-risk population so that protective measures can begin.
Data from the original research was generated from two audio recordings of each participant: 1) while reading a standardized 80-word text twice, 2) while monologuing for 90 seconds about interests, job, family, current events, etc. The recordings were then transcribed into data marking twelve motor features of speech, resulting in 24 predictor variables (twelve for each recording). The original research employed quadratic discriminant function analysis for prediction of Parkinson’s disease based upon audio recordings. The researchers used a leave-one-out cross-validation method to train the model for the optimally accurate subset of predictor variables. This model was 70.0% accurate when evaluated on the at-risk group.
I will use data from the control and disease groups as the training/testing sets (n=80) for “tuning” models. Final validating of each model will be on data from the at risk RBD group (n=50). This approach is comparable to the original research’s analytic method.
Evaluation of the final models in the RBD group will be based on a binary variable derived from a 27-point clinical body movement examination (i.e.: the Unified Parkinson’s Disease Rating Scale or UPDRS for short). Prior to analyzing the data, I corresponded with one of the original research authors, Dr. Jan Rusz, to confirm understanding of the outcome variable; Dr. Rusz was instrumental in helping me understand what follows. Per Postuma, et al (as cited by Hlavnicka, et al), UPDRS scores have optimal accuracy, sensitivity and specificity for diagnosing Parkinson’s disease when the action tremor examination element is removed. What is an action tremor? Per the National Institute of Neurological Disorders and Stroke, an action tremor “occurs with the voluntary movement of a muscle.” Action tremors are frequently present in non-Parkinson’s disease processes and consequently can artificially inflate scores. Therefore, this element should be removed prior to tallying UPDRS scores to identify Parkinson’s motor positive patients.
RStudio will be utilized exclusively in this analysis with the following libraries:
library(dplyr) #for data manipulation
library(readr) #for reading in data
library(tidyr) #for data cleaning
library(stringr) #for string manipulation
library(ggplot2) #for data visualization
library(purrr) #for functional programming
library(ROCR) #for ROC curve generation
library(knitr) #for online report generation
library(kableExtra) #for customizing tables
A binary response variable was created to represent a diagnosis of Parkinson’s disease (1 = disease present, 0 = disease absent). The variable was coded as ‘0’ in the control group (n=50) and ‘1’ in the Parkinson’s group (n=30). Coding for the RBD group required preparatory work. Action tremor elements were subtracted from the UPDRS total score in the RBD group and then the binary response variable was coded:
\(Response.variable = 1\) when \(UPDRS scores >3\) and
\(Respons.variable = 0\) when \(UPDRS scores <= 3\).
Prior to removing action tremor, 30 RBD group members were classified as motor positive. This number dropped to 23 after removal of action tremor, identical to findings reported by Hlavnicka, et al.
| Parkinson’s? | Count |
|---|---|
| 0 | 27 |
| 1 | 23 |
| Mean of control group | Mean of Parkinson’s group | p-value (Welch two sample t-test) | |
|---|---|---|---|
| Rate of speech timing (-/min) | 327.7 | 290.2 | 0.0008 |
| Duration of pause intervals (ms) | 174.1 | 225.5 | 0.0004 |
| Duration of unvoiced stops (ms) | 27.1 | 30.3 | 0.0357 |
Below is a scatter plot of the two variables with greatest between group differences. It shows that Parkinson’s group observations tend toward the lower right while controls tend to the upper left. This initial analysis suggests that speech patterns in this early onset Parkinson’s group appear to be characterized by a slower rate with longer pauses.
What is the human capability of identifying speech impairment in patients with early onset Parkinson’s? The UPDRS includes a screening for speech impairment. Human speech screening of the at risk RBD group identified speech impairments in only 2 of the twenty-three patients who were Parkinson’s positive. We will have opportunity below to observe how accurate statistical inference can be in detecting Parkinsonian speech impairment.
Data containing the control and Parkinson’s groups were randomly divided into a training set (70%) and a testing set (30%) and then a logistic model was run on the training data with all 12 of the averaged predictor variables included (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).
Below are the logistic model coefficients generated from the training data (they are listed in ascending order of p-value) and a histogram of the predicted probabilities of training data observations using this model:
| variable | Estimate | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|
Duration of pause intervals (ms)_mean
|
0.05 | 0.02 | 2.25 | 0.02 |
Latency of respiratory exchange (ms)_mean
|
-0.03 | 0.01 | -2.29 | 0.02 |
Duration of voiced intervals (ms)_mean
|
-0.05 | 0.03 | -1.66 | 0.10 |
Rate of speech timing (-/min)_mean
|
-0.05 | 0.04 | -1.31 | 0.19 |
Gaping in-between voiced intervals (-/min)_mean
|
0.08 | 0.06 | 1.28 | 0.20 |
Decay of unvoiced fricatives (‰/min)_mean
|
0.29 | 0.34 | 0.84 | 0.40 |
Pause intervals per respiration (-)_mean
|
0.28 | 0.44 | 0.63 | 0.53 |
Duration of unvoiced stops (ms)_mean
|
-0.04 | 0.08 | -0.53 | 0.60 |
Acceleration of speech timing (-/min2)_mean
|
-0.03 | 0.06 | -0.49 | 0.62 |
Entropy of speech timing (-)_mean
|
13.00 | 58.07 | 0.22 | 0.82 |
Relative loudness of respiration (dB)_mean
|
-0.01 | 0.18 | -0.08 | 0.94 |
Rate of speech respiration (-/min)_mean
|
-0.01 | 0.17 | -0.07 | 0.95 |
How does this logistic model perform when used to predict data in the training sample? Let’s look at several measures of performance beginning with the receiver-operator characteristic (ROC) curve. The ROC curve is a plot of precision (true positive rate) vs recall (false positive rate) and is shown below.
The area under the curve (AUC) of the ROC can be calculated to give an overall sense of model performance at every level of probability (0 to 1). ROC curves that hug the upper left corner of the curve are higher performing models (with an AUC of 1.0 being an inerrant model). Generally speaking, an out-of-sample AUC of 0.7 or greater is considered acceptable in terms of prediction. For this baseline logistic model, the in-sample AUC was 0.86. The ROC curve begs the question, which probability should be chosen as a cut-off value for classifying observations? One strategy is to use training/testing sample probabilities (46% observations were Parkinson’s positive). However, I chose the optimal probability cut by weighting misclassification: false negatives weighted twice as much as false positives. This is due to the purpose of this tool as a screen to identify people who do have Parkinson’s. Below is the code used for a grid search and the plot of cost versus probability cuts.
# choose optimal p-cut with grid search
costfun = function(obs, pred.prob, pcut){
class.pred = pred.prob > pcut
w1 = (3 + 1/3)/10
w2 = (6 + 2/3)/10
# FP
FP = (class.pred == 1 & obs == 0)
# FN
FN = (class.pred == 0 & obs == 1)
# cost
return(mean(w1*FP + w2*FN))
}
pcut.seq = seq(0,1, length.out = 100)
cost = rep(0, 100)
for(i in 1:100){
cost[i] = costfun(train$dx_parkinsons, pred_prob, pcut.seq[i])
}
Note that multiple probabilities tied for the minimum cost. I used the median of these ties as the chosen probability: 0.3.
The in-sample accuracy of this model (using 0.3 as the probability cut off) was 0.79 (which indicates a misclassification rate of 0.21). Two other important model performance metrics include sensitivity (the percentage of people with the disease correctly identified) and specificity (the percentage of people without the disease correctly identified). Sensitivity for this model on testing data was 0.95 while specificity was 0.69. These numbers were calculated from the below contingency table:| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 24 | 11 |
| Observed positive | 1 | 20 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.63 | 0.58 | 0.67 | 0.53 |
| Predicted negative | Predicted negative | |
|---|---|---|
| Observed negative | 8 | 7 |
| Observed positive | 3 | 6 |
A baseline logistic model was then run on the aggregated training and testing data and assessed on the RBD group’s data with the below performance results:
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.53 | 0.52 | 0.61 | 0.44 |
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 12 | 15 |
| Observed positive | 9 | 14 |
Below are the model coefficients generated from the training data with this subset:
| variable | Estimate | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|
Duration of pause intervals (ms)_mean
|
0.03 | 0.01 | 2.26 | 0.02 |
Duration of unvoiced stops (ms)_mean
|
-0.03 | 0.06 | -0.53 | 0.60 |
Rate of speech timing (-/min)_mean
|
0.01 | 0.01 | 0.44 | 0.66 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.73 | 0.77 | 0.57 | 0.89 |
These numbers were calculated from the below contingency table:
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 31 | 4 |
| Observed positive | 9 | 12 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.7 | 0.62 | 0.44 | 0.73 |
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 11 | 4 |
| Observed positive | 5 | 4 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.69 | 0.52 | 0.87 | 0.22 |
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 6 | 21 |
| Observed positive | 3 | 20 |
The logistic model was then run on a subset of variables from the full baseline model; variables with a baseline model coefficient p-value <= 0.05 were included in this model. Below are the model coefficients generated from the training data with this subset:
| variable | Estimate | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|
Duration of pause intervals (ms)_mean
|
0.02 | 0.01 | 2.87 | 0.00 |
Latency of respiratory exchange (ms)_mean
|
-0.01 | 0.01 | -0.93 | 0.35 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.74 | 0.73 | 0.62 | 0.8 |
These numbers were calculated from the below contingency table:
| Predicted negative | Predicted negative | |
|---|---|---|
| Observed negative | 28 | 7 |
| Observed positive | 8 | 13 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.8 | 0.62 | 0.67 | 0.6 |
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 9 | 6 |
| Observed positive | 3 | 6 |
| AUC | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| 0.64 | 0.64 | 0.74 | 0.56 |
| Predicted negative | Predicted positive | |
|---|---|---|
| Observed negative | 15 | 12 |
| Observed positive | 6 | 17 |
The above analysis demonstrates the use of a logistic model and some basic feature selection approaches. Human speech screening of the at-risk group was able to identify speech impairments in 2 of the 23 people with a Parkinson’s diagnosis; the most accurate logistic model above identified speech impairment in 17 of the 23 at-risk people with a Parkinson’s diagnois. This gives credence to the sense that speech impairments in early onset Parkinson’s disease are largely imperceptible to the human ear. Herein lies the business use for automated voice screening tools.
The original research (Hlavnicka, et al) employed a quadratic discriminant function analysis and leave-one-out cross validation for feature selection. When validated on the at-risk group, this model was 70.0% accurate and correctly identified 17 of 23 people with Parkinson’s. I employed a variety of logistic models with use of a cost-weighting grid search to find the optimal probability cut. The 3-variable logistic model above was 64% accurate and identified 17 of the 23 people with a Parkinson’s diagnosis. This suggests that a logistic model with an optimized probability cut and basic feature selection is not as accurate compared with the quadratic discriminant model with advance feature selection. However, the sensitivity of this basic logistic model appears comprable to the quadratic model due to use of an imbalanced cost.
Stay tuned. I will employ advanced feature selection techniques including best subset selection and least absolute shrinkage selection operator. I will also employ some other advanced statistical models in combination with feature selection. Advanced models will include ridge regression, support vector machines, and random forests. The goal will be to explore if prediction accuracy can be improved with these advanced methods.