Credit: Diagram from original research by Hlavnicka, et al copyrighted and licensed under Creative Commons Attribution 4.0 International License

Credit: Diagram from original research by Hlavnicka, et al copyrighted and licensed under Creative Commons Attribution 4.0 International License

1 Introduction

I analyzed a data set of audio recordings taken from Hlavnicka’s original research; the goal of their original research was to use these audio recordings to accurately predict a diagnosis of Parkinson’s disease in an at-risk population. The data set can be downloaded from UCI Machine Learning Repository by clicking here.

1.1 Purpose of this post and posts to come

This post details analytic techniques of the above data with a variety of baseline logistic models. The purpose is to examine model accuracy in predicting Parkinson’s disease as compared to 1) human voice screening, and 2) findings reported in the original research. I will compare the accuracy of the models with all predictor variables as well as with multiple subsets of predictors. Later posts will explore advanced statistical models and advanced feature selection methods (i.e.: methods to systematically reduce the number of variables for enhanced prediction performance).

1.2 Study background

Three groupings were included in this data set:

  • A control group of people without Parkinson’s disease (n=50).
  • A disease group of people with newly diagnosed Parkinson’s disease (n=30).
  • An at-risk group of people with rapid eye movement sleep disorder (n=50).

Per UpToDate, Rapid eye movement sleep disorder “is a parasomnia characterized by dream-enactment behaviors that emerge during a loss of REM sleep atonia. RBD dream enactment ranges in severity from benign hand gestures to violent thrashing, punching, and kicking. Patients typically present to medical attention with a concern related to injurious or potentially injurious actions to themselves and/or their bed partner.” Unfortunately, rapid eye movement disorder is often a prodromal syndrome of Parkinson’s disease. Development of an automatic audio screening tool may help general physicians and physician extenders with knowing when Parkinsonian symptoms are developing in an at-risk population so that protective measures can begin.

Data from the original research was generated from two audio recordings of each participant: 1) while reading a standardized 80-word text twice, 2) while monologuing for 90 seconds about interests, job, family, current events, etc. The recordings were then transcribed into data marking twelve motor features of speech, resulting in 24 predictor variables (twelve for each recording). The original research employed quadratic discriminant function analysis for prediction of Parkinson’s disease based upon audio recordings. The researchers used a leave-one-out cross-validation method to train the model for the optimally accurate subset of predictor variables. This model was 70.0% accurate when evaluated on the at-risk group.

1.3 Training and testing sets

I will use data from the control and disease groups as the training/testing sets (n=80) for “tuning” models. Final validating of each model will be on data from the at risk RBD group (n=50). This approach is comparable to the original research’s analytic method.

1.4 Evaluation of models

Evaluation of the final models in the RBD group will be based on a binary variable derived from a 27-point clinical body movement examination (i.e.: the Unified Parkinson’s Disease Rating Scale or UPDRS for short). Prior to analyzing the data, I corresponded with one of the original research authors, Dr. Jan Rusz, to confirm understanding of the outcome variable; Dr. Rusz was instrumental in helping me understand what follows. Per Postuma, et al (as cited by Hlavnicka, et al), UPDRS scores have optimal accuracy, sensitivity and specificity for diagnosing Parkinson’s disease when the action tremor examination element is removed. What is an action tremor? Per the National Institute of Neurological Disorders and Stroke, an action tremor “occurs with the voluntary movement of a muscle.” Action tremors are frequently present in non-Parkinson’s disease processes and consequently can artificially inflate scores. Therefore, this element should be removed prior to tallying UPDRS scores to identify Parkinson’s motor positive patients.

2 Library of libraries

RStudio will be utilized exclusively in this analysis with the following libraries:

library(dplyr) #for data manipulation
library(readr) #for reading in data
library(tidyr) #for data cleaning
library(stringr) #for string manipulation
library(ggplot2) #for data visualization
library(purrr) #for functional programming
library(ROCR) #for ROC curve generation
library(knitr) #for online report generation
library(kableExtra) #for customizing tables

3 Data Preprocessing

A binary response variable was created to represent a diagnosis of Parkinson’s disease (1 = disease present, 0 = disease absent). The variable was coded as ‘0’ in the control group (n=50) and ‘1’ in the Parkinson’s group (n=30). Coding for the RBD group required preparatory work. Action tremor elements were subtracted from the UPDRS total score in the RBD group and then the binary response variable was coded:

  • \(Response.variable = 1\) when \(UPDRS scores >3\) and

  • \(Respons.variable = 0\) when \(UPDRS scores <= 3\).

Prior to removing action tremor, 30 RBD group members were classified as motor positive. This number dropped to 23 after removal of action tremor, identical to findings reported by Hlavnicka, et al.

Response variable frequency counts in the RBD group
Parkinson’s? Count
0 27
1 23

4 Exploratory data analysis

4.1 Summary statistical information

To simplify exploratory data analysis, the two audio recordings were averaged resulting in 12 predictor variables. Summary statistical information on the Control and Parkinson’s groups indicated that the mean differences were statistically significant for three variables (see table below).
Control and Parkinson’s group means with p-value of two-sample t-test
Mean of control group Mean of Parkinson’s group p-value (Welch two sample t-test)
Rate of speech timing (-/min) 327.7 290.2 0.0008
Duration of pause intervals (ms) 174.1 225.5 0.0004
Duration of unvoiced stops (ms) 27.1 30.3 0.0357

4.2 Visualization

Below is a scatter plot of the two variables with greatest between group differences. It shows that Parkinson’s group observations tend toward the lower right while controls tend to the upper left. This initial analysis suggests that speech patterns in this early onset Parkinson’s group appear to be characterized by a slower rate with longer pauses.

4.3 Human perception of mild speech impairment

What is the human capability of identifying speech impairment in patients with early onset Parkinson’s? The UPDRS includes a screening for speech impairment. Human speech screening of the at risk RBD group identified speech impairments in only 2 of the twenty-three patients who were Parkinson’s positive. We will have opportunity below to observe how accurate statistical inference can be in detecting Parkinsonian speech impairment.

5 Logistic modeling with 12 averaged variables

Data containing the control and Parkinson’s groups were randomly divided into a training set (70%) and a testing set (30%) and then a logistic model was run on the training data with all 12 of the averaged predictor variables included (note that I utilized set.seed(35) in the sampling process to facilitate reproduction of this ‘stochastic’ process).

Below are the logistic model coefficients generated from the training data (they are listed in ascending order of p-value) and a histogram of the predicted probabilities of training data observations using this model:

Estimated coefficients, error, z-value and p-values (Pr(>|z|)
variable Estimate Std. Error z value Pr(>|z|)
Duration of pause intervals (ms)_mean 0.05 0.02 2.25 0.02
Latency of respiratory exchange (ms)_mean -0.03 0.01 -2.29 0.02
Duration of voiced intervals (ms)_mean -0.05 0.03 -1.66 0.10
Rate of speech timing (-/min)_mean -0.05 0.04 -1.31 0.19
Gaping in-between voiced intervals (-/min)_mean 0.08 0.06 1.28 0.20
Decay of unvoiced fricatives (‰/min)_mean 0.29 0.34 0.84 0.40
Pause intervals per respiration (-)_mean 0.28 0.44 0.63 0.53
Duration of unvoiced stops (ms)_mean -0.04 0.08 -0.53 0.60
Acceleration of speech timing (-/min2)_mean -0.03 0.06 -0.49 0.62
Entropy of speech timing (-)_mean 13.00 58.07 0.22 0.82
Relative loudness of respiration (dB)_mean -0.01 0.18 -0.08 0.94
Rate of speech respiration (-/min)_mean -0.01 0.17 -0.07 0.95

5.1 Logistic model in-sample performance

How does this logistic model perform when used to predict data in the training sample? Let’s look at several measures of performance beginning with the receiver-operator characteristic (ROC) curve. The ROC curve is a plot of precision (true positive rate) vs recall (false positive rate) and is shown below.

The area under the curve (AUC) of the ROC can be calculated to give an overall sense of model performance at every level of probability (0 to 1). ROC curves that hug the upper left corner of the curve are higher performing models (with an AUC of 1.0 being an inerrant model). Generally speaking, an out-of-sample AUC of 0.7 or greater is considered acceptable in terms of prediction. For this baseline logistic model, the in-sample AUC was 0.86. The ROC curve begs the question, which probability should be chosen as a cut-off value for classifying observations? One strategy is to use training/testing sample probabilities (46% observations were Parkinson’s positive). However, I chose the optimal probability cut by weighting misclassification: false negatives weighted twice as much as false positives. This is due to the purpose of this tool as a screen to identify people who do have Parkinson’s. Below is the code used for a grid search and the plot of cost versus probability cuts.

# choose optimal p-cut with grid search
costfun = function(obs, pred.prob, pcut){
  class.pred = pred.prob > pcut
  w1 = (3 + 1/3)/10
  w2 = (6 + 2/3)/10
  # FP
  FP = (class.pred == 1 & obs == 0)
  # FN
  FN = (class.pred == 0 & obs == 1)
  # cost
  return(mean(w1*FP + w2*FN))
}

pcut.seq = seq(0,1, length.out = 100)
cost = rep(0, 100)
for(i in 1:100){
  cost[i] = costfun(train$dx_parkinsons, pred_prob, pcut.seq[i])
}

Note that multiple probabilities tied for the minimum cost. I used the median of these ties as the chosen probability: 0.3.

The in-sample accuracy of this model (using 0.3 as the probability cut off) was 0.79 (which indicates a misclassification rate of 0.21). Two other important model performance metrics include sensitivity (the percentage of people with the disease correctly identified) and specificity (the percentage of people without the disease correctly identified). Sensitivity for this model on testing data was 0.95 while specificity was 0.69. These numbers were calculated from the below contingency table:
Contingency table for observed versus predicted classifications on training data
Predicted negative Predicted positive
Observed negative 24 11
Observed positive 1 20

5.2 Out of sample logistic model performance on test data

Recall that 30% of the control/Parkinson’s observations were randomly chosen and set aside to assess the model’s predictive performance. The above baseline logistic model was tested on the test data set with its optimized probability cut. Below are these out-of-sample performance metrics for this baseline model:
Out-of-sample logistic model performance metrics
AUC Accuracy Sensitivity Specificity
0.63 0.58 0.67 0.53
Out-of-sample contingency table for observed versus predicted classifications
Predicted negative Predicted negative
Observed negative 8 7
Observed positive 3 6

5.3 Logistic model performance on RBD group

A baseline logistic model was then run on the aggregated training and testing data and assessed on the RBD group’s data with the below performance results:

Out-of-sample logistic model performance metrics
AUC Accuracy Sensitivity Specificity
0.53 0.52 0.61 0.44
Out-of-sample contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 12 15
Observed positive 9 14

6 Logistic modeling with variable subsets

6.1 Including variables with significantly different group means (Parkinson’s group vs control group)

Below are the model coefficients generated from the training data with this subset:

Estimated coefficients, error, z-value and p-values (Pr(>|z|)
variable Estimate Std. Error z value Pr(>|z|)
Duration of pause intervals (ms)_mean 0.03 0.01 2.26 0.02
Duration of unvoiced stops (ms)_mean -0.03 0.06 -0.53 0.60
Rate of speech timing (-/min)_mean 0.01 0.01 0.44 0.66

Logistic model performance metrics on RBD group
AUC Accuracy Sensitivity Specificity
0.73 0.77 0.57 0.89

These numbers were calculated from the below contingency table:

In-sample contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 31 4
Observed positive 9 12

6.1.1 Reduced variable logistic model out-of-sample performance

This baseline logistic model was then tested on the test data with the below performance results:
Out-of-sample logistic model performance metrics
AUC Accuracy Sensitivity Specificity
0.7 0.62 0.44 0.73
Out-of-sample contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 11 4
Observed positive 5 4

6.1.2 Reduced variable logistic model performance on RBD group

A reduced variable baseline logistic model was then run on the aggregated training/testing data with the below performance results. Note the improved performance attained by feature reduction:
Reduced logistic model performance metrics on RBD group
AUC Accuracy Sensitivity Specificity
0.69 0.52 0.87 0.22
RBD group contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 6 21
Observed positive 3 20

6.2 Including baseline model significant coefficients

The logistic model was then run on a subset of variables from the full baseline model; variables with a baseline model coefficient p-value <= 0.05 were included in this model. Below are the model coefficients generated from the training data with this subset:

Estimated coefficients, error, z-value and p-values (Pr(>|z|)
variable Estimate Std. Error z value Pr(>|z|)
Duration of pause intervals (ms)_mean 0.02 0.01 2.87 0.00
Latency of respiratory exchange (ms)_mean -0.01 0.01 -0.93 0.35

Logistic model performance metrics on RBD group
AUC Accuracy Sensitivity Specificity
0.74 0.73 0.62 0.8

These numbers were calculated from the below contingency table:

In-sample contingency table for observed versus predicted classifications
Predicted negative Predicted negative
Observed negative 28 7
Observed positive 8 13

6.2.1 Reduced variable logistic model out-of-sample performance

This baseline logistic model was then tested on the test data with the below performance results:
Out-of-sample logistic model performance metrics
AUC Accuracy Sensitivity Specificity
0.8 0.62 0.67 0.6
Out-of-sample contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 9 6
Observed positive 3 6

6.2.2 Reduced variable logistic model performance on RBD group

A reduced variable baseline logistic model was then run on the aggregated training/testing data with the below performance results. Note the improved performance attained by feature reduction:
Reduced logistic model performance metrics on RBD group
AUC Accuracy Sensitivity Specificity
0.64 0.64 0.74 0.56
RBD group contingency table for observed versus predicted classifications
Predicted negative Predicted positive
Observed negative 15 12
Observed positive 6 17

7 Summary and next steps

7.1 Humans versus machines

The above analysis demonstrates the use of a logistic model and some basic feature selection approaches. Human speech screening of the at-risk group was able to identify speech impairments in 2 of the 23 people with a Parkinson’s diagnosis; the most accurate logistic model above identified speech impairment in 17 of the 23 at-risk people with a Parkinson’s diagnois. This gives credence to the sense that speech impairments in early onset Parkinson’s disease are largely imperceptible to the human ear. Herein lies the business use for automated voice screening tools.

7.2 Quadratic discriminant model vs logistic model

The original research (Hlavnicka, et al) employed a quadratic discriminant function analysis and leave-one-out cross validation for feature selection. When validated on the at-risk group, this model was 70.0% accurate and correctly identified 17 of 23 people with Parkinson’s. I employed a variety of logistic models with use of a cost-weighting grid search to find the optimal probability cut. The 3-variable logistic model above was 64% accurate and identified 17 of the 23 people with a Parkinson’s diagnosis. This suggests that a logistic model with an optimized probability cut and basic feature selection is not as accurate compared with the quadratic discriminant model with advance feature selection. However, the sensitivity of this basic logistic model appears comprable to the quadratic model due to use of an imbalanced cost.

7.3 Next steps

Stay tuned. I will employ advanced feature selection techniques including best subset selection and least absolute shrinkage selection operator. I will also employ some other advanced statistical models in combination with feature selection. Advanced models will include ridge regression, support vector machines, and random forests. The goal will be to explore if prediction accuracy can be improved with these advanced methods.