Model-based and Model-free Machine Learning Techniques for Diagnostic Prediction and Classification of Clinical Outcomes in Parkinson’s Disease

Chao Gao, Hanbo Sun, Ivo Dinov et al.

1/10/2018

Paper Link: https://www.nature.com/articles/s41598-018-24783-4

1 Introduction

PD clinical characteristics, current state-of-the-art techniques, societal impact

Parkinson’s disease (PD) is a common neurodegenerative disorder that affects over 10 million people worldwide. PD affects about 1% of people over 60 years of age and the prevalence increases with age. People with PD experience a range of motor and non-motor symptoms that include tremor, rigidity, bradykinesia, postural instability, gait disturbances such as freezing of gait (FoG), autonomic disturbances, affective disorders, sleep disturbances, and cognitive deficits. These symptoms markedly impact and curtail health related quality of life. Freezing of gait and associated falls represent one of the most serious consequences of PD. Falls are much more common in patients with PD than in age-matched controls and falls often lead to reduced functional independence, increased morbidity, and higher mortality4. The ability to better identify future fallers from non-fallers could inform more effective treatment and personalized medicine planning.

The hallmark pathology of PD is loss of dopamine in the striatum secondary to progressive degeneration of dopaminergic cells in the substantia nigra pars compacta, accompanied by the formation of Lewy bodies5. A variable combination of tremor, rigidity, and bradykinesia symptoms may present along with postural instability and gait difficulty (PIGD) features. Because of primary involvement of the basal ganglia in PD, it has often been asserted that these motor features are mainly attributable to nigrostriatal dopaminergic loss. A common dopamine replacement therapy to ameliorate PD motor symptom is levodopa (L-DOPA). A recent study from Vu et al. showed that L-DOPA potency was lowest for PIGD features compared to other cardinal motor features. In the Sydney Multicenter Study of PD, patients have been followed for about two decades. Results of this study indicate that dopamine non-responsive problems dominate 15 years after initial assessments and include frequent falls, which occurs in 81% of the patients. Similar findings were recently reported by López et al. after following de novo PD patients for 10 years. These authors reported good responses to dopaminergic treatment in the first year with a progressive decline, becoming more manifest especially after 3 years. Significant PIGD motor disabilities arose at 10 years in 71% of patients that were mainly caused by non-dopamine-responsive features such as freezing of gait (FoG). The L-DOPA resistance of PIGD motor features has been proposed to include non-dopaminergic structures in widespread brain regions. As axial motor impairments, in particular falls, do not respond well to dopaminergic medications there is a need to identify early predictors of falls. Such predictors may provide potential clues about underlying mechanism of falls that may more effectively inform future treatment interventions. The main goal of this study was to identify clinical and MR imaging predictors of falls from two independent archives containing clinical and imaging data of PD patients.

Machine Learning methods for prediction, classification, forecasting and data-mining

Both model-based and model-free techniques may be employed for prediction of specific clinical outcomes or diagnostic phenotypes. The application of model-based approaches heavily depends on the a priori statistical statements, such as specification of relationship between variables (e.g. independence) and the model-specific assumptions regarding the process probability distributions (e.g., the outcome variable may be required to be binomial). Examples of model-based methods include generalized linear models. Logistic regression is one of the most commonly used model-based tools, which is applicable when the outcome variables are measured on a binary scale (e.g., success/failure) and follow Bernoulli distribution10. Hence, the classification process can be carried out based on the estimated probabilities. Investigators have to carefully examine and confirm the model assumptions and choose appropriate link functions. Since the statistical assumptions do not always hold in real life problems, especially for big incongruent data, the model-based methods may not be applicable or may generate biased results.

In contrast, model-free methods adapt to the intrinsic data characteristics without the use of any a priori models and with fewer assumptions. Given complicated information, model-free techniques are able to construct non-parametric representations, which may also be referred as (non-parametric) models, using machine learning algorithms or ensembles of multiple base learners without simplification of the problem. In the present study, several model-free methods are utilized, e.g., Random Forest, AdaBoost, XGBoost, Support Vector Machines, Neural Network, and SuperLearner. These algorithms benefit from constant learning, or retraining, as they do not guarantee optimized classification/regression results. However, when trained, maintained and reinforced properly and effectively, model-free machine learning methods have great potential in solving real-world problems (prediction and data-mining). The morphometric biomarkers that were identified and reported here may be useful for clinical decision support and assist with diagnosis and monitoring of Parkinson’s disease.

There are prior reports of using model-free machine-learning techniques to diagnose Parkinson’s disease. For instance, Abos et al. explored connection-wise patterns of functional connectivity to discriminate PD patients according to their cognitive status. They reported an accuracy of 80.0% for classifying a validation sample independent of the training dataset. Dinesh and colleagues employed (boosted) decision trees to forecast PD. Their approach was based on analyzing variations in voice patterns of PD patients and unaffected subjects and reported average prediction accuracy of 91–95%. Peng et al. used machine learning method for detection of morphometric biomarkers in Parkinson’s disease. Their multi-kernel support vector machine classifier performed well with average accuracy=86%, specificity=88%, and sensitivity=88%. Another group of researchers developed a novel feature selection technique to predict PD based on multi-modal neuroimaging data and using support vector classification. Their cross-validation results of predicting three types of patients, normal controls, subjects without evidence of dopaminergic denervation (SWEDDs), and PD patients reported classification accuracy about 89–90%. Bernad-Elazari et al. applied a machine learning approach to distinguish between subjects with and without PD. Their objective characterization of daily living transitions in patients with PD used a single body-fixed sensor, successfully distinguishing mild patients from healthy older adults with an accuracy of 86%. Previously identified biomarkers, as well as the salient features determined in our study, may be useful for improving the diagnosis, prognosticating the course, and tracking the progression of the disease over time.

Study Goals This study aims to address four complementary challenges. To address the need for effective data management and reliable data accumulation, Challenge 1 involves designing a protocol for harmonizing and aggregating complex, multisource, and multi-site Parkinson’s disease data. We applied machine learning techniques and controlled variable selection, e.g., knockoff filtering, to address Challenge 2, identify salient predictive features associated with specific clinical traits, e.g., patient falls. Challenge 3 involves forecasting patient falls using alternative techniques based on the selected features and evaluating the classification performance using internal (statistical) and external (prospective data) validation. Finally, Challenge 4, addresses the need to forecast other clinically relevant traits like Parkinson’s phenotypes, e.g., tremor dominance (TD) vs. posture instability and gait difficulty (PIGD).

Predictive Analytic Strategy The datasets used in this study were collected independently at two sites – the University of Michigan Udall Center of Excellence in Parkinson’s Disease Research (Michigan data) and the Sourasky Medical Center, Israel (Tel-Aviv data). Both the datasets include high dimensional data consisting of several hundred demographic and clinical features for about a couple of hundred PD patients. This research is focused primarily on the prediction of patients’ falls, although alternative clinical outcomes and diagnostic phenotypes can be explored using the same approach. As not all of the features in the clinical record are strongly associated with each specific response, our goal is to identify some important critical features, build the simplest statistical models, and demonstrate reproducible computational classifies that produce higher prediction accuracy while avoiding overfitting. Figure 1 shows a high-level schematic of the study-design, including the complementary training and testing strategies.

In general, model-free statistical learning methods (e.g. Random Forest, Support Vector Machines) make fewer assumptions and often outperform model-based statistical techniques like logistic regression, which is often considered a baseline method, on large and complex biomedical data. To quantify the forecasting results, we used established evaluation metrics such as overall accuracy, sensitivity, specificity, positive and negative predictive power, and log odds ratio. For clinical datasets with a large number of features, it is difficult to avoid the multi-collinearity problem, which causes problems with maximum likelihood estimation of model-based techniques. As the machine learning techniques have minimal statistical assumptions, they may provide more flexible and reliable predictions.

2 EDA

2.1 Overall Summaries

Heatmaps illustrates correlation of some core data features. There are some differences between the paired correlations between features and across data archives. For instance, gait-speed is strongly negatively correlated with tremor score, PIGD score, BMI, Hoehn and Yahr scale (H&Y), and GDS-SF (Geriatric Depression Scale - short form), whereas PIGD (MDS_PIGD) is strongly-positively correlated with TUG (Timed Up and Go test), GDS-SF, BMI, and Hoehn and Yahr scale. We also found that gait speed is negatively correlated with postural stability (pos_stab). The presence of more severe postural instability and gait difficulties is not robustly correlated with the non-motor experiences of daily living in the patient. The non-motor experiences of daily living reflect impairments of cognition, mood, sleep and autonomic functions. Although axial impairments are generally associated with cognitive impairments in PD, the lack of significant associations with overall non-motor experiences of daily living may be due to the heterogeneous (cognitive and non-cognitive) nature of this MDS UPDRS subscale.

Pair correlations of some features, separately for each of the three datasets used in the study. (A) Michigan data boxplots illustrating significant differences in MDS_TREM (p = 0.5465), MDS_PIGD (p < 0.001), H and Y scale (p < 0.001), gaitSpeed_Off (p < 0.001) between PD patients with and without a history of falls, based on MWW test. (No = 0, Yes = 1). (B) Tel-Aviv data boxplots illustrating significant differences in Tremor_score (p = 0.01094), PIGD_score (p < 0.001), H and Y scale (p < 0.001) and FOG_Q (p < 0.001) between PD patients with and without a history of falls, based on MWW test. (No = 0, Yes = 1).

A B C

EDA Plots for Michigan and Tel-Aviv Data

These figure demonstrates exploratory data analytics (EDA) including univariate and multivariate distributions contrasting the Michigan and Tel-Aviv populations.

Results of multidimensional scaling (MDS) projection of the original data points on the first two MDS dimensions, suggesting that the Falls/No-fall classification is challenging (Tel-Aviv data).

2.2 Missing Data Plots

This figure illustrates the missing data patterns for both, the Michigan and the Tel-Aviv datasets. This lower dimensional projection suggests that the two cohorts are quite entangled, which may present a challenge in classification of falls/no-fall.

3 Results

3.1 Challenge 1. Harmonizing and aggregating complex multi-source and multisite Parkinson’s disease data

Data Aggregation: Since the data were acquired in independent studies at two separate institutions, not all the features collected were homologous. Even common features contained in both archives had some with substantially different distributions, according to Kolmogorov–Smirnov test. This figure shows the Kolmogorov–Smirnov tests carried out on all the numeric features (126 in total) that were common in both, Michigan and Tel-Aviv, datasets. Some extremely small 𝑝-values were slightly transformed, i.e., replaced by the minimum of the other non-zero 𝑝-values, to ensure that the logarithmic y-axis scale is correctly plotted.False Discovery Rate (FDR) was used to control the false-positive rate at the level of 0.01. Thus, among the set of rejected null hypotheses, the expected proportion of false discoveries is limited to 1%. Assuming the tests are independent, the FDR control is achieved by calculating 𝑞-values (Benjamini/Hochberg FDR adjusted 𝑝-val36 for each test and rejecting those with 𝑞-value <0.01. The red line represents the −log(0.01) cutoff value.

This figure includes examples of feature distributions in these two datasets showing some similarity and some differences.

As the study subjects in both Udall and Tel-Aviv datasets represent Parkinson’s disease patients, an aggregate dataset was generated to increase the number of training and testing cases and examine the performance of the predictive analytics on the complete data. We used normalization (centering and scaling) of the data elements prior to their aggregation.

This graph shows batch effects on the aggregate dataset using two alternative standardization techniques – normalize two data sets separately prior to aggregation vs. aggregate and normalize the combined data. To illustrate the similarities and differences between the pair of standardization techniques we show 2D projections of the data in each paradigm (top and bottom) using both multidimensional scaling (MDS, left) and t-distributed Stochastic Neighbor Embedding (tSNE, right).

Batch effects do not represent underlying biological variability. Rather, they reflect technical sources of data variation due to handling of the samples. To untangle batch technical variation from intrinsic biomedical process variability we need to carefully select the data harmonization, normalization and aggregation strategies to avoid unintended bias. In this case, we chose to normalize each of the two datasets separately prior to their aggregation into the combined Udall+TelAviv dataset.

3.2 Challenge 2. Identification of predictors highly associated with patients’ falls

In this part, we aim to identify for the strongest predictors for patients’ falls for each of the three datasets, Udall, Tel-Aviv, and the aggregated Udall+TelAviv. We carry out feature selection using two different methods: random forest (RF) and Knockoff filtering (KO). For each dataset, both feature selection techniques identify the top 20 selected variables. MWW test and KS test are used to compare the distributions of these features between patient subgroups (Falls vs. No-falls). We try to identify commonly selected features by both techniques that also show significant differences on the MWW and KS tests.

Michigan dataset: We consider common variables selected by both LASSO and Knockoff (FDR=0.35) as the “potentially falls-associated features”. In addition, candidate features that are significantly different on both MWW and KS tests across two cohorts (“fall” and “non-fall”) are considered “falls-associated features”. Regularized (LASSO) linear modeling rejects all genetic features, the only set of multi-level categorical features in Udall dataset. This fact facilitates our implementation of Knockoff filtering, which is not directly applicable for multi-level categorical variables. Excluding all genetic variables, we apply Random Forest (RF) and Knockoff (KO) variable selections on all other numeric or binary features. The feature selection results are shown on with a corresponding variable importance plots.

Density plots are showing the top six clinical features with significantly different distributions between falls and no-fall cohorts within the Udall study.

Further, this table shows the results comparing the distributions between fallers and no-fallers in the Michigan data, using the top six common features identified by RF and KO controlled feature selection.

Tel-Aviv dataset: illustrates the top features selected by RF and KO methods solely on the Tel-Aviv dataset. Again, commonly selected features by both strategies are highlighted.

Aggregated (Michigan+TelAviv): Similar results, corresponding to the separate Michigan and Tel-Aviv results shown above, are included below for the aggregate Michigan+TelAviv dataset

3.3 Challenge 3. Classification of patients’ falls.

Below, we report the prediction results for the model-based logistic regression, used as a reference method, and machine learning classification using the normalized datasets. The results are reported separately for the Michigan only, Tel-Aviv only, and the aggregate Michigan+TelAviv datasets.

Michigan data This table shows the binary classification of fall/no-fall (5-fold CV) using all features. The columns represent seven complementary performance estimating measures: accuracy (acc), sensitivity (sens), specificity (spec), positive and negative predictive values (ppv and npv), and area under the receiver operating curve (auc). This table shows the binary classification of fall/no-fall (5-fold CV) using only the top 6 selected features (MDS_PIGD, gaitSpeed_Off, MOT_EDL, NON_MOTOR_EDL, walk, pos_stab).

Tel Aviv data This table illustrates the results of the binary classification of fall/no-fall (5-fold CV) using all features.

This table shows the binary classification of fall/no-fall (5-fold CV) using top 10 selected features (gaitSpeed_Off, ABC, BMI, PIGD_score, X2.11, partII_sum, Attention, DGI, FOG_Q, H_and_Y_OFF). Improving Classification Sensitivity: We attempted to further improve the classification sensitivity, which is important in this clinical setting. As Random Forest outperforms the other methods, we focused our performance tuning on RF classification. By optimizing the RF parameters, using grant weights, setting cut off points for two classes and the number of features used for each decision tree branch split, we obtained a classification model with higher sensitivity and LOR. Although, there is more room to further improvement of the sensitivity, it is also important to keep specificity within a reasonable range. Table below shows the best RF results on the Tel-Aviv data. Note that improving the classifier sensitivity trades off with (compromising) it’s sensitivity. Fall prediction with a subset of important features: We applied a logit model for a low dimensional case-study. Our results show 74% prediction accuracy using four variables, table below. Prior work by Paul, et al. reported accuracy about 80% using three variables, including “fall in the previous year” as an additional predictor, which may be very strongly associated with the clinical outcome of interest—whether a patient is expected to fall or not.

This table and ROC curves show the areas under the ROC curve of the Random Forest classification using several different study-designs. The results suggest that four features provide sufficient predictive power to forecast fall sin PD patients (area under the ROC curve is approximately 0.8).

Truncated classification of multiple-falls vs. no-falls (5-fold CV): A natural consideration is that some patients with prior falls might be attributed to unrelated accidents. Therefore, we tried to accurately identify patients with multiple falls. Further, for patients who had a history of falls, including one or more falls, the observations who had presence of fall by accident could mask the key demographic/clinical predictors, associated with falls. Table 17 shows the proportion of participants with two or more falls vs. no falls and table below shows the classification results using all features.

Finally, this table shows the classification using only the commonly selected features. The best results were obtained using adaptive boosting (Adaboost) and SVM with Gaussian kernel.

Aggregate Michigan + TelAviv Data This table shows the binary falls/non-fall classification of the mixed/aggregated data using all features (5-fold CV). This table illustrates the results of the mixed/aggregated data (5-fold CV) classification using only the seven commonly selected features: gaitSpeed_Off, PIGD_score, partII_sum, BMI, X2.11, H_and_Y_OFF, X3.10gait_off. Train on Michigan and Test on Tel-Aviv Data: this table shows the falls/no-fall classification (training on Michigan and testing on Tel-Aviv data) results using the selected features.

Train on Tel-Aviv and Test on Michigan Data This table shows the opposite falls/no-fall classification (training on Tel-Aviv and testing on Michigan data) results using only the commonly selected features.

3.4 Challenge 4. Morbidity phenotype (TD/PIGD) Classification

Next, ignoring the UPDRS subitems, we performed predictive analytics of tremor dominant (TD) vs. posture instability and gait disorder (PIGD) classification using only the demographic and clinical information (neuroimaging features were excluded).

Michigan Data This table shows that compared to prediction of falls using all features, the overall accuracy for both logistic regression and AdaBoost TD/PIGD classification is improved. Tel-Aviv Data This table demonstrated improved sensitivity of TD/PIGD classification, as compared to prediction of falls using all features. This indicates TD/PIGD classification may also be an important predictor of patients’ falls.

Aggregated Michigan+TelAviv Data This table shows a slightly higher sensitivity for random forest and AdaBoost TD/PIGD classification, compared to falls prediction using all features. Yet, compared to within archive training with internal CV assessment, the performance of both classifiers on the aggregated dataset is less impressive, which may be explained by the heterogeneity of the sets discussed in Challenge 1.

4 Conclusion

Regarding Challenge 1 (data compilation), we carefully examined, harmonized and aggregated the two independently acquired PD datasets. The merged dataset was used to retrain the algorithms and validate their classification accuracy using internal statistical cross validation. The substantial biomedical variability in the data may explain the fact that the predictive accuracy of the falls/no-fall classification results were lower in the merged aggregated data compared to training and testing the forecasting methods on each dataset separately.

Challenge 2 (feature selection for prediction of falls) showed that three variables appear to be consistently chosen in the feature selection process across Michigan, Tel-Aviv and aggregated datasets – the MDS-UPDRS PIGD subscore (MDS_PIGD), gait speed in the off state, and sum score for MDS-Part II: Motor Aspects of Experiences of Daily Living (M-EDL). This is consistent with expectations as PIGD has been previously related to fall risk in PD.

In the third Challenge (prediction of falls), we found some differences between the classification results obtained by training of the three different datasets. For instance, training on the Michigan data, the highest overall classification accuracy was about 78%, with a lower sensitivity, ~47%. Whereas, training on the Tel-Aviv data, the accuracy and sensitivity rates reached 80% and 68%, respectively. For the Tel-Aviv data, the prediction model can be tuned to yield a sensitivity of 81% and accuracy of 77%. Furthermore, training on the Tel-Aviv data yields better results when the classification outcome corresponds to discriminating PD patients with multiple falls from those without falls. When training on the aggregated dataset, the falls/no-fall classification accuracy is about 70% with sensitivity around 55%. The most realistic, yet difficult, case involves external out-of-bag validation, training on one of the datasets and testing on the other. For instance, training an RF classifier on the Tel-Aviv dataset and tested it out of-bag on the Michigan dataset yields accuracy of 71% and sensitivity of 67%.

The results of the last Challenge (TD/PIGD) suggest that tremor dominant (TD) vs. postural instability and gait difficulty (PIGD) classification is reliable. For example, training and statistically validating on the Tel-Aviv data yields accuracy of 74%, sensitivity of 61% and specificity of 82%.

The classification performance of different machine learning methods varies with respect to the testing and training datasets. Overall, the random forests classifier works best on most combinations of training/testing datasets and feature selection strategies. The boosting method also showed high predictive classification accuracy on Tel-Aviv data. When the number of features is small, logistic regression may provide a viable model for predicting patient falls and it has always the benefit of easy intuitive interpretation within the scope of the problem.

The reported variable importance results may be useful for selecting features that may be important biomarkers helping clinicians quantify the risk of falls in PD patients. This study may have some potential pitfalls and limitations. For instance, the sample sizes are relatively small, Michigan (N1 = 148) and Tel-Aviv (N2 = 103). There was significant heterogeneity of the feature distributions between the Michigan and Tel-Aviv datasets. It is not clear if there were underlying biological, clinical, physiological, or technological reasons for the observed variation. This is a common challenge in all Big data analytic studies relying on multisource heterogeneous data. Features that were completely incongruent between the two data archives were removed from the subsequent analyses and were not included in the aggregated dataset. Finally, the classifiers trained on one of the datasets (Tel-Aviv) performed better when tested either via internal statistical cross-validation or via external out-of-bag valuation (using the Michigan test data). Our study of falls primarily focused on the binary indicator of falls. The frequency of falls, or the severity of falls, were not examined due to lack of sufficient information in either data archive. However, both frequency and severity of falls require further examination.