# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)This article is made for completing the assignment for Algoritma : Machine Learning Course in Classification 2 Theia Batch 2022.
In this article, we will try to predict if someone has a potential to get Parkinson or not based on many variables on their speech recordings. This dataset is the output of voice frequency analysis of people that have Parkinson and not have Parkinson.
Parkinson’s
From Mayoclinic.ORG , Parkinson’s disease is a progressive disorder that affects the nervous system and the parts of the body controlled by the nerves. Symptoms start slowly. The first symptom may be a barely noticeable tremor in just one hand. Tremors are common, but the disorder may also cause stiffness or slowing of movement.
In the early stages of Parkinson’s disease, your face may show little or no expression. Your arms may not swing when you walk. Your speech may become soft or slurred. Parkinson’s disease symptoms worsen as your condition progresses over time.
Based on the picture above, one of the main symptoms is changes in speech. There are slight differences in speech if people have a potential to have Parkinsons. These findings can be used to predict someone has potential to have Parkinson’s before they develop more symptoms and have therapy and cure to reduce the effect of the disease before it become worse.
One example of people with Parkinson’s is my favorite actor, Michael J. Fox. He played as Marty in the Back to the Future Film. You can see how he speaks in this video.
Back2TheFuture
Michael J Fox as Marty
Feel bad for him though :(
The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 (65.1±10.9) at the Department of Neurology in Cerrahpaşa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1±8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physician’s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.
Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson’s Disease (PD) patients to extract clinically useful information for PD assessment.
options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization#load dataset
parkinsons <- read.csv("data-input/parkinsons-data.csv")
glimpse(parkinsons)#> Rows: 195
#> Columns: 24
#> $ name <chr> "phon_R01_S01_1", "phon_R01_S01_2", "phon_R01_S01_3",…
#> $ MDVP.Fo.Hz. <dbl> 119.992, 122.400, 116.682, 116.676, 116.014, 120.552,…
#> $ MDVP.Fhi.Hz. <dbl> 157.302, 148.650, 131.111, 137.871, 141.781, 131.162,…
#> $ MDVP.Flo.Hz. <dbl> 74.997, 113.819, 111.555, 111.366, 110.655, 113.787, …
#> $ MDVP.Jitter... <dbl> 0.00784, 0.00968, 0.01050, 0.00997, 0.01284, 0.00968,…
#> $ MDVP.Jitter.Abs. <dbl> 0.00007, 0.00008, 0.00009, 0.00009, 0.00011, 0.00008,…
#> $ MDVP.RAP <dbl> 0.00370, 0.00465, 0.00544, 0.00502, 0.00655, 0.00463,…
#> $ MDVP.PPQ <dbl> 0.00554, 0.00696, 0.00781, 0.00698, 0.00908, 0.00750,…
#> $ Jitter.DDP <dbl> 0.01109, 0.01394, 0.01633, 0.01505, 0.01966, 0.01388,…
#> $ MDVP.Shimmer <dbl> 0.04374, 0.06134, 0.05233, 0.05492, 0.06425, 0.04701,…
#> $ MDVP.Shimmer.dB. <dbl> 0.426, 0.626, 0.482, 0.517, 0.584, 0.456, 0.140, 0.13…
#> $ Shimmer.APQ3 <dbl> 0.02182, 0.03134, 0.02757, 0.02924, 0.03490, 0.02328,…
#> $ Shimmer.APQ5 <dbl> 0.03130, 0.04518, 0.03858, 0.04005, 0.04825, 0.03526,…
#> $ MDVP.APQ <dbl> 0.02971, 0.04368, 0.03590, 0.03772, 0.04465, 0.03243,…
#> $ Shimmer.DDA <dbl> 0.06545, 0.09403, 0.08270, 0.08771, 0.10470, 0.06985,…
#> $ NHR <dbl> 0.02211, 0.01929, 0.01309, 0.01353, 0.01767, 0.01222,…
#> $ HNR <dbl> 21.033, 19.085, 20.651, 20.644, 19.649, 21.378, 24.88…
#> $ status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ RPDE <dbl> 0.414783, 0.458359, 0.429895, 0.434969, 0.417356, 0.4…
#> $ DFA <dbl> 0.815285, 0.819521, 0.825288, 0.819235, 0.823484, 0.8…
#> $ spread1 <dbl> -4.813031, -4.075192, -4.443179, -4.117501, -3.747787…
#> $ spread2 <dbl> 0.266482, 0.335590, 0.311173, 0.334147, 0.234513, 0.2…
#> $ D2 <dbl> 2.301442, 2.486855, 2.342259, 2.405554, 2.332180, 2.1…
#> $ PPE <dbl> 0.284654, 0.368674, 0.332634, 0.368975, 0.410335, 0.3…
Matrix column entries (attributes):
Name - ASCII subject name and recording numberMDVP:Fo(Hz) - Average vocal fundamental frequencyMDVP:Fhi(Hz) - Maximum vocal fundamental frequencyMDVP:Flo(Hz) - Minimum vocal fundamental frequencyMDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP
- Several measures of variation in fundamental frequencyMDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA
- Several measures of variation in amplitudeNHR,HNR - Two measures of ratio of noise
to tonal components in the voiceStatus - Health status of the subject (one) -
Parkinson’s, (zero) - healthyRPDE,D2 - Two nonlinear dynamical complexity
measuresDFA - Signal fractal scaling exponentspread1,spread2,PPE - Three nonlinear measures of
fundamental frequency variationsummary(parkinsons)#> name MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz.
#> Length:195 Min. : 88.33 Min. :102.1 Min. : 65.48
#> Class :character 1st Qu.:117.57 1st Qu.:134.9 1st Qu.: 84.29
#> Mode :character Median :148.79 Median :175.8 Median :104.31
#> Mean :154.23 Mean :197.1 Mean :116.32
#> 3rd Qu.:182.77 3rd Qu.:224.2 3rd Qu.:140.02
#> Max. :260.11 Max. :592.0 Max. :239.17
#> MDVP.Jitter... MDVP.Jitter.Abs. MDVP.RAP MDVP.PPQ
#> Min. :0.001680 Min. :0.00000700 Min. :0.000680 Min. :0.000920
#> 1st Qu.:0.003460 1st Qu.:0.00002000 1st Qu.:0.001660 1st Qu.:0.001860
#> Median :0.004940 Median :0.00003000 Median :0.002500 Median :0.002690
#> Mean :0.006220 Mean :0.00004396 Mean :0.003306 Mean :0.003446
#> 3rd Qu.:0.007365 3rd Qu.:0.00006000 3rd Qu.:0.003835 3rd Qu.:0.003955
#> Max. :0.033160 Max. :0.00026000 Max. :0.021440 Max. :0.019580
#> Jitter.DDP MDVP.Shimmer MDVP.Shimmer.dB. Shimmer.APQ3
#> Min. :0.002040 Min. :0.00954 Min. :0.0850 Min. :0.004550
#> 1st Qu.:0.004985 1st Qu.:0.01650 1st Qu.:0.1485 1st Qu.:0.008245
#> Median :0.007490 Median :0.02297 Median :0.2210 Median :0.012790
#> Mean :0.009920 Mean :0.02971 Mean :0.2823 Mean :0.015664
#> 3rd Qu.:0.011505 3rd Qu.:0.03789 3rd Qu.:0.3500 3rd Qu.:0.020265
#> Max. :0.064330 Max. :0.11908 Max. :1.3020 Max. :0.056470
#> Shimmer.APQ5 MDVP.APQ Shimmer.DDA NHR
#> Min. :0.00570 Min. :0.00719 Min. :0.01364 Min. :0.000650
#> 1st Qu.:0.00958 1st Qu.:0.01308 1st Qu.:0.02474 1st Qu.:0.005925
#> Median :0.01347 Median :0.01826 Median :0.03836 Median :0.011660
#> Mean :0.01788 Mean :0.02408 Mean :0.04699 Mean :0.024847
#> 3rd Qu.:0.02238 3rd Qu.:0.02940 3rd Qu.:0.06080 3rd Qu.:0.025640
#> Max. :0.07940 Max. :0.13778 Max. :0.16942 Max. :0.314820
#> HNR status RPDE DFA
#> Min. : 8.441 Min. :0.0000 Min. :0.2566 Min. :0.5743
#> 1st Qu.:19.198 1st Qu.:1.0000 1st Qu.:0.4213 1st Qu.:0.6748
#> Median :22.085 Median :1.0000 Median :0.4960 Median :0.7223
#> Mean :21.886 Mean :0.7538 Mean :0.4985 Mean :0.7181
#> 3rd Qu.:25.076 3rd Qu.:1.0000 3rd Qu.:0.5876 3rd Qu.:0.7619
#> Max. :33.047 Max. :1.0000 Max. :0.6852 Max. :0.8253
#> spread1 spread2 D2 PPE
#> Min. :-7.965 Min. :0.006274 Min. :1.423 Min. :0.04454
#> 1st Qu.:-6.450 1st Qu.:0.174350 1st Qu.:2.099 1st Qu.:0.13745
#> Median :-5.721 Median :0.218885 Median :2.362 Median :0.19405
#> Mean :-5.684 Mean :0.226510 Mean :2.382 Mean :0.20655
#> 3rd Qu.:-5.046 3rd Qu.:0.279234 3rd Qu.:2.636 3rd Qu.:0.25298
#> Max. :-2.434 Max. :0.450493 Max. :3.671 Max. :0.52737
Based on the summary above, we found that several columns have a potential of outliers, but it could be not because this is an output of speech analysis (which we don’t know what’s the reasonable range of data). So we don’t kick out any outliers because it may be useful in our dataset.
We will see the various columns correlation with each other and with
our target variable status
ggcorr(parkinsons, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)parkinsons_dataset <- parkinsons %>%
select(-c(MDVP.Jitter..., MDVP.Jitter.Abs., MDVP.RAP, MDVP.PPQ, MDVP.Shimmer, MDVP.Shimmer.dB., Shimmer.APQ3, Shimmer.APQ5, MDVP.APQ, Jitter.DDP, Shimmer.DDA, spread1, DFA, HNR))Remove columns with high correlation value with another column, threshold >0.7 / <-0.7 to reduce to potential of Multicolinearity of the Data.
ggcorr(parkinsons_dataset, hjust = 1, layout.exp = 3, label = TRUE, label_size = 2.5)
We see the proportion of the data.
hist(parkinsons_dataset$status)
We see that the data is not balance, we’ll deal with it later.
#change column status into factor and remove column name.
parkinsons_dataset <- parkinsons_dataset %>%
mutate(status = as.factor(ifelse(status == 1, "parkinson", "healthy"))) %>%
select(-name)#check if there's missing value
colSums(is.na(parkinsons_dataset))#> MDVP.Fo.Hz. MDVP.Fhi.Hz. MDVP.Flo.Hz. NHR status RPDE
#> 0 0 0 0 0 0
#> spread2 D2 PPE
#> 0 0 0
Split Data into 75% for Data Train, 25% for Data Test
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(rsample)
# Stratified Random Sampling
index_n <- initial_split(data = parkinsons_dataset, # dataset that will be split
prop = 0.75, # data train proportion
strata = status) # target variable
parkinsons_train <- training(index_n) # training function from rsample
parkinsons_test <- testing(index_n)prop.table(table(parkinsons_train$status))#>
#> healthy parkinson
#> 0.2465753 0.7534247
prop.table(table(parkinsons_test$status))#>
#> healthy parkinson
#> 0.244898 0.755102
Because our data is not balance, we will upsample the dataset for train. Why upsample? because in the dataset we found there are fewer sample of healthy people compared to people with Parkinson’s. So we don’t want any lost of information from the data.
# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
parkinsons_train_n <- upSample(x=parkinsons_train %>% select(-status), #select only predictors variable
y=parkinsons_train$status, # select only target variable
yname="status") # name of target variable column
prop.table(table(parkinsons_train_n$status))#>
#> healthy parkinson
#> 0.5 0.5
Create Dataset without Target variable, because simply some model doesn’t need that.
# create dataset without target variable
parkinsons_test_n <- parkinsons_test %>% select(-(status))library(e1071)
#create model
naive_model<- naiveBayes(status ~ ., data = parkinsons_train_n)
#create prediction
naive_pred <- predict(naive_model, newdata = parkinsons_test_n, type = "class") #for binary result
naive_prob <- predict(naive_model, newdata = parkinsons_test_n, type = "raw") #for probability result
#create table
naive_conf <- table(naive_pred, parkinsons_test$status) #create table confusion matrix#confusion matrix and interpretation of model
confusionMatrix(naive_conf) #> Confusion Matrix and Statistics
#>
#>
#> naive_pred healthy parkinson
#> healthy 9 11
#> parkinson 3 26
#>
#> Accuracy : 0.7143
#> 95% CI : (0.5674, 0.8342)
#> No Information Rate : 0.7551
#> P-Value [Acc > NIR] : 0.79940
#>
#> Kappa : 0.3695
#>
#> Mcnemar's Test P-Value : 0.06137
#>
#> Sensitivity : 0.7500
#> Specificity : 0.7027
#> Pos Pred Value : 0.4500
#> Neg Pred Value : 0.8966
#> Prevalence : 0.2449
#> Detection Rate : 0.1837
#> Detection Prevalence : 0.4082
#> Balanced Accuracy : 0.7264
#>
#> 'Positive' Class : healthy
#>
It seems our model is not very accurate enough, if we set our target to be 80% Accurate. It also has a very low precision performance (52% Positive Prediction Value)
# change pred into actual
parkinsons_test$actual <- ifelse(parkinsons_test$status=="parkinson", 1, 0)library(ROCR)
# object prediction
naive_roc <- prediction(predictions = naive_prob[,1], # probability one of the class, in this case we choose "1" because it is Parkinson Patient
labels = parkinsons_test$actual) # actual label that is converted to "1" and "0"# ROC curve
plot(performance(prediction.obj = naive_roc, # prediction object naive bayes
measure = "tpr", # axis y
x.measure = "fpr")) # axis x
abline(0,1, lty=2) # diagonal line # AUC
naive_auc <- performance(prediction.obj = naive_roc,
measure = "auc")
naive_auc@y.values#> [[1]]
#> [1] 0.1891892
Ups! Based on plot above, we can see that the model is extremely bad
at predicting the correct answer, but it is a quite good binary
classifier! So from this point, we will switch the positive label for
healthy, which means the class we want to set as positive
is healthy
healthy to “1”# change pred into actual
parkinsons_test$actual <- ifelse(parkinsons_test$status=="healthy", 1, 0)naive_roc <- prediction(predictions = naive_prob[,1],
labels = parkinsons_test$actual)
plot(performance(prediction.obj = naive_roc,
measure = "tpr",
x.measure = "fpr"))
abline(0,1, lty=2) # AUC
naive_auc <- performance(prediction.obj = naive_roc,
measure = "auc")
naive_auc@y.values#> [[1]]
#> [1] 0.8108108
We will try another model for comparation
library(partykit)
dt_model <- ctree(status~ .,
data = parkinsons_train_n,
control = ctree_control(mincriterion = 0.95)) #P-value (significant value) must be < 0.05 for a node to making a branchplot(dt_model, type="simple")#model interpetation
dt_model#>
#> Model formula:
#> status ~ MDVP.Fo.Hz. + MDVP.Fhi.Hz. + MDVP.Flo.Hz. + NHR + RPDE +
#> spread2 + D2 + PPE
#>
#> Fitted party:
#> [1] root
#> | [2] PPE <= 0.13387
#> | | [3] RPDE <= 0.39661
#> | | | [4] MDVP.Fo.Hz. <= 186.163: parkinson (n = 9, err = 0.0%)
#> | | | [5] MDVP.Fo.Hz. > 186.163: healthy (n = 24, err = 0.0%)
#> | | [6] RPDE > 0.39661: healthy (n = 61, err = 1.6%)
#> | [7] PPE > 0.13387
#> | | [8] spread2 <= 0.21888: parkinson (n = 62, err = 41.9%)
#> | | [9] spread2 > 0.21888: parkinson (n = 64, err = 0.0%)
#>
#> Number of inner nodes: 4
#> Number of terminal nodes: 5
#create prediction
dt_pred <- predict(dt_model, parkinsons_test_n, type = "response") #for binary result
dt_prob <- predict(dt_model, parkinsons_test_n, type = "prob") #for probability result
#create table
dt_conf <- table(dt_pred, parkinsons_test$status)#create confusion matrix
confusionMatrix(dt_conf)#> Confusion Matrix and Statistics
#>
#>
#> dt_pred healthy parkinson
#> healthy 7 2
#> parkinson 5 35
#>
#> Accuracy : 0.8571
#> 95% CI : (0.7276, 0.9406)
#> No Information Rate : 0.7551
#> P-Value [Acc > NIR] : 0.06174
#>
#> Kappa : 0.5781
#>
#> Mcnemar's Test P-Value : 0.44969
#>
#> Sensitivity : 0.5833
#> Specificity : 0.9459
#> Pos Pred Value : 0.7778
#> Neg Pred Value : 0.8750
#> Prevalence : 0.2449
#> Detection Rate : 0.1429
#> Detection Prevalence : 0.1837
#> Balanced Accuracy : 0.7646
#>
#> 'Positive' Class : healthy
#>
It seems a model using Decision Tree can produce a better performance. We see that accuracy improved to 85%.
# object prediction
dt_roc <- prediction(predictions = dt_prob[,1], # probability from a positive class 'healthy'
labels = parkinsons_test$actual) # actual label that was converted to 0 and 1# ROC curve
plot(performance(prediction.obj = dt_roc, # ROC object prediction
measure = "tpr", # axis y
x.measure = "fpr")) # axis x
abline(0,1, lty=2) # diagonal line# AUC
dt_auc <- performance(prediction.obj = dt_roc,
measure = "auc")
dt_auc@y.values#> [[1]]
#> [1] 0.8153153
It seems that the AUC is not very significantly improved with this model.
#create model random forrest
set.seed(150)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
# method = "repeatedcv"
# folds = 5, repeats = 3
rf_model <- train(status~ ., data=parkinsons_train_n, method="rf", trControl = ctrl)
saveRDS(rf_model, file = "rf_model.rds") #save model for lighter computation later#upload model random forrest
rf_model <- readRDS("rf_model.rds")#create prediction
rf_pred <- predict(rf_model, parkinsons_test_n, type = "raw") #for binary result
rf_prob <- predict(rf_model, parkinsons_test_n, type = "prob") #for probability result
rf_conf<- table(rf_pred, parkinsons_test$status)confusionMatrix(rf_conf)#> Confusion Matrix and Statistics
#>
#>
#> rf_pred healthy parkinson
#> healthy 9 3
#> parkinson 3 34
#>
#> Accuracy : 0.8776
#> 95% CI : (0.7523, 0.9537)
#> No Information Rate : 0.7551
#> P-Value [Acc > NIR] : 0.02761
#>
#> Kappa : 0.6689
#>
#> Mcnemar's Test P-Value : 1.00000
#>
#> Sensitivity : 0.7500
#> Specificity : 0.9189
#> Pos Pred Value : 0.7500
#> Neg Pred Value : 0.9189
#> Prevalence : 0.2449
#> Detection Rate : 0.1837
#> Detection Prevalence : 0.2449
#> Balanced Accuracy : 0.8345
#>
#> 'Positive' Class : healthy
#>
Wow, this model can produce a beter accuracy significantly compared to other model. It has 87% Accuracy, above our target which is 80% Accuracy.
# object prediction
rf_roc <- prediction(predictions = rf_prob[,1],
labels = parkinsons_test$actual) # ROC curve
plot(performance(prediction.obj = rf_roc, #ROC object prediction
measure = "tpr", # axis y
x.measure = "fpr")) # axis x
abline(0,1, lty=2) # diagonal line# AUC
rf_auc <- performance(prediction.obj = rf_roc,
measure = "auc")
rf_auc@y.values#> [[1]]
#> [1] 0.9594595
Based on the plot above we can see that this model can produce better ROC plot and AUC value. Significantly bigger than the rest of the Model.
varImp(rf_model)#> rf variable importance
#>
#> Overall
#> PPE 100.000
#> NHR 53.145
#> MDVP.Fo.Hz. 52.737
#> MDVP.Fhi.Hz. 28.459
#> spread2 23.590
#> MDVP.Flo.Hz. 5.487
#> D2 3.325
#> RPDE 0.000
Based on varImp function above, we can see that the PPE
is the most significant predictors and the RDPE is the
least significant one. The PPE itself is one of three
nonlinear measures of fundamental frequency variation, which it can
significantly separate healthy and parkinson
patient. Don’t take it as one measure to predict because abnormal in
speech could means many thing.
plot(rf_model$finalModel)
legend("topright", colnames(rf_model$finalModel$err.rate),
col=1:6,cex=0.8,fill=1:6)
Evaluate the Random Forrest Model, we can see that the trees not
improving significantly after 100 number of trees, it also shows that
the best trees value is 300 which produces smallest error rate on the
model.
Because we make this model using healthy as a positive
class, it means that the model could predict which data is categorized
as healthy as a positive result. Means that this model is
not made to predict parkinsons but as a test if someone has
a potential to have parkinson or not. So if someone is
tested and passes this test, it means that he’s healthy,
and if he’s not then they need to take another procedure to look
further.
The best result we can get is from Random Forest Model, it can be a better model because it is made by several more iterations compared with Naive Bayes and Decision Tree. So in this case it will create a better model compared to the other two.
The best accuracy of the model we made above is 87% , which means it has a potential of 13% wrong!. That’s why in the Medical Section, when we try to diagnose someone’s disease we use several tests because when we use several tests it significantly increases our chances of predicting right!.
Take a look at HOUSE MD SERIES , He never test the patient with only one method of test, and He always giving a doubt to the result of the test (because the chance of the result is wrong is also high when the procedure of doing the test is not done right)