Overview

In “Data 110 Project 2 - Parkinson’s Speech Analysis”3 I approximated a diagram from the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder”1 using the data associated with the study.

In this extension of Project 2 I’ll approximate the prediction model used to diagnose RBD subjects in the study and create an updated version of the diagram.

Refresher of terms and data

Dysarthia is a speech disorder present in close to 90% of people with Parkinson’s Disease (PD). The development of a fully-automated method to detect early, distinctive patterns of dysarthia using only acoustic data from connected speech such as reading a short passage or monologue is the subject of the study.

Early biomarkers of Parkinson’s Disease dataset

The data include 30 patients with early untreated Parkinson’s, 50 patients with REM sleep behavior disorder (RBD), which are at high risk developing Parkinson’s disease, and 50 healthy controls (HC). All PD and RBD patients were scored for movement disorders. All subjects were examined and scored for dysarthia during a single session with a speech specialist. All subjects were recorded reading a standardized, phonetically-balanced text a monologue for approximately 90 seconds. Speech features were automatically analyzed in the study resulting in 24 speech attribute measurements.

Label Exam measurement
RST rate of speech timing
AST acceleration of speech timing
DPI duration of pause intervals
EST entropy of speech timing
DUS duration of unvoiced stops
DUF decay of unvoiced fricatives
DVI duration of voiced intervals
GVI gaping in-between voiced intervals
RSR rate of speech respiration
PIR pause intervals per respiration
RLR relative loudness of respiration
LRE latency of respiratory exchange

Read cleaned data

I cleaned and organized the original dataset and saved it previously for Project 2. I’ll read the data and look at it.

pacman::p_load(tidyverse, dplyr, readr, stringr, knitr, kableExtra, ggplot2, plotly, GGally, corrplot)
pacman::p_load(InformationValue)

pdata <- read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons/pdata.csv")
pdata$motor_score <- as.integer(pdata$motor_score)

pdata$group <- pdata$group %>% factor(levels = c("PD", "RBD", "controls"))   # factor group to order 
glimpse(pdata)
## Rows: 130
## Columns: 29
## $ id          <chr> "PD01", "PD02", "PD03", "PD04", "PD05", "PD06", "PD07", "P~
## $ age         <int> 58, 68, 68, 75, 61, 58, 79, 59, 73, 66, 71, 37, 43, 70, 75~
## $ motor_score <int> 8, 22, 19, 24, 54, 29, 29, 32, 6, 22, 15, 36, 11, 18, 25, ~
## $ dysarthia   <int> 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0~
## $ est_r       <dbl> 1.564, 1.564, 1.550, 1.519, 1.543, 1.553, 1.563, 1.568, 1.~
## $ rst_r       <int> 354, 340, 211, 140, 269, 317, 269, 338, 374, 281, 234, 365~
## $ ast_r       <dbl> 6.05, 27.52, 11.97, -2.49, 6.72, 24.19, 2.47, -4.36, 11.10~
## $ dpi_r       <int> 146, 173, 377, 360, 211, 186, 214, 145, 117, 213, 198, 129~
## $ dvi_r       <int> 264, 253, 322, 663, 328, 286, 310, 272, 265, 323, 414, 261~
## $ gvi_r       <dbl> 58.65, 48.26, 47.54, 13.72, 42.90, 43.83, 60.30, 61.31, 52~
## $ dus_r       <dbl> 31.38, 22.38, 38.12, 44.88, 47.12, 33.63, 20.12, 25.75, 23~
## $ duf_r       <dbl> -2.101, -1.745, 2.657, -0.934, -0.973, 0.921, -5.347, 1.08~
## $ rlr_r       <dbl> -22.47, -24.59, -16.89, -25.54, -22.61, -25.00, -26.20, -2~
## $ pir_r       <dbl> 4.50, 7.00, 3.00, 1.00, 5.00, 2.75, 5.25, 6.50, 9.25, 4.50~
## $ rsr_r       <dbl> 21.14, 15.28, 20.76, 18.71, 16.26, 27.07, 16.33, 13.42, 12~
## $ lre_r       <int> 167, 163, 372, 119, 78, 124, 288, 128, 87, 112, 44, 170, 1~
## $ est_m       <dbl> 1.564, 1.569, 1.550, 1.539, 1.560, 1.518, 1.566, 1.573, 1.~
## $ rst_m       <int> 333, 285, 247, 112, 230, 181, 289, 370, 288, 258, 174, 277~
## $ ast_m       <dbl> -2.82, 8.20, 4.71, -9.09, 11.77, 13.38, -1.53, -4.12, 5.25~
## $ dpi_m       <int> 158, 295, 280, 397, 206, 611, 251, 118, 194, 246, 433, 214~
## $ dvi_m       <int> 318, 264, 317, 800, 480, 398, 278, 269, 317, 356, 519, 313~
## $ gvi_m       <dbl> 49.01, 40.56, 48.97, 18.69, 33.54, 18.18, 58.90, 68.65, 53~
## $ dus_m       <dbl> 22.37, 26.88, 22.37, 49.37, 26.87, 49.37, 22.38, 17.88, 31~
## $ duf_m       <dbl> 0.588, -0.825, -0.955, 0.791, 0.075, 1.488, 0.174, 0.554, ~
## $ rlr_m       <dbl> -19.77, -23.26, -13.29, -25.08, -22.32, -25.08, -23.38, -1~
## $ pir_m       <dbl> 6.0, 4.0, 4.0, 2.0, 5.0, 2.0, 5.0, 13.5, 4.0, 3.5, 2.0, 4.~
## $ rsr_m       <dbl> 13.81, 21.77, 22.52, 14.37, 14.61, 18.21, 17.85, 8.21, 21.~
## $ lre_m       <int> 127, 313, 201, 151, 151, 593, 203, 116, 81, 69, 40, 179, 1~
## $ group       <fct> PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD~

Data distribution and correlation

The following diagram shows distributions and correlations of speech attribute measurements.

require(psych)

pairs.panels(pdata[5:28],   # plot distributions and correlations for all the data
             gap = 0,
             pch = 21,
             lm = TRUE)

Many of the variables are skewed and the dot plots show outliers. Even though the entire sample size of 130 is reasonably large, and the Parkinson’s disease group has 30 subjects, a linear regression is probably not the best choice for finding a prediction model.

The study used quadratic discriminant function analysis. Where, the multivariate normal distribution, characterized by the means and covariances, was estimated using an expectation maximization algorithm.

Even so since a diagnosis of dysarthia is a binomial result I’ll start with logistic regression and see how it fits the data. I’ll use ggcorr to find variables with significant coefficients.

# create a correlation plot

pcor <- pdata %>% filter(group != "RBD") %>% dplyr::select(est_r:lre_m)    # filter for the PD and controls groups                                                                                                      # select is defined in both dplyr and MASS packages  

# create a correlation plot

pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                   # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "PD / Control Correlation",                               # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")

pdcorr

There are 16 variables with coefficients greater than 0.6, and 10 variables with coefficients greater than 0.7.

# find the variable with significant coefficients

pdcorr$data %>% filter(abs(coefficient) > 0.6 ) %>%            # filter the correlation table for significance
 dplyr::select(1:3) %>%                                               # get the variable names and coefficients
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of PD group to control group (coefficient > 0.6)", 
        digits = 2, col.names = c("X", "Y", "Coefficient")) %>% 
  kable_minimal()
Significant correlation of PD group to control group (coefficient > 0.6)
X Y Coefficient
dvi_r rst_r -0.85
dpi_r rst_r -0.80
dvi_m rst_m -0.79
dpi_m rst_m -0.77
dvi_m rst_r -0.65
rsr_m pir_m -0.64
dus_m rst_m -0.64
rst_m dvi_r -0.62
rsr_r pir_r -0.61
rst_m rst_r 0.63
dvi_r dpi_r 0.64
lre_m dpi_m 0.66
dus_m dpi_m 0.67
gvi_m est_m 0.72
rlr_m rlr_r 0.74
dvi_m dvi_r 0.84
t <- pdcorr$data %>% filter(abs(coefficient) > 0.6 )
c(t$x,as.character(t$y)) %>% unique() %>% kable(col.names = "Variables w/correlation > 0.6")
Variables w/correlation > 0.6
dpi_r
dvi_r
rst_m
dvi_m
rlr_m
rsr_r
gvi_m
dpi_m
dus_m
lre_m
rsr_m
rst_r
rlr_r
pir_r
est_m
pir_m

t <- pdcorr$data %>% filter(abs(coefficient) > 0.7 )
c(t$x,as.character(t$y)) %>% unique() %>% kable(col.names = "Variables w/correlation > 0.7")
Variables w/correlation > 0.7
dpi_r
dvi_r
dvi_m
rlr_m
gvi_m
dpi_m
rst_r
rlr_r
est_m
rst_m

In the study the most distinctive disturbed speech patterns between the PD and control groups were found for a combination of seven attributes:

  • RST reading
  • DVI monologue
  • DPI reading
  • DPI monologue
  • DUS reading
  • DUS monologue
  • PIR monologue

Logistic regression

I’ll use general linear model (glm) with the logit function to create the models. Then plot the models to evaluate their fit to the data.

  • fit24 - using all 24 acoustic variables
  • fit16 - using the 16 variables with significant coefficients > 0.6
  • fit10 - using the 10 variables with significant coefficeents > 0.7
  • fit7 - the seven variables the study found to distinctive speech pattern difference
pd <- filter(pdata, group == "PD")   # get the Parkinsons and controls subjects

attach(pd)

# model all 24 variables
fit24 <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +
                         dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
                         est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
                         dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m, 
             family=binomial(link="logit"))

# model 16 variables with correlation < 0.6
fit16 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m + 
                         gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r + 
                         rsr_m + rsr_r + rst_m + rst_r, 
             family=binomial(link="logit"))

# model 10 variables with correlation < 0.7
fit10 <- glm(dysarthia ~ dpi_r + dvi_r + dvi_m + rlr_m + gvi_m + 
                         dpi_m + rst_r + rlr_r + est_m + rst_m, 
             family=binomial(link="logit"))

# model 7 significant variables found in the study
fit7 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dus_r + dvi_m + pir_r +  rst_r,    
             family=binomial(link="logit")
            ) 

detach(pd)

# plot each model to look at the residuals

par(mfrow = c(2,2))
plot(fit24)
mtext("fit24", side = 1)

par(mfrow = c(2,2), bg = "aliceblue")
plot(fit16)
mtext("fit16", side = 1)

par(mfrow = c(2,2), bg = "white")
plot(fit10)
mtext("fit10", side = 1)

par(mfrow = c(2,2), bg = "aliceblue")
plot(fit7)
mtext("fit7", side = 1)

Each of the models show high leverage outliers. Removing outliers and rerunning the models doesn’t improve them, each model shows a new set of high leverage outliers. More evidence that logistic regression is not the best fit for this data. The following plots give a closer look at the residuals for Model 7 and Model 10.

# Model 7
# select the data for residuals 
d <- pd %>% dplyr::select(dysarthia, dpi_m, dpi_r, dus_m, dus_r, dvi_m, pir_r,  rst_r)

# add predicted and residuals
d$predicted <- predict(fit7, type="response")
d$residuals <- residuals(fit7, type = "response")

# plot the results
 d %>% 
   gather(key = "iv", value = "x", -dysarthia, -predicted, -residuals) %>%     # pull all the predictors together
   ggplot(aes(x = x, y = dysarthia)) +
   geom_segment(aes(xend = x, yend = predicted), alpha = .4) +
   geom_point(aes(color = residuals, size = abs(residuals)/5), alpha = .7) +
   scale_color_gradient2(low = "darkblue", mid = "lavender", high = "darkred") +
   guides(color = FALSE, size = FALSE) +
   geom_point(aes(y = predicted), shape = 1) +
   facet_wrap(~ iv, scales = "free_x") +
   labs(title = "Residuals for Model 7",                                         # titles and labels
                x = "") +
   theme_bw() 

#Model 10
# select the data for residuals 
d <- pd %>% dplyr::select(dysarthia, dpi_r, dvi_r, dvi_m, rlr_m, gvi_m, dpi_m, rst_r, rlr_r, est_m, rst_m)

# add predicted and residuals
d$predicted <- predict(fit7, type="response")
d$residuals <- residuals(fit7, type = "response")

# plot the results
 d %>% 
   gather(key = "iv", value = "x", -dysarthia, -predicted, -residuals) %>%     # pull all the predictors together
   ggplot(aes(x = x, y = dysarthia)) +
   geom_segment(aes(xend = x, yend = predicted), alpha = .4) +
   geom_point(aes(color = residuals, size = abs(residuals)/5), alpha = .7) +
   scale_color_gradient2(low = "darkblue", mid = "lavender", high = "darkred") +
   guides(color = FALSE, size = FALSE) +
   geom_point(aes(y = predicted), shape = 1) +
   facet_wrap(~ iv, scales = "free_x") +
   labs(title = "Residuals for Model 10",                                         # titles and labels
                x = "") +
   theme_bw() 

These plots show a good number of large residuals, I’ll continue with logistic regression despite these indications of poor fit.

Predictions and evaluations

Both PD and RBD subjects were scored according to the Unified Parkinson’s Disease Rating Scale motor subscore (UPDRS III15, ranging from 0 to 108, with 0 for no motor manifestations to 108 representing severe motor distortion). A score >3 was previously revealed to be a very good indicator of initial parkinsonism. RBD subjects with a motor score ≤3 (asymptomatic RBD subgroup) are labeled motor negatives and RBD subjects with a motor score >3 (symptomatic RBD subgroup) were labeled motor positives. This categorization is used to create the confusion matrices evaluating the model performance.

# evaluate the motor score

pdata <- pdata %>% mutate("motor_true" = if_else(motor_score > 3, 1, 0))

pd <- pdata %>% filter(group == "PD")                     # get the PD group data  

# look at the initial dysarthia diagnosis, motor score and motor true
pd %>% dplyr::select(id, dysarthia, motor_score, motor_true) %>%  
  head() %>% 
  kable(caption = "Motor Score evaluation for PD subjects") %>% kable_minimal()
Motor Score evaluation for PD subjects
id dysarthia motor_score motor_true
PD01 0 8 1
PD02 1 22 1
PD03 0 19 1
PD04 0 24 1
PD05 1 54 1
PD06 1 29 1
rbd <- pdata %>% filter(group == "RBD")                     # get the RBD group data                     

# look at the initial dysarthia diagnosis, motor score and motor true
rbd %>% dplyr::select(id, dysarthia, motor_score, motor_true) %>%  
  head() %>% 
  kable(caption = "Motor Score evaluation for RBD subjects") %>% kable_minimal()
Motor Score evaluation for RBD subjects
id dysarthia motor_score motor_true
RBD01 1 3 0
RBD02 0 2 0
RBD03 0 7 1
RBD04 0 1 0
RBD05 0 10 1
RBD06 0 12 1

For now, I won’t create test data and evaluate the models. Instead, I’ll go straight to running predictions on RBD data for each of the models and compare the results using the motor score evaluation as the test criteria.

require(caret)   

etbl <- tibble("Model" = c("fit24", "fit16", "fit10", "fit7"),           # tibble to compare fit evals
               "Accuracy" = c(0,0,0,0),
               "True Pos" = c(0,0,0,0),
               "True Neg" = c(0,0,0,0),
               "False Pos" = c(0,0,0,0),
               "False Neg" = c(0,0,0,0),
               "Sensitivity" = c(0,0,0,0),
               "Specificity" = c(0,0,0,0)
               )

motor_positive <- factor(rbd$motor_true, c(0,1))                # factor required by confusionMatrix

predicted24 <- predict(fit24, rbd, type = "response") %>% round()        # prediction using fit24
predictedc <- predicted24 %>% factor(levels = c(0,1)) 
c24 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[1] = c24$overall["Accuracy"]
etbl$`True Pos`[1] = c24$table[2,2]
etbl$`True Neg`[1] = c24$table[1,1]
etbl$`False Pos`[1] = c24$table[1,2]
etbl$`False Neg`[1] = c24$table[2,1]
etbl$Sensitivity[1] = c24$byClass["Sensitivity"]
etbl$Specificity[1] = c24$byClass["Specificity"]

predicted16 <- predict(fit16, rbd, type = "response") %>% round()        # prediction using fit16
predictedc <- predicted16 %>% factor(levels = c(0,1)) 
c16  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[2] = c16$overall["Accuracy"]
etbl$`True Pos`[2] = c16$table[2,2]
etbl$`True Neg`[2] = c16$table[1,1]
etbl$`False Pos`[2] = c16$table[1,2]
etbl$`False Neg`[2] = c16$table[2,1]
etbl$Sensitivity[2] = c16$byClass["Sensitivity"]
etbl$Specificity[2] = c16$byClass["Specificity"]

predicted10 <- predict(fit10, rbd, type = "response") %>% round()        # prediction using fit16
predictedc <- predicted10 %>% factor(levels = c(0,1)) 
c10  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[3] = c10$overall["Accuracy"]
etbl$`True Pos`[3] = c10$table[2,2]
etbl$`True Neg`[3] = c10$table[1,1]
etbl$`False Pos`[3] = c10$table[1,2]
etbl$`False Neg`[3] = c10$table[2,1]
etbl$Sensitivity[3] = c10$byClass["Sensitivity"]
etbl$Specificity[3] = c10$byClass["Specificity"]

predicted7 <- predict(fit7, rbd, type="response") %>% round()            # prediction using fit7   
predictedc <- predicted7 %>% factor(levels = c(0,1)) 
c7  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[4] = c7$overall["Accuracy"]
etbl$`True Pos`[4] = c7$table[2,2]
etbl$`True Neg`[4] = c7$table[1,1]
etbl$`False Pos`[4] = c7$table[1,2]
etbl$`False Neg`[4] = c7$table[2,1]
etbl$Sensitivity[4] = c7$byClass["Sensitivity"]
etbl$Specificity[4] = c7$byClass["Specificity"]

etbl %>% kable(caption = "Prediction results from logistic glm", digits = 3) %>% kable_minimal()
Prediction results from logistic glm
Model Accuracy True Pos True Neg False Pos False Neg Sensitivity Specificity
fit24 0.46 11 12 8 19 0.387 0.579
fit16 0.50 14 11 9 16 0.407 0.609
fit10 0.48 12 12 8 18 0.400 0.600
fit7 0.42 10 11 9 20 0.355 0.526

All of the models have a good number of false positives, and have more false negatives than true negatives. Model 16 has the best, although not stellar, performance. I’ll plot the ROC curves to see what they look like.

require(plotROC)
cols <- c('darkred', 'mediumvioletred', 'darkgreen', 'blue') 

# add the predictions to the data and pivot longer

cmat <-  tibble ("motor_true" = rbd$motor_true, "Model24" = round(predicted24), "Model16" = round(predicted16), 
                       "Model10" = round(predicted10), "Model7" = round(predicted7)) %>% 

             pivot_longer(Model24:Model7,
                          names_to = "model",
                          values_to = "speech_pos"
                          )
cmat$model <- factor(cmat$model, levels = c("Model7", "Model10", "Model16", "Model24") ) # factor by model name

roc <- ggplot(cmat) +
  geom_roc(aes(d=motor_true, m = speech_pos, color = model), size = 2, na.rm = TRUE, labels = TRUE) +
            scale_color_manual(values=cols) +
            scale_color_manual(name = "Model", values = cols) +
           labs(title = "ROC for logistic glm models",                      # titles and labels
                x = "Speech positive / Motor negative (False positive fraction)",
                y = "Speech positive / Motor positive (True positive fraction)") +
           theme_light()
roc

cr <- calc_auc(roc) %>% rename("Model" = group)
cr$Model <- c("Model7", "Model10", "Model16", "Model24")
cr %>% dplyr::select(Model, AUC) %>% kable(caption = "Area under ROC", align = c("l", "l")) %>% kable_minimal()
Area under ROC
Model AUC
Model7 0.4416667
Model10 0.5000000
Model16 0.5083333
Model24 0.4833333

Again, Model 16 at just over 50% is marginally better than the other models.

Quadratic discriminant analysis

The study used quadratic discriminate analysis. I’ll try qda to see if it gives a better fit to the data. First I’ll create training and test data.

pdc <- pdata %>% filter(group %in% c("PD"))                               # Use the PD data for training
tr <- pdc %>%  dplyr::select(-id,-age,-motor_score,-group)                # deselect unwanted columns
tr$dysarthia <- factor(pdc$dysarthia, levels = c(0, 1))                   # factor on response variable
attach(pdc)

set.seed(1)                   

#Use 70% of data as training set 
sample <- sample(c(TRUE, FALSE), nrow(tr), replace=TRUE, prob=c(0.7,0.3))
train <- tr[sample, ]

test <- tr[!sample, ]    #create a test dataset with the remaining 

Now I’ll run qda models on pairs of variables and test them.

require(MASS)

vnames <- names(pdata[5:28])    # get the names of measurement variables
vnames <- combn(vnames,2)       # create combinations of 2 names

mtest <- tibble("v1" = vnames[1,], "v2"= vnames[2,] ,"mean" = 0)   # make a tibble with the names in columns 

# loop through all the combinations and perform qda
for (i in 1:length(mtest$v1)) {
  tr <- dplyr::select(train, dysarthia, mtest$v1[i],mtest$v2[i])   # select the two columns from train
  model <- qda(dysarthia ~ ., data=tr)                                # create qda model  
  predicted <- predict(model, test)                                   # use model to predict test data
  mtest$mean[i] <- mean(predicted$class==test$dysarthia)              # get the mean of the predicted v actual  
}

mtest <- mtest %>% arrange(desc(mean))                                # look at the test results
head(mtest, 10) %>% kable(caption = "High QDA model test results", align = c("l","l","l"),
                          col.names = c("Variable 1", "Variable 2", "Test Mean")) %>% kable_minimal()
High QDA model test results
Variable 1 Variable 2 Test Mean
dpi_r gvi_r 0.8888889
dpi_r lre_r 0.8888889
dpi_r dvi_m 0.8888889
dpi_r gvi_m 0.8888889
dpi_r lre_m 0.8888889
dvi_r lre_r 0.8888889
est_r rst_r 0.7777778
est_r ast_r 0.7777778
est_r dpi_r 0.7777778
est_r dvi_r 0.7777778

There are a number of test results with means of 0.889, and the lowest result is 0.222. There are seven variables with high means between the pairs. Although they’re not the same seven seven variables identified in the study.

QDA pairs models Study variables
dpi_r dpi_m
dvi_m dpi_r
dvi_r dus_m
gvi_m dus_r
gvi_r dvi_m
lre_m pir_m
lre_r rst_r

To create their prediction model the study used repeated measures analysis with between group and within group factors, Fisher’s least squares, Pearson correlation and Bonferroni corrections. Again I’m running up against statistical analysis beyond my current abilities, or enough time to figure it out. I’ll try running the qda model with the seven variables (Model 7a) in the study and then the seven high qda test variables (Model 7b).

train1 <- train %>% dplyr::select(dysarthia,rst_r,dvi_m,dpi_r,dpi_m,dus_r,dus_m,pir_m)
pacman::p_load(klaR)
model7a <- qda(dysarthia ~ ., data = train1)
predictedt <- predict(model, test)
m7a <- mean(predictedt$class == test$dysarthia)

train2 <- train %>% dplyr::select(dysarthia,dpi_r,dvi_m,dvi_r,gvi_m,gvi_r,lre_m,lre_r)
pacman::p_load(klaR)
model7b <- qda(dysarthia ~ ., data = train1)
predictedt <- predict(model, test)
m7b <- mean(predictedt$class == test$dysarthia)

The two models are equivalent. The mean of the qda test is for the seven study variables is 0.667 (Model 7a), and the mean for the seven variables from the pairs qda is 0.667 (Model 7b). A partition matrix of the test shows the mis-classification errors for Model 7a.

# create a partition matrix for the study variables training data
# haven't found a way to add diagram title in partimat

partimat(dysarthia ~ ., data = train1, method = "qda",            # perform qda
         plot.matrix = TRUE, 
         col.correct='blue', col.wrong='red', col.mean = 'purple',
         image.colors = c("white","grey"), main = "QDA")

Predict dysarthia for the RBD group

Now I’ll use qda Model7a to predict dysarthia for the RBD subjects and look at the results.

# predict dysarthia for RBD group using 7 study variable model
predicted <- predict(model7a, rbd)         

rbd$speech_pos <- predicted$class     # add speech positive prediction to rbd data

# print the scores for rbd data

rbd %>% dplyr::select(id, dysarthia, motor_score, motor_true, speech_pos) %>% 
  filter(dysarthia ==1 | speech_pos ==1) %>% 
  kable(caption = "RBD subjects tagged as Speech Positive by the qda model") %>% kable_minimal()
RBD subjects tagged as Speech Positive by the qda model
id dysarthia motor_score motor_true speech_pos
RBD01 1 3 0 1
RBD05 0 10 1 1
RBD14 0 11 1 1
RBD18 0 3 0 1
RBD21 1 13 1 0
RBD27 0 5 1 1
RBD28 0 0 0 1
RBD29 0 1 0 1
RBD36 0 4 1 1
RBD38 1 10 1 0
RBD44 0 4 1 1
c7 <- confusionMatrix(motor_positive, predicted$class)
etbl <- etbl %>% add_row(Model = "Model7a", 
                         Accuracy = c7$overall["Accuracy"],
                         `True Pos` = c7$table[2,2],
                         `True Neg` = c7$table[1,1],
                         `False Pos` = c7$table[1,2],
                         `False Neg` = c7$table[2,1],
                         Sensitivity = c7$byClass["Sensitivity"],
                         Specificity = c7$byClass["Specificity"])

filter(etbl, Model == "Model7a") %>% kable(caption = "Prediction results from qda", digits = 3) %>% kable_minimal()
Prediction results from qda
Model Accuracy True Pos True Neg False Pos False Neg Sensitivity Specificity
Model7a 0.42 5 16 4 25 0.39 0.556

The qda model of the seven variables detected dysarthia for 9 of the RBD subjects, including the three which were tagged with dysarthia by the speech specialist. Seven of the subjects diagnosed by the model are motor positives. It’s performance is comparable to the logistic regression model for the seven variables. As I noted above, there are factors and corrections that need be applied to create an effective prediction model.

The study diagnosed RBD subjects with dysarthia compared to their motor scores with 70.0% accuracy, 73.9% sensitivity and 66.7% specificity. It used an iterative leave-one-subject-out model validation scheme to find the best combination of acoustic measurements for the predictive model.

I’ll create an interactive plot for Model7a prediction, extending the diagram from Project 2 that shows the attributes dpi_m and rst_r. I’ll add the points for RBD subjects diagnosed with dysarthia by the model.

The diagram in the study has a curved solid gray line representing the border of discrimination of RBD speech negative subjects and speech positive subjects.

I haven’t found a way (yet!) to replicate this curve. I think it looks similar to the partition curves in the partimat diagram. My experiments with ggplot, ggplotly and highcharter weren’t successful. I’ll approximate it with an loess line fitting dpi_m to rst_r for the RBD speech positive group.

pacman::p_load(broom, highcharter)

p2 <- pdata %>% mutate(px = NA, py = NA)       # set up the data for creating fitted line 
p2$group <- as.character(p2$group)

for (i in 1:50) {

if(rbd$speech_pos[i] == 1) {            # get x, y variables for fitted line
  p2$px[i+30] <- p2$rst_r[i+30]
  p2$py[i+30] <- p2$dpi_m[i+30]
  p2$group[i+30] <- "RBD_DYS"           # assign new group for speech positives
 }
}

# factor by the new set of groups

p2$group <- p2$group %>% factor(levels = c("PD", "RBD", "RBD_DYS", "controls"))
                              
# create a model for the fitted line and add the residuals to the data
modlss <- loess(py ~ px, data = filter(p2, group == "RBD_DYS"))
fitd <- arrange(augment(modlss, filter(p2, group == "RBD_DYS")), px) %>%
    rename(residual = `.resid`)

#relabel the factors for the plot legend
p2$group <- p2$group %>% factor(labels = c("Parkinson's", 
                                           "RBD speech negative", "RBD speech positive",
                                           "controls"))

cols <- c('red', 'royalblue', 'purple', 'darkgreen')   # set colors for the subject groups

highchart() %>% 
  hc_add_theme(hc_theme_smpl()) %>% 
      hc_add_series(p2,                                  # scatter plot of the two variables
                  type = "scatter", 
                  hcaes(x=rst_r, y = dpi_m, group = group),
                  opacity = 0.75,
                  marker = list(radius = 6),
                  tooltip = list(pointFormat = 
                 "ID: {point.id}<br> 
                  Dysarthia: {point.dysarthia}<br>
                  Motor Score: {point.motor_score}")
                  ) %>% 
    hc_colors(cols) %>%
    hc_xAxis(title = list(text = "Reading RST (-/min)")) %>% 
    hc_yAxis(title = list(text = "Monologue DPI (ms)")) %>% 
    hc_add_series(fitd, type = "line", hcaes(x = px, y = .fitted),   # add the fitted line
                name = "RBD speech positive fit", id = "fit",
                color = "grey",
                lineWidth = 5,
                opacity = 0.5,
                tooltip = list(pointFormat = 
                 "ID: {point.id}<br> 
                  Residual: {point.residual}<br>
                  Dysarthia: {point.dysarthia}<br>
                  Motor Score: {point.motor_score}")
                ) %>% 
    hc_legend(align = "right", verticalAlign = "middle", layout = "vertical") %>% 
    hc_title(text = "Parkinson's Disease Diagnosis: Speech Pattern Monologue DPI ~ Reading RST ",
           margin = 20, align = "center", style = list(color = "#22A883")) %>% 
    hc_subtitle(text = "Tool tip: Dysarthia = score by speech specialist") %>% 
    hc_caption(text = "source: https://www.nature.com/articles/s41598-017-00047-5", align = "right")

This project presented many challenges, and I learned a lot. First was understanding what the study was doing in order to try to replicate the analysis. Then learning how to use the glm and qda model functions and interpret the results. And finally finding ways to build visualizations to demonstrate the results. Even though I wasn’t able to fully implement all the methods from the study, and my models performed poorly, I feel I’ve gained some understanding of what’s involved in replicating the data analysis.

References

  1. Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5

  2. Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease

  3. Macy, Marilyn (2021). Parkinson’s Speech Analysis, Data 110 - Project 2. https://rpubs.com/sopranomax/756877