In “Data 110 Project 2 - Parkinson’s Speech Analysis”3 I approximated a diagram from the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder”1 using the data associated with the study.
In this extension of Project 2 I’ll approximate the prediction model used to diagnose RBD subjects in the study and create an updated version of the diagram.
Dysarthia is a speech disorder present in close to 90% of people with Parkinson’s Disease (PD). The development of a fully-automated method to detect early, distinctive patterns of dysarthia using only acoustic data from connected speech such as reading a short passage or monologue is the subject of the study.
The data include 30 patients with early untreated Parkinson’s, 50 patients with REM sleep behavior disorder (RBD), which are at high risk developing Parkinson’s disease, and 50 healthy controls (HC). All PD and RBD patients were scored for movement disorders. All subjects were examined and scored for dysarthia during a single session with a speech specialist. All subjects were recorded reading a standardized, phonetically-balanced text a monologue for approximately 90 seconds. Speech features were automatically analyzed in the study resulting in 24 speech attribute measurements.
| Label | Exam measurement |
|---|---|
| RST | rate of speech timing |
| AST | acceleration of speech timing |
| DPI | duration of pause intervals |
| EST | entropy of speech timing |
| DUS | duration of unvoiced stops |
| DUF | decay of unvoiced fricatives |
| DVI | duration of voiced intervals |
| GVI | gaping in-between voiced intervals |
| RSR | rate of speech respiration |
| PIR | pause intervals per respiration |
| RLR | relative loudness of respiration |
| LRE | latency of respiratory exchange |
I cleaned and organized the original dataset and saved it previously for Project 2. I’ll read the data and look at it.
pacman::p_load(tidyverse, dplyr, readr, stringr, knitr, kableExtra, ggplot2, plotly, GGally, corrplot)
pacman::p_load(InformationValue)
pdata <- read.csv("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons/pdata.csv")
pdata$motor_score <- as.integer(pdata$motor_score)
pdata$group <- pdata$group %>% factor(levels = c("PD", "RBD", "controls")) # factor group to order
glimpse(pdata)
## Rows: 130
## Columns: 29
## $ id <chr> "PD01", "PD02", "PD03", "PD04", "PD05", "PD06", "PD07", "P~
## $ age <int> 58, 68, 68, 75, 61, 58, 79, 59, 73, 66, 71, 37, 43, 70, 75~
## $ motor_score <int> 8, 22, 19, 24, 54, 29, 29, 32, 6, 22, 15, 36, 11, 18, 25, ~
## $ dysarthia <int> 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0~
## $ est_r <dbl> 1.564, 1.564, 1.550, 1.519, 1.543, 1.553, 1.563, 1.568, 1.~
## $ rst_r <int> 354, 340, 211, 140, 269, 317, 269, 338, 374, 281, 234, 365~
## $ ast_r <dbl> 6.05, 27.52, 11.97, -2.49, 6.72, 24.19, 2.47, -4.36, 11.10~
## $ dpi_r <int> 146, 173, 377, 360, 211, 186, 214, 145, 117, 213, 198, 129~
## $ dvi_r <int> 264, 253, 322, 663, 328, 286, 310, 272, 265, 323, 414, 261~
## $ gvi_r <dbl> 58.65, 48.26, 47.54, 13.72, 42.90, 43.83, 60.30, 61.31, 52~
## $ dus_r <dbl> 31.38, 22.38, 38.12, 44.88, 47.12, 33.63, 20.12, 25.75, 23~
## $ duf_r <dbl> -2.101, -1.745, 2.657, -0.934, -0.973, 0.921, -5.347, 1.08~
## $ rlr_r <dbl> -22.47, -24.59, -16.89, -25.54, -22.61, -25.00, -26.20, -2~
## $ pir_r <dbl> 4.50, 7.00, 3.00, 1.00, 5.00, 2.75, 5.25, 6.50, 9.25, 4.50~
## $ rsr_r <dbl> 21.14, 15.28, 20.76, 18.71, 16.26, 27.07, 16.33, 13.42, 12~
## $ lre_r <int> 167, 163, 372, 119, 78, 124, 288, 128, 87, 112, 44, 170, 1~
## $ est_m <dbl> 1.564, 1.569, 1.550, 1.539, 1.560, 1.518, 1.566, 1.573, 1.~
## $ rst_m <int> 333, 285, 247, 112, 230, 181, 289, 370, 288, 258, 174, 277~
## $ ast_m <dbl> -2.82, 8.20, 4.71, -9.09, 11.77, 13.38, -1.53, -4.12, 5.25~
## $ dpi_m <int> 158, 295, 280, 397, 206, 611, 251, 118, 194, 246, 433, 214~
## $ dvi_m <int> 318, 264, 317, 800, 480, 398, 278, 269, 317, 356, 519, 313~
## $ gvi_m <dbl> 49.01, 40.56, 48.97, 18.69, 33.54, 18.18, 58.90, 68.65, 53~
## $ dus_m <dbl> 22.37, 26.88, 22.37, 49.37, 26.87, 49.37, 22.38, 17.88, 31~
## $ duf_m <dbl> 0.588, -0.825, -0.955, 0.791, 0.075, 1.488, 0.174, 0.554, ~
## $ rlr_m <dbl> -19.77, -23.26, -13.29, -25.08, -22.32, -25.08, -23.38, -1~
## $ pir_m <dbl> 6.0, 4.0, 4.0, 2.0, 5.0, 2.0, 5.0, 13.5, 4.0, 3.5, 2.0, 4.~
## $ rsr_m <dbl> 13.81, 21.77, 22.52, 14.37, 14.61, 18.21, 17.85, 8.21, 21.~
## $ lre_m <int> 127, 313, 201, 151, 151, 593, 203, 116, 81, 69, 40, 179, 1~
## $ group <fct> PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD, PD~
The following diagram shows distributions and correlations of speech attribute measurements.
require(psych)
pairs.panels(pdata[5:28], # plot distributions and correlations for all the data
gap = 0,
pch = 21,
lm = TRUE)
Many of the variables are skewed and the dot plots show outliers. Even though the entire sample size of 130 is reasonably large, and the Parkinson’s disease group has 30 subjects, a linear regression is probably not the best choice for finding a prediction model.
The study used quadratic discriminant function analysis. Where, the multivariate normal distribution, characterized by the means and covariances, was estimated using an expectation maximization algorithm.
Even so since a diagnosis of dysarthia is a binomial result I’ll start with logistic regression and see how it fits the data. I’ll use ggcorr to find variables with significant coefficients.
# create a correlation plot
pcor <- pdata %>% filter(group != "RBD") %>% dplyr::select(est_r:lre_m) # filter for the PD and controls groups # select is defined in both dplyr and MASS packages
# create a correlation plot
pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy", # set the number of levels and the colors
hjust = .65, size = 4, # nudge and format the axis labels
label = TRUE, label_size = 4, label_color = "white", # format the coefficient
name = "Significance") + # legend title
labs(title = "PD / Control Correlation", # plot title
caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)), # position and format the title and caption
plot.caption.position = "plot")
pdcorr
There are 16 variables with coefficients greater than 0.6, and 10 variables with coefficients greater than 0.7.
# find the variable with significant coefficients
pdcorr$data %>% filter(abs(coefficient) > 0.6 ) %>% # filter the correlation table for significance
dplyr::select(1:3) %>% # get the variable names and coefficients
arrange(coefficient) %>%
kable(caption = "Significant correlation of PD group to control group (coefficient > 0.6)",
digits = 2, col.names = c("X", "Y", "Coefficient")) %>%
kable_minimal()
| X | Y | Coefficient |
|---|---|---|
| dvi_r | rst_r | -0.85 |
| dpi_r | rst_r | -0.80 |
| dvi_m | rst_m | -0.79 |
| dpi_m | rst_m | -0.77 |
| dvi_m | rst_r | -0.65 |
| rsr_m | pir_m | -0.64 |
| dus_m | rst_m | -0.64 |
| rst_m | dvi_r | -0.62 |
| rsr_r | pir_r | -0.61 |
| rst_m | rst_r | 0.63 |
| dvi_r | dpi_r | 0.64 |
| lre_m | dpi_m | 0.66 |
| dus_m | dpi_m | 0.67 |
| gvi_m | est_m | 0.72 |
| rlr_m | rlr_r | 0.74 |
| dvi_m | dvi_r | 0.84 |
t <- pdcorr$data %>% filter(abs(coefficient) > 0.6 )
c(t$x,as.character(t$y)) %>% unique() %>% kable(col.names = "Variables w/correlation > 0.6")
| Variables w/correlation > 0.6 |
|---|
| dpi_r |
| dvi_r |
| rst_m |
| dvi_m |
| rlr_m |
| rsr_r |
| gvi_m |
| dpi_m |
| dus_m |
| lre_m |
| rsr_m |
| rst_r |
| rlr_r |
| pir_r |
| est_m |
| pir_m |
t <- pdcorr$data %>% filter(abs(coefficient) > 0.7 )
c(t$x,as.character(t$y)) %>% unique() %>% kable(col.names = "Variables w/correlation > 0.7")
| Variables w/correlation > 0.7 |
|---|
| dpi_r |
| dvi_r |
| dvi_m |
| rlr_m |
| gvi_m |
| dpi_m |
| rst_r |
| rlr_r |
| est_m |
| rst_m |
In the study the most distinctive disturbed speech patterns between the PD and control groups were found for a combination of seven attributes:
I’ll use general linear model (glm) with the logit function to create the models. Then plot the models to evaluate their fit to the data.
pd <- filter(pdata, group == "PD") # get the Parkinsons and controls subjects
attach(pd)
# model all 24 variables
fit24 <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +
dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m,
family=binomial(link="logit"))
# model 16 variables with correlation < 0.6
fit16 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +
gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r +
rsr_m + rsr_r + rst_m + rst_r,
family=binomial(link="logit"))
# model 10 variables with correlation < 0.7
fit10 <- glm(dysarthia ~ dpi_r + dvi_r + dvi_m + rlr_m + gvi_m +
dpi_m + rst_r + rlr_r + est_m + rst_m,
family=binomial(link="logit"))
# model 7 significant variables found in the study
fit7 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dus_r + dvi_m + pir_r + rst_r,
family=binomial(link="logit")
)
detach(pd)
# plot each model to look at the residuals
par(mfrow = c(2,2))
plot(fit24)
mtext("fit24", side = 1)
par(mfrow = c(2,2), bg = "aliceblue")
plot(fit16)
mtext("fit16", side = 1)
par(mfrow = c(2,2), bg = "white")
plot(fit10)
mtext("fit10", side = 1)
par(mfrow = c(2,2), bg = "aliceblue")
plot(fit7)
mtext("fit7", side = 1)
Each of the models show high leverage outliers. Removing outliers and rerunning the models doesn’t improve them, each model shows a new set of high leverage outliers. More evidence that logistic regression is not the best fit for this data. The following plots give a closer look at the residuals for Model 7 and Model 10.
# Model 7
# select the data for residuals
d <- pd %>% dplyr::select(dysarthia, dpi_m, dpi_r, dus_m, dus_r, dvi_m, pir_r, rst_r)
# add predicted and residuals
d$predicted <- predict(fit7, type="response")
d$residuals <- residuals(fit7, type = "response")
# plot the results
d %>%
gather(key = "iv", value = "x", -dysarthia, -predicted, -residuals) %>% # pull all the predictors together
ggplot(aes(x = x, y = dysarthia)) +
geom_segment(aes(xend = x, yend = predicted), alpha = .4) +
geom_point(aes(color = residuals, size = abs(residuals)/5), alpha = .7) +
scale_color_gradient2(low = "darkblue", mid = "lavender", high = "darkred") +
guides(color = FALSE, size = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_wrap(~ iv, scales = "free_x") +
labs(title = "Residuals for Model 7", # titles and labels
x = "") +
theme_bw()
#Model 10
# select the data for residuals
d <- pd %>% dplyr::select(dysarthia, dpi_r, dvi_r, dvi_m, rlr_m, gvi_m, dpi_m, rst_r, rlr_r, est_m, rst_m)
# add predicted and residuals
d$predicted <- predict(fit7, type="response")
d$residuals <- residuals(fit7, type = "response")
# plot the results
d %>%
gather(key = "iv", value = "x", -dysarthia, -predicted, -residuals) %>% # pull all the predictors together
ggplot(aes(x = x, y = dysarthia)) +
geom_segment(aes(xend = x, yend = predicted), alpha = .4) +
geom_point(aes(color = residuals, size = abs(residuals)/5), alpha = .7) +
scale_color_gradient2(low = "darkblue", mid = "lavender", high = "darkred") +
guides(color = FALSE, size = FALSE) +
geom_point(aes(y = predicted), shape = 1) +
facet_wrap(~ iv, scales = "free_x") +
labs(title = "Residuals for Model 10", # titles and labels
x = "") +
theme_bw()
These plots show a good number of large residuals, I’ll continue with logistic regression despite these indications of poor fit.
Both PD and RBD subjects were scored according to the Unified Parkinson’s Disease Rating Scale motor subscore (UPDRS III15, ranging from 0 to 108, with 0 for no motor manifestations to 108 representing severe motor distortion). A score >3 was previously revealed to be a very good indicator of initial parkinsonism. RBD subjects with a motor score ≤3 (asymptomatic RBD subgroup) are labeled motor negatives and RBD subjects with a motor score >3 (symptomatic RBD subgroup) were labeled motor positives. This categorization is used to create the confusion matrices evaluating the model performance.
# evaluate the motor score
pdata <- pdata %>% mutate("motor_true" = if_else(motor_score > 3, 1, 0))
pd <- pdata %>% filter(group == "PD") # get the PD group data
# look at the initial dysarthia diagnosis, motor score and motor true
pd %>% dplyr::select(id, dysarthia, motor_score, motor_true) %>%
head() %>%
kable(caption = "Motor Score evaluation for PD subjects") %>% kable_minimal()
| id | dysarthia | motor_score | motor_true |
|---|---|---|---|
| PD01 | 0 | 8 | 1 |
| PD02 | 1 | 22 | 1 |
| PD03 | 0 | 19 | 1 |
| PD04 | 0 | 24 | 1 |
| PD05 | 1 | 54 | 1 |
| PD06 | 1 | 29 | 1 |
rbd <- pdata %>% filter(group == "RBD") # get the RBD group data
# look at the initial dysarthia diagnosis, motor score and motor true
rbd %>% dplyr::select(id, dysarthia, motor_score, motor_true) %>%
head() %>%
kable(caption = "Motor Score evaluation for RBD subjects") %>% kable_minimal()
| id | dysarthia | motor_score | motor_true |
|---|---|---|---|
| RBD01 | 1 | 3 | 0 |
| RBD02 | 0 | 2 | 0 |
| RBD03 | 0 | 7 | 1 |
| RBD04 | 0 | 1 | 0 |
| RBD05 | 0 | 10 | 1 |
| RBD06 | 0 | 12 | 1 |
For now, I won’t create test data and evaluate the models. Instead, I’ll go straight to running predictions on RBD data for each of the models and compare the results using the motor score evaluation as the test criteria.
require(caret)
etbl <- tibble("Model" = c("fit24", "fit16", "fit10", "fit7"), # tibble to compare fit evals
"Accuracy" = c(0,0,0,0),
"True Pos" = c(0,0,0,0),
"True Neg" = c(0,0,0,0),
"False Pos" = c(0,0,0,0),
"False Neg" = c(0,0,0,0),
"Sensitivity" = c(0,0,0,0),
"Specificity" = c(0,0,0,0)
)
motor_positive <- factor(rbd$motor_true, c(0,1)) # factor required by confusionMatrix
predicted24 <- predict(fit24, rbd, type = "response") %>% round() # prediction using fit24
predictedc <- predicted24 %>% factor(levels = c(0,1))
c24 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[1] = c24$overall["Accuracy"]
etbl$`True Pos`[1] = c24$table[2,2]
etbl$`True Neg`[1] = c24$table[1,1]
etbl$`False Pos`[1] = c24$table[1,2]
etbl$`False Neg`[1] = c24$table[2,1]
etbl$Sensitivity[1] = c24$byClass["Sensitivity"]
etbl$Specificity[1] = c24$byClass["Specificity"]
predicted16 <- predict(fit16, rbd, type = "response") %>% round() # prediction using fit16
predictedc <- predicted16 %>% factor(levels = c(0,1))
c16 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[2] = c16$overall["Accuracy"]
etbl$`True Pos`[2] = c16$table[2,2]
etbl$`True Neg`[2] = c16$table[1,1]
etbl$`False Pos`[2] = c16$table[1,2]
etbl$`False Neg`[2] = c16$table[2,1]
etbl$Sensitivity[2] = c16$byClass["Sensitivity"]
etbl$Specificity[2] = c16$byClass["Specificity"]
predicted10 <- predict(fit10, rbd, type = "response") %>% round() # prediction using fit16
predictedc <- predicted10 %>% factor(levels = c(0,1))
c10 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[3] = c10$overall["Accuracy"]
etbl$`True Pos`[3] = c10$table[2,2]
etbl$`True Neg`[3] = c10$table[1,1]
etbl$`False Pos`[3] = c10$table[1,2]
etbl$`False Neg`[3] = c10$table[2,1]
etbl$Sensitivity[3] = c10$byClass["Sensitivity"]
etbl$Specificity[3] = c10$byClass["Specificity"]
predicted7 <- predict(fit7, rbd, type="response") %>% round() # prediction using fit7
predictedc <- predicted7 %>% factor(levels = c(0,1))
c7 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[4] = c7$overall["Accuracy"]
etbl$`True Pos`[4] = c7$table[2,2]
etbl$`True Neg`[4] = c7$table[1,1]
etbl$`False Pos`[4] = c7$table[1,2]
etbl$`False Neg`[4] = c7$table[2,1]
etbl$Sensitivity[4] = c7$byClass["Sensitivity"]
etbl$Specificity[4] = c7$byClass["Specificity"]
etbl %>% kable(caption = "Prediction results from logistic glm", digits = 3) %>% kable_minimal()
| Model | Accuracy | True Pos | True Neg | False Pos | False Neg | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| fit24 | 0.46 | 11 | 12 | 8 | 19 | 0.387 | 0.579 |
| fit16 | 0.50 | 14 | 11 | 9 | 16 | 0.407 | 0.609 |
| fit10 | 0.48 | 12 | 12 | 8 | 18 | 0.400 | 0.600 |
| fit7 | 0.42 | 10 | 11 | 9 | 20 | 0.355 | 0.526 |
All of the models have a good number of false positives, and have more false negatives than true negatives. Model 16 has the best, although not stellar, performance. I’ll plot the ROC curves to see what they look like.
require(plotROC)
cols <- c('darkred', 'mediumvioletred', 'darkgreen', 'blue')
# add the predictions to the data and pivot longer
cmat <- tibble ("motor_true" = rbd$motor_true, "Model24" = round(predicted24), "Model16" = round(predicted16),
"Model10" = round(predicted10), "Model7" = round(predicted7)) %>%
pivot_longer(Model24:Model7,
names_to = "model",
values_to = "speech_pos"
)
cmat$model <- factor(cmat$model, levels = c("Model7", "Model10", "Model16", "Model24") ) # factor by model name
roc <- ggplot(cmat) +
geom_roc(aes(d=motor_true, m = speech_pos, color = model), size = 2, na.rm = TRUE, labels = TRUE) +
scale_color_manual(values=cols) +
scale_color_manual(name = "Model", values = cols) +
labs(title = "ROC for logistic glm models", # titles and labels
x = "Speech positive / Motor negative (False positive fraction)",
y = "Speech positive / Motor positive (True positive fraction)") +
theme_light()
roc
cr <- calc_auc(roc) %>% rename("Model" = group)
cr$Model <- c("Model7", "Model10", "Model16", "Model24")
cr %>% dplyr::select(Model, AUC) %>% kable(caption = "Area under ROC", align = c("l", "l")) %>% kable_minimal()
| Model | AUC |
|---|---|
| Model7 | 0.4416667 |
| Model10 | 0.5000000 |
| Model16 | 0.5083333 |
| Model24 | 0.4833333 |
Again, Model 16 at just over 50% is marginally better than the other models.
The study used quadratic discriminate analysis. I’ll try qda to see if it gives a better fit to the data. First I’ll create training and test data.
pdc <- pdata %>% filter(group %in% c("PD")) # Use the PD data for training
tr <- pdc %>% dplyr::select(-id,-age,-motor_score,-group) # deselect unwanted columns
tr$dysarthia <- factor(pdc$dysarthia, levels = c(0, 1)) # factor on response variable
attach(pdc)
set.seed(1)
#Use 70% of data as training set
sample <- sample(c(TRUE, FALSE), nrow(tr), replace=TRUE, prob=c(0.7,0.3))
train <- tr[sample, ]
test <- tr[!sample, ] #create a test dataset with the remaining
Now I’ll run qda models on pairs of variables and test them.
require(MASS)
vnames <- names(pdata[5:28]) # get the names of measurement variables
vnames <- combn(vnames,2) # create combinations of 2 names
mtest <- tibble("v1" = vnames[1,], "v2"= vnames[2,] ,"mean" = 0) # make a tibble with the names in columns
# loop through all the combinations and perform qda
for (i in 1:length(mtest$v1)) {
tr <- dplyr::select(train, dysarthia, mtest$v1[i],mtest$v2[i]) # select the two columns from train
model <- qda(dysarthia ~ ., data=tr) # create qda model
predicted <- predict(model, test) # use model to predict test data
mtest$mean[i] <- mean(predicted$class==test$dysarthia) # get the mean of the predicted v actual
}
mtest <- mtest %>% arrange(desc(mean)) # look at the test results
head(mtest, 10) %>% kable(caption = "High QDA model test results", align = c("l","l","l"),
col.names = c("Variable 1", "Variable 2", "Test Mean")) %>% kable_minimal()
| Variable 1 | Variable 2 | Test Mean |
|---|---|---|
| dpi_r | gvi_r | 0.8888889 |
| dpi_r | lre_r | 0.8888889 |
| dpi_r | dvi_m | 0.8888889 |
| dpi_r | gvi_m | 0.8888889 |
| dpi_r | lre_m | 0.8888889 |
| dvi_r | lre_r | 0.8888889 |
| est_r | rst_r | 0.7777778 |
| est_r | ast_r | 0.7777778 |
| est_r | dpi_r | 0.7777778 |
| est_r | dvi_r | 0.7777778 |
There are a number of test results with means of 0.889, and the lowest result is 0.222. There are seven variables with high means between the pairs. Although they’re not the same seven seven variables identified in the study.
| QDA pairs models | Study variables |
|---|---|
| dpi_r | dpi_m |
| dvi_m | dpi_r |
| dvi_r | dus_m |
| gvi_m | dus_r |
| gvi_r | dvi_m |
| lre_m | pir_m |
| lre_r | rst_r |
To create their prediction model the study used repeated measures analysis with between group and within group factors, Fisher’s least squares, Pearson correlation and Bonferroni corrections. Again I’m running up against statistical analysis beyond my current abilities, or enough time to figure it out. I’ll try running the qda model with the seven variables (Model 7a) in the study and then the seven high qda test variables (Model 7b).
train1 <- train %>% dplyr::select(dysarthia,rst_r,dvi_m,dpi_r,dpi_m,dus_r,dus_m,pir_m)
pacman::p_load(klaR)
model7a <- qda(dysarthia ~ ., data = train1)
predictedt <- predict(model, test)
m7a <- mean(predictedt$class == test$dysarthia)
train2 <- train %>% dplyr::select(dysarthia,dpi_r,dvi_m,dvi_r,gvi_m,gvi_r,lre_m,lre_r)
pacman::p_load(klaR)
model7b <- qda(dysarthia ~ ., data = train1)
predictedt <- predict(model, test)
m7b <- mean(predictedt$class == test$dysarthia)
The two models are equivalent. The mean of the qda test is for the seven study variables is 0.667 (Model 7a), and the mean for the seven variables from the pairs qda is 0.667 (Model 7b). A partition matrix of the test shows the mis-classification errors for Model 7a.
# create a partition matrix for the study variables training data
# haven't found a way to add diagram title in partimat
partimat(dysarthia ~ ., data = train1, method = "qda", # perform qda
plot.matrix = TRUE,
col.correct='blue', col.wrong='red', col.mean = 'purple',
image.colors = c("white","grey"), main = "QDA")
Now I’ll use qda Model7a to predict dysarthia for the RBD subjects and look at the results.
# predict dysarthia for RBD group using 7 study variable model
predicted <- predict(model7a, rbd)
rbd$speech_pos <- predicted$class # add speech positive prediction to rbd data
# print the scores for rbd data
rbd %>% dplyr::select(id, dysarthia, motor_score, motor_true, speech_pos) %>%
filter(dysarthia ==1 | speech_pos ==1) %>%
kable(caption = "RBD subjects tagged as Speech Positive by the qda model") %>% kable_minimal()
| id | dysarthia | motor_score | motor_true | speech_pos |
|---|---|---|---|---|
| RBD01 | 1 | 3 | 0 | 1 |
| RBD05 | 0 | 10 | 1 | 1 |
| RBD14 | 0 | 11 | 1 | 1 |
| RBD18 | 0 | 3 | 0 | 1 |
| RBD21 | 1 | 13 | 1 | 0 |
| RBD27 | 0 | 5 | 1 | 1 |
| RBD28 | 0 | 0 | 0 | 1 |
| RBD29 | 0 | 1 | 0 | 1 |
| RBD36 | 0 | 4 | 1 | 1 |
| RBD38 | 1 | 10 | 1 | 0 |
| RBD44 | 0 | 4 | 1 | 1 |
c7 <- confusionMatrix(motor_positive, predicted$class)
etbl <- etbl %>% add_row(Model = "Model7a",
Accuracy = c7$overall["Accuracy"],
`True Pos` = c7$table[2,2],
`True Neg` = c7$table[1,1],
`False Pos` = c7$table[1,2],
`False Neg` = c7$table[2,1],
Sensitivity = c7$byClass["Sensitivity"],
Specificity = c7$byClass["Specificity"])
filter(etbl, Model == "Model7a") %>% kable(caption = "Prediction results from qda", digits = 3) %>% kable_minimal()
| Model | Accuracy | True Pos | True Neg | False Pos | False Neg | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| Model7a | 0.42 | 5 | 16 | 4 | 25 | 0.39 | 0.556 |
The qda model of the seven variables detected dysarthia for 9 of the RBD subjects, including the three which were tagged with dysarthia by the speech specialist. Seven of the subjects diagnosed by the model are motor positives. It’s performance is comparable to the logistic regression model for the seven variables. As I noted above, there are factors and corrections that need be applied to create an effective prediction model.
The study diagnosed RBD subjects with dysarthia compared to their motor scores with 70.0% accuracy, 73.9% sensitivity and 66.7% specificity. It used an iterative leave-one-subject-out model validation scheme to find the best combination of acoustic measurements for the predictive model.
I’ll create an interactive plot for Model7a prediction, extending the diagram from Project 2 that shows the attributes dpi_m and rst_r. I’ll add the points for RBD subjects diagnosed with dysarthia by the model.
The diagram in the study has a curved solid gray line representing the border of discrimination of RBD speech negative subjects and speech positive subjects.
I haven’t found a way (yet!) to replicate this curve. I think it looks similar to the partition curves in the partimat diagram. My experiments with ggplot, ggplotly and highcharter weren’t successful. I’ll approximate it with an loess line fitting dpi_m to rst_r for the RBD speech positive group.
pacman::p_load(broom, highcharter)
p2 <- pdata %>% mutate(px = NA, py = NA) # set up the data for creating fitted line
p2$group <- as.character(p2$group)
for (i in 1:50) {
if(rbd$speech_pos[i] == 1) { # get x, y variables for fitted line
p2$px[i+30] <- p2$rst_r[i+30]
p2$py[i+30] <- p2$dpi_m[i+30]
p2$group[i+30] <- "RBD_DYS" # assign new group for speech positives
}
}
# factor by the new set of groups
p2$group <- p2$group %>% factor(levels = c("PD", "RBD", "RBD_DYS", "controls"))
# create a model for the fitted line and add the residuals to the data
modlss <- loess(py ~ px, data = filter(p2, group == "RBD_DYS"))
fitd <- arrange(augment(modlss, filter(p2, group == "RBD_DYS")), px) %>%
rename(residual = `.resid`)
#relabel the factors for the plot legend
p2$group <- p2$group %>% factor(labels = c("Parkinson's",
"RBD speech negative", "RBD speech positive",
"controls"))
cols <- c('red', 'royalblue', 'purple', 'darkgreen') # set colors for the subject groups
highchart() %>%
hc_add_theme(hc_theme_smpl()) %>%
hc_add_series(p2, # scatter plot of the two variables
type = "scatter",
hcaes(x=rst_r, y = dpi_m, group = group),
opacity = 0.75,
marker = list(radius = 6),
tooltip = list(pointFormat =
"ID: {point.id}<br>
Dysarthia: {point.dysarthia}<br>
Motor Score: {point.motor_score}")
) %>%
hc_colors(cols) %>%
hc_xAxis(title = list(text = "Reading RST (-/min)")) %>%
hc_yAxis(title = list(text = "Monologue DPI (ms)")) %>%
hc_add_series(fitd, type = "line", hcaes(x = px, y = .fitted), # add the fitted line
name = "RBD speech positive fit", id = "fit",
color = "grey",
lineWidth = 5,
opacity = 0.5,
tooltip = list(pointFormat =
"ID: {point.id}<br>
Residual: {point.residual}<br>
Dysarthia: {point.dysarthia}<br>
Motor Score: {point.motor_score}")
) %>%
hc_legend(align = "right", verticalAlign = "middle", layout = "vertical") %>%
hc_title(text = "Parkinson's Disease Diagnosis: Speech Pattern Monologue DPI ~ Reading RST ",
margin = 20, align = "center", style = list(color = "#22A883")) %>%
hc_subtitle(text = "Tool tip: Dysarthia = score by speech specialist") %>%
hc_caption(text = "source: https://www.nature.com/articles/s41598-017-00047-5", align = "right")
This project presented many challenges, and I learned a lot. First was understanding what the study was doing in order to try to replicate the analysis. Then learning how to use the glm and qda model functions and interpret the results. And finally finding ways to build visualizations to demonstrate the results. Even though I wasn’t able to fully implement all the methods from the study, and my models performed poorly, I feel I’ve gained some understanding of what’s involved in replicating the data analysis.
Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5
Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease
Macy, Marilyn (2021). Parkinson’s Speech Analysis, Data 110 - Project 2. https://rpubs.com/sopranomax/756877