A study of speech analysis measurements for untreated Parkinson’s disease patients and subjects with REM sleep behavior disorder reveals that an automated tool for early diagnosis of Parkinson’s Disease can be reliable. The study is presented in the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behavior disorder”2. Both my father and my father-in-law had Parkinson’s disease, so I found this study interesting. In this paper I’ll approximate the statistical analysis from the study and create a prediction model for diagnosing Parkinson’s from the speech data.
Dysarthria is a speech disorder where the muscles used to produce speech are damaged, paralyzed, or weakened. Close to 90% of people with Parkinson’s Disease (PD) present with hypokinetic dysarthria, evidenced by reduced vocal loudness, monotone, reduced fundamental frequency range, consonant and vowel imprecision, breathiness and irregular pauses.1 Evaluation of dysarthia in disorders such as PD has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. The development of a fully-automated method to detect early, distinctive patterns of neuro-degeneration using only acoustic data from connected speech such as reading a short passage or monologue would have the potential to revolutionize the diagnostic process for diseases manifesting motor speech disorders.2
The dataset associated with the article is available as an excel file via the article on Nature.com. It’s also on Kaggle as a csv file.3
The data include 30 patients with early untreated Parkinson’s disease (PD), 50 patients with REM sleep behavior disorder (RBD), which are at high risk for developing Parkinson’s disease, and 50 healthy controls (controls). All patients were scored clinically by a neurologist with experience in movement disorders. All subjects were examined during a single session with a speech specialist and designated as having dysarthia or not.
All the subjects were recorded twice the first recording reading a standardized, phonetically-balanced text of 80 words and the second a monologue for approximately 90 seconds. Twelve speech features were automatically analyzed by Jan Hlavnička et al. (2017)3 for each recording, resulting in 24 measurement variables.
Read and view the .csv file of speech data.
setwd("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons")
peb <- tibble(read_csv("parkinsonsEarlyBiomarkers.csv")) # read Parkinsons Early Biomarkers dataset
glimpse(peb) # look at the peb tibble
## Rows: 130
## Columns: 65
## $ `Participant code` <chr> "PD01"~
## $ `Age (years)` <dbl> 58, 68~
## $ Gender <chr> "F", "~
## $ `Positive history of Parkinson disease in family` <chr> "No", ~
## $ `Age of disease onset (years)` <chr> "56", ~
## $ `Duration of disease from first symptoms (years)` <chr> "2", "~
## $ `Antidepressant therapy` <chr> "No", ~
## $ `Antiparkinsonian medication` <chr> "No", ~
## $ `Antipsychotic medication` <chr> "No", ~
## $ `Benzodiazepine medication` <chr> "No", ~
## $ `Levodopa equivalent (mg/day)` <dbl> 0, 0, ~
## $ `Clonazepam (mg/day)` <dbl> 0.0, 0~
## $ `Overview of motor examination: Hoehn & Yahr scale (-)` <chr> "1.5",~
## $ `Overview of motor examination: UPDRS III total (-)` <chr> "8", "~
## $ `18. Speech` <chr> "0", "~
## $ `19. Facial Expression` <chr> "1", "~
## $ `20. Tremor at Rest - head` <chr> "0", "~
## $ `20. Tremor at Rest - RUE` <chr> "0", "~
## $ `20. Tremor at Rest - LUE` <chr> "2", "~
## $ `20. Tremor at Rest - RLE` <chr> "0", "~
## $ `20. Tremor at Rest - LLE` <chr> "2", "~
## $ `21. Action or Postural Tremor - RUE` <chr> "0", "~
## $ `21. Action or Postural Tremor - LUE` <chr> "0", "~
## $ `22. Rigidity - neck` <chr> "0", "~
## $ `22. Rigidity - RUE` <chr> "0", "~
## $ `22. Rigidity - LUE` <chr> "1", "~
## $ `22. Rigidity - RLE` <chr> "0", "~
## $ `22. Rigidity - LLE` <chr> "0", "~
## $ `23.Finger Taps - RUE` <chr> "0", "~
## $ `23.Finger Taps - LUE` <chr> "1", "~
## $ `24. Hand Movements - RUE` <chr> "0", "~
## $ `24. Hand Movements - LUE` <chr> "0", "~
## $ `25. Rapid Alternating Movements - RUE` <chr> "0", "~
## $ `25. Rapid Alternating Movements - LUE` <chr> "1", "~
## $ `26. Leg Agility - RLE` <chr> "0", "~
## $ `26. Leg Agility - LLE` <chr> "0", "~
## $ `27. Arising from Chair` <chr> "0", "~
## $ `28. Posture` <chr> "0", "~
## $ `29. Gait` <chr> "0", "~
## $ `30. Postural Stability` <chr> "0", "~
## $ `31. Body Bradykinesia and Hypokinesia` <chr> "0", "~
## $ `Entropy of speech timing (-)` <dbl> 1.564,~
## $ `Rate of speech timing (-/min)` <dbl> 354, 3~
## $ `Acceleration of speech timing (-/min2)` <dbl> 6.05, ~
## $ `Duration of pause intervals (ms)` <dbl> 146, 1~
## $ `Duration of voiced intervals (ms)` <dbl> 264, 2~
## $ `Gaping in-between voiced intervals (-/min)` <dbl> 58.65,~
## $ `Duration of unvoiced stops (ms)` <dbl> 31.38,~
## $ `Decay of unvoiced fricatives (‰/min)` <dbl> -2.101~
## $ `Relative loudness of respiration (dB)` <dbl> -22.47~
## $ `Pause intervals per respiration (-)` <dbl> 4.50, ~
## $ `Rate of speech respiration (-/min)` <dbl> 21.14,~
## $ `Latency of respiratory exchange (ms)` <dbl> 167, 1~
## $ `Entropy of speech timing (-)_1` <dbl> 1.564,~
## $ `Rate of speech timing (-/min)_1` <dbl> 333, 2~
## $ `Acceleration of speech timing (-/min2)_1` <dbl> -2.82,~
## $ `Duration of pause intervals (ms)_1` <dbl> 158, 2~
## $ `Duration of voiced intervals (ms)_1` <dbl> 318, 2~
## $ `Gaping in-between voiced intervals (-/min)_1` <dbl> 49.01,~
## $ `Duration of unvoiced stops (ms)_1` <dbl> 22.37,~
## $ `Decay of unvoiced fricatives (‰/min)_1` <dbl> 0.588,~
## $ `Relative loudness of respiration (dB)_1` <dbl> -19.77~
## $ `Pause intervals per respiration (-)_1` <dbl> 6.0, 4~
## $ `Rate of speech respiration (-/min)_1` <dbl> 13.81,~
## $ `Latency of respiratory exchange (ms)_1` <dbl> 127, 3~
The data file is tidy dataset, however it has long, descriptive variable names. To effectively use the data it will be easier to create concise variables names by eliminating spaces, extraneous words and creating acronyms.
units <- names(peb[42:53]) %>% str_extract("(\\(.*\\))") # pull the exam measurement units off of the names
names(peb) <- gsub("(of) | (per) | (in-between) | (\\(.*\\))"," ", names(peb)) # clean out extraneous words
names(peb) <- str_to_title(names(peb)) # capitalize each word
names(peb) <- gsub(" ","",names(peb)) # eliminate spaces
names(peb) # look at the names
## [1] "ParticipantCode"
## [2] "Age"
## [3] "Gender"
## [4] "PositiveHistoryParkinsonDiseaseInFamily"
## [5] "AgeDiseaseOnset"
## [6] "DurationDiseaseFromFirstSymptoms"
## [7] "AntidepressantTherapy"
## [8] "AntiparkinsonianMedication"
## [9] "AntipsychoticMedication"
## [10] "BenzodiazepineMedication"
## [11] "LevodopaEquivalent"
## [12] "Clonazepam"
## [13] "OverviewMotorExamination:Hoehn&YahrScale"
## [14] "OverviewMotorExamination:UpdrsIiiTotal"
## [15] "18.Speech"
## [16] "19.FacialExpression"
## [17] "20.TremorAtRest-Head"
## [18] "20.TremorAtRest-Rue"
## [19] "20.TremorAtRest-Lue"
## [20] "20.TremorAtRest-Rle"
## [21] "20.TremorAtRest-Lle"
## [22] "21.ActionOrPosturalTremor-Rue"
## [23] "21.ActionOrPosturalTremor-Lue"
## [24] "22.Rigidity-Neck"
## [25] "22.Rigidity-Rue"
## [26] "22.Rigidity-Lue"
## [27] "22.Rigidity-Rle"
## [28] "22.Rigidity-Lle"
## [29] "23.FingerTaps-Rue"
## [30] "23.FingerTaps-Lue"
## [31] "24.HandMovements-Rue"
## [32] "24.HandMovements-Lue"
## [33] "25.RapidAlternatingMovements-Rue"
## [34] "25.RapidAlternatingMovements-Lue"
## [35] "26.LegAgility-Rle"
## [36] "26.LegAgility-Lle"
## [37] "27.ArisingFromChair"
## [38] "28.Posture"
## [39] "29.Gait"
## [40] "30.PosturalStability"
## [41] "31.BodyBradykinesiaAndHypokinesia"
## [42] "EntropySpeechTiming"
## [43] "RateSpeechTiming"
## [44] "AccelerationSpeechTiming"
## [45] "DurationPauseIntervals"
## [46] "DurationVoicedIntervals"
## [47] "GapingVoicedIntervals"
## [48] "DurationUnvoicedStops"
## [49] "DecayUnvoicedFricatives"
## [50] "RelativeLoudnessRespiration"
## [51] "PauseIntervalsRespiration"
## [52] "RateSpeechRespiration"
## [53] "LatencyRespiratoryExchange"
## [54] "EntropySpeechTiming_1"
## [55] "RateSpeechTiming_1"
## [56] "AccelerationSpeechTiming_1"
## [57] "DurationPauseIntervals_1"
## [58] "DurationVoicedIntervals_1"
## [59] "GapingVoicedIntervals_1"
## [60] "DurationUnvoicedStops_1"
## [61] "DecayUnvoicedFricatives_1"
## [62] "RelativeLoudnessRespiration_1"
## [63] "PauseIntervalsRespiration_1"
## [64] "RateSpeechRespiration_1"
## [65] "LatencyRespiratoryExchange_1"
Create a set of acronym labels for the exam measurements.
| Label | Exam measurement |
|---|---|
| RST | rate of speech timing |
| AST | acceleration of speech timing |
| DPI | duration of pause intervals |
| EST | entropy of speech timing |
| DUS | duration of unvoiced stops |
| DUF | decay of unvoiced fricatives |
| DVI | duration of voiced intervals |
| GVI | gaping in-between voiced intervals |
| RSR | rate of speech respiration |
| PIR | pause intervals per respiration |
| RLR | relative loudness of respiration |
| LRE | latency of respiratory exchange |
tLabels <- gsub("[a-z]", "", names(peb[42:53])) # grab the capital letters to create acronyms
for (i in 1:12) {
tLabels[i] <- str_glue(tLabels[i], " ", units[i]) # concatenate the acronyms with units
}
tLabels # look at the labels
## [1] "EST (-)" "RST (-/min)" "AST (-/min2)" "DPI (ms)" "DVI (ms)"
## [6] "GVI (-/min)" "DUS (ms)" "DUF (‰/min)" "RLR (dB)" "PIR (-)"
## [11] "RSR (-/min)" "LRE (ms)"
To obtain the desired set of data, leave out the medication variables and the 26 motor skill scores which are numbered 19 through 31 in the variable labels. Create a variable for the test groups and convert character data to numeric.
pdata <- select(peb, id = ParticipantCode, # create a tibble with the needed variables
age = Age,
motor_score = `OverviewMotorExamination:UpdrsIiiTotal`, # motor skills score
dysarthia = `18.Speech`, # dysarthia score
est_r = EntropySpeechTiming, # reading speech measurements
rst_r = RateSpeechTiming,
ast_r = AccelerationSpeechTiming,
dpi_r = DurationPauseIntervals,
dvi_r = DurationVoicedIntervals,
gvi_r = GapingVoicedIntervals,
dus_r = DurationUnvoicedStops,
duf_r = DecayUnvoicedFricatives,
rlr_r = RelativeLoudnessRespiration,
pir_r = PauseIntervalsRespiration,
rsr_r = RateSpeechRespiration,
lre_r = LatencyRespiratoryExchange,
est_m = EntropySpeechTiming_1, # monologue speech measurements
rst_m = RateSpeechTiming_1,
ast_m = AccelerationSpeechTiming_1,
dpi_m = DurationPauseIntervals_1,
dvi_m = DurationVoicedIntervals_1,
gvi_m = GapingVoicedIntervals_1,
dus_m = DurationUnvoicedStops_1,
duf_m = DecayUnvoicedFricatives_1,
rlr_m = RelativeLoudnessRespiration_1,
pir_m = PauseIntervalsRespiration_1,
rsr_m = RateSpeechRespiration_1,
lre_m = LatencyRespiratoryExchange_1
)
pdata$group <- str_extract(peb$ParticipantCode, "(PD|RBD|HC)") # pull out the group categories from the ids
pdata$group <- pdata$group %>% str_replace("HC", "controls") %>% # factor group to order and label like the article diagrams
factor(levels = c("PD", "RBD", "controls"))
# assuming control group does not have dysarthia, change hyphen to 0 to prevent NA
pdata$dysarthia <- pdata$dysarthia %>% str_replace('-','0') %>% as.numeric()
pdata$motor_score <- as.numeric(pdata$motor_score)
head(pdata) %>% kable() %>% kable_minimal()
| id | age | motor_score | dysarthia | est_r | rst_r | ast_r | dpi_r | dvi_r | gvi_r | dus_r | duf_r | rlr_r | pir_r | rsr_r | lre_r | est_m | rst_m | ast_m | dpi_m | dvi_m | gvi_m | dus_m | duf_m | rlr_m | pir_m | rsr_m | lre_m | group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PD01 | 58 | 8 | 0 | 1.564 | 354 | 6.05 | 146 | 264 | 58.65 | 31.38 | -2.101 | -22.47 | 4.50 | 21.14 | 167 | 1.564 | 333 | -2.82 | 158 | 318 | 49.01 | 22.37 | 0.588 | -19.77 | 6 | 13.81 | 127 | PD |
| PD02 | 68 | 22 | 1 | 1.564 | 340 | 27.52 | 173 | 253 | 48.26 | 22.38 | -1.745 | -24.59 | 7.00 | 15.28 | 163 | 1.569 | 285 | 8.20 | 295 | 264 | 40.56 | 26.88 | -0.825 | -23.26 | 4 | 21.77 | 313 | PD |
| PD03 | 68 | 19 | 0 | 1.550 | 211 | 11.97 | 377 | 322 | 47.54 | 38.12 | 2.657 | -16.89 | 3.00 | 20.76 | 372 | 1.550 | 247 | 4.71 | 280 | 317 | 48.97 | 22.37 | -0.955 | -13.29 | 4 | 22.52 | 201 | PD |
| PD04 | 75 | 24 | 0 | 1.519 | 140 | -2.49 | 360 | 663 | 13.72 | 44.88 | -0.934 | -25.54 | 1.00 | 18.71 | 119 | 1.539 | 112 | -9.09 | 397 | 800 | 18.69 | 49.37 | 0.791 | -25.08 | 2 | 14.37 | 151 | PD |
| PD05 | 61 | 54 | 1 | 1.543 | 269 | 6.72 | 211 | 328 | 42.90 | 47.12 | -0.973 | -22.61 | 5.00 | 16.26 | 78 | 1.560 | 230 | 11.77 | 206 | 480 | 33.54 | 26.87 | 0.075 | -22.32 | 5 | 14.61 | 151 | PD |
| PD06 | 58 | 29 | 1 | 1.553 | 317 | 24.19 | 186 | 286 | 43.83 | 33.63 | 0.921 | -25.00 | 2.75 | 27.07 | 124 | 1.518 | 181 | 13.38 | 611 | 398 | 18.18 | 49.37 | 1.488 | -25.08 | 2 | 18.21 | 593 | PD |
The percentage of each group tagged as having dysarthia by a speech specialist using perceptual tests is shown in the following table.
dpct <- group_by(pdata, group) %>% # create a table of percentages by group
summarize(num = n(), dysarthia = sum(dysarthia), dysarthiaPct = (sum(dysarthia) / n()) * 100)
kable(dpct, caption = "Percentage of participants tagged with dysarthia", digits = 2, # print the table
col.names = c("Group", "#Participants", "Dysarthia", "%")) %>% kable_material()
| Group | #Participants | Dysarthia | % |
|---|---|---|---|
| PD | 30 | 13 | 43.33 |
| RBD | 50 | 3 | 6.00 |
| controls | 50 | 0 | 0.00 |
Six percent of the RBD group are tagged with dysarthia by the perceptual evaluation, while 43% of the group with untreated Parkinson’s disease are tagged.
Age itself can have an impact on speech and vocal quality. The following graph shows the distribution of participant ages for each test group.
# create a quick plot of the participant ages in each group
qplot(age, data = pdata, fill=group, geom = "dotplot", facets = "group") +
scale_fill_manual(name = "Group", values=c('red', 'powderblue', 'darkgreen')) + # colors for each group
theme_bw()
It isn’t common to see Parkinson’s disease in people younger than 50, but for a small subset the disease strikes early. People are diagnosed with Parkinson’s at an average age of 60. My father was diagnosed at 52, and my father-in-law in his seventies. Diagnosis in a person younger than 50 is considered young-onset Parkinson’s, or YOPD. Rarely, PD may be diagnosed in people younger than 40 — about 2 percent of the 1 million people with Parkinson’s were diagnosed earlier than age 40.4 One of the most recognizable is Michael J. Fox, who was diagnosed at age 29.5
The samples show three YOPD subjects and a reasonable balance of subjects across age ranges within the three groups.
Pivot the data longer to combine the measurements for the two exam recordings into a single column for measurement type, a column for the recording type (reading or monologue) and a column for the measurement value.
pd1 <- pdata %>% pivot_longer(est_r:lre_m,
names_to = c("meastyp", "examtyp"),
names_pattern = "(.*)_(.)",
values_to = "amt"
)
getMlabel <- sort(tLabels) # create a lookup table of the labels
nmes <- unique(pd1$meastyp) %>% sort # create the lookup names
names(getMlabel) <- sort(nmes)
pd1$mlabel <- getMlabel[pd1$meastyp] # add labels to pd1
pd1$meastyp <-factor(pd1$meastyp, levels=c("est", # order the types as they are in the dataset
"rst",
"ast",
"dpi",
"dvi",
"gvi",
"dus",
"duf",
"rlr",
"pir",
"rsr",
"lre")
)
pd1$examtyp <- if_else(pd1$examtyp == "m", "Monologue",
if(pd1$examtyp == "r") "Reading")
Create a faceted boxplot to look at the range for each of the variables within each group and exam type.
ggplot(pd1, aes(meastyp, amt, fill = meastyp)) + # create a faceted boxplot
geom_boxplot(outlier.color = "red") +
coord_cartesian(ylim = c(-35, 615)) + # expand the y-axis scale
facet_grid(cols = vars(group),
rows = vars(examtyp), scales = "free") + # facet by the group
theme_bw() + scale_x_discrete(breaks=NULL) + # don't show labels on the x-axis
scale_fill_discrete(name = "Speech measurement",
labels = tLabels) +
labs(x = NULL, y = NULL)
The units as shown in the legend for each measurement type are quite different, dB, ms, min, etc. Outliers are evident in each group for most of the variables. In the study the speech data is split into four speech categories, Respiration, Timing, Phonation, and Articulation. Dividing the data by these categories and creating boxplots for each gives a better view of the ranges for each measurement type.
# Add speech categories
pd1 <- pd1 %>% mutate(spchcat = if_else(meastyp %in% c("rlr", "lre", "pir", "rsr"), "Respiration",
if_else(meastyp %in% c("rst", "ast", "dpi", "est"), "Timing",
if_else(meastyp %in% c("gvi", "dvi"), "Phonation", "Articulation")
))
)
gcat <- "Respiration" # initialize the category variable
pcolor <- "green" # initialize the color variable
pd1$group <- factor(pd1$group, levels = c("PD", "RBD", "controls")) # factor group to display in specified order
# create a function with the plot template
pt <- function (gcat, pcolor) {
dcat <- pd1 %>% filter(spchcat == gcat) # create a faceted box plot
ggplot(dcat) +
geom_boxplot(aes(group, amt, fill = examtyp), outlier.color = "red") +
facet_wrap(vars(mlabel), scales = "free") +
theme_bw() +
scale_fill_grey(name = "Legend:", labels = c("Reading passage", "Monologue")) +
labs(title = gcat, x = NULL, y = NULL) +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)), # add & position title
legend.position = "bottom", legend.direction = "horizontal", # legend on the bottom
plot.background = element_rect(color = pcolor, size = 6), # rectangle around plot
plot.margin = margin(.5, .5, .5, .5, "cm"),
strip.text.x = element_text(face = "bold", size = rel(1.5) )) # bold the facet labels
}
# execute the template for each category
par(mfrow = c(2,2))
pt("Respiration", "dodgerblue")
pt("Phonation", "firebrick2")
pt("Timing", "seagreen3")
pt("Articulation", "yellow3")
The following diagram shows distributions and correlations of exam measurements for the subjects in the Parkinson’s group. Many of the variables are skewed. Even though the entire sample size of 130 is reasonably large, and the Parkinson’s disease group has 30 subjects, a linear regression model is probably not the best choice for finding a predictor.
The study used quadratic discriminant function analysis where the multivariate normal distribution characterized by the mean and covariance matrix was estimated using an expectation maximization algorithm. Logistic regression will be here used with the understanding that the models are probably not representative of the data.
require(psych)
pdc <- pdata %>% filter(group == "PD") # get the Parkinson's group data
pairs.panels(pdc[5:28], # plot distributions and correlations
gap = 0,
pch = 20,
lm = TRUE)
Working through all of the permutations of 24 variables would be time consuming, so a few models will be developed to continue. The study found seven variables that showed the most distinctive disturbed speech patterns between the PD group and the control group. Three models will be created to use in logistic regression.
The following diagram highlights significant correlations of exam measurements for the Parkinson’s subjects.
pacman::p_load(GGally)
pcor <- pdata %>% filter(group == "PD") %>% select(est_r:lre_m) # filter for the PD group
# create a correlation plot
pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy", # set the number of levels and the colors
hjust = .65, size = 4, # nudge and format the axis labels
label = TRUE, label_size = 4, label_color = "white", # format the coefficient
name = "Significance") + # legend title
labs(title = "PD Correlation", # plot title
caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)), # position and format the title and caption
plot.caption.position = "plot")
pdcorr
There are sixteen unique variables with significant coefficients. This table shows the coefficients with absolute values greater than 0.6.
pdcorr$data %>% filter(abs(coefficient) > 0.6 ) %>%
select(1:3) %>%
arrange(coefficient) %>%
kable(caption = "Significant correlation of PD group", digits = 2,
col.names = c("X", "Y", "Coefficient"), row.names = TRUE) %>% kable_minimal()
| X | Y | Coefficient | |
|---|---|---|---|
| 1 | dpi_r | rst_r | -0.86 |
| 2 | dvi_r | rst_r | -0.85 |
| 3 | dvi_m | rst_m | -0.81 |
| 4 | dpi_m | rst_m | -0.79 |
| 5 | gvi_m | dpi_m | -0.72 |
| 6 | dvi_m | rst_r | -0.71 |
| 7 | rsr_r | pir_r | -0.70 |
| 8 | dus_m | rst_m | -0.70 |
| 9 | gvi_r | dvi_r | -0.67 |
| 10 | rst_m | dvi_r | -0.64 |
| 11 | pir_m | dpi_m | -0.63 |
| 12 | dvi_m | gvi_r | -0.62 |
| 13 | dus_m | gvi_m | -0.62 |
| 14 | dus_m | dvi_m | 0.60 |
| 15 | pir_m | rst_m | 0.61 |
| 16 | dvi_m | dus_r | 0.61 |
| 17 | lre_m | dpi_m | 0.63 |
| 18 | gvi_r | rst_r | 0.64 |
| 19 | rlr_m | gvi_m | 0.64 |
| 20 | pir_m | gvi_m | 0.65 |
| 21 | gvi_m | rst_m | 0.65 |
| 22 | dvi_r | dpi_r | 0.67 |
| 23 | dus_m | dpi_m | 0.70 |
| 24 | gvi_m | gvi_r | 0.73 |
| 25 | gvi_m | est_m | 0.77 |
| 26 | rlr_m | rlr_r | 0.78 |
| 27 | dvi_m | dvi_r | 0.90 |
A diagnosis of dysarthia is a binomial result which lends itself to logistic regression. General linear models with the logit function are used to create the three models. Then the models are plotted to evaluate their fit to the data.
attach(pdc) # use the Parkinsons data
fit24 <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r + # model all 24 variables
dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m,
family=binomial(link="logit")
)
# model 16 variables with significant correlation
fit16 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +
gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r +
rsr_m + rsr_r + rst_m + rst_r,
family=binomial(link="logit")
)
# model 7 significant variables found in the study
fit7 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dus_r + dvi_m + pir_r + rst_r,
family=binomial(link="logit")
)
detach(pdc)
par(mfrow = c(2,2))
plot(fit24)
mtext("fit24", side = 1)
par(mfrow = c(2,2), bg = "aliceblue")
plot(fit16)
mtext("fit16", side = 1)
par(mfrow = c(2,2), bg = "white")
plot(fit7)
mtext("fit7", side = 1)
Consistent with the boxplots, each of the models show high leverage outliers. Removing the outliers and rerunning the models doesn’t improve them, each model shows a new set of high leverage outliers, as shown below. Which is further evidence that logistic regression is not the best fit for this data.
pdo <- pdc %>% filter(!id %in% c("PD04", "PD08", "PD27")) # remove high leverage outliers for fit24
attach(pdo)
fito <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r + # model all 24 variables
dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m,
family=binomial(link="logit")
)
par(mfrow = c(2,2))
plot(fito)
mtext("fit24", side = 1)
pdo <- pdc %>% filter(!id %in% c("PD03", "PD08", "PD10")) # remove high leverage outliers for fit16
fito <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m + # model 16 variables with significant correlation
gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r +
rsr_m + rsr_r + rst_m + rst_r,
family=binomial(link="logit")
)
par(mfrow = c(2,2), bg = "aliceblue")
plot(fito)
mtext("fit16", side = 1)
pdo <- pdc %>% filter(!id %in% c("PD06", "PD09", "PD27")) # remove high leverage outliers for fit7
fito <- glm(dysarthia ~ dpi_m + dpi_r + dvi_m + dvi_r + est_m + gvi_m + # model 7 significant variables found in the study
rlr_m + rlr_r + rst_m + rst_r,
family=binomial(link="logit")
)
par(mfrow = c(2,2), bg = "white")
plot(fito)
mtext("fit7", side = 1)
detach(pdo)
Both PD and RBD subjects were scored according to the Unified Parkinson’s Disease Rating Scale motor subscore (UPDRS III15). The scores range from 0 to 108, with 0 for no motor manifestations and 108 representing severe motor distortion. A score > 3 was is considered a very good indicator of initial parkinsonism.
rbd <- pdata %>% filter(group == "RBD") # get the RBD group data
The RBD subject have motor scores ranging from 0 to 21. RBD subjects with a motor score ≤ 3 (asymptomatic RBD subgroup) are labeled motor negatives and RBD subjects with a motor score >3 (symptomatic RBD subgroup) were labeled motor positives. This categorization is used to create the confusion matrices for evaluating the model performance by creating a logical variable from the motor evaluation score.
rbd <- rbd %>%
mutate(motor_true = if_else(motor_score > 3, 1, 0)) # add the motor true categorization
rbd %>% select(id, dysarthia, motor_score, motor_true) %>% # look at the initial dysarthia diagnosis, motor score and motor true
head() %>%
kable() %>% kable_minimal()
| id | dysarthia | motor_score | motor_true |
|---|---|---|---|
| RBD01 | 1 | 3 | 0 |
| RBD02 | 0 | 2 | 0 |
| RBD03 | 0 | 7 | 1 |
| RBD04 | 0 | 1 | 0 |
| RBD05 | 0 | 10 | 1 |
| RBD06 | 0 | 12 | 1 |
Creating the predictions for each of the models show that none of them have very good performance compared to the motor score. They all have more false negatives than true negatives, and each has a good number of false positives. The model with the best performance is fit16, but it’s performance at 50% accuracy with 41% sensitivity and 61% specificity is still not great.
require(caret)
## Warning: package 'caret' was built under R version 4.0.5
etbl <- tibble("Model" = c("fit24", "fit16", "fit7"), # tibble to compare fit evals
"Accuracy" = c(0,0,0),
"True Pos" = c(0,0,0),
"True Neg" = c(0,0,0),
"False Pos" = c(0,0,0),
"False Neg" = c(0,0,0),
"Sensitivity" = c(0,0,0),
"Specificity" = c(0,0,0)
)
motor_positive <- factor(rbd$motor_true, c(0,1)) # factor required by confusionMatrix
predicted24 <- predict(fit24, rbd, type = "response") %>% round() # prediction using fit24
predictedc <- predicted24 %>% factor(levels = c(0,1))
c24 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[1] = c24$overall["Accuracy"]
etbl$`True Pos`[1] = c24$table[2,2]
etbl$`True Neg`[1] = c24$table[1,1]
etbl$`False Pos`[1] = c24$table[1,2]
etbl$`False Neg`[1] = c24$table[2,1]
etbl$Sensitivity[1] = c24$byClass["Sensitivity"]
etbl$Specificity[1] = c24$byClass["Specificity"]
predicted16 <- predict(fit16, rbd, type = "response") %>% round() # prediction using fit16
predictedc <- predicted16 %>% factor(levels = c(0,1))
c16 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[2] = c16$overall["Accuracy"]
etbl$`True Pos`[2] = c16$table[2,2]
etbl$`True Neg`[2] = c16$table[1,1]
etbl$`False Pos`[2] = c16$table[1,2]
etbl$`False Neg`[2] = c16$table[2,1]
etbl$Sensitivity[2] = c16$byClass["Sensitivity"]
etbl$Specificity[2] = c16$byClass["Specificity"]
predicted7 <- predict(fit7, rbd, type="response") %>% round() # prediction using fit7
predictedc <- predicted7 %>% factor(levels = c(0,1))
c7 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[3] = c7$overall["Accuracy"]
etbl$`True Pos`[3] = c7$table[2,2]
etbl$`True Neg`[3] = c7$table[1,1]
etbl$`False Pos`[3] = c7$table[1,2]
etbl$`False Neg`[3] = c7$table[2,1]
etbl$Sensitivity[3] = c7$byClass["Sensitivity"]
etbl$Specificity[3] = c7$byClass["Specificity"]
etbl %>% kable(digits = 3) %>% kable_minimal()
| Model | Accuracy | True Pos | True Neg | False Pos | False Neg | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| fit24 | 0.46 | 11 | 12 | 8 | 19 | 0.387 | 0.579 |
| fit16 | 0.50 | 14 | 11 | 9 | 16 | 0.407 | 0.609 |
| fit7 | 0.42 | 10 | 11 | 9 | 20 | 0.355 | 0.526 |
The following plot of the ROC curves also shows the relatively low accuracy of the models.
require(plotROC)
cmat <- rbd %>% mutate("Model24" = predicted24, "Model16" = predicted16,
"Model7" = predicted7) %>%
select(motor_true, Model24, Model16, Model7) %>%
pivot_longer(Model24:Model7,
names_to = "model",
values_to = "diag"
)
cmat$model <- factor(cmat$model, levels = c("Model7", "Model16", "Model24") )
ggplot(cmat) +
geom_roc(aes(d=motor_true, m = diag, color = model), size = 2, labels = FALSE) +
scale_color_discrete(name = "Model") +
labs(title = "ROC for logistic regression models", # titles and labels
x = "Speech positive / Motor negative (False positive)",
y = "Speech positive / Motor positive (True positive)") +
theme_light()
Using a QDA model the study diagnosed RBD subjects with 70.0% accuracy, 73.9% sensitivity and 66.7% specificity. The three logistic regression models here did not perform as well. There’s not much reason to have confidence in these logistic regression models.
Moya-Galé G, Levy ES (2018). Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies. https://www.dovepress.com/parkinsons-disease-associated-dysarthria-prevalence-impact-and-managem-peer-reviewed-article-JPRLS
Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5
Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease
Young-Onset Parkinson’s Disease. https://www.hopkinsmedicine.org/health/conditions-and-diseases/parkinsons-disease/youngonset-parkinsons-disease
Michael J. Fox interview on NBC’s Today Sunday Sitdown, Nov. 22, 2020. https://www.today.com/health/michael-j-fox-recalls-wife-tracy-pollan-s-reaction-parkinson-t199866