Parkinson’s Disease Speech Data

Abstract:

A study of speech analysis measurements for untreated Parkinson’s disease patients and subjects with REM sleep behavior disorder reveals that an automated tool for early diagnosis of Parkinson’s Disease can be reliable. The study is presented in the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behavior disorder”2. Both my father and my father-in-law had Parkinson’s disease, so I found this study interesting. In this paper I’ll approximate the statistical analysis from the study and create a prediction model for diagnosing Parkinson’s from the speech data.

Automating Parkinson’s Disease diagnosis using speech pattern analysis

Dysarthria is a speech disorder where the muscles used to produce speech are damaged, paralyzed, or weakened. Close to 90% of people with Parkinson’s Disease (PD) present with hypokinetic dysarthria, evidenced by reduced vocal loudness, monotone, reduced fundamental frequency range, consonant and vowel imprecision, breathiness and irregular pauses.1 Evaluation of dysarthia in disorders such as PD has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. The development of a fully-automated method to detect early, distinctive patterns of neuro-degeneration using only acoustic data from connected speech such as reading a short passage or monologue would have the potential to revolutionize the diagnostic process for diseases manifesting motor speech disorders.2

Early biomarkers of Parkinson’s Disease dataset

The dataset associated with the article is available as an excel file via the article on Nature.com. It’s also on Kaggle as a csv file.3

The data include 30 patients with early untreated Parkinson’s disease (PD), 50 patients with REM sleep behavior disorder (RBD), which are at high risk for developing Parkinson’s disease, and 50 healthy controls (controls). All patients were scored clinically by a neurologist with experience in movement disorders. All subjects were examined during a single session with a speech specialist and designated as having dysarthia or not.

All the subjects were recorded twice the first recording reading a standardized, phonetically-balanced text of 80 words and the second a monologue for approximately 90 seconds. Twelve speech features were automatically analyzed by Jan Hlavnička et al. (2017)3 for each recording, resulting in 24 measurement variables.

Read the dataset

Read and view the .csv file of speech data.

setwd("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons")

peb <- tibble(read_csv("parkinsonsEarlyBiomarkers.csv"))    # read Parkinsons Early Biomarkers dataset

glimpse(peb)                                                # look at the peb tibble
## Rows: 130
## Columns: 65
## $ `Participant  code`                                             <chr> "PD01"~
## $ `Age  (years)`                                                  <dbl> 58, 68~
## $ Gender                                                          <chr> "F", "~
## $ `Positive  history  of  Parkinson  disease  in  family`         <chr> "No", ~
## $ `Age  of  disease  onset  (years)`                              <chr> "56", ~
## $ `Duration  of  disease  from  first  symptoms  (years)`         <chr> "2", "~
## $ `Antidepressant  therapy`                                       <chr> "No", ~
## $ `Antiparkinsonian  medication`                                  <chr> "No", ~
## $ `Antipsychotic  medication`                                     <chr> "No", ~
## $ `Benzodiazepine  medication`                                    <chr> "No", ~
## $ `Levodopa  equivalent  (mg/day)`                                <dbl> 0, 0, ~
## $ `Clonazepam  (mg/day)`                                          <dbl> 0.0, 0~
## $ `Overview  of  motor  examination:  Hoehn  &  Yahr  scale  (-)` <chr> "1.5",~
## $ `Overview  of  motor  examination:  UPDRS  III  total  (-)`     <chr> "8", "~
## $ `18.  Speech`                                                   <chr> "0", "~
## $ `19.  Facial  Expression`                                       <chr> "1", "~
## $ `20.  Tremor  at  Rest  -  head`                                <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  RUE`                                 <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  LUE`                                 <chr> "2", "~
## $ `20.  Tremor  at  Rest  -  RLE`                                 <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  LLE`                                 <chr> "2", "~
## $ `21.  Action  or  Postural  Tremor  -  RUE`                     <chr> "0", "~
## $ `21.  Action  or  Postural  Tremor  -  LUE`                     <chr> "0", "~
## $ `22.  Rigidity  -  neck`                                        <chr> "0", "~
## $ `22.  Rigidity  -  RUE`                                         <chr> "0", "~
## $ `22.  Rigidity  -  LUE`                                         <chr> "1", "~
## $ `22.  Rigidity  -  RLE`                                         <chr> "0", "~
## $ `22.  Rigidity  -  LLE`                                         <chr> "0", "~
## $ `23.Finger  Taps  -  RUE`                                       <chr> "0", "~
## $ `23.Finger  Taps  -  LUE`                                       <chr> "1", "~
## $ `24.  Hand  Movements  -  RUE`                                  <chr> "0", "~
## $ `24.  Hand  Movements  -  LUE`                                  <chr> "0", "~
## $ `25.  Rapid  Alternating  Movements  -  RUE`                    <chr> "0", "~
## $ `25.  Rapid  Alternating  Movements  -  LUE`                    <chr> "1", "~
## $ `26.  Leg  Agility  -  RLE`                                     <chr> "0", "~
## $ `26.  Leg  Agility  -  LLE`                                     <chr> "0", "~
## $ `27.  Arising  from  Chair`                                     <chr> "0", "~
## $ `28.  Posture`                                                  <chr> "0", "~
## $ `29.  Gait`                                                     <chr> "0", "~
## $ `30.  Postural  Stability`                                      <chr> "0", "~
## $ `31.  Body  Bradykinesia  and  Hypokinesia`                     <chr> "0", "~
## $ `Entropy  of  speech  timing  (-)`                              <dbl> 1.564,~
## $ `Rate  of  speech  timing  (-/min)`                             <dbl> 354, 3~
## $ `Acceleration  of  speech  timing  (-/min2)`                    <dbl> 6.05, ~
## $ `Duration  of  pause  intervals  (ms)`                          <dbl> 146, 1~
## $ `Duration  of  voiced  intervals  (ms)`                         <dbl> 264, 2~
## $ `Gaping  in-between  voiced  intervals  (-/min)`                <dbl> 58.65,~
## $ `Duration  of  unvoiced  stops  (ms)`                           <dbl> 31.38,~
## $ `Decay  of  unvoiced  fricatives  (‰/min)`                      <dbl> -2.101~
## $ `Relative  loudness  of  respiration  (dB)`                     <dbl> -22.47~
## $ `Pause  intervals  per  respiration  (-)`                       <dbl> 4.50, ~
## $ `Rate  of  speech  respiration  (-/min)`                        <dbl> 21.14,~
## $ `Latency  of  respiratory  exchange  (ms)`                      <dbl> 167, 1~
## $ `Entropy  of  speech  timing  (-)_1`                            <dbl> 1.564,~
## $ `Rate  of  speech  timing  (-/min)_1`                           <dbl> 333, 2~
## $ `Acceleration  of  speech  timing  (-/min2)_1`                  <dbl> -2.82,~
## $ `Duration  of  pause  intervals  (ms)_1`                        <dbl> 158, 2~
## $ `Duration  of  voiced  intervals  (ms)_1`                       <dbl> 318, 2~
## $ `Gaping  in-between  voiced  intervals  (-/min)_1`              <dbl> 49.01,~
## $ `Duration  of  unvoiced  stops  (ms)_1`                         <dbl> 22.37,~
## $ `Decay  of  unvoiced  fricatives  (‰/min)_1`                    <dbl> 0.588,~
## $ `Relative  loudness  of  respiration  (dB)_1`                   <dbl> -19.77~
## $ `Pause  intervals  per  respiration  (-)_1`                     <dbl> 6.0, 4~
## $ `Rate  of  speech  respiration  (-/min)_1`                      <dbl> 13.81,~
## $ `Latency  of  respiratory  exchange  (ms)_1`                    <dbl> 127, 3~

Data cleaning and wrangling

The data file is tidy dataset, however it has long, descriptive variable names. To effectively use the data it will be easier to create concise variables names by eliminating spaces, extraneous words and creating acronyms.

units <- names(peb[42:53]) %>% str_extract("(\\(.*\\))")                              # pull the exam measurement units off of the names
names(peb) <-  gsub("(of) | (per) | (in-between) | (\\(.*\\))"," ", names(peb))       # clean out extraneous words   
names(peb) <-  str_to_title(names(peb))                                               # capitalize each word
names(peb) <- gsub(" ","",names(peb))                                                 # eliminate spaces

names(peb)                                                                            # look at the names
##  [1] "ParticipantCode"                         
##  [2] "Age"                                     
##  [3] "Gender"                                  
##  [4] "PositiveHistoryParkinsonDiseaseInFamily" 
##  [5] "AgeDiseaseOnset"                         
##  [6] "DurationDiseaseFromFirstSymptoms"        
##  [7] "AntidepressantTherapy"                   
##  [8] "AntiparkinsonianMedication"              
##  [9] "AntipsychoticMedication"                 
## [10] "BenzodiazepineMedication"                
## [11] "LevodopaEquivalent"                      
## [12] "Clonazepam"                              
## [13] "OverviewMotorExamination:Hoehn&YahrScale"
## [14] "OverviewMotorExamination:UpdrsIiiTotal"  
## [15] "18.Speech"                               
## [16] "19.FacialExpression"                     
## [17] "20.TremorAtRest-Head"                    
## [18] "20.TremorAtRest-Rue"                     
## [19] "20.TremorAtRest-Lue"                     
## [20] "20.TremorAtRest-Rle"                     
## [21] "20.TremorAtRest-Lle"                     
## [22] "21.ActionOrPosturalTremor-Rue"           
## [23] "21.ActionOrPosturalTremor-Lue"           
## [24] "22.Rigidity-Neck"                        
## [25] "22.Rigidity-Rue"                         
## [26] "22.Rigidity-Lue"                         
## [27] "22.Rigidity-Rle"                         
## [28] "22.Rigidity-Lle"                         
## [29] "23.FingerTaps-Rue"                       
## [30] "23.FingerTaps-Lue"                       
## [31] "24.HandMovements-Rue"                    
## [32] "24.HandMovements-Lue"                    
## [33] "25.RapidAlternatingMovements-Rue"        
## [34] "25.RapidAlternatingMovements-Lue"        
## [35] "26.LegAgility-Rle"                       
## [36] "26.LegAgility-Lle"                       
## [37] "27.ArisingFromChair"                     
## [38] "28.Posture"                              
## [39] "29.Gait"                                 
## [40] "30.PosturalStability"                    
## [41] "31.BodyBradykinesiaAndHypokinesia"       
## [42] "EntropySpeechTiming"                     
## [43] "RateSpeechTiming"                        
## [44] "AccelerationSpeechTiming"                
## [45] "DurationPauseIntervals"                  
## [46] "DurationVoicedIntervals"                 
## [47] "GapingVoicedIntervals"                   
## [48] "DurationUnvoicedStops"                   
## [49] "DecayUnvoicedFricatives"                 
## [50] "RelativeLoudnessRespiration"             
## [51] "PauseIntervalsRespiration"               
## [52] "RateSpeechRespiration"                   
## [53] "LatencyRespiratoryExchange"              
## [54] "EntropySpeechTiming_1"                   
## [55] "RateSpeechTiming_1"                      
## [56] "AccelerationSpeechTiming_1"              
## [57] "DurationPauseIntervals_1"                
## [58] "DurationVoicedIntervals_1"               
## [59] "GapingVoicedIntervals_1"                 
## [60] "DurationUnvoicedStops_1"                 
## [61] "DecayUnvoicedFricatives_1"               
## [62] "RelativeLoudnessRespiration_1"           
## [63] "PauseIntervalsRespiration_1"             
## [64] "RateSpeechRespiration_1"                 
## [65] "LatencyRespiratoryExchange_1"

Create a set of acronym labels for the exam measurements.

Label Exam measurement
RST rate of speech timing
AST acceleration of speech timing
DPI duration of pause intervals
EST entropy of speech timing
DUS duration of unvoiced stops
DUF decay of unvoiced fricatives
DVI duration of voiced intervals
GVI gaping in-between voiced intervals
RSR rate of speech respiration
PIR pause intervals per respiration
RLR relative loudness of respiration
LRE latency of respiratory exchange
tLabels <- gsub("[a-z]", "", names(peb[42:53]))             # grab the capital letters to create acronyms

for (i in 1:12) {
 tLabels[i] <- str_glue(tLabels[i], " ", units[i])          # concatenate the acronyms with units   
}
 
tLabels                                                     # look at the labels
##  [1] "EST (-)"      "RST (-/min)"  "AST (-/min2)" "DPI (ms)"     "DVI (ms)"    
##  [6] "GVI (-/min)"  "DUS (ms)"     "DUF (‰/min)"  "RLR (dB)"     "PIR (-)"     
## [11] "RSR (-/min)"  "LRE (ms)"

To obtain the desired set of data, leave out the medication variables and the 26 motor skill scores which are numbered 19 through 31 in the variable labels. Create a variable for the test groups and convert character data to numeric.

pdata <- select(peb, id = ParticipantCode,    # create a tibble with the needed variables
         age = Age,
         motor_score = `OverviewMotorExamination:UpdrsIiiTotal`, # motor skills score
         dysarthia = `18.Speech`,                                # dysarthia score  
         est_r = EntropySpeechTiming,                            # reading speech measurements
         rst_r = RateSpeechTiming,
         ast_r = AccelerationSpeechTiming,
         dpi_r = DurationPauseIntervals,
         dvi_r = DurationVoicedIntervals,
         gvi_r = GapingVoicedIntervals,
         dus_r = DurationUnvoicedStops,
         duf_r = DecayUnvoicedFricatives,
         rlr_r = RelativeLoudnessRespiration,
         pir_r = PauseIntervalsRespiration,
         rsr_r = RateSpeechRespiration,
         lre_r = LatencyRespiratoryExchange,
         est_m = EntropySpeechTiming_1,                          # monologue speech measurements
         rst_m = RateSpeechTiming_1,
         ast_m = AccelerationSpeechTiming_1,
         dpi_m = DurationPauseIntervals_1,
         dvi_m = DurationVoicedIntervals_1,
         gvi_m = GapingVoicedIntervals_1,
         dus_m = DurationUnvoicedStops_1,
         duf_m = DecayUnvoicedFricatives_1,
         rlr_m = RelativeLoudnessRespiration_1,
         pir_m = PauseIntervalsRespiration_1,
         rsr_m = RateSpeechRespiration_1,
         lre_m = LatencyRespiratoryExchange_1
         )

pdata$group <- str_extract(peb$ParticipantCode, "(PD|RBD|HC)")          # pull out the group categories from the ids

pdata$group <- pdata$group %>% str_replace("HC", "controls") %>%        # factor group to order and label like the article diagrams
               factor(levels = c("PD", "RBD", "controls"))

# assuming control group does not have dysarthia, change hyphen to 0 to prevent NA

pdata$dysarthia <- pdata$dysarthia %>% str_replace('-','0') %>% as.numeric() 
pdata$motor_score <- as.numeric(pdata$motor_score)

head(pdata) %>% kable() %>% kable_minimal()
id age motor_score dysarthia est_r rst_r ast_r dpi_r dvi_r gvi_r dus_r duf_r rlr_r pir_r rsr_r lre_r est_m rst_m ast_m dpi_m dvi_m gvi_m dus_m duf_m rlr_m pir_m rsr_m lre_m group
PD01 58 8 0 1.564 354 6.05 146 264 58.65 31.38 -2.101 -22.47 4.50 21.14 167 1.564 333 -2.82 158 318 49.01 22.37 0.588 -19.77 6 13.81 127 PD
PD02 68 22 1 1.564 340 27.52 173 253 48.26 22.38 -1.745 -24.59 7.00 15.28 163 1.569 285 8.20 295 264 40.56 26.88 -0.825 -23.26 4 21.77 313 PD
PD03 68 19 0 1.550 211 11.97 377 322 47.54 38.12 2.657 -16.89 3.00 20.76 372 1.550 247 4.71 280 317 48.97 22.37 -0.955 -13.29 4 22.52 201 PD
PD04 75 24 0 1.519 140 -2.49 360 663 13.72 44.88 -0.934 -25.54 1.00 18.71 119 1.539 112 -9.09 397 800 18.69 49.37 0.791 -25.08 2 14.37 151 PD
PD05 61 54 1 1.543 269 6.72 211 328 42.90 47.12 -0.973 -22.61 5.00 16.26 78 1.560 230 11.77 206 480 33.54 26.87 0.075 -22.32 5 14.61 151 PD
PD06 58 29 1 1.553 317 24.19 186 286 43.83 33.63 0.921 -25.00 2.75 27.07 124 1.518 181 13.38 611 398 18.18 49.37 1.488 -25.08 2 18.21 593 PD

Data Exploration

Dysarthia tagging by perceptual tests

The percentage of each group tagged as having dysarthia by a speech specialist using perceptual tests is shown in the following table.

dpct <- group_by(pdata, group) %>%                                                                  # create a table of percentages by group
  summarize(num = n(), dysarthia = sum(dysarthia), dysarthiaPct = (sum(dysarthia) / n()) * 100)

kable(dpct, caption = "Percentage of participants tagged with dysarthia", digits = 2,               # print the table
      col.names = c("Group", "#Participants", "Dysarthia", "%")) %>% kable_material()
Percentage of participants tagged with dysarthia
Group #Participants Dysarthia %
PD 30 13 43.33
RBD 50 3 6.00
controls 50 0 0.00

Six percent of the RBD group are tagged with dysarthia by the perceptual evaluation, while 43% of the group with untreated Parkinson’s disease are tagged.

Age of participants in each test group

Age itself can have an impact on speech and vocal quality. The following graph shows the distribution of participant ages for each test group.

# create a quick plot of the participant ages in each group

qplot(age, data = pdata, fill=group, geom = "dotplot",  facets = "group") +
  scale_fill_manual(name = "Group", values=c('red', 'powderblue', 'darkgreen')) +  # colors for each group
               theme_bw()  

It isn’t common to see Parkinson’s disease in people younger than 50, but for a small subset the disease strikes early. People are diagnosed with Parkinson’s at an average age of 60. My father was diagnosed at 52, and my father-in-law in his seventies. Diagnosis in a person younger than 50 is considered young-onset Parkinson’s, or YOPD. Rarely, PD may be diagnosed in people younger than 40 — about 2 percent of the 1 million people with Parkinson’s were diagnosed earlier than age 40.4 One of the most recognizable is Michael J. Fox, who was diagnosed at age 29.5

The samples show three YOPD subjects and a reasonable balance of subjects across age ranges within the three groups.

Statistical Analysis

Accoustic speech analysis measurement ranges

Pivot the data longer to combine the measurements for the two exam recordings into a single column for measurement type, a column for the recording type (reading or monologue) and a column for the measurement value.

pd1 <- pdata %>% pivot_longer(est_r:lre_m,
                     names_to = c("meastyp", "examtyp"),
                     names_pattern = "(.*)_(.)",
                     values_to = "amt"
                     )

getMlabel <- sort(tLabels)                           # create a lookup table of the labels

nmes <- unique(pd1$meastyp) %>% sort                 # create the lookup names
names(getMlabel) <- sort(nmes)

pd1$mlabel <- getMlabel[pd1$meastyp]                 # add labels to pd1   

pd1$meastyp <-factor(pd1$meastyp, levels=c("est",    # order the types as they are in the dataset
                                           "rst",
                                           "ast",
                                           "dpi",
                                           "dvi",
                                           "gvi",
                                           "dus",
                                           "duf",
                                           "rlr",
                                           "pir",
                                           "rsr",
                                           "lre")
                     )

pd1$examtyp <- if_else(pd1$examtyp == "m", "Monologue", 
                        if(pd1$examtyp == "r") "Reading")

Create a faceted boxplot to look at the range for each of the variables within each group and exam type.

 ggplot(pd1, aes(meastyp, amt, fill = meastyp)) +                   # create a faceted boxplot
             geom_boxplot(outlier.color = "red") + 
             coord_cartesian(ylim = c(-35, 615)) +                  # expand the y-axis scale
             facet_grid(cols = vars(group), 
                        rows = vars(examtyp), scales = "free") +      # facet by the group
             theme_bw() + scale_x_discrete(breaks=NULL) +           # don't show labels on the x-axis
             scale_fill_discrete(name = "Speech measurement", 
                                 labels = tLabels) +
             labs(x = NULL, y = NULL)

The units as shown in the legend for each measurement type are quite different, dB, ms, min, etc. Outliers are evident in each group for most of the variables. In the study the speech data is split into four speech categories, Respiration, Timing, Phonation, and Articulation. Dividing the data by these categories and creating boxplots for each gives a better view of the ranges for each measurement type.

# Add speech categories
pd1 <- pd1 %>% mutate(spchcat = if_else(meastyp %in% c("rlr", "lre", "pir", "rsr"), "Respiration",
                                  if_else(meastyp %in% c("rst", "ast", "dpi", "est"), "Timing",
                                  if_else(meastyp %in% c("gvi", "dvi"), "Phonation", "Articulation")
                                ))
                      )
gcat <- "Respiration"                                                # initialize the category variable
pcolor <- "green"                                                    # initialize the color variable
pd1$group <- factor(pd1$group, levels = c("PD", "RBD", "controls"))  # factor group to display in specified order

# create a function with the plot template

pt <- function (gcat, pcolor) {
  dcat <- pd1 %>% filter(spchcat == gcat)                            # create a faceted box plot
  ggplot(dcat) +      
             geom_boxplot(aes(group, amt, fill = examtyp), outlier.color = "red") +
             facet_wrap(vars(mlabel), scales = "free") +      
             theme_bw() +  
             scale_fill_grey(name = "Legend:", labels = c("Reading passage", "Monologue")) +
             labs(title = gcat, x = NULL, y = NULL) +
             theme(plot.title.position = "plot", 
                   plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),   # add & position title
                   legend.position = "bottom", legend.direction = "horizontal",       # legend on the bottom
                   plot.background = element_rect(color = pcolor, size = 6),          # rectangle around plot
                   plot.margin = margin(.5, .5, .5, .5, "cm"),
                   strip.text.x = element_text(face = "bold", size = rel(1.5) ))      # bold the facet labels
}

# execute the template for each category

par(mfrow = c(2,2))

pt("Respiration", "dodgerblue")

pt("Phonation", "firebrick2")

pt("Timing", "seagreen3")

pt("Articulation", "yellow3")

Distributions and Correlations

The following diagram shows distributions and correlations of exam measurements for the subjects in the Parkinson’s group. Many of the variables are skewed. Even though the entire sample size of 130 is reasonably large, and the Parkinson’s disease group has 30 subjects, a linear regression model is probably not the best choice for finding a predictor.

The study used quadratic discriminant function analysis where the multivariate normal distribution characterized by the mean and covariance matrix was estimated using an expectation maximization algorithm. Logistic regression will be here used with the understanding that the models are probably not representative of the data.

require(psych)

pdc <- pdata %>% filter(group == "PD")                        # get the Parkinson's group data
pairs.panels(pdc[5:28],                                       # plot distributions and correlations
             gap = 0,
             pch = 20,
             lm = TRUE)

Working through all of the permutations of 24 variables would be time consuming, so a few models will be developed to continue. The study found seven variables that showed the most distinctive disturbed speech patterns between the PD group and the control group. Three models will be created to use in logistic regression.

  • All 24 variables together
  • Seven variables identified in the study
  • Variables with significant correlation

The following diagram highlights significant correlations of exam measurements for the Parkinson’s subjects.

pacman::p_load(GGally)

pcor <- pdata %>% filter(group == "PD") %>% select(est_r:lre_m)    # filter for the PD group                                                

# create a correlation plot

pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                   # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "PD Correlation",                                         # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")

pdcorr

There are sixteen unique variables with significant coefficients. This table shows the coefficients with absolute values greater than 0.6.

pdcorr$data %>% filter(abs(coefficient) > 0.6 ) %>%  
 select(1:3) %>% 
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of PD group", digits = 2,
      col.names = c("X", "Y", "Coefficient"), row.names = TRUE) %>% kable_minimal()
Significant correlation of PD group
X Y Coefficient
1 dpi_r rst_r -0.86
2 dvi_r rst_r -0.85
3 dvi_m rst_m -0.81
4 dpi_m rst_m -0.79
5 gvi_m dpi_m -0.72
6 dvi_m rst_r -0.71
7 rsr_r pir_r -0.70
8 dus_m rst_m -0.70
9 gvi_r dvi_r -0.67
10 rst_m dvi_r -0.64
11 pir_m dpi_m -0.63
12 dvi_m gvi_r -0.62
13 dus_m gvi_m -0.62
14 dus_m dvi_m 0.60
15 pir_m rst_m 0.61
16 dvi_m dus_r 0.61
17 lre_m dpi_m 0.63
18 gvi_r rst_r 0.64
19 rlr_m gvi_m 0.64
20 pir_m gvi_m 0.65
21 gvi_m rst_m 0.65
22 dvi_r dpi_r 0.67
23 dus_m dpi_m 0.70
24 gvi_m gvi_r 0.73
25 gvi_m est_m 0.77
26 rlr_m rlr_r 0.78
27 dvi_m dvi_r 0.90

Logistic regression models

A diagnosis of dysarthia is a binomial result which lends itself to logistic regression. General linear models with the logit function are used to create the three models. Then the models are plotted to evaluate their fit to the data.

  • fit24 - using all 24 acoustic variables
  • fit16 - using the 16 variables with significant coefficients from the correlation plot
  • fit7 - the seven variables the study found to distinctive speech pattern difference
attach(pdc)                                                                 # use the Parkinsons data

fit24 <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +    # model all 24 variables             
                         dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
                         est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
                         dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m, 
              family=binomial(link="logit")
             )

# model 16 variables with significant correlation
fit16 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +    
                         gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r + 
                         rsr_m + rsr_r + rst_m + rst_r, 
              family=binomial(link="logit")
             )

# model 7 significant variables found in the study 
fit7 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dus_r + dvi_m + pir_r +  rst_r,   
             family=binomial(link="logit")
            ) 
detach(pdc)

par(mfrow = c(2,2))
plot(fit24)
mtext("fit24", side = 1)

par(mfrow = c(2,2), bg = "aliceblue")
plot(fit16)
mtext("fit16", side = 1)

par(mfrow = c(2,2), bg = "white")
plot(fit7)
mtext("fit7", side = 1)

Consistent with the boxplots, each of the models show high leverage outliers. Removing the outliers and rerunning the models doesn’t improve them, each model shows a new set of high leverage outliers, as shown below. Which is further evidence that logistic regression is not the best fit for this data.

pdo <- pdc %>% filter(!id %in% c("PD04", "PD08", "PD27"))                          # remove high leverage outliers for fit24
attach(pdo)
fito <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +    # model all 24 variables             
                         dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
                         est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
                         dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m, 
              family=binomial(link="logit")
             )
par(mfrow = c(2,2))
plot(fito)
mtext("fit24", side = 1)

pdo <- pdc %>% filter(!id %in% c("PD03", "PD08", "PD10"))                  # remove high leverage outliers for fit16
fito <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +    # model 16 variables with significant correlation
                         gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r + 
                         rsr_m + rsr_r + rst_m + rst_r, 
              family=binomial(link="logit")
             )
par(mfrow = c(2,2), bg = "aliceblue")
plot(fito)
mtext("fit16", side = 1)

pdo <- pdc %>% filter(!id %in% c("PD06", "PD09", "PD27"))                  # remove high leverage outliers for fit7
fito <- glm(dysarthia ~ dpi_m + dpi_r + dvi_m + dvi_r + est_m + gvi_m +     # model 7 significant variables found in the study 
                        rlr_m + rlr_r + rst_m + rst_r,
             family=binomial(link="logit")
            ) 

par(mfrow = c(2,2), bg = "white")
plot(fito)
mtext("fit7", side = 1)

detach(pdo)

Predictions and model evaluations

Both PD and RBD subjects were scored according to the Unified Parkinson’s Disease Rating Scale motor subscore (UPDRS III15). The scores range from 0 to 108, with 0 for no motor manifestations and 108 representing severe motor distortion. A score > 3 was is considered a very good indicator of initial parkinsonism.

rbd <- pdata %>% filter(group == "RBD")                     # get the RBD group data  

The RBD subject have motor scores ranging from 0 to 21. RBD subjects with a motor score ≤ 3 (asymptomatic RBD subgroup) are labeled motor negatives and RBD subjects with a motor score >3 (symptomatic RBD subgroup) were labeled motor positives. This categorization is used to create the confusion matrices for evaluating the model performance by creating a logical variable from the motor evaluation score.

rbd <- rbd %>% 
  mutate(motor_true = if_else(motor_score > 3, 1, 0))       # add the motor true categorization

rbd %>% select(id, dysarthia, motor_score, motor_true) %>%  # look at the initial dysarthia diagnosis, motor score and motor true
  head() %>% 
  kable() %>% kable_minimal()
id dysarthia motor_score motor_true
RBD01 1 3 0
RBD02 0 2 0
RBD03 0 7 1
RBD04 0 1 0
RBD05 0 10 1
RBD06 0 12 1

Creating the predictions for each of the models show that none of them have very good performance compared to the motor score. They all have more false negatives than true negatives, and each has a good number of false positives. The model with the best performance is fit16, but it’s performance at 50% accuracy with 41% sensitivity and 61% specificity is still not great.

require(caret)   
## Warning: package 'caret' was built under R version 4.0.5
etbl <- tibble("Model" = c("fit24", "fit16", "fit7"),           # tibble to compare fit evals
               "Accuracy" = c(0,0,0),
               "True Pos" = c(0,0,0),
               "True Neg" = c(0,0,0),
               "False Pos" = c(0,0,0),
               "False Neg" = c(0,0,0),
               "Sensitivity" = c(0,0,0),
               "Specificity" = c(0,0,0)
               )

motor_positive <- factor(rbd$motor_true, c(0,1))                # factor required by confusionMatrix

predicted24 <- predict(fit24, rbd, type = "response") %>% round()        # prediction using fit24
predictedc <- predicted24 %>% factor(levels = c(0,1)) 
c24 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[1] = c24$overall["Accuracy"]
etbl$`True Pos`[1] = c24$table[2,2]
etbl$`True Neg`[1] = c24$table[1,1]
etbl$`False Pos`[1] = c24$table[1,2]
etbl$`False Neg`[1] = c24$table[2,1]
etbl$Sensitivity[1] = c24$byClass["Sensitivity"]
etbl$Specificity[1] = c24$byClass["Specificity"]

predicted16 <- predict(fit16, rbd, type = "response") %>% round()        # prediction using fit16
predictedc <- predicted16 %>% factor(levels = c(0,1)) 
c16  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[2] = c16$overall["Accuracy"]
etbl$`True Pos`[2] = c16$table[2,2]
etbl$`True Neg`[2] = c16$table[1,1]
etbl$`False Pos`[2] = c16$table[1,2]
etbl$`False Neg`[2] = c16$table[2,1]
etbl$Sensitivity[2] = c16$byClass["Sensitivity"]
etbl$Specificity[2] = c16$byClass["Specificity"]

predicted7 <- predict(fit7, rbd, type="response") %>% round()            # prediction using fit7   
predictedc <- predicted7 %>% factor(levels = c(0,1)) 
c7  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[3] = c7$overall["Accuracy"]
etbl$`True Pos`[3] = c7$table[2,2]
etbl$`True Neg`[3] = c7$table[1,1]
etbl$`False Pos`[3] = c7$table[1,2]
etbl$`False Neg`[3] = c7$table[2,1]
etbl$Sensitivity[3] = c7$byClass["Sensitivity"]
etbl$Specificity[3] = c7$byClass["Specificity"]

etbl %>% kable(digits = 3) %>% kable_minimal()
Model Accuracy True Pos True Neg False Pos False Neg Sensitivity Specificity
fit24 0.46 11 12 8 19 0.387 0.579
fit16 0.50 14 11 9 16 0.407 0.609
fit7 0.42 10 11 9 20 0.355 0.526

The following plot of the ROC curves also shows the relatively low accuracy of the models.

require(plotROC)

cmat <- rbd %>% mutate("Model24" = predicted24, "Model16" = predicted16, 
                        "Model7" = predicted7) %>%
             select(motor_true, Model24, Model16, Model7) %>% 
             pivot_longer(Model24:Model7,
                          names_to = "model",
                          values_to = "diag"
                          ) 
cmat$model <- factor(cmat$model, levels = c("Model7", "Model16", "Model24") )

ggplot(cmat) + 
  geom_roc(aes(d=motor_true, m = diag, color = model), size = 2, labels = FALSE) +
   scale_color_discrete(name = "Model") +
  labs(title = "ROC for logistic regression models",                                         # titles and labels
       x = "Speech positive / Motor negative (False positive)",
       y = "Speech positive / Motor positive (True positive)") +
             theme_light() 

Conclusion

Using a QDA model the study diagnosed RBD subjects with 70.0% accuracy, 73.9% sensitivity and 66.7% specificity. The three logistic regression models here did not perform as well. There’s not much reason to have confidence in these logistic regression models.

References

  1. Moya-Galé G, Levy ES (2018). Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies. https://www.dovepress.com/parkinsons-disease-associated-dysarthria-prevalence-impact-and-managem-peer-reviewed-article-JPRLS

  2. Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5

  3. Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease

  4. Young-Onset Parkinson’s Disease. https://www.hopkinsmedicine.org/health/conditions-and-diseases/parkinsons-disease/youngonset-parkinsons-disease

  5. Michael J. Fox interview on NBC’s Today Sunday Sitdown, Nov. 22, 2020. https://www.today.com/health/michael-j-fox-recalls-wife-tracy-pollan-s-reaction-parkinson-t199866