Parkinson’s Disease Speech Data

Abstract:

A study of speech analysis measurements for untreated Parkinson’s disease patients and subjects with REM sleep behavior disorder reveals that an automated tool for early diagnosis of Parkinson’s Disease can be reliable. The study is presented in the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behavior disorder”². Both my father and my father-in-law had Parkinson’s disease, so I found this study interesting. In this paper I’ll approximate the statistical analysis from the study and create a prediction model for diagnosing Parkinson’s from the speech data.

Automating Parkinson’s Disease diagnosis using speech pattern analysis

Dysarthria is a speech disorder where the muscles used to produce speech are damaged, paralyzed, or weakened. Close to 90% of people with Parkinson’s Disease (PD) present with hypokinetic dysarthria, evidenced by reduced vocal loudness, monotone, reduced fundamental frequency range, consonant and vowel imprecision, breathiness and irregular pauses.¹ Evaluation of dysarthia in disorders such as PD has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. The development of a fully-automated method to detect early, distinctive patterns of neuro-degeneration using only acoustic data from connected speech such as reading a short passage or monologue would have the potential to revolutionize the diagnostic process for diseases manifesting motor speech disorders.²

Early biomarkers of Parkinson’s Disease dataset

The dataset associated with the article is available as an excel file via the article on Nature.com. It’s also on Kaggle as a csv file.³

The data include 30 patients with early untreated Parkinson’s disease (PD), 50 patients with REM sleep behavior disorder (RBD), which are at high risk for developing Parkinson’s disease, and 50 healthy controls (controls). All patients were scored clinically by a neurologist with experience in movement disorders. All subjects were examined during a single session with a speech specialist and designated as having dysarthia or not.

All the subjects were recorded twice the first recording reading a standardized, phonetically-balanced text of 80 words and the second a monologue for approximately 90 seconds. Twelve speech features were automatically analyzed by Jan Hlavnička et al. (2017)³ for each recording, resulting in 24 measurement variables.

Read the dataset

Read and view the .csv file of speech data.

setwd("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons")

peb <- tibble(read_csv("parkinsonsEarlyBiomarkers.csv"))    # read Parkinsons Early Biomarkers dataset

glimpse(peb)                                                # look at the peb tibble

## Rows: 130
## Columns: 65
## $ `Participant  code`                                             <chr> "PD01"~
## $ `Age  (years)`                                                  <dbl> 58, 68~
## $ Gender                                                          <chr> "F", "~
## $ `Positive  history  of  Parkinson  disease  in  family`         <chr> "No", ~
## $ `Age  of  disease  onset  (years)`                              <chr> "56", ~
## $ `Duration  of  disease  from  first  symptoms  (years)`         <chr> "2", "~
## $ `Antidepressant  therapy`                                       <chr> "No", ~
## $ `Antiparkinsonian  medication`                                  <chr> "No", ~
## $ `Antipsychotic  medication`                                     <chr> "No", ~
## $ `Benzodiazepine  medication`                                    <chr> "No", ~
## $ `Levodopa  equivalent  (mg/day)`                                <dbl> 0, 0, ~
## $ `Clonazepam  (mg/day)`                                          <dbl> 0.0, 0~
## $ `Overview  of  motor  examination:  Hoehn  &  Yahr  scale  (-)` <chr> "1.5",~
## $ `Overview  of  motor  examination:  UPDRS  III  total  (-)`     <chr> "8", "~
## $ `18.  Speech`                                                   <chr> "0", "~
## $ `19.  Facial  Expression`                                       <chr> "1", "~
## $ `20.  Tremor  at  Rest  -  head`                                <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  RUE`                                 <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  LUE`                                 <chr> "2", "~
## $ `20.  Tremor  at  Rest  -  RLE`                                 <chr> "0", "~
## $ `20.  Tremor  at  Rest  -  LLE`                                 <chr> "2", "~
## $ `21.  Action  or  Postural  Tremor  -  RUE`                     <chr> "0", "~
## $ `21.  Action  or  Postural  Tremor  -  LUE`                     <chr> "0", "~
## $ `22.  Rigidity  -  neck`                                        <chr> "0", "~
## $ `22.  Rigidity  -  RUE`                                         <chr> "0", "~
## $ `22.  Rigidity  -  LUE`                                         <chr> "1", "~
## $ `22.  Rigidity  -  RLE`                                         <chr> "0", "~
## $ `22.  Rigidity  -  LLE`                                         <chr> "0", "~
## $ `23.Finger  Taps  -  RUE`                                       <chr> "0", "~
## $ `23.Finger  Taps  -  LUE`                                       <chr> "1", "~
## $ `24.  Hand  Movements  -  RUE`                                  <chr> "0", "~
## $ `24.  Hand  Movements  -  LUE`                                  <chr> "0", "~
## $ `25.  Rapid  Alternating  Movements  -  RUE`                    <chr> "0", "~
## $ `25.  Rapid  Alternating  Movements  -  LUE`                    <chr> "1", "~
## $ `26.  Leg  Agility  -  RLE`                                     <chr> "0", "~
## $ `26.  Leg  Agility  -  LLE`                                     <chr> "0", "~
## $ `27.  Arising  from  Chair`                                     <chr> "0", "~
## $ `28.  Posture`                                                  <chr> "0", "~
## $ `29.  Gait`                                                     <chr> "0", "~
## $ `30.  Postural  Stability`                                      <chr> "0", "~
## $ `31.  Body  Bradykinesia  and  Hypokinesia`                     <chr> "0", "~
## $ `Entropy  of  speech  timing  (-)`                              <dbl> 1.564,~
## $ `Rate  of  speech  timing  (-/min)`                             <dbl> 354, 3~
## $ `Acceleration  of  speech  timing  (-/min2)`                    <dbl> 6.05, ~
## $ `Duration  of  pause  intervals  (ms)`                          <dbl> 146, 1~
## $ `Duration  of  voiced  intervals  (ms)`                         <dbl> 264, 2~
## $ `Gaping  in-between  voiced  intervals  (-/min)`                <dbl> 58.65,~
## $ `Duration  of  unvoiced  stops  (ms)`                           <dbl> 31.38,~
## $ `Decay  of  unvoiced  fricatives  (‰/min)`                      <dbl> -2.101~
## $ `Relative  loudness  of  respiration  (dB)`                     <dbl> -22.47~
## $ `Pause  intervals  per  respiration  (-)`                       <dbl> 4.50, ~
## $ `Rate  of  speech  respiration  (-/min)`                        <dbl> 21.14,~
## $ `Latency  of  respiratory  exchange  (ms)`                      <dbl> 167, 1~
## $ `Entropy  of  speech  timing  (-)_1`                            <dbl> 1.564,~
## $ `Rate  of  speech  timing  (-/min)_1`                           <dbl> 333, 2~
## $ `Acceleration  of  speech  timing  (-/min2)_1`                  <dbl> -2.82,~
## $ `Duration  of  pause  intervals  (ms)_1`                        <dbl> 158, 2~
## $ `Duration  of  voiced  intervals  (ms)_1`                       <dbl> 318, 2~
## $ `Gaping  in-between  voiced  intervals  (-/min)_1`              <dbl> 49.01,~
## $ `Duration  of  unvoiced  stops  (ms)_1`                         <dbl> 22.37,~
## $ `Decay  of  unvoiced  fricatives  (‰/min)_1`                    <dbl> 0.588,~
## $ `Relative  loudness  of  respiration  (dB)_1`                   <dbl> -19.77~
## $ `Pause  intervals  per  respiration  (-)_1`                     <dbl> 6.0, 4~
## $ `Rate  of  speech  respiration  (-/min)_1`                      <dbl> 13.81,~
## $ `Latency  of  respiratory  exchange  (ms)_1`                    <dbl> 127, 3~

Data cleaning and wrangling

The data file is tidy dataset, however it has long, descriptive variable names. To effectively use the data it will be easier to create concise variables names by eliminating spaces, extraneous words and creating acronyms.

units <- names(peb[42:53]) %>% str_extract("(\\(.*\\))")                              # pull the exam measurement units off of the names
names(peb) <-  gsub("(of) | (per) | (in-between) | (\\(.*\\))"," ", names(peb))       # clean out extraneous words   
names(peb) <-  str_to_title(names(peb))                                               # capitalize each word
names(peb) <- gsub(" ","",names(peb))                                                 # eliminate spaces

names(peb)                                                                            # look at the names

##  [1] "ParticipantCode"                         
##  [2] "Age"                                     
##  [3] "Gender"                                  
##  [4] "PositiveHistoryParkinsonDiseaseInFamily" 
##  [5] "AgeDiseaseOnset"                         
##  [6] "DurationDiseaseFromFirstSymptoms"        
##  [7] "AntidepressantTherapy"                   
##  [8] "AntiparkinsonianMedication"              
##  [9] "AntipsychoticMedication"                 
## [10] "BenzodiazepineMedication"                
## [11] "LevodopaEquivalent"                      
## [12] "Clonazepam"                              
## [13] "OverviewMotorExamination:Hoehn&YahrScale"
## [14] "OverviewMotorExamination:UpdrsIiiTotal"  
## [15] "18.Speech"                               
## [16] "19.FacialExpression"                     
## [17] "20.TremorAtRest-Head"                    
## [18] "20.TremorAtRest-Rue"                     
## [19] "20.TremorAtRest-Lue"                     
## [20] "20.TremorAtRest-Rle"                     
## [21] "20.TremorAtRest-Lle"                     
## [22] "21.ActionOrPosturalTremor-Rue"           
## [23] "21.ActionOrPosturalTremor-Lue"           
## [24] "22.Rigidity-Neck"                        
## [25] "22.Rigidity-Rue"                         
## [26] "22.Rigidity-Lue"                         
## [27] "22.Rigidity-Rle"                         
## [28] "22.Rigidity-Lle"                         
## [29] "23.FingerTaps-Rue"                       
## [30] "23.FingerTaps-Lue"                       
## [31] "24.HandMovements-Rue"                    
## [32] "24.HandMovements-Lue"                    
## [33] "25.RapidAlternatingMovements-Rue"        
## [34] "25.RapidAlternatingMovements-Lue"        
## [35] "26.LegAgility-Rle"                       
## [36] "26.LegAgility-Lle"                       
## [37] "27.ArisingFromChair"                     
## [38] "28.Posture"                              
## [39] "29.Gait"                                 
## [40] "30.PosturalStability"                    
## [41] "31.BodyBradykinesiaAndHypokinesia"       
## [42] "EntropySpeechTiming"                     
## [43] "RateSpeechTiming"                        
## [44] "AccelerationSpeechTiming"                
## [45] "DurationPauseIntervals"                  
## [46] "DurationVoicedIntervals"                 
## [47] "GapingVoicedIntervals"                   
## [48] "DurationUnvoicedStops"                   
## [49] "DecayUnvoicedFricatives"                 
## [50] "RelativeLoudnessRespiration"             
## [51] "PauseIntervalsRespiration"               
## [52] "RateSpeechRespiration"                   
## [53] "LatencyRespiratoryExchange"              
## [54] "EntropySpeechTiming_1"                   
## [55] "RateSpeechTiming_1"                      
## [56] "AccelerationSpeechTiming_1"              
## [57] "DurationPauseIntervals_1"                
## [58] "DurationVoicedIntervals_1"               
## [59] "GapingVoicedIntervals_1"                 
## [60] "DurationUnvoicedStops_1"                 
## [61] "DecayUnvoicedFricatives_1"               
## [62] "RelativeLoudnessRespiration_1"           
## [63] "PauseIntervalsRespiration_1"             
## [64] "RateSpeechRespiration_1"                 
## [65] "LatencyRespiratoryExchange_1"

Create a set of acronym labels for the exam measurements.

Label	Exam measurement
RST	rate of speech timing
AST	acceleration of speech timing
DPI	duration of pause intervals
EST	entropy of speech timing
DUS	duration of unvoiced stops
DUF	decay of unvoiced fricatives
DVI	duration of voiced intervals
GVI	gaping in-between voiced intervals
RSR	rate of speech respiration
PIR	pause intervals per respiration
RLR	relative loudness of respiration
LRE	latency of respiratory exchange

tLabels <- gsub("[a-z]", "", names(peb[42:53]))             # grab the capital letters to create acronyms

for (i in 1:12) {
 tLabels[i] <- str_glue(tLabels[i], " ", units[i])          # concatenate the acronyms with units   
}
 
tLabels                                                     # look at the labels

##  [1] "EST (-)"      "RST (-/min)"  "AST (-/min2)" "DPI (ms)"     "DVI (ms)"    
##  [6] "GVI (-/min)"  "DUS (ms)"     "DUF (‰/min)"  "RLR (dB)"     "PIR (-)"     
## [11] "RSR (-/min)"  "LRE (ms)"

To obtain the desired set of data, leave out the medication variables and the 26 motor skill scores which are numbered 19 through 31 in the variable labels. Create a variable for the test groups and convert character data to numeric.

pdata <- select(peb, id = ParticipantCode,    # create a tibble with the needed variables
         age = Age,
         motor_score = `OverviewMotorExamination:UpdrsIiiTotal`, # motor skills score
         dysarthia = `18.Speech`,                                # dysarthia score  
         est_r = EntropySpeechTiming,                            # reading speech measurements
         rst_r = RateSpeechTiming,
         ast_r = AccelerationSpeechTiming,
         dpi_r = DurationPauseIntervals,
         dvi_r = DurationVoicedIntervals,
         gvi_r = GapingVoicedIntervals,
         dus_r = DurationUnvoicedStops,
         duf_r = DecayUnvoicedFricatives,
         rlr_r = RelativeLoudnessRespiration,
         pir_r = PauseIntervalsRespiration,
         rsr_r = RateSpeechRespiration,
         lre_r = LatencyRespiratoryExchange,
         est_m = EntropySpeechTiming_1,                          # monologue speech measurements
         rst_m = RateSpeechTiming_1,
         ast_m = AccelerationSpeechTiming_1,
         dpi_m = DurationPauseIntervals_1,
         dvi_m = DurationVoicedIntervals_1,
         gvi_m = GapingVoicedIntervals_1,
         dus_m = DurationUnvoicedStops_1,
         duf_m = DecayUnvoicedFricatives_1,
         rlr_m = RelativeLoudnessRespiration_1,
         pir_m = PauseIntervalsRespiration_1,
         rsr_m = RateSpeechRespiration_1,
         lre_m = LatencyRespiratoryExchange_1
         )

pdata$group <- str_extract(peb$ParticipantCode, "(PD|RBD|HC)")          # pull out the group categories from the ids

pdata$group <- pdata$group %>% str_replace("HC", "controls") %>%        # factor group to order and label like the article diagrams
               factor(levels = c("PD", "RBD", "controls"))

# assuming control group does not have dysarthia, change hyphen to 0 to prevent NA

pdata$dysarthia <- pdata$dysarthia %>% str_replace('-','0') %>% as.numeric() 
pdata$motor_score <- as.numeric(pdata$motor_score)

head(pdata) %>% kable() %>% kable_minimal()

id	age	motor_score	dysarthia	est_r	rst_r	ast_r	dpi_r	dvi_r	gvi_r	dus_r	duf_r	rlr_r	pir_r	rsr_r	lre_r	est_m	rst_m	ast_m	dpi_m	dvi_m	gvi_m	dus_m	duf_m	rlr_m	pir_m	rsr_m	lre_m	group
PD01	58	8	0	1.564	354	6.05	146	264	58.65	31.38	-2.101	-22.47	4.50	21.14	167	1.564	333	-2.82	158	318	49.01	22.37	0.588	-19.77	6	13.81	127	PD
PD02	68	22	1	1.564	340	27.52	173	253	48.26	22.38	-1.745	-24.59	7.00	15.28	163	1.569	285	8.20	295	264	40.56	26.88	-0.825	-23.26	4	21.77	313	PD
PD03	68	19	0	1.550	211	11.97	377	322	47.54	38.12	2.657	-16.89	3.00	20.76	372	1.550	247	4.71	280	317	48.97	22.37	-0.955	-13.29	4	22.52	201	PD
PD04	75	24	0	1.519	140	-2.49	360	663	13.72	44.88	-0.934	-25.54	1.00	18.71	119	1.539	112	-9.09	397	800	18.69	49.37	0.791	-25.08	2	14.37	151	PD
PD05	61	54	1	1.543	269	6.72	211	328	42.90	47.12	-0.973	-22.61	5.00	16.26	78	1.560	230	11.77	206	480	33.54	26.87	0.075	-22.32	5	14.61	151	PD
PD06	58	29	1	1.553	317	24.19	186	286	43.83	33.63	0.921	-25.00	2.75	27.07	124	1.518	181	13.38	611	398	18.18	49.37	1.488	-25.08	2	18.21	593	PD

Data Exploration

Dysarthia tagging by perceptual tests

The percentage of each group tagged as having dysarthia by a speech specialist using perceptual tests is shown in the following table.

dpct <- group_by(pdata, group) %>%                                                                  # create a table of percentages by group
  summarize(num = n(), dysarthia = sum(dysarthia), dysarthiaPct = (sum(dysarthia) / n()) * 100)

kable(dpct, caption = "Percentage of participants tagged with dysarthia", digits = 2,               # print the table
      col.names = c("Group", "#Participants", "Dysarthia", "%")) %>% kable_material()

Percentage of participants tagged with dysarthia
Group	#Participants	Dysarthia	%
PD	30	13	43.33
RBD	50	3	6.00
controls	50	0	0.00

Six percent of the RBD group are tagged with dysarthia by the perceptual evaluation, while 43% of the group with untreated Parkinson’s disease are tagged.

Age of participants in each test group

Age itself can have an impact on speech and vocal quality. The following graph shows the distribution of participant ages for each test group.

# create a quick plot of the participant ages in each group

qplot(age, data = pdata, fill=group, geom = "dotplot",  facets = "group") +
  scale_fill_manual(name = "Group", values=c('red', 'powderblue', 'darkgreen')) +  # colors for each group
               theme_bw()

It isn’t common to see Parkinson’s disease in people younger than 50, but for a small subset the disease strikes early. People are diagnosed with Parkinson’s at an average age of 60. My father was diagnosed at 52, and my father-in-law in his seventies. Diagnosis in a person younger than 50 is considered young-onset Parkinson’s, or YOPD. Rarely, PD may be diagnosed in people younger than 40 — about 2 percent of the 1 million people with Parkinson’s were diagnosed earlier than age 40.⁴ One of the most recognizable is Michael J. Fox, who was diagnosed at age 29.⁵

The samples show three YOPD subjects and a reasonable balance of subjects across age ranges within the three groups.

Statistical Analysis

Accoustic speech analysis measurement ranges

Pivot the data longer to combine the measurements for the two exam recordings into a single column for measurement type, a column for the recording type (reading or monologue) and a column for the measurement value.

pd1 <- pdata %>% pivot_longer(est_r:lre_m,
                     names_to = c("meastyp", "examtyp"),
                     names_pattern = "(.*)_(.)",
                     values_to = "amt"
                     )

getMlabel <- sort(tLabels)                           # create a lookup table of the labels

nmes <- unique(pd1$meastyp) %>% sort                 # create the lookup names
names(getMlabel) <- sort(nmes)

pd1$mlabel <- getMlabel[pd1$meastyp]                 # add labels to pd1   

pd1$meastyp <-factor(pd1$meastyp, levels=c("est",    # order the types as they are in the dataset
                                           "rst",
                                           "ast",
                                           "dpi",
                                           "dvi",
                                           "gvi",
                                           "dus",
                                           "duf",
                                           "rlr",
                                           "pir",
                                           "rsr",
                                           "lre")
                     )

pd1$examtyp <- if_else(pd1$examtyp == "m", "Monologue", 
                        if(pd1$examtyp == "r") "Reading")

Create a faceted boxplot to look at the range for each of the variables within each group and exam type.

 ggplot(pd1, aes(meastyp, amt, fill = meastyp)) +                   # create a faceted boxplot
             geom_boxplot(outlier.color = "red") + 
             coord_cartesian(ylim = c(-35, 615)) +                  # expand the y-axis scale
             facet_grid(cols = vars(group), 
                        rows = vars(examtyp), scales = "free") +      # facet by the group
             theme_bw() + scale_x_discrete(breaks=NULL) +           # don't show labels on the x-axis
             scale_fill_discrete(name = "Speech measurement", 
                                 labels = tLabels) +
             labs(x = NULL, y = NULL)

The units as shown in the legend for each measurement type are quite different, dB, ms, min, etc. Outliers are evident in each group for most of the variables. In the study the speech data is split into four speech categories, Respiration, Timing, Phonation, and Articulation. Dividing the data by these categories and creating boxplots for each gives a better view of the ranges for each measurement type.

# Add speech categories
pd1 <- pd1 %>% mutate(spchcat = if_else(meastyp %in% c("rlr", "lre", "pir", "rsr"), "Respiration",
                                  if_else(meastyp %in% c("rst", "ast", "dpi", "est"), "Timing",
                                  if_else(meastyp %in% c("gvi", "dvi"), "Phonation", "Articulation")
                                ))
                      )

gcat <- "Respiration"                                                # initialize the category variable
pcolor <- "green"                                                    # initialize the color variable
pd1$group <- factor(pd1$group, levels = c("PD", "RBD", "controls"))  # factor group to display in specified order

# create a function with the plot template

pt <- function (gcat, pcolor) {
  dcat <- pd1 %>% filter(spchcat == gcat)                            # create a faceted box plot
  ggplot(dcat) +      
             geom_boxplot(aes(group, amt, fill = examtyp), outlier.color = "red") +
             facet_wrap(vars(mlabel), scales = "free") +      
             theme_bw() +  
             scale_fill_grey(name = "Legend:", labels = c("Reading passage", "Monologue")) +
             labs(title = gcat, x = NULL, y = NULL) +
             theme(plot.title.position = "plot", 
                   plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),   # add & position title
                   legend.position = "bottom", legend.direction = "horizontal",       # legend on the bottom
                   plot.background = element_rect(color = pcolor, size = 6),          # rectangle around plot
                   plot.margin = margin(.5, .5, .5, .5, "cm"),
                   strip.text.x = element_text(face = "bold", size = rel(1.5) ))      # bold the facet labels
}

# execute the template for each category

par(mfrow = c(2,2))

pt("Respiration", "dodgerblue")

pt("Phonation", "firebrick2")

pt("Timing", "seagreen3")

pt("Articulation", "yellow3")

Distributions and Correlations

The following diagram shows distributions and correlations of exam measurements for the subjects in the Parkinson’s group. Many of the variables are skewed. Even though the entire sample size of 130 is reasonably large, and the Parkinson’s disease group has 30 subjects, a linear regression model is probably not the best choice for finding a predictor.

The study used quadratic discriminant function analysis where the multivariate normal distribution characterized by the mean and covariance matrix was estimated using an expectation maximization algorithm. Logistic regression will be here used with the understanding that the models are probably not representative of the data.

require(psych)

pdc <- pdata %>% filter(group == "PD")                        # get the Parkinson's group data
pairs.panels(pdc[5:28],                                       # plot distributions and correlations
             gap = 0,
             pch = 20,
             lm = TRUE)

Working through all of the permutations of 24 variables would be time consuming, so a few models will be developed to continue. The study found seven variables that showed the most distinctive disturbed speech patterns between the PD group and the control group. Three models will be created to use in logistic regression.

All 24 variables together
Seven variables identified in the study
Variables with significant correlation

The following diagram highlights significant correlations of exam measurements for the Parkinson’s subjects.

pacman::p_load(GGally)

pcor <- pdata %>% filter(group == "PD") %>% select(est_r:lre_m)    # filter for the PD group                                                

# create a correlation plot

pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                   # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "PD Correlation",                                         # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")

pdcorr

There are sixteen unique variables with significant coefficients. This table shows the coefficients with absolute values greater than 0.6.

pdcorr$data %>% filter(abs(coefficient) > 0.6 ) %>%  
 select(1:3) %>% 
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of PD group", digits = 2,
      col.names = c("X", "Y", "Coefficient"), row.names = TRUE) %>% kable_minimal()

Significant correlation of PD group
	X	Y	Coefficient
1	dpi_r	rst_r	-0.86
2	dvi_r	rst_r	-0.85
3	dvi_m	rst_m	-0.81
4	dpi_m	rst_m	-0.79
5	gvi_m	dpi_m	-0.72
6	dvi_m	rst_r	-0.71
7	rsr_r	pir_r	-0.70
8	dus_m	rst_m	-0.70
9	gvi_r	dvi_r	-0.67
10	rst_m	dvi_r	-0.64
11	pir_m	dpi_m	-0.63
12	dvi_m	gvi_r	-0.62
13	dus_m	gvi_m	-0.62
14	dus_m	dvi_m	0.60
15	pir_m	rst_m	0.61
16	dvi_m	dus_r	0.61
17	lre_m	dpi_m	0.63
18	gvi_r	rst_r	0.64
19	rlr_m	gvi_m	0.64
20	pir_m	gvi_m	0.65
21	gvi_m	rst_m	0.65
22	dvi_r	dpi_r	0.67
23	dus_m	dpi_m	0.70
24	gvi_m	gvi_r	0.73
25	gvi_m	est_m	0.77
26	rlr_m	rlr_r	0.78
27	dvi_m	dvi_r	0.90

Logistic regression models

A diagnosis of dysarthia is a binomial result which lends itself to logistic regression. General linear models with the logit function are used to create the three models. Then the models are plotted to evaluate their fit to the data.

fit24 - using all 24 acoustic variables
fit16 - using the 16 variables with significant coefficients from the correlation plot
fit7 - the seven variables the study found to distinctive speech pattern difference

attach(pdc)                                                                 # use the Parkinsons data

fit24 <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +    # model all 24 variables             
                         dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
                         est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
                         dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m, 
              family=binomial(link="logit")
             )

# model 16 variables with significant correlation
fit16 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +    
                         gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r + 
                         rsr_m + rsr_r + rst_m + rst_r, 
              family=binomial(link="logit")
             )

# model 7 significant variables found in the study 
fit7 <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dus_r + dvi_m + pir_r +  rst_r,   
             family=binomial(link="logit")
            ) 
detach(pdc)

par(mfrow = c(2,2))
plot(fit24)
mtext("fit24", side = 1)

par(mfrow = c(2,2), bg = "aliceblue")
plot(fit16)
mtext("fit16", side = 1)

par(mfrow = c(2,2), bg = "white")
plot(fit7)
mtext("fit7", side = 1)

Consistent with the boxplots, each of the models show high leverage outliers. Removing the outliers and rerunning the models doesn’t improve them, each model shows a new set of high leverage outliers, as shown below. Which is further evidence that logistic regression is not the best fit for this data.

pdo <- pdc %>% filter(!id %in% c("PD04", "PD08", "PD27"))                          # remove high leverage outliers for fit24
attach(pdo)
fito <- glm(dysarthia ~ est_r + rst_r + ast_r + dpi_r + dvi_r + gvi_r +    # model all 24 variables             
                         dus_r + duf_r + rlr_r + pir_r + rsr_r + lre_r +
                         est_m + rst_m + ast_m + dpi_m + dvi_m + gvi_m +
                         dus_m + duf_m + rlr_m + pir_m + rsr_m + lre_m, 
              family=binomial(link="logit")
             )
par(mfrow = c(2,2))
plot(fito)
mtext("fit24", side = 1)

pdo <- pdc %>% filter(!id %in% c("PD03", "PD08", "PD10"))                  # remove high leverage outliers for fit16
fito <- glm(dysarthia ~ dpi_m + dpi_r + dus_m + dvi_m + dvi_r + est_m +    # model 16 variables with significant correlation
                         gvi_m + lre_m + pir_m + pir_r + rlr_m + rlr_r + 
                         rsr_m + rsr_r + rst_m + rst_r, 
              family=binomial(link="logit")
             )
par(mfrow = c(2,2), bg = "aliceblue")
plot(fito)
mtext("fit16", side = 1)

pdo <- pdc %>% filter(!id %in% c("PD06", "PD09", "PD27"))                  # remove high leverage outliers for fit7
fito <- glm(dysarthia ~ dpi_m + dpi_r + dvi_m + dvi_r + est_m + gvi_m +     # model 7 significant variables found in the study 
                        rlr_m + rlr_r + rst_m + rst_r,
             family=binomial(link="logit")
            ) 

par(mfrow = c(2,2), bg = "white")
plot(fito)
mtext("fit7", side = 1)

detach(pdo)

Predictions and model evaluations

Both PD and RBD subjects were scored according to the Unified Parkinson’s Disease Rating Scale motor subscore (UPDRS III15). The scores range from 0 to 108, with 0 for no motor manifestations and 108 representing severe motor distortion. A score > 3 was is considered a very good indicator of initial parkinsonism.

rbd <- pdata %>% filter(group == "RBD")                     # get the RBD group data

The RBD subject have motor scores ranging from 0 to 21. RBD subjects with a motor score ≤ 3 (asymptomatic RBD subgroup) are labeled motor negatives and RBD subjects with a motor score >3 (symptomatic RBD subgroup) were labeled motor positives. This categorization is used to create the confusion matrices for evaluating the model performance by creating a logical variable from the motor evaluation score.

rbd <- rbd %>% 
  mutate(motor_true = if_else(motor_score > 3, 1, 0))       # add the motor true categorization

rbd %>% select(id, dysarthia, motor_score, motor_true) %>%  # look at the initial dysarthia diagnosis, motor score and motor true
  head() %>% 
  kable() %>% kable_minimal()

id	dysarthia	motor_score	motor_true
RBD01	1	3	0
RBD02	0	2	0
RBD03	0	7	1
RBD04	0	1	0
RBD05	0	10	1
RBD06	0	12	1

Creating the predictions for each of the models show that none of them have very good performance compared to the motor score. They all have more false negatives than true negatives, and each has a good number of false positives. The model with the best performance is fit16, but it’s performance at 50% accuracy with 41% sensitivity and 61% specificity is still not great.

require(caret)

## Warning: package 'caret' was built under R version 4.0.5

etbl <- tibble("Model" = c("fit24", "fit16", "fit7"),           # tibble to compare fit evals
               "Accuracy" = c(0,0,0),
               "True Pos" = c(0,0,0),
               "True Neg" = c(0,0,0),
               "False Pos" = c(0,0,0),
               "False Neg" = c(0,0,0),
               "Sensitivity" = c(0,0,0),
               "Specificity" = c(0,0,0)
               )

motor_positive <- factor(rbd$motor_true, c(0,1))                # factor required by confusionMatrix

predicted24 <- predict(fit24, rbd, type = "response") %>% round()        # prediction using fit24
predictedc <- predicted24 %>% factor(levels = c(0,1)) 
c24 <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[1] = c24$overall["Accuracy"]
etbl$`True Pos`[1] = c24$table[2,2]
etbl$`True Neg`[1] = c24$table[1,1]
etbl$`False Pos`[1] = c24$table[1,2]
etbl$`False Neg`[1] = c24$table[2,1]
etbl$Sensitivity[1] = c24$byClass["Sensitivity"]
etbl$Specificity[1] = c24$byClass["Specificity"]

predicted16 <- predict(fit16, rbd, type = "response") %>% round()        # prediction using fit16
predictedc <- predicted16 %>% factor(levels = c(0,1)) 
c16  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[2] = c16$overall["Accuracy"]
etbl$`True Pos`[2] = c16$table[2,2]
etbl$`True Neg`[2] = c16$table[1,1]
etbl$`False Pos`[2] = c16$table[1,2]
etbl$`False Neg`[2] = c16$table[2,1]
etbl$Sensitivity[2] = c16$byClass["Sensitivity"]
etbl$Specificity[2] = c16$byClass["Specificity"]

predicted7 <- predict(fit7, rbd, type="response") %>% round()            # prediction using fit7   
predictedc <- predicted7 %>% factor(levels = c(0,1)) 
c7  <- confusionMatrix(motor_positive, predictedc)
etbl$Accuracy[3] = c7$overall["Accuracy"]
etbl$`True Pos`[3] = c7$table[2,2]
etbl$`True Neg`[3] = c7$table[1,1]
etbl$`False Pos`[3] = c7$table[1,2]
etbl$`False Neg`[3] = c7$table[2,1]
etbl$Sensitivity[3] = c7$byClass["Sensitivity"]
etbl$Specificity[3] = c7$byClass["Specificity"]

etbl %>% kable(digits = 3) %>% kable_minimal()

Model	Accuracy	True Pos	True Neg	False Pos	False Neg	Sensitivity	Specificity
fit24	0.46	11	12	8	19	0.387	0.579
fit16	0.50	14	11	9	16	0.407	0.609
fit7	0.42	10	11	9	20	0.355	0.526

The following plot of the ROC curves also shows the relatively low accuracy of the models.

require(plotROC)

cmat <- rbd %>% mutate("Model24" = predicted24, "Model16" = predicted16, 
                        "Model7" = predicted7) %>%
             select(motor_true, Model24, Model16, Model7) %>% 
             pivot_longer(Model24:Model7,
                          names_to = "model",
                          values_to = "diag"
                          ) 
cmat$model <- factor(cmat$model, levels = c("Model7", "Model16", "Model24") )

ggplot(cmat) + 
  geom_roc(aes(d=motor_true, m = diag, color = model), size = 2, labels = FALSE) +
   scale_color_discrete(name = "Model") +
  labs(title = "ROC for logistic regression models",                                         # titles and labels
       x = "Speech positive / Motor negative (False positive)",
       y = "Speech positive / Motor positive (True positive)") +
             theme_light()

Conclusion

Using a QDA model the study diagnosed RBD subjects with 70.0% accuracy, 73.9% sensitivity and 66.7% specificity. The three logistic regression models here did not perform as well. There’s not much reason to have confidence in these logistic regression models.

References

Moya-Galé G, Levy ES (2018). Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies. https://www.dovepress.com/parkinsons-disease-associated-dysarthria-prevalence-impact-and-managem-peer-reviewed-article-JPRLS
Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5
Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease
Young-Onset Parkinson’s Disease. https://www.hopkinsmedicine.org/health/conditions-and-diseases/parkinsons-disease/youngonset-parkinsons-disease
Michael J. Fox interview on NBC’s Today Sunday Sitdown, Nov. 22, 2020. https://www.today.com/health/michael-j-fox-recalls-wife-tracy-pollan-s-reaction-parkinson-t199866

Parkinson’s Disease Speech Analysis Data

Data 101, Spring Semester 2021 - Final Project

Marilyn Macy

4/16/2021