Data 110 - Project 2

Automating Parkinson’s Disease diagnosis using Speech Recognition

Dysarthria is a speech disorder where the muscles used to produce speech are damaged, paralyzed, or weakened. Close to 90% of people with Parkinson’s Disease (PD) present with hypokinetic dysarthria, evidenced by reduced vocal loudness, monotone, reduced fundamental frequency range, consonant and vowel imprecision, breathiness and irregular pauses.1 Evaluation of dysarthia in disorders such as PD has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. The development of a fully-automated method to detect early, distinctive patterns of neuro-degeneration using only acoustic data from connected speech such as reading a short passage or monologue would have the potential to revolutionize the diagnostic process for diseases manifesting motor speech disorders.2

Early biomarkers of Parkinson’s Disease dataset

The dataset associated with the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder”2 is available as an excel file via the article on Nature.com. It’s also on Kaggle as a csv file.3

The data include 30 patients with early untreated Parkinson’s, 50 patients with REM sleep behavior disorder (RBD), which are at high risk developing Parkinson’s disease, and 50 healthy controls (HC). All patients were scored clinically by a neurologist with experience in movement disorders. All subjects were examined during a single session with a speech specialist. All subjects were recorded reading a standardized, phonetically-balanced text of 80 words and a monologue about their interests, job, family or current activities for approximately 90 seconds. Speech features were automatically analyzed by Jan Hlavnička et al. (2017).3

Both my father and my father-in-law had Parkinson’s disease, so I found this dataset and paper about developing automated diagnosis from speech recordings interesting. The diagrams in the article have plots that are simple and clear.

I’ll try to approximate a couple of the plots, but first I’ll get the data, clean it up and do a bit of analysis.

setwd("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons")

peb <- tibble(read_csv("parkinsonsEarlyBiomarkers.csv"))    # read Parkinsons Early Biomarkers dataset

spec(peb)                                                   # look at the peb tibble
## cols(
##   `Participant  code` = col_character(),
##   `Age  (years)` = col_double(),
##   Gender = col_character(),
##   `Positive  history  of  Parkinson  disease  in  family` = col_character(),
##   `Age  of  disease  onset  (years)` = col_character(),
##   `Duration  of  disease  from  first  symptoms  (years)` = col_character(),
##   `Antidepressant  therapy` = col_character(),
##   `Antiparkinsonian  medication` = col_character(),
##   `Antipsychotic  medication` = col_character(),
##   `Benzodiazepine  medication` = col_character(),
##   `Levodopa  equivalent  (mg/day)` = col_double(),
##   `Clonazepam  (mg/day)` = col_double(),
##   `Overview  of  motor  examination:  Hoehn  &  Yahr  scale  (-)` = col_character(),
##   `Overview  of  motor  examination:  UPDRS  III  total  (-)` = col_character(),
##   `18.  Speech` = col_character(),
##   `19.  Facial  Expression` = col_character(),
##   `20.  Tremor  at  Rest  -  head` = col_character(),
##   `20.  Tremor  at  Rest  -  RUE` = col_character(),
##   `20.  Tremor  at  Rest  -  LUE` = col_character(),
##   `20.  Tremor  at  Rest  -  RLE` = col_character(),
##   `20.  Tremor  at  Rest  -  LLE` = col_character(),
##   `21.  Action  or  Postural  Tremor  -  RUE` = col_character(),
##   `21.  Action  or  Postural  Tremor  -  LUE` = col_character(),
##   `22.  Rigidity  -  neck` = col_character(),
##   `22.  Rigidity  -  RUE` = col_character(),
##   `22.  Rigidity  -  LUE` = col_character(),
##   `22.  Rigidity  -  RLE` = col_character(),
##   `22.  Rigidity  -  LLE` = col_character(),
##   `23.Finger  Taps  -  RUE` = col_character(),
##   `23.Finger  Taps  -  LUE` = col_character(),
##   `24.  Hand  Movements  -  RUE` = col_character(),
##   `24.  Hand  Movements  -  LUE` = col_character(),
##   `25.  Rapid  Alternating  Movements  -  RUE` = col_character(),
##   `25.  Rapid  Alternating  Movements  -  LUE` = col_character(),
##   `26.  Leg  Agility  -  RLE` = col_character(),
##   `26.  Leg  Agility  -  LLE` = col_character(),
##   `27.  Arising  from  Chair` = col_character(),
##   `28.  Posture` = col_character(),
##   `29.  Gait` = col_character(),
##   `30.  Postural  Stability` = col_character(),
##   `31.  Body  Bradykinesia  and  Hypokinesia` = col_character(),
##   `Entropy  of  speech  timing  (-)` = col_double(),
##   `Rate  of  speech  timing  (-/min)` = col_double(),
##   `Acceleration  of  speech  timing  (-/min2)` = col_double(),
##   `Duration  of  pause  intervals  (ms)` = col_double(),
##   `Duration  of  voiced  intervals  (ms)` = col_double(),
##   `Gaping  in-between  voiced  intervals  (-/min)` = col_double(),
##   `Duration  of  unvoiced  stops  (ms)` = col_double(),
##   `Decay  of  unvoiced  fricatives  (‰/min)` = col_double(),
##   `Relative  loudness  of  respiration  (dB)` = col_double(),
##   `Pause  intervals  per  respiration  (-)` = col_double(),
##   `Rate  of  speech  respiration  (-/min)` = col_double(),
##   `Latency  of  respiratory  exchange  (ms)` = col_double(),
##   `Entropy  of  speech  timing  (-)_1` = col_double(),
##   `Rate  of  speech  timing  (-/min)_1` = col_double(),
##   `Acceleration  of  speech  timing  (-/min2)_1` = col_double(),
##   `Duration  of  pause  intervals  (ms)_1` = col_double(),
##   `Duration  of  voiced  intervals  (ms)_1` = col_double(),
##   `Gaping  in-between  voiced  intervals  (-/min)_1` = col_double(),
##   `Duration  of  unvoiced  stops  (ms)_1` = col_double(),
##   `Decay  of  unvoiced  fricatives  (‰/min)_1` = col_double(),
##   `Relative  loudness  of  respiration  (dB)_1` = col_double(),
##   `Pause  intervals  per  respiration  (-)_1` = col_double(),
##   `Rate  of  speech  respiration  (-/min)_1` = col_double(),
##   `Latency  of  respiratory  exchange  (ms)_1` = col_double()
## )

Clean dataset

For my analysis of the data, I’ll deselect the medication and the 26 motor skill scores (numbered 19. through 31. in the field labels). Then I’ll give the variables concise names.

units <- names(peb[42:53]) %>% str_extract("(\\(.*\\))")                              # pull the exam measurement units off of the names
names(peb) <-  gsub("(of) | (per) | (in-between) | (\\(.*\\))"," ", names(peb))       # clean out extraneous words   
names(peb) <-  str_to_title(names(peb))                                               # capitalize each word
names(peb) <- gsub(" ","",names(peb))                                                 # eliminate spaces

names(peb)                                                                            # look at the names
##  [1] "ParticipantCode"                         
##  [2] "Age"                                     
##  [3] "Gender"                                  
##  [4] "PositiveHistoryParkinsonDiseaseInFamily" 
##  [5] "AgeDiseaseOnset"                         
##  [6] "DurationDiseaseFromFirstSymptoms"        
##  [7] "AntidepressantTherapy"                   
##  [8] "AntiparkinsonianMedication"              
##  [9] "AntipsychoticMedication"                 
## [10] "BenzodiazepineMedication"                
## [11] "LevodopaEquivalent"                      
## [12] "Clonazepam"                              
## [13] "OverviewMotorExamination:Hoehn&YahrScale"
## [14] "OverviewMotorExamination:UpdrsIiiTotal"  
## [15] "18.Speech"                               
## [16] "19.FacialExpression"                     
## [17] "20.TremorAtRest-Head"                    
## [18] "20.TremorAtRest-Rue"                     
## [19] "20.TremorAtRest-Lue"                     
## [20] "20.TremorAtRest-Rle"                     
## [21] "20.TremorAtRest-Lle"                     
## [22] "21.ActionOrPosturalTremor-Rue"           
## [23] "21.ActionOrPosturalTremor-Lue"           
## [24] "22.Rigidity-Neck"                        
## [25] "22.Rigidity-Rue"                         
## [26] "22.Rigidity-Lue"                         
## [27] "22.Rigidity-Rle"                         
## [28] "22.Rigidity-Lle"                         
## [29] "23.FingerTaps-Rue"                       
## [30] "23.FingerTaps-Lue"                       
## [31] "24.HandMovements-Rue"                    
## [32] "24.HandMovements-Lue"                    
## [33] "25.RapidAlternatingMovements-Rue"        
## [34] "25.RapidAlternatingMovements-Lue"        
## [35] "26.LegAgility-Rle"                       
## [36] "26.LegAgility-Lle"                       
## [37] "27.ArisingFromChair"                     
## [38] "28.Posture"                              
## [39] "29.Gait"                                 
## [40] "30.PosturalStability"                    
## [41] "31.BodyBradykinesiaAndHypokinesia"       
## [42] "EntropySpeechTiming"                     
## [43] "RateSpeechTiming"                        
## [44] "AccelerationSpeechTiming"                
## [45] "DurationPauseIntervals"                  
## [46] "DurationVoicedIntervals"                 
## [47] "GapingVoicedIntervals"                   
## [48] "DurationUnvoicedStops"                   
## [49] "DecayUnvoicedFricatives"                 
## [50] "RelativeLoudnessRespiration"             
## [51] "PauseIntervalsRespiration"               
## [52] "RateSpeechRespiration"                   
## [53] "LatencyRespiratoryExchange"              
## [54] "EntropySpeechTiming_1"                   
## [55] "RateSpeechTiming_1"                      
## [56] "AccelerationSpeechTiming_1"              
## [57] "DurationPauseIntervals_1"                
## [58] "DurationVoicedIntervals_1"               
## [59] "GapingVoicedIntervals_1"                 
## [60] "DurationUnvoicedStops_1"                 
## [61] "DecayUnvoicedFricatives_1"               
## [62] "RelativeLoudnessRespiration_1"           
## [63] "PauseIntervalsRespiration_1"             
## [64] "RateSpeechRespiration_1"                 
## [65] "LatencyRespiratoryExchange_1"

Create a set of acronym labels for the exam measurements.

Label Exam measurement
RST rate of speech timing
AST acceleration of speech timing
DPI duration of pause intervals
EST entropy of speech timing
DUS duration of unvoiced stops
DUF decay of unvoiced fricatives
DVI duration of voiced intervals
GVI gaping in-between voiced intervals
RSR rate of speech respiration
PIR pause intervals per respiration
RLR relative loudness of respiration
LRE latency of respiratory exchange
tLabels <- gsub("[a-z]", "", names(peb[42:53]))             # grab the capital letters to create acronyms

for (i in 1:12) {
 tLabels[i] <- str_glue(tLabels[i], " ", units[i])          # concatenate the acronyms with units   
}
 
tLabels                                                     # look at the labels
##  [1] "EST (-)"      "RST (-/min)"  "AST (-/min2)" "DPI (ms)"     "DVI (ms)"    
##  [6] "GVI (-/min)"  "DUS (ms)"     "DUF (‰/min)"  "RLR (dB)"     "PIR (-)"     
## [11] "RSR (-/min)"  "LRE (ms)"
pdata <- select(peb, id = ParticipantCode,    # create a tibble with the needed variables
         age = Age,
         dysarthia = `18.Speech`,
         est_r = EntropySpeechTiming,
         rst_r = RateSpeechTiming,
         ast_r = AccelerationSpeechTiming,
         dpi_r = DurationPauseIntervals,
         dvi_r = DurationVoicedIntervals,
         gvi_r = GapingVoicedIntervals,
         dus_r = DurationUnvoicedStops,
         duf_r = DecayUnvoicedFricatives,
         rlr_r = RelativeLoudnessRespiration,
         pir_r = PauseIntervalsRespiration,
         rsr_r = RateSpeechRespiration,
         lre_r = LatencyRespiratoryExchange,
         est_m = EntropySpeechTiming_1,
         rst_m = RateSpeechTiming_1,
         ast_m = AccelerationSpeechTiming_1,
         dpi_m = DurationPauseIntervals_1,
         dvi_m = DurationVoicedIntervals_1,
         gvi_m = GapingVoicedIntervals_1,
         dus_m = DurationUnvoicedStops_1,
         duf_m = DecayUnvoicedFricatives_1,
         rlr_m = RelativeLoudnessRespiration_1,
         pir_m = PauseIntervalsRespiration_1,
         rsr_m = RateSpeechRespiration_1,
         lre_m = LatencyRespiratoryExchange_1
         )

pdata$group <- str_extract(peb$ParticipantCode, "(PD|RBD|HC)")          # pull out the group categories from the ids

pdata$group <- pdata$group %>% str_replace("HC", "controls") %>%        # factor group to order and label like the article diagrams
               factor(levels = c("PD", "RBD", "controls"))

# assuming control group does not have dysarthia, change hyphen to 0 to prevent NA

pdata$dysarthia <- pdata$dysarthia %>% str_replace('-','0') %>% as.numeric() 

head(pdata) %>% kable() %>% kable_minimal()
id age dysarthia est_r rst_r ast_r dpi_r dvi_r gvi_r dus_r duf_r rlr_r pir_r rsr_r lre_r est_m rst_m ast_m dpi_m dvi_m gvi_m dus_m duf_m rlr_m pir_m rsr_m lre_m group
PD01 58 0 1.564 354 6.05 146 264 58.65 31.38 -2.101 -22.47 4.50 21.14 167 1.564 333 -2.82 158 318 49.01 22.37 0.588 -19.77 6 13.81 127 PD
PD02 68 1 1.564 340 27.52 173 253 48.26 22.38 -1.745 -24.59 7.00 15.28 163 1.569 285 8.20 295 264 40.56 26.88 -0.825 -23.26 4 21.77 313 PD
PD03 68 0 1.550 211 11.97 377 322 47.54 38.12 2.657 -16.89 3.00 20.76 372 1.550 247 4.71 280 317 48.97 22.37 -0.955 -13.29 4 22.52 201 PD
PD04 75 0 1.519 140 -2.49 360 663 13.72 44.88 -0.934 -25.54 1.00 18.71 119 1.539 112 -9.09 397 800 18.69 49.37 0.791 -25.08 2 14.37 151 PD
PD05 61 1 1.543 269 6.72 211 328 42.90 47.12 -0.973 -22.61 5.00 16.26 78 1.560 230 11.77 206 480 33.54 26.87 0.075 -22.32 5 14.61 151 PD
PD06 58 1 1.553 317 24.19 186 286 43.83 33.63 0.921 -25.00 2.75 27.07 124 1.518 181 13.38 611 398 18.18 49.37 1.488 -25.08 2 18.21 593 PD

Explore the data

What percentage of each group is tagged as having dysarthia by perceptual tests?

dpct <- group_by(pdata, group) %>%                                                                  # create a table of percentages by group
  summarize(num = n(), dysarthia = sum(dysarthia), dysarthiaPct = (sum(dysarthia) / n()) * 100)

kable(dpct, caption = "Percentage of participants tagged with dysarthia", digits = 2,               # print the table
      col.names = c("Group", "#Participants", "Dysarthia", "%")) %>% kable_material()
Percentage of participants tagged with dysarthia
Group #Participants Dysarthia %
PD 30 13 43.33
RBD 50 3 6.00
controls 50 0 0.00

Only 6% of the RBD group are tagged with dysarthia by the perceptual evaluation, while 43% of the group with untreated Parkinson’s disease are tagged.

Age itself can have an impact on speech and vocal quality. I’ll look at the distribution of participant ages to see how they compare between the test groups.

# create a quick plot of the participant ages in each group

qplot(age, data = pdata, fill=group, geom = "dotplot",  facets = "group")   

It’s not common to see Parkinson’s disease in people younger than 50, but for a small subset the disease strikes early. People are diagnosed with Parkinson’s at an average age of 60. My father was diagnosed at 52, and my father-in-law in his seventies. Anything younger than 50 is considered young-onset Parkinson’s, or YOPD. Rarely, PD may be diagnosed in people younger than 40 — about 2 percent of the 1 million people with Parkinson’s were diagnosed earlier than age 40.4 One of the most recognizable is Michael J. Fox, who was diagnosed at age 29.5

The samples show three YOPD subjects and a reasonable balance of subjects in the three groups across the age ranges.

Correlation

Now I’ll look at the correlations of the exam measurements for the two test groups against the controls. First I’ll compare the Parkinson’s group with the control group.

pacman::p_load(GGally)

pcor <- pdata %>% filter(group != "RBD") %>% select(est_r:lre_m)    # filter for the PD and controls groups                                                

# create a correlation plot

pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                   # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "PD / Control Correlation",                               # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")

pdcorr

Next I’ll look at the correlations for the REM sleep disorder group and the control group.

pcor <- pdata %>% filter(group != "PD") %>% select(est_r:lre_m)                   # filter for the RBD and controls groups

rbdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                     # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "RBD / Control Correlation",                               # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")
rbdcorr

# create another version of the plot with just the significant coefficients highlighted

ggcorr(pcor, geom = "blank", label = TRUE, hjust = 0.65, size = 4) +
  geom_point(size = 10, aes(color = coefficient > 0, alpha = abs(coefficient) > 0.5)) +
  scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
  guides(alpha = FALSE) +
  scale_color_discrete(name = "Significance", labels = c("<= -0.05", ">= 0.5")) +
  labs(title = "RBD / Control Correlation",
       subtitle = "Another view of the correlation table",
       caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
  theme(plot.title.position = "plot", plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),
        plot.subtitle = element_text(hjust = .5), plot.caption.position = "plot")

The correlations for PD and RBD are similar. There are a few significant coefficients that are different, e.g. dpi_m to lre_m and rst_r to dvi_m. Looking at tables of the significant correlations shows that the RBD group has five coefficients in common with the PD group.

# create and print tables of the coefficients for the PD and controls groups

pdcorr$data %>% filter(abs(coefficient) > 0.7 ) %>%                # filter for significant values
 select(1:3) %>%                                                   # select the columns to print
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of Parkinson's group to control group", digits = 2,   # format and print the table
      col.names = c("PD", "Control", "Coefficient")) %>% kable_minimal()                         # use a kable theme
Significant correlation of Parkinson’s group to control group
PD Control Coefficient
dvi_r rst_r -0.85
dpi_r rst_r -0.80
dvi_m rst_m -0.79
dpi_m rst_m -0.77
gvi_m est_m 0.72
rlr_m rlr_r 0.74
dvi_m dvi_r 0.84
# do the same thing for the RBD to controls groups

rbdcorr$data %>% filter(abs(coefficient) > 0.7 ) %>%  
 select(1:3) %>% 
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of RBD group to control group", digits = 2,
      col.names = c("RBD", "Control", "Coefficient")) %>% kable_minimal()
Significant correlation of RBD group to control group
RBD Control Coefficient
dvi_r rst_r -0.88
dvi_m rst_m -0.78
dpi_m rst_m -0.78
dpi_r rst_r -0.75
gvi_m est_m 0.73

Approximate the diagrams from the paper

A blinded experiment was executed by quadratic discriminant function analysis that showed the most distinctive disturbed speech patterns between the PD and control groups were found for a combination of

  • RST in reading passage - rst_r
  • DVI in monologue - dvi_m
  • DPI in reading passage - dpi_r
  • DPI in monologue - dpi_m
  • DUS in reading passage - dus_r
  • DUS in monologue - dus_m
  • PIR in monologue - pir_m

Figure 5 from the paper shows plots of two pairings of exam measurements:

Figure 5.Figure 5. Hlavnička, J., Čmejla, R., Tykalová, T. et al. Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder.2

The solid gray line represents the border of discrimination between speech negatives (RBD subjects with asymptomatic speech performance) and speech positives (subjects with speech performance closer to PD speakers). Speech negatives and speech positives were determined in the Index Test during the analysis of the experiment. That data isn’t available and I need more experience with statistics to reproduce the Index Test. Such as:

  • quadratic discriminant function analysis
  • multivariate normal distribution with
  • Expectation Maximization algorithm
  • Bayes discriminant

So I’ll create the plots with a regression line.

cols <- c('red', 'powderblue', 'darkgreen')       # colors for the points
shps <- c(20, 16, 4)                              # shapes for the points

g <- guide_legend("Group")                        # initalize the legend 

# Graph A - x = RST reading task, y = DPI monologue task 

gA <- pdata %>% ggplot() +
             geom_smooth(aes(rst_r, dpi_m), method = lm, formula=y~x, se = FALSE, color = "grey", size = 2) +   # regression line
             geom_point(aes(rst_r, dpi_m, color = group, shape = group, size = group)) +                        # point plot on top
             scale_fill_discrete(name = "Group") +
             coord_cartesian(xlim = c(100, 500), ylim = c(100, 700)) +                                          # expand the axes scales
             labs(title = "Graph A - Acoustic features of distinctive speech patterns",                         # titles and labels
                  x = "Reading RST (-/min)",  
                  y = "Monologue DPI (ms)") +
             theme_light() +
             scale_shape_manual(values = shps, guide = g) +                                                     # use the specified shapes
             scale_color_manual(values=cols, guide = g) +                                                       # colors for each shape
             scale_size_discrete(range = c(3,5), guide = g)                                                     # sizes for each shape 
gA

gB <- pdata %>% ggplot() +
             geom_smooth(aes(dpi_r, dus_m), method = lm, formula=y~x, se = FALSE, color = "grey", size = 2) +   # regression line
             geom_point(aes(dpi_r, dus_m, color = group, shape = group, size = group)) +                        # point plot on top
             scale_fill_discrete(name = "Group") +                                                              # name the legend 
             labs(title = "Graph B - Acoustic features of distinctive speech patterns",                         # add titles and labels 
                  x = "Reading DPI (ms)",
                  y = "Monologue DUS (ms)") +
             theme_light() +
             scale_shape_manual(values = shps, guide = g) +                                                     # use the specified shapes
             scale_color_manual(values=cols, guide = g) +                                                       # colors for each shape
             scale_size_discrete(range = c(3,5), guide = g)                                                     # sizes for each shape  
gB

Here is an interactive version of Graph A built using plot_ly:

shps <- c("circle", "circle", "x")                    # shapes for the scatter plot markers
cols <- c('red', 'lightblue', 'darkgreen')            # colors for the scatter plot markers

szs <-  pdata$group %>% str_replace("PD", "7") %>%    # sizes for the scatter plot markers  
       str_replace("RBD", "20") %>%
       str_replace("controls", "14") %>% as.numeric() 


fit <- lm(dpi_m ~ rst_r, data = pdata)                # formula for the regression line

p2 <- plot_ly() %>%                                   # start an interactive plot
           add_trace(data = pdata,                    # add a scatter plot layer
              x=~rst_r,
              y=~dpi_m,
              type="scatter",
              mode= "markers",
              marker = list(size = szs),              # specify the marker attributes
              color = pdata$group,
              colors = cols,
              symbol = pdata$group,
              symbols = shps,
              hoverinfo= "text",                      # create the popup text
              text = paste("Participant: ", pdata$id,
                           "<br>",
                           "Group: ", pdata$group,
                           "<br>",
                           "Rate of speech intervals: ", pdata$rst_r,
                           "<br>",
                           "Duration of pause intervals: ", pdata$dpi_m
                           )
                            ) %>%
               add_trace(data = pdata, x = ~rst_r, y = fitted(fit), 
                        type = "scatter",                              # add the regression line
                       mode = "lines",                                                   
                       line = list(color = 'grey', width = 4), 
                       opacity = 0.5, 
                       showlegend = FALSE,                             # don't show line on legend
                       inherit = FALSE)                                # don't inherit popup or aesthetics from scatter plot layer



p2 <- p2 %>% layout(title = "Graph A - Acoustic features of distinctive speech patterns",      # add titles
                    xaxis = list(title = "Reading RST (-/min)"),
                    yaxis = list(title = "Monologue DPI (ms)" ),
                    legend = list(title = list(text="Group")),
                    margin = list(t = 90))
p2

Replicate accoustic speech analysis diagram

The following figure from the paper shows the results of the acoustic speech analysis. The 12 exam measurements are grouped into four categories and plotted to show the ranges.

Figure 3.Figure 3. Hlavnička, J., Čmejla, R., Tykalová, T. et al. Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder.2

To replicate these plots I’ll rearrange the data to combine the measurements for the two exam recordings for each participant into a single column for variable type and a column for it’s associated value.

pd1 <- pdata %>% pivot_longer(est_r:lre_m,
                     names_to = c("meastyp", "examtyp"),
                     names_pattern = "(.*)_(.)",
                     values_to = "amt"
                     )

getMlabel <- sort(tLabels)                           # create a lookup table of the labels

nmes <- unique(pd1$meastyp) %>% sort                 # create the lookup names
names(getMlabel) <- sort(nmes)

pd1$mlabel <- getMlabel[pd1$meastyp]                 # add labels to pd1   

pd1$meastyp <-factor(pd1$meastyp, levels=c("est",    # order the types as they are in the dataset
                                           "rst",
                                           "ast",
                                           "dpi",
                                           "dvi",
                                           "gvi",
                                           "dus",
                                           "duf",
                                           "rlr",
                                           "pir",
                                           "rsr",
                                           "lre")
                     )

Create a faceted boxplot to look at the spread for each of the variables within each group.

 ggplot(pd1, aes(meastyp, amt, fill = meastyp)) +                   # create a faceted boxplot
             geom_boxplot() + 
             coord_cartesian(ylim = c(-35, 615)) +                  # expand the y-axis scale
             facet_grid(cols = vars(group), scales = "free") +      # facet by the group
             theme_bw() + scale_x_discrete(breaks=NULL) +           # don't show labels on the x-axis
             scale_fill_discrete(name = "Speech measurement", 
                                 labels = tLabels) +
             labs(x = NULL, y = NULL)

The units as shown in the legend for each type are quite different. High outliers for PD subjects are evident, speech rate and voiced duration are similar for PD and RBD but differ from the control group. Relative loudness of respiration has all negative values.

I’ll split the data by the four categories in figure 3, and plot each category.

# Add speech categories
pd1 <- pd1 %>% mutate(spchcat = if_else(meastyp %in% c("rlr", "lre", "pir", "rsr"), "Respiration",
                                  if_else(meastyp %in% c("rst", "ast", "dpi", "est"), "Timing",
                                  if_else(meastyp %in% c("gvi", "dvi"), "Phonation", "Articulation")
                                ))
                      )
gcat <- "Respiration"                                                # initialize the category variable
pcolor <- "green"                                                    # initialize the color variable
pd1$group <- factor(pd1$group, levels = c("PD", "RBD", "controls"))  # factor group to display in specified order

# create a function with the plot template

pt <- function (gcat, pcolor) {
  dcat <- pd1 %>% filter(spchcat == gcat)                            # create a faceted box plot
  ggplot(dcat) +      
             geom_boxplot(aes(group, amt, fill = examtyp), outlier.shape = NA) +
             facet_wrap(vars(mlabel), scales = "free") +      
             theme_minimal() +  
             scale_fill_grey(name = "Legend:", labels = c("Reading passage", "Monologue")) +
             labs(title = gcat, x = NULL, y = NULL) +
             theme(plot.title.position = "plot", plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),   # add and position title
                   legend.position = "bottom", legend.direction = "horizontal",                                     # put the legend on the bottom
                   plot.background = element_rect(color = pcolor, size = 6),                                        # put a rectangle around it
                   plot.margin = margin(.5, .5, .5, .5, "cm"),
                   strip.text.x = element_text(face = "bold", size = rel(1.5) ))                                    # bold the facet labels
}

# execute the template for each category

pt("Respiration", "dodgerblue")

pt("Phonation", "firebrick2")

pt("Timing", "seagreen3")

pt("Articulation", "yellow3")

I think these are a reasonably close approximation to the plots in the diagram. I could probably get closer using layers or different functions.

This project was a challenge, I did a lot of research to find commands and methods to get the results I wanted. I would find new ways to do things I’d already coded, so I re-factored my code quite a bit. I tried three different ways to create the correlation plots and settled on using ggcorr as the most legible for the large number variables. And I feel that dropping out the insignificant coefficients or highlighting the significant coefficients helps to focus on the areas of interest.

References

  1. Moya-Galé G, Levy ES (2018). Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies. https://www.dovepress.com/parkinsons-disease-associated-dysarthria-prevalence-impact-and-managem-peer-reviewed-article-JPRLS

  2. Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5

  3. Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease

  4. Young-Onset Parkinson’s Disease. https://www.hopkinsmedicine.org/health/conditions-and-diseases/parkinsons-disease/youngonset-parkinsons-disease

  5. Michael J. Fox interview on NBC’s Today Sunday Sitdown, Nov. 22, 2020. https://www.today.com/health/michael-j-fox-recalls-wife-tracy-pollan-s-reaction-parkinson-t199866