Data 110 - Project 2

Automating Parkinson’s Disease diagnosis using Speech Recognition

Dysarthria is a speech disorder where the muscles used to produce speech are damaged, paralyzed, or weakened. Close to 90% of people with Parkinson’s Disease (PD) present with hypokinetic dysarthria, evidenced by reduced vocal loudness, monotone, reduced fundamental frequency range, consonant and vowel imprecision, breathiness and irregular pauses.¹ Evaluation of dysarthia in disorders such as PD has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. The development of a fully-automated method to detect early, distinctive patterns of neuro-degeneration using only acoustic data from connected speech such as reading a short passage or monologue would have the potential to revolutionize the diagnostic process for diseases manifesting motor speech disorders.²

Early biomarkers of Parkinson’s Disease dataset

The dataset associated with the article “Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder”² is available as an excel file via the article on Nature.com. It’s also on Kaggle as a csv file.³

The data include 30 patients with early untreated Parkinson’s, 50 patients with REM sleep behavior disorder (RBD), which are at high risk developing Parkinson’s disease, and 50 healthy controls (HC). All patients were scored clinically by a neurologist with experience in movement disorders. All subjects were examined during a single session with a speech specialist. All subjects were recorded reading a standardized, phonetically-balanced text of 80 words and a monologue about their interests, job, family or current activities for approximately 90 seconds. Speech features were automatically analyzed by Jan Hlavnička et al. (2017).³

Both my father and my father-in-law had Parkinson’s disease, so I found this dataset and paper about developing automated diagnosis from speech recordings interesting. The diagrams in the article have plots that are simple and clear.

I’ll try to approximate a couple of the plots, but first I’ll get the data, clean it up and do a bit of analysis.

setwd("C:/Users/Owner/Desktop/My Documents/School/DataFiles/Parkinsons")

peb <- tibble(read_csv("parkinsonsEarlyBiomarkers.csv"))    # read Parkinsons Early Biomarkers dataset

spec(peb)                                                   # look at the peb tibble

## cols(
##   `Participant  code` = col_character(),
##   `Age  (years)` = col_double(),
##   Gender = col_character(),
##   `Positive  history  of  Parkinson  disease  in  family` = col_character(),
##   `Age  of  disease  onset  (years)` = col_character(),
##   `Duration  of  disease  from  first  symptoms  (years)` = col_character(),
##   `Antidepressant  therapy` = col_character(),
##   `Antiparkinsonian  medication` = col_character(),
##   `Antipsychotic  medication` = col_character(),
##   `Benzodiazepine  medication` = col_character(),
##   `Levodopa  equivalent  (mg/day)` = col_double(),
##   `Clonazepam  (mg/day)` = col_double(),
##   `Overview  of  motor  examination:  Hoehn  &  Yahr  scale  (-)` = col_character(),
##   `Overview  of  motor  examination:  UPDRS  III  total  (-)` = col_character(),
##   `18.  Speech` = col_character(),
##   `19.  Facial  Expression` = col_character(),
##   `20.  Tremor  at  Rest  -  head` = col_character(),
##   `20.  Tremor  at  Rest  -  RUE` = col_character(),
##   `20.  Tremor  at  Rest  -  LUE` = col_character(),
##   `20.  Tremor  at  Rest  -  RLE` = col_character(),
##   `20.  Tremor  at  Rest  -  LLE` = col_character(),
##   `21.  Action  or  Postural  Tremor  -  RUE` = col_character(),
##   `21.  Action  or  Postural  Tremor  -  LUE` = col_character(),
##   `22.  Rigidity  -  neck` = col_character(),
##   `22.  Rigidity  -  RUE` = col_character(),
##   `22.  Rigidity  -  LUE` = col_character(),
##   `22.  Rigidity  -  RLE` = col_character(),
##   `22.  Rigidity  -  LLE` = col_character(),
##   `23.Finger  Taps  -  RUE` = col_character(),
##   `23.Finger  Taps  -  LUE` = col_character(),
##   `24.  Hand  Movements  -  RUE` = col_character(),
##   `24.  Hand  Movements  -  LUE` = col_character(),
##   `25.  Rapid  Alternating  Movements  -  RUE` = col_character(),
##   `25.  Rapid  Alternating  Movements  -  LUE` = col_character(),
##   `26.  Leg  Agility  -  RLE` = col_character(),
##   `26.  Leg  Agility  -  LLE` = col_character(),
##   `27.  Arising  from  Chair` = col_character(),
##   `28.  Posture` = col_character(),
##   `29.  Gait` = col_character(),
##   `30.  Postural  Stability` = col_character(),
##   `31.  Body  Bradykinesia  and  Hypokinesia` = col_character(),
##   `Entropy  of  speech  timing  (-)` = col_double(),
##   `Rate  of  speech  timing  (-/min)` = col_double(),
##   `Acceleration  of  speech  timing  (-/min2)` = col_double(),
##   `Duration  of  pause  intervals  (ms)` = col_double(),
##   `Duration  of  voiced  intervals  (ms)` = col_double(),
##   `Gaping  in-between  voiced  intervals  (-/min)` = col_double(),
##   `Duration  of  unvoiced  stops  (ms)` = col_double(),
##   `Decay  of  unvoiced  fricatives  (‰/min)` = col_double(),
##   `Relative  loudness  of  respiration  (dB)` = col_double(),
##   `Pause  intervals  per  respiration  (-)` = col_double(),
##   `Rate  of  speech  respiration  (-/min)` = col_double(),
##   `Latency  of  respiratory  exchange  (ms)` = col_double(),
##   `Entropy  of  speech  timing  (-)_1` = col_double(),
##   `Rate  of  speech  timing  (-/min)_1` = col_double(),
##   `Acceleration  of  speech  timing  (-/min2)_1` = col_double(),
##   `Duration  of  pause  intervals  (ms)_1` = col_double(),
##   `Duration  of  voiced  intervals  (ms)_1` = col_double(),
##   `Gaping  in-between  voiced  intervals  (-/min)_1` = col_double(),
##   `Duration  of  unvoiced  stops  (ms)_1` = col_double(),
##   `Decay  of  unvoiced  fricatives  (‰/min)_1` = col_double(),
##   `Relative  loudness  of  respiration  (dB)_1` = col_double(),
##   `Pause  intervals  per  respiration  (-)_1` = col_double(),
##   `Rate  of  speech  respiration  (-/min)_1` = col_double(),
##   `Latency  of  respiratory  exchange  (ms)_1` = col_double()
## )

Clean dataset

For my analysis of the data, I’ll deselect the medication and the 26 motor skill scores (numbered 19. through 31. in the field labels). Then I’ll give the variables concise names.

units <- names(peb[42:53]) %>% str_extract("(\\(.*\\))")                              # pull the exam measurement units off of the names
names(peb) <-  gsub("(of) | (per) | (in-between) | (\\(.*\\))"," ", names(peb))       # clean out extraneous words   
names(peb) <-  str_to_title(names(peb))                                               # capitalize each word
names(peb) <- gsub(" ","",names(peb))                                                 # eliminate spaces

names(peb)                                                                            # look at the names

##  [1] "ParticipantCode"                         
##  [2] "Age"                                     
##  [3] "Gender"                                  
##  [4] "PositiveHistoryParkinsonDiseaseInFamily" 
##  [5] "AgeDiseaseOnset"                         
##  [6] "DurationDiseaseFromFirstSymptoms"        
##  [7] "AntidepressantTherapy"                   
##  [8] "AntiparkinsonianMedication"              
##  [9] "AntipsychoticMedication"                 
## [10] "BenzodiazepineMedication"                
## [11] "LevodopaEquivalent"                      
## [12] "Clonazepam"                              
## [13] "OverviewMotorExamination:Hoehn&YahrScale"
## [14] "OverviewMotorExamination:UpdrsIiiTotal"  
## [15] "18.Speech"                               
## [16] "19.FacialExpression"                     
## [17] "20.TremorAtRest-Head"                    
## [18] "20.TremorAtRest-Rue"                     
## [19] "20.TremorAtRest-Lue"                     
## [20] "20.TremorAtRest-Rle"                     
## [21] "20.TremorAtRest-Lle"                     
## [22] "21.ActionOrPosturalTremor-Rue"           
## [23] "21.ActionOrPosturalTremor-Lue"           
## [24] "22.Rigidity-Neck"                        
## [25] "22.Rigidity-Rue"                         
## [26] "22.Rigidity-Lue"                         
## [27] "22.Rigidity-Rle"                         
## [28] "22.Rigidity-Lle"                         
## [29] "23.FingerTaps-Rue"                       
## [30] "23.FingerTaps-Lue"                       
## [31] "24.HandMovements-Rue"                    
## [32] "24.HandMovements-Lue"                    
## [33] "25.RapidAlternatingMovements-Rue"        
## [34] "25.RapidAlternatingMovements-Lue"        
## [35] "26.LegAgility-Rle"                       
## [36] "26.LegAgility-Lle"                       
## [37] "27.ArisingFromChair"                     
## [38] "28.Posture"                              
## [39] "29.Gait"                                 
## [40] "30.PosturalStability"                    
## [41] "31.BodyBradykinesiaAndHypokinesia"       
## [42] "EntropySpeechTiming"                     
## [43] "RateSpeechTiming"                        
## [44] "AccelerationSpeechTiming"                
## [45] "DurationPauseIntervals"                  
## [46] "DurationVoicedIntervals"                 
## [47] "GapingVoicedIntervals"                   
## [48] "DurationUnvoicedStops"                   
## [49] "DecayUnvoicedFricatives"                 
## [50] "RelativeLoudnessRespiration"             
## [51] "PauseIntervalsRespiration"               
## [52] "RateSpeechRespiration"                   
## [53] "LatencyRespiratoryExchange"              
## [54] "EntropySpeechTiming_1"                   
## [55] "RateSpeechTiming_1"                      
## [56] "AccelerationSpeechTiming_1"              
## [57] "DurationPauseIntervals_1"                
## [58] "DurationVoicedIntervals_1"               
## [59] "GapingVoicedIntervals_1"                 
## [60] "DurationUnvoicedStops_1"                 
## [61] "DecayUnvoicedFricatives_1"               
## [62] "RelativeLoudnessRespiration_1"           
## [63] "PauseIntervalsRespiration_1"             
## [64] "RateSpeechRespiration_1"                 
## [65] "LatencyRespiratoryExchange_1"

Create a set of acronym labels for the exam measurements.

Label	Exam measurement
RST	rate of speech timing
AST	acceleration of speech timing
DPI	duration of pause intervals
EST	entropy of speech timing
DUS	duration of unvoiced stops
DUF	decay of unvoiced fricatives
DVI	duration of voiced intervals
GVI	gaping in-between voiced intervals
RSR	rate of speech respiration
PIR	pause intervals per respiration
RLR	relative loudness of respiration
LRE	latency of respiratory exchange

tLabels <- gsub("[a-z]", "", names(peb[42:53]))             # grab the capital letters to create acronyms

for (i in 1:12) {
 tLabels[i] <- str_glue(tLabels[i], " ", units[i])          # concatenate the acronyms with units   
}
 
tLabels                                                     # look at the labels

##  [1] "EST (-)"      "RST (-/min)"  "AST (-/min2)" "DPI (ms)"     "DVI (ms)"    
##  [6] "GVI (-/min)"  "DUS (ms)"     "DUF (‰/min)"  "RLR (dB)"     "PIR (-)"     
## [11] "RSR (-/min)"  "LRE (ms)"

pdata <- select(peb, id = ParticipantCode,    # create a tibble with the needed variables
         age = Age,
         dysarthia = `18.Speech`,
         est_r = EntropySpeechTiming,
         rst_r = RateSpeechTiming,
         ast_r = AccelerationSpeechTiming,
         dpi_r = DurationPauseIntervals,
         dvi_r = DurationVoicedIntervals,
         gvi_r = GapingVoicedIntervals,
         dus_r = DurationUnvoicedStops,
         duf_r = DecayUnvoicedFricatives,
         rlr_r = RelativeLoudnessRespiration,
         pir_r = PauseIntervalsRespiration,
         rsr_r = RateSpeechRespiration,
         lre_r = LatencyRespiratoryExchange,
         est_m = EntropySpeechTiming_1,
         rst_m = RateSpeechTiming_1,
         ast_m = AccelerationSpeechTiming_1,
         dpi_m = DurationPauseIntervals_1,
         dvi_m = DurationVoicedIntervals_1,
         gvi_m = GapingVoicedIntervals_1,
         dus_m = DurationUnvoicedStops_1,
         duf_m = DecayUnvoicedFricatives_1,
         rlr_m = RelativeLoudnessRespiration_1,
         pir_m = PauseIntervalsRespiration_1,
         rsr_m = RateSpeechRespiration_1,
         lre_m = LatencyRespiratoryExchange_1
         )

pdata$group <- str_extract(peb$ParticipantCode, "(PD|RBD|HC)")          # pull out the group categories from the ids

pdata$group <- pdata$group %>% str_replace("HC", "controls") %>%        # factor group to order and label like the article diagrams
               factor(levels = c("PD", "RBD", "controls"))

# assuming control group does not have dysarthia, change hyphen to 0 to prevent NA

pdata$dysarthia <- pdata$dysarthia %>% str_replace('-','0') %>% as.numeric() 

head(pdata) %>% kable() %>% kable_minimal()

id	age	dysarthia	est_r	rst_r	ast_r	dpi_r	dvi_r	gvi_r	dus_r	duf_r	rlr_r	pir_r	rsr_r	lre_r	est_m	rst_m	ast_m	dpi_m	dvi_m	gvi_m	dus_m	duf_m	rlr_m	pir_m	rsr_m	lre_m	group
PD01	58	0	1.564	354	6.05	146	264	58.65	31.38	-2.101	-22.47	4.50	21.14	167	1.564	333	-2.82	158	318	49.01	22.37	0.588	-19.77	6	13.81	127	PD
PD02	68	1	1.564	340	27.52	173	253	48.26	22.38	-1.745	-24.59	7.00	15.28	163	1.569	285	8.20	295	264	40.56	26.88	-0.825	-23.26	4	21.77	313	PD
PD03	68	0	1.550	211	11.97	377	322	47.54	38.12	2.657	-16.89	3.00	20.76	372	1.550	247	4.71	280	317	48.97	22.37	-0.955	-13.29	4	22.52	201	PD
PD04	75	0	1.519	140	-2.49	360	663	13.72	44.88	-0.934	-25.54	1.00	18.71	119	1.539	112	-9.09	397	800	18.69	49.37	0.791	-25.08	2	14.37	151	PD
PD05	61	1	1.543	269	6.72	211	328	42.90	47.12	-0.973	-22.61	5.00	16.26	78	1.560	230	11.77	206	480	33.54	26.87	0.075	-22.32	5	14.61	151	PD
PD06	58	1	1.553	317	24.19	186	286	43.83	33.63	0.921	-25.00	2.75	27.07	124	1.518	181	13.38	611	398	18.18	49.37	1.488	-25.08	2	18.21	593	PD

Explore the data

What percentage of each group is tagged as having dysarthia by perceptual tests?

dpct <- group_by(pdata, group) %>%                                                                  # create a table of percentages by group
  summarize(num = n(), dysarthia = sum(dysarthia), dysarthiaPct = (sum(dysarthia) / n()) * 100)

kable(dpct, caption = "Percentage of participants tagged with dysarthia", digits = 2,               # print the table
      col.names = c("Group", "#Participants", "Dysarthia", "%")) %>% kable_material()

Percentage of participants tagged with dysarthia
Group	#Participants	Dysarthia	%
PD	30	13	43.33
RBD	50	3	6.00
controls	50	0	0.00

Only 6% of the RBD group are tagged with dysarthia by the perceptual evaluation, while 43% of the group with untreated Parkinson’s disease are tagged.

Age itself can have an impact on speech and vocal quality. I’ll look at the distribution of participant ages to see how they compare between the test groups.

# create a quick plot of the participant ages in each group

qplot(age, data = pdata, fill=group, geom = "dotplot",  facets = "group")

It’s not common to see Parkinson’s disease in people younger than 50, but for a small subset the disease strikes early. People are diagnosed with Parkinson’s at an average age of 60. My father was diagnosed at 52, and my father-in-law in his seventies. Anything younger than 50 is considered young-onset Parkinson’s, or YOPD. Rarely, PD may be diagnosed in people younger than 40 — about 2 percent of the 1 million people with Parkinson’s were diagnosed earlier than age 40.⁴ One of the most recognizable is Michael J. Fox, who was diagnosed at age 29.⁵

The samples show three YOPD subjects and a reasonable balance of subjects in the three groups across the age ranges.

Correlation

Now I’ll look at the correlations of the exam measurements for the two test groups against the controls. First I’ll compare the Parkinson’s group with the control group.

pacman::p_load(GGally)

pcor <- pdata %>% filter(group != "RBD") %>% select(est_r:lre_m)    # filter for the PD and controls groups                                                

# create a correlation plot

pdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                   # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "PD / Control Correlation",                               # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")

pdcorr

Next I’ll look at the correlations for the REM sleep disorder group and the control group.

pcor <- pdata %>% filter(group != "PD") %>% select(est_r:lre_m)                   # filter for the RBD and controls groups

rbdcorr <- ggcorr(pcor, nbreaks = 5, palette = "RdGy",                                     # set the number of levels and the colors       
                 hjust = .65, size = 4,                                                 # nudge and format the axis labels
                 label = TRUE, label_size = 4, label_color = "white",                   # format the coefficient 
                 name = "Significance") +                                               # legend title 
                 labs(title = "RBD / Control Correlation",                               # plot title
                 caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
          theme(plot.title.position = "plot", 
                plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),        # position and format the title and caption
                plot.caption.position = "plot")
rbdcorr

# create another version of the plot with just the significant coefficients highlighted

ggcorr(pcor, geom = "blank", label = TRUE, hjust = 0.65, size = 4) +
  geom_point(size = 10, aes(color = coefficient > 0, alpha = abs(coefficient) > 0.5)) +
  scale_alpha_manual(values = c("TRUE" = 0.25, "FALSE" = 0)) +
  guides(alpha = FALSE) +
  scale_color_discrete(name = "Significance", labels = c("<= -0.05", ">= 0.5")) +
  labs(title = "RBD / Control Correlation",
       subtitle = "Another view of the correlation table",
       caption = "source: https://www.nature.com/articles/s41598-017-00047-5") +
  theme(plot.title.position = "plot", plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),
        plot.subtitle = element_text(hjust = .5), plot.caption.position = "plot")

The correlations for PD and RBD are similar. There are a few significant coefficients that are different, e.g. dpi_m to lre_m and rst_r to dvi_m. Looking at tables of the significant correlations shows that the RBD group has five coefficients in common with the PD group.

# create and print tables of the coefficients for the PD and controls groups

pdcorr$data %>% filter(abs(coefficient) > 0.7 ) %>%                # filter for significant values
 select(1:3) %>%                                                   # select the columns to print
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of Parkinson's group to control group", digits = 2,   # format and print the table
      col.names = c("PD", "Control", "Coefficient")) %>% kable_minimal()                         # use a kable theme

Significant correlation of Parkinson’s group to control group
PD	Control	Coefficient
dvi_r	rst_r	-0.85
dpi_r	rst_r	-0.80
dvi_m	rst_m	-0.79
dpi_m	rst_m	-0.77
gvi_m	est_m	0.72
rlr_m	rlr_r	0.74
dvi_m	dvi_r	0.84

# do the same thing for the RBD to controls groups

rbdcorr$data %>% filter(abs(coefficient) > 0.7 ) %>%  
 select(1:3) %>% 
 arrange(coefficient) %>% 
  kable(caption = "Significant correlation of RBD group to control group", digits = 2,
      col.names = c("RBD", "Control", "Coefficient")) %>% kable_minimal()

Significant correlation of RBD group to control group
RBD	Control	Coefficient
dvi_r	rst_r	-0.88
dvi_m	rst_m	-0.78
dpi_m	rst_m	-0.78
dpi_r	rst_r	-0.75
gvi_m	est_m	0.73

Approximate the diagrams from the paper

A blinded experiment was executed by quadratic discriminant function analysis that showed the most distinctive disturbed speech patterns between the PD and control groups were found for a combination of

RST in reading passage - rst_r
DVI in monologue - dvi_m
DPI in reading passage - dpi_r
DPI in monologue - dpi_m
DUS in reading passage - dus_r
DUS in monologue - dus_m
PIR in monologue - pir_m

Figure 5 from the paper shows plots of two pairings of exam measurements:

Figure 5. Hlavnička, J., Čmejla, R., Tykalová, T. et al. Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder.²

The solid gray line represents the border of discrimination between speech negatives (RBD subjects with asymptomatic speech performance) and speech positives (subjects with speech performance closer to PD speakers). Speech negatives and speech positives were determined in the Index Test during the analysis of the experiment. That data isn’t available and I need more experience with statistics to reproduce the Index Test. Such as:

quadratic discriminant function analysis
multivariate normal distribution with
Expectation Maximization algorithm
Bayes discriminant

So I’ll create the plots with a regression line.

cols <- c('red', 'powderblue', 'darkgreen')       # colors for the points
shps <- c(20, 16, 4)                              # shapes for the points

g <- guide_legend("Group")                        # initalize the legend 

# Graph A - x = RST reading task, y = DPI monologue task 

gA <- pdata %>% ggplot() +
             geom_smooth(aes(rst_r, dpi_m), method = lm, formula=y~x, se = FALSE, color = "grey", size = 2) +   # regression line
             geom_point(aes(rst_r, dpi_m, color = group, shape = group, size = group)) +                        # point plot on top
             scale_fill_discrete(name = "Group") +
             coord_cartesian(xlim = c(100, 500), ylim = c(100, 700)) +                                          # expand the axes scales
             labs(title = "Graph A - Acoustic features of distinctive speech patterns",                         # titles and labels
                  x = "Reading RST (-/min)",  
                  y = "Monologue DPI (ms)") +
             theme_light() +
             scale_shape_manual(values = shps, guide = g) +                                                     # use the specified shapes
             scale_color_manual(values=cols, guide = g) +                                                       # colors for each shape
             scale_size_discrete(range = c(3,5), guide = g)                                                     # sizes for each shape 
gA

gB <- pdata %>% ggplot() +
             geom_smooth(aes(dpi_r, dus_m), method = lm, formula=y~x, se = FALSE, color = "grey", size = 2) +   # regression line
             geom_point(aes(dpi_r, dus_m, color = group, shape = group, size = group)) +                        # point plot on top
             scale_fill_discrete(name = "Group") +                                                              # name the legend 
             labs(title = "Graph B - Acoustic features of distinctive speech patterns",                         # add titles and labels 
                  x = "Reading DPI (ms)",
                  y = "Monologue DUS (ms)") +
             theme_light() +
             scale_shape_manual(values = shps, guide = g) +                                                     # use the specified shapes
             scale_color_manual(values=cols, guide = g) +                                                       # colors for each shape
             scale_size_discrete(range = c(3,5), guide = g)                                                     # sizes for each shape  
gB

Here is an interactive version of Graph A built using plot_ly:

shps <- c("circle", "circle", "x")                    # shapes for the scatter plot markers
cols <- c('red', 'lightblue', 'darkgreen')            # colors for the scatter plot markers

szs <-  pdata$group %>% str_replace("PD", "7") %>%    # sizes for the scatter plot markers  
       str_replace("RBD", "20") %>%
       str_replace("controls", "14") %>% as.numeric() 


fit <- lm(dpi_m ~ rst_r, data = pdata)                # formula for the regression line

p2 <- plot_ly() %>%                                   # start an interactive plot
           add_trace(data = pdata,                    # add a scatter plot layer
              x=~rst_r,
              y=~dpi_m,
              type="scatter",
              mode= "markers",
              marker = list(size = szs),              # specify the marker attributes
              color = pdata$group,
              colors = cols,
              symbol = pdata$group,
              symbols = shps,
              hoverinfo= "text",                      # create the popup text
              text = paste("Participant: ", pdata$id,
                           "<br>",
                           "Group: ", pdata$group,
                           "<br>",
                           "Rate of speech intervals: ", pdata$rst_r,
                           "<br>",
                           "Duration of pause intervals: ", pdata$dpi_m
                           )
                            ) %>%
               add_trace(data = pdata, x = ~rst_r, y = fitted(fit), 
                        type = "scatter",                              # add the regression line
                       mode = "lines",                                                   
                       line = list(color = 'grey', width = 4), 
                       opacity = 0.5, 
                       showlegend = FALSE,                             # don't show line on legend
                       inherit = FALSE)                                # don't inherit popup or aesthetics from scatter plot layer



p2 <- p2 %>% layout(title = "Graph A - Acoustic features of distinctive speech patterns",      # add titles
                    xaxis = list(title = "Reading RST (-/min)"),
                    yaxis = list(title = "Monologue DPI (ms)" ),
                    legend = list(title = list(text="Group")),
                    margin = list(t = 90))
p2

Replicate accoustic speech analysis diagram

The following figure from the paper shows the results of the acoustic speech analysis. The 12 exam measurements are grouped into four categories and plotted to show the ranges.

Figure 3. Hlavnička, J., Čmejla, R., Tykalová, T. et al. Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder.²

To replicate these plots I’ll rearrange the data to combine the measurements for the two exam recordings for each participant into a single column for variable type and a column for it’s associated value.

pd1 <- pdata %>% pivot_longer(est_r:lre_m,
                     names_to = c("meastyp", "examtyp"),
                     names_pattern = "(.*)_(.)",
                     values_to = "amt"
                     )

getMlabel <- sort(tLabels)                           # create a lookup table of the labels

nmes <- unique(pd1$meastyp) %>% sort                 # create the lookup names
names(getMlabel) <- sort(nmes)

pd1$mlabel <- getMlabel[pd1$meastyp]                 # add labels to pd1   

pd1$meastyp <-factor(pd1$meastyp, levels=c("est",    # order the types as they are in the dataset
                                           "rst",
                                           "ast",
                                           "dpi",
                                           "dvi",
                                           "gvi",
                                           "dus",
                                           "duf",
                                           "rlr",
                                           "pir",
                                           "rsr",
                                           "lre")
                     )

Create a faceted boxplot to look at the spread for each of the variables within each group.

 ggplot(pd1, aes(meastyp, amt, fill = meastyp)) +                   # create a faceted boxplot
             geom_boxplot() + 
             coord_cartesian(ylim = c(-35, 615)) +                  # expand the y-axis scale
             facet_grid(cols = vars(group), scales = "free") +      # facet by the group
             theme_bw() + scale_x_discrete(breaks=NULL) +           # don't show labels on the x-axis
             scale_fill_discrete(name = "Speech measurement", 
                                 labels = tLabels) +
             labs(x = NULL, y = NULL)

The units as shown in the legend for each type are quite different. High outliers for PD subjects are evident, speech rate and voiced duration are similar for PD and RBD but differ from the control group. Relative loudness of respiration has all negative values.

I’ll split the data by the four categories in figure 3, and plot each category.

# Add speech categories
pd1 <- pd1 %>% mutate(spchcat = if_else(meastyp %in% c("rlr", "lre", "pir", "rsr"), "Respiration",
                                  if_else(meastyp %in% c("rst", "ast", "dpi", "est"), "Timing",
                                  if_else(meastyp %in% c("gvi", "dvi"), "Phonation", "Articulation")
                                ))
                      )

gcat <- "Respiration"                                                # initialize the category variable
pcolor <- "green"                                                    # initialize the color variable
pd1$group <- factor(pd1$group, levels = c("PD", "RBD", "controls"))  # factor group to display in specified order

# create a function with the plot template

pt <- function (gcat, pcolor) {
  dcat <- pd1 %>% filter(spchcat == gcat)                            # create a faceted box plot
  ggplot(dcat) +      
             geom_boxplot(aes(group, amt, fill = examtyp), outlier.shape = NA) +
             facet_wrap(vars(mlabel), scales = "free") +      
             theme_minimal() +  
             scale_fill_grey(name = "Legend:", labels = c("Reading passage", "Monologue")) +
             labs(title = gcat, x = NULL, y = NULL) +
             theme(plot.title.position = "plot", plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)),   # add and position title
                   legend.position = "bottom", legend.direction = "horizontal",                                     # put the legend on the bottom
                   plot.background = element_rect(color = pcolor, size = 6),                                        # put a rectangle around it
                   plot.margin = margin(.5, .5, .5, .5, "cm"),
                   strip.text.x = element_text(face = "bold", size = rel(1.5) ))                                    # bold the facet labels
}

# execute the template for each category

pt("Respiration", "dodgerblue")

pt("Phonation", "firebrick2")

pt("Timing", "seagreen3")

pt("Articulation", "yellow3")

I think these are a reasonably close approximation to the plots in the diagram. I could probably get closer using layers or different functions.

This project was a challenge, I did a lot of research to find commands and methods to get the results I wanted. I would find new ways to do things I’d already coded, so I re-factored my code quite a bit. I tried three different ways to create the correlation plots and settled on using ggcorr as the most legible for the large number variables. And I feel that dropping out the insignificant coefficients or highlighting the significant coefficients helps to focus on the areas of interest.

References

Moya-Galé G, Levy ES (2018). Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies. https://www.dovepress.com/parkinsons-disease-associated-dysarthria-prevalence-impact-and-managem-peer-reviewed-article-JPRLS
Hlavnička, J., Čmejla, R., Tykalová, T. et al. (2017). Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. https://www.nature.com/articles/s41598-017-00047-5
Early biomarkers of parkinsons disease dataset. https://www.kaggle.com/ruslankl/early-biomarkers-of-parkinsons-disease
Young-Onset Parkinson’s Disease. https://www.hopkinsmedicine.org/health/conditions-and-diseases/parkinsons-disease/youngonset-parkinsons-disease
Michael J. Fox interview on NBC’s Today Sunday Sitdown, Nov. 22, 2020. https://www.today.com/health/michael-j-fox-recalls-wife-tracy-pollan-s-reaction-parkinson-t199866

Parkinsons Speech Analysis

Marilyn Macy

4/16/2021