#Student Confirmation

#1. I hereby confirm that I have only used generative Artificial Intelligence (AI), such as ChatGPT, according to the rules set out in the instructions of this assessment. I acknowledge that using generative AI in a manner different to that stipulated will constitute an examination offence and will be formerly investigated as such. 

#2. I hereby confirm that I have read the generative AI tools guidance and the academic misconduct policy (specifically, point 14i) and have used AI appropriately according to these guidelines and the rules stipulated by this assessment. 

Aims

Exploring the suitability of 6 dysphonia measurements; NHR (Noise-to-Harmonics Ratio), HNR (Harmonics-to-Noise Ratio), RPDE (Recurrence Period Density Entropy), spread2, D2, PPE (Pitch Period Entropy) for telemonitoring of Parkinson’s disease.

Introduction

Parkinson’s is a disease that effects over 1 million people in North America alone and has no known cure (Little M et al.). However, it can be treated particularly in its earlier stages. This makes the method of telemonitoring dysphonia critical, and therefore the investigation of new telemonitoring methods imperitive to provide the best results. In this study we look at 6 of the measurements (NHR, HNR, RPDE, spread, D2 and PPE).

Data analysis and results

How well does each variable associate with eachother?

parkinsons<-read.csv("parkinsons.csv", row.names = 1)
#assess visual correlation between each variable
plot(parkinsons)

  • As we can see there is certainly at least a correlation between every form of measurment, which suggests that all are at least viable options. However, some plot show non-linear patterns (such as the NHR-HNR plots) as oppose to linear ones (such as HNR-PPE plots) and some show negative correlations as oppose to positive ones.
  • The difference between negative and positive correlations is of little interest because they are based on trends that we would expect eg NHR and HNR are the reverses of each other and therefore of course have a negative correlation.
  • More interestingly, is the difference between the linear and non-linear plots as a measurement for a patient with a non-linear plot may be less useful as once it reaches the ‘plato’ of the line, it is a straight line, whereas a linear line would be better suited to incorporate the entire range of results a PWP may have and therefore monitor them to a greater extent

What are the correlations between each variable and which variables have the strongest correlation?

cor(parkinsons)
##                NHR        HNR       RPDE    spread2         D2        PPE
## NHR      1.0000000 -0.7140724  0.3708905  0.3180990  0.4709488  0.5525913
## HNR     -0.7140724  1.0000000 -0.5987363 -0.4315637 -0.6014010 -0.6928759
## RPDE     0.3708905 -0.5987363  1.0000000  0.4799045  0.2369314  0.5458857
## spread2  0.3180990 -0.4315637  0.4799045  1.0000000  0.5235317  0.6447110
## D2       0.4709488 -0.6014010  0.2369314  0.5235317  1.0000000  0.4805845
## PPE      0.5525913 -0.6928759  0.5458857  0.6447110  0.4805845  1.0000000
  • The strongest correlation is between NHR and HNR with a correlation coefficient at -0.71 but it is a non-linear correlation negative correlation. NHR’s correlation coefficients with the remaining variables are relatively poor, barely reaching the 0.5 threshold (0.37, 0.31, 0.47, 0.55), suggesting that there is not a strong correlation between the other variables.
  • HNR with the remaining variables also has a low correlation ranging from 0.43-0.69, although to a lesser extent as it has quite a strong correlation with PPE. RPDE and D2 has low correlation with all other categories (0.37, 0.60, 0.48, 0.52, 0.48) and (0.47, 0.60, 0.23, 0.52, 0.48) respectively.
  • PPE overall has higher correlations with them in general, with correlations of 0.48, 0.55, 0.54, 0.69, 0.64)

How well is PPE explained by each variable?

#plot each plot for a visual representation with line of best fit
plot(PPE~NHR ,data=parkinsons)
fit.NHR<-lm(PPE~NHR, data=parkinsons)
abline(fit.NHR, col = "red")

plot(PPE~HNR ,data=parkinsons)
fit.HNR<-lm(PPE~HNR, data=parkinsons)
abline(fit.HNR, col = "red")

plot(PPE~RPDE ,data=parkinsons)
fit.RPDE <- lm(PPE~RPDE, data=parkinsons)
abline(fit.RPDE, col = "red")

plot(PPE~spread2 ,data=parkinsons)
fit.spread2 <- lm(PPE~spread2, data=parkinsons)
abline(fit.spread2, col = "red")

plot(PPE~D2 ,data=parkinsons)
fit.D2<-lm(PPE~D2, data=parkinsons)
abline(fit.D2, col = "red")

  • PPE, visually seems to relatively well explained by HNR, RPDE, spread2 and D2 with the only visually obvious fisfit being NHR. In order to determine the best explanation for PPE, I should find an r-squared value.

Use linear regression, and report the slope, intercept and r-squared values and plot residuals to assess the goodness of fit for each linear regression model.

#finding r, slope, intercept and r-squared for NHR
fit.NHR<-lm(PPE~NHR, data=parkinsons)
plot(fit.NHR$residuals ~ fit.NHR$fitted.values)
abline(h=0, col = "red")

summary(fit.NHR)
## 
## Call:
## lm(formula = PPE ~ NHR, data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19286 -0.05393 -0.00577  0.04402  0.21888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.175938   0.006335  27.774   <2e-16 ***
## NHR         1.232090   0.133764   9.211   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0753 on 193 degrees of freedom
## Multiple R-squared:  0.3054, Adjusted R-squared:  0.3018 
## F-statistic: 84.84 on 1 and 193 DF,  p-value: < 2.2e-16
  • slope:1.232090
  • intercept:0.175938
  • r-squared value: 0.3018
  • plot residual (median):-0.00577
#finding r, slope, intercept and r-squared for HNR
plot(fit.HNR$residuals ~ fit.HNR$fitted.values)
abline(h=0, col = "red")

summary(fit.HNR)
## 
## Call:
## lm(formula = PPE ~ HNR, data = parkinsons)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146053 -0.046084 -0.006357  0.038739  0.202933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.515333   0.023596   21.84   <2e-16 ***
## HNR         -0.014109   0.001057  -13.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06515 on 193 degrees of freedom
## Multiple R-squared:  0.4801, Adjusted R-squared:  0.4774 
## F-statistic: 178.2 on 1 and 193 DF,  p-value: < 2.2e-16
  • slope:-0.014109
  • intercept:0.515333
  • r-squared value:0.4774
  • plot residual (median):-0.006357
#finding the r RPDE
plot(fit.RPDE$residuals ~ fit.RPDE$fitted.values)
abline(h=0, col = "red")

summary(fit.RPDE)
## 
## Call:
## lm(formula = PPE ~ RPDE, data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17308 -0.04932 -0.01385  0.03222  0.26055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02940    0.02663  -1.104    0.271    
## RPDE         0.47329    0.05229   9.051   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0757 on 193 degrees of freedom
## Multiple R-squared:  0.298,  Adjusted R-squared:  0.2944 
## F-statistic: 81.93 on 1 and 193 DF,  p-value: < 2.2e-16
  • slope:0.47329
  • intercept:-0.02940
  • r-squared value: 0.2944
  • plot residual (median):-0.01385
#finding r, slope, intercept and r-squared for spread2
plot(fit.spread2$residuals ~ fit.spread2$fitted.values)
abline(h=0, col = "red")

summary(fit.spread2)
## 
## Call:
## lm(formula = PPE ~ spread2, data = parkinsons)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146147 -0.045599 -0.008829  0.039457  0.207481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.04876    0.01435   3.399 0.000822 ***
## spread2      0.69661    0.05945  11.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06907 on 193 degrees of freedom
## Multiple R-squared:  0.4157, Adjusted R-squared:  0.4126 
## F-statistic: 137.3 on 1 and 193 DF,  p-value: < 2.2e-16
  • slope:0.69661
  • intercept:0.04876
  • r-squared value:0.4126
  • plot residual (median):-0.008829
#finding r, slope, intercept and r-squared for D2
plot(fit.D2$residuals ~ fit.D2$fitted.values)
abline(h=0, col = "red")

summary(fit.D2)
## 
## Call:
## lm(formula = PPE ~ D2, data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15632 -0.05571 -0.01118  0.04392  0.24191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06293    0.03585  -1.755   0.0808 .  
## D2           0.11314    0.01486   7.613 1.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07923 on 193 degrees of freedom
## Multiple R-squared:  0.231,  Adjusted R-squared:  0.227 
## F-statistic: 57.96 on 1 and 193 DF,  p-value: 1.158e-12
  • slope:0.11314
  • intercept:-0.06293 *r-squared value:0.227
  • plot residual (median):-0.01118

Which variable would you use to predict the value of PPE?

  • I would use HNR as it has the highest R-squared value at 0.4774, although I did also consider using spread2 which has a r-squared value of 0.4126. NHR and RPDE have similar mid-range r-squared value (0.3018 and 0.2944) and D2 explains the least variance (R-squared = 0.227).
  • Furthermore, although all variables had relatively good residual means, HNR (-0.006357) also had the lowest value, once again making it must suitable to explaining PPE.

Predict the equivolent value of PPE in the new parkinson_new.csv data, using D2.

#open the parkinson_new.csv file
parkinsons_new<-read.csv("parkinsons_new.csv", row.names = 1)
predicted_ppe <- predict(fit.HNR, newdata = parkinsons_new, interval = "confidence")

Create a new plot of PPE and HNR, and add the predicted PPE values using the points function.

#plot the new plot of PPE and HNR
plot(PPE~HNR ,data=parkinsons)
#add the values of the predicted PPE using the points function
points(parkinsons_new$HNR, predicted_ppe[, "fit"], col = "red", pch = 16)

  • This plot shows the predicted values largely follow the plot, which is to be expected and suggests that PPE and HNR are a good measurement for Dysphonia as they seem to be strongly correlated, which would be unlucky to occur by chance and therefore suggests that they are both more accurate.

Discussion and conclusion

#DO NOT DELETE THIS CHUNK
#word count chunk - the number under "words" is your official word count

#The first time you run this, you will need to install the package on your device

#remove the # from the following line of code if you have not used wordcount on this device before
#install.packages("rmdwc")
library(rmdwc)
rmdcountAddin()
## $code
## [1] "data.frame(file='STAT_3parkinsons.Rmd', lines=112, words=856, bytes=5616, chars=5616, nonws=4571)"
## attr(,"class")
## [1] "rs.scalar"
## 
## $echo
## [1] FALSE
## attr(,"class")
## [1] "rs.scalar"
## 
## $execute
## [1] TRUE
## attr(,"class")
## [1] "rs.scalar"
## 
## $focus
## [1] TRUE
## attr(,"class")
## [1] "rs.scalar"
## 
## $animate
## [1] FALSE
## attr(,"class")
## [1] "rs.scalar"
## 
## $language
## [1] "R"
#BIBLIOGRAPHY

#1.Little M, McSharry P, Hunter E, Spielman J, Ramig L. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. Nature Precedings. 2008 Sep 12;