Aims:

This study aims to:

·Identify the most significant vocal biomarkers distinguishing Parkinson’s disease patients from healthy controls

·Evaluate the predictive accuracy of dysphonia measures in disease classification

·Establish a statistical foundation for non-invasive Parkinson’s disease monitoring using voice analysis

Introduction:

Parkinson’s disease diagnosis still heavily relies on the clinical assessment of motor symptoms, which often appear only after significant neurodegeneration has occurred. This reliance creates a critical need for objective, non-invasive biomarkers that can detect the disease earlier and monitor its progression more continuously.This study investigates non-invasive vocal biomarkers to address this gap. Building on the foundational work of Little et al. (2009), whose dataset we analyze, we evaluate six acoustic measures (NHR, HNR, RPDE, spread2, D2, PPE) that quantify vocal impairment linked to Parkinson’s progression. Our aim is to determine their effectiveness in distinguishing patients from healthy controls, thereby contributing to the development of accessible tools for earlier detection and improved telemonitoring strategies.

Data Analysis and Results:

Data Import and Preprocessing:

To analyze the voice measurement data of Parkinson’s disease, the dataset was first imported using the read.csv() function, and the first column was set as the row name for subsequent sample identification. After importing the data, the data structure and basic statistical characteristics were checked to confirm that the dataset contained all the required acoustic measurement variables and corresponding disease status labels.

parkinsons<-read.csv("parkinsons.csv", row.names = 1)

Visualization analysis of the correlation between variables:

To explore the interrelationships among various acoustic feature indicators, we conducted a visualization analysis of the correlations between variables. This step helps to understand whether different acoustic features provide complementary information and whether there are highly correlated variable pairs that may cause multicollinearity problems in subsequent modeling. We use the basic plot() or cor() functions in R to generate a scatter plot matrix, which can display the relationships between all pairs of numeric variables in a data frame at once.

plot(parkinsons)

cor_matrix <- cor(parkinsons)
print(round(cor_matrix, 3))
##            NHR    HNR   RPDE spread2     D2    PPE
## NHR      1.000 -0.714  0.371   0.318  0.471  0.553
## HNR     -0.714  1.000 -0.599  -0.432 -0.601 -0.693
## RPDE     0.371 -0.599  1.000   0.480  0.237  0.546
## spread2  0.318 -0.432  0.480   1.000  0.524  0.645
## D2       0.471 -0.601  0.237   0.524  1.000  0.481
## PPE      0.553 -0.693  0.546   0.645  0.481  1.000

Based on the correlation coefficient scale we calculated, negative r values show the negative linear relationship and vice versa, the strength of associations between each pair of acoustic feature variables is precisely evaluated as follows:

High Correlation Relationships:

NHR vs. HNR: Correlation coefficient r = -0.714, This indicates the strongest negative linear relationship.

Moderate to High Correlation Relationships:

PPE vs. HNR: r = -0.693

D2 vs. PPE: r = 0.645

Moderate Correlation Relationships:

D2 vs. HNR: r = -0.601

PPE vs. NHR: r = 0.553

D2 vs. spread2: r = 0.544

D2 vs. RPDE: r = 0.533

RPDE vs. HNR: r = -0.599 (approaching moderate to high)

Low Correlation Relationships:

RPDE vs. spread2: r = 0.480

spread2 vs. HNR: r = -0.432

RPDE vs. NHR: r = 0.371,

spread2 vs. NHR: r = 0.318

Very Low/No Correlation Relationships: No variable combinations with correlation coefficient absolute values below 0.2 were found in the dataset, indicating that all measured acoustic features possess some degree of association.

Regression analysis of PPE and other acoustic characteristics

PPE, as a measure that can directly quantify the degree of sound disorder, is a core characteristic of voice disorders in patients with Parkinson’s disease.

To assess the quantitative relationship between pitch period entropy (PPE) and other acoustic measurement indicators, we used a simple linear regression model to systematically analyze the strength, direction, and predictive ability of the association between PPE and five variables: NHR, HNR, RPDE, spread2, and D2.

By calculating the slope, intercept, and coefficient of determination (R²) of each model, and supplemented by residual analysis, our aim was to identify the predictive variables that best explain the variation of PPE.

variables <- c("NHR", "HNR", "RPDE", "spread2", "D2")
for (var in variables) {
  plot(parkinsons$PPE ~ parkinsons[[var]],
       xlab = var,
       ylab = "PPE",
       main = paste("PPE vs", var))
  fit <- lm(PPE ~ get(var), data = parkinsons)
  abline(fit, col = "red", lwd = 2)
  cat("Variable:", var, "\n")
  print(summary(fit))
  plot(fit$fitted.values, fit$residuals,
       xlab = "Fitted Values",
       ylab = "Residuals",
       main = paste("Residual Plot for PPE ~", var))
  abline(h = 0, col = "blue", lty = 2)
}

## Variable: NHR 
## 
## Call:
## lm(formula = PPE ~ get(var), data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19286 -0.05393 -0.00577  0.04402  0.21888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.175938   0.006335  27.774   <2e-16 ***
## get(var)    1.232090   0.133764   9.211   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0753 on 193 degrees of freedom
## Multiple R-squared:  0.3054, Adjusted R-squared:  0.3018 
## F-statistic: 84.84 on 1 and 193 DF,  p-value: < 2.2e-16

## Variable: HNR 
## 
## Call:
## lm(formula = PPE ~ get(var), data = parkinsons)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146053 -0.046084 -0.006357  0.038739  0.202933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.515333   0.023596   21.84   <2e-16 ***
## get(var)    -0.014109   0.001057  -13.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06515 on 193 degrees of freedom
## Multiple R-squared:  0.4801, Adjusted R-squared:  0.4774 
## F-statistic: 178.2 on 1 and 193 DF,  p-value: < 2.2e-16

## Variable: RPDE 
## 
## Call:
## lm(formula = PPE ~ get(var), data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17308 -0.04932 -0.01385  0.03222  0.26055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.02940    0.02663  -1.104    0.271    
## get(var)     0.47329    0.05229   9.051   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0757 on 193 degrees of freedom
## Multiple R-squared:  0.298,  Adjusted R-squared:  0.2944 
## F-statistic: 81.93 on 1 and 193 DF,  p-value: < 2.2e-16

## Variable: spread2 
## 
## Call:
## lm(formula = PPE ~ get(var), data = parkinsons)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.146147 -0.045599 -0.008829  0.039457  0.207481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.04876    0.01435   3.399 0.000822 ***
## get(var)     0.69661    0.05945  11.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06907 on 193 degrees of freedom
## Multiple R-squared:  0.4157, Adjusted R-squared:  0.4126 
## F-statistic: 137.3 on 1 and 193 DF,  p-value: < 2.2e-16

## Variable: D2 
## 
## Call:
## lm(formula = PPE ~ get(var), data = parkinsons)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15632 -0.05571 -0.01118  0.04392  0.24191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06293    0.03585  -1.755   0.0808 .  
## get(var)     0.11314    0.01486   7.613 1.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07923 on 193 degrees of freedom
## Multiple R-squared:  0.231,  Adjusted R-squared:  0.227 
## F-statistic: 57.96 on 1 and 193 DF,  p-value: 1.158e-12

The analysis results show that there are significant differences in the degree of association between each variable and PPE:

The relationship between HNR and PPE shows the strongest negative correlation ( slope = -0.014 ), with a determination coefficient R² of 0.480 . This indicates that HNR can explain approximately 48.0% of the variation in PPE, making it the most predictive indicator among all variables.

The relationship between spread2 and PPE is moderately positive ( slope = 0.697, R² = 0.416 ), accounting for 41.6% of the variation in PPE and having a predictive ability second only to HNR.

NHR and RPDE respectively accounted for approximately 30.5% and 29.8% of the variation in PPE, showing a relatively weak positive correlation.

The predictive ability of D2 is the weakest ( R² = 0.231 ), and it can only explain 23.1% of the variation in PPE.

The residual analysis shows that the residual distributions of the HNR and spread2 models are relatively uniform, while the residual plots of other variables exhibit certain nonlinear patterns, suggesting the possibility of more complex functional relationships.

Based on the comprehensive evaluation of the determination coefficient and the goodness of fit of the model, HNR was identified as the best single variable for predicting the PPE value. This finding has significant clinical implications, indicating that the reduction in harmonic noise ratio is closely related to the increase in pitch irregularity, providing important clues for the mechanism study of voice disorders in Parkinson’s disease.

Validation of the PPE prediction model based on HNR

To evaluate the practical application value of harmonic noise ratio (HNR) as a predictor of pitch period entropy (PPE), we constructed a linear prediction model based on the previous analysis and verified it using a new dataset. The previous research indicated that HNR has the strongest negative correlation with PPE (R² = 0.48), which is significantly better than other acoustic feature variables. Therefore, it was selected as the core predictor.

First, we load the dataset containing the new observations into the environment, ensuring that its variable structure is consistent with the training data. Subsequently, using the linear regression model established based on the original data, we generate the predicted values of PPE and the corresponding confidence intervals through the predict() function in combination with the interval = “confidence” parameter. This processing not only provides point predictions but also quantifies the range of uncertainty in the prediction results, providing a crucial basis for evaluating the reliability of the model.

In terms of visualization, we plotted a scatter graph of HNR and PPE, and used the points() function to superimpose and display the predicted results of the new data in red. By carefully setting the xlim and ylim parameters, we ensured that all the observed points and predicted values were clearly displayed within the graph range.

p_new <- read.csv("parkinsons_new.csv", row.names = 1)

fit <- lm(PPE ~ HNR, data = parkinsons)
prediction <- predict(fit, p_new, interval = "confidence")
x_range <- range(parkinsons$HNR, p_new$HNR)
y_range <- range(parkinsons$PPE, prediction[, "fit"])

plot(parkinsons$HNR, parkinsons$PPE,  # 用 parkinsons 不是 p_new
     ylab = "PPE", xlab = "HNR",
     main = "PPE vs HNR with Predictions",
     xlim = x_range,   
     ylim = y_range)    

abline(fit, col = "blue", lwd = 2)
points(p_new$HNR, prediction[, "fit"], col = "red", pch = 19)  
lines(p_new$HNR, prediction[, "lwr"], col = "red", lty = 2)
lines(p_new$HNR, prediction[, "upr"], col = "red", lty = 2)

The PPE prediction model based on HNR demonstrates excellent generalization ability on the new dataset, with most of the observations closely following the regression trend line determined during the training stage. The prediction points for the three new samples all fall within the 95% confidence interval, verifying the reliability of the model. However, in the extreme HNR regions, the confidence intervals significantly widen, and the prediction uncertainty increases, indicating that the original training data has insufficient sample coverage in these areas. This model has high reliability within the intermediate HNR value range and is suitable for most clinical scenarios; for patients with extreme sound characteristics, it is recommended to combine other acoustic indicators to construct a multivariate model to improve the prediction robustness.

Discussions and conclusions:

Summary of statistical Findings: Our analysis identified HNR as the most significant predictor of PPE , accounting for 48.0% of its variance through a strong negative linear relationship. This was followed by spread2 (41.6% variance explained) , while NHR, RPDE, and D2 demonstrated progressively weaker predictive power . The HNR-based prediction model exhibited strong generalization on new data, with all test samples falling within the 95% confidence interval , though uncertainty increased in extreme HNR regions due to limited training data coverage in these ranges.

Contextualization with Introduction: This study directly addresses the critical need for non-invasive biomarkers in Parkinson’s disease diagnosis identified in our introduction. Where current clinical practice relies on observable motor symptoms that emerge only after significant neurodegeneration, our results demonstrate that objective acoustic measurements—particularly HNR and PPE—can detect vocal impairment much earlier. The strong performance of HNR validates Little et al.’s foundational work while advancing it through precise quantification of predictive power.

Clinical Relevance and Future Directions: These findings matter because they provide the statistical foundation for developing accessible, low-cost telemonitoring tools that could revolutionize Parkinson’s care. Voice-based screening enables frequent, remote assessment without clinical visits, potentially detecting progression earlier and personalizing treatment regimens.