Cluster Analysis and PLS-DA on High Dimensional Data

Goal

The project seeks to classify Parkinson Disease (PD) patients from the control group by the speech features, using PLS-DA, sPLS-DA and cluster analysis.

Summary

The project finds the conventional methods like cluster analysis and PLS-DA performs worse than boosting methods like xgboost, which builds sequential trees on previously misclassified data.

Data Description

The data were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 at the Department of Neurology in CerrahpaAYa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1A ± 8.9). During the data collection process, the microphone is set to 44.1 KHz and following the physicians’ examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions. Various speech signal processing algorithms including Time Frequency Features, MelFrequency Ceptral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson’s Disease (PD) patients to extract clinically useful information for PD assessment. The data is submitted separately and has been cleaned.

Analytical Plan

The original dataset is a 757x755 matrix, with an id column, two binary variables for gender and PD (pd$class); the rest are all continuous variables representing speech features. The data has many different units, requiring scaling before the analysis. The dataset does not contain NAs.

With such high dimensional data, many analytical methods requiring invertible matrix are not applicable. Instead of original proposed LDA, the project uses PLS-DA, sPLS-DA (with LASSO type variable selection) and k-means cluster analysis. The rest of the report will be divided into two parts: starting with the distance-based test for group differences, the first part is k-means cluster analysis. The second part is PLS-DA and sPLS-DA, including the tuning. In each part, measurements of classification like confusion matrix and false negative rate will be presented, followed by the discussion of the classification power.

Here is a glimpse of the dataset:

gender	class	PPE	DFA	RPDE	numPulses	numPeriodsPulses	meanPeriodPulses	stdDevPeriodPulses
1	1	0.85247	0.71826	0.57227	240	239	0.0080635	8.68e-05
1	1	0.76686	0.69481	0.53966	234	233	0.0082583	7.31e-05
1	1	0.85083	0.67604	0.58982	232	231	0.0083396	6.04e-05

Codes

Test for Group Differences

The project will perform a non-parametric permanova and anosim tests based on Manhattan distance. In the high dimension, widely-used Euclidean distances tend to become less meaningful due to the “curse of dimensionality.” The curse of dimensionality refers to the problem of increasing sparsity and the lack of discriminative power of distances in high-dimensional spaces.

The first two PC did not separate the scaled PD data well especially along the second axis. It seems like PD patients are slightly on the left of the first axis, and the normal people are on the right.

Both tests have a low p-value after 999 permutations, suggesting group differences are significant. The result can be viewed in the appendix.

Codes

K-means testing

We will empirically choose 2 as the group reflecting the classification nature of the data: patients and non-patients. More discussion are followed in the appendix. In fact, 3 means perform better with a special note.

K-means clustering suggests two cluster can separate the data well along PC1. However, the confusion matrix suggests the group clustering do not coincide with the groups in the data:

##       Clusters
## Actual   0   1
##      1  37 319
##      2 155 245

Although the healthy are predominantly in cluster 1, the number of patients are almost equal in both clusters, suggesting the poor performance of clustering.

Codes

PLS-DA

Preliminary PCA

PCA uses the linear combination of variables with the largest variability. It is often a preliminary data exploration step. Therefore, it is used here.

Using “elbow rule”, three PC explains about 30.6 variability, a very moderate result. Projecting data into the two components, a great proportion are overlapped, suggesting two components poorly distinguishing PD patients from the normal people.

Initial PLS-DA

PLS-DA refers to Partial Least Squares Discriminant Analysis. The key idea is to model the relationship between the predictors and the class labels by constructing latent variables known as components. PLS-DA finds these components by maximising the covariance between the predictors and the class labels.

The initial PLS-DA is fitted with 10 components in order to select the best number of components.

The result is slightly better than PCA. The first axis seems to divide two groups.

Codes

Final Model

The tuning process can be read in appendix. Here only the final result is presented.

We can see the variable selected by sPLS-DA by the following table:

##      comp1                  comp2                      
## [1,] "tqwt_minValue_dec_12" "app_det_TKEO_mean_1_coef" 
## [2,] "mean_MFCC_2nd_coef"   "app_det_TKEO_mean_10_coef"
## [3,] "tqwt_maxValue_dec_12" "app_det_TKEO_mean_2_coef" 
## [4,] "tqwt_stdValue_dec_11" "app_det_TKEO_mean_3_coef" 
## [5,] "tqwt_stdValue_dec_12" "app_det_TKEO_mean_4_coef" 
##      comp3                      
## [1,] "app_det_TKEO_mean_1_coef" 
## [2,] "app_det_TKEO_mean_10_coef"
## [3,] "app_det_TKEO_mean_2_coef" 
## [4,] "app_det_TKEO_mean_3_coef" 
## [5,] "app_det_TKEO_mean_4_coef"

The first component, which has the strongest separative power, picks up tqwt (tunable Q wavelet transform) features. The conclusion is similar to many recent papers which use this feature to classify PD.

Codes

Performance and Discussion

Here I will present the comparison of the original model and tuned model in terms of prediction accuracy. First, the split of training and testing set is needed.

##       sPLS-DA
## Actual  0  1
##      0 28 30
##      1 10 84

##       PLS-DA
## Actual  0  1
##      0 27 23
##      1 11 91

A summary of the performance of the methods (2-means are ignored because it deviates from the reality too much) are attached below. Xgboost was added as a comparison to the methods used here. It is not discussed in the main part since it is out of the scope of the project. More discussion on it will be followed.

	3-means	PLS-DA	sPLS-DA	Xgboost
Sensitivity	82.27%	79.82%	73.68%	96.49%
Specificity	54.69%	71.05%	73.68%	34.21%
Positive Predictive Value	84.21%	89.22%	89.36%	81.48%
Negative Predictive Value	51.72%	54%	48.28%	76.47%

Xgboost demonstrates the highest sensitivity among all the methods, indicating its superior ability to identify PD patients accurately. However, it has a relatively low specificity and negative predictive value, suggesting a higher chance of misclassifying control group individuals as PD patients. On the other hand, PLS-DA and sPLS-DA exhibit better specificity and negative predictive value compared to 3-means and xgboost, indicating a lower chance of false positives and a higher ability to correctly identify the control group.

Considering the context of identifying PD patients from the control group, the choice of the best method would depend on the specific requirements and priorities. If the focus is on minimising false negatives and capturing as many PD patients as possible (high sensitivity), xgboost could be the preferred option. If the emphasis is on minimising false positives and correctly identifying the control group (high specificity), PLS-DA or sPLS-DA may be more suitable.

In terms of method, tuned sPLS-DA does not show major improvements from the original PLS-DA containing more components. This may suggest the limit of sPLS-DA: the regulisation by lasso may overlook the nonlinearity of the predictors. The unbalanced data may also affect lasso. The non-parametric machine learning model, xgboost performs better in classification, since it includes a learning process: assigning more weights to the wrong predictions and repeating the process until no further improvements. Hence, it overcomes the unbalanced nature of data. An simple model and the confusion matrix can be found here.

codes

Conclusion

The project conducted distance-based test, cluster analysis and PLS-DA on the speech feature of PD data characterised by high-dimensionality. Using Manhattan distance matrix, both anosim and permanova reject the null that there is no group difference. The project also performed cluster analysis, PLS-DA and sPLS-DA, followed by a discussion of the performance and suitability of each method, with an additional xgboost model for comparison. While machine learning method like xgboost performs well in identifying the PD, PLS-DA and sPLS-DA performs better in identifying the control group.

Appendix

Data Source

UC Irvine Machine Learning Library

Load the Data

pd<-read.csv("E:/Rwork/STATS767/pd_speech_features/pd_speech_features.csv",skip = 1)
knitr::kable(head(pd[,1:10],n=3))

Test for Group Difference

# Distance Matrix ----
X<-pd[,-c(1:3)]
D<-dist(scale(X),method="manhattan")
library(MASS)
pd.cmd<-cmdscale(D, eig=TRUE, k=7)
plot(pd.cmd$points[,1:2], col=as.numeric(as.factor(pd$class)), 
     main='MDS (Manhattan dist)',
     xlab=paste(round((pd.cmd$eig/sum(abs(pd.cmd$eig)))[1]*100), "% of total variability") ,
     ylab= paste(round((pd.cmd$eig/sum(abs(pd.cmd$eig)))[2]*100), "% of total variability") 
     )
legend("bottomright",  legend=unique(pd$class), 
       col=unique(as.factor(pd$class)), pch=1,cex=0.8)

Anosim and Permanova Tests

## 
## Call:
## anosim(x = D, grouping = pd$class, permutations = 999) 
## Dissimilarity: manhattan 
## 
## ANOSIM statistic R: 0.1626 
##       Significance: 0.001 
## 
## Permutation: free
## Number of permutations: 999

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = D ~ pd$class, permutations = 999)
##           Df  SumOfSqs      R2     F Pr(>F)    
## pd$class   1   6736550 0.04304 33.91  0.001 ***
## Residual 754 149788813 0.95696                 
## Total    755 156525363 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

K-Means Clustering

Here I present HC index and WSS selection method as well.

# WSS & HC index
WSS<-rep(NA, 10)
for(i in 1:10){
  WSS[i-1]<-kmeans(scale(pd[,4:ncol(pd)]), centers=i, 
                   nstart=50)$tot.withinss
}
plot(1:10, WSS, xlab="Number of Clusters", main="PD data")

BSS<-rep(NA,10)
for(i in 2:11){
  BSS[i-1]<-kmeans(scale(pd[,4:ncol(pd)]), centers=i, nstart=50)$betweenss
}
NminusK<-nrow(pd)-c(2:11)
Kminus1<-c(1:10)
CH<-BSS*NminusK/(WSS*(Kminus1))
plot(2:11, CH, xlab="number of clusters",main = "C-H index by k centres")

We can see that WSS are very smooth, HC index favours 2 or 3. If we use 3 clusters instead, the result follows:

pca<-prcomp(X, scale=TRUE,center = TRUE)
eigs <- pca$sdev^2
kmod3<-kmeans(scale(X), centers=3, algorithm="Lloyd",nstart=50)
plot(pca$x[,1:2], col=kmod3$cluster, main="PD Speech Features with 3 means",
      xlab=paste(round((eigs[1] / sum(eigs))*100), "% of total variability") ,
     ylab= paste(round((eigs[2] / sum(eigs))*100), "% of total variability") )

Both PC1 and PC2 separates the data well. I also want to check how other PC represent the data:

pairs(pca$x[,1:4], col=kmod$cluster,  main="2 Clusters")

pairs(pca$x[,1:4], col=kmod3$cluster,  main="3 Clusters")

Generally, PC1 performs the best. The confusion matrix for 3 clusters:

##       Clusters
## Actual   0   1
##      1 105 100
##      2  16 239
##      3  71 225

The healthy are more in the third clusters, while the patients spread between cluster 1 and 2. If the first two clusters are actually considered as the group for patients, then the sensitivity is 82.27%, the specificity is 54.69%. The positive predictive value is 84.21%. The sensitivity is slightly higher than PLS-DA methods.

Codes for Means = 2

# the graph
pca<-prcomp(pd[,4:ncol(pd)], scale=TRUE,center = TRUE)
kmod<-kmeans(scale(pd[,4:ncol(pd)]), centers=2, algorithm="Lloyd", 
             nstart=50)
eigs <- pca$sdev^2

plot(pca$x[,1:2], col=kmod$cluster, main="PD Speech Features",
      xlab=paste(round((eigs[1] / sum(eigs))*100), "% of total variability") ,
     ylab= paste(round((eigs[2] / sum(eigs))*100), "% of total variability") 
     )

# Confusino matrix----
table(kmod$cluster,pd$class,dnn=c("Actual","Clusters"))

PLS-DA

# PCA----
library(mixOmics)
pd.pca = pca(X, ncomp = 10, center = TRUE, scale = TRUE) 
plot(pd.pca)
# PCA plotting
plotIndiv(pd.pca, group = pd$class, ind.names = FALSE, 
          legend = TRUE, title = 'PCA on PD, comp 1 - 2') 
# PLS-DA----
# the default plsda is scale=TRUE
pd_pls<-plsda(X,pd$class,ncomp=10)
plotIndiv(pd_pls, comp=1:2, group= pd$class, ind.names=FALSE, legend=TRUE)

Tuning sPLS-DA

Number of components

The appropriate number of components is crucial for the model. I will use perf() function to implement a 10-fold cross validation with 100 repeats to reduce the random allocation. Various error rate will be plotted to measure the performance of different number of components.

perf.pd<- perf(pd_pls, validation = "Mfold", 
                          folds = 10, nrepeat = 100,
                          progressBar = FALSE, auc = TRUE)

# plot the outcome of performance evaluation across all ten components
legend_size <- 0.7 
par(cex = legend_size)
plot(perf.pd, sd = TRUE,
     legend.position = "horizontal",legend=TRUE)

The balanced error rate (BER) generally reaches the lowest point at the 6th component. It can be summarised in the table:

perf.pd$choice.ncomp

##         max.dist centroids.dist mahalanobis.dist
## overall        4              6                4
## BER            4              6                4

I will choose 6 as many measures suggest it.

Number of Variables

The next step is to perform M-fold cross-validation to find the optimal number of variables to keep based on error rates. The function tune.splsda measures the performance of splsda(), where lasso penalisation is applied to the loadings vector of predictors only. Therefore, it is very useful facing p close to n.

# a list of X to consider
list.keepX <- c(1:10,  seq(20, 300, 10))
t.pd_pls <- tune.splsda(X,pd$class,ncomp=6,
                                 validation = 'Mfold',
                                 folds = 10, nrepeat = 100,
                                 dist = 'max.dist', 
                                 measure = "BER", 
                                 test.keepX = list.keepX,
                                 cpus = 2)

plot(t.pd_pls, col = color.jet(6)) # plot output of variable number tuning

t.pd_pls$choice.keepX

## comp1 comp2 comp3 comp4 comp5 comp6 
##   130   300     7     1     1     1

The result also suggests 3 components with 6 features would be appropriate.

t.pd_pls$choice.ncomp$ncomp

## [1] 3

Final Model and Variable Importance

# Final Model----
f.pd_pls <- splsda(X, pd$class, ncomp = 3,keepX = 6)
# final tuning
perf.f.pd_pls <- perf(f.pd_pls,folds = 10, nrepeat = 100,validation = "Mfold", dist = "max.dist", progressBar = FALSE)
# Variable importance----
sapply(perf.f.pd_pls$features$stable,function(A){names(head(sort(A,decreasing=TRUE),5))})

Performance

# Split of training and testing set
set.seed(767)
train_idx<-sample(1:nrow(pd),floor(0.8*nrow(pd)))
train_data<-pd[train_idx,]
train_X<-train_data[,c(2,4:ncol(train_data))]
test_data<-pd[-train_idx,]
test_X<-test_data[,c(2,4:ncol(test_data))]

# Predictions for final and initial model
pd.pred<-predict(pd_pls,newdata=test_X)
f.pd.pred<-predict(f.pd_pls,newdata=test_X)

# Confusion matrices
table(f.pd.pred$class$centroids.dist[,3], test_data$class,dnn=c("Actual","sPLS-DA"))

table(pd.pred$class$mahalanobis.dist[,3], test_data$class,dnn=c("Actual","PLS-DA"))

Xgboost

Xgboost stands for extreme gradient boosting. It belongs to the family of gradient boosting algorithms and uses a second order Taylor approximation to the quadratic loss function. In addition, it allows additive penalties to control overfitting. It is known for its efficiency and accuracy. Here a simple xgboost will be performed:

library(xgboost)
library(glmnet)
# Split  of the testing and training
set.seed(767)
train_idx<-sample(1:nrow(pd),floor(0.8*nrow(pd)))
train_data<-pd[train_idx,]
train_X<-train_data[,c(2,4:ncol(train_data))]
test_data<-pd[-train_idx,]
test_X<-test_data[,c(2,4:ncol(test_data))]
train_mat<-sparse.model.matrix(class ~ ., data = train_data[,-1])
# Building matrix
test_mat<-sparse.model.matrix(class ~ ., data = test_data[,-1])

# for classification purpose
set.seed(767)
bst<-xgboost(data=train_mat,label=train_data$class,nrounds = 5, max_depth = 1, colsample_bytree=0.75,
             verbose = 0,objective="binary:logistic") 

yhat_xgb <- predict(bst, newdata=test_mat)
# Set a criteria for being in a class 1 or 0
prediction <- as.numeric(yhat_xgb > 0.5)
table(prediction,test_data$class,dnn=c("Actual","xgboost"))

##       xgboost
## Actual   0   1
##      0  13   4
##      1  25 110

We immediately notice the sensitivity is very high: 96.49%; specificity is 34.21%, only moderate. The positive predictive value is 81.48%, almost twice as high as sPLS-DA. The result is slightly higher than 3-means clustering. But we believe with further tuning, xgboost shall perform very well.

Cluster Analysis and PLS-DA on High Dimensional Data

2023-06-05

Goal

Summary

Data Description

Analytical Plan

Test for Group Differences

K-means testing

PLS-DA

Preliminary PCA

Initial PLS-DA

Final Model

Performance and Discussion

Conclusion

Appendix

Data Source

Load the Data

Test for Group Difference

Anosim and Permanova Tests

K-Means Clustering

Codes for Means = 2

PLS-DA

Tuning sPLS-DA

Number of components

Number of Variables

Final Model and Variable Importance

Performance

Xgboost