Case study X2: PCA - A simple but forgotten tool for data mining

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam 2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

In this study, the question is how to classify two newly developed opioid pain relievers into correct therapeutic regimens. To answer this question, these new drugs were assessed for their similarities with 25 other standard opioids. Studied profiles include the score (0-10) of Analgesia, Antitussive, Constipation, Respiratory depression effects, Abuse liability, as well as Eliminate time (half-life) and Duration of analgesia effect.

The original dataset was provided in Chaper 1, Machine Learning in Medicine-Cookbook 2 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format chap1nearestneighbors.sav) from their website: extras.springer.com.

df=read.spss("chap1nearestneighbors.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%.[,-9]

row.names(df)<-df$drug

knitr::kable(
  head(df,30),booktabs = TRUE,
  caption = 'Opioids dataset'
)
Opioids dataset
drug analgesia antitussive constipation respiratorydepression abuseliability elimination analgesiaduration
buprenorphine 7 4 5 7 4 5.0 9.0
butorphanol 7 3 4 7 4 2.7 4.0
codeine 5 6 6 5 4 2.9 7.0
heroine 8 6 8 8 10 9.0 15.0
hydromorphone 8 6 6 8 8 2.6 5.0
levorphanol 8 6 6 8 8 11.0 20.0
mepriridine 7 2 4 8 6 3.2 14.0
methadone 9 6 6 8 6 25.0 5.0
morphine 8 6 8 8 8 3.1 5.0
nalbuphine 7 2 4 7 4 5.1 4.5
oxycodone 6 6 6 6 8 5.0 4.0
oxymorphine 8 5 6 8 8 5.2 3.5
pentazocine 7 2 4 7 5 2.9 3.0
propoxyphene 5 2 4 5 5 3.3 2.0
nalorphine 2 3 6 8 1 1.4 3.2
levallorphan 3 2 5 4 1 11.0 5.0
cyclazocine 2 3 6 3 2 1.6 2.8
naloxone 1 2 5 8 1 1.2 3.0
naltrexone 1 3 5 8 0 9.7 14.0
alfentanil 7 6 7 4 6 1.6 0.5
alphaprodine 6 5 6 3 5 2.2 2.0
fentanyl 6 5 7 5 4 3.7 0.5
meptazinol 4 3 5 5 3 1.6 2.0
norpropoxyphene 8 6 8 5 7 6.0 4.0
sufentanil 7 6 8 6 8 2.6 5.0
newdrug1 5 5 4 3 6 5.0 12.0
newdrug2 8 6 3 4 5 7.0 16.0

The dataframe consists of 27 rows that correspond to 27 drug names, 1 categorical variable that indicates drugs names and 7 quantitative variables. This study design implicates two approaches of data exploration: individuals and variables studying. As we know, any dataframe could be explored in two ways, using either rows (individuals) or columns (variables).

Individuals study

Classifying individuals means identifying the similarity (or dissimilarity) between them from the point of view of our features (variables), i.e identifying which individuals are the most similar/dissimilar? In our study two drugs are considered similar if they have the similar profile based on 9 variables. If so, these 2 drugs present the same dimension of variability. Thus, we can classify a new drug into appropriate treatment group as they are sharing the same profile. In general, looking at individuals might help to identify a pattern that allows developing classification rule or models.

The study on individuals could be illustrated by a scatter-dot matrix graph (Figure 1). This type of graph allows to describe all individuals (a cloud of points i) in terms of pair of variables (two dimensions). Each combination in the matrix represents a situation, or point of view and some of them can eventually show an organised structure of individuals with a clear distance between the subgroups. Such presentation allows a straightforward interpretation of data, as each component graph is two-dimensional.

However, when there are many variables in the data, we will need a better way to identify the multi-dimensional space in which those individuals locate.

plotfuncLow <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=df$drug),shape=21,color="black")+geom_smooth(method="lm",alpha=0.5)
  p
}

library(GGally)

ggpairs(df[,-1],lower=list(continuous=plotfuncLow),diag=NULL)

Variables study

An alternative approach is to explore the data by looking at the variables (features). Variable study aims to reveal the link between all variables in the dataset. Though many kinds of relationship could exist, like logarithmic, exponential or polynomial… but in practice we are usually interested in linear correlation. Such correlation could be evaluated by plotting a correlation matrix between our quantitative variables (Figure 2).

cor.mtest <- function(mat, conf.level = 0.95){
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  diag(lowCI.mat) <- diag(uppCI.mat) <- 1
  for(i in 1:(n-1)){
    for(j in (i+1):n){
      tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
      p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
      lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
      uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
    }
  }
  return(list(p.mat, lowCI.mat, uppCI.mat))
}

cormat<-df[,-1]%>%cor.mtest(.,0.95)

library("corrplot")
library(viridis)

df[,-1]%>%cor(.,method="pearson")%>%corrplot(.,p.mat=cormat[[1]],sig.level=0.05,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,col=viridis::plasma(n=100,begin =0.9, end = 0.4))

The correlation matrix indicated a significant linear relationship between some features whilst in other pairs the relationship is weaker and less significant. We could see that studying variable works in the same way as studying individuals: in both cases the data could be visually reorganized in clusters (variable groups or individual groups). As for the individuals, the variable study aims to identify a variable pattern, with variable groups that represent a strong relationship among them. By doing this, the number of dimension of original dataset could be reduced so each group could be represented by a single synthetic variable instead of many variables, as they are well correlated and would vary in the same way.

However, the scatter dot matrix and the correlogram alone did not provide a good answer for our main question.

In addition, correlation matrix and scatter dot matrix are only practical for dataset with small number of variables. When exploring larger dataset with too many variables, drawing conclusion from those graphs is more difficult.

Therefore we will need a more efficient method with ability to represent visually the important relationships between the variables. PCA (Principal component analysis) is our solution. This method allows reducing the dimension of original dataset by generating the synthetic variables called “principal components”.

Materials and method

Data analysis was performed in R statistical programming language. The FactoMineR and Factoextra packages will be used for conducting PCA and plotting the results as ggplot2 graphical objects.

library(FactoMineR)
library(factoextra)

First thing to do is determining the active and supplementary features. Only active variables contribute to the construction of principal components while supplementary variables are exclusively used for illustrative or interpretive purposes. Active elements can be determined for either individuals and variables.

When using PCA as an exploratory tool before model training, we usually choose the explanatory variable (features) of the model as active variables for PCA, and project the target (dependent variable) as supplementary variables. This could help to discover the relationship between them and eventually predict the model performance.

How we select the active elements ? It’s easy to decide in this case because all the individuals will be included as active individuals, and all 7 quantitative profiles will be used as active variables. In the PCA function, we specify that the drug names will be treated as categorical illustrative variable

Should we standardize our variables ? As there are 2 types of metrics (Score and Time), they were measured in different units and scales, those variables should be standardized before PCA. The relative role of each variable will be balanced after standardization.

Though the PCA provides the graph for individual and variable analysis, as well as other outputs, contained in the output object, we don’t need any graphic output as we will do that later using factorextra package.

res.pca <- PCA(df,quali.sup=1,scale.unit = T,graph = F)

print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 27 individuals, described by 8 variables
## *The results are available in the following objects:
## 
##    name               
## 1  "$eig"             
## 2  "$var"             
## 3  "$var$coord"       
## 4  "$var$cor"         
## 5  "$var$cos2"        
## 6  "$var$contrib"     
## 7  "$ind"             
## 8  "$ind$coord"       
## 9  "$ind$cos2"        
## 10 "$ind$contrib"     
## 11 "$quali.sup"       
## 12 "$quali.sup$coord" 
## 13 "$quali.sup$v.test"
## 14 "$call"            
## 15 "$call$centre"     
## 16 "$call$ecart.type" 
## 17 "$call$row.w"      
## 18 "$call$col.w"      
##    description                                          
## 1  "eigenvalues"                                        
## 2  "results for the variables"                          
## 3  "coord. for the variables"                           
## 4  "correlations variables - dimensions"                
## 5  "cos2 for the variables"                             
## 6  "contributions of the variables"                     
## 7  "results for the individuals"                        
## 8  "coord. for the individuals"                         
## 9  "cos2 for the individuals"                           
## 10 "contributions of the individuals"                   
## 11 "results for the supplementary categorical variables"
## 12 "coord. for the supplementary categories"            
## 13 "v-test of the supplementary categories"             
## 14 "summary statistics"                                 
## 15 "mean of the variables"                              
## 16 "standard error of the variables"                    
## 17 "weights for the individuals"                        
## 18 "weights for the variables"

How many dimension would be required for the analysis ?

The number of principal components is always inferior or equal to the number of active variables. Examining the eigen values or inertia of variance explained associated to each principal components allows to determine the adequate number of components that should be used for interpretation. This ratio can be interpreted as a measure of quality of data representation. As the two components are orthogonal, their ratios could be added together.

res.pca$eig%>%round(.,3)%>%knitr::kable()
eigenvalue percentage of variance cumulative percentage of variance
comp 1 2.884 41.196 41.196
comp 2 1.517 21.676 62.872
comp 3 0.913 13.038 75.910
comp 4 0.798 11.401 87.311
comp 5 0.622 8.892 96.203
comp 6 0.170 2.422 98.625
comp 7 0.096 1.375 100.000

Those information can be plotted on a bar chart:

fviz_screeplot(res.pca)+geom_line(size=1)+geom_point(shape=21,size=5,color="black",fill="gold")+geom_text(aes(label=round(eig,2)),nudge_y =3,nudge_x=0.1)

The sum of relative importance indicate that up to 62% of total variability of the individual cloud / variable can be represented by only two components 1 and 2. However, the variability of profiles cannot be summarized by those 2 dimension alone, as the cumulative sum of percentage can reach 87% when considering 3rd and 4th principal components.

A) Variables

res.pca$var$cos2%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
cosine squared
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
analgesia 0.703 0.002 0.011 0.130 0.116
antitussive 0.721 0.070 0.044 0.045 0.027
constipation 0.326 0.290 0.139 0.162 0.047
respiratorydepression 0.068 0.273 0.623 0.019 0.003
abuseliability 0.815 0.013 0.001 0.104 0.000
elimination 0.150 0.345 0.007 0.338 0.154
analgesiaduration 0.101 0.524 0.088 0.000 0.276
res.pca$var$contrib%>%round(.,3)%>%knitr::kable(caption = 'Contribution')
Contribution
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
analgesia 24.378 0.161 1.214 16.292 18.557
antitussive 25.014 4.630 4.782 5.673 4.316
constipation 11.290 19.080 15.201 20.278 7.542
respiratorydepression 2.366 17.998 68.263 2.369 0.402
abuseliability 28.245 0.864 0.086 12.979 0.006
elimination 5.208 22.710 0.818 42.351 24.797
analgesiaduration 3.499 34.558 9.635 0.058 44.379
d1=fviz_contrib(res.pca, choice="var", axes = 1,fill="#c170d8",color="black")+labs(title = "Contributions to Dim 1")+geom_text(aes(label=round(contrib,2)),nudge_y=2,color="blue")+coord_flip()

d2=fviz_contrib(res.pca, choice="var", axes = 2,fill="gold",color="black")+labs(title = "Contributions to Dim 2")+geom_text(aes(label=round(contrib,2)),nudge_y=2,color="blue")+coord_flip()

library(gridExtra)
grid.arrange(d1,d2)

fviz_pca_var(res.pca, col.var="cos2",alpha.var = 0.8)+scale_color_gradient(low="blue",high="red")+theme_minimal()

fviz_pca_var(res.pca, axes=c(3:4),col.var="cos2",alpha.var = 0.8)+scale_color_gradient(low="blue",high="red")+theme_minimal()

The Analgesia, Abuse liability and Antitusive scores are positively correlated, and those 3 metrics contribute the most to the construction of 1st component (dimension 1); while Alangesia duration, elimination and respiratory depression contribute the most to 2nd component.

The cosine square results indicate that Analgesia, Abuse liability and Antitusive score well correlated to the Dimension 1. Weak correlation was found between all features and the dimension 2. The respiratory depression was better correlated to Dimension 3.

Thus, the 1st component mainly implies the pharmacological effect, including Analgegia effect, abuse liability, antitussive (preventing coughs) whilst the 2nd dimension is mainly due to pharmacokinetic features, such as analgesia duration and elimination time. The 3rd component is mainly due to Respiratory depression profile.

A) Individuals

res.pca$ind$cos2%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
cosine squared
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
buprenorphine 0.006 0.541 0.003 0.144 0.001
butorphanol 0.271 0.059 0.022 0.432 0.157
codeine 0.000 0.259 0.164 0.100 0.267
heroine 0.790 0.038 0.023 0.005 0.087
hydromorphone 0.576 0.031 0.115 0.184 0.000
levorphanol 0.520 0.369 0.009 0.002 0.097
mepriridine 0.016 0.511 0.007 0.380 0.028
methadone 0.295 0.178 0.004 0.233 0.286
morphine 0.622 0.124 0.223 0.006 0.021
nalbuphine 0.289 0.178 0.024 0.234 0.248
oxycodone 0.472 0.222 0.002 0.002 0.000
oxymorphine 0.527 0.007 0.220 0.134 0.079
pentazocine 0.243 0.041 0.043 0.448 0.214
propoxyphene 0.574 0.005 0.023 0.144 0.148
nalorphine 0.556 0.003 0.321 0.034 0.062
levallorphan 0.563 0.028 0.052 0.250 0.053
cyclazocine 0.609 0.215 0.048 0.078 0.023
naloxone 0.739 0.004 0.207 0.004 0.028
naltrexone 0.301 0.366 0.027 0.189 0.113
alfentanil 0.085 0.857 0.039 0.001 0.006
alphaprodine 0.018 0.684 0.283 0.002 0.011
fentanyl 0.001 0.789 0.008 0.116 0.028
meptazinol 0.857 0.130 0.002 0.005 0.006
norpropoxyphene 0.532 0.355 0.001 0.058 0.003
sufentanil 0.515 0.385 0.030 0.001 0.054
newdrug1 0.012 0.011 0.863 0.006 0.054
newdrug2 0.029 0.193 0.674 0.033 0.008
res.pca$ind$contrib%>%round(.,3)%>%knitr::kable(caption = 'Contribution')
Contribution
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
buprenorphine 0.009 1.496 0.016 0.756 0.006
butorphanol 1.083 0.448 0.273 6.242 2.903
codeine 0.000 1.316 1.381 0.963 3.301
heroine 13.117 1.192 1.205 0.328 6.722
hydromorphone 3.419 0.347 2.164 3.948 0.013
levorphanol 8.518 11.469 0.445 0.101 7.366
mepriridine 0.144 8.915 0.190 12.615 1.178
methadone 8.018 9.176 0.324 22.914 36.026
morphine 5.872 2.226 6.660 0.200 0.905
nalbuphine 1.521 1.779 0.405 4.459 6.053
oxycodone 1.576 1.406 0.024 0.022 0.004
oxymorphine 2.513 0.062 3.313 2.316 1.748
pentazocine 1.390 0.447 0.772 9.255 5.680
propoxyphene 3.536 0.057 0.451 3.196 4.222
nalorphine 5.877 0.065 10.729 1.307 3.017
levallorphan 6.719 0.646 1.967 10.775 2.949
cyclazocine 7.181 4.811 1.774 3.331 1.261
naloxone 10.913 0.119 9.647 0.211 1.948
naltrexone 5.222 12.088 1.455 11.874 9.135
alfentanil 0.613 11.722 0.880 0.034 0.192
alphaprodine 0.099 7.339 5.046 0.041 0.276
fentanyl 0.004 5.956 0.100 1.659 0.514
meptazinol 4.283 1.233 0.027 0.086 0.132
norpropoxyphene 4.053 5.146 0.026 1.604 0.122
sufentanil 3.835 5.446 0.716 0.024 1.861
newdrug1 0.096 0.163 21.345 0.156 1.973
newdrug2 0.392 4.931 28.664 1.583 0.492
fviz_pca_ind(res.pca, repel = TRUE, col.ind = "cos2",alpha.ind=0.7)+scale_color_gradient(low="blue",high="red")+theme_minimal()

fviz_pca_ind(res.pca, axes=c(3:4), repel = TRUE, col.ind = "cos2",alpha.ind=0.7)+scale_color_gradient(low="blue",high="red")+theme_minimal()

The 1st principal component indicates the principal axis of variability between drugs, such as Naloxone and Oxymorphine. Returning to the original dataframe, we can see that those 2 drugs are the most extreme in terms of analgesia (or antitussive effects). Indeed, the Analgesia effect of Oxymorphine is stronger than that of Naloxone.

The 2nd principal component represents the characteristic that separate the drugs most significantly once the 1st component is not considered. This identified naltrexon and alfentanil, which has respectively the longest and shortest analgesia duration.

Our study question implicates directly a study of individuals. Obviously, a drug could be located by their coordinates on the PCA plot. For example; according to the projection plane of PC1 and PC2, Morphine, Hydromorphine and oxymorphine would represent similar profile, while Naloxone and Oxymorphine are opposed. Naltrexone or Heroin seem to be original drugs since their profile are relatively isolated from other drugs.

Therefore the New drugs 1 and 2 could be classified by their relative projected coordinate on a choosen component The coordinate of 27 drugs could be extracted as follow:

res.pca$ind$coord%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
cosine squared
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
buprenorphine -0.084 0.783 0.063 -0.404 0.031
butorphanol -0.918 0.428 0.259 -1.160 -0.698
codeine -0.013 -0.734 -0.583 0.455 0.745
heroine 3.196 0.699 0.545 0.266 1.063
hydromorphone 1.631 -0.377 0.730 -0.922 0.047
levorphanol 2.575 2.168 -0.331 0.147 1.113
mepriridine -0.334 1.911 0.216 -1.649 0.445
methadone 2.499 1.939 0.283 2.222 -2.461
morphine 2.138 -0.955 1.281 -0.207 0.390
nalbuphine -1.088 0.854 0.316 -0.980 -1.009
oxycodone 1.108 -0.759 -0.077 -0.069 -0.026
oxymorphine 1.399 -0.160 0.904 -0.706 -0.542
pentazocine -1.040 0.428 0.436 -1.412 -0.977
propoxyphene -1.659 -0.153 -0.333 -0.830 -0.842
nalorphine -2.139 -0.163 1.626 0.531 0.712
levallorphan -2.287 0.514 -0.696 1.524 -0.704
cyclazocine -2.365 -1.404 -0.661 0.847 0.460
naloxone -2.915 0.221 1.542 0.213 0.572
naltrexone -2.016 2.225 0.599 1.600 1.239
alfentanil 0.691 -2.191 -0.466 0.086 -0.180
alphaprodine -0.278 -1.734 -1.115 0.095 -0.215
fentanyl -0.056 -1.562 0.157 0.598 -0.294
meptazinol -1.826 -0.711 -0.082 -0.136 -0.149
norpropoxyphene 1.776 -1.452 -0.080 0.588 -0.143
sufentanil 1.728 -1.494 0.420 0.072 0.559
newdrug1 -0.273 0.258 -2.293 -0.184 0.576
newdrug2 0.552 1.421 -2.658 -0.584 0.288

The Biplot is a graph for representing two sets of objects of different nature , for example: individual and variable

fviz_pca_biplot(res.pca,label="var",addEllipses=TRUE,ellipse.level= 0.95)+geom_point(shape=21,aes(fill=df$drug),color="black",size=3)+theme(legend.position="none")+geom_text(aes(label=name,color=name),alpha=0.5,nudge_y=0.2)

The contribution of each variable or individual to construction of principal components could be explored by the Contribution plot as follow:

fviz_contrib(res.pca, choice="ind", axes = 2,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 2")+coord_flip()

fviz_contrib(res.pca, choice="ind", axes = 3,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 3")+coord_flip()

fviz_contrib(res.pca, choice="var", axes = 1:3,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 1-3")

Conclusion: Principal component analysis (PCA) is a simple but powerful tool for exploring the large dataset with many quantitative variables. The goal of PCA is to represent the data in a space with reduced dimensionality with the best viewpoint, thus enabling an optimal visualization of the hidden information in the multivariate data with only a few synthetic components. The PCA results could be interpreted from either individual or variable viewpoints.

END