Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam 2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
In this study, the question is how to classify two newly developed opioid pain relievers into correct therapeutic regimens. To answer this question, these new drugs were assessed for their similarities with 25 other standard opioids. Studied profiles include the score (0-10) of Analgesia, Antitussive, Constipation, Respiratory depression effects, Abuse liability, as well as Eliminate time (half-life) and Duration of analgesia effect.
The original dataset was provided in Chaper 1, Machine Learning in Medicine-Cookbook 2 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format chap1nearestneighbors.sav) from their website: extras.springer.com.
df=read.spss("chap1nearestneighbors.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%.[,-9]
row.names(df)<-df$drug
knitr::kable(
head(df,30),booktabs = TRUE,
caption = 'Opioids dataset'
)
| drug | analgesia | antitussive | constipation | respiratorydepression | abuseliability | elimination | analgesiaduration |
|---|---|---|---|---|---|---|---|
| buprenorphine | 7 | 4 | 5 | 7 | 4 | 5.0 | 9.0 |
| butorphanol | 7 | 3 | 4 | 7 | 4 | 2.7 | 4.0 |
| codeine | 5 | 6 | 6 | 5 | 4 | 2.9 | 7.0 |
| heroine | 8 | 6 | 8 | 8 | 10 | 9.0 | 15.0 |
| hydromorphone | 8 | 6 | 6 | 8 | 8 | 2.6 | 5.0 |
| levorphanol | 8 | 6 | 6 | 8 | 8 | 11.0 | 20.0 |
| mepriridine | 7 | 2 | 4 | 8 | 6 | 3.2 | 14.0 |
| methadone | 9 | 6 | 6 | 8 | 6 | 25.0 | 5.0 |
| morphine | 8 | 6 | 8 | 8 | 8 | 3.1 | 5.0 |
| nalbuphine | 7 | 2 | 4 | 7 | 4 | 5.1 | 4.5 |
| oxycodone | 6 | 6 | 6 | 6 | 8 | 5.0 | 4.0 |
| oxymorphine | 8 | 5 | 6 | 8 | 8 | 5.2 | 3.5 |
| pentazocine | 7 | 2 | 4 | 7 | 5 | 2.9 | 3.0 |
| propoxyphene | 5 | 2 | 4 | 5 | 5 | 3.3 | 2.0 |
| nalorphine | 2 | 3 | 6 | 8 | 1 | 1.4 | 3.2 |
| levallorphan | 3 | 2 | 5 | 4 | 1 | 11.0 | 5.0 |
| cyclazocine | 2 | 3 | 6 | 3 | 2 | 1.6 | 2.8 |
| naloxone | 1 | 2 | 5 | 8 | 1 | 1.2 | 3.0 |
| naltrexone | 1 | 3 | 5 | 8 | 0 | 9.7 | 14.0 |
| alfentanil | 7 | 6 | 7 | 4 | 6 | 1.6 | 0.5 |
| alphaprodine | 6 | 5 | 6 | 3 | 5 | 2.2 | 2.0 |
| fentanyl | 6 | 5 | 7 | 5 | 4 | 3.7 | 0.5 |
| meptazinol | 4 | 3 | 5 | 5 | 3 | 1.6 | 2.0 |
| norpropoxyphene | 8 | 6 | 8 | 5 | 7 | 6.0 | 4.0 |
| sufentanil | 7 | 6 | 8 | 6 | 8 | 2.6 | 5.0 |
| newdrug1 | 5 | 5 | 4 | 3 | 6 | 5.0 | 12.0 |
| newdrug2 | 8 | 6 | 3 | 4 | 5 | 7.0 | 16.0 |
The dataframe consists of 27 rows that correspond to 27 drug names, 1 categorical variable that indicates drugs names and 7 quantitative variables. This study design implicates two approaches of data exploration: individuals and variables studying. As we know, any dataframe could be explored in two ways, using either rows (individuals) or columns (variables).
Individuals study
Classifying individuals means identifying the similarity (or dissimilarity) between them from the point of view of our features (variables), i.e identifying which individuals are the most similar/dissimilar? In our study two drugs are considered similar if they have the similar profile based on 9 variables. If so, these 2 drugs present the same dimension of variability. Thus, we can classify a new drug into appropriate treatment group as they are sharing the same profile. In general, looking at individuals might help to identify a pattern that allows developing classification rule or models.
The study on individuals could be illustrated by a scatter-dot matrix graph (Figure 1). This type of graph allows to describe all individuals (a cloud of points i) in terms of pair of variables (two dimensions). Each combination in the matrix represents a situation, or point of view and some of them can eventually show an organised structure of individuals with a clear distance between the subgroups. Such presentation allows a straightforward interpretation of data, as each component graph is two-dimensional.
However, when there are many variables in the data, we will need a better way to identify the multi-dimensional space in which those individuals locate.
plotfuncLow <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=df$drug),shape=21,color="black")+geom_smooth(method="lm",alpha=0.5)
p
}
library(GGally)
ggpairs(df[,-1],lower=list(continuous=plotfuncLow),diag=NULL)
Variables study
An alternative approach is to explore the data by looking at the variables (features). Variable study aims to reveal the link between all variables in the dataset. Though many kinds of relationship could exist, like logarithmic, exponential or polynomial… but in practice we are usually interested in linear correlation. Such correlation could be evaluated by plotting a correlation matrix between our quantitative variables (Figure 2).
cor.mtest <- function(mat, conf.level = 0.95){
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for(i in 1:(n-1)){
for(j in (i+1):n){
tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
cormat<-df[,-1]%>%cor.mtest(.,0.95)
library("corrplot")
library(viridis)
df[,-1]%>%cor(.,method="pearson")%>%corrplot(.,p.mat=cormat[[1]],sig.level=0.05,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,col=viridis::plasma(n=100,begin =0.9, end = 0.4))
The correlation matrix indicated a significant linear relationship between some features whilst in other pairs the relationship is weaker and less significant. We could see that studying variable works in the same way as studying individuals: in both cases the data could be visually reorganized in clusters (variable groups or individual groups). As for the individuals, the variable study aims to identify a variable pattern, with variable groups that represent a strong relationship among them. By doing this, the number of dimension of original dataset could be reduced so each group could be represented by a single synthetic variable instead of many variables, as they are well correlated and would vary in the same way.
However, the scatter dot matrix and the correlogram alone did not provide a good answer for our main question.
In addition, correlation matrix and scatter dot matrix are only practical for dataset with small number of variables. When exploring larger dataset with too many variables, drawing conclusion from those graphs is more difficult.
Therefore we will need a more efficient method with ability to represent visually the important relationships between the variables. PCA (Principal component analysis) is our solution. This method allows reducing the dimension of original dataset by generating the synthetic variables called “principal components”.
Materials and method
Data analysis was performed in R statistical programming language. The FactoMineR and Factoextra packages will be used for conducting PCA and plotting the results as ggplot2 graphical objects.
library(FactoMineR)
library(factoextra)
First thing to do is determining the active and supplementary features. Only active variables contribute to the construction of principal components while supplementary variables are exclusively used for illustrative or interpretive purposes. Active elements can be determined for either individuals and variables.
When using PCA as an exploratory tool before model training, we usually choose the explanatory variable (features) of the model as active variables for PCA, and project the target (dependent variable) as supplementary variables. This could help to discover the relationship between them and eventually predict the model performance.
How we select the active elements ? It’s easy to decide in this case because all the individuals will be included as active individuals, and all 7 quantitative profiles will be used as active variables. In the PCA function, we specify that the drug names will be treated as categorical illustrative variable
Should we standardize our variables ? As there are 2 types of metrics (Score and Time), they were measured in different units and scales, those variables should be standardized before PCA. The relative role of each variable will be balanced after standardization.
Though the PCA provides the graph for individual and variable analysis, as well as other outputs, contained in the output object, we don’t need any graphic output as we will do that later using factorextra package.
res.pca <- PCA(df,quali.sup=1,scale.unit = T,graph = F)
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 27 individuals, described by 8 variables
## *The results are available in the following objects:
##
## name
## 1 "$eig"
## 2 "$var"
## 3 "$var$coord"
## 4 "$var$cor"
## 5 "$var$cos2"
## 6 "$var$contrib"
## 7 "$ind"
## 8 "$ind$coord"
## 9 "$ind$cos2"
## 10 "$ind$contrib"
## 11 "$quali.sup"
## 12 "$quali.sup$coord"
## 13 "$quali.sup$v.test"
## 14 "$call"
## 15 "$call$centre"
## 16 "$call$ecart.type"
## 17 "$call$row.w"
## 18 "$call$col.w"
## description
## 1 "eigenvalues"
## 2 "results for the variables"
## 3 "coord. for the variables"
## 4 "correlations variables - dimensions"
## 5 "cos2 for the variables"
## 6 "contributions of the variables"
## 7 "results for the individuals"
## 8 "coord. for the individuals"
## 9 "cos2 for the individuals"
## 10 "contributions of the individuals"
## 11 "results for the supplementary categorical variables"
## 12 "coord. for the supplementary categories"
## 13 "v-test of the supplementary categories"
## 14 "summary statistics"
## 15 "mean of the variables"
## 16 "standard error of the variables"
## 17 "weights for the individuals"
## 18 "weights for the variables"
How many dimension would be required for the analysis ?
The number of principal components is always inferior or equal to the number of active variables. Examining the eigen values or inertia of variance explained associated to each principal components allows to determine the adequate number of components that should be used for interpretation. This ratio can be interpreted as a measure of quality of data representation. As the two components are orthogonal, their ratios could be added together.
res.pca$eig%>%round(.,3)%>%knitr::kable()
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 2.884 | 41.196 | 41.196 |
| comp 2 | 1.517 | 21.676 | 62.872 |
| comp 3 | 0.913 | 13.038 | 75.910 |
| comp 4 | 0.798 | 11.401 | 87.311 |
| comp 5 | 0.622 | 8.892 | 96.203 |
| comp 6 | 0.170 | 2.422 | 98.625 |
| comp 7 | 0.096 | 1.375 | 100.000 |
Those information can be plotted on a bar chart:
fviz_screeplot(res.pca)+geom_line(size=1)+geom_point(shape=21,size=5,color="black",fill="gold")+geom_text(aes(label=round(eig,2)),nudge_y =3,nudge_x=0.1)
The sum of relative importance indicate that up to 62% of total variability of the individual cloud / variable can be represented by only two components 1 and 2. However, the variability of profiles cannot be summarized by those 2 dimension alone, as the cumulative sum of percentage can reach 87% when considering 3rd and 4th principal components.
A) Variables
res.pca$var$cos2%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| analgesia | 0.703 | 0.002 | 0.011 | 0.130 | 0.116 |
| antitussive | 0.721 | 0.070 | 0.044 | 0.045 | 0.027 |
| constipation | 0.326 | 0.290 | 0.139 | 0.162 | 0.047 |
| respiratorydepression | 0.068 | 0.273 | 0.623 | 0.019 | 0.003 |
| abuseliability | 0.815 | 0.013 | 0.001 | 0.104 | 0.000 |
| elimination | 0.150 | 0.345 | 0.007 | 0.338 | 0.154 |
| analgesiaduration | 0.101 | 0.524 | 0.088 | 0.000 | 0.276 |
res.pca$var$contrib%>%round(.,3)%>%knitr::kable(caption = 'Contribution')
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| analgesia | 24.378 | 0.161 | 1.214 | 16.292 | 18.557 |
| antitussive | 25.014 | 4.630 | 4.782 | 5.673 | 4.316 |
| constipation | 11.290 | 19.080 | 15.201 | 20.278 | 7.542 |
| respiratorydepression | 2.366 | 17.998 | 68.263 | 2.369 | 0.402 |
| abuseliability | 28.245 | 0.864 | 0.086 | 12.979 | 0.006 |
| elimination | 5.208 | 22.710 | 0.818 | 42.351 | 24.797 |
| analgesiaduration | 3.499 | 34.558 | 9.635 | 0.058 | 44.379 |
d1=fviz_contrib(res.pca, choice="var", axes = 1,fill="#c170d8",color="black")+labs(title = "Contributions to Dim 1")+geom_text(aes(label=round(contrib,2)),nudge_y=2,color="blue")+coord_flip()
d2=fviz_contrib(res.pca, choice="var", axes = 2,fill="gold",color="black")+labs(title = "Contributions to Dim 2")+geom_text(aes(label=round(contrib,2)),nudge_y=2,color="blue")+coord_flip()
library(gridExtra)
grid.arrange(d1,d2)
fviz_pca_var(res.pca, col.var="cos2",alpha.var = 0.8)+scale_color_gradient(low="blue",high="red")+theme_minimal()
fviz_pca_var(res.pca, axes=c(3:4),col.var="cos2",alpha.var = 0.8)+scale_color_gradient(low="blue",high="red")+theme_minimal()
The Analgesia, Abuse liability and Antitusive scores are positively correlated, and those 3 metrics contribute the most to the construction of 1st component (dimension 1); while Alangesia duration, elimination and respiratory depression contribute the most to 2nd component.
The cosine square results indicate that Analgesia, Abuse liability and Antitusive score well correlated to the Dimension 1. Weak correlation was found between all features and the dimension 2. The respiratory depression was better correlated to Dimension 3.
Thus, the 1st component mainly implies the pharmacological effect, including Analgegia effect, abuse liability, antitussive (preventing coughs) whilst the 2nd dimension is mainly due to pharmacokinetic features, such as analgesia duration and elimination time. The 3rd component is mainly due to Respiratory depression profile.
A) Individuals
res.pca$ind$cos2%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| buprenorphine | 0.006 | 0.541 | 0.003 | 0.144 | 0.001 |
| butorphanol | 0.271 | 0.059 | 0.022 | 0.432 | 0.157 |
| codeine | 0.000 | 0.259 | 0.164 | 0.100 | 0.267 |
| heroine | 0.790 | 0.038 | 0.023 | 0.005 | 0.087 |
| hydromorphone | 0.576 | 0.031 | 0.115 | 0.184 | 0.000 |
| levorphanol | 0.520 | 0.369 | 0.009 | 0.002 | 0.097 |
| mepriridine | 0.016 | 0.511 | 0.007 | 0.380 | 0.028 |
| methadone | 0.295 | 0.178 | 0.004 | 0.233 | 0.286 |
| morphine | 0.622 | 0.124 | 0.223 | 0.006 | 0.021 |
| nalbuphine | 0.289 | 0.178 | 0.024 | 0.234 | 0.248 |
| oxycodone | 0.472 | 0.222 | 0.002 | 0.002 | 0.000 |
| oxymorphine | 0.527 | 0.007 | 0.220 | 0.134 | 0.079 |
| pentazocine | 0.243 | 0.041 | 0.043 | 0.448 | 0.214 |
| propoxyphene | 0.574 | 0.005 | 0.023 | 0.144 | 0.148 |
| nalorphine | 0.556 | 0.003 | 0.321 | 0.034 | 0.062 |
| levallorphan | 0.563 | 0.028 | 0.052 | 0.250 | 0.053 |
| cyclazocine | 0.609 | 0.215 | 0.048 | 0.078 | 0.023 |
| naloxone | 0.739 | 0.004 | 0.207 | 0.004 | 0.028 |
| naltrexone | 0.301 | 0.366 | 0.027 | 0.189 | 0.113 |
| alfentanil | 0.085 | 0.857 | 0.039 | 0.001 | 0.006 |
| alphaprodine | 0.018 | 0.684 | 0.283 | 0.002 | 0.011 |
| fentanyl | 0.001 | 0.789 | 0.008 | 0.116 | 0.028 |
| meptazinol | 0.857 | 0.130 | 0.002 | 0.005 | 0.006 |
| norpropoxyphene | 0.532 | 0.355 | 0.001 | 0.058 | 0.003 |
| sufentanil | 0.515 | 0.385 | 0.030 | 0.001 | 0.054 |
| newdrug1 | 0.012 | 0.011 | 0.863 | 0.006 | 0.054 |
| newdrug2 | 0.029 | 0.193 | 0.674 | 0.033 | 0.008 |
res.pca$ind$contrib%>%round(.,3)%>%knitr::kable(caption = 'Contribution')
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| buprenorphine | 0.009 | 1.496 | 0.016 | 0.756 | 0.006 |
| butorphanol | 1.083 | 0.448 | 0.273 | 6.242 | 2.903 |
| codeine | 0.000 | 1.316 | 1.381 | 0.963 | 3.301 |
| heroine | 13.117 | 1.192 | 1.205 | 0.328 | 6.722 |
| hydromorphone | 3.419 | 0.347 | 2.164 | 3.948 | 0.013 |
| levorphanol | 8.518 | 11.469 | 0.445 | 0.101 | 7.366 |
| mepriridine | 0.144 | 8.915 | 0.190 | 12.615 | 1.178 |
| methadone | 8.018 | 9.176 | 0.324 | 22.914 | 36.026 |
| morphine | 5.872 | 2.226 | 6.660 | 0.200 | 0.905 |
| nalbuphine | 1.521 | 1.779 | 0.405 | 4.459 | 6.053 |
| oxycodone | 1.576 | 1.406 | 0.024 | 0.022 | 0.004 |
| oxymorphine | 2.513 | 0.062 | 3.313 | 2.316 | 1.748 |
| pentazocine | 1.390 | 0.447 | 0.772 | 9.255 | 5.680 |
| propoxyphene | 3.536 | 0.057 | 0.451 | 3.196 | 4.222 |
| nalorphine | 5.877 | 0.065 | 10.729 | 1.307 | 3.017 |
| levallorphan | 6.719 | 0.646 | 1.967 | 10.775 | 2.949 |
| cyclazocine | 7.181 | 4.811 | 1.774 | 3.331 | 1.261 |
| naloxone | 10.913 | 0.119 | 9.647 | 0.211 | 1.948 |
| naltrexone | 5.222 | 12.088 | 1.455 | 11.874 | 9.135 |
| alfentanil | 0.613 | 11.722 | 0.880 | 0.034 | 0.192 |
| alphaprodine | 0.099 | 7.339 | 5.046 | 0.041 | 0.276 |
| fentanyl | 0.004 | 5.956 | 0.100 | 1.659 | 0.514 |
| meptazinol | 4.283 | 1.233 | 0.027 | 0.086 | 0.132 |
| norpropoxyphene | 4.053 | 5.146 | 0.026 | 1.604 | 0.122 |
| sufentanil | 3.835 | 5.446 | 0.716 | 0.024 | 1.861 |
| newdrug1 | 0.096 | 0.163 | 21.345 | 0.156 | 1.973 |
| newdrug2 | 0.392 | 4.931 | 28.664 | 1.583 | 0.492 |
fviz_pca_ind(res.pca, repel = TRUE, col.ind = "cos2",alpha.ind=0.7)+scale_color_gradient(low="blue",high="red")+theme_minimal()
fviz_pca_ind(res.pca, axes=c(3:4), repel = TRUE, col.ind = "cos2",alpha.ind=0.7)+scale_color_gradient(low="blue",high="red")+theme_minimal()
The 1st principal component indicates the principal axis of variability between drugs, such as Naloxone and Oxymorphine. Returning to the original dataframe, we can see that those 2 drugs are the most extreme in terms of analgesia (or antitussive effects). Indeed, the Analgesia effect of Oxymorphine is stronger than that of Naloxone.
The 2nd principal component represents the characteristic that separate the drugs most significantly once the 1st component is not considered. This identified naltrexon and alfentanil, which has respectively the longest and shortest analgesia duration.
Our study question implicates directly a study of individuals. Obviously, a drug could be located by their coordinates on the PCA plot. For example; according to the projection plane of PC1 and PC2, Morphine, Hydromorphine and oxymorphine would represent similar profile, while Naloxone and Oxymorphine are opposed. Naltrexone or Heroin seem to be original drugs since their profile are relatively isolated from other drugs.
Therefore the New drugs 1 and 2 could be classified by their relative projected coordinate on a choosen component The coordinate of 27 drugs could be extracted as follow:
res.pca$ind$coord%>%round(.,3)%>%knitr::kable(caption = 'cosine squared')
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| buprenorphine | -0.084 | 0.783 | 0.063 | -0.404 | 0.031 |
| butorphanol | -0.918 | 0.428 | 0.259 | -1.160 | -0.698 |
| codeine | -0.013 | -0.734 | -0.583 | 0.455 | 0.745 |
| heroine | 3.196 | 0.699 | 0.545 | 0.266 | 1.063 |
| hydromorphone | 1.631 | -0.377 | 0.730 | -0.922 | 0.047 |
| levorphanol | 2.575 | 2.168 | -0.331 | 0.147 | 1.113 |
| mepriridine | -0.334 | 1.911 | 0.216 | -1.649 | 0.445 |
| methadone | 2.499 | 1.939 | 0.283 | 2.222 | -2.461 |
| morphine | 2.138 | -0.955 | 1.281 | -0.207 | 0.390 |
| nalbuphine | -1.088 | 0.854 | 0.316 | -0.980 | -1.009 |
| oxycodone | 1.108 | -0.759 | -0.077 | -0.069 | -0.026 |
| oxymorphine | 1.399 | -0.160 | 0.904 | -0.706 | -0.542 |
| pentazocine | -1.040 | 0.428 | 0.436 | -1.412 | -0.977 |
| propoxyphene | -1.659 | -0.153 | -0.333 | -0.830 | -0.842 |
| nalorphine | -2.139 | -0.163 | 1.626 | 0.531 | 0.712 |
| levallorphan | -2.287 | 0.514 | -0.696 | 1.524 | -0.704 |
| cyclazocine | -2.365 | -1.404 | -0.661 | 0.847 | 0.460 |
| naloxone | -2.915 | 0.221 | 1.542 | 0.213 | 0.572 |
| naltrexone | -2.016 | 2.225 | 0.599 | 1.600 | 1.239 |
| alfentanil | 0.691 | -2.191 | -0.466 | 0.086 | -0.180 |
| alphaprodine | -0.278 | -1.734 | -1.115 | 0.095 | -0.215 |
| fentanyl | -0.056 | -1.562 | 0.157 | 0.598 | -0.294 |
| meptazinol | -1.826 | -0.711 | -0.082 | -0.136 | -0.149 |
| norpropoxyphene | 1.776 | -1.452 | -0.080 | 0.588 | -0.143 |
| sufentanil | 1.728 | -1.494 | 0.420 | 0.072 | 0.559 |
| newdrug1 | -0.273 | 0.258 | -2.293 | -0.184 | 0.576 |
| newdrug2 | 0.552 | 1.421 | -2.658 | -0.584 | 0.288 |
The Biplot is a graph for representing two sets of objects of different nature , for example: individual and variable
fviz_pca_biplot(res.pca,label="var",addEllipses=TRUE,ellipse.level= 0.95)+geom_point(shape=21,aes(fill=df$drug),color="black",size=3)+theme(legend.position="none")+geom_text(aes(label=name,color=name),alpha=0.5,nudge_y=0.2)
The contribution of each variable or individual to construction of principal components could be explored by the Contribution plot as follow:
fviz_contrib(res.pca, choice="ind", axes = 2,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 2")+coord_flip()
fviz_contrib(res.pca, choice="ind", axes = 3,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 3")+coord_flip()
fviz_contrib(res.pca, choice="var", axes = 1:3,fill="gold",color="black")+geom_text(aes(label=round(contrib,2)),nudge_y=1,color="blue")+labs(title = "Contributions to Dim 1-3")
Conclusion: Principal component analysis (PCA) is a simple but powerful tool for exploring the large dataset with many quantitative variables. The goal of PCA is to represent the data in a space with reduced dimensionality with the best viewpoint, thus enabling an optimal visualization of the hidden information in the multivariate data with only a few synthetic components. The PCA results could be interpreted from either individual or variable viewpoints.
END