Read in the data
setwd("C:/Users/Jesse/Desktop/Grad/Thesis/Data/Analysis/Datasets/Multivariate")
alldata = read.table("af.bug.test.1.csv", T, sep=",")
alldataPCA = prcomp(alldata[, c(13:47)], center = T,scale=T)
First here is the clustering. This is basically the code that you initally sent me with a few adjustments made to the dataset omitting outliers.
alldata = read.table("af.bug.test.1.csv", T, sep=",")
###################################################
# Create prcomp, omitting categorical data and column 40 (all zero)
###################################################
alldataPCA = prcomp(alldata[, c(13:47)], center = T,scale=T)
PCA.scores = data.frame(alldata$sample, alldata$wp, alldata$depth,
alldata$macrophyte, alldata$codom, alldata$mixed, alldata$position,
alldata$complexity,alldata$abundance,alldata$deep.rank,
alldata$cwd, alldata$location,
round(alldataPCA$x, 3))
write.table(PCA.scores, "alldataPCA.csv", quote=F, row.names=F, col.names=T, sep=",")
PCA.variables = data.frame(round(alldataPCA$rotation, 3))
write.table(PCA.variables, "PCA_variables.csv", quote=F, row.names=T, col.names=NA, sep=",")
newdata = read.table("alldataPCA.csv", T, sep=",")
distances = dist(newdata[, c(13:15)], method="euclidean")
eward = hclust(distances, method="ward.D")
plot(eward, labels=alldata$sample, hang=-0.1, cex=0.65, xlab=" ", sub=" ",
main="PC1-3", ylab="Euclidean Distance")
We see the two strong groups as we’ve discussded many times.
Now I will run a Kruskal Wallis test between the two groups using the variables of the top six variables in PCA1 (and then PCA2). There is a summary table following the code.
No stastical difference can be seen between the group medians of the top 6 variables of PCA1.
setwd("C:/Users/Jesse/Desktop/Grad/Thesis/Data/Analysis/Datasets/Multivariate")
PCA1.Compare<-read.csv("PCA1_top6.csv")
kruskal.test(pl_fe ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: pl_fe by group
## Kruskal-Wallis chi-squared = 0.025791, df = 1, p-value = 0.8724
kruskal.test(ap_rh ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: ap_rh by group
## Kruskal-Wallis chi-squared = 0.51151, df = 1, p-value = 0.4745
kruskal.test(di_06 ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: di_06 by group
## Kruskal-Wallis chi-squared = 0.76, df = 1, p-value = 0.3833
kruskal.test(di_07 ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: di_07 by group
## Kruskal-Wallis chi-squared = 0.026928, df = 1, p-value = 0.8697
kruskal.test(di_ce ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: di_ce by group
## Kruskal-Wallis chi-squared = 0.81915, df = 1, p-value = 0.3654
kruskal.test(di_02 ~ group, data = PCA1.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: di_02 by group
## Kruskal-Wallis chi-squared = 1.837, df = 1, p-value = 0.1753
Here are the comparison between the two groups based on PCA2. Stastical difference can be seen between the group medians of tr_01, pl_gy, hy_01, and co_ex.The differences tested are the medians of the top 6 variable of PCA2.
PCA2.Compare<-read.csv("PCA2_top6.csv")
kruskal.test(tr_01 ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: tr_01 by group
## Kruskal-Wallis chi-squared = 6.1366, df = 1, p-value = 0.01324
kruskal.test(pl_gy ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: pl_gy by group
## Kruskal-Wallis chi-squared = 25.408, df = 1, p-value = 4.639e-07
kruskal.test(ep_ba ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: ep_ba by group
## Kruskal-Wallis chi-squared = 1.5553, df = 1, p-value = 0.2123
kruskal.test(hy_01 ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: hy_01 by group
## Kruskal-Wallis chi-squared = 5.4081, df = 1, p-value = 0.02004
kruskal.test(co_ex ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: co_ex by group
## Kruskal-Wallis chi-squared = 5.1804, df = 1, p-value = 0.02284
kruskal.test(he_01 ~ group, data = PCA2.Compare)
##
## Kruskal-Wallis rank sum test
##
## data: he_01 by group
## Kruskal-Wallis chi-squared = 0.76, df = 1, p-value = 0.3833
Summary of PCA 1 compairisons.
PCA 1 Top Loadings | Compairison | Result |
---|---|---|
pl_fe | Group 1 vs Group 2 | Not Sig |
ap_rh | Group 1 vs Group 2 | Not Sig |
di_06 | Group 1 vs Group 2 | Not Sig |
di_07 | Group 1 vs Group 2 | Not Sig |
di_ce | Group 1 vs Group 2 | Not Sig |
di_02 | Group 1 vs Group 2 | Not Sig |
Summary of PCA 2 compairisons.
PCA 2 Top Loadings | Compairison | Result |
---|---|---|
tr_01 | Group 1 vs Group 2 | Significant |
pl_gy | Group 1 vs Group 2 | Significant |
ep_ba | Group 1 vs Group 2 | Not Sig |
hy_01 | Group 1 vs Group 2 | Significant |
co_ex | Group 1 vs Group 2 | Significant |
he_01 | Group 1 vs Group 2 | Not Sig |
The Kruskal Wallis test results tell me that the taxa that generate the variability in PCA2 are more influential on the the two clustering groups than are any of the taxa indicated by PCA1.
I have used the package factoextra
for the PCA plots. I don’t think you will need to run any of the following but I included the plots in case you need to reference any of the PCA plots to help explain anyting.
Here is the sample ordinations on PCA1 and PCA2. This is only included just in case it is needed. Note that the numbers in the plot do not match the dendrogram because they are based on the row number and not the actual sample number.
###################################################
#install.packages("factoextra") is required
library("factoextra")
labels<-alldata$sample
fviz_pca_ind(alldataPCA,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
legend.title = "Contribution",
repel = FALSE # Avoid text overlapping
)
Here is the variable loadings on PCA 1 and PCA 2. The relative contribution to the PCA is given as a color gradient which I thought was neat.
fviz_pca_var(alldataPCA,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
legend.title = "Contribution",
repel = F, # Avoid text overlapping
title = "Variable Loadings"
)
Here is the top 12 variable loadings on PCA 1 and PCA 2. Using this package was the only way I could find that would easily allow me to control which loadings were displayed. I think this shows what is driving PCA 1 and 2 much more clearly.
fviz_pca_var(alldataPCA,
select.var = list(cos2 = 12),
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
legend.title = "Contribution",
repel = F, # Avoid text overlapping
title = "Top 10 Variable Loadings"
)