Read in the data

setwd("C:/Users/Jesse/Desktop/Grad/Thesis/Data/Analysis/Datasets/Multivariate")
alldata = read.table("af.bug.test.1.csv", T, sep=",")
alldataPCA = prcomp(alldata[, c(13:47)], center = T,scale=T)

Clustering

First here is the clustering. This is basically the code that you initally sent me with a few adjustments made to the dataset omitting outliers.

alldata = read.table("af.bug.test.1.csv", T, sep=",")

###################################################
# Create prcomp, omitting categorical data and column 40 (all zero)
###################################################
alldataPCA = prcomp(alldata[, c(13:47)], center = T,scale=T)

PCA.scores = data.frame(alldata$sample, alldata$wp, alldata$depth,
             alldata$macrophyte, alldata$codom, alldata$mixed, alldata$position,
             alldata$complexity,alldata$abundance,alldata$deep.rank,
             alldata$cwd, alldata$location,
             round(alldataPCA$x, 3))

write.table(PCA.scores, "alldataPCA.csv", quote=F, row.names=F, col.names=T, sep=",")

PCA.variables = data.frame(round(alldataPCA$rotation, 3))
write.table(PCA.variables, "PCA_variables.csv", quote=F, row.names=T, col.names=NA, sep=",")

newdata = read.table("alldataPCA.csv", T, sep=",")

distances = dist(newdata[, c(13:15)], method="euclidean")
eward = hclust(distances, method="ward.D")
plot(eward, labels=alldata$sample, hang=-0.1, cex=0.65, xlab=" ", sub=" ",
      main="PC1-3", ylab="Euclidean Distance")

We see the two strong groups as we’ve discussded many times.

Kruskal Wallis Tests

Now I will run a Kruskal Wallis test between the two groups using the variables of the top six variables in PCA1 (and then PCA2). There is a summary table following the code.

PCA1

No stastical difference can be seen between the group medians of the top 6 variables of PCA1.

setwd("C:/Users/Jesse/Desktop/Grad/Thesis/Data/Analysis/Datasets/Multivariate")

PCA1.Compare<-read.csv("PCA1_top6.csv")
kruskal.test(pl_fe ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  pl_fe by group
## Kruskal-Wallis chi-squared = 0.025791, df = 1, p-value = 0.8724

kruskal.test(ap_rh ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ap_rh by group
## Kruskal-Wallis chi-squared = 0.51151, df = 1, p-value = 0.4745

kruskal.test(di_06 ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  di_06 by group
## Kruskal-Wallis chi-squared = 0.76, df = 1, p-value = 0.3833

kruskal.test(di_07 ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  di_07 by group
## Kruskal-Wallis chi-squared = 0.026928, df = 1, p-value = 0.8697

kruskal.test(di_ce ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  di_ce by group
## Kruskal-Wallis chi-squared = 0.81915, df = 1, p-value = 0.3654

kruskal.test(di_02 ~ group, data = PCA1.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  di_02 by group
## Kruskal-Wallis chi-squared = 1.837, df = 1, p-value = 0.1753

PCA2

Here are the comparison between the two groups based on PCA2. Stastical difference can be seen between the group medians of tr_01, pl_gy, hy_01, and co_ex.The differences tested are the medians of the top 6 variable of PCA2.

PCA2.Compare<-read.csv("PCA2_top6.csv")
kruskal.test(tr_01 ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  tr_01 by group
## Kruskal-Wallis chi-squared = 6.1366, df = 1, p-value = 0.01324

kruskal.test(pl_gy ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  pl_gy by group
## Kruskal-Wallis chi-squared = 25.408, df = 1, p-value = 4.639e-07

kruskal.test(ep_ba ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ep_ba by group
## Kruskal-Wallis chi-squared = 1.5553, df = 1, p-value = 0.2123

kruskal.test(hy_01 ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  hy_01 by group
## Kruskal-Wallis chi-squared = 5.4081, df = 1, p-value = 0.02004

kruskal.test(co_ex ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  co_ex by group
## Kruskal-Wallis chi-squared = 5.1804, df = 1, p-value = 0.02284

kruskal.test(he_01 ~ group, data = PCA2.Compare)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  he_01 by group
## Kruskal-Wallis chi-squared = 0.76, df = 1, p-value = 0.3833

Summary of PCA 1 compairisons.

PCA 1 Top Loadings	Compairison	Result
pl_fe	Group 1 vs Group 2	Not Sig
ap_rh	Group 1 vs Group 2	Not Sig
di_06	Group 1 vs Group 2	Not Sig
di_07	Group 1 vs Group 2	Not Sig
di_ce	Group 1 vs Group 2	Not Sig
di_02	Group 1 vs Group 2	Not Sig

Summary of PCA 2 compairisons.

PCA 2 Top Loadings	Compairison	Result
tr_01	Group 1 vs Group 2	Significant
pl_gy	Group 1 vs Group 2	Significant
ep_ba	Group 1 vs Group 2	Not Sig
hy_01	Group 1 vs Group 2	Significant
co_ex	Group 1 vs Group 2	Significant
he_01	Group 1 vs Group 2	Not Sig

Question

The Kruskal Wallis test results tell me that the taxa that generate the variability in PCA2 are more influential on the the two clustering groups than are any of the taxa indicated by PCA1.

PCA Plots (if needed)

I have used the package factoextra for the PCA plots. I don’t think you will need to run any of the following but I included the plots in case you need to reference any of the PCA plots to help explain anyting.

Here is the sample ordinations on PCA1 and PCA2. This is only included just in case it is needed. Note that the numbers in the plot do not match the dendrogram because they are based on the row number and not the actual sample number.

###################################################
#install.packages("factoextra") is required
library("factoextra")
labels<-alldata$sample
fviz_pca_ind(alldataPCA,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             legend.title = "Contribution",
             repel = FALSE     # Avoid text overlapping
)

Here is the variable loadings on PCA 1 and PCA 2. The relative contribution to the PCA is given as a color gradient which I thought was neat.

fviz_pca_var(alldataPCA,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             legend.title = "Contribution",
             repel = F,     # Avoid text overlapping
             title = "Variable Loadings"
)

Here is the top 12 variable loadings on PCA 1 and PCA 2. Using this package was the only way I could find that would easily allow me to control which loadings were displayed. I think this shows what is driving PCA 1 and 2 much more clearly.

fviz_pca_var(alldataPCA,
             select.var = list(cos2 = 12),
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             legend.title = "Contribution",
             repel = F,     # Avoid text overlapping
             title = "Top 10 Variable Loadings"
)

PCA questions