Click here for other works of the author on RPubs
library(knitr)
library(fpc)
library(vegan)
library(cluster)
library(purrr)
library(factoextra)
library(pander)
For this assignment, k-means clustering and hierarchical clustering will be performed to analyze the structure of the stations according to dominant copepode species.
#import copepod data
copdata = read.table("copepod_composition.txt", header = T)
#import species name
species_name <- read.table("copepodSPlist.txt", sep = "\t")
#assign copepod species name
rownames(copdata) <- as.character(unlist(species_name))
Dominant species are defined as species with a percentage of 2% or more in any given station.
#convert species frequency into percentage
copdata = copdata / 100
#find dominant species
dom = apply(copdata >= 0.02, 2, which)
dom = sort(unique(unlist(dom)))
dom = t(copdata[dom, ])
#show a part of the data
kable(dom[1:6, 1:6], caption = "Station data with percent composition of dominant species (partial)")
Acartia negligence | Acartia pacifica | Calanus sinicus | Canthocalanus pauper | Cosmocalanus darwinii | Undinula vulgaris | |
---|---|---|---|---|---|---|
p1 | 0.0000 | 0 | 0.1426 | 0.0000 | 0 | 0 |
p3 | 0.0000 | 0 | 0.0701 | 0.0000 | 0 | 0 |
p4 | 0.0000 | 0 | 0.0238 | 0.0000 | 0 | 0 |
p6 | 0.0000 | 0 | 0.0904 | 0.0000 | 0 | 0 |
p13 | 0.0000 | 0 | 0.0556 | 0.0038 | 0 | 0 |
p16 | 0.0022 | 0 | 0.0176 | 0.0022 | 0 | 0 |
Using distance measures including euclidean, manhattan, Bray-Curtis and Jaccard to calculate distance matrix.
dist_meth = c("euclidean", "manhattan", "bray", "jaccard")
dist_m = map(as.list(dist_meth), vegdist, x = dom)
Define function find_k
for plotting scree plot to show how within groups sum of square (WSS) and average Silhouette width (Si) change with different number of clusters used.
find_k <- function(data, plot_title = ""){
#create data for within group sum of square
wss = numeric(15)
for (k in 1:15) wss[k] <- kmeans(data, k, nstart = 20)$tot.withinss
#create data for average silhouette distance
asw <- numeric(15)
for (k in 2:15) asw[k] <- pam(data, k)$silinfo$avg.width
#create s cree plot
par(mar=c(5, 4, 4, 6))
plot(1:15, wss, type = "b", main = plot_title, xlab = "Number of Clusters", ylab = "Within groups sum of squares")
par(new = T)
plot(1:15, asw, type = "l", lty = 2, col = "red", axes = F, xlab = NA, ylab = NA)
axis(side = 4)
mtext("Average Silhouette width", side = 4, line = 3)
legend("topright", legend = c("WSS", "Si"), lty = c(1,2), col = c("black", "red"))
}
Define plot titles.
dist_meas = c("Euclidean", "Manhattan", "Bray-Curtis", "Jaccard")
Create scree plots to determine the number of clusters to use (k). Click on the tabs to show results for different distance methods.
for(i in 1:4){
cat("\n")
cat("#### ", dist_meas[i], "\n")
find_k(dist_m[[i]], plot_title = dist_meas[i])
cat("\n")
}
From the result of scree plots, 2 clusters seems to be the most reasonable choice when euclidean distance is used and 3 clusters is optimal if other distance methods are used. In order to compare the effects of different distance measures, 3 clusters will be used in later analysis.
#conduct k-means to find clusters
km = map(dist_m, pam, k = 3)
# extract cluster data
km_result = transpose(km)
km_cluster = as.data.frame(km_result$cluster)
colnames(km_cluster) <- dist_meas
Plot the Silhouette plots for k-means clustering. Click on the tabs to see results for different distance methods.
for(i in 1:4){
cat("\n")
cat("#### ", dist_meas[i], "\n")
plot(silhouette(km[[i]], dist_m[[i]]), main = paste("Silhoulette plot using ", dist_meas[i], " distance") , col = 1:3, border = NA, cex = 0.6)
cat("\n")
}
No significant differences exists between the clusters when different distance measures are used. However, it should be noted that stations p23 and wC are assigned to a different cluster when euclidean distance is used comparing with other methods.
Using the bootstrap method, assess the stability of clusters when using k-means. Click on different tabs to see results for different distance methods.
for(i in 1:4){
cat("\n")
cat("#### ", dist_meas[i], "\n")
stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = claraCBI, k = 3, count = F)
stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for k-means"))
cat("\n")
}
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6566714 | 291 | 302 |
0.5284333 | 581 | 184 |
0.7354016 | 84 | 458 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8639887 | 55 | 784 |
0.8598780 | 48 | 771 |
0.9489945 | 24 | 932 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8379067 | 82 | 716 |
0.8413992 | 49 | 730 |
0.9479493 | 24 | 918 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8688582 | 58 | 794 |
0.8601801 | 51 | 767 |
0.9441388 | 23 | 913 |
Based on stability assessments, it seems that using euclidean distance would result in unstable clusters. Other distance methods all yield reasonably stable clusters.
#perform hierarchical clustering on different distance measures and linkage methods
hc_agg = list()
m1 = list(method = list("single", "complete", "average", "ward"))
m2 = list(method = list("single", "complete", "average", "ward.D"))
for(i in 1:4){
hc_agg[[i]] = pmap(m1, agnes, x = dist_m[[i]])
}
Define plot titles
link_meth = c("Single", "Complete", "Average", "Ward's")
Plot the dendrograms for hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.
for(i in 1:4){
cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
for(j in 1:4){
cat("\n")
cat("##### ", link_meth[j], "\n")
print(fviz_dend(hc_agg[[i]][[j]], cex = 0.5, k = 3, lwd = 0.8, main = paste(link_meth[j], "linkage Dendrogram"), horiz = T, ylab = paste("Height", "\n", "\n", "Agglomerative coefficient=", round(hc_agg[[i]][[j]]$ac, 2), "\n", "Cophenetic correlation=", round(cor(dist_m[[i]], cophenetic(hc_agg[[i]][[j]])), 2))))
cat("\n")
}
cat("\n")
}
From the dendrograms, we can see that using euclidean distance or single linkage would give large clusters with many stations and small clusters with only a few stations (sometimes only 1) and have a chaining effect. This is undesirable for our analysis. Additionally, single linkage generally result in a lower cophenetic correlation while Ward’s linkage yields a higher agglomerative coefficient.
Using the bootstrap method, assess the stability of clusters when using hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.
for(i in 1:4){
cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
for(j in 1:4){
cat("\n")
cat("##### ", link_meth[j], "\n")
stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = hclustCBI, k = 3, method = m2[[1]][[j]], count = F)
stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for hierarchical clustering"))
cat("\n")
}
cat("\n")
}
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.4627839 | 574 | 426 |
0.9006884 | 0 | 947 |
0.6099281 | 391 | 609 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8475470 | 121 | 644 |
0.9387263 | 10 | 930 |
0.8511922 | 160 | 747 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8019406 | 154 | 592 |
0.9406904 | 0 | 967 |
0.8119281 | 244 | 670 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6894428 | 301 | 350 |
0.5641925 | 542 | 235 |
0.7465700 | 91 | 463 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6278480 | 412 | 290 |
0.7784895 | 202 | 599 |
0.6071241 | 525 | 340 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6309667 | 411 | 317 |
0.5613357 | 464 | 116 |
0.3183909 | 807 | 171 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.7346667 | 237 | 490 |
0.9010478 | 68 | 864 |
0.7038296 | 283 | 442 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6686884 | 410 | 401 |
0.7588880 | 102 | 453 |
0.9535567 | 3 | 963 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6115445 | 429 | 248 |
0.7932883 | 183 | 630 |
0.6068739 | 513 | 325 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6938350 | 157 | 300 |
0.7888723 | 94 | 615 |
0.3637434 | 749 | 218 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.7761627 | 169 | 571 |
0.9233389 | 27 | 904 |
0.7510716 | 229 | 516 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8033594 | 150 | 594 |
0.9607506 | 3 | 974 |
0.8152986 | 90 | 615 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6391961 | 405 | 300 |
0.8096528 | 160 | 637 |
0.6625609 | 445 | 421 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.6743126 | 164 | 257 |
0.8066948 | 79 | 670 |
0.3128567 | 801 | 162 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8223877 | 76 | 697 |
0.9579048 | 13 | 965 |
0.7869375 | 161 | 644 |
Clusterwise Jaccard bootstrap mean | dissolved | recovered |
---|---|---|
0.8503456 | 72 | 694 |
0.9580631 | 7 | 957 |
0.8569549 | 65 | 719 |
According to the stability assessment, using single linkage and complete linkage generally yields unstable clusters. On the other hand, Ward’s distance gives the most stable clusters.
The result of our analysis vary significantly with different distance measure and linkage methods. As mentioned above, using euclidean distance or single linkage yields large clusters with many stations and small clusters with only a few stations. Consequently, euclidean distance and single linkage are inappropriate for analyzing the clusters of stations in this research.
All other methods return similar clustering results and also have similar performances on the agglomerative coefficient, Cophenetic correlation and stability assessment. Among other methods, both Bray-Curtis and Jaccard distance along with Ward’s linakge have better performance in the three criterions with Jaccard distance method giving slightly more stable clusters. Therefore, the results from Jaccard distance and Ward’s linkage will be used for further analysis.
#extract indice for cluster from result of heirarchical clustering using Jaccard distance and Ward's linkage method
clust = cutree(hc_agg[[4]][[4]], 3)
#subset station data according to the clusters
g = list()
clust_name = c()
for(i in 1:3){
g[[i]] = subset(dom, clust == i)
clust_name[i] = paste(rownames(g[[i]]), collapse = " ")
}
#display the clusters
clust_name = data.frame(clust_name)
colnames(clust_name) <- "Stations"
rownames(clust_name) <- c("Group 1", "Group 2", "Group 3")
pander(clust_name, split.cell = 35)
Stations | |
---|---|
Group 1 | p1 p3 p4 p6 p13 p16 p19 p21 p23 p25 |
Group 2 | s18 s19 s20 s22 s23 s25 s27 s29 sA sB sC sD sE sF sG |
Group 3 | w22 w23 w25 w27 w29 wA wB wC wD |
Based on the results of cluster analysis, it is evident that the groups of stations are highly related to the season in which copepod data are collected. Since groupings are based on the composition of different copepod species, it is very likely that season have a strong influence on the composition of copepod species.
#compute average composition percentage for every copepod for each group
g_mean = map(g, apply, MARGIN = 2, FUN = mean)
kable(data.frame(g_mean), col.names = c("Group 1 (Spring)", "Group 2 (Summer)", "Group 3 (Winter)"))
Group 1 (Spring) | Group 2 (Summer) | Group 3 (Winter) | |
---|---|---|---|
Acartia negligence | 0.00669 | 0.0084800 | 0.0047556 |
Acartia pacifica | 0.00000 | 0.0157533 | 0.0950778 |
Calanus sinicus | 0.07033 | 0.0008200 | 0.0964333 |
Canthocalanus pauper | 0.00268 | 0.0519533 | 0.0233778 |
Cosmocalanus darwinii | 0.00352 | 0.0011267 | 0.0046222 |
Undinula vulgaris | 0.00098 | 0.0113000 | 0.0040889 |
Clausocalanus furcatus | 0.00996 | 0.0191667 | 0.0218444 |
Clausocalanus minor | 0.00900 | 0.0091133 | 0.0198000 |
Subeucalanus pileatus | 0.00000 | 0.0070733 | 0.0145778 |
Subeucalanus subcrassus | 0.00000 | 0.0014200 | 0.0155222 |
Subeucalanus copepodid | 0.00467 | 0.0072200 | 0.0028444 |
Euchaeta concinna | 0.00026 | 0.0002467 | 0.0106222 |
Euchaeta copepodid | 0.01975 | 0.0129933 | 0.1773000 |
Acrocalanus gibber | 0.00823 | 0.0427600 | 0.0186222 |
Acrocalanus gracilis | 0.00032 | 0.0069133 | 0.0022111 |
Calocalanus gracilis | 0.00413 | 0.0013800 | 0.0000000 |
Calocalanus pavoninus | 0.00108 | 0.0065933 | 0.0000000 |
Calocalanus plumulosus | 0.00290 | 0.0021267 | 0.0000000 |
Calocalanus styliremis | 0.00134 | 0.0040467 | 0.0002111 |
Paracalanus aculeatus | 0.03864 | 0.0063333 | 0.1055889 |
Paracalanus pavus | 0.46160 | 0.1267933 | 0.1606556 |
Paracalanus serrulus | 0.00000 | 0.0248200 | 0.0006778 |
Parvocalanus crassirostris | 0.00883 | 0.1511467 | 0.0009000 |
Scolecithricella longispinosa | 0.00048 | 0.0067400 | 0.0334222 |
Temora stylifera | 0.00871 | 0.0008067 | 0.0002111 |
Temora turbinata | 0.00276 | 0.0656267 | 0.0276333 |
Oithona attenuata | 0.00038 | 0.0419800 | 0.0000000 |
Oithona brevicornis | 0.00019 | 0.0173733 | 0.0000000 |
Oithona fallax | 0.00130 | 0.0071800 | 0.0118000 |
Oithona plumifera | 0.02300 | 0.0250733 | 0.0153111 |
Oithona similis | 0.09122 | 0.0103867 | 0.0000000 |
Euterpina acutifrons | 0.00652 | 0.0537000 | 0.0000000 |
Corycaeus (Ditrichocorycaeus) affinis | 0.12334 | 0.0018933 | 0.0000000 |
Corycaeus (Ditrichocorycaeus) dahli | 0.00145 | 0.0139533 | 0.0029000 |
Corycaeus (Ditrichocorycaeus) lubbocki | 0.00000 | 0.0237400 | 0.0056556 |
Corycaeus (Ditrichocorycaeus) subtilis | 0.00072 | 0.0052933 | 0.0006000 |
Corycaeus (Onychocorycaeus) giesbrechti | 0.00000 | 0.0101067 | 0.0023556 |
Farranula gibbula | 0.00253 | 0.0053000 | 0.0004222 |
Corycaeidae copepodid | 0.01598 | 0.0000000 | 0.0000000 |
Oncaea conifera | 0.00097 | 0.0303400 | 0.0070778 |
Oncaea media | 0.00146 | 0.0023533 | 0.0039778 |
Oncaea mediterranea | 0.00268 | 0.0014733 | 0.0053444 |
Oncaea venusta | 0.02656 | 0.0889667 | 0.0300444 |
#extract copepod species that are prominant in each group (season). Prominant species are those with a higher composition percentage than the sum of composition percentage in other groups
dom_species_name <- rownames(data.frame(g_mean))
dom_g1 = dom_species_name[g_mean[[1]] > g_mean[[2]] + g_mean[[3]]]
dom_g2 = dom_species_name[g_mean[[2]] > g_mean[[1]] + g_mean[[3]]]
dom_g3 = dom_species_name[g_mean[[3]] > g_mean[[1]] + g_mean[[2]]]
As we can see, copepod species like Calocalanus gracilis, Calocalanus plumulosus, Paracalanus pavus, Temora stylifera, Oithona similis, Corycaeus (Ditrichocorycaeus) affinis, Corycaeidae copepodid are more prominent in spring, copepod species like Canthocalanus pauper, Undinula vulgaris, Acrocalanus gibber, Acrocalanus gracilis, Calocalanus pavoninus, Calocalanus styliremis, Paracalanus serrulus, Parvocalanus crassirostris, Temora turbinata, Oithona attenuata, Oithona brevicornis, Euterpina acutifrons, Corycaeus (Ditrichocorycaeus) dahli, Corycaeus (Ditrichocorycaeus) lubbocki, Corycaeus (Ditrichocorycaeus) subtilis, Corycaeus (Onychocorycaeus) giesbrechti, Farranula gibbula, Oncaea conifera, Oncaea venusta are more prominent in summer, and copepod species such as Acartia pacifica, Calanus sinicus, Clausocalanus minor, Subeucalanus pileatus, Subeucalanus subcrassus, Euchaeta concinna, Euchaeta copepodid, Paracalanus aculeatus, Scolecithricella longispinosa, Oithona fallax, Oncaea media, Oncaea mediterranea are more prominent in winter.
It is therefore very likely that different season have significant impact on the abundance of these copepod species. In the next section, we would look deeper into the variables that may be affected by different seasons.
envdata = read.table("enviANDdensity.txt", header = T)
rownames(envdata) <- envdata[, 1]
envdata = envdata[,-1]
season = list()
for(i in 1:3){
season[[i]] = subset(envdata, clust == i)
}
env_mean = map(season, apply, MARGIN = 2, FUN = mean)
kable(data.frame(env_mean), col.names = c("Spring", "Summer", "Winter"), digits = 2)
Spring | Summer | Winter | |
---|---|---|---|
Depth | 54.10 | 82.23 | 96.06 |
Temperature | 16.36 | 27.32 | 21.92 |
Salinity | 32.48 | 33.57 | 32.75 |
Fluorescence | 0.26 | 0.78 | 0.30 |
DissolvedOxygen | 8.09 | 6.55 | 7.23 |
maxT | 17.35 | 27.38 | 22.37 |
maxS | 33.37 | 34.01 | 33.49 |
maxF | 0.43 | 0.95 | 0.34 |
MaxDO | 8.11 | 6.97 | 7.50 |
FishDensity | 271.12 | 540.23 | 16.53 |
CopepodDensity | 1318.20 | 3530.60 | 102.19 |
It can be observed that spring has a lower depth, temperature and higher dissolved oxygen level. Next, summer comes with higher temperature, fluorescence, fish density, copepod density and lower dissolved oxygen level. Finally, winter results in higher depth, lower fish and copepod density. The levels of salinity do not vary much across different seasons.
How exactly does these environmental variables affect each specific dominant copepod species would require other data or experiments to determine.