Click here for other works of the author on RPubs

Load packages

library(knitr)
library(fpc)
library(vegan)
library(cluster)
library(purrr)
library(factoextra)
library(pander)

Q. Use the “dominant” copepod species data (from HW1). Perform cluster analysis of stations based on percent composition data of the dominant species and tell your story about these copepod data.

For this assignment, k-means clustering and hierarchical clustering will be performed to analyze the structure of the stations according to dominant copepode species.

Import data

#import copepod data
copdata = read.table("copepod_composition.txt", header = T)

#import species name
species_name <- read.table("copepodSPlist.txt", sep = "\t")

#assign copepod species name
rownames(copdata) <- as.character(unlist(species_name))

Extract dominant species data

Dominant species are defined as species with a percentage of 2% or more in any given station.

#convert species frequency into percentage
copdata = copdata / 100

#find dominant species
dom = apply(copdata >= 0.02, 2, which)
dom = sort(unique(unlist(dom)))
dom = t(copdata[dom, ])

#show a part of the data
kable(dom[1:6, 1:6], caption = "Station data with percent composition of dominant species (partial)")
Station data with percent composition of dominant species (partial)
Acartia negligence Acartia pacifica Calanus sinicus Canthocalanus pauper Cosmocalanus darwinii Undinula vulgaris
p1 0.0000 0 0.1426 0.0000 0 0
p3 0.0000 0 0.0701 0.0000 0 0
p4 0.0000 0 0.0238 0.0000 0 0
p6 0.0000 0 0.0904 0.0000 0 0
p13 0.0000 0 0.0556 0.0038 0 0
p16 0.0022 0 0.0176 0.0022 0 0

Compute distance matrix with differnt distance measures

Using distance measures including euclidean, manhattan, Bray-Curtis and Jaccard to calculate distance matrix.

dist_meth = c("euclidean", "manhattan", "bray", "jaccard")
dist_m = map(as.list(dist_meth), vegdist, x = dom)

Determine the optimal number of clusters to use

Define function find_k for plotting scree plot to show how within groups sum of square (WSS) and average Silhouette width (Si) change with different number of clusters used.

find_k <- function(data, plot_title = ""){
    #create data for within group sum of square
    wss = numeric(15)
    for (k in 1:15) wss[k] <- kmeans(data, k, nstart = 20)$tot.withinss
    
    #create data for average silhouette distance
    asw <- numeric(15)
    for (k in 2:15) asw[k] <- pam(data, k)$silinfo$avg.width
    
    #create s cree plot
    par(mar=c(5, 4, 4, 6))
    plot(1:15, wss, type = "b", main = plot_title, xlab = "Number of Clusters", ylab = "Within groups sum of squares")
    par(new = T)
    plot(1:15, asw, type = "l", lty = 2, col = "red", axes = F, xlab = NA, ylab = NA)
    axis(side = 4)
    mtext("Average Silhouette width", side = 4, line = 3)
    legend("topright", legend = c("WSS", "Si"), lty = c(1,2), col = c("black", "red"))
}

Scree plots

Define plot titles.

dist_meas = c("Euclidean", "Manhattan", "Bray-Curtis", "Jaccard")

Create scree plots to determine the number of clusters to use (k). Click on the tabs to show results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    find_k(dist_m[[i]], plot_title = dist_meas[i])
    cat("\n")
}

Euclidean

Manhattan

Bray-Curtis

Jaccard

From the result of scree plots, 2 clusters seems to be the most reasonable choice when euclidean distance is used and 3 clusters is optimal if other distance methods are used. In order to compare the effects of different distance measures, 3 clusters will be used in later analysis.

Perform k-means clustering to analyze the clusters of stations

#conduct k-means to find clusters
km = map(dist_m, pam, k = 3)

# extract cluster data
km_result = transpose(km)
km_cluster = as.data.frame(km_result$cluster)
colnames(km_cluster) <- dist_meas

Silhoulette plots: results for k-means clustering

Plot the Silhouette plots for k-means clustering. Click on the tabs to see results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    plot(silhouette(km[[i]], dist_m[[i]]), main = paste("Silhoulette plot using ", dist_meas[i], " distance") , col = 1:3, border = NA, cex = 0.6)
    cat("\n")
}

Euclidean

Manhattan

Bray-Curtis

Jaccard

No significant differences exists between the clusters when different distance measures are used. However, it should be noted that stations p23 and wC are assigned to a different cluster when euclidean distance is used comparing with other methods.

Clusters stability: k-means

Using the bootstrap method, assess the stability of clusters when using k-means. Click on different tabs to see results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = claraCBI, k = 3, count = F)
    stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
    print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for k-means"))
    cat("\n")
}

Euclidean

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6566714 291 302
0.5284333 581 184
0.7354016 84 458

Manhattan

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8639887 55 784
0.8598780 48 771
0.9489945 24 932

Bray-Curtis

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8379067 82 716
0.8413992 49 730
0.9479493 24 918

Jaccard

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8688582 58 794
0.8601801 51 767
0.9441388 23 913

Based on stability assessments, it seems that using euclidean distance would result in unstable clusters. Other distance methods all yield reasonably stable clusters.

Perform hierarchical clustering to analyze the clusters of stations

#perform hierarchical clustering on different distance measures and linkage methods
hc_agg = list()
m1 = list(method = list("single", "complete", "average", "ward"))
m2 = list(method = list("single", "complete", "average", "ward.D"))
for(i in 1:4){
    hc_agg[[i]] = pmap(m1, agnes, x = dist_m[[i]])
}

Dendrogram: results for hierarchical clustering

Define plot titles

link_meth = c("Single", "Complete", "Average",  "Ward's")

Plot the dendrograms for hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.

for(i in 1:4){
    cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
    for(j in 1:4){
        cat("\n")
        cat("##### ", link_meth[j], "\n")
        print(fviz_dend(hc_agg[[i]][[j]], cex = 0.5, k = 3, lwd = 0.8, main = paste(link_meth[j], "linkage Dendrogram"), horiz = T, ylab = paste("Height", "\n", "\n", "Agglomerative coefficient=", round(hc_agg[[i]][[j]]$ac, 2), "\n", "Cophenetic correlation=", round(cor(dist_m[[i]], cophenetic(hc_agg[[i]][[j]])), 2))))
        cat("\n")
    }
    cat("\n")
}

Euclidean

Single

Complete

Average

Ward’s

Manhattan

Single

Complete

Average

Ward’s

Bray-Curtis

Single

Complete

Average

Ward’s

Jaccard

Single

Complete

Average

Ward’s

From the dendrograms, we can see that using euclidean distance or single linkage would give large clusters with many stations and small clusters with only a few stations (sometimes only 1) and have a chaining effect. This is undesirable for our analysis. Additionally, single linkage generally result in a lower cophenetic correlation while Ward’s linkage yields a higher agglomerative coefficient.

Clusters stability: hierarchical clustering

Using the bootstrap method, assess the stability of clusters when using hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.

for(i in 1:4){
    cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
    for(j in 1:4){
        cat("\n")
        cat("##### ", link_meth[j], "\n")
        stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = hclustCBI, k = 3, method = m2[[1]][[j]], count = F)
        stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
        print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for hierarchical clustering"))
        cat("\n")
    }
    cat("\n")
}

Euclidean

Single
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.4627839 574 426
0.9006884 0 947
0.6099281 391 609
Complete
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8475470 121 644
0.9387263 10 930
0.8511922 160 747
Average
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8019406 154 592
0.9406904 0 967
0.8119281 244 670
Ward’s
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6894428 301 350
0.5641925 542 235
0.7465700 91 463

Manhattan

Single
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6278480 412 290
0.7784895 202 599
0.6071241 525 340
Complete
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6309667 411 317
0.5613357 464 116
0.3183909 807 171
Average
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.7346667 237 490
0.9010478 68 864
0.7038296 283 442
Ward’s
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6686884 410 401
0.7588880 102 453
0.9535567 3 963

Bray-Curtis

Single
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6115445 429 248
0.7932883 183 630
0.6068739 513 325
Complete
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6938350 157 300
0.7888723 94 615
0.3637434 749 218
Average
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.7761627 169 571
0.9233389 27 904
0.7510716 229 516
Ward’s
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8033594 150 594
0.9607506 3 974
0.8152986 90 615

Jaccard

Single
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6391961 405 300
0.8096528 160 637
0.6625609 445 421
Complete
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.6743126 164 257
0.8066948 79 670
0.3128567 801 162
Average
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8223877 76 697
0.9579048 13 965
0.7869375 161 644
Ward’s
Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean dissolved recovered
0.8503456 72 694
0.9580631 7 957
0.8569549 65 719

According to the stability assessment, using single linkage and complete linkage generally yields unstable clusters. On the other hand, Ward’s distance gives the most stable clusters.

Cluster analysis result

The result of our analysis vary significantly with different distance measure and linkage methods. As mentioned above, using euclidean distance or single linkage yields large clusters with many stations and small clusters with only a few stations. Consequently, euclidean distance and single linkage are inappropriate for analyzing the clusters of stations in this research.

All other methods return similar clustering results and also have similar performances on the agglomerative coefficient, Cophenetic correlation and stability assessment. Among other methods, both Bray-Curtis and Jaccard distance along with Ward’s linakge have better performance in the three criterions with Jaccard distance method giving slightly more stable clusters. Therefore, the results from Jaccard distance and Ward’s linkage will be used for further analysis.

Report stations in different group

#extract indice for cluster from result of heirarchical clustering using Jaccard distance and Ward's linkage method
clust = cutree(hc_agg[[4]][[4]], 3)

#subset station data according to the clusters
g = list()
clust_name = c()
for(i in 1:3){
    g[[i]] = subset(dom, clust == i)
    clust_name[i] = paste(rownames(g[[i]]), collapse = " ")
}

#display the clusters
clust_name = data.frame(clust_name)
colnames(clust_name) <- "Stations"
rownames(clust_name) <- c("Group 1", "Group 2", "Group 3")
pander(clust_name, split.cell = 35)
  Stations
Group 1 p1 p3 p4 p6 p13 p16 p19 p21 p23 p25
Group 2 s18 s19 s20 s22 s23 s25 s27 s29 sA sB sC sD sE sF sG
Group 3 w22 w23 w25 w27 w29 wA wB wC wD

Based on the results of cluster analysis, it is evident that the groups of stations are highly related to the season in which copepod data are collected. Since groupings are based on the composition of different copepod species, it is very likely that season have a strong influence on the composition of copepod species.

Compare copepod composition in the 3 clusters

#compute average composition percentage for every copepod for each group
g_mean = map(g, apply, MARGIN = 2, FUN = mean)
kable(data.frame(g_mean), col.names = c("Group 1 (Spring)", "Group 2 (Summer)", "Group 3 (Winter)"))
Group 1 (Spring) Group 2 (Summer) Group 3 (Winter)
Acartia negligence 0.00669 0.0084800 0.0047556
Acartia pacifica 0.00000 0.0157533 0.0950778
Calanus sinicus 0.07033 0.0008200 0.0964333
Canthocalanus pauper 0.00268 0.0519533 0.0233778
Cosmocalanus darwinii 0.00352 0.0011267 0.0046222
Undinula vulgaris 0.00098 0.0113000 0.0040889
Clausocalanus furcatus 0.00996 0.0191667 0.0218444
Clausocalanus minor 0.00900 0.0091133 0.0198000
Subeucalanus pileatus 0.00000 0.0070733 0.0145778
Subeucalanus subcrassus 0.00000 0.0014200 0.0155222
Subeucalanus copepodid 0.00467 0.0072200 0.0028444
Euchaeta concinna 0.00026 0.0002467 0.0106222
Euchaeta copepodid 0.01975 0.0129933 0.1773000
Acrocalanus gibber 0.00823 0.0427600 0.0186222
Acrocalanus gracilis 0.00032 0.0069133 0.0022111
Calocalanus gracilis 0.00413 0.0013800 0.0000000
Calocalanus pavoninus 0.00108 0.0065933 0.0000000
Calocalanus plumulosus 0.00290 0.0021267 0.0000000
Calocalanus styliremis 0.00134 0.0040467 0.0002111
Paracalanus aculeatus 0.03864 0.0063333 0.1055889
Paracalanus pavus 0.46160 0.1267933 0.1606556
Paracalanus serrulus 0.00000 0.0248200 0.0006778
Parvocalanus crassirostris 0.00883 0.1511467 0.0009000
Scolecithricella longispinosa 0.00048 0.0067400 0.0334222
Temora stylifera 0.00871 0.0008067 0.0002111
Temora turbinata 0.00276 0.0656267 0.0276333
Oithona attenuata 0.00038 0.0419800 0.0000000
Oithona brevicornis 0.00019 0.0173733 0.0000000
Oithona fallax 0.00130 0.0071800 0.0118000
Oithona plumifera 0.02300 0.0250733 0.0153111
Oithona similis 0.09122 0.0103867 0.0000000
Euterpina acutifrons 0.00652 0.0537000 0.0000000
Corycaeus (Ditrichocorycaeus) affinis 0.12334 0.0018933 0.0000000
Corycaeus (Ditrichocorycaeus) dahli 0.00145 0.0139533 0.0029000
Corycaeus (Ditrichocorycaeus) lubbocki 0.00000 0.0237400 0.0056556
Corycaeus (Ditrichocorycaeus) subtilis 0.00072 0.0052933 0.0006000
Corycaeus (Onychocorycaeus) giesbrechti 0.00000 0.0101067 0.0023556
Farranula gibbula 0.00253 0.0053000 0.0004222
Corycaeidae copepodid 0.01598 0.0000000 0.0000000
Oncaea conifera 0.00097 0.0303400 0.0070778
Oncaea media 0.00146 0.0023533 0.0039778
Oncaea mediterranea 0.00268 0.0014733 0.0053444
Oncaea venusta 0.02656 0.0889667 0.0300444
#extract copepod species that are prominant in each group (season). Prominant species are those with a higher composition percentage than the sum of composition percentage in other groups
dom_species_name <- rownames(data.frame(g_mean))
dom_g1 = dom_species_name[g_mean[[1]] > g_mean[[2]] + g_mean[[3]]]
dom_g2 = dom_species_name[g_mean[[2]] > g_mean[[1]] + g_mean[[3]]]
dom_g3 = dom_species_name[g_mean[[3]] > g_mean[[1]] + g_mean[[2]]]

As we can see, copepod species like Calocalanus gracilis, Calocalanus plumulosus, Paracalanus pavus, Temora stylifera, Oithona similis, Corycaeus (Ditrichocorycaeus) affinis, Corycaeidae copepodid are more prominent in spring, copepod species like Canthocalanus pauper, Undinula vulgaris, Acrocalanus gibber, Acrocalanus gracilis, Calocalanus pavoninus, Calocalanus styliremis, Paracalanus serrulus, Parvocalanus crassirostris, Temora turbinata, Oithona attenuata, Oithona brevicornis, Euterpina acutifrons, Corycaeus (Ditrichocorycaeus) dahli, Corycaeus (Ditrichocorycaeus) lubbocki, Corycaeus (Ditrichocorycaeus) subtilis, Corycaeus (Onychocorycaeus) giesbrechti, Farranula gibbula, Oncaea conifera, Oncaea venusta are more prominent in summer, and copepod species such as Acartia pacifica, Calanus sinicus, Clausocalanus minor, Subeucalanus pileatus, Subeucalanus subcrassus, Euchaeta concinna, Euchaeta copepodid, Paracalanus aculeatus, Scolecithricella longispinosa, Oithona fallax, Oncaea media, Oncaea mediterranea are more prominent in winter.

It is therefore very likely that different season have significant impact on the abundance of these copepod species. In the next section, we would look deeper into the variables that may be affected by different seasons.

Load and extract environmental data for stations in each season

envdata = read.table("enviANDdensity.txt", header = T)
rownames(envdata) <- envdata[, 1]
envdata = envdata[,-1]
season = list()
for(i in 1:3){
    season[[i]] = subset(envdata, clust == i)
}
env_mean = map(season, apply, MARGIN = 2, FUN = mean)

Show environmental variables according to different season

kable(data.frame(env_mean), col.names = c("Spring", "Summer", "Winter"), digits = 2)
Spring Summer Winter
Depth 54.10 82.23 96.06
Temperature 16.36 27.32 21.92
Salinity 32.48 33.57 32.75
Fluorescence 0.26 0.78 0.30
DissolvedOxygen 8.09 6.55 7.23
maxT 17.35 27.38 22.37
maxS 33.37 34.01 33.49
maxF 0.43 0.95 0.34
MaxDO 8.11 6.97 7.50
FishDensity 271.12 540.23 16.53
CopepodDensity 1318.20 3530.60 102.19

It can be observed that spring has a lower depth, temperature and higher dissolved oxygen level. Next, summer comes with higher temperature, fluorescence, fish density, copepod density and lower dissolved oxygen level. Finally, winter results in higher depth, lower fish and copepod density. The levels of salinity do not vary much across different seasons.

How exactly does these environmental variables affect each specific dominant copepod species would require other data or experiments to determine.