HW #8 Computer Intensive Statistics in Ecology

Click here for other works of the author on RPubs

Load packages

library(knitr)
library(fpc)
library(vegan)
library(cluster)
library(purrr)
library(factoextra)
library(pander)

Q. Use the “dominant” copepod species data (from HW1). Perform cluster analysis of stations based on percent composition data of the dominant species and tell your story about these copepod data.

For this assignment, k-means clustering and hierarchical clustering will be performed to analyze the structure of the stations according to dominant copepode species.

Import data

#import copepod data
copdata = read.table("copepod_composition.txt", header = T)

#import species name
species_name <- read.table("copepodSPlist.txt", sep = "\t")

#assign copepod species name
rownames(copdata) <- as.character(unlist(species_name))

Extract dominant species data

Dominant species are defined as species with a percentage of 2% or more in any given station.

#convert species frequency into percentage
copdata = copdata / 100

#find dominant species
dom = apply(copdata >= 0.02, 2, which)
dom = sort(unique(unlist(dom)))
dom = t(copdata[dom, ])

#show a part of the data
kable(dom[1:6, 1:6], caption = "Station data with percent composition of dominant species (partial)")

Station data with percent composition of dominant species (partial)
	Acartia negligence	Calanus sinicus	Canthocalanus pauper
p1	0.0000	0.1426	0.0000
p3	0.0000	0.0701	0.0000
p4	0.0000	0.0238	0.0000
p6	0.0000	0.0904	0.0000
p13	0.0000	0.0556	0.0038
p16	0.0022	0.0176	0.0022

Compute distance matrix with differnt distance measures

Using distance measures including euclidean, manhattan, Bray-Curtis and Jaccard to calculate distance matrix.

dist_meth = c("euclidean", "manhattan", "bray", "jaccard")
dist_m = map(as.list(dist_meth), vegdist, x = dom)

Determine the optimal number of clusters to use

Define function find_k for plotting scree plot to show how within groups sum of square (WSS) and average Silhouette width (Si) change with different number of clusters used.

find_k <- function(data, plot_title = ""){
    #create data for within group sum of square
    wss = numeric(15)
    for (k in 1:15) wss[k] <- kmeans(data, k, nstart = 20)$tot.withinss
    
    #create data for average silhouette distance
    asw <- numeric(15)
    for (k in 2:15) asw[k] <- pam(data, k)$silinfo$avg.width
    
    #create s cree plot
    par(mar=c(5, 4, 4, 6))
    plot(1:15, wss, type = "b", main = plot_title, xlab = "Number of Clusters", ylab = "Within groups sum of squares")
    par(new = T)
    plot(1:15, asw, type = "l", lty = 2, col = "red", axes = F, xlab = NA, ylab = NA)
    axis(side = 4)
    mtext("Average Silhouette width", side = 4, line = 3)
    legend("topright", legend = c("WSS", "Si"), lty = c(1,2), col = c("black", "red"))
}

Scree plots

Define plot titles.

dist_meas = c("Euclidean", "Manhattan", "Bray-Curtis", "Jaccard")

Create scree plots to determine the number of clusters to use (k). Click on the tabs to show results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    find_k(dist_m[[i]], plot_title = dist_meas[i])
    cat("\n")
}

Euclidean

Manhattan

Bray-Curtis

Jaccard

From the result of scree plots, 2 clusters seems to be the most reasonable choice when euclidean distance is used and 3 clusters is optimal if other distance methods are used. In order to compare the effects of different distance measures, 3 clusters will be used in later analysis.

Perform k-means clustering to analyze the clusters of stations

#conduct k-means to find clusters
km = map(dist_m, pam, k = 3)

# extract cluster data
km_result = transpose(km)
km_cluster = as.data.frame(km_result$cluster)
colnames(km_cluster) <- dist_meas

Silhoulette plots: results for k-means clustering

Plot the Silhouette plots for k-means clustering. Click on the tabs to see results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    plot(silhouette(km[[i]], dist_m[[i]]), main = paste("Silhoulette plot using ", dist_meas[i], " distance") , col = 1:3, border = NA, cex = 0.6)
    cat("\n")
}

Euclidean

Manhattan

Bray-Curtis

Jaccard

No significant differences exists between the clusters when different distance measures are used. However, it should be noted that stations p23 and wC are assigned to a different cluster when euclidean distance is used comparing with other methods.

Clusters stability: k-means

Using the bootstrap method, assess the stability of clusters when using k-means. Click on different tabs to see results for different distance methods.

for(i in 1:4){
    cat("\n")
    cat("#### ", dist_meas[i], "\n")
    stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = claraCBI, k = 3, count = F)
    stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
    print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for k-means"))
    cat("\n")
}

Euclidean

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6566714	291	302
0.5284333	581	184
0.7354016	84	458

Manhattan

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8639887	55	784
0.8598780	48	771
0.9489945	24	932

Bray-Curtis

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8379067	82	716
0.8413992	49	730
0.9479493	24	918

Jaccard

Cluster stability assessment for k-means
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8688582	58	794
0.8601801	51	767
0.9441388	23	913

Based on stability assessments, it seems that using euclidean distance would result in unstable clusters. Other distance methods all yield reasonably stable clusters.

Perform hierarchical clustering to analyze the clusters of stations

#perform hierarchical clustering on different distance measures and linkage methods
hc_agg = list()
m1 = list(method = list("single", "complete", "average", "ward"))
m2 = list(method = list("single", "complete", "average", "ward.D"))
for(i in 1:4){
    hc_agg[[i]] = pmap(m1, agnes, x = dist_m[[i]])
}

Dendrogram: results for hierarchical clustering

Define plot titles

link_meth = c("Single", "Complete", "Average",  "Ward's")

Plot the dendrograms for hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.

for(i in 1:4){
    cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
    for(j in 1:4){
        cat("\n")
        cat("##### ", link_meth[j], "\n")
        print(fviz_dend(hc_agg[[i]][[j]], cex = 0.5, k = 3, lwd = 0.8, main = paste(link_meth[j], "linkage Dendrogram"), horiz = T, ylab = paste("Height", "\n", "\n", "Agglomerative coefficient=", round(hc_agg[[i]][[j]]$ac, 2), "\n", "Cophenetic correlation=", round(cor(dist_m[[i]], cophenetic(hc_agg[[i]][[j]])), 2))))
        cat("\n")
    }
    cat("\n")
}

Euclidean

Single

Complete

Average

Ward’s

Manhattan

Single

Complete

Average

Ward’s

Bray-Curtis

Single

Complete

Average

Ward’s

Jaccard

Single

Complete

Average

Ward’s

From the dendrograms, we can see that using euclidean distance or single linkage would give large clusters with many stations and small clusters with only a few stations (sometimes only 1) and have a chaining effect. This is undesirable for our analysis. Additionally, single linkage generally result in a lower cophenetic correlation while Ward’s linkage yields a higher agglomerative coefficient.

Clusters stability: hierarchical clustering

Using the bootstrap method, assess the stability of clusters when using hierarchical clustering. Click on different tabs to see results for different distance methods with different linkage methods.

for(i in 1:4){
    cat("#### ", dist_meas[i], "{.tabset results='asis'}", "\n")
    for(j in 1:4){
        cat("\n")
        cat("##### ", link_meth[j], "\n")
        stab = clusterboot(dist_m[[i]], B = 1000, bootmethod = "boot", clustermethod = hclustCBI, k = 3, method = m2[[1]][[j]], count = F)
        stab_results = cbind(stab$bootmean, stab$bootbrd, stab$bootrecover)
        print(kable(stab_results, col.names = c("Clusterwise Jaccard bootstrap mean", "dissolved", "recovered"), caption = "Cluster stability assessment for hierarchical clustering"))
        cat("\n")
    }
    cat("\n")
}

Euclidean

Single

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.4627839	574	426
0.9006884	0	947
0.6099281	391	609

Complete

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8475470	121	644
0.9387263	10	930
0.8511922	160	747

Average

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8019406	154	592
0.9406904	0	967
0.8119281	244	670

Ward’s

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6894428	301	350
0.5641925	542	235
0.7465700	91	463

Manhattan

Single

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6278480	412	290
0.7784895	202	599
0.6071241	525	340

Complete

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6309667	411	317
0.5613357	464	116
0.3183909	807	171

Average

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.7346667	237	490
0.9010478	68	864
0.7038296	283	442

Ward’s

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6686884	410	401
0.7588880	102	453
0.9535567	3	963

Bray-Curtis

Single

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6115445	429	248
0.7932883	183	630
0.6068739	513	325

Complete

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6938350	157	300
0.7888723	94	615
0.3637434	749	218

Average

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.7761627	169	571
0.9233389	27	904
0.7510716	229	516

Ward’s

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8033594	150	594
0.9607506	3	974
0.8152986	90	615

Jaccard

Single

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6391961	405	300
0.8096528	160	637
0.6625609	445	421

Complete

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.6743126	164	257
0.8066948	79	670
0.3128567	801	162

Average

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8223877	76	697
0.9579048	13	965
0.7869375	161	644

Ward’s

Cluster stability assessment for hierarchical clustering
Clusterwise Jaccard bootstrap mean	dissolved	recovered
0.8503456	72	694
0.9580631	7	957
0.8569549	65	719

According to the stability assessment, using single linkage and complete linkage generally yields unstable clusters. On the other hand, Ward’s distance gives the most stable clusters.

Cluster analysis result

The result of our analysis vary significantly with different distance measure and linkage methods. As mentioned above, using euclidean distance or single linkage yields large clusters with many stations and small clusters with only a few stations. Consequently, euclidean distance and single linkage are inappropriate for analyzing the clusters of stations in this research.

All other methods return similar clustering results and also have similar performances on the agglomerative coefficient, Cophenetic correlation and stability assessment. Among other methods, both Bray-Curtis and Jaccard distance along with Ward’s linakge have better performance in the three criterions with Jaccard distance method giving slightly more stable clusters. Therefore, the results from Jaccard distance and Ward’s linkage will be used for further analysis.

Report stations in different group

#extract indice for cluster from result of heirarchical clustering using Jaccard distance and Ward's linkage method
clust = cutree(hc_agg[[4]][[4]], 3)

#subset station data according to the clusters
g = list()
clust_name = c()
for(i in 1:3){
    g[[i]] = subset(dom, clust == i)
    clust_name[i] = paste(rownames(g[[i]]), collapse = " ")
}

#display the clusters
clust_name = data.frame(clust_name)
colnames(clust_name) <- "Stations"
rownames(clust_name) <- c("Group 1", "Group 2", "Group 3")
pander(clust_name, split.cell = 35)

	Stations
Group 1	p1 p3 p4 p6 p13 p16 p19 p21 p23 p25
Group 2	s18 s19 s20 s22 s23 s25 s27 s29 sA sB sC sD sE sF sG
Group 3	w22 w23 w25 w27 w29 wA wB wC wD

Based on the results of cluster analysis, it is evident that the groups of stations are highly related to the season in which copepod data are collected. Since groupings are based on the composition of different copepod species, it is very likely that season have a strong influence on the composition of copepod species.

Compare copepod composition in the 3 clusters

#compute average composition percentage for every copepod for each group
g_mean = map(g, apply, MARGIN = 2, FUN = mean)
kable(data.frame(g_mean), col.names = c("Group 1 (Spring)", "Group 2 (Summer)", "Group 3 (Winter)"))

	Group 1 (Spring)	Group 2 (Summer)	Group 3 (Winter)
Acartia negligence	0.00669	0.0084800	0.0047556
Acartia pacifica	0.00000	0.0157533	0.0950778
Calanus sinicus	0.07033	0.0008200	0.0964333
Canthocalanus pauper	0.00268	0.0519533	0.0233778
Cosmocalanus darwinii	0.00352	0.0011267	0.0046222
Undinula vulgaris	0.00098	0.0113000	0.0040889
Clausocalanus furcatus	0.00996	0.0191667	0.0218444
Clausocalanus minor	0.00900	0.0091133	0.0198000
Subeucalanus pileatus	0.00000	0.0070733	0.0145778
Subeucalanus subcrassus	0.00000	0.0014200	0.0155222
Subeucalanus copepodid	0.00467	0.0072200	0.0028444
Euchaeta concinna	0.00026	0.0002467	0.0106222
Euchaeta copepodid	0.01975	0.0129933	0.1773000
Acrocalanus gibber	0.00823	0.0427600	0.0186222
Acrocalanus gracilis	0.00032	0.0069133	0.0022111
Calocalanus gracilis	0.00413	0.0013800	0.0000000
Calocalanus pavoninus	0.00108	0.0065933	0.0000000
Calocalanus plumulosus	0.00290	0.0021267	0.0000000
Calocalanus styliremis	0.00134	0.0040467	0.0002111
Paracalanus aculeatus	0.03864	0.0063333	0.1055889
Paracalanus pavus	0.46160	0.1267933	0.1606556
Paracalanus serrulus	0.00000	0.0248200	0.0006778
Parvocalanus crassirostris	0.00883	0.1511467	0.0009000
Scolecithricella longispinosa	0.00048	0.0067400	0.0334222
Temora stylifera	0.00871	0.0008067	0.0002111
Temora turbinata	0.00276	0.0656267	0.0276333
Oithona attenuata	0.00038	0.0419800	0.0000000
Oithona brevicornis	0.00019	0.0173733	0.0000000
Oithona fallax	0.00130	0.0071800	0.0118000
Oithona plumifera	0.02300	0.0250733	0.0153111
Oithona similis	0.09122	0.0103867	0.0000000
Euterpina acutifrons	0.00652	0.0537000	0.0000000
Corycaeus (Ditrichocorycaeus) affinis	0.12334	0.0018933	0.0000000
Corycaeus (Ditrichocorycaeus) dahli	0.00145	0.0139533	0.0029000
Corycaeus (Ditrichocorycaeus) lubbocki	0.00000	0.0237400	0.0056556
Corycaeus (Ditrichocorycaeus) subtilis	0.00072	0.0052933	0.0006000
Corycaeus (Onychocorycaeus) giesbrechti	0.00000	0.0101067	0.0023556
Farranula gibbula	0.00253	0.0053000	0.0004222
Corycaeidae copepodid	0.01598	0.0000000	0.0000000
Oncaea conifera	0.00097	0.0303400	0.0070778
Oncaea media	0.00146	0.0023533	0.0039778
Oncaea mediterranea	0.00268	0.0014733	0.0053444
Oncaea venusta	0.02656	0.0889667	0.0300444

#extract copepod species that are prominant in each group (season). Prominant species are those with a higher composition percentage than the sum of composition percentage in other groups
dom_species_name <- rownames(data.frame(g_mean))
dom_g1 = dom_species_name[g_mean[[1]] > g_mean[[2]] + g_mean[[3]]]
dom_g2 = dom_species_name[g_mean[[2]] > g_mean[[1]] + g_mean[[3]]]
dom_g3 = dom_species_name[g_mean[[3]] > g_mean[[1]] + g_mean[[2]]]

As we can see, copepod species like Calocalanus gracilis, Calocalanus plumulosus, Paracalanus pavus, Temora stylifera, Oithona similis, Corycaeus (Ditrichocorycaeus) affinis, Corycaeidae copepodid are more prominent in spring, copepod species like Canthocalanus pauper, Undinula vulgaris, Acrocalanus gibber, Acrocalanus gracilis, Calocalanus pavoninus, Calocalanus styliremis, Paracalanus serrulus, Parvocalanus crassirostris, Temora turbinata, Oithona attenuata, Oithona brevicornis, Euterpina acutifrons, Corycaeus (Ditrichocorycaeus) dahli, Corycaeus (Ditrichocorycaeus) lubbocki, Corycaeus (Ditrichocorycaeus) subtilis, Corycaeus (Onychocorycaeus) giesbrechti, Farranula gibbula, Oncaea conifera, Oncaea venusta are more prominent in summer, and copepod species such as Acartia pacifica, Calanus sinicus, Clausocalanus minor, Subeucalanus pileatus, Subeucalanus subcrassus, Euchaeta concinna, Euchaeta copepodid, Paracalanus aculeatus, Scolecithricella longispinosa, Oithona fallax, Oncaea media, Oncaea mediterranea are more prominent in winter.

It is therefore very likely that different season have significant impact on the abundance of these copepod species. In the next section, we would look deeper into the variables that may be affected by different seasons.

Load and extract environmental data for stations in each season

envdata = read.table("enviANDdensity.txt", header = T)
rownames(envdata) <- envdata[, 1]
envdata = envdata[,-1]
season = list()
for(i in 1:3){
    season[[i]] = subset(envdata, clust == i)
}
env_mean = map(season, apply, MARGIN = 2, FUN = mean)

Show environmental variables according to different season

kable(data.frame(env_mean), col.names = c("Spring", "Summer", "Winter"), digits = 2)

	Spring	Summer	Winter
Depth	54.10	82.23	96.06
Temperature	16.36	27.32	21.92
Salinity	32.48	33.57	32.75
Fluorescence	0.26	0.78	0.30
DissolvedOxygen	8.09	6.55	7.23
maxT	17.35	27.38	22.37
maxS	33.37	34.01	33.49
maxF	0.43	0.95	0.34
MaxDO	8.11	6.97	7.50
FishDensity	271.12	540.23	16.53
CopepodDensity	1318.20	3530.60	102.19

It can be observed that spring has a lower depth, temperature and higher dissolved oxygen level. Next, summer comes with higher temperature, fluorescence, fish density, copepod density and lower dissolved oxygen level. Finally, winter results in higher depth, lower fish and copepod density. The levels of salinity do not vary much across different seasons.

How exactly does these environmental variables affect each specific dominant copepod species would require other data or experiments to determine.

HW #8 Computer Intensive Statistics in Ecology

蘇主弦 Eric Su

2017/4/16

Load packages

Q. Use the “dominant” copepod species data (from HW1). Perform cluster analysis of stations based on percent composition data of the dominant species and tell your story about these copepod data.

Import data

Extract dominant species data

Compute distance matrix with differnt distance measures

Determine the optimal number of clusters to use

Scree plots

Euclidean

Manhattan

Bray-Curtis

Jaccard

Perform k-means clustering to analyze the clusters of stations

Silhoulette plots: results for k-means clustering

Euclidean

Manhattan

Bray-Curtis

Jaccard

Clusters stability: k-means

Euclidean

Manhattan

Bray-Curtis

Jaccard

Perform hierarchical clustering to analyze the clusters of stations

Dendrogram: results for hierarchical clustering

Euclidean

Single

Complete

Average

Ward’s

Manhattan

Single

Complete

Average

Ward’s

Bray-Curtis

Single

Complete

Average

Ward’s

Jaccard

Single

Complete

Average

Ward’s

Clusters stability: hierarchical clustering

Euclidean

Single

Complete

Average

Ward’s

Manhattan

Single

Complete

Average

Ward’s

Bray-Curtis

Single

Complete

Average

Ward’s

Jaccard

Single

Complete

Average

Ward’s

Cluster analysis result

Report stations in different group

Compare copepod composition in the 3 clusters

Load and extract environmental data for stations in each season

Show environmental variables according to different season