Outline

In this script we will learn how to run Cluster- and Ordinationanalyses on a floristic dataset from the Atacama Desert in Chile. For this purpose we will use functions provided by the packages ‘R-base’ and ‘vegan’.Furtheron we will learn different techniques of how to compare and validate our results. Additionally we will see how to wrap our scripts into functions and how to use them. Finally we will illustrate our results using the packages ‘ggplot2’ and ‘factoextra’. Throughout the script you will find different tasks whitch are written in bold. Have fun!

1) Prepare your script

At the beginning of each script you should clear your ‘Global Environment’ and install and call all required packages

# Start by removing all Objects from the 'Global Environment'
rm(list = ls())

# Check if you are in the right working directory using getwd().
# If not, please use setwd() to choose and set the right folder as your Workingdirectory
getwd()
setwd()

# Load Packages
library(vegan)
library(factoextra)
library(ggpubr)
library(ggplot2)
library(fpc)
library(dendextend)
library(rcartocolor)
library(RColorBrewer)
library(purrr)
library(tinytex)

Oh Damn it! there is directly a warning occurring! What’s wrong? Please fix the problem!

2) Load Community Datamatrix

Load the dataset and store it in the object ‘Flor’. Our dataset is so called ‘binary’ or ‘presence/absence’ data, which means that it consist of many ones and zeros indicating whether a species is present in a certain Locality or not. Use the different function (dim(), class() etc.) to inspect you data.

Flor <- 
  read.csv2("D:/Sciebo/02_Uni/02_04_Biogeography_Peru_Chile/01_Data/02_Floristic_Data_PeruChile/01_Datamatrix/Datamatrix.csv",
           header = TRUE,
           dec = ",",
           row.names = 1) 

# Check the structure of your Data

dim(Flor)
class(Flor)
head(Flor, 10)
str(Flor)
View(Flor)

What is the difference between read.csv() and read.csv2()? Use the help() function or button. To become a little bit more familiar with the arguments of the functions change header = FALSE and row.names = 2. What happens? Please change back to the orginial setting before proceeding

3) Cluster Analyses

Cluster analyses is a method that identifies groups of similar objects based on certain criteria/variables (here the criteria or variables are our 1087 Species). Clusteranalyses helps us to identify patterns inside complex structures by categorizing. However, it only produces discrete Catagories (Clusters) of our continuous world and nature and misses to identify what lies inbetween

3.1) Dissimilarity Matrix

Before running Clusteranalyses we need to transform our data into a distance matrix The function vegdist() of the ‘vegan’ package provides many different Dissimilarty Idices usefull in vegetation ecology. For our Data we want to compute Soerensen dissimilarity witch is the same as Bray-Curtis index for binary Data.

Note: the function dist() included in R-base also creates distance matrices but does not offer Bray-Curtis index

There is Something is wrong with the distance method. Use help to define the method argument so that is uses Bray-Curtis index

bray.dist <- vegdist(Flor,
                     method = "...",
                     binary = TRUE)

3.2) Let’s cluster

Now we are ready to compute our first Clusteranalyses! hclust() is a function from the R-base package and runs hierarchical clustering based on a the distance matrix. Here we use single linkage clustering defined in the argument ‘method’.

Note: Below you find some additiona and usefull functions to check your R-Objects

single <- hclust(bray.dist, method = "single")

# some useful functions in between: typeof() shows you what type of data you have
# the is...() function gives you a logical ouput (TRUE or FALSE)

typeof(single)
is.list(single)
is.vector(single)

Extra Exercise: In R you can use functions inside of function. Knowing this,are you able to write a oneliner witch computes A Dissimilarity matrix and Clustering at once? Store the results in ‘allinone’

allinone <-

4) Visualize your Results

Results of hierarchical clusteranalyses are typically illustrated as dendrograms. There are different ways and packages to draw and modify dendrograms in R. For a first inspection use plot() on your hclust results.

plot(as.dendrogram(single), horiz = T)

This is not very beautiful. To make a good looking dendrogram there are nice solutions based on plot(). However, from now on we are working with the package ‘factoextra’ providing the function fviz_dend() witch works in a ggplot2 environment and can be modified as we have already learned with ggplot.So let’s try the fviz_dend() function.

Note: you do not need to use as.dendogram() because fviz_dend() already expects an hclust object.

fviz_dend(single)

Now let’s use some additional arguments in fivz_dend() to make our dendrogram more beautiful. Use the help function to find the arguments controlling the number of groups you want to split your dendrogram into and set it to 2. Further plot your dendrogram horizontally as you have seen it before by using the plot() function

fviz_dend(single,...)

AH, this looks better! Still there is a lot to do….however to go through each argument would be to much for now. Below you find a code template you should use from now on. Of course there are some problems that need to be fixed! ;)

Find the arguments controlling the tip labels and the line width of the branches of the dendrogram and set them to 0.6 and 1.2. Additionally change the color of the boxes around each cluster. You can inspect all color palettes provided by the package ‘RColorbrewer’ using display.brewer.all(). Use display.brewer.all(colorblindFriendly = TRUE, type= “qual”) to select a colorblind friendly pallet. Change the name to your desired pallet in line 153.

display.brewer.all()
display.brewer.all(colorblindFriendly = TRUE, type= "qual")


fviz_dend(single, 
          cex = 0.6, 
          horiz = TRUE, 
          k = 2,
          k_colors = "black", 
          label_cols = "black", 
          lwd = 1.2, 
          rect_fill = TRUE, 
          rect = TRUE,
          lower_rect = -0.02,
          show_labels = FALSE,
          rect_border = brewer.pal(2, name= "Set2")[1:2],
          type = "rectangle", 
          labels_track_height = 2, 
          main = "",
          ylab = "Floristic Dissimilarity")+

We already know that there are different methods of clustering available (complete linkage, Ward’s methdo etc.). Ward’s method and average clustering are usually preferred as their metrics were found to produce most reasonable clustering. Note: There is no cluster method witch is perse ‘better’ than another. They all produce different images of the same reality but are not able to get it completely. So even if Ward’s method is in general a good idea you should compare it to the results of other methods

Run a cluster analyses using Ward’s method (Ward.D2) in the same manner as you have seen it above. Use the help function so inspect the possible settings for the argument ‘method’. Than use the template for the dendrogram and change in a way that it illustrates the results of Ward clustering

ward <-

5) Compare Clustering Methods and Choose the (optimal) number of Clusters

Until now we simply assumed that there are 2 clusters in our dendrogram. But aren’t there acutely more?.. and what is the right number of clusters of my dendrogram? There are several ways and packages in which help to decide on the Cluster number. Unfortunately there is no big help in deciding witch method (in this case: single or ward) is to be preferred. Still, the cluster.stats() function from the package ‘fpc’ provides us with a variety of statistics which can be compared. cluster.stats() needs a distance matrix and the cluster identities (the assignment of an element to a cluster) from the our elements (Localities) as input. To get the Cluster identities cutree() cuts the dendrogram resulted from hclust() into a desired number of clusters k. As we only work with 2 clusters result of cutree is a vector one 1 and 2.

k=2
id.ward <- cutree(ward, k=k)
id.single <- cutree(single, k=k)

ward_stats <-cluster.stats(bray.dist, id.ward)
single_stats <-cluster.stats(bray.dist, id.single)

The Objects ward_stats and single_stats are now available in the Global Environment. They are lists and list 34 different statistics. Have a look and compare ‘within.cluster.ss’ (Within Cluster Sum of squares) and ‘avg. silwidth’ (Average Silhouette width).Within cluster sum of squares should be as low as possible, Average silhouette width as high a possible. Which method (single or ward) performs better due to the both statistics?

Note: Within-cluster sum of squares measures the variability within each cluster, means how compact ore large a cluster is. Compact clusters preferred! I know the name is terrifying but it measures simply the distance of every cluster element to the cluster centroid and sums them up. It takes the square of every distance to avoid negative values. And thats it The Silhoutte width measures if an element is well clsutered or not. Thsi is done by meaureing the distance to its assigend cluster (dist_A) and teh distance to the next cluster (dist_B). It simply substract dist_A from dist_B (dist_B - dist_A). Values of Silhouettes range from -1 to 1 .High silhouette values close to 1 indicate that an object is well clustered., negative values might indicate that an object is grouped in the wrong cluster and values close to zero indicate that an elemnt might be inbetween two clusters

Open ward_stats and single_stats and compare the statistics. Also Try k= 4 in cutree() and compare statistics when four clusters are considered. Do You see differences?

Below you see the code for a small data frame containing the Values of within-cluster sum of squares (WCSS) and average silhouette withds (SIL) for single and ward clustering. Try to understand what the code does and proceed to the next task.

names <-c("WCSS", "SIL") # create a vector with the names of the statistics
ward1 <- c(ward_stats$within.cluster.ss, ward_stats$avg.silwidth) #create a vector with the values of the statisctis for ward clustering
single1 <- c(single_stats$within.cluster.ss, single_stats$avg.silwidth)# create vector withe the values of the statistics for single clustering
df1 <-data.frame(t(cbind(ward1, single1))) # bind them to a dataframe witch is turned by t()
colnames(df1) <- names # Change the column names by using names

Use ggplot to create one barplot comparing for WCSS and one comparing SIL of single and ward clustering. HINT: x=rownames(df1), geom_bar(stat = “identity”)

Mulivariate Statistics: Similarity Anlysis of floristic data from the Atacama desert

MSc Module OEP-A16 / PBCO2: Biodiversity Informatics: Data Analyses for Ecology and Biogeography

Jonathan Ruhm, Nees-Institut, Uni Bonn, jruhm@uni-bonn.de

2/4/2021