1. Purpose

This Memo aims to organize and briefly presents the main simulations and assumptions regarding a cluster analysis developed for 159 gauged catchments located throughout southeastern Australia. A conceptual rainfall-runoff model is calibrated for each of the catchments and hierarchical clustering is performed to classify catchments based on their calibrated model parameters. Two approaches of clustering are assessed: (1) assuming parameter weights are equal and (2) assuming parameter weights are determined by sensitivity analysis.

Most of the used code and text employed here comes from the book Pratical Guide To Cluster Analysis in R, Unsupervised Machine Learning, by Alboukadel Kassambara, Edition 1 (2017) and for the sake of simplicity the references will be omitted.

In addition, the book Regionalization of Watersheds-An Approach Based on Cluster Analysis, by A. Ramachandra Rao and V.V. Srinivas, (2008) and the book Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei, Edition 3, (2012)

2. Hydrologic Model

The adopted hydrologic model was the GR4J model. A description of the GR4J model can be found in Perrin et al. (2003). GR4J model has four parameters:

X1: maximum capacity of the production store (mm);
X2: groundwater exchange coefficient (mm);
X3: one day ahead maximum capacity of the routing store (mm);
X4: time base of unit hydrograph UH1 (days)

Figure 1: Diagram of the GR4J rainfall-runoff model (Perinn et al. 2003)

Part I:Assuming parameter weights are equal

3. Accessing the optimum parameters files

This code chuck accesses the optimized GR4J parameters for all the simulated catchments, save them together with the KGE efficiency in a matrix and save it in a txt file on this Memo folder. The optimization simulations were performed separately, in a MATLAB code.

#################################################
#################################################
#Part I - Acessing the optimal parameters files
#################################################
#################################################
#Description
# This first part of the code accesses the folders
# of each watershed and extract from each folder
# the files ".mat" in which the optimal parameters
# were saved. Then, all parameters are saved in
# a matrix in which each line represents the waterhsed
# and the columns are the optimum parameters set and
# the KGE efficiency values.The first 4 elements are
# the GR4J model paramters (X1, X2, X3, X4)
# and the last element is the KGE.
##################################################


#Clean R memory
rm(list=ls(all=TRUE))

#Loading packge to read .mat files
library(R.matlab)

## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3

## R.matlab v3.6.2 (2018-09-26) successfully loaded. See ?R.matlab for help.

## 
## Attaching package: 'R.matlab'

## The following objects are masked from 'package:base':
## 
##     getOption, isOpen

#Loading clustering packages
library(cluster)
library(devtools)

## Loading required package: usethis

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Setting the base working directory where all the sub-folders
# are saved

setwd("C:/Users/rampincg/Desktop/Clarkson_Pc/cassio_PC/AGU_2019/dds_seaus240_gr4j/_gr4j_kge")

# Saving the path of the base directory
base.Directory<-getwd()

# Save the list of all the sub-folders names
folders<-dir('.')[file.info(dir('.',full.names=T))$isdir]

# Loop to save in the variable folder.name the name of each
# folder individualy.

folder.name<-0


for (i in 1:length(folders)){
  
  
  folder.name[i]=folders[i]
  
}


# Create the matrix to save the optimum parameters for each
# watershed
optimum.Paramters.kge<-matrix(NA,nrow=length(folder.name),ncol=5)
mat.file=0

#Loop to access the folders and extract the optimum parameters
#set for each watershed and save them in a matrix.

for(i in 1:length(folder.name)){
  
  #Set the working directory as function of the folder name
  setwd(paste0(base.Directory,"/",folder.name[i]))
  
  #List of all files in the directory
  files<-list.files()
  
  #Acess the .mat file that is the sixth element in the list
  mat.file<-readMat(files[6])
  
  #Record the first element of the mat.file that is a vector with
  #the optimum paramters and third element that is the NS value
  par.optimal<-c(mat.file[[1]],mat.file[[3]])
  
  #Save the optimum parameters set of the watershed in a line
  optimum.Paramters.kge[i,]<-par.optimal
  
  
  #re-set the dummy variable
  par.optimal=0
  
}


#Create matrix to record parameters
data.Set.kge<-matrix(NA,nrow=length(folder.name),ncol=5)
rownames(data.Set.kge)=folder.name

##Saving optimum parameters#####################
################################################

#Setting work directory
setwd("C:/Users/rampincg/Desktop/Clarkson_Pc/cassio_PC/AGU_2019/Reports/Memo_01")

write.table(optimum.Paramters.kge,file="optimumParamtersKge_MEMO_01.txt")

4. Parameters densities

This R code chunck presents the parameters densities (histograms) covering all simulated sites.

#################################################
#################################################
#Part II - Parameters densities
#################################################
#################################################

#Saving rom names
names.temp=rownames(data.Set.kge)

#Removing the 5th column(kge VALUES)
data.Set.kge<-optimum.Paramters.kge[,-5]

#Reinserting the row names after removing the 5th column
rownames(data.Set.kge)=names.temp


#Rename the columns with the model parameters names
colnames(data.Set.kge)<-c("X1","X2","X3","X4")

#Presenting a summary of the parameters statistics
summary(data.Set.kge)

##        X1                X2                X3                 X4        
##  Min.   :  29.37   Min.   :-5.0000   Min.   :  0.0133   Min.   :0.5000  
##  1st Qu.: 148.63   1st Qu.:-3.6216   1st Qu.: 11.3438   1st Qu.:0.5000  
##  Median : 216.29   Median :-1.8079   Median : 22.3868   Median :0.9453  
##  Mean   : 281.72   Mean   :-1.9399   Mean   : 39.4130   Mean   :0.9378  
##  3rd Qu.: 333.79   3rd Qu.:-0.6901   3rd Qu.: 45.6590   3rd Qu.:1.0695  
##  Max.   :1000.00   Max.   : 5.0000   Max.   :300.0000   Max.   :5.0000

####################################################### Parameters Densities with GGPLOT2 ###################################################


#Loading required packages
library(reshape2)
library(ggplot2)
library(cowplot)

## 
## ********************************************************

## Note: As of version 1.0.0, cowplot does not change the

##   default ggplot2 theme anymore. To recover the previous

##   behavior, execute:
##   theme_set(theme_cowplot())

## ********************************************************

#Setting up the variables

index=seq(from=1,to=nrow(data.Set.kge),by=1)

dataX1<-cbind(index,data.Set.kge[,1])
dataX1<-as.data.frame(dataX1)
cnames=c("index","X1")
colnames(dataX1)=cnames
dataX1<-reshape2::melt(dataX1, id.var='index')


dataX2<-cbind(index,data.Set.kge[,2])
dataX2<-as.data.frame(dataX2)
cnames=c("index","X2")
colnames(dataX2)=cnames
dataX2<-reshape2::melt(dataX2, id.var='index')


dataX3<-cbind(index,data.Set.kge[,3])
dataX3<-as.data.frame(dataX3)
cnames=c("index","X3")
colnames(dataX3)=cnames
dataX3<-reshape2::melt(dataX3, id.var='index')


dataX4<-cbind(index,data.Set.kge[,4])
dataX4<-as.data.frame(dataX4)
cnames=c("index","X4")
colnames(dataX4)=cnames
dataX4<-reshape2::melt(dataX4, id.var='index')


#Preparing the plots variables for ggplot2
plot1<-ggplot(dataX1) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("X1")+
  xlim(28,1100) + 
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot2<-ggplot(dataX2) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("X2")+
  xlim(-5,5) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot3<-ggplot(dataX3) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("X3")+
  xlim(0,300) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot4<-ggplot(dataX4) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("X4")+
  xlim(0.4,5) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


#Function plot
plot_grid(plot1,plot2,plot3,plot4, labels = "AUTO")

5. Scaling Model Parameters

The parameters values were standardized given that they are in different scales, otherwise the dissimilarity measures obtained will be severely affect. The standardization was performed in a scale between 0 and 1. The R code chunk below applies a factor for each parameter (column) so that they can be scaled in a range between 0 and 1. Next, the matrix with the scaled parameters is saved in txt file.

#####################################################################
#Scaling the variables into the range 0-1
#####################################################################
df.scaled.kge<-as.data.frame(data.Set.kge)

#Variables used to save the scaled columns
col1=col2=col3=col4=0

col1<-(df.scaled.kge[,1]-min(df.scaled.kge[,1]))/(max(df.scaled.kge[,1])-min(df.scaled.kge[,1]))
col2<-(df.scaled.kge[,2]-min(df.scaled.kge[,2]))/(max(df.scaled.kge[,2])-min(df.scaled.kge[,2]))
col3<-(df.scaled.kge[,3]-min(df.scaled.kge[,3]))/(max(df.scaled.kge[,3])-min(df.scaled.kge[,3]))
col4<-(df.scaled.kge[,4]-min(df.scaled.kge[,4]))/(max(df.scaled.kge[,4])-min(df.scaled.kge[,4]))

df.scaled.kge[,1]=col1
df.scaled.kge[,2]=col2
df.scaled.kge[,3]=col3
df.scaled.kge[,4]=col4

#Saving a txt file with optmum parameters data set matrix
write.table(data.Set.kge,file="data.Set.kge_MEMO_01.txt")


#Saving a txt file with the scaled optmum parameters data set matrix
write.table(df.scaled.kge,file="df.scaled.kge_MEMO_01.txt")

#Naming the columns of the scaled matrix
colnames(df.scaled.kge)<-c("X1","X2","X3","X4")

6. Heat map of the scaled variables

A heat map plot of the scaled parameters is presented as follows.

$Figure 2: Heatmap of the scaled parameters$

Figure 2: Heatmap of the scaled parameters

7. Distance Matrix/Dissimilarity

Three types of distance measurement were computed over the data matrix. One method that doesn’t take into account the correlation between the variables (Euclidean Distance) and two other correlation-based distances (Pearson and Mahalanobis distances).

The choice of distance measures is crucial to cluster analysis as it has a strong influence on the clustering results. Most common clustering analysis employs the Euclidean distance, but depending on the type of data correlation-based distance should be considered. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.

The distance between two objects is 0 when they are perfectly correlated. If we want to identify clusters of observation switch the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure. If the Euclidean distance is adopted, then observations with high values of features or low values of features will be clustered together.

The distance is computed over the data matrix. In general, the data matrix has the following structure in which n is the number of objects, in this case, number of gauging stations, and p is the number of features or attributes, in this case, the number of the hydrologic model parameters, and f is used to index through the p attributes. So for this simulation we have n=159 and p=4.

\[\mathbf{Data \ Matrix} = \left[\begin{array} {rrrrrr} x_{11} & ... & x_{1f} & ... & x_{1p} \\ ... & ... & ... & ... & ... \\ x_{i1} & ... & x_{if} & ... & x_{ip} \\ ... & ... & ... & ... & ... \\ x_{n1} & ... & x_{nf} & ... & x_{np} \end{array}\right] \]

In this simulation the data matrix is named df.scaled.kge, each line represents a watershed and each column a parameter of the GR4J model. The initial lines of the data matrix is presented as follows.

head(df.scaled.kge)

##                  X1        X2         X3         X4
## seaus_01 0.41944476 0.5371352 0.05987039 0.10844667
## seaus_03 0.18154127 0.5026927 0.05428914 0.11780311
## seaus_04 0.21506588 0.5355217 0.05159455 0.00000000
## seaus_05 0.07593944 0.3816126 0.04074144 0.07548193
## seaus_06 0.18552642 0.1232940 0.13082233 0.18141533
## seaus_07 0.06683656 0.6840576 0.21576537 0.00000000

The distance matrix also named Dissimilarity Matrix stores a collection of proximities that are available for all pairs of n objects (gauging stations) and has the following general structure.

\[\mathbf{Dissimilarity \ Matrix} = \left[\begin{array} {rrrrrrr} d_{11}=0 & ... & d_{1f} & ... & d_{1p} \\ d_{21} & d_{22}=0 & d_{2f} & ... & d_{2p} \\ d_{31} & d_{32} & d_{33}=0 & ... & d_{3p} \\ ... & ... & ... & ... & ... \\ d_{i1} & ... & d_{if} & d_{ii}=0 & d_{ip} \\ ... & ... & ... & ... & ... \\ d_{n1} & ... & ... & ... & 0 \end{array}\right] \]

Where $d_{i,j}$ is the measured dissimilarity, distance, or difference between objects (gauging stations) i,j. In general, $d_{i,j}$ is a non-negative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they differ.

There are many R functions for computing distances between observations. The ones that will be used in this simulation will be:

dist() R base function [stats package] to compute Euclidean distance
get_dist() function [factoextra package] to compute Pearson distance, and
mahalanobis.dist() function [StatMatch pacakge] to compute Mahalanobis distance.

A way of visualizing the distance matrices in R is to use the function fviz_dist() [factoextra package]. The color level is proportional to the value of the dissimilarity between observations: pure $\color{red}{\text{red}}$ if $d_{i,j}=0$ and pure $\color{blue}{\text{blue}}$ if $d_{i,j}=1$. Objects belonging to the same cluster are displayed in consecutive order.

The following sub-sections presents the results for each one of the mentioned distances matrixes.

7.1 Euclidean Distance

The Euclidean distance is given by:

\[ d_{i,j}= \sqrt{\sum^p_{k=1}(x_{i,k}-x_{j,k})^2} \]

In which:

$p= number\ of \ attributes \ (\ model \ parameters)$

$k= attribute \ \ (model \ parameter)$

$i,j= index \ related \ to \ the \ gauging \ satation \ (lines \ of \ the \ data \ matrix)$

The R code used to compute the Euclidean distance is presented below as well as the dissimiliraty matrix:

#################################################
#################################################
#Part III - Euclidean Distance computation
#################################################
#################################################

#Euclidian Distance
dist.eucl.kge<-dist(df.scaled.kge,method="euclidean")

#Visualizing the distance matrices
fviz_dist(dist.eucl.kge)

7.1.1 Defining the Number of Clusters

Two R functions will be employed for determining the optimal number of clusters:

fviz_nbclust() function [factoextra R package] was used to compute suggested number of clusters based on three different methods: Elbow, Silhouette and Gap Statistic for the partitioning clustering methods K-means and K-medoids(PAM)
- K-Means

#################################################
#################################################
#Part IV - Number of Clusters-K-Means
#################################################
#################################################
#Loading package
library(factoextra)

#Elbow method
fviz_nbclust(df.scaled.kge,kmeans,method="wss")+geom_vline(xintercept=3,linetype=2)+labs(subtitle="Elbow method")

#Silhouette method
fviz_nbclust(df.scaled.kge,kmeans,method="silhouette")+labs(subtitle="Silhouette method")

#Gap Statistic
#nboot=50 to keep the function speedy.
#recommended value: nboot=500 for your analysis.
#Use verbose=FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.scaled.kge,kmeans,method="gap_stat",nboot=50)+labs(subtitle="Gap statistic method")

Elbow method: 3 clusters suggested

#Compute k-means k=3
set.seed(123)
km.res<-kmeans(df.scaled.kge,3,nstart=15)

#Plotting Cluster
fviz_cluster(km.res,data=df.scaled.kge,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

Silhouette method: 2 clusters suggested

#Compute k-means k=2
set.seed(123)
km.res<-kmeans(df.scaled.kge,2,nstart=15)

#Plotting Cluster
fviz_cluster(km.res,data=df.scaled.kge,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

Gap Statistic Method: 1 cluster suggested

#Compute k-means k=1
set.seed(123)
km.res<-kmeans(df.scaled.kge,1,nstart=15)

#Plotting Cluster
fviz_cluster(km.res,data=df.scaled.kge,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

K-Medoids (PAM)

#################################################
#################################################
#Part IV - Number of Clusters-K-Medoids (PAM)
#################################################
#################################################
#Loading package
library(factoextra)

#Elbow method
fviz_nbclust(df.scaled.kge,pam,method="wss")+geom_vline(xintercept=3,linetype=2)+labs(subtitle="Elbow method")

#Silhouette method
fviz_nbclust(df.scaled.kge,pam,method="silhouette")+labs(subtitle="Silhouette method")

#Gap Statistic
#nboot=50 to keep the function speedy.
#recommended value: nboot=500 for your analysis.
#Use verbose=FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(df.scaled.kge,pam,method="gap_stat",nboot=50)+labs(subtitle="Gap statistic method")

Elbow method: 3 clusters suggested

#Compute k-medoids k=3
set.seed(123)
pam.res<-pam(df.scaled.kge,3)

#Plotting Cluster
fviz_cluster(pam.res,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

Silhouette method: 5 clusters suggested

#Compute k-medoids k=5
set.seed(123)
pam.res<-pam(df.scaled.kge,5)

#Plotting Cluster
fviz_cluster(pam.res,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

Gap Statistic Method: 1 cluster suggested

#Compute k-medoids k=1
set.seed(123)
pam.res<-pam(df.scaled.kge,1)

#Plotting Cluster
fviz_cluster(pam.res,palette="jco",repel=TRUE,ggtheme=theme_classic(),show.clust.cent = FALSE,stand=FALSE,geom="point")

NbClust() function [NbClust R package] was used to compute suggested number of clusters based on different results obtained by varying all combinations of number of clusters, distance measures and clustering methods. It simultaneously computes all the indices and determine the number of clusters in a single function.

#Loading required package
library("NbClust")

#Computing maximum number of cluster with kmeans
nb<-NbClust(df.scaled.kge,distance="euclidean",min.nc=2,max.nc=10,method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 4 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 7 proposed  3 as the best number of clusters
## * 3 proposed  5 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 4 proposed  8 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

#Computing maximum number of cluster with ward.D
nb<-NbClust(df.scaled.kge,distance="euclidean",min.nc=2,max.nc=10,method="ward.D")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 5 proposed 7 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 8 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 3 proposed  4 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 5 proposed  7 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

#Computing maximum number of cluster with ward.D2
nb<-NbClust(df.scaled.kge,distance="euclidean",min.nc=2,max.nc=10,method="ward.D2")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 5 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 5 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 2 proposed  3 as the best number of clusters
## * 5 proposed  5 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 5 proposed  9 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

Hierarchical Clustering (Agglomerative)

The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similiraty. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree. Hence, given the distance matrix, the R base function hclust() can be used to create the hierarchical tree. There are many cluster agglomeration methods, or linkage methods. In this study we will consider two of them: Average linkage and Ward’s minimum variance Method. The former defines the distance between two clusters as the average distance between the elements in cluster 1 and the elements in cluster 2. The latter minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

The dendrograms generated by the function fviz_dend() [factoextra R package] correspond to the graphical representation of the hierarchical tree. In the dedogram each leaf corresponds to one object(gaguing station). As we move up the tree, objects (gauging stations) are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the dissimilarity/distance between two objects/clusters. The higher the height of the fusion, the less similar the objects are. This height is know as the cophenetic distance between the two objects. In order to identify sub-groups, we can cut the dendrogram at a certain height as described in the next sections.

Ward’s linkage

res.hc.ward<-hclust(d=dist.eucl.kge,method="ward.D2")
fviz_dend(res.hc.ward,cex=0.2)

Average linkage

res.hc.average<-hclust(d=dist.eucl.kge,method="average")
fviz_dend(res.hc.average,cex=0.2)

Verifying Cluster Tree - Ward’s X Average linkage

One way to measure how well the cluster tree generated by the hclust() function reflects the data is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function. If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.

The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good.

The correlation between cophenetic distance and the original distance is computed for the Wards linkage and Average linkage as follows.

Ward’s linkage

#Computing cophentic distance
res.coph.ward<-cophenetic(res.hc.ward)
#Correlation between cophenetic distance and the original distance
cor(dist.eucl.kge,res.coph.ward)

## [1] 0.6054076

Average linkage

#Computing cophentic distance
res.coph.average<-cophenetic(res.hc.average)
#Correlation between cophenetic distance and the original distance
cor(dist.eucl.kge,res.coph.average)

## [1] 0.8716902

The average linkage method appears to produce high value of this statistic

7.1.2 Checking Clustering Tendency

The assessing of clustering tendency consists of verifying if the data sets contains meaningful clusters, in other orthers, checking if they are not random structures, since clustering methods will return clusters even if the data does not contain any clusters.

Principal Component Analysis (PCA) - Visual Inspection

Since the data contain more than two variables, the dimensionality of the data is reduced in order to plot a scatter plot by using prinicpal componnet analysi algorithm by the R function prcomp(). Then, the function fviz_pca_ind() [factoextra package] is applied to visualize the scatterplot.

Basically, the original data set is compared with one derivated from random data set. The R code is presented as follows.

#Random data generated from the original data set
random_df<-apply(df.scaled.kge,2,function(x){runif(length(x),min(x),max((x)))})
random_df<-as.data.frame(random_df)

#Plot of the original data set
fviz_pca_ind(prcomp(df.scaled.kge),title="PCA-Original Data",pallete="jco",geom="point",ggtheme=theme_classic(),legend="bottom")

#Plot of the random data set
fviz_pca_ind(prcomp(random_df),title="PCA-Random Data",pallete="jco",geom="point",ggtheme=theme_classic(),legend="bottom")

Hopkins Statistic (H)

This H statistic is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data. If the value of H statistic is close to zero, then we can reject the null hypotehsis (the data set is uniformly distributed, i.e., no meaningful clusters) and conclude that the data set is significantly a clusterable data.

The R function hopkins() [clustertend package] can be used to statistically evaluate clustering tendency in R. The R code is presented as below.

#Loading package
library(clustertend)

#Compute Hopkins statistic for the original dataset
hopkins(df.scaled.kge,n=nrow(df.scaled.kge)-1)

## $H
## [1] 0.170625

#Compute Hopkins statistic for a random dataset
hopkins(random_df,n=nrow(random_df)-1)

## $H
## [1] 0.4916442

Visual Assessment of Cluster Tendency (VAT)

For the visual assessment of clustering tendency, the dissimilarity matrix betweeen the original data and the random data are compared.

#Function used to display the dissimilarity matrix for the original data
fviz_dist(dist(df.scaled.kge),show_labels = F)+labs(title="Original Data")

#Function used to display the dissimilarity matrix for the random data
fviz_dist(dist(random_df),show_labels = F)+labs(title="Random Data")

Based on all clustering tendency methods one can conclude that the data set is clusterable.

7.1.3 Selecting the Best Clustering Algorithms

To define the best clustering method the R package clValid was used to compare simultaneously multiple clustering algorithms in a singule function call for identifying the best clustering approach and the optimal number of clusters. The R code and outputs are presented below.

The clValid package compares clustering algorithms using two cluster validation measures:

Internal measures, which uses intrinsic information in the data to assess the quality of the clustering. Internal measures include the connectivity, the silhouett coefficient and the Dunn index.

The R code is presented as follows.

#Loading package
library(clValid)

#Compute clValid
clmethods<-c("hierarchical","kmeans","pam")
intern<-clValid(df.scaled.kge,nClust=2:10,clMethods=clmethods,validation="internal")

#Summary
summary(intern)

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                   2        3        4        5        6        7        8        9       10
##                                                                                                            
## hierarchical Connectivity    3.8579  19.6496  27.5734  29.4623  41.4294  50.6139  54.2222  54.6889  57.1429
##              Dunn            0.4136   0.1127   0.1305   0.1305   0.1598   0.1413   0.1413   0.1413   0.1413
##              Silhouette      0.6500   0.4951   0.4400   0.4210   0.3793   0.3388   0.3243   0.3118   0.2898
## kmeans       Connectivity   21.2865  30.1786  39.0679  51.4873  74.3754  72.5179  78.0829  81.9377  93.3433
##              Dunn            0.1083   0.0811   0.1058   0.0473   0.0688   0.0462   0.0665   0.0795   0.0462
##              Silhouette      0.4643   0.4020   0.4162   0.3227   0.2897   0.3296   0.3261   0.3055   0.3039
## pam          Connectivity   46.7452  50.8270  69.1456  66.1341  93.6694  91.6909  95.7893 118.5155 130.1103
##              Dunn            0.0236   0.0181   0.0309   0.0301   0.0341   0.0341   0.0486   0.0319   0.0344
##              Silhouette      0.2255   0.2608   0.2458   0.2856   0.1818   0.2103   0.2218   0.2142   0.2079
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 3.8579 hierarchical 2       
## Dunn         0.4136 hierarchical 2       
## Silhouette   0.6500 hierarchical 2

Based on the presented method, it can be seen that hierarchical clustering with two clusters performed the best in all tests.

Stability measures, a special version of internal measures, which evaluates the consistency of a clustering result by comparing it with the clusters obtained after each column is removed, one at time. Cluster stability measures include:

The average proportion of non-overlap (APN);
The average distance (AD)
The average distance between means (ADM);
The figure of merit (FOM)

The values of APN, ADM and FOM ranges from 0 to 1, with smaller value corresponding with highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.

The R code is presented a follows.

#Compute clValid
clmethods<-c("hierarchical","kmeans","pam")
intern<-clValid(df.scaled.kge,nClust=2:10,clMethods=clmethods,validation="stability")

#Summary
summary(intern)

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                        2      3      4      5      6      7      8      9     10
##                                                                                 
## hierarchical APN  0.0378 0.0412 0.0354 0.0655 0.0992 0.0778 0.1184 0.1268 0.1216
##              AD   0.4309 0.3953 0.3781 0.3688 0.3625 0.3353 0.3236 0.3198 0.3152
##              ADM  0.0344 0.0521 0.0584 0.0646 0.0940 0.0933 0.0866 0.0852 0.0828
##              FOM  0.1741 0.1739 0.1723 0.1685 0.1678 0.1677 0.1672 0.1665 0.1670
## kmeans       APN  0.1672 0.2153 0.3442 0.4012 0.4289 0.2700 0.3679 0.2821 0.2597
##              AD   0.4134 0.3895 0.3777 0.3556 0.3394 0.3043 0.2992 0.2804 0.2711
##              ADM  0.1121 0.1068 0.1517 0.1967 0.1913 0.1455 0.1525 0.1274 0.1251
##              FOM  0.1787 0.1744 0.1748 0.1710 0.1675 0.1653 0.1657 0.1639 0.1641
## pam          APN  0.2843 0.2902 0.3410 0.4010 0.3922 0.4048 0.4137 0.4778 0.5338
##              AD   0.4201 0.3615 0.3479 0.3315 0.3192 0.2997 0.2904 0.2883 0.2862
##              ADM  0.1659 0.1240 0.1476 0.1562 0.1505 0.1437 0.1433 0.1521 0.1639
##              FOM  0.1818 0.1713 0.1707 0.1690 0.1684 0.1656 0.1688 0.1685 0.1687
## 
## Optimal Scores:
## 
##     Score  Method       Clusters
## APN 0.0354 hierarchical 4       
## AD  0.2711 kmeans       10      
## ADM 0.0344 hierarchical 2       
## FOM 0.1639 kmeans       9

Based on APN and ADM method, hierarchical clustering had the best score with four and two clusters respectively. Based on AD method,kmedoids(pam) clustering with ten clusters had the best score while for FOM method kmeans with nine clusters,had the best score.

From now on, the same presented structure will be repeated for the Pearson and Mahalanobis distance. For the sake of simplicity most of the description will be omitted since the basic process is equivalent to what was done for the Euclidean distance.

7.2 Pearson Distance

The Pearson distance is given by:

\[ d_{i,j}= 1-\frac{\sum^p_{k=1}(x_{i,k}-\bar{x_{i}})(x_{j,k}-\bar{x_j})}{\sqrt{\sum^p_{k=1}(x_{i,k}-\bar{x_{i}})^2(x_{j,k}-\bar{x_{j}})^2}} \]

In which:

$p= number\ of \ attributes \ (\ model \ parameters)$

$k= attribute \ \ (model \ parameter)$

$i,j= index \ related \ to \ the \ gauging \ satation \ (lines \ of \ the \ data \ matrix)$

$\bar{x_i}= parameters \ mean \ of \ row \ i$

$\bar{x_j}= parameters \ mean \ of \ row \ j$

The R code used to compute the Pearson distance is presented below as well as the dissimiliraty matrix:

#################################################
#################################################
#Part III - Pearson Distance computation
#################################################
#################################################

#Pearson Distance
dist.pearson.kge<-get_dist(df.scaled.kge,method="pearson")

#Visualizing the distance matrices
fviz_dist(dist.pearson.kge)

Hierarchical Clustering (Agglomerative)
Ward’s linkage

res.hc.ward<-hclust(d=dist.pearson.kge,method="ward.D2")
fviz_dend(res.hc.ward,cex=0.2)

Average linkage

res.hc.average<-hclust(d=dist.pearson.kge,method="average")
fviz_dend(res.hc.average,cex=0.2)

Verifying Cluster Tree - Ward’s X Average linkage
- Ward’s linkage

#Computing cophentic distance
res.coph.ward<-cophenetic(res.hc.ward)
#Correlation between cophenetic distance and the original distance
cor(dist.pearson.kge,res.coph.ward)

## [1] 0.7184322

Average linkage

#Computing cophentic distance
res.coph.average<-cophenetic(res.hc.average)
#Correlation between cophenetic distance and the original distance
cor(dist.pearson.kge,res.coph.average)

## [1] 0.7800221

7.3 Mahalanobis Distance

The Mahalanobis distance is given by:

\[ d_{i,j}= \sqrt{(X_i-X_j)^T \sum^{-1}(X_i-X_j)} \]

In which:

$X= matrix\ of \ attributes \ (\ model \ parameters)$

The R code used to compute the Mahalanobis distance is presented below as well as the dissimiliraty matrix:

#################################################
#################################################
#Part III - Mahalanobis Distance computation
#################################################
#################################################
#Package to Mahalanobis Distance
library(StatMatch)

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: clue

## Loading required package: survey

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

## Loading required package: RANN

## Loading required package: lpSolve

#Computing Mahalanobis distance
dist.maha.kge<-mahalanobis.dist(df.scaled.kge)

#Conerting the Mahalanobis distance to a dist class
dist.maha.kge=as.dist(dist.maha.kge)

#Visualizing the distance matrices
fviz_dist(dist.maha.kge)

Hierarchical Clustering (Agglomerative)
Ward’s linkage

res.hc.ward<-hclust(d=dist.maha.kge,method="ward.D2")
fviz_dend(res.hc.ward,cex=0.2)

Average linkage

res.hc.average<-hclust(d=dist.maha.kge,method="average")
fviz_dend(res.hc.average,cex=0.2)

Verifying Cluster Tree - Ward’s X Average linkage
- Ward’s linkage

#Computing cophentic distance
res.coph.ward<-cophenetic(res.hc.ward)
#Correlation between cophenetic distance and the original distance
cor(dist.maha.kge,res.coph.ward)

## [1] 0.607091

Average linkage

#Computing cophentic distance
res.coph.average<-cophenetic(res.hc.average)
#Correlation between cophenetic distance and the original distance
cor(dist.maha.kge,res.coph.average)

## [1] 0.8443855

Part II:Assuming parameter weights

8. Accessing Parameters Weights Files

setwd("C:/Users/rampincg/Desktop/Clarkson_Pc/cassio_PC/AGU_2019/vars_seaus240_gr4j/_gr4j_kge")

# Saving the path of the base directory
base.Directory<-getwd()

# Save the list of all the sub-folders names
folders<-dir('.')[file.info(dir('.',full.names=T))$isdir]

# Loop to save in the variable folder.name the name of each
# folder individualy.

folder.name<-0

for (i in 1:length(folders)){
  
  
  folder.name[i]=folders[i]
  
}


# Create the matrix to save I-VARS weight for each parameter

weight.Paramters.kge<-matrix(NA,nrow=length(folder.name),ncol=4)


#Loop to access the folders and extract the optimum parameters
#set for each watershed and save them in a matrix.

for(i in 1:length(folder.name)){
  
  #Set the working directory as function of the folder name
  setwd(paste0(base.Directory,"/",folder.name[i]))
  
  #List of all files in the directory
  files<-list.files()
  
  #Acess the .mat file that is the sixth element in the list
  mat.file<-readMat(files[4])
  
  temp1<-mat.file[[2]]
  temp2<-temp1[[7]]
  temp3<-temp2[[100]]
  temp3<-temp3[[1]]
  
  #Record the first element of the mat.file that is a vector with
  #the optimum paramters and third element that is the NS value
  par.optimal<-temp3
  
  #Save the optimum parameters set of the watershed in a line
  weight.Paramters.kge[i,]<-par.optimal
  
  
  #re-set the dummy variable
  par.optimal=0
  
}


rownames(weight.Paramters.kge)=rownames(df.scaled.kge)

weights.kge<-weight.Paramters.kge/rowSums(weight.Paramters.kge)



##Saving weights parameters#####################
################################################

#Setting work directory
#Setting work directory
setwd("C:/Users/rampincg/Desktop/Clarkson_Pc/cassio_PC/AGU_2019/Reports/Memo_01")

write.table(weights.kge,file="weight.Paramters.kge_MEMO_01.txt")

9. Weights Densities

################################
##Acessing parameters histograms
################################
##############################################################
###DENSITIES WITH GGPLOT2#####################################
##############################################################
library(reshape2)
library(ggplot2)
library(cowplot)

index=seq(from=1,to=nrow(weights.kge),by=1)

dataX1<-cbind(index,weights.kge[,1])
dataX1<-as.data.frame(dataX1)
cnames=c("index","Weight-X1")
colnames(dataX1)=cnames
dataX1<-reshape2::melt(dataX1, id.var='index')


dataX2<-cbind(index,weights.kge[,2])
dataX2<-as.data.frame(dataX2)
cnames=c("index","Weight-X2")
colnames(dataX2)=cnames
dataX2<-reshape2::melt(dataX2, id.var='index')


dataX3<-cbind(index,weights.kge[,3])
dataX3<-as.data.frame(dataX3)
cnames=c("index","Weight-X3")
colnames(dataX3)=cnames
dataX3<-reshape2::melt(dataX3, id.var='index')


dataX4<-cbind(index,weights.kge[,4])
dataX4<-as.data.frame(dataX4)
cnames=c("index","Weight-X4")
colnames(dataX4)=cnames
dataX4<-reshape2::melt(dataX4, id.var='index')


#Together with ggplot2
plot1<-ggplot(dataX1) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("Weight-X1")+
  xlim(0,1) + 
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot2<-ggplot(dataX2) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("Weight-X2")+
  xlim(0,1) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot3<-ggplot(dataX3) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("Weight-X3")+
  xlim(0,1) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot4<-ggplot(dataX4) + stat_density(aes(x=value, y=..scaled..,color=variable,linetype=variable),lwd=1,  geom="line",position = "identity")+
  ylab("Density")+
  xlab("Weight-X4")+
  xlim(0,1) +
  theme_bw() +
  theme(legend.position="top")+
  theme(legend.title = element_blank())+
  theme(axis.title.x=element_text(size=14,face="bold"),axis.title.y=element_text(size=14))+
  theme(legend.text=element_text(size=14))+
  theme(axis.text.x = element_text(size = 16),axis.text.y = element_text(size = 16))


plot_grid(plot1,plot2,plot3,plot4, labels = "AUTO")

10. Distance Matrix- Euclidean

#Euclidian Distance by weight
df.scaled.kge.weight<-0
df.scaled.kge.weight<-df.scaled.kge*weights.kge
#Euclidian Distance

dist.eucl.kge.weight<-dist(df.scaled.kge.weight,method="euclidean")


#Visualizing the distance matrices
fviz_dist(dist.eucl.kge.weight)

11. Hierarchical Clustering

Ward’s linkage

res.hc.ward<-hclust(d=dist.eucl.kge.weight,method="ward.D2")
fviz_dend(res.hc.ward,cex=0.2)

Average linkage

res.hc.average<-hclust(d=dist.eucl.kge.weight,method="average")
fviz_dend(res.hc.average,cex=0.2)

Verifying Cluster Tree - Ward’s X Average linkage
- Ward’s linkage

#Computing cophentic distance
res.coph.ward<-cophenetic(res.hc.ward)
#Correlation between cophenetic distance and the original distance
cor(dist.eucl.kge.weight,res.coph.ward)

## [1] 0.6340643

Average linkage

#Computing cophentic distance
res.coph.average<-cophenetic(res.hc.average)
#Correlation between cophenetic distance and the original distance
cor(dist.eucl.kge.weight,res.coph.average)

## [1] 0.8696955

Cluster_Memo_01

Cassio Rampinelli(RA-Clarkson University) & Tyler Smith (Professor-Clarkson University)

11/13/2019

1. Purpose

2. Hydrologic Model

Part I:Assuming parameter weights are equal

3. Accessing the optimum parameters files

4. Parameters densities

5. Scaling Model Parameters

6. Heat map of the scaled variables

7. Distance Matrix/Dissimilarity

7.1 Euclidean Distance

7.1.1 Defining the Number of Clusters

7.1.2 Checking Clustering Tendency

7.1.3 Selecting the Best Clustering Algorithms

7.2 Pearson Distance

7.3 Mahalanobis Distance

Part II:Assuming parameter weights

8. Accessing Parameters Weights Files

9. Weights Densities

10. Distance Matrix- Euclidean

11. Hierarchical Clustering