I would like to cluster Year 1 to Year 3 returns and determine the number of clusters for exploratory purpose
setwd("D:/DATACAMP/interim/FUNDS_YEAR_2020")
CART_data <- read.csv("data.csv")
myvars <- names(CART_data) %in% c("Price", "Date", "X5.YR", "X10.YR", "X")
newdata <- CART_data[!myvars]
#remove NA
newdata <- newdata[complete.cases(newdata),]
#remove fund name
myvars <- names(newdata) %in% c("Fund.Name")
newdata <- newdata[!myvars]
#select each yearly returns and cluster them
myvars <- names(newdata) %in% c("X1.YR", "X2.YR", "X3.YR")
newdata <- newdata[myvars]
#create a dataset of returns
returnsdata <- newdata
head(returnsdata)
## X2.YR X1.YR X3.YR
## 1 12.49 7.96 3.54
## 3 9.41 0.13 9.04
## 4 2.77 -3.33 3.89
## 5 18.83 8.47 13.81
## 6 7.04 4.65 3.43
## 7 8.81 3.98 6.01
Let’s use the library below to quickly determine the number of clusters
library("NbClust")
set.seed(123)
res.nbclust <- NbClust(returnsdata, distance = "euclidean", min.nc = 2, max.nc = 10, method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
library("factoextra")
## Warning: package 'factoextra' was built under R version 3.6.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#visualize
factoextra::fviz_nbclust(res.nbclust) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 6 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
#PAM clustering: Partitioning Around Medoids. Robust alternative to k-means clustering, less sensitive to outliers.
# Compute PAM
library(cluster)
pam.res <- pam(returnsdata, 3)
# Visualize
fviz_cluster(pam.res)