HW6.2.utf8.md

#setting seed so that results are reproducible
set.seed(123)
#loading USarrests dataset in data variable
data <- USArrests
#before scaling
#using Euclidean distance as a method to input data
d <- dist(data, method = "euclidean")
hfit <- hclust(d, method = "complete")
#Drawing cluster Dendrogram
plot(hfit)
#breaking hierarchy cluster into k-sets
grps <- cutree(hfit,k=3)
rect.hclust(hfit,k=3, border = "green")

#using cbind to see which state falls in which state
c2 <- cbind(grps)
library(cluster)
clusplot(data,grps,main ="2D representation of Cluster", shade=TRUE, labels =2, lines = 0)

#Cluster 1 is biggers it falls both in the cluster2 and cluster3

#After Scaling
#scaling the dataset, where the SD = 1
df <- scale(data[1])
#using Euclidean distance as a method to input data
d <- dist(df, method = "euclidean")
hfit <- hclust(d, method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

#Drawing cluster Dendrogram
plot(hfit)
#breaking hierarchy cluster into k-sets
grps <- cutree(hfit,k=3)
rect.hclust(hfit,k=3, border = "green")

#Scaling maintains balance, and one variable will not overpower others due to high values.
#In this dataset before scaling there are 16 states now after scaling there are 17.
#Gorgia and Texas move to 1st cluster, Delaware, Alaska and Illinois in 3rd cluster.
#Yes, it should be scaled before. Basic geometry tells us that we should scale the variables before we 
#compute the euclidean distances.
#If we have sales value in thousands and other factors in much smaller unit, then the sales value would 
#dominate the other factor variable in the clustering algorithm unless we scale the variables first.

HW6.2.R

arnabchakraboty

2020-03-05