Cluster Analysis in R

Getting Data
mydata <- read.csv("D:/MARV BS MATH/4th year, 2nd sem/Multivariate/Discriminant data.txt", header=T)
str(mydata)
'data.frame':   22 obs. of  9 variables:
 $ Company     : chr  "Arizona " "Boston " "Central " "Commonwealth" ...
 $ Fixed_charge: num  1.06 0.89 1.43 1.02 1.49 1.32 1.22 1.1 1.34 1.12 ...
 $ RoR         : num  9.2 10.3 15.4 11.2 8.8 13.5 12.2 9.2 13 12.4 ...
 $ Cost        : int  151 202 113 168 192 111 175 245 168 197 ...
 $ Load        : num  54.4 57.9 53 56 51.2 60 67.6 57 60.4 53 ...
 $ D.Demand    : num  1.6 2.2 3.4 0.3 1 -2.2 2.2 3.3 7.2 2.7 ...
 $ Sales       : int  9077 5088 9212 6423 3300 11127 7642 13082 8406 6455 ...
 $ Nuclear     : num  0 25.3 0 34.3 15.6 22.5 0 0 0 39.2 ...
 $ Fuel_Cost   : num  0.628 1.555 1.058 0.7 2.044 ...
a<-head(mydata)
paged_table(a)
pairs(mydata[2:9])

Basically, a pairs plot will provide a scatter plot for all possible combinations

Scatter Plot

If you want to look at the scatterplot separately we have:

plot(mydata$Fuel_Cost~ mydata$Sales, data = mydata)
with(mydata,text(mydata$Fuel_Cost ~ mydata$Sales, labels=mydata$Company,pos=4))

Normalize

Normalization is mandatory for cluster analysis.

z <- mydata[,-c(1,1)]
paged_table(z)
means <- apply(z,2,mean)
means
Fixed_charge          RoR         Cost         Load     D.Demand        Sales 
    1.114091    10.736364   168.181818    56.977273     3.240909  8914.045455 
     Nuclear    Fuel_Cost 
   12.000000     1.102727 
sds <- apply(z,2,sd)
sds
Fixed_charge          RoR         Cost         Load     D.Demand        Sales 
   0.1845112    2.2440494   41.1913495    4.4611478    3.1182503 3549.9840305 
     Nuclear    Fuel_Cost 
  16.7919198    0.5560981 
nor <- scale(z,center=means,scale=sds)
nor
      Fixed_charge         RoR         Cost        Load    D.Demand       Sales
 [1,]  -0.29315791 -0.68463896 -0.417122002 -0.57771516 -0.52622751  0.04590290
 [2,]  -1.21451134 -0.19445367  0.821002037  0.20683629 -0.33381191 -1.07776413
 [3,]   1.71214073  2.07822360 -1.339645796 -0.89153574  0.05101929  0.08393124
 [4,]  -0.50994695  0.20660702 -0.004413989 -0.21906307 -0.94312798 -0.70170610
 [5,]   2.03732429 -0.86288816  0.578232617 -1.29501935 -0.71864311 -1.58142837
 [6,]   1.11597086  1.23153991 -1.388199680  0.67756716 -1.74485965  0.62337028
 [7,]   0.57399826  0.65223002  0.165524604  2.38116460 -0.33381191 -0.35832428
 [8,]  -0.07636887 -0.68463896  1.864910540  0.00509449  0.01895002  1.17407698
 [9,]   1.22436538  1.00872841 -0.004413989  0.76723019  1.26965142 -0.14311204
[10,]   0.03202565  0.74135462  0.699617327 -0.89153574 -0.17346558 -0.69269198
[11,]  -1.97327298 -1.44219805  0.116970720 -1.22777208  1.04516655  2.40196983
[12,]   0.08622291  0.07292013  0.238355430  1.12588228  0.14722709 -0.77748109
[13,]   0.19461744  0.87504152  0.748171211 -0.73462545  1.01309729 -0.48874740
[14,]  -0.13056613  0.56310542 -1.752353809 -1.60883993 -0.59036605  0.21379097
[15,]  -0.83513051 -1.39763576 -0.101521757  1.17071379 -1.07140505 -0.68902999
[16,]   0.24881470 -0.37270287  2.034849134 -0.21906307  1.91103676  1.99351729
[17,]  -1.91907572 -1.93238335 -0.781276132  1.10346652  1.84689822 -0.90142531
[18,]  -0.34735517  0.83047922 -0.441398944 -0.06215278 -0.17346558  0.34534086
[19,]   0.24881470  0.42941852 -1.558138274 -0.66737818 -1.71279038  1.29379583
[20,]   0.46560374  0.47398082 -0.489952828  0.65515141  0.08308855 -0.45832473
[21,]  -0.40155243 -0.95201276  0.869555920  0.90172472  0.08308855 -0.63776215
[22,]  -0.23896065 -0.64007666  0.141247662 -0.60013092  0.85275095  0.33210137
         Nuclear   Fuel_Cost
 [1,] -0.7146294 -0.85367545
 [2,]  0.7920476  0.81329670
 [3,] -0.7146294 -0.08043055
 [4,]  1.3280197 -0.72420189
 [5,]  0.2143888  1.69263800
 [6,]  0.6253007  0.24864810
 [7,] -0.7146294  0.98772637
 [8,] -0.7146294 -1.42731528
 [9,] -0.7146294 -0.43288637
[10,]  1.6198267 -0.86266667
[11,] -0.7146294 -0.60192130
[12,] -0.7146294  1.42829614
[13,]  2.2749037 -1.03529809
[14,] -0.7146294 -0.92560521
[15,] -0.6610322  0.53456889
[16,] -0.7146294 -0.86806140
[17,] -0.2203441  1.46965575
[18,] -0.7146294  0.00948165
[19,] -0.7146294 -0.83928950
[20,]  1.7329764 -0.72060540
[21,] -0.7146294  1.82211157
[22,]  0.8694658  0.36553395
attr(,"scaled:center")
Fixed_charge          RoR         Cost         Load     D.Demand        Sales 
    1.114091    10.736364   168.181818    56.977273     3.240909  8914.045455 
     Nuclear    Fuel_Cost 
   12.000000     1.102727 
attr(,"scaled:scale")
Fixed_charge          RoR         Cost         Load     D.Demand        Sales 
   0.1845112    2.2440494   41.1913495    4.4611478    3.1182503 3549.9840305 
     Nuclear    Fuel_Cost 
  16.7919198    0.5560981 
Calculate Distance Matrix
distance = dist(nor)
Hierarchical Agglomerative Clustering
mydata.hclust = hclust(distance)
plot(mydata.hclust)

plot(mydata.hclust,labels=mydata$Company,main='Default from hclust')

plot(mydata.hclust,hang=-1, labels=mydata$Company,main='Default from hclust')

Hierarchical Agglomerative Clustering Using “average” Linkage
mydata.hclust<-hclust(distance,method="average") 
plot(mydata.hclust,hang=-1) 

Cluster Membership
member = cutree(mydata.hclust,3)
table(member)
member
 1  2  3 
18  1  3 
Characterizing Clusters
b<-aggregate(nor,list(member),mean)
paged_table(b)
c<-aggregate(mydata[,-c(1,1)],list(member),mean)
paged_table(c)
Silhouette Plot
library(cluster)
plot(silhouette(cutree(mydata.hclust,3), distance))

Based on the above plot, if any bar comes as negative side then we can conclude particular data is an outlier can remove from our analysis.

Scree Plot

Scree plot will allow us to see the variabilities in clusters, suppose if we increase the number of clusters within-group sum of squares will come down.

wss <- (nrow(nor)-1)*sum(apply(nor,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

So in this data ideal number of clusters should be 3, 4, or 5.

Scree Plot
set.seed(123)
kc<-kmeans(nor,3)
kc
K-means clustering with 3 clusters of sizes 7, 5, 10

Cluster means:
  Fixed_charge         RoR       Cost       Load    D.Demand      Sales
1  -0.23896065 -0.65917479  0.2556961  0.7992527 -0.05435116 -0.8604593
2   0.51980100  1.02655333 -1.2959473 -0.5104679 -0.83409247  0.5120458
3  -0.09262805 -0.05185431  0.4689864 -0.3042429  0.45509205  0.3462986
     Nuclear  Fuel_Cost
1 -0.2884040  1.2497562
2 -0.4466434 -0.3174391
3  0.4252045 -0.7161098

Clustering vector:
 [1] 3 1 2 3 1 2 1 3 3 3 3 1 3 2 1 3 1 2 2 3 1 3

Within cluster sum of squares by cluster:
[1] 34.16481 15.15613 57.53424
 (between_SS / total_SS =  36.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
ot<-nor
datadistshortset<-dist(ot,method = "euclidean")
hc1 <- hclust(datadistshortset, method = "complete" )
pamvshortset <- pam(datadistshortset,4, diss = FALSE)
clusplot(pamvshortset, shade = FALSE,labels=2,col.clus="blue",col.p="red",span=FALSE,main="Cluster Mapping",cex=1.2)

Cluster Analysis in R
k2 <- kmeans(nor, centers = 3, nstart = 25)

We can execute k-means in R with the help of kmeans function.

The kmeans function also has a nstart option that attempts multiple initial configurations and reports on the best output.

For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

str(k2)
List of 9
 $ cluster     : int [1:22] 3 2 3 3 2 3 2 1 3 3 ...
 $ centers     : num [1:3, 1:8] -0.6 -0.239 0.289 -0.833 -0.659 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:8] "Fixed_charge" "RoR" "Cost" "Load" ...
 $ totss       : num 168
 $ withinss    : num [1:3] 9.53 34.16 58.01
 $ tot.withinss: num 102
 $ betweenss   : num 66.3
 $ size        : int [1:3] 3 7 12
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

We can also view our kmeans results by using fviz_cluster. This provides a beautiful illustration of the clusters.

fviz_cluster(k2, data = nor)

If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

Optimal Clusters

We can find out optimal clusters in R with the following code. The results suggest that 4 is the optimal number of clusters as it appears to be the bend in the knee.

The same we executed above with traditional coding’s.

fviz_nbclust(nor, kmeans, method = "wss")

Average Silhouette Method

The average silhouette approach measures the quality of a clustering. It determines how well each observation lies within its cluster.

A high average silhouette width indicates a good clustering. The average silhouette method computes the average silhouette of observations for different values of k.

We can execute the same based on the below code

fviz_nbclust(nor, kmeans, method = "silhouette")

Gap Statistic Method

This approach can be utilized in any type of clustering method (i.e. K-means clustering, hierarchical clustering).

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data.

We can execute the same based on below code

gap_stat <- clusGap(nor, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

In this method also optimal number of cluster is 5.

Conclusion

K-means clustering is a very simple and fast algorithm and it can efficiently deal with very large data sets.

K-means clustering needs to provide a number of clusters as an input, Hierarchical clustering is an alternative approach that does not require that we commit to a particular choice of clusters.

Hierarchical clustering has an added advantage over K-means clustering because it has an attractive tree-based representation of the observations (dendrogram).