Heirarchical Clustering

library(cluster)

Read data

olive<-read.csv("~/Desktop/碩一下/多變量/olive.csv",h=T) 
head(olive)

##   Region Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic
## 1      1    1     1075          75     226  7823      672        36
## 2      1    1     1088          73     224  7709      781        31
## 3      1    1      911          54     246  8113      549        31
## 4      1    1      966          57     240  7952      619        50
## 5      1    1     1051          67     259  7771      672        50
## 6      1    1      911          49     268  7924      678        51
##   Arachidic Eicosenoic
## 1        60         29
## 2        61         29
## 3        63         29
## 4        78         35
## 5        80         46
## 6        70         44

newolive<-olive[,3:10]
x <-daisy(newolive, stand=T) # Standarized

Single

agn <-agnes(x,metric="euclidean",method="single")

# Use the following interactive command for both the “dedrogram” and “banner plot” :


plot(agn,which.plots=2)

# 觀察是否有outlier存在(例如522, 79)，outlier會影響切點

plot(agn,which.plots=1)

# for a “banner plot”, you are not able to get a clear plot since we have more than 500 objects 
# (this is just a horizontal version of the dendrogram).

# However, from the output the AC (Agglomerative Coefficient) is derived to be 0.73, 
# which shows a strong clustering structure. 
# You can also check out the AC (Agglomerative Coefficient) by using:

agn$ac

## [1] 0.7346398

# This shows a pretty good clustering structure.
# Check that if the resulting grouping agrees with the original “Regions”:
  
olive[,1][agn$order]  # 樹從左邊開始標示，最左邊的群都是1, 但3的部分有outlier出現(1)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [351] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [386] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3
## [421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [456] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [491] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [526] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [561] 3 3 3 3 3 3 3 3 1 1 1 3

# I would say “yes”, except for 3 region “1” in the last line.
# Check that if the resulting grouping agrees with the original “Areas”:

olive[,2][agn$order] # 依然從左到右去標籤

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 1 1 1 4 4 1 1 1 1 1 1 2 2 2 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3 2 2 2 2 2 2 4 2
## [176] 2 2 2 2 2 2 2 4 2 2 2 2 2 3 3 3 2 2 2 3 3 3 3 4 4 4 4 3 3 3 3 2 3 3 2
## [211] 2 2 2 2 2 2 3 3 4 3 3 3 3 3 3 2 2 2 3 3 2 3 3 3 3 3 2 3 4 3 3 3 3 3 4
## [246] 4 2 2 2 3 3 4 4 4 3 3 2 2 3 3 4 4 3 3 3 3 4 4 4 4 3 3 3 3 4 3 4 4 3 4
## [281] 4 4 4 3 3 3 2 2 4 4 3 3 3 1 2 3 3 3 3 3 3 1 4 3 2 1 3 4 3 3 2 2 2 2 3
## [316] 3 3 3 3 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [351] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 5 5 5 5 6
## [386] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 6 5 6 5 9 9
## [421] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [456] 9 9 9 9 9 9 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 7 7 7 7 7 7 7
## [491] 8 8 8 8 8 8 8 8 8 8 8 8 8 7 8 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [526] 8 8 8 8 7 8 7 7 8 8 8 8 7 7 8 7 7 8 9 9 9 9 9 7 7 7 8 7 7 7 7 7 8 8 8
## [561] 8 8 8 8 8 8 7 7 3 3 2 7

# I would say “no” here.

Q: How about using other linkages?

Complete

agn<-agnes(x,metric="euclidean",method="complete") 
plot(agn,which.plots=2)

# This results a better clustering structure, say, AC = 0.93.
# Better than single, so it would be better method

Ward

agn<-agnes(x,metric="euclidean",method="ward")
plot(agn,which.plots=2)

# This results an even larger AC = 0.99.

Diana – divisive method (splitting method)

di<-diana(x,metric="euclidean")
plot(di, which.plots=2)

plot(di, which.plots=1)

di$dc

## [1] 0.924267

Note that DC=0.924267 shows a pretty strong clustering. Check that if the resulting grouping agrees with the original “Regions”:

olive[,1][di$order]

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [386] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [456] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [491] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [526] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [561] 2 2 2 2 2 2 2 2 2 2 2 2

Clustering

Lin Chian Hung

2018/5/17

Heirarchical Clustering

Read data

Single

Complete

Ward

Diana – divisive method (splitting method)