階層分群法透過階層架構方式,將資料層反覆地進行分裂與聚合,以產生最後的樹狀結構,常見方式有兩類:聚合方式與分裂方式。
由樹狀結構的底部開始,將資料或分群逐次合併。起初將每一筆資料視成一個群聚 (cluster),若有 n 筆資料,則可視成 n 個群聚,並依底下演算法形成聚合樹:
由樹狀結構的頂端開始,逐次分裂分群。起初將所有資料視成一個群聚 (cluster),並依底下演算法形成分裂樹:
而因在實作上聚合方式較易實作,故底下亦以介紹聚合方式為主,在聚合時需定義兩個群聚的距離,底下有 4 種常用的群聚距離的定義:
\[ d(C_i, C_j) = min_{a \in C_i, b \in C_j}{d(a,b)} \]
\[ d(C_i, C_j) = max_{a \in C_i, b \in C_j}{d(a,b)} \]
\[ d(C_i, C_j) = \sum_{a \in C_i, b \in C_j}{\frac{d(a,b)}{|C_i||C_j|}} \]
\[ d(C_i, C_j) = \sum_{a \in C_i, b \in C_j}{||m - \mu||} \]
packageName <- c("stats")
for(i in 1:length(packageName)) {
if(!(packageName[i] %in% rownames(installed.packages()))) {
install.packages(packageName[i])
}
}
lapply(packageName, require, character.only = TRUE)
## [[1]]
## [1] TRUE
set.seed(123)
getOriData <- matrix(c(
dim1 = rnorm(30,mean=100,sd=5),
dim2 = cos(rnorm(30,mean=200,sd=100)),
dim3 = sin(rnorm(30,mean=500,sd=300)),
dim4 = sample(c(-100:100,NA),size=30,replace=TRUE),
dim5 = sample(c(500:1000,NA),size=30,replace=TRUE))
, ncol = 5
)
# change name of matrix
colnames(getOriData) <- c("dim1","dim2","dim3","dim4","dim5")
rowNameList <- c()
for(i in 1:nrow(getOriData)) {
rowNameList <- c(rowNameList,paste("data",toString(i),sep=""))
}
rownames(getOriData) <- rowNameList
# get rid of NAs
getOriData[is.na(getOriData)] <- 0
head(getOriData, 5)
## dim1 dim2 dim3 dim4 dim5
## data1 97.19762 -0.7359092 -0.9583813 69 741
## data2 98.84911 0.6623476 -0.5531824 -37 626
## data3 107.79354 0.8841887 -0.8701556 43 608
## data4 100.35254 0.3500467 -0.3441745 -47 838
## data5 100.64644 0.8335847 0.5713495 20 523
# the prototype of hclust
# d: 由函式 dist 產生的距離矩陣,或用 as.dist 來將矩陣轉換為距離矩陣
# method: 計算聚集的方法,如 average, complete, single, median, ... 等
hclust(d, method = "complete")
# the prototype of dist
# method: 距離計算方法,有 "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski" 等
dist(x. method="euclidean", ...)
底下利用歐幾里得距離來計算兩筆資料間的距離,亦可利用 Correlation 來計算兩筆資料間的相關性,並以此做為兩筆資料間的距離。透過函式 hclust 及方法 average 或 complete 來產生階層分群結果。
# clustering based on correlation
getOriData.dist.cor <- as.dist(1 - cor(t(getOriData)))
# clustering based on euclidean
getOriData.dist.euc <- dist(getOriData)
hc_ave <- hclust(getOriData.dist.euc, "average")
hc_cmp <- hclust(getOriData.dist.euc, "complete")
summary(hc_ave)
## Length Class Mode
## merge 58 -none- numeric
## height 29 -none- numeric
## order 30 -none- numeric
## labels 30 -none- character
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
# draw the dendrogram
par(mar=c(3,1,1,3))
plot(
as.dendrogram(hc_ave), labels = NULL,
axes = TRUE, frame.plot = FALSE, ann = TRUE,
main = "Cluster Dendrogram - average",
sub = NULL, cex = .6, horiz=T
)
par(mar=c(3,1,1,3))
plot(
as.dendrogram(hc_cmp), labels = NULL,
axes = TRUE, frame.plot = FALSE, ann = TRUE,
main = "Cluster Dendrogram - complete",
sub = NULL, cex = .6, horiz=T
)
階層分群結束後,會如上圖的階層樹結果,可以透過屬性 height 來取得資料合併時在階層樹中的高度(代表資料的差異距離),之後可以透過高度來分群。
hc_ave$height
## [1] 15.86985 17.69218 18.54191 24.14177 24.54019 26.13129 26.16949
## [8] 28.33717 30.00059 35.96070 36.89931 39.65064 45.39283 49.07034
## [15] 55.03064 56.11457 69.47920 71.85406 73.06058 73.62122 77.13075
## [22] 79.92296 84.61022 90.20449 112.25442 125.47618 140.09498 159.39778
## [29] 250.41246
可以透過屬性 merge 來取得資料聚合的順序。
head(hc_ave$merge, 10)
## [,1] [,2]
## [1,] -18 -23
## [2,] -5 -24
## [3,] -2 -11
## [4,] -25 3
## [5,] -22 -26
## [6,] -8 -16
## [7,] -4 -19
## [8,] -28 -30
## [9,] -6 -14
## [10,] -27 4
由上可以看出第 1 次的聚合是第 18 筆(-18 標示)與第 23 筆(-23)資料,第 2 次的聚合是第 5 筆(-5)與第 24 筆(-24)資料,第 3 次的聚合是第 2 筆(-2)與第 11 筆(-11)資料,第 4 次的聚合是第 25 筆(-25)與第 3 次(+3 標示)合併後的資料,以此類推。
# get sub-tree
# by how much groups generated
getSubtreeGroup <- cutree( hc_ave, k=15 )
getSubtreeGroup
## data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
## 1 2 3 4 5 6 2 7 8 8
## data11 data12 data13 data14 data15 data16 data17 data18 data19 data20
## 2 9 10 6 11 7 7 12 4 13
## data21 data22 data23 data24 data25 data26 data27 data28 data29 data30
## 4 1 12 5 2 1 2 14 15 14
# by tree height
getSubtreeHeight <- cutree( hc_ave, h=100 )
getSubtreeHeight
## data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
## 1 2 2 3 4 3 2 1 5 5
## data11 data12 data13 data14 data15 data16 data17 data18 data19 data20
## 2 6 5 3 4 1 1 1 3 6
## data21 data22 data23 data24 data25 data26 data27 data28 data29 data30
## 3 1 1 4 2 1 2 6 2 6