This is R code for Hierarchical clustering practice.
Reference from THIS BLOG
#set seed to make example reproducible
set.seed(123)
data <- data.frame(x=sample(1:10000,7),
y=sample(1:10000,7),
z=sample(1:10000,7))
data
## x y z
## 1 2876 8925 1030
## 2 7883 5514 8998
## 3 4089 4566 2461
## 4 8828 9566 421
## 5 9401 4532 3278
## 6 456 6773 9541
## 7 5278 5723 8891
library(scatterplot3d)
s3d <- scatterplot3d(data, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(data)
#label data points(1~7)
text(s3d.coords$x, s3d.coords$y, labels=row.names(data), cex=1, pos=4)
There is a defined function dist to calculate distance in R, but how do we know IT’S CORRECT?
One way is to create our own function, and compare it with the defined function!
Sounds not bad, right?
So let’s do it!
# create own function according to Euclidean distance formula
euclidean_distance <- function(p,q){
sqrt(sum((p - q)^2))
}
# check points 4 and 6
euclidean_distance(data[4,],data[6,]) #my own function
## [1] 12691.16
dist(data, method="euclidean") # defined function in R
## 1 2 3 4 5 6
## 2 10009.695
## 3 4745.525 7617.448
## 4 6017.314 9532.925 7184.687
## 5 8180.928 5998.921 5374.569 5816.523
## 6 9106.296 7552.500 8258.083 12691.164 11147.209
## 7 8821.436 2615.560 6640.578 9955.503 7065.648 4977.618
# create own function according to Manhattan distance formula
manhattan_distance <- function(p,q){
sum(abs(p-q))
}
# check potins 6 and 7
manhattan_distance(data[6,],data[7,])
## [1] 6522
dist(data, method="manhattan")
## 1 2 3 4 5 6
## 2 16386
## 3 7003 11279
## 4 7202 13574 11779
## 5 13166 8220 6163 8464
## 6 13083 9229 12920 20285 17449
## 7 13465 2921 8776 15863 10927 6522
p=1 means Manhattan distance:
# p=1 means Manhattan distance
a <- dist(data[1:2,], method="minkowski", p=1)
b <- dist(data[1:2,], method="manhattan")
c(a, b)
## [1] 16386 16386
p=2 means Euclidean distance:
#p=2 means Euclidean distance
a <- dist(data[1:2,], method="minkowski", p=2)
b <- dist(data[1:2,], method="euclidean")
c(a, b)
## [1] 10009.7 10009.7
There is a function hclust defined in R, applied to do Hierarchical Clustering
The basic algorithmetic steps are :
1. find the closest two things
2. put them together
3. find the next closest
And it requires two arguments:
* A defined distance(Similarity)
* A merging approach
Do you notice that?
It is actually the Bottom-Up approach for clustering, right?
Let’s try different methods with the same distance, and see what happens:
par(mfrow=c(2,3))
plot(hclust(dist(data, method="euclidean"), method="single"))
plot(hclust(dist(data, method="euclidean"), method="complete"))
plot(hclust(dist(data, method="euclidean"), method="average"))
plot(hclust(dist(data, method="euclidean"), method="centroid"))
plot(hclust(dist(data, method="euclidean"), method="ward.D2"))