R Example for HC Implement

This is R code for Hierarchical clustering practice.
Reference from THIS BLOG

The original data:

#set seed to make example reproducible
set.seed(123)

data <- data.frame(x=sample(1:10000,7), 
                   y=sample(1:10000,7), 
                   z=sample(1:10000,7))
data

##      x    y    z
## 1 2876 8925 1030
## 2 7883 5514 8998
## 3 4089 4566 2461
## 4 8828 9566  421
## 5 9401 4532 3278
## 6  456 6773 9541
## 7 5278 5723 8891

library(scatterplot3d)
s3d <- scatterplot3d(data, color=1:7, pch=19, type="p")
s3d.coords <- s3d$xyz.convert(data)
#label data points(1~7)
text(s3d.coords$x, s3d.coords$y, labels=row.names(data), cex=1, pos=4)

There is a defined function dist to calculate distance in R, but how do we know IT’S CORRECT?

One way is to create our own function, and compare it with the defined function!
Sounds not bad, right?
So let’s do it!

Euclidean distance :

# create own function according to Euclidean distance formula
euclidean_distance <- function(p,q){
    sqrt(sum((p - q)^2))
}

# check points 4 and 6 
euclidean_distance(data[4,],data[6,]) #my own function

## [1] 12691.16

dist(data, method="euclidean") # defined function in R

##           1         2         3         4         5         6
## 2 10009.695                                                  
## 3  4745.525  7617.448                                        
## 4  6017.314  9532.925  7184.687                              
## 5  8180.928  5998.921  5374.569  5816.523                    
## 6  9106.296  7552.500  8258.083 12691.164 11147.209          
## 7  8821.436  2615.560  6640.578  9955.503  7065.648  4977.618

Manhattan distance :

# create own function according to Manhattan distance formula
manhattan_distance <- function(p,q){
    sum(abs(p-q))
}
# check potins 6 and 7
manhattan_distance(data[6,],data[7,])

## [1] 6522

dist(data, method="manhattan")

##       1     2     3     4     5     6
## 2 16386                              
## 3  7003 11279                        
## 4  7202 13574 11779                  
## 5 13166  8220  6163  8464            
## 6 13083  9229 12920 20285 17449      
## 7 13465  2921  8776 15863 10927  6522

Minkowski distance :

p=1 means Manhattan distance:

# p=1 means Manhattan distance
a <- dist(data[1:2,], method="minkowski", p=1)
b <- dist(data[1:2,], method="manhattan")
c(a, b)

## [1] 16386 16386

p=2 means Euclidean distance:

#p=2 means Euclidean distance
a <- dist(data[1:2,], method="minkowski", p=2)
b <- dist(data[1:2,], method="euclidean")
c(a, b)

## [1] 10009.7 10009.7

Hierarchical Clustering :

There is a function hclust defined in R, applied to do Hierarchical Clustering

The basic algorithmetic steps are :
1. find the closest two things
2. put them together
3. find the next closest

And it requires two arguments:
* A defined distance(Similarity)
* A merging approach

Do you notice that?
It is actually the Bottom-Up approach for clustering, right?

Let’s try different methods with the same distance, and see what happens:

par(mfrow=c(2,3))
plot(hclust(dist(data, method="euclidean"), method="single"))
plot(hclust(dist(data, method="euclidean"), method="complete"))
plot(hclust(dist(data, method="euclidean"), method="average"))
plot(hclust(dist(data, method="euclidean"), method="centroid"))
plot(hclust(dist(data, method="euclidean"), method="ward.D2"))