Introduction

This is Part IV of the hierarchical clustering analysis. In this part, I am going to compare the each result with K-means.

Note that,you can find the other parts from the links below:

Part I covers question 1 Part II covers question 2 Part III covers question 3. Part IV covers question4

Data:

The data file Ramusbone length records the length of the Ramusbone from 20 boys ages 8, 8.5, 9, and 9.5.

Objective:

Carry out different hierarchical cluster analysis techniques on the Ramusbone length data set and cluster the kids based on given growth in height and compare the results.

Research Questions

  1. Perform hierarchical clustering using the single, average, and complete methods to obtain k = 4 groups. Compare the results. For each unique clustering, make a plot of the first two principal components of each individual and indicate the cluster to which each point belongs.

  2. One way in which hierarchical methods are compared is their robustness to error. Add noise to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement. Repeat the analyses from part (1). Which of the methods, if any, had their clustering change due to the added noise?

  3. Another way in which hierarchical models are compared is their robustness to outliers. Add the observation x21 = (47:7; 48:8; 45:7; 45:3)T to the data set and rerun the analyses from part (1). Which of the methods, if any, had their clustering change due to the outlier?

  4. Run a k-means analysis for k = 4 groups on the ramus bone data. Perform the analysis three times with random seeds to begin the procedure. Do the three k-means analyses return the same grouping? How do the groupings from the k-means procedure compare to those from the hierarchical methods?

PART IV

Data Preparation
library(tidyverse)   # Data manipulation. 
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(factoextra)  # Clustering Visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Raw(Original) Dataset

# Clear the workspace: 
rm(list = ls())

# Load data
ramus<- read.table("ramusbonelength.txt", head = T)

head(ramus)
##   Individual X8yr X8.5yr X9yr X9.5yr
## 1          1 47.8   48.8 49.0   49.7
## 2          2 46.4   47.3 47.7   48.4
## 3          3 46.3   46.8 47.8   48.5
## 4          4 45.1   45.3 46.1   47.2
## 5          5 47.6   48.5 48.9   49.3
## 6          6 52.5   53.2 53.3   53.7
# Normalize the data set for the raw data:
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))

ramusdf <- ramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)

head(ramusdf)
##   Individual       X8yr     X8.5yr       X9yr     X9.5yr
## 1          1 -0.3398327 -0.3248600 -0.5969107 -0.6400435
## 2          2 -0.8962839 -0.9155145 -1.0911680 -1.1168668
## 3          3 -0.9360304 -1.1123994 -1.0531482 -1.0801881
## 4          4 -1.4129886 -1.7030539 -1.6994846 -1.5570114
## 5          5 -0.4193257 -0.4429909 -0.6349305 -0.7867584
## 6          6  1.5282535  1.4077267  1.0379403  0.8271050
rdf<-ramusdf[,2:5] # Dropping "Individual" column




Noise added Dataset

newramus <- ramus[,2:5] + matrix(rnorm(20, mean = 0, sd = .25), ncol = 4)

noisedf<-cbind(ramus[,1], newramus)

head(noisedf)
##   ramus[, 1]     X8yr   X8.5yr     X9yr   X9.5yr
## 1          1 47.52187 48.52187 48.72187 49.42187
## 2          2 46.12405 47.02405 47.42405 48.12405
## 3          3 46.40522 46.90522 47.90522 48.60522
## 4          4 45.15813 45.35813 46.15813 47.25813
## 5          5 47.78449 48.68449 49.08449 49.48449
## 6          6 52.57645 53.27645 53.37645 53.77645

Noise was added to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement.

# Normalize the data set :
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))

newnoisedf <- noisedf %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)

head(newnoisedf)
##   ramus[, 1]       X8yr     X8.5yr      X9yr     X9.5yr
## 1          1 -0.4667505 -0.4548520 -0.709811 -0.7427000
## 2          2 -1.0192874 -1.0468656 -1.195731 -1.2081683
## 3          3 -0.9081458 -1.0938342 -1.015576 -1.0355951
## 4          4 -1.4011024 -1.7053230 -1.669710 -1.5187354
## 5          5 -0.3629384 -0.3905741 -0.574039 -0.7202391
## 6          6  1.5312549  1.4244029  1.032930  0.8190933
ndf<-newnoisedf[,2:5] # Drops "Individual" column




Outlier added Dataset

new<-c(21,47.7,48.8,45.7,45.3)

newramus <- rbind(ramus,new)

nrow(newramus)
## [1] 21

Observation x21 = (47:7; 48:8; 45:7; 45:3) has been added as an outlier.

# Normalize the data set :
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))

outdf <- newramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)

head(outdf)
##   Individual       X8yr     X8.5yr       X9yr     X9.5yr
## 1          1 -0.3289306 -0.3165917 -0.4821698 -0.4879474
## 2          2 -0.8977870 -0.9209939 -0.9506124 -0.9247003
## 3          3 -0.9384196 -1.1224614 -0.9145784 -0.8911040
## 4          4 -1.4260108 -1.7268636 -1.5271571 -1.3278569
## 5          5 -0.4101958 -0.4374721 -0.5182039 -0.6223329
## 6          6  1.5808017  1.4563217  1.0672941  0.8559078
odf<-outdf[,2:5] # Drops "Individual" column




Distance matrix computation

# Compute distances for raw data: 
dr <- dist(rdf, method = "euclidean")

# Compute distances for raw data: 
dn <- dist(ndf, method = "euclidean")

# Compute distances for raw data: 
do <- dist(odf, method = "euclidean")


  1. Run a k-means analysis for k = 4 groups on the ramus bone data. Perform the analysis three times with random seeds to begin the procedure. Do the three k-means analyses return the same grouping? How do the groupings from the k-means procedure compare to those from the hierarchical methods?
set.seed(333)
out1 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out2 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out3 <- kmeans(rdf, 4, iter.max=20, nstart=25)


out1$cluster  
##  [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1
out2$cluster  
##  [1] 1 3 3 3 1 2 2 4 4 3 4 4 2 1 4 4 1 2 3 1
out3$cluster  
##  [1] 2 1 1 1 2 4 4 3 3 1 3 3 4 2 3 3 2 4 1 2

Yes, K-Means analysis does return the same grouping.




Comparison over Raw Data

Single Linkage

singledr <- hclust(dr, method = "single")
sgroup<- cutree(singledr, k = 4)
sgroup
##  [1] 1 1 1 1 1 2 2 2 3 1 2 3 2 1 2 2 1 2 1 4

Complete Linkage

completedr <- hclust(dr, method = "complete")
sgroup<- cutree(completedr, k = 4)
sgroup
##  [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 4 1 3 2 1

Average Linkage

averagedr<- hclust(dr, method = "average")
sgroup<- cutree(averagedr, k = 4)
sgroup
##  [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 3 1 2 1 4

K-Means

set.seed(333)

out1 <- kmeans(rdf, 4, iter.max=20, nstart=25)

out1$cluster  
##  [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1


RESULT: K-means returns the same grouping as Complete linkage does. Single Linkage and Average linkage return slightly different grouping.





Comparison over Noisy Data

Single Linkage

singledn <- hclust(dn, method = "single")

sgroup<- cutree(singledn, k = 4)

sgroup
##  [1] 1 1 1 1 1 2 2 2 3 1 2 3 2 1 2 2 1 2 1 4

Complete Linkage

completedn <- hclust(dn, method = "complete")

cgroup<- cutree(completedn, k = 4)

cgroup
##  [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 3 1 3 2 1

Average Linkage

averagedn <- hclust(dn, method = "average")

agroup<- cutree(averagedn, k = 4)

agroup
##  [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 2 1 2 1 4

K-Means

set.seed(333)

out1 <- kmeans(ndf, 4, iter.max=20, nstart=25)

out1$cluster  
##  [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 4 1 4 3 1


RESULT: K-means did not return the same grouping with any Linkage method.



Comparison over Data with Outlier

Single Linkage

singleDo <- hclust(do, method = "single")

sgroup<- cutree(singleDo, k = 4)

sgroup
##  [1] 1 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 1 2 1 3 4

Complete Linkage

completeDo <- hclust(do, method = "complete")

cgroup<- cutree(completeDo, k = 4)

cgroup
##  [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 4 1 3 2 1 2

Average Linkage

averageDo <- hclust(do, method = "average")

agroup<- cutree(averageDo, k = 4)

agroup
##  [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 3 1 2 1 1 4

K-Means

set.seed(333)

out3 <- kmeans(rdf, 4, iter.max=20, nstart=25)

out3$cluster  
##  [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1


RESULT: K-means did not return the same grouping with any Linkage method.



K-means suggested clusters for each data group

Raw data

# Enhanced k-means clustering
res.km <- eclust(rdf, "kmeans", nstart = 25)

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 1

Note that K-means suggest that 1 cluster works best for the raw data.



Noise added data

# Enhanced k-means clustering
res.km <- eclust(ndf, "kmeans", nstart = 25)

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 1

Note that K-means suggest that 1 cluster works best for the noise added data.



Outlier added data

# Enhanced k-means clustering
res.km <- eclust(odf, "kmeans", nstart = 25)

# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 2

Note that K-means suggest that 2 cluster works best for the outlier added data.



References:

  1. https://en.wikipedia.org/wiki/Hierarchical_clustering
  2. https://www.datanovia.com/en/blog/cluster-analysis-in-r-simplified-and-enhanced/

**************