This is Part IV of the hierarchical clustering analysis. In this part, I am going to compare the each result with K-means.
Note that,you can find the other parts from the links below:
Data:
The data file Ramusbone length records the length of the Ramusbone from 20 boys ages 8, 8.5, 9, and 9.5.
Objective:
Carry out different hierarchical cluster analysis techniques on the Ramusbone length data set and cluster the kids based on given growth in height and compare the results.
Research Questions
Perform hierarchical clustering using the single, average, and complete methods to obtain k = 4 groups. Compare the results. For each unique clustering, make a plot of the first two principal components of each individual and indicate the cluster to which each point belongs.
One way in which hierarchical methods are compared is their robustness to error. Add noise to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement. Repeat the analyses from part (1). Which of the methods, if any, had their clustering change due to the added noise?
Another way in which hierarchical models are compared is their robustness to outliers. Add the observation x21 = (47:7; 48:8; 45:7; 45:3)T to the data set and rerun the analyses from part (1). Which of the methods, if any, had their clustering change due to the outlier?
Run a k-means analysis for k = 4 groups on the ramus bone data. Perform the analysis three times with random seeds to begin the procedure. Do the three k-means analyses return the same grouping? How do the groupings from the k-means procedure compare to those from the hierarchical methods?
library(tidyverse) # Data manipulation.
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(factoextra) # Clustering Visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Raw(Original) Dataset
# Clear the workspace:
rm(list = ls())
# Load data
ramus<- read.table("ramusbonelength.txt", head = T)
head(ramus)
## Individual X8yr X8.5yr X9yr X9.5yr
## 1 1 47.8 48.8 49.0 49.7
## 2 2 46.4 47.3 47.7 48.4
## 3 3 46.3 46.8 47.8 48.5
## 4 4 45.1 45.3 46.1 47.2
## 5 5 47.6 48.5 48.9 49.3
## 6 6 52.5 53.2 53.3 53.7
# Normalize the data set for the raw data:
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))
ramusdf <- ramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)
head(ramusdf)
## Individual X8yr X8.5yr X9yr X9.5yr
## 1 1 -0.3398327 -0.3248600 -0.5969107 -0.6400435
## 2 2 -0.8962839 -0.9155145 -1.0911680 -1.1168668
## 3 3 -0.9360304 -1.1123994 -1.0531482 -1.0801881
## 4 4 -1.4129886 -1.7030539 -1.6994846 -1.5570114
## 5 5 -0.4193257 -0.4429909 -0.6349305 -0.7867584
## 6 6 1.5282535 1.4077267 1.0379403 0.8271050
rdf<-ramusdf[,2:5] # Dropping "Individual" column
Noise added Dataset
newramus <- ramus[,2:5] + matrix(rnorm(20, mean = 0, sd = .25), ncol = 4)
noisedf<-cbind(ramus[,1], newramus)
head(noisedf)
## ramus[, 1] X8yr X8.5yr X9yr X9.5yr
## 1 1 47.52187 48.52187 48.72187 49.42187
## 2 2 46.12405 47.02405 47.42405 48.12405
## 3 3 46.40522 46.90522 47.90522 48.60522
## 4 4 45.15813 45.35813 46.15813 47.25813
## 5 5 47.78449 48.68449 49.08449 49.48449
## 6 6 52.57645 53.27645 53.37645 53.77645
Noise was added to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement.
# Normalize the data set :
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))
newnoisedf <- noisedf %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)
head(newnoisedf)
## ramus[, 1] X8yr X8.5yr X9yr X9.5yr
## 1 1 -0.4667505 -0.4548520 -0.709811 -0.7427000
## 2 2 -1.0192874 -1.0468656 -1.195731 -1.2081683
## 3 3 -0.9081458 -1.0938342 -1.015576 -1.0355951
## 4 4 -1.4011024 -1.7053230 -1.669710 -1.5187354
## 5 5 -0.3629384 -0.3905741 -0.574039 -0.7202391
## 6 6 1.5312549 1.4244029 1.032930 0.8190933
ndf<-newnoisedf[,2:5] # Drops "Individual" column
Outlier added Dataset
new<-c(21,47.7,48.8,45.7,45.3)
newramus <- rbind(ramus,new)
nrow(newramus)
## [1] 21
Observation x21 = (47:7; 48:8; 45:7; 45:3) has been added as an outlier.
# Normalize the data set :
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))
outdf <- newramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)
head(outdf)
## Individual X8yr X8.5yr X9yr X9.5yr
## 1 1 -0.3289306 -0.3165917 -0.4821698 -0.4879474
## 2 2 -0.8977870 -0.9209939 -0.9506124 -0.9247003
## 3 3 -0.9384196 -1.1224614 -0.9145784 -0.8911040
## 4 4 -1.4260108 -1.7268636 -1.5271571 -1.3278569
## 5 5 -0.4101958 -0.4374721 -0.5182039 -0.6223329
## 6 6 1.5808017 1.4563217 1.0672941 0.8559078
odf<-outdf[,2:5] # Drops "Individual" column
Distance matrix computation
# Compute distances for raw data:
dr <- dist(rdf, method = "euclidean")
# Compute distances for raw data:
dn <- dist(ndf, method = "euclidean")
# Compute distances for raw data:
do <- dist(odf, method = "euclidean")
set.seed(333)
out1 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out2 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out3 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out1$cluster
## [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1
out2$cluster
## [1] 1 3 3 3 1 2 2 4 4 3 4 4 2 1 4 4 1 2 3 1
out3$cluster
## [1] 2 1 1 1 2 4 4 3 3 1 3 3 4 2 3 3 2 4 1 2
Yes, K-Means analysis does return the same grouping.
Single Linkage
singledr <- hclust(dr, method = "single")
sgroup<- cutree(singledr, k = 4)
sgroup
## [1] 1 1 1 1 1 2 2 2 3 1 2 3 2 1 2 2 1 2 1 4
Complete Linkage
completedr <- hclust(dr, method = "complete")
sgroup<- cutree(completedr, k = 4)
sgroup
## [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 4 1 3 2 1
Average Linkage
averagedr<- hclust(dr, method = "average")
sgroup<- cutree(averagedr, k = 4)
sgroup
## [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 3 1 2 1 4
K-Means
set.seed(333)
out1 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out1$cluster
## [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1
RESULT: K-means returns the same grouping as Complete linkage does. Single Linkage and Average linkage return slightly different grouping.
Single Linkage
singledn <- hclust(dn, method = "single")
sgroup<- cutree(singledn, k = 4)
sgroup
## [1] 1 1 1 1 1 2 2 2 3 1 2 3 2 1 2 2 1 2 1 4
Complete Linkage
completedn <- hclust(dn, method = "complete")
cgroup<- cutree(completedn, k = 4)
cgroup
## [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 3 1 3 2 1
Average Linkage
averagedn <- hclust(dn, method = "average")
agroup<- cutree(averagedn, k = 4)
agroup
## [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 2 1 2 1 4
K-Means
set.seed(333)
out1 <- kmeans(ndf, 4, iter.max=20, nstart=25)
out1$cluster
## [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 4 1 4 3 1
RESULT: K-means did not return the same grouping with any Linkage method.
Single Linkage
singleDo <- hclust(do, method = "single")
sgroup<- cutree(singleDo, k = 4)
sgroup
## [1] 1 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 1 2 1 3 4
Complete Linkage
completeDo <- hclust(do, method = "complete")
cgroup<- cutree(completeDo, k = 4)
cgroup
## [1] 1 2 2 2 1 3 3 4 4 2 4 4 3 1 4 4 1 3 2 1 2
Average Linkage
averageDo <- hclust(do, method = "average")
agroup<- cutree(averageDo, k = 4)
agroup
## [1] 1 1 1 1 1 2 2 3 3 1 3 3 2 1 3 3 1 2 1 1 4
K-Means
set.seed(333)
out3 <- kmeans(rdf, 4, iter.max=20, nstart=25)
out3$cluster
## [1] 1 3 3 3 1 4 4 2 2 3 2 2 4 1 2 2 1 4 3 1
RESULT: K-means did not return the same grouping with any Linkage method.
Raw data
# Enhanced k-means clustering
res.km <- eclust(rdf, "kmeans", nstart = 25)
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 1
Note that K-means suggest that 1 cluster works best for the raw data.
Noise added data
# Enhanced k-means clustering
res.km <- eclust(ndf, "kmeans", nstart = 25)
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 1
Note that K-means suggest that 1 cluster works best for the noise added data.
Outlier added data
# Enhanced k-means clustering
res.km <- eclust(odf, "kmeans", nstart = 25)
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Optimal number of clusters using gap statistics
res.km$nbclust
## [1] 2
Note that K-means suggest that 2 cluster works best for the outlier added data.