Introduction

This is Part II of the hierarchical clustering analysis. In this part, I will be adding noise to the data set and observe if it makes any difference.

Note that,you can find the other parts from the links below:

Part I covers question 1 Part II covers question 2 Part III covers question 3. Part IV covers question4

Data:

The data file Ramusbone length records the length of the Ramusbone from 20 boys ages 8, 8.5, 9, and 9.5.

Objective:

Carry out different hierarchical cluster analysis techniques on the Ramusbone length data set and cluster the kids based on given growth in height and compare the results.

Research Questions

  1. Perform hierarchical clustering using the single, average, and complete methods to obtain k = 4 groups. Compare the results. For each unique clustering, make a plot of the first two principal components of each individual and indicate the cluster to which each point belongs.

  2. One way in which hierarchical methods are compared is their robustness to error. Add noise to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement. Repeat the analyses from part (1). Which of the methods, if any, had their clustering change due to the added noise?

  3. Another way in which hierarchical models are compared is their robustness to outliers. Add the observation x21 = (47:7; 48:8; 45:7; 45:3)T to the data set and rerun the analyses from part (1). Which of the methods, if any, had their clustering change due to the outlier?

  4. Run a k-means analysis for k = 4 groups on the ramus bone data. Perform the analysis three times with random seeds to begin the procedure. Do the three k-means analyses return the same grouping? How do the groupings from the k-means procedure compare to those from the hierarchical methods?

PART II

Data Preparation

library(tidyverse)   # Data manipulation. 
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(factoextra)  # Clustering Visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Load the Dataset

# Clear the workspace: 
rm(list = ls())

# Load data
ramus<- read.table("ramusbonelength.txt", head = T)

head(ramus)
##   Individual X8yr X8.5yr X9yr X9.5yr
## 1          1 47.8   48.8 49.0   49.7
## 2          2 46.4   47.3 47.7   48.4
## 3          3 46.3   46.8 47.8   48.5
## 4          4 45.1   45.3 46.1   47.2
## 5          5 47.6   48.5 48.9   49.3
## 6          6 52.5   53.2 53.3   53.7

Add Noise and Create a new Data set

#N.R.V. with (0,0.25) added to the observations
newramus <- ramus[,2:5] + matrix(rnorm(20, mean = 0, sd = .25), ncol = 4)

#Combined the Columns 
newramus<-cbind(ramus[,1], newramus)

head(newramus)
##   ramus[, 1]     X8yr   X8.5yr     X9yr   X9.5yr
## 1          1 48.06815 49.06815 49.26815 49.96815
## 2          2 46.03164 46.93164 47.33164 48.03164
## 3          3 46.16947 46.66947 47.66947 48.36947
## 4          4 45.32939 45.52939 46.32939 47.42939
## 5          5 47.38829 48.28829 48.68829 49.08829
## 6          6 52.21681 52.91681 53.01681 53.41681

Noise was added to the bone data by adding independent normal random variables with mean 0 and standard deviation 0.25 to each measurement.

Normalize the data set

#Create a scale function
scaled <- function(x,na.rm=FALSE) ((x - mean(x)) / sd(x))

#Apply over the observations
newramusdf <- newramus %>% mutate_at(c("X8yr", "X8.5yr","X9yr","X9.5yr"), scaled, na.rm=TRUE)

head(newramusdf)
##   ramus[, 1]       X8yr     X8.5yr       X9yr     X9.5yr
## 1          1 -0.2360764 -0.2220488 -0.5022948 -0.5467915
## 2          2 -1.0560147 -1.0747639 -1.2497485 -1.2640114
## 3          3 -1.0005194 -1.1793976 -1.1193509 -1.1388883
## 4          4 -1.3387517 -1.6344214 -1.6365942 -1.4870629
## 5          5 -0.5098027 -0.5333048 -0.7261105 -0.8726641
## 6          6  1.4342526  1.3140115  0.9446108  0.7304780
# Drops "Individual" column
newdf<-newramusdf[,2:5] 

Create functions for data visualizations

# Create a function of dendrogram:
dend_func <- (function(x) {fviz_dend(x, 
          k = 4,   
          cex = 0.5, 
          rect = TRUE, 
          rect_fill = TRUE, 
          horiz = FALSE, 
          palette = "jco", 
          rect_border = "jco", 
          color_labels_by_k = TRUE) })
# Create a theme of dendrogram plots : 
theme1<- theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm"))  
# Create a function of cluster:
clust_func<-(function(x){fviz_cluster(list(data = newdf, cluster = paste0("Group", x)), 
                         alpha = 1, 
                         colors = x, 
                         labelsize = 9, 
                         ellipse.type = "norm")})
# Create a theme of cluster plots : 
theme2<-  theme(legend.position = c(0.1, 0.8)) + 
          theme(plot.margin = unit(rep(0.5, 4), "cm"))   

Distance matrix computation and visualization

# Compute distances: 
dd <- dist(newdf, method = "euclidean")

# Visualize the dissimilarity: 
fviz_dist(dd, lab_size = 6)

Eigen Values for PC Analysis

#Find Correlation Matrix
S <- cor(newdf)

#Find EigenValues
eig <- eigen(S)

#Find Cumulative sum of the Eigenvalues
cumsum(eig$values)/sum(eig$values)
## [1] 0.9219378 0.9872995 0.9956753 1.0000000
# PC values
pc1 <- t( t(eig$vectors[,1] %*% t(newdf)))
pc2 <- t( t(eig$vectors[,2] %*% t(newdf)))




Single Linkage

Dendrogram Plot

singleNoise <- hclust(dd, method = "single")

dend_func(singleNoise) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Single Linkage Method")

PCA Plot

sgroup<- cutree(singleNoise, k = 4)

plot(pc1, pc2, ylim = c(-2,1))

text(x = pc1, y = pc2+.1, labels = rownames(ramus))

points(pc1, pc2, col = sgroup, pch = 19)

Cluster Plot

clust_func(sgroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering") 
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

These plots confirm that observation 18 and 4 have the biggest distance.





Complete Linkage

Dendrogram Plot

completeNoise <- hclust(dd, method = "complete")

dend_func(completeNoise) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Complete Linkage Method")

PCA Plot

cgroup<- cutree(completeNoise, k = 4)

plot(pc1, pc2, ylim = c(-2,1))

text(x = pc1, y = pc2+.1, labels = rownames(ramus))

points(pc1, pc2, col = cgroup, pch = 19)

Cluster Plot

clust_func(cgroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering") 





Average Linkage

Dendrogram Plot

aveNoise <- hclust(dd, method = "average")

dend_func(aveNoise) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering with Average Linkage Method")

PCA Plot

agroup<- cutree(aveNoise, k = 4)

plot(pc1, pc2, ylim = c(-2,1))

text(x = pc1, y = pc2+.1, labels = rownames(ramus))

points(pc1, pc2, col = agroup, pch = 19)

Cluster Plot

# Cut tree into 4 groups: 
clust_func(agroup)-> cluster_plot

cluster_plot + theme2 + 
  labs(title = "Cluster based on Hierarchical Clustering") 

References:

  1. https://en.wikipedia.org/wiki/Hierarchical_clustering
  2. https://www.datanovia.com/en/blog/cluster-analysis-in-r-simplified-and-enhanced/

**************