This month the /r/dataisbeautiful data visualization challenge involved a clean and simple dataset consisting of the:

Utilizing hierarchical clustering we can explore if these animals group together meaningfully based on the provided variables.


1. Loading required packages

library('cluster')
library('dendextend')
library("factoextra")
library('ggplot2')
library('ggrepel')
library('knitr')
library('kableExtra')
library('purrr')
library('tidyverse')

2. Import the data

df <- read.csv("august_data_vis.csv")

df <- df %>% 
  rename(
    Mass_grams = Mass..grams.,
    Heart_Rate_BPM = Resting.Heart.Rate..BPM.,
    Longevity_years = Longevity..Years.)

df
##       Creature Mass_grams Heart_Rate_BPM Longevity_years
## 1        Human      90000             60              70
## 2          Cat       2000            150              15
## 3    Small dog       2000            100              10
## 4   Medium dog       5000             90              15
## 5    Large dog       8000             75              17
## 6      Hamster         60            450               3
## 7      Chicken       1500            275              15
## 8       Monkey       5000            190              15
## 9        Horse    1200000             44              40
## 10         Cow     800000             65              22
## 11         Pig     150000             70              25
## 12      Rabbit       1000            205               9
## 13    Elephant    5000000             30              70
## 14     Giraffe     900000             65              20
## 15 Large whale  120000000             20              80

3. Hierarchical Clustering

# Hierarchical Clustering
df_indexed <- data.frame(df[,-1], row.names = df[,1])
scaled <- scale(df_indexed)

dist_scaled <- dist(scaled, method = 'euclidean')
hc_df <- hclust(dist_scaled, method = 'average')
dend_df <- as.dendrogram(hc_df)

# Plot the dendrogram
plot(dend_df, main="Dendogram: Average Linkage, Euclidean Distance", cex.main=0.8)

At first glance it looks like there could be some meaningful grouping, but we must first decide how many clusters are appropriate to designate.

4. Elbow Plot

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:14,  function(k){
  model <- kmeans(x = dist_scaled, centers = k)
  model$tot.withinss})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:14,
  tot_withinss = tot_withinss)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:14) +
  theme_bw() +
  labs(title = "Optimal Number of Clusters: Elbow Plot")

Unfortunately, this elbow curve is smooth and therefore not conclusive as to the appropriate selection of a k value.

5. Silhouette analysis

sil_width <- map_dbl(2:14,  function(k){
  model <- pam(df_indexed, k = k)
  model$silinfo$avg.width})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:14,
  sil_width = sil_width)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:14) +
  theme_bw() +
  geom_vline(xintercept = 2, 
             color = "red", linetype = "dashed", size = 1) +
  labs(title= "Optimal Number of Clusters: Silhouette Plot")

It looks like k=2 is the most acceptable number of clusters.

6. Final Diagrams

7. Interpretation of clusters

# Calculate the mean for each category
clust_int <- clust_df[c(2:5)]
clust_int$Mass_grams <- clust_int$Mass_grams / 1000

dt <- clust_int %>% 
  group_by(cluster) %>% 
  summarise_all(funs(mean(.)))

colnames(dt) <- c("Cluster","Mass (kg)", "Resting Heart Rate (BPM)", "Longevity (years)")
dt<-round((dt[,1:4]),1)

kable(dt) %>%
  kable_styling(position = "center", full_width = F) %>%
  add_header_above(c(" " = 1, "Average" = 3))
Average
Cluster Mass (kg) Resting Heart Rate (BPM) Longevity (years)
1 583.2 133.5 24.7
2 120000.0 20.0 80.0

Cluster 1 is characterized by having on average smaller mass, higher heart rates, and shorter life-spans.

Cluster 2 is characterized by having a larger mass, lower heart rate and longer life-span. This cluster consists of only one animal, the large whale.

8. Does it make sense?

The physiological differences between the whale and the other animals is likely a product of the associated habitat differences (water vs land).

Therefore, the two clusters displayed through this analysis do make biological sense.

This dataset included 14 mammals and one bird. Perhaps, if the dataset included other animal groups such as reptiles or more birds additional clusters would be clear.