This month the /r/dataisbeautiful data visualization challenge involved a clean and simple dataset consisting of the:
Utilizing hierarchical clustering we can explore if these animals group together meaningfully based on the provided variables.
library('cluster')
library('dendextend')
library("factoextra")
library('ggplot2')
library('ggrepel')
library('knitr')
library('kableExtra')
library('purrr')
library('tidyverse')
df <- read.csv("august_data_vis.csv")
df <- df %>%
rename(
Mass_grams = Mass..grams.,
Heart_Rate_BPM = Resting.Heart.Rate..BPM.,
Longevity_years = Longevity..Years.)
df
## Creature Mass_grams Heart_Rate_BPM Longevity_years
## 1 Human 90000 60 70
## 2 Cat 2000 150 15
## 3 Small dog 2000 100 10
## 4 Medium dog 5000 90 15
## 5 Large dog 8000 75 17
## 6 Hamster 60 450 3
## 7 Chicken 1500 275 15
## 8 Monkey 5000 190 15
## 9 Horse 1200000 44 40
## 10 Cow 800000 65 22
## 11 Pig 150000 70 25
## 12 Rabbit 1000 205 9
## 13 Elephant 5000000 30 70
## 14 Giraffe 900000 65 20
## 15 Large whale 120000000 20 80
# Hierarchical Clustering
df_indexed <- data.frame(df[,-1], row.names = df[,1])
scaled <- scale(df_indexed)
dist_scaled <- dist(scaled, method = 'euclidean')
hc_df <- hclust(dist_scaled, method = 'average')
dend_df <- as.dendrogram(hc_df)
# Plot the dendrogram
plot(dend_df, main="Dendogram: Average Linkage, Euclidean Distance", cex.main=0.8)
At first glance it looks like there could be some meaningful grouping, but we must first decide how many clusters are appropriate to designate.
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:14, function(k){
model <- kmeans(x = dist_scaled, centers = k)
model$tot.withinss})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:14,
tot_withinss = tot_withinss)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:14) +
theme_bw() +
labs(title = "Optimal Number of Clusters: Elbow Plot")
Unfortunately, this elbow curve is smooth and therefore not conclusive as to the appropriate selection of a k value.
sil_width <- map_dbl(2:14, function(k){
model <- pam(df_indexed, k = k)
model$silinfo$avg.width})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:14,
sil_width = sil_width)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:14) +
theme_bw() +
geom_vline(xintercept = 2,
color = "red", linetype = "dashed", size = 1) +
labs(title= "Optimal Number of Clusters: Silhouette Plot")
It looks like k=2 is the most acceptable number of clusters.
# Calculate the mean for each category
clust_int <- clust_df[c(2:5)]
clust_int$Mass_grams <- clust_int$Mass_grams / 1000
dt <- clust_int %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
colnames(dt) <- c("Cluster","Mass (kg)", "Resting Heart Rate (BPM)", "Longevity (years)")
dt<-round((dt[,1:4]),1)
kable(dt) %>%
kable_styling(position = "center", full_width = F) %>%
add_header_above(c(" " = 1, "Average" = 3))
| Cluster | Mass (kg) | Resting Heart Rate (BPM) | Longevity (years) |
|---|---|---|---|
| 1 | 583.2 | 133.5 | 24.7 |
| 2 | 120000.0 | 20.0 | 80.0 |
Cluster 1 is characterized by having on average smaller mass, higher heart rates, and shorter life-spans.
Cluster 2 is characterized by having a larger mass, lower heart rate and longer life-span. This cluster consists of only one animal, the large whale.
The physiological differences between the whale and the other animals is likely a product of the associated habitat differences (water vs land).
Therefore, the two clusters displayed through this analysis do make biological sense.
This dataset included 14 mammals and one bird. Perhaps, if the dataset included other animal groups such as reptiles or more birds additional clusters would be clear.