Comparative Analysis of Hierarchical Clustering and Silhouette Analysis in Market Segmentation and Dietary Pattern Recognition
Author
Saurabh C Srivastava
Published
February 14, 2025
Part 1: Hierarchical Clustering Analysis of Car Characteristics for Market Segmentation
Objective of the Analysis
To segment the car market based on the mtcars dataset using hierarchical cluster analysis, visualizing the resulting groupings with a phylogenic dendrogram.
1. Loading Required Libraries and Exploring the mtcars Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <-dist(mtcars) # note the default method is euclidian distanceh <-hclust(d)
4. Visualize the Hierarchical Clustering as a Phylogenetic Dendrogram
# Visualize the treefviz_dend(h, k =4, repel =TRUE, type ="phylogenic")+ggtitle("Motor Trend Car Road Tests", subtitle ="Groupings")+labs( caption ="Saurabh's Work") +theme(legend.position ="NULL")
Part 2: Hierarchical Clustering and Silhouette Analysis of European Protein Consumption Patterns
Objective of the Analysis
The objective of this analysis is to evaluate the validity of clustering in European protein consumption patterns using silhouette analysis, which includes reporting the average silhouette width for assessing cluster quality. Additionally, a phylogenetic dendrogram is utilized to visualize hierarchical relationships among countries based on their protein consumption profiles, ensuring both quantitative validation and structured representation of clustering outcomes.
1. Load Required Libraries
library(tidyverse)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
model_rf =randomForest(cluster ~ ., data = mydata_df, importance=TRUE) # fit the random forest with default parameter
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values. Are you sure you want to do regression?
model_rf
Call:
randomForest(formula = cluster ~ ., data = mydata_df, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 0.366756
% Var explained: 64.95
Creates a hierarchical dendrogram to visualize how countries are grouped based on protein consumption.
d <-dist(mydata_df) # note the default method is euclidian distanceh <-hclust(d) # Visualize the treeg2 =fviz_dend(h, k =4, repel =TRUE, type ="phylogenic")+ggtitle("Hierarchical Clustering", subtitle ="Groupings")+labs( caption ="Saurabh's Work") +theme(legend.position ="NULL")g2
Combines the silhouette plot and dendrogram into a single visualization.
pacman::p_load(cowplot)cowplot::plot_grid(g1,g2)
Key Takeaways
1. Phylogenetic Tree (Motor Trend Car Road Tests)
The hierarchical clustering dendrogram effectively groups different car models based on their similarities in specifications, forming clear clusters. Vehicles such as AMC Javelin, Cadillac Fleetwood, and Camaro Z28 appear in distinct clusters, suggesting that cars with similar performance metrics (horsepower, weight, fuel efficiency, etc.) are grouped together. The phylogenetic structure shows how car models branch out from a common hierarchy, revealing relationships between different brands and models based on mechanical attributes.
One standout vehicle in the analysis is the Maserati Bora, a high-performance sports car. The clustering places it in a distinct group, highlighting its exceptional horsepower, speed, and weight-to-power ratio, which set it apart from standard sedans and fuel-efficient vehicles.
2. Silhouette Analysis for Hierarchical Clustering (Protein Consumption Data)
The silhouette analysis measures the cohesion and separation of clusters, with an average silhouette width of 0.38. This indicates that while clustering is present, some observations may not be optimally assigned. Countries such as Albania, Austria, and Belgium show varied silhouette widths, indicating differences in their dietary protein consumption patterns.
The average silhouette width of 0.38 suggests that clusters are moderately well-separated, but there could be overlapping dietary patterns among some countries.
Countries with similar diets are clustered together, confirming regional or cultural similarities in protein consumption.
Some countries may be borderline between clusters, indicating dietary transitions or diverse food consumption habits.