Comparative Analysis of Hierarchical Clustering and Silhouette Analysis in Market Segmentation and Dietary Pattern Recognition

Author

Saurabh C Srivastava

Published

February 14, 2025

Part 1: Hierarchical Clustering Analysis of Car Characteristics for Market Segmentation

Objective of the Analysis

To segment the car market based on the mtcars dataset using hierarchical cluster analysis, visualizing the resulting groupings with a phylogenic dendrogram.

1. Loading Required Libraries and Exploring the mtcars Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(NbClust)
library(cluster)
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data(mtcars)
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2. Determine the Optimal Number of Clusters

NbClust(mtcars, method = "complete", index = 'hartigan')$Best.nc
Number_clusters     Value_Index 
         3.0000         34.1696 
fviz_nbclust(mtcars, pam)

3. Compute Distance Matrix & Perform Hierarchical Clustering

d <- dist(mtcars) # note the default method is euclidian distance
h <- hclust(d)

4. Visualize the Hierarchical Clustering as a Phylogenetic Dendrogram

# Visualize the tree
fviz_dend(h, k = 4, repel = TRUE,  type = "phylogenic")+       
  ggtitle("Motor Trend Car Road Tests", subtitle = "Groupings")+
  labs( caption = "Saurabh's Work") +
  theme(legend.position = "NULL")

Part 2: Hierarchical Clustering and Silhouette Analysis of European Protein Consumption Patterns

Objective of the Analysis

The objective of this analysis is to evaluate the validity of clustering in European protein consumption patterns using silhouette analysis, which includes reporting the average silhouette width for assessing cluster quality. Additionally, a phylogenetic dendrogram is utilized to visualize hierarchical relationships among countries based on their protein consumption profiles, ensuring both quantitative validation and structured representation of clustering outcomes.

1. Load Required Libraries

library(tidyverse)
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(NbClust)
library(cluster)
library(factoextra)
library(flexclust)
Loading required package: grid
Loading required package: lattice
Loading required package: modeltools
Loading required package: stats4
library(FeatureImpCluster)
Loading required package: data.table

Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year
The following objects are masked from 'package:dplyr':

    between, first, last
The following object is masked from 'package:purrr':

    transpose
library(ClusterR)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
library(caret)

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
library(ggplot2)
library(cowplot)

Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':

    stamp
library(pacman)

2. Data Preprocessing

Uses janitor::clean_names() to format column names in CamelCase.

mydata = read.csv("protein-2.csv", header = TRUE) %>% 
  janitor::clean_names(case = "upper_camel")
head(mydata, 3)
  Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
1 Albania    10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5   1.7
2 Austria     8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3   4.3
3 Belgium    13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1   4.0

Stores country names separately and converts the dataframe so that countries become row names.

country = mydata$Country # To be used in plot 'RedMeat' and 'WhiteMeat'
mydata_df = mydata %>% column_to_rownames("Country")
head(mydata_df,3)
        RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
Albania    10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5   1.7
Austria     8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3   4.3
Belgium    13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1   4.0

3. Determining the Optimal Number of Clusters

# k-means
NbClust(mydata_df, index = "hartigan",method = "complete")$Best.nc
Number_clusters     Value_Index 
         3.0000         11.3132 

4. K-Means Clustering & Feature Importance Analysis

#Feature Importance: Review #
# Which features define individual clusters?
res <- kcca(mydata_df,k=3,family=kccaFamily("kmeans") )
FeatureImp_res <- FeatureImpCluster(res,as.data.table(mydata_df))
plot(FeatureImp_res)

barplot(res)

5. K-Means Clustering on the Dataset

# Fitting K-Means clustering Model
# to training dataset
set.seed(42) # Setting seed
head(mydata_df)
               RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5   1.7
Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3   4.3
Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1   4.0
Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7   4.2
Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1   4.0
Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7   2.4
kmeans.re <- kmeans(mydata_df, 4)
kmeans.re$cluster |> sort()
     E Germany       Portugal          Spain        Austria        Belgium 
             1              1              1              2              2 
       Denmark        Finland         France        Ireland    Netherlands 
             2              2              2              2              2 
        Norway         Sweden    Switzerland             UK      W Germany 
             2              2              2              2              2 
      Bulgaria        Romania     Yugoslavia        Albania Czechoslovakia 
             3              3              3              4              4 
        Greece        Hungary          Italy         Poland           USSR 
             4              4              4              4              4 

6. Silhouette Analysis for Cluster Validation

Computes silhouette scores to evaluate how well data points fit in their assigned clusters.

D <- dist(mydata_df)
s_mod1 = silhouette(kmeans.re$cluster, D) # from the package cluster
s_mod1
      cluster neighbor  sil_width
 [1,]       4        3 0.03874373
 [2,]       2        4 0.31463165
 [3,]       2        1 0.30414085
 [4,]       3        4 0.66877444
 [5,]       4        1 0.19339910
 [6,]       2        1 0.51099005
 [7,]       1        2 0.06315997
 [8,]       2        4 0.41300874
 [9,]       2        1 0.35467967
[10,]       4        3 0.31323457
[11,]       4        3 0.23973894
[12,]       2        1 0.58333546
[13,]       4        1 0.36329940
[14,]       2        1 0.52404430
[15,]       2        1 0.39874775
[16,]       4        2 0.26109731
[17,]       1        4 0.44094417
[18,]       3        4 0.43374158
[19,]       1        4 0.32672060
[20,]       2        1 0.51339649
[21,]       2        4 0.54537799
[22,]       2        1 0.43109847
[23,]       4        3 0.23710861
[24,]       2        1 0.38298762
[25,]       3        4 0.69680448
attr(,"Ordered")
[1] FALSE
attr(,"call")
silhouette.default(x = kmeans.re$cluster, dist = D)
attr(,"class")
[1] "silhouette"

7. Feature Importance Using Random Forest

# Random Forests for Variable Importance & Partial Plots
set.seed(42)
mydata_df$cluster = kmeans.re$cluster
names(mydata_df)
 [1] "RedMeat"   "WhiteMeat" "Eggs"      "Milk"      "Fish"      "Cereals"  
 [7] "Starch"    "Nuts"      "FrVeg"     "cluster"  
model_rf =  randomForest(cluster ~ ., data = mydata_df, importance=TRUE) # fit the random forest with default parameter
Warning in randomForest.default(m, y, ...): The response has five or fewer
unique values.  Are you sure you want to do regression?
model_rf

Call:
 randomForest(formula = cluster ~ ., data = mydata_df, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 0.366756
                    % Var explained: 64.95
caret::varImp((model_rf))
             Overall
RedMeat    4.9794031
WhiteMeat -2.2005975
Eggs       8.6080141
Milk       4.2762591
Fish       6.2456921
Cereals   18.1708299
Starch     2.6437096
Nuts      -1.1677532
FrVeg      0.4564572
c1= caret::varImp((model_rf))[1] %>% rownames_to_column("Country")
c1 %>% arrange(desc(c1[,2]))
    Country    Overall
1   Cereals 18.1708299
2      Eggs  8.6080141
3      Fish  6.2456921
4   RedMeat  4.9794031
5      Milk  4.2762591
6    Starch  2.6437096
7     FrVeg  0.4564572
8      Nuts -1.1677532
9 WhiteMeat -2.2005975

varImpPlot() visualizes feature importance.

varImpPlot(model_rf, type = 1)

Partial dependence plots to analyze how individual food categories influence clustering.

mydata_df %>% rownames_to_column("Country") %>%
  filter(Country == "W Germany")
    Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg cluster
1 W Germany    11.4      12.5  4.1 18.8  3.4    18.6    5.2  1.5   3.8       2
head(mydata_df)                
               RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg
Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5   1.7
Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3   4.3
Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1   4.0
Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7   4.2
Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1   4.0
Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7   2.4
               cluster
Albania              4
Austria              2
Belgium              2
Bulgaria             3
Czechoslovakia       4
Denmark              2
partialPlot(model_rf, mydata_df, RedMeat)

partialPlot(model_rf, mydata_df, WhiteMeat)

partialPlot(model_rf, mydata_df, Cereals)

partialPlot(model_rf, mydata_df, Fish)

8. Visualizing Clustering Results

Silhouette plot shows the quality of clustering, indicating how well each country fits into its assigned cluster.

myrf = varImp(model_rf)
avg_sil_width <- mean(fviz_silhouette(s_mod1, show.labels = TRUE)$data$sil_width)
  cluster size ave.sil.width
1       1    3          0.28
2       2   12          0.44
3       3    3          0.60
4       4    7          0.24
g1 = fviz_silhouette(s_mod1, show.labels = T) + coord_flip() +
     geom_text(aes(label = rownames(mydata_df)), col = "black") +
     theme(legend.position = "none")
  cluster size ave.sil.width
1       1    3          0.28
2       2   12          0.44
3       3    3          0.60
4       4    7          0.24
g1

Creates a hierarchical dendrogram to visualize how countries are grouped based on protein consumption.

d <- dist(mydata_df) # note the default method is euclidian distance
h <- hclust(d) 

# Visualize the tree
g2 = fviz_dend(h, k = 4, repel = TRUE,  type = "phylogenic")+       
     ggtitle("Hierarchical Clustering", subtitle = "Groupings")+
     labs( caption = "Saurabh's Work") +
     theme(legend.position = "NULL")

g2

Combines the silhouette plot and dendrogram into a single visualization.

pacman::p_load(cowplot)
cowplot::plot_grid(g1,g2)

Key Takeaways

1. Phylogenetic Tree (Motor Trend Car Road Tests)

The hierarchical clustering dendrogram effectively groups different car models based on their similarities in specifications, forming clear clusters. Vehicles such as AMC Javelin, Cadillac Fleetwood, and Camaro Z28 appear in distinct clusters, suggesting that cars with similar performance metrics (horsepower, weight, fuel efficiency, etc.) are grouped together. The phylogenetic structure shows how car models branch out from a common hierarchy, revealing relationships between different brands and models based on mechanical attributes.

One standout vehicle in the analysis is the Maserati Bora, a high-performance sports car. The clustering places it in a distinct group, highlighting its exceptional horsepower, speed, and weight-to-power ratio, which set it apart from standard sedans and fuel-efficient vehicles.

2. Silhouette Analysis for Hierarchical Clustering (Protein Consumption Data)

The silhouette analysis measures the cohesion and separation of clusters, with an average silhouette width of 0.38. This indicates that while clustering is present, some observations may not be optimally assigned. Countries such as Albania, Austria, and Belgium show varied silhouette widths, indicating differences in their dietary protein consumption patterns.

The average silhouette width of 0.38 suggests that clusters are moderately well-separated, but there could be overlapping dietary patterns among some countries.

Countries with similar diets are clustered together, confirming regional or cultural similarities in protein consumption.

Some countries may be borderline between clusters, indicating dietary transitions or diverse food consumption habits.