Introduction

Fast food has become an integral part of modern society, offering convenient, affordable and tasty meal options. With the rapid growth of the fast-food industry, consumers are now faced with an overwhelming variety of menu items from different brands, each varying in nutritional content, ingredients and portion sizes. In this project, I will conduct a clustering analysis on a dataset featuring nutritional information from a variety of fast-food items at Pizza Hut resaturants. The goal is to uncover significant patterns and gain deeper insights into the nutritional characteristics of fast-food offerings.

Clustering algorithms

Clustering in general is an unsupervised machine learning technique used to group data points into distinct categories or clusters based on their similarities. It helps identify hidden patterns, relationships or structures within a dataset without prior knowledge of the group labels. By measuring the similarity or distance between data points clustering creates meaningful groups where items within the same cluster are more alike than those in different clusters.

Project overview

The dataset used for the project is “Fast Food Nutrition” from Kaggle (https://www.kaggle.com/datasets/joebeachcapital/fast-food). It focuses on nutritional values, including calories and micro-nutrients from six of the largest and most popular fast food restaurants: McDonald’s, Burger King, Wendy’s, Kentucky Fried Chicken (KFC), Taco Bell and Pizza Hut. Despite the dataset covering six companies, I chose to focus only on Pizza Hut, as hierarchical clustering with data from all companies would be overly complex and visually unintelligible. Attributes included are calories, calories from fat, total fat, saturated fat, trans fat, cholesterol, sodium, carbs, fiber, sugars, protein, and weight watchers points (where available).

Preprocessing

First of all, the file needs to be read and inspected.

food <- read.csv("FastFoodNutritionMenuV3.csv")
food <- food[1074:1147, c(1:3, 5:13)]
head(food,1)
##        Company                              Item Calories Total.Fat..g.
## 1074 Pizza Hut Detroit Double Cheesy Pizza Slice      280            12
##      Saturated.Fat..g. Trans.Fat..g. Cholesterol..mg. Sodium...mg. Carbs..g.
## 1074                 6             0               30          560        31
##      Fiber..g. Sugars..g. Protein..g.
## 1074         2          2          13
str(food)
## 'data.frame':    74 obs. of  12 variables:
##  $ Company          : chr  "Pizza Hut" "Pizza Hut" "Pizza Hut" "Pizza Hut" ...
##  $ Item             : chr  "Detroit Double Cheesy Pizza Slice" "Detroit Double Pepperoni Pizza Slice" "Detroit Meaty Pizza Slice" "Detroit Supremo Pizza Slice" ...
##  $ Calories         : chr  "280" "330" "320" "290" ...
##  $ Total.Fat..g.    : chr  "12" "17" "16" "13" ...
##  $ Saturated.Fat..g.: chr  "6" "7" "6" "6" ...
##  $ Trans.Fat..g.    : chr  "0" "0" "0" "0" ...
##  $ Cholesterol..mg. : chr  "30" "40" "35" "30" ...
##  $ Sodium...mg.     : chr  "560" "720" "640" "570" ...
##  $ Carbs..g.        : chr  "31" "30" "31" "31" ...
##  $ Fiber..g.        : chr  "2" "2" "2" "2" ...
##  $ Sugars..g.       : chr  "2" "2" "2" "2" ...
##  $ Protein..g.      : chr  "13" "14" "14" "13" ...
colnames(food) <- c("Brand","Item", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")
library(corrplot)
## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.4.2
## corrplot 0.95 loaded
food_data = food[,-c(1, 2)]
str(food_data)
## 'data.frame':    74 obs. of  10 variables:
##  $ V1 : chr  "280" "330" "320" "290" ...
##  $ V2 : chr  "12" "17" "16" "13" ...
##  $ V3 : chr  "6" "7" "6" "6" ...
##  $ V4 : chr  "0" "0" "0" "0" ...
##  $ V5 : chr  "30" "40" "35" "30" ...
##  $ V6 : chr  "560" "720" "640" "570" ...
##  $ V7 : chr  "31" "30" "31" "31" ...
##  $ V8 : chr  "2" "2" "2" "2" ...
##  $ V9 : chr  "2" "2" "2" "2" ...
##  $ V10: chr  "13" "14" "14" "13" ...

The structure of dataset show the “character” variables, it should be “numeric”, so it has to be fixed.

food_data$V1 <- as.numeric(gsub("[^0-9.]", "", food_data$V1))
food_data$V2 <- as.numeric(gsub("[^0-9.]", "", food_data$V2))
food_data$V3 <- as.numeric(gsub("[^0-9.]", "", food_data$V3))
food_data$V4 <- as.numeric(gsub("[^0-9.]", "", food_data$V4))
food_data$V5 <- as.numeric(gsub("[^0-9.]", "", food_data$V5))
food_data$V6 <- as.numeric(gsub("[^0-9.]", "", food_data$V6))
food_data$V7 <- as.numeric(gsub("[^0-9.]", "", food_data$V7))
food_data$V8 <- as.numeric(gsub("[^0-9.]", "", food_data$V8))
food_data$V9 <- as.numeric(gsub("[^0-9.]", "", food_data$V9))
food_data$V10 <- as.numeric(gsub("[^0-9.]", "", food_data$V10))
str(food_data)
## 'data.frame':    74 obs. of  10 variables:
##  $ V1 : num  280 330 320 290 180 270 380 240 350 160 ...
##  $ V2 : num  12 17 16 13 6 10 16 10 16 5 ...
##  $ V3 : num  6 7 6 6 2 4 6 4.5 6 2 ...
##  $ V4 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ V5 : num  30 40 35 30 15 25 35 15 25 15 ...
##  $ V6 : num  560 720 640 570 370 450 650 470 680 570 ...
##  $ V7 : num  31 30 31 31 25 33 43 29 38 22 ...
##  $ V8 : num  2 2 2 2 1 2 2 2 3 1 ...
##  $ V9 : num  2 2 2 2 8 6 7 1 2 2 ...
##  $ V10: num  13 14 14 13 8 11 16 10 14 7 ...

Since the data is provided in different units, it needs to be standarized to ensure that the results obtained can be considered reliable.

food_data_scaled <- scale(food_data)
food_data_scaled <- na.omit(food_data_scaled)

The next step involves analyzing the relationships between the variables.

food_matrix <- data.matrix(food_data_scaled, rownames.force = NA)
F <- cor(food_matrix)
corrplot(F, method = "number", number.cex = 0.75, order="hclust")

According to the correlation matrix, the data mostly seem to be highly correlated.

Test for clusterability of data - Hopkins statistics

The Hopkins statistic is a valuable tool in the context of clustering, as it helps to assess the clustering tendency of a dataset before applying any clustering algorithms.In simple words it tells how well the data can be clustered. Hopkins statistic close to 0 indicates that the data is highly random and not suitable for clustering. A value close to 0.5 suggests that the data is likely randomly distributed, with no clear clustering structure. In contrast, a value near 1 indicates a strong clustering tendency, meaning the data points are more likely to be grouped together, making it suitable for clustering analysis.

library(hopkins)
## Warning: pakiet 'hopkins' został zbudowany w wersji R 4.4.2
library(factoextra)
## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.4.2
## Ładowanie wymaganego pakietu: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data <- food_data_scaled
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.9477007
## 
## $plot

The obtained result indicates a high tendency of the data to cluster.

Optimal number of clusters

After confirming the clustering tendency of the dataset using, the next crucial step is to determine the optimal number of clusters.

Silhouette analysis can be used to evaluate how well-separated the resulting clusters are from each other. It measures the distance between data points within the same cluster and compares it to the distance to points in neighbouring clusters, providing insight into the quality of the clustering structure.The silhouette score ranges from -1 to 1. Value close to 1 indicates that the data points are well-clustered, value close to 0 indicates the data points are on or near the decision boundary between clusters and value close to -1 indicates that the data points might be misclassified into incorrect clusters.

library(gridExtra)
a <- fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c <- fviz_nbclust(data, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d <- fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a, b, c, d, ncol=2)

The silhouette index indicates that the most optimal number of clusters for k-means is either 2 or 3, as the values are close and it is challenging to clearly determine the better option. For PAM, CLARA and hierarchical clustering it’s 2.

After evaluating the optimal number of clusters using the silhouette statistic, the WSS (within-cluster sum of squares) statistic will be applied to verify if it provides the same results or reveals slight differences.

library(cluster)
a <- fviz_nbclust(data, kmeans, method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(data, pam, method = "wss") + ggtitle("pam")
c <- fviz_nbclust(data, clara, method = "wss") +ggtitle("clara")
d <- fviz_nbclust(data, hcut, method = "wss") +ggtitle("hierarchical clustering")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")

Based on both the silhouette and WSS statistics, the optimal number of clusters is consistently identified as 2 for all clustering methods.

Additionaly, the clValid function in R is used for validating clustering methods, helping to identify the most suitable one for a given dataset. This function evaluates clustering quality using metrics such as the Silhouette width and Dunn’s Index.Both indices help in assessing the performance of clustering methods.

food_numeric <- food_data[, -c(1, 2)]
library(clValid)
## Warning: pakiet 'clValid' został zbudowany w wersji R 4.4.2
clmethods <- c("hierarchical","kmeans","pam","clara")
internal <- clValid(food_data, nClust = 2:30, clMethods = clmethods, validation = "internal", maxitems = 100000)
summary(internal)
## 
## Clustering Methods:
##  hierarchical kmeans pam clara 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
## 
## Validation Measures:
##                                   2        3        4        5        6        7        8        9       10       11       12       13       14       15       16       17       18       19       20       21       22       23       24       25       26       27       28       29       30
##                                                                                                                                                                                                                                                                                                
## hierarchical Connectivity    3.1060   6.8893  11.8401  18.3353  20.8115  24.9155  26.2250  31.4560  37.9587  40.6972  43.1972  44.6972  48.8579  51.9119  56.0921  60.3631  61.8131  63.8131  66.5877  68.5877  72.0829  79.0567  84.0663  90.8710  97.1698  98.7532 103.0365 109.3690 111.7619
##              Dunn            0.1065   0.0661   0.1053   0.1417   0.1417   0.1432   0.1432   0.1501   0.1501   0.2467   0.2467   0.2467   0.2467   0.2467   0.2918   0.3599   0.3625   0.3625   0.3625   0.3625   0.3557   0.4418   0.5194   0.4862   0.4814   0.4814   0.4814   0.5329   0.5329
##              Silhouette      0.5213   0.5074   0.4924   0.4860   0.4696   0.4761   0.4908   0.4648   0.4532   0.4642   0.4568   0.4458   0.4636   0.4531   0.4350   0.4533   0.4488   0.4408   0.4268   0.4113   0.3812   0.3830   0.3775   0.3758   0.3630   0.3543   0.3554   0.3566   0.3363
## kmeans       Connectivity    4.4246  10.8897  15.8405  18.3393  20.8155  24.9194  28.1067  31.6893  38.1921  41.1083  43.6083  45.1083  49.2690  52.1980  61.5401  66.1155  70.8488  72.8488  75.6234  77.6234  83.9806  87.0298  88.7560  95.5607 101.8595 103.4429 107.7262 110.4619 113.6881
##              Dunn            0.0602   0.0432   0.0767   0.1286   0.1286   0.1354   0.1375   0.1701   0.2283   0.2260   0.2429   0.2429   0.2730   0.2867   0.2955   0.2370   0.2404   0.2404   0.2492   0.2492   0.1676   0.2122   0.2357   0.2701   0.2704   0.2704   0.2704   0.2704   0.2704
##              Silhouette      0.5640   0.5103   0.5055   0.5002   0.4832   0.4869   0.5011   0.4911   0.4765   0.4795   0.4721   0.4611   0.4754   0.4672   0.4447   0.4560   0.4536   0.4456   0.4234   0.4071   0.3887   0.3922   0.3846   0.3829   0.3701   0.3614   0.3624   0.3556   0.3489
## pam          Connectivity    4.8508  11.7278  14.6056  19.5563  20.0448  28.3151  35.1619  39.3861  41.5290  44.0052  45.5230  51.9258  59.6972  67.0401  73.3389  74.8389  78.8190  81.3190  83.8190  88.7583  90.7583  92.2083  97.3607  99.3607 102.1353 108.9008 112.3103 115.3710 118.2210
##              Dunn            0.0621   0.0388   0.0433   0.0902   0.1404   0.0925   0.0974   0.0974   0.0994   0.1124   0.1625   0.1625   0.1625   0.1625   0.1625   0.1625   0.1625   0.2115   0.2370   0.3142   0.3142   0.3678   0.4707   0.4707   0.4776   0.3687   0.3687   0.3687   0.3687
##              Silhouette      0.5639   0.4899   0.4580   0.4917   0.4930   0.4646   0.4463   0.4491   0.4832   0.4741   0.4750   0.4841   0.4579   0.4420   0.4080   0.4001   0.3979   0.3905   0.3974   0.3786   0.3706   0.3734   0.3787   0.3629   0.3650   0.3502   0.3544   0.3490   0.3258
## clara        Connectivity    4.4246  11.7278  14.6056  18.3393  20.0448  28.3151  38.8115  35.5909  41.4250  42.3528  45.5230  51.9258  59.6972  66.9044  73.3389  77.2520  81.6044  85.5845  88.0845  90.5845  95.5238  97.5238  98.9738 104.1262 106.9008 108.9008 112.3103 115.3710 118.2210
##              Dunn            0.0602   0.0388   0.0433   0.1286   0.1404   0.0925   0.0651   0.1322   0.0994   0.1124   0.1625   0.1625   0.1625   0.1625   0.1625   0.1625   0.1625   0.1625   0.2115   0.2370   0.2426   0.2426   0.2839   0.3634   0.3677   0.3687   0.3687   0.3687   0.3687
##              Silhouette      0.5640   0.4899   0.4580   0.5002   0.4930   0.4646   0.4301   0.4901   0.4853   0.4807   0.4750   0.4841   0.4579   0.4430   0.4080   0.3759   0.3820   0.3830   0.3756   0.3824   0.3637   0.3557   0.3584   0.3638   0.3658   0.3502   0.3544   0.3490   0.3258
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 3.1060 hierarchical 2       
## Dunn         0.5329 hierarchical 29      
## Silhouette   0.5640 kmeans       2

K-means

The k-means algorithm is a popular clustering method that partitions data into a predefined number of clusters by minimizing the variance within each cluster. It assigns each data point to the nearest centroid, then recalculates centroids iteratively until convergence.

2 clusters

library(factoextra)
km2 <- eclust(data, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
c2 <- fviz_cluster(km2, data=food_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means")
s2 <- fviz_silhouette(km2)
##   cluster size ave.sil.width
## 1       1   37          0.35
## 2       2   37          0.50
grid.arrange(c2, s2, ncol=2)

3 clusters

Just to be sure, let’s check 3 clusters.

km3 <- eclust(data, k=3 , FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
c3 <- fviz_cluster(km3, data=food_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means")
s3 <- fviz_silhouette(km3)
##   cluster size ave.sil.width
## 1       1   29          0.30
## 2       2   23          0.53
## 3       3   22          0.21
grid.arrange(c3, s3, ncol=2)

It seems that the 2-cluster solution offers better-defined clusters, as indicated by the higher Silhouette score. 3 clusters may lead to overfitting or misclassification.

PAM

The Partitioning Around Medoids (PAM) method is a clustering algorithm that aims to divide data into clusters by selecting representative data points, known as medoids, as the centers of each cluster. Unlike k-means, PAM is more robust to outliers since it uses actual data points as cluster centers instead of computing centroids.

pam2 <- eclust(data, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
cp2 <- fviz_cluster(pam2, data=DATA, elipse.type="convex", geom=c("point")) + ggtitle("PAM")
sp2 <- fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1   46          0.31
## 2       2   28          0.57
grid.arrange(cp2, sp2, ncol=2)

The Silhouette scores for both K-means and PAM are quite close, indicating that both methods performed similarly in terms of how well the points are grouped within their clusters and separated from others.

CLARA

CLARA is a clustering method designed for large datasets, where it selects multiple samples, applies the k-medoids algorithm to each, and uses the best result to handle data efficiently while maintaining accuracy.

clara2 <- eclust(data, k=2, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)
cc2 <- fviz_cluster(clara2, data=DATA, elipse.type="convex", geom=c("point")) + ggtitle("CLARA")
sc2 <- fviz_silhouette(clara2)
##   cluster size ave.sil.width
## 1       1   44          0.33
## 2       2   30          0.54
grid.arrange(cc2, sc2, ncol=2)

Hierarchical clustering

Hierarchical clustering is a method of grouping data into a hierarchy of clusters, either by successively merging smaller clusters (agglomerative) or dividing larger clusters (divisive).

Agglomerative approach

Single linkeage:

hc <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc, k=2)

The numbers are displayed on the graph instead of the food names because the food names were too long, making the graph unreadable.

Complete linkeage:

hc1 <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc1, k=2)

Average linkeage:

hc2 <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc2, k=2)

Divisive approach

hc3 <- eclust(data, k=2, FUNcluster="diana")
pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc3, k=2)

Next, there will be assessed the quality of the obtained divisions by calculating the inertia ratio.

inertion_hclust <- matrix(0, nrow = 4, ncol = 4)
colnames(inertion_hclust) <- c("Single Linkage", "Complete Linkage", "Average Linkage", "Divisive")
rownames(inertion_hclust) <- c("Intra-clust", "Total", "Percentage", "Q")

Single Linkage

library(ClustGeo)
## Warning: pakiet 'ClustGeo' został zbudowany w wersji R 4.4.2
hclust_single <- cutree(hc, k = 2) 
inertion_hclust[1, 1] <- withindiss(dist(data), part = hclust_single)  
inertion_hclust[2, 1] <- inertdiss(dist(data))                        
inertion_hclust[3, 1] <- inertion_hclust[1, 1] / inertion_hclust[2, 1] 
inertion_hclust[4, 1] <- 1 - inertion_hclust[3, 1]                     

Complete Linkage

hclust_complete <- cutree(hc1, k = 2) 
inertion_hclust[1, 2] <- withindiss(dist(data), part = hclust_complete)  
inertion_hclust[2, 2] <- inertdiss(dist(data))                            
inertion_hclust[3, 2] <- inertion_hclust[1, 2] / inertion_hclust[2, 2]   
inertion_hclust[4, 2] <- 1 - inertion_hclust[3, 2]                       

Average Linkage

hclust_average <- cutree(hc2, k = 2)  
inertion_hclust[1, 3] <- withindiss(dist(data), part = hclust_average)  
inertion_hclust[2, 3] <- inertdiss(dist(data))                         
inertion_hclust[3, 3] <- inertion_hclust[1, 3] / inertion_hclust[2, 3] 
inertion_hclust[4, 3] <- 1 - inertion_hclust[3, 3]                    

Divisive (DIANA)

diana_clusters <- cutree(hc3, k = 2) 
inertion_hclust[1, 4] <- withindiss(dist(data), part = diana_clusters)  
inertion_hclust[2, 4] <- inertdiss(dist(data))                         
inertion_hclust[3, 4] <- inertion_hclust[1, 4] / inertion_hclust[2, 4] 
inertion_hclust[4, 4] <- 1 - inertion_hclust[3, 4]                     

inertion_hclust
##             Single Linkage Complete Linkage Average Linkage  Divisive
## Intra-clust      8.3790277        8.3790277       8.3790277 8.3790277
## Total            9.8648649        9.8648649       9.8648649 9.8648649
## Percentage       0.8493809        0.8493809       0.8493809 0.8493809
## Q                0.1506191        0.1506191       0.1506191 0.1506191

The results for all methods are identical what suggests that the data might not exhibit strong enough differences for the methods to produce varying results.The dataset might not contain enough distinct clusters or variation, leading to similar results regardless of the method.Other explanation can be that the chosen number of clusters might not be optimal for distinguishing between different patterns in the data. After checking behind the project the values for different numbers of clusters, such as 3, the results varied, and the best method turned out to be single linkage for that case.

Coclusions

In summary, the text data were clustered using various methods. Preprocessing played a crucial role, and multiple datasets were generated to achieve improved clustering outcomes. After evaluating the clustering methods using various metrics, it was found that k-means with two clusters produced the best results overall.