Fast food has become an integral part of modern society, offering convenient, affordable and tasty meal options. With the rapid growth of the fast-food industry, consumers are now faced with an overwhelming variety of menu items from different brands, each varying in nutritional content, ingredients and portion sizes. In this project, I will conduct a clustering analysis on a dataset featuring nutritional information from a variety of fast-food items at Pizza Hut resaturants. The goal is to uncover significant patterns and gain deeper insights into the nutritional characteristics of fast-food offerings.
Clustering in general is an unsupervised machine learning technique used to group data points into distinct categories or clusters based on their similarities. It helps identify hidden patterns, relationships or structures within a dataset without prior knowledge of the group labels. By measuring the similarity or distance between data points clustering creates meaningful groups where items within the same cluster are more alike than those in different clusters.
The dataset used for the project is “Fast Food Nutrition” from Kaggle (https://www.kaggle.com/datasets/joebeachcapital/fast-food). It focuses on nutritional values, including calories and micro-nutrients from six of the largest and most popular fast food restaurants: McDonald’s, Burger King, Wendy’s, Kentucky Fried Chicken (KFC), Taco Bell and Pizza Hut. Despite the dataset covering six companies, I chose to focus only on Pizza Hut, as hierarchical clustering with data from all companies would be overly complex and visually unintelligible. Attributes included are calories, calories from fat, total fat, saturated fat, trans fat, cholesterol, sodium, carbs, fiber, sugars, protein, and weight watchers points (where available).
First of all, the file needs to be read and inspected.
food <- read.csv("FastFoodNutritionMenuV3.csv")
food <- food[1074:1147, c(1:3, 5:13)]
head(food,1)
## Company Item Calories Total.Fat..g.
## 1074 Pizza Hut Detroit Double Cheesy Pizza Slice 280 12
## Saturated.Fat..g. Trans.Fat..g. Cholesterol..mg. Sodium...mg. Carbs..g.
## 1074 6 0 30 560 31
## Fiber..g. Sugars..g. Protein..g.
## 1074 2 2 13
str(food)
## 'data.frame': 74 obs. of 12 variables:
## $ Company : chr "Pizza Hut" "Pizza Hut" "Pizza Hut" "Pizza Hut" ...
## $ Item : chr "Detroit Double Cheesy Pizza Slice" "Detroit Double Pepperoni Pizza Slice" "Detroit Meaty Pizza Slice" "Detroit Supremo Pizza Slice" ...
## $ Calories : chr "280" "330" "320" "290" ...
## $ Total.Fat..g. : chr "12" "17" "16" "13" ...
## $ Saturated.Fat..g.: chr "6" "7" "6" "6" ...
## $ Trans.Fat..g. : chr "0" "0" "0" "0" ...
## $ Cholesterol..mg. : chr "30" "40" "35" "30" ...
## $ Sodium...mg. : chr "560" "720" "640" "570" ...
## $ Carbs..g. : chr "31" "30" "31" "31" ...
## $ Fiber..g. : chr "2" "2" "2" "2" ...
## $ Sugars..g. : chr "2" "2" "2" "2" ...
## $ Protein..g. : chr "13" "14" "14" "13" ...
colnames(food) <- c("Brand","Item", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")
library(corrplot)
## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.4.2
## corrplot 0.95 loaded
food_data = food[,-c(1, 2)]
str(food_data)
## 'data.frame': 74 obs. of 10 variables:
## $ V1 : chr "280" "330" "320" "290" ...
## $ V2 : chr "12" "17" "16" "13" ...
## $ V3 : chr "6" "7" "6" "6" ...
## $ V4 : chr "0" "0" "0" "0" ...
## $ V5 : chr "30" "40" "35" "30" ...
## $ V6 : chr "560" "720" "640" "570" ...
## $ V7 : chr "31" "30" "31" "31" ...
## $ V8 : chr "2" "2" "2" "2" ...
## $ V9 : chr "2" "2" "2" "2" ...
## $ V10: chr "13" "14" "14" "13" ...
The structure of dataset show the “character” variables, it should be “numeric”, so it has to be fixed.
food_data$V1 <- as.numeric(gsub("[^0-9.]", "", food_data$V1))
food_data$V2 <- as.numeric(gsub("[^0-9.]", "", food_data$V2))
food_data$V3 <- as.numeric(gsub("[^0-9.]", "", food_data$V3))
food_data$V4 <- as.numeric(gsub("[^0-9.]", "", food_data$V4))
food_data$V5 <- as.numeric(gsub("[^0-9.]", "", food_data$V5))
food_data$V6 <- as.numeric(gsub("[^0-9.]", "", food_data$V6))
food_data$V7 <- as.numeric(gsub("[^0-9.]", "", food_data$V7))
food_data$V8 <- as.numeric(gsub("[^0-9.]", "", food_data$V8))
food_data$V9 <- as.numeric(gsub("[^0-9.]", "", food_data$V9))
food_data$V10 <- as.numeric(gsub("[^0-9.]", "", food_data$V10))
str(food_data)
## 'data.frame': 74 obs. of 10 variables:
## $ V1 : num 280 330 320 290 180 270 380 240 350 160 ...
## $ V2 : num 12 17 16 13 6 10 16 10 16 5 ...
## $ V3 : num 6 7 6 6 2 4 6 4.5 6 2 ...
## $ V4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ V5 : num 30 40 35 30 15 25 35 15 25 15 ...
## $ V6 : num 560 720 640 570 370 450 650 470 680 570 ...
## $ V7 : num 31 30 31 31 25 33 43 29 38 22 ...
## $ V8 : num 2 2 2 2 1 2 2 2 3 1 ...
## $ V9 : num 2 2 2 2 8 6 7 1 2 2 ...
## $ V10: num 13 14 14 13 8 11 16 10 14 7 ...
Since the data is provided in different units, it needs to be standarized to ensure that the results obtained can be considered reliable.
food_data_scaled <- scale(food_data)
food_data_scaled <- na.omit(food_data_scaled)
The next step involves analyzing the relationships between the variables.
food_matrix <- data.matrix(food_data_scaled, rownames.force = NA)
F <- cor(food_matrix)
corrplot(F, method = "number", number.cex = 0.75, order="hclust")
According to the correlation matrix, the data mostly seem to be highly
correlated.
The Hopkins statistic is a valuable tool in the context of clustering, as it helps to assess the clustering tendency of a dataset before applying any clustering algorithms.In simple words it tells how well the data can be clustered. Hopkins statistic close to 0 indicates that the data is highly random and not suitable for clustering. A value close to 0.5 suggests that the data is likely randomly distributed, with no clear clustering structure. In contrast, a value near 1 indicates a strong clustering tendency, meaning the data points are more likely to be grouped together, making it suitable for clustering analysis.
library(hopkins)
## Warning: pakiet 'hopkins' został zbudowany w wersji R 4.4.2
library(factoextra)
## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.4.2
## Ładowanie wymaganego pakietu: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
data <- food_data_scaled
get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))
## $hopkins_stat
## [1] 0.9477007
##
## $plot
The obtained result indicates a high tendency of the data to
cluster.
After confirming the clustering tendency of the dataset using, the next crucial step is to determine the optimal number of clusters.
Silhouette analysis can be used to evaluate how well-separated the resulting clusters are from each other. It measures the distance between data points within the same cluster and compares it to the distance to points in neighbouring clusters, providing insight into the quality of the clustering structure.The silhouette score ranges from -1 to 1. Value close to 1 indicates that the data points are well-clustered, value close to 0 indicates the data points are on or near the decision boundary between clusters and value close to -1 indicates that the data points might be misclassified into incorrect clusters.
library(gridExtra)
a <- fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b <- fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
c <- fviz_nbclust(data, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
d <- fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a, b, c, d, ncol=2)
The silhouette index indicates that the most optimal number of clusters
for k-means is either 2 or 3, as the values are close and it is
challenging to clearly determine the better option. For PAM, CLARA and
hierarchical clustering it’s 2.
After evaluating the optimal number of clusters using the silhouette statistic, the WSS (within-cluster sum of squares) statistic will be applied to verify if it provides the same results or reveals slight differences.
library(cluster)
a <- fviz_nbclust(data, kmeans, method = "wss") + ggtitle("k-means")
b <- fviz_nbclust(data, pam, method = "wss") + ggtitle("pam")
c <- fviz_nbclust(data, clara, method = "wss") +ggtitle("clara")
d <- fviz_nbclust(data, hcut, method = "wss") +ggtitle("hierarchical clustering")
grid.arrange(a,b,c,d, ncol=2, top = "Optimal number of clusters")
Based on both the silhouette and WSS statistics, the optimal number of
clusters is consistently identified as 2 for all clustering methods.
Additionaly, the clValid function in R is used for validating clustering methods, helping to identify the most suitable one for a given dataset. This function evaluates clustering quality using metrics such as the Silhouette width and Dunn’s Index.Both indices help in assessing the performance of clustering methods.
food_numeric <- food_data[, -c(1, 2)]
library(clValid)
## Warning: pakiet 'clValid' został zbudowany w wersji R 4.4.2
clmethods <- c("hierarchical","kmeans","pam","clara")
internal <- clValid(food_data, nClust = 2:30, clMethods = clmethods, validation = "internal", maxitems = 100000)
summary(internal)
##
## Clustering Methods:
## hierarchical kmeans pam clara
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
##
## hierarchical Connectivity 3.1060 6.8893 11.8401 18.3353 20.8115 24.9155 26.2250 31.4560 37.9587 40.6972 43.1972 44.6972 48.8579 51.9119 56.0921 60.3631 61.8131 63.8131 66.5877 68.5877 72.0829 79.0567 84.0663 90.8710 97.1698 98.7532 103.0365 109.3690 111.7619
## Dunn 0.1065 0.0661 0.1053 0.1417 0.1417 0.1432 0.1432 0.1501 0.1501 0.2467 0.2467 0.2467 0.2467 0.2467 0.2918 0.3599 0.3625 0.3625 0.3625 0.3625 0.3557 0.4418 0.5194 0.4862 0.4814 0.4814 0.4814 0.5329 0.5329
## Silhouette 0.5213 0.5074 0.4924 0.4860 0.4696 0.4761 0.4908 0.4648 0.4532 0.4642 0.4568 0.4458 0.4636 0.4531 0.4350 0.4533 0.4488 0.4408 0.4268 0.4113 0.3812 0.3830 0.3775 0.3758 0.3630 0.3543 0.3554 0.3566 0.3363
## kmeans Connectivity 4.4246 10.8897 15.8405 18.3393 20.8155 24.9194 28.1067 31.6893 38.1921 41.1083 43.6083 45.1083 49.2690 52.1980 61.5401 66.1155 70.8488 72.8488 75.6234 77.6234 83.9806 87.0298 88.7560 95.5607 101.8595 103.4429 107.7262 110.4619 113.6881
## Dunn 0.0602 0.0432 0.0767 0.1286 0.1286 0.1354 0.1375 0.1701 0.2283 0.2260 0.2429 0.2429 0.2730 0.2867 0.2955 0.2370 0.2404 0.2404 0.2492 0.2492 0.1676 0.2122 0.2357 0.2701 0.2704 0.2704 0.2704 0.2704 0.2704
## Silhouette 0.5640 0.5103 0.5055 0.5002 0.4832 0.4869 0.5011 0.4911 0.4765 0.4795 0.4721 0.4611 0.4754 0.4672 0.4447 0.4560 0.4536 0.4456 0.4234 0.4071 0.3887 0.3922 0.3846 0.3829 0.3701 0.3614 0.3624 0.3556 0.3489
## pam Connectivity 4.8508 11.7278 14.6056 19.5563 20.0448 28.3151 35.1619 39.3861 41.5290 44.0052 45.5230 51.9258 59.6972 67.0401 73.3389 74.8389 78.8190 81.3190 83.8190 88.7583 90.7583 92.2083 97.3607 99.3607 102.1353 108.9008 112.3103 115.3710 118.2210
## Dunn 0.0621 0.0388 0.0433 0.0902 0.1404 0.0925 0.0974 0.0974 0.0994 0.1124 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.2115 0.2370 0.3142 0.3142 0.3678 0.4707 0.4707 0.4776 0.3687 0.3687 0.3687 0.3687
## Silhouette 0.5639 0.4899 0.4580 0.4917 0.4930 0.4646 0.4463 0.4491 0.4832 0.4741 0.4750 0.4841 0.4579 0.4420 0.4080 0.4001 0.3979 0.3905 0.3974 0.3786 0.3706 0.3734 0.3787 0.3629 0.3650 0.3502 0.3544 0.3490 0.3258
## clara Connectivity 4.4246 11.7278 14.6056 18.3393 20.0448 28.3151 38.8115 35.5909 41.4250 42.3528 45.5230 51.9258 59.6972 66.9044 73.3389 77.2520 81.6044 85.5845 88.0845 90.5845 95.5238 97.5238 98.9738 104.1262 106.9008 108.9008 112.3103 115.3710 118.2210
## Dunn 0.0602 0.0388 0.0433 0.1286 0.1404 0.0925 0.0651 0.1322 0.0994 0.1124 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.2115 0.2370 0.2426 0.2426 0.2839 0.3634 0.3677 0.3687 0.3687 0.3687 0.3687
## Silhouette 0.5640 0.4899 0.4580 0.5002 0.4930 0.4646 0.4301 0.4901 0.4853 0.4807 0.4750 0.4841 0.4579 0.4430 0.4080 0.3759 0.3820 0.3830 0.3756 0.3824 0.3637 0.3557 0.3584 0.3638 0.3658 0.3502 0.3544 0.3490 0.3258
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 3.1060 hierarchical 2
## Dunn 0.5329 hierarchical 29
## Silhouette 0.5640 kmeans 2
The k-means algorithm is a popular clustering method that partitions data into a predefined number of clusters by minimizing the variance within each cluster. It assigns each data point to the nearest centroid, then recalculates centroids iteratively until convergence.
library(factoextra)
km2 <- eclust(data, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
c2 <- fviz_cluster(km2, data=food_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means")
s2 <- fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 37 0.35
## 2 2 37 0.50
grid.arrange(c2, s2, ncol=2)
Just to be sure, let’s check 3 clusters.
km3 <- eclust(data, k=3 , FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
c3 <- fviz_cluster(km3, data=food_data, elipse.type="convex", geom=c("point")) + ggtitle("K-means")
s3 <- fviz_silhouette(km3)
## cluster size ave.sil.width
## 1 1 29 0.30
## 2 2 23 0.53
## 3 3 22 0.21
grid.arrange(c3, s3, ncol=2)
It seems that the 2-cluster solution offers better-defined clusters, as
indicated by the higher Silhouette score. 3 clusters may lead to
overfitting or misclassification.
The Partitioning Around Medoids (PAM) method is a clustering algorithm that aims to divide data into clusters by selecting representative data points, known as medoids, as the centers of each cluster. Unlike k-means, PAM is more robust to outliers since it uses actual data points as cluster centers instead of computing centroids.
pam2 <- eclust(data, k=2 , FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
cp2 <- fviz_cluster(pam2, data=DATA, elipse.type="convex", geom=c("point")) + ggtitle("PAM")
sp2 <- fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 46 0.31
## 2 2 28 0.57
grid.arrange(cp2, sp2, ncol=2)
The Silhouette scores for both K-means and PAM are quite close, indicating that both methods performed similarly in terms of how well the points are grouped within their clusters and separated from others.
CLARA is a clustering method designed for large datasets, where it selects multiple samples, applies the k-medoids algorithm to each, and uses the best result to handle data efficiently while maintaining accuracy.
clara2 <- eclust(data, k=2, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)
cc2 <- fviz_cluster(clara2, data=DATA, elipse.type="convex", geom=c("point")) + ggtitle("CLARA")
sc2 <- fviz_silhouette(clara2)
## cluster size ave.sil.width
## 1 1 44 0.33
## 2 2 30 0.54
grid.arrange(cc2, sc2, ncol=2)
Hierarchical clustering is a method of grouping data into a hierarchy of clusters, either by successively merging smaller clusters (agglomerative) or dividing larger clusters (divisive).
Single linkeage:
hc <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc, k=2)
The numbers are displayed on the graph instead of the food names because
the food names were too long, making the graph unreadable.
Complete linkeage:
hc1 <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc1, k=2)
Average linkeage:
hc2 <- eclust(data, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of agglomerative hierarchical clustering")
rect.hclust(hc2, k=2)
hc3 <- eclust(data, k=2, FUNcluster="diana")
pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc3, k=2)
Next, there will be assessed the quality of the obtained divisions by calculating the inertia ratio.
inertion_hclust <- matrix(0, nrow = 4, ncol = 4)
colnames(inertion_hclust) <- c("Single Linkage", "Complete Linkage", "Average Linkage", "Divisive")
rownames(inertion_hclust) <- c("Intra-clust", "Total", "Percentage", "Q")
Single Linkage
library(ClustGeo)
## Warning: pakiet 'ClustGeo' został zbudowany w wersji R 4.4.2
hclust_single <- cutree(hc, k = 2)
inertion_hclust[1, 1] <- withindiss(dist(data), part = hclust_single)
inertion_hclust[2, 1] <- inertdiss(dist(data))
inertion_hclust[3, 1] <- inertion_hclust[1, 1] / inertion_hclust[2, 1]
inertion_hclust[4, 1] <- 1 - inertion_hclust[3, 1]
Complete Linkage
hclust_complete <- cutree(hc1, k = 2)
inertion_hclust[1, 2] <- withindiss(dist(data), part = hclust_complete)
inertion_hclust[2, 2] <- inertdiss(dist(data))
inertion_hclust[3, 2] <- inertion_hclust[1, 2] / inertion_hclust[2, 2]
inertion_hclust[4, 2] <- 1 - inertion_hclust[3, 2]
Average Linkage
hclust_average <- cutree(hc2, k = 2)
inertion_hclust[1, 3] <- withindiss(dist(data), part = hclust_average)
inertion_hclust[2, 3] <- inertdiss(dist(data))
inertion_hclust[3, 3] <- inertion_hclust[1, 3] / inertion_hclust[2, 3]
inertion_hclust[4, 3] <- 1 - inertion_hclust[3, 3]
Divisive (DIANA)
diana_clusters <- cutree(hc3, k = 2)
inertion_hclust[1, 4] <- withindiss(dist(data), part = diana_clusters)
inertion_hclust[2, 4] <- inertdiss(dist(data))
inertion_hclust[3, 4] <- inertion_hclust[1, 4] / inertion_hclust[2, 4]
inertion_hclust[4, 4] <- 1 - inertion_hclust[3, 4]
inertion_hclust
## Single Linkage Complete Linkage Average Linkage Divisive
## Intra-clust 8.3790277 8.3790277 8.3790277 8.3790277
## Total 9.8648649 9.8648649 9.8648649 9.8648649
## Percentage 0.8493809 0.8493809 0.8493809 0.8493809
## Q 0.1506191 0.1506191 0.1506191 0.1506191
The results for all methods are identical what suggests that the data might not exhibit strong enough differences for the methods to produce varying results.The dataset might not contain enough distinct clusters or variation, leading to similar results regardless of the method.Other explanation can be that the chosen number of clusters might not be optimal for distinguishing between different patterns in the data. After checking behind the project the values for different numbers of clusters, such as 3, the results varied, and the best method turned out to be single linkage for that case.
In summary, the text data were clustered using various methods. Preprocessing played a crucial role, and multiple datasets were generated to achieve improved clustering outcomes. After evaluating the clustering methods using various metrics, it was found that k-means with two clusters produced the best results overall.