mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW4/menu.csv")
mydata1 <- mydata[c(-3,-5,-7,-8,-9,-10,-11,-12,-13,-14,-16,-18,-21,-22,-23,-24)]
mydata1$ID <- 1:nrow(mydata1)
mydata1 <- mydata1[c(9,1,2,3,4,5,6,7,8)]
colnames(mydata1) <- c("ID", "Category", "Name", "Calories", "Fats", "Carbs", "Fiber", "Sugar", "Protein")
mydata1[mydata1 == 'Beef & Pork'] <- 'Main'
mydata1[mydata1 == 'Chicken & Fish'] <- 'Main'
mydata1[mydata1 == 'Snacks & Sides'] <- 'Sides_&_Salads'
mydata1[mydata1 == 'Salads'] <- 'Sides_&_Salads'
mydata1[mydata1 == 'Coffee & Tea'] <- 'Beverages'
mydata1[mydata1 == 'Smoothies & Shakes'] <- 'Desserts'
mydata1$CategoryF <- factor(mydata1$Category,
levels = c("Breakfast", "Main", "Sides_&_Salads", "Desserts", "Beverages"),
labels = c("Breakfast", "Main", "Sides_&_Salads", "Desserts", "Beverages"))
head(mydata1,10)
## ID Category Name
## 1 1 Breakfast Egg McMuffin
## 2 2 Breakfast Egg White Delight
## 3 3 Breakfast Sausage McMuffin
## 4 4 Breakfast Sausage McMuffin with Egg
## 5 5 Breakfast Sausage McMuffin with Egg Whites
## 6 6 Breakfast Steak & Egg McMuffin
## 7 7 Breakfast Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8 8 Breakfast Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9 9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 10 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
## Calories Fats Carbs Fiber Sugar Protein CategoryF
## 1 300 13 31 4 3 17 Breakfast
## 2 250 8 30 4 3 18 Breakfast
## 3 370 23 29 4 2 14 Breakfast
## 4 450 28 30 4 2 21 Breakfast
## 5 400 23 30 4 2 21 Breakfast
## 6 430 23 31 4 3 26 Breakfast
## 7 460 26 38 2 3 19 Breakfast
## 8 520 30 43 3 4 19 Breakfast
## 9 410 20 36 2 3 20 Breakfast
## 10 470 25 42 3 4 20 Breakfast
The unit of observation in my sample is an item on the McDonald’s menu. The original size of the data set 260 units of observation (McDonald’s Menu Items) with 24 variables. I removed columns that included variables that did not hold potential for further hypothesis testing. After this the data set (mydata1) was left with 8 of the original variables. The variables that were left are:
Category: The subcategory of each food item that each Menu Item falls into (Breakfast, Main, Salads, Sides, Desserts, Beverages, Coffee & Tea & Smoothies & Shakes).
Menu Item: The name of each food item on the McDonald’s menu.
Calories: Amount of kilo calories of each food item on the McDonald’s menu.
Fats: Amount of fat in grams in each food item on the McDonald’s menu.
Carbs: Amount of carbohydrates in grams in each food item on the McDonald’s menu.
Fiber: Amount of fiber in grams in each food item on the McDonald’s menu.
Sugar: Amount of sugar in grams in each food item on the McDonald’s menu.
Protein: Amount of protein in grams in each food item on the McDonald’s menu.
The source of the above data was found on the Kaggle website, the author is McDonald’s. Retrieved January 9th, 2023, from https://www.kaggle.com/datasets/mcdonalds/nutrition-facts.
Using cluster analysis to classify the McDonald’s menu items into 2 or more groups based on their nutritional characteristics (Calories, Fats, Carbs, Fiber, Sugar and Protein).
Using stat.desc to present the general descriptive statistic of this data set and the sapply function to find the minimum and maximum values of each variable.
round(stat.desc((mydata1[c(-1,-2,-3,-10)]), basic = FALSE), 2)
## Calories Fats Carbs Fiber Sugar Protein
## median 340.00 11.00 44.00 1.00 17.50 12.00
## mean 368.27 14.17 47.35 1.63 29.42 13.34
## SE.mean 14.90 0.88 1.75 0.10 1.78 0.71
## CI.mean.0.95 29.34 1.73 3.45 0.19 3.50 1.40
## var 57729.62 201.81 798.19 2.46 822.53 130.56
## std.dev 240.27 14.21 28.25 1.57 28.68 11.43
## coef.var 0.65 1.00 0.60 0.96 0.97 0.86
sapply(mydata1[c(-1,-2,-3,-10)], FUN = min)
## Calories Fats Carbs Fiber Sugar Protein
## 0 0 0 0 0 0
sapply(mydata1[c(-1,-2,-3,-10)], FUN = max)
## Calories Fats Carbs Fiber Sugar Protein
## 1880 118 141 7 128 87
Checking the variance in the data of each individual variable in order to confirm that standardization of the variables is required.
var(mydata1$Fats)
## [1] 201.8104
var(mydata1$Carbs)
## [1] 798.1886
var(mydata1$Fiber)
## [1] 2.457737
var(mydata1$Sugar)
## [1] 822.5307
var(mydata1$Protein)
## [1] 130.5568
Since the variances have a very wide spread (from over 800 to just above 2), standardization has to be used in order to compare these variables and plot them in the correlation matrix.
Standardizing the variables using the scale function and adding them into the data set in their dedicated columns.
mydata1$FatsZ <- scale(mydata1$Fats)
mydata1$CarbsZ <- scale(mydata1$Carbs)
mydata1$FiberZ <- scale(mydata1$Fiber)
mydata1$SugarZ <- scale(mydata1$Sugar)
mydata1$ProteinZ <- scale(mydata1$Protein)
rcorr(as.matrix(mydata1[c(11,12,13,14,15)]),
type = "pearson")
## FatsZ CarbsZ FiberZ SugarZ ProteinZ
## FatsZ 1.00 0.46 0.58 -0.12 0.81
## CarbsZ 0.46 1.00 0.22 0.76 0.35
## FiberZ 0.58 0.22 1.00 -0.30 0.64
## SugarZ -0.12 0.76 -0.30 1.00 -0.18
## ProteinZ 0.81 0.35 0.64 -0.18 1.00
##
## n= 260
##
##
## P
## FatsZ CarbsZ FiberZ SugarZ ProteinZ
## FatsZ 0.0000 0.0000 0.0631 0.0000
## CarbsZ 0.0000 0.0003 0.0000 0.0000
## FiberZ 0.0000 0.0003 0.0000 0.0000
## SugarZ 0.0631 0.0000 0.0000 0.0036
## ProteinZ 0.0000 0.0000 0.0000 0.0036
Since the correlation between sugar and fats is above 0.05 these variables are not correlated. However, the other variables seem to be quite highly correlated, which is not ideal for clustering. If other variables would be possible to select, that were less correlated, it would be best to do so, however, the data does not contain any other variables that could be used for clustering.
Introducing an additional variable named Dissimilarity and ordering it in order to find potential outliers.
mydata1$Dissimilarity <- sqrt(mydata1$FatsZ^2 + mydata1$CarbsZ^2 + mydata1$FiberZ^2 + mydata1$SugarZ^2 + mydata1$ProteinZ^2)
head(mydata1[order(-mydata1$Dissimilarity),],10)
## ID Category Name
## 83 83 Main Chicken McNuggets (40 piece)
## 33 33 Breakfast Big Breakfast with Hotcakes (Large Biscuit)
## 35 35 Breakfast Big Breakfast with Hotcakes and Egg Whites (Large Biscuit)
## 32 32 Breakfast Big Breakfast with Hotcakes (Regular Biscuit)
## 254 254 Desserts McFlurry with M&M’s Candies (Medium)
## 247 247 Desserts Strawberry Shake (Large)
## 250 250 Desserts Chocolate Shake (Large)
## 34 34 Breakfast Big Breakfast with Hotcakes and Egg Whites (Regular Biscuit)
## 252 252 Desserts Shamrock Shake (Large)
## 82 82 Main Chicken McNuggets (20 piece)
## Calories Fats Carbs Fiber Sugar Protein CategoryF FatsZ CarbsZ
## 83 1880 118 118 6 1 87 Main 7.3092095 2.5008235
## 33 1150 60 116 7 17 36 Breakfast 3.2264270 2.4300327
## 35 1050 50 115 7 18 35 Breakfast 2.5224990 2.3946372
## 32 1090 56 111 6 17 36 Breakfast 2.9448558 2.2530555
## 254 930 33 139 2 128 20 Desserts 1.3258213 3.2441277
## 247 850 24 140 0 123 18 Desserts 0.6922861 3.2795231
## 250 850 23 141 2 120 19 Desserts 0.6218933 3.3149185
## 34 990 46 110 6 17 35 Breakfast 2.2409278 2.2176601
## 252 820 23 135 0 115 18 Desserts 0.6218933 3.1025459
## 82 940 59 59 3 0 44 Main 3.1560342 0.4124929
## FiberZ SugarZ ProteinZ Dissimilarity
## 83 2.7870021 -0.9910488 6.4467527 10.487556
## 33 3.4248723 -0.4331647 1.9833055 5.671488
## 35 3.4248723 -0.3982970 1.8957869 5.251637
## 32 2.7870021 -0.4331647 1.9833055 5.063294
## 254 0.2355213 3.4371556 0.5830083 4.948896
## 247 -1.0402191 3.2628168 0.4079712 4.809262
## 250 0.2355213 3.1582136 0.4954898 4.653034
## 34 2.7870021 -0.4331647 1.8957869 4.635601
## 252 -1.0402191 2.9838748 0.4079712 4.490495
## 82 0.8733915 -1.0259165 2.6834541 4.375722
Removing outliers found in previous step (those with a big gap in the dissimilarity).
mydata1 <- mydata1[c(-83,-33,-35,-82,-34,-32),]
Calculating the Euclidean distances that will be used to present the dissimilarity matrix.
distance <- get_dist(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2)
The matrix above suggests that the data is clusterable, to further confirm the assumption, the Hopkins statistic is calculated in the next step.
get_clust_tendency(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
n = nrow(mydata1) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8710226
##
## $plot
## NULL
The Hopkins statistic is above 0.5 (the advised level for clustering), meaning the assumption is confirmed and the data is clusterable.
WARD <- mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
Presenting the cluster dendrogram of the data using the Ward method for clustering.
Using the NbClust function to find the optimal number of clusters for the data.
OptNumberClusters <- mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")] %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "ward.D2",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Based on the presented data, the overwhelming result of the majority rule states that 3 clusters should be used in order to cluster the data, therefore k = 3 will be used.
fviz_dend(WARD,
k = 3,
cex = 0.5,
palette = c("#FFCC00", "#429698", "#5D2689"),
color_labels_by_k = TRUE,
rect = TRUE)
Assigning each unit to one of the three formed clusters according to the Ward method.
mydata1$ClusterWARD <- cutree(WARD,
k =3)
head(mydata1,10)
## ID Category Name
## 1 1 Breakfast Egg McMuffin
## 2 2 Breakfast Egg White Delight
## 3 3 Breakfast Sausage McMuffin
## 4 4 Breakfast Sausage McMuffin with Egg
## 5 5 Breakfast Sausage McMuffin with Egg Whites
## 6 6 Breakfast Steak & Egg McMuffin
## 7 7 Breakfast Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8 8 Breakfast Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9 9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 10 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
## Calories Fats Carbs Fiber Sugar Protein CategoryF FatsZ CarbsZ
## 1 300 13 31 4 3 17 Breakfast -0.08203469 -0.5785792
## 2 250 8 30 4 3 18 Breakfast -0.43399870 -0.6139746
## 3 370 23 29 4 2 14 Breakfast 0.62189333 -0.6493701
## 4 450 28 30 4 2 21 Breakfast 0.97385733 -0.6139746
## 5 400 23 30 4 2 21 Breakfast 0.62189333 -0.6139746
## 6 430 23 31 4 3 26 Breakfast 0.62189333 -0.5785792
## 7 460 26 38 2 3 19 Breakfast 0.83307173 -0.3308112
## 8 520 30 43 3 4 19 Breakfast 1.11464294 -0.1538340
## 9 410 20 36 2 3 20 Breakfast 0.41071492 -0.4016020
## 10 470 25 42 3 4 20 Breakfast 0.76267893 -0.1892294
## FiberZ SugarZ ProteinZ Dissimilarity ClusterWARD
## 1 1.5112617 -0.9213133 0.3204526 1.891270 1
## 2 1.5112617 -0.9213133 0.4079712 1.965831 1
## 3 1.5112617 -0.9561810 0.0578969 2.002493 1
## 4 1.5112617 -0.9561810 0.6705269 2.230059 1
## 5 1.5112617 -0.9561810 0.6705269 2.100361 1
## 6 1.5112617 -0.9213133 1.1081198 2.254366 1
## 7 0.2355213 -0.9213133 0.4954898 1.397585 1
## 8 0.8733915 -0.8864455 0.4954898 1.749343 1
## 9 0.2355213 -0.9213133 0.5830083 1.254655 1
## 10 0.8733915 -0.8864455 0.5830083 1.583030 1
Finding the initial leaders using the already mentioned Ward method in the 3 original clusters.
Initial_leaders <- aggregate(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
by = list(mydata1$ClusterWARD),
FUN = mean)
Initial_leaders
## Group.1 FatsZ CarbsZ FiberZ SugarZ ProteinZ
## 1 1 0.7617277 -0.1504858 0.8302922 -0.7648796 0.9673804
## 2 2 -0.5451678 -0.2418909 -0.4217962 0.1296093 -0.5552668
## 3 3 0.7274825 2.4123350 -0.5219496 2.4412454 0.1618252
K_MEANS <- hkmeans(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 3 clusters of sizes 82, 113, 59
##
## Cluster means:
## FatsZ CarbsZ FiberZ SugarZ ProteinZ
## 1 0.67039568 -0.1283666 0.9200649 -0.7414467 0.8039394
## 2 -0.65670159 -0.6427922 -0.6450783 -0.2063701 -0.6693592
## 3 -0.03669695 1.2025910 -0.3158580 1.4886984 -0.1215903
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 36 37 38 39 40 41 42 43 44
## 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 84 85 86
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1
## 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
## 1 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 2
## 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
## 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
## 2 2 2 2 2 2 2 3 2 2 3 2 2 3 2 2 1 2 2 2
## 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
## 2 2 3 2 2 3 2 2 3 2 2 2 2 3 3 2 3 3 2 3
## 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
## 3 2 3 3 2 3 3 2 3 3 2 2 2 2 2 2 2 2 2 2
## 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
## 2 2 2 2 2 2 2 3 2 2 3 2 2 3 2 2 3 3 3 3
## 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246
## 3 3 3 3 3 3 2 3 3 2 3 3 2 3 3 3 3 3 3 3
## 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 3 3 3 3 3 3 3 3 3 3 3 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 149.5657 148.7184 134.2054
## (between_SS / total_SS = 59.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Visual representation of the three formed clusters using the K-Means clustering method.
fviz_cluster(K_MEANS,
palette = c("#FFCC00", "#429698", "#5D2689"),
repel = FALSE,
ggtheme = theme_classic())
On the above visualization, some additional outliers could be seen, (ID’s 82, 34 and 32) therefore these were also removed along with the outliers already noticed by the dissimilarity variable and the visualization was re-run.
mydata1$ClusterK_Means <- K_MEANS$cluster
head(mydata1[c("ID", "ClusterWARD", "ClusterK_Means")])
## ID ClusterWARD ClusterK_Means
## 1 1 1 1
## 2 2 1 1
## 3 3 1 1
## 4 4 1 1
## 5 5 1 1
## 6 6 1 1
table(mydata1$ClusterWARD)
##
## 1 2 3
## 74 164 16
table(mydata1$ClusterK_Means)
##
## 1 2 3
## 82 113 59
table(mydata1$ClusterWARD, mydata1$ClusterK_Means)
##
## 1 2 3
## 1 74 0 0
## 2 8 113 43
## 3 0 0 16
From the last table it can be seen that no items were re-classified from cluster 1 to either cluster 2 or 3. From cluster 2 8 menu items were re-classified into cluster 1 and 43 items were re-classified into cluster 3. Furthermore, no items were re-classified from cluster 3 to either cluster 1 or 2. Cluster sizes are now 82, 113 and 59 respectively.
Finding the centroids of the newly formed clusters.
Centroids <- K_MEANS$centers
Centroids
## FatsZ CarbsZ FiberZ SugarZ ProteinZ
## 1 0.67039568 -0.1283666 0.9200649 -0.7414467 0.8039394
## 2 -0.65670159 -0.6427922 -0.6450783 -0.2063701 -0.6693592
## 3 -0.03669695 1.2025910 -0.3158580 1.4886984 -0.1215903
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(FatsZ, CarbsZ, FiberZ, SugarZ, ProteinZ))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$nameFactor <- factor(Figure$name,
levels = c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ"),
labels = c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2, 2)
From the above found centroids and chart it can be stated that menu items found in cluster 3 are approximately average on fats, high in carbohydrate and sugar nutritional values and below average in fiber and protein values. The 1st cluster contains menu items that are above average in fat values, fiber values as well as protein, however these items are below average in carbohydrates and sugar nutritional values. The 2nd cluster is below average in all categories, with the sugar values coming the closest to the average of the menu.
Checking the average calories of menu items by clusters.
describeBy(mydata1$Calories, mydata1$ClusterK_Means)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 82 476.95 131.49 465 474.7 111.19 140 800 660 0.15 -0.08
## se
## X1 14.52
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 113 181.15 106.05 190 183.52 103.78 0 360 360 -0.29 -0.88
## se
## X1 9.98
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 59 492.71 178.96 450 479.18 163.09 250 930 680 0.63 -0.63 23.3
aggregate(mydata1$Calories,
by = list(mydata1$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 476.9512
## 2 2 181.1504
## 3 3 492.7119
In order to make sure whether the average calories of each cluster are different from each other, we perform another ANOVA test.
H0: µ(cluster1, Calories) = µ(cluster2, Calories) = µ(cluster3, Calories)
H1: µ(cluster1, Calories) ≠ µ(cluster2, Calories) ≠ µ(cluster3, Calories)
fit_calories <- aov(Calories ~ as.factor(ClusterK_Means),
data = mydata1)
summary(fit_calories)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 5744600 2872300 159.6 <2e-16 ***
## Residuals 251 4517704 17999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the ANOVA test is below 0.001, meaning the null hypothesis is rejected and at least one mean is different, and the result is validated.
The clustering of the original 260 items, after removing the outliers 254, was performed based on 5 standardized variables (FatsZ, CarbsZ, FiberZ, SugarZ and “ProteinZ).
Originally using the Ward method in order to perform hierarchical clustering and using the dendrogram and nbclust function to find the optimal number of clusters, which turned out to be 3 according to the majority rule. The final clusters were then optimized by the use of K-Means clustering and were sized as follows:
Cluster 1 therefore contains approximately 32.3 % of all menu items, cluster 2 contains approximately 44.5 % of all menu items and the final cluster contains approximately 23.2 % of all menu items.
The average item in cluster 2 contains approximately 181 calories, the lowest of all three clusters. It also scores the lowest in 4 categories (except the sugar content, where cluster group 1 scores even lower) and is below average in all 5 determining variables. Menu items in cluster 3 are approximately average on fats, high in carbohydrate and sugar nutritional values and below average in fiber and protein values and have on average approximately 493 calories. The 1st cluster contains menu items that are above average in fat values, fiber values as well as protein, however these items are below average in carbohydrates and sugar nutritional values. Cluster 1 menu items have, on average, approximately 477 calories.
Cluster 1 contains more than expected number of menu items from the breakfast category (alpha 0.1%) and more than expected number of menu items from the main items category (alpha 0.1%). However, it contains below expected number of items from the deserts (alpha 1%) and beverages (alpha 0.1%) categories.
Cluster 2 contains more less than expected number of menu items from both the breakfast and main items category (both alpha 1%). It does however contain more than expected number of Beverages (alpha 1%).
Finally, cluster 3 contains less than expected number of menu items from breakfast, main and sides & salads categories (alpha values of 1%, 1% and 5% respectively). This cluster also contains more than expected number of menu items from the desserts category (aplha 0.1%).