CLUSTERING/CLUSTER ANALYSIS

The data set and data manipulations

mydata <- read.csv("~/Desktop/IMB/2. SEMESTER/MULTIVARIATE ANALYSIS/HOMEWORKS/HW4/menu.csv")

mydata1 <- mydata[c(-3,-5,-7,-8,-9,-10,-11,-12,-13,-14,-16,-18,-21,-22,-23,-24)]

mydata1$ID <- 1:nrow(mydata1)

mydata1 <- mydata1[c(9,1,2,3,4,5,6,7,8)]

colnames(mydata1) <- c("ID", "Category", "Name", "Calories", "Fats", "Carbs", "Fiber", "Sugar", "Protein")
mydata1[mydata1 == 'Beef & Pork'] <- 'Main'
mydata1[mydata1 == 'Chicken & Fish'] <- 'Main'
mydata1[mydata1 == 'Snacks & Sides'] <- 'Sides_&_Salads'
mydata1[mydata1 == 'Salads'] <- 'Sides_&_Salads'
mydata1[mydata1 == 'Coffee & Tea'] <- 'Beverages' 
mydata1[mydata1 == 'Smoothies & Shakes'] <- 'Desserts'

mydata1$CategoryF <- factor(mydata1$Category,
                            levels = c("Breakfast", "Main", "Sides_&_Salads", "Desserts", "Beverages"),
                            labels = c("Breakfast", "Main", "Sides_&_Salads", "Desserts", "Beverages"))

head(mydata1,10)

##    ID  Category                                                          Name
## 1   1 Breakfast                                                  Egg McMuffin
## 2   2 Breakfast                                             Egg White Delight
## 3   3 Breakfast                                              Sausage McMuffin
## 4   4 Breakfast                                     Sausage McMuffin with Egg
## 5   5 Breakfast                              Sausage McMuffin with Egg Whites
## 6   6 Breakfast                                          Steak & Egg McMuffin
## 7   7 Breakfast                 Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8   8 Breakfast                   Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9   9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 10 Breakfast   Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
##    Calories Fats Carbs Fiber Sugar Protein CategoryF
## 1       300   13    31     4     3      17 Breakfast
## 2       250    8    30     4     3      18 Breakfast
## 3       370   23    29     4     2      14 Breakfast
## 4       450   28    30     4     2      21 Breakfast
## 5       400   23    30     4     2      21 Breakfast
## 6       430   23    31     4     3      26 Breakfast
## 7       460   26    38     2     3      19 Breakfast
## 8       520   30    43     3     4      19 Breakfast
## 9       410   20    36     2     3      20 Breakfast
## 10      470   25    42     3     4      20 Breakfast

Explaining the data

The unit of observation in my sample is an item on the McDonald’s menu. The original size of the data set 260 units of observation (McDonald’s Menu Items) with 24 variables. I removed columns that included variables that did not hold potential for further hypothesis testing. After this the data set (mydata1) was left with 8 of the original variables. The variables that were left are:

Category: The subcategory of each food item that each Menu Item falls into (Breakfast, Main, Salads, Sides, Desserts, Beverages, Coffee & Tea & Smoothies & Shakes).
Menu Item: The name of each food item on the McDonald’s menu.
Calories: Amount of kilo calories of each food item on the McDonald’s menu.
Fats: Amount of fat in grams in each food item on the McDonald’s menu.
Carbs: Amount of carbohydrates in grams in each food item on the McDonald’s menu.
Fiber: Amount of fiber in grams in each food item on the McDonald’s menu.
Sugar: Amount of sugar in grams in each food item on the McDonald’s menu.
Protein: Amount of protein in grams in each food item on the McDonald’s menu.

The source of the above data was found on the Kaggle website, the author is McDonald’s. Retrieved January 9th, 2023, from https://www.kaggle.com/datasets/mcdonalds/nutrition-facts.

Using cluster analysis to classify the McDonald’s menu items into 2 or more groups based on their nutritional characteristics (Calories, Fats, Carbs, Fiber, Sugar and Protein).

Using stat.desc to present the general descriptive statistic of this data set and the sapply function to find the minimum and maximum values of each variable.

round(stat.desc((mydata1[c(-1,-2,-3,-10)]), basic = FALSE), 2)

##              Calories   Fats  Carbs Fiber  Sugar Protein
## median         340.00  11.00  44.00  1.00  17.50   12.00
## mean           368.27  14.17  47.35  1.63  29.42   13.34
## SE.mean         14.90   0.88   1.75  0.10   1.78    0.71
## CI.mean.0.95    29.34   1.73   3.45  0.19   3.50    1.40
## var          57729.62 201.81 798.19  2.46 822.53  130.56
## std.dev        240.27  14.21  28.25  1.57  28.68   11.43
## coef.var         0.65   1.00   0.60  0.96   0.97    0.86

sapply(mydata1[c(-1,-2,-3,-10)], FUN = min)

## Calories     Fats    Carbs    Fiber    Sugar  Protein 
##        0        0        0        0        0        0

sapply(mydata1[c(-1,-2,-3,-10)], FUN = max)

## Calories     Fats    Carbs    Fiber    Sugar  Protein 
##     1880      118      141        7      128       87

Checking the variance in the data of each individual variable in order to confirm that standardization of the variables is required.

var(mydata1$Fats)

## [1] 201.8104

var(mydata1$Carbs)

## [1] 798.1886

var(mydata1$Fiber)

## [1] 2.457737

var(mydata1$Sugar)

## [1] 822.5307

var(mydata1$Protein)

## [1] 130.5568

Since the variances have a very wide spread (from over 800 to just above 2), standardization has to be used in order to compare these variables and plot them in the correlation matrix.

Standardizing the variables using the scale function and adding them into the data set in their dedicated columns.

mydata1$FatsZ <- scale(mydata1$Fats)
mydata1$CarbsZ <- scale(mydata1$Carbs)
mydata1$FiberZ <- scale(mydata1$Fiber)
mydata1$SugarZ <- scale(mydata1$Sugar)
mydata1$ProteinZ <- scale(mydata1$Protein)

rcorr(as.matrix(mydata1[c(11,12,13,14,15)]),
      type = "pearson")

##          FatsZ CarbsZ FiberZ SugarZ ProteinZ
## FatsZ     1.00   0.46   0.58  -0.12     0.81
## CarbsZ    0.46   1.00   0.22   0.76     0.35
## FiberZ    0.58   0.22   1.00  -0.30     0.64
## SugarZ   -0.12   0.76  -0.30   1.00    -0.18
## ProteinZ  0.81   0.35   0.64  -0.18     1.00
## 
## n= 260 
## 
## 
## P
##          FatsZ  CarbsZ FiberZ SugarZ ProteinZ
## FatsZ           0.0000 0.0000 0.0631 0.0000  
## CarbsZ   0.0000        0.0003 0.0000 0.0000  
## FiberZ   0.0000 0.0003        0.0000 0.0000  
## SugarZ   0.0631 0.0000 0.0000        0.0036  
## ProteinZ 0.0000 0.0000 0.0000 0.0036

Since the correlation between sugar and fats is above 0.05 these variables are not correlated. However, the other variables seem to be quite highly correlated, which is not ideal for clustering. If other variables would be possible to select, that were less correlated, it would be best to do so, however, the data does not contain any other variables that could be used for clustering.

Introducing an additional variable named Dissimilarity and ordering it in order to find potential outliers.

mydata1$Dissimilarity <- sqrt(mydata1$FatsZ^2 + mydata1$CarbsZ^2 + mydata1$FiberZ^2 + mydata1$SugarZ^2 + mydata1$ProteinZ^2)
head(mydata1[order(-mydata1$Dissimilarity),],10)

##      ID  Category                                                         Name
## 83   83      Main                                 Chicken McNuggets (40 piece)
## 33   33 Breakfast                  Big Breakfast with Hotcakes (Large Biscuit)
## 35   35 Breakfast   Big Breakfast with Hotcakes and Egg Whites (Large Biscuit)
## 32   32 Breakfast                Big Breakfast with Hotcakes (Regular Biscuit)
## 254 254  Desserts                         McFlurry with M&M’s Candies (Medium)
## 247 247  Desserts                                     Strawberry Shake (Large)
## 250 250  Desserts                                      Chocolate Shake (Large)
## 34   34 Breakfast Big Breakfast with Hotcakes and Egg Whites (Regular Biscuit)
## 252 252  Desserts                                       Shamrock Shake (Large)
## 82   82      Main                                 Chicken McNuggets (20 piece)
##     Calories Fats Carbs Fiber Sugar Protein CategoryF     FatsZ    CarbsZ
## 83      1880  118   118     6     1      87      Main 7.3092095 2.5008235
## 33      1150   60   116     7    17      36 Breakfast 3.2264270 2.4300327
## 35      1050   50   115     7    18      35 Breakfast 2.5224990 2.3946372
## 32      1090   56   111     6    17      36 Breakfast 2.9448558 2.2530555
## 254      930   33   139     2   128      20  Desserts 1.3258213 3.2441277
## 247      850   24   140     0   123      18  Desserts 0.6922861 3.2795231
## 250      850   23   141     2   120      19  Desserts 0.6218933 3.3149185
## 34       990   46   110     6    17      35 Breakfast 2.2409278 2.2176601
## 252      820   23   135     0   115      18  Desserts 0.6218933 3.1025459
## 82       940   59    59     3     0      44      Main 3.1560342 0.4124929
##         FiberZ     SugarZ  ProteinZ Dissimilarity
## 83   2.7870021 -0.9910488 6.4467527     10.487556
## 33   3.4248723 -0.4331647 1.9833055      5.671488
## 35   3.4248723 -0.3982970 1.8957869      5.251637
## 32   2.7870021 -0.4331647 1.9833055      5.063294
## 254  0.2355213  3.4371556 0.5830083      4.948896
## 247 -1.0402191  3.2628168 0.4079712      4.809262
## 250  0.2355213  3.1582136 0.4954898      4.653034
## 34   2.7870021 -0.4331647 1.8957869      4.635601
## 252 -1.0402191  2.9838748 0.4079712      4.490495
## 82   0.8733915 -1.0259165 2.6834541      4.375722

Removing outliers found in previous step (those with a big gap in the dissimilarity).

mydata1 <- mydata1[c(-83,-33,-35,-82,-34,-32),]

Calculating the Euclidean distances that will be used to present the dissimilarity matrix.

distance <- get_dist(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")], 
                     method = "euclidian")

distance2 <- distance^2 

fviz_dist(distance2)

The matrix above suggests that the data is clusterable, to further confirm the assumption, the Hopkins statistic is calculated in the next step.

get_clust_tendency(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
                   n = nrow(mydata1) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8710226
## 
## $plot
## NULL

The Hopkins statistic is above 0.5 (the advised level for clustering), meaning the assumption is confirmed and the data is clusterable.

WARD <- mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")] %>% 
  get_dist(method = "euclidean") %>% 
  hclust(method = "ward.D2")

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

Presenting the cluster dendrogram of the data using the Ward method for clustering.

Using the NbClust function to find the optimal number of clusters for the data.

OptNumberClusters <- mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")] %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "ward.D2", 
          index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Based on the presented data, the overwhelming result of the majority rule states that 3 clusters should be used in order to cluster the data, therefore k = 3 will be used.

fviz_dend(WARD, 
          k = 3,
          cex = 0.5, 
          palette = c("#FFCC00", "#429698", "#5D2689"),
          color_labels_by_k = TRUE, 
          rect = TRUE)

Assigning each unit to one of the three formed clusters according to the Ward method.

mydata1$ClusterWARD <- cutree(WARD,
                              k =3)
head(mydata1,10)

##    ID  Category                                                          Name
## 1   1 Breakfast                                                  Egg McMuffin
## 2   2 Breakfast                                             Egg White Delight
## 3   3 Breakfast                                              Sausage McMuffin
## 4   4 Breakfast                                     Sausage McMuffin with Egg
## 5   5 Breakfast                              Sausage McMuffin with Egg Whites
## 6   6 Breakfast                                          Steak & Egg McMuffin
## 7   7 Breakfast                 Bacon, Egg & Cheese Biscuit (Regular Biscuit)
## 8   8 Breakfast                   Bacon, Egg & Cheese Biscuit (Large Biscuit)
## 9   9 Breakfast Bacon, Egg & Cheese Biscuit with Egg Whites (Regular Biscuit)
## 10 10 Breakfast   Bacon, Egg & Cheese Biscuit with Egg Whites (Large Biscuit)
##    Calories Fats Carbs Fiber Sugar Protein CategoryF       FatsZ     CarbsZ
## 1       300   13    31     4     3      17 Breakfast -0.08203469 -0.5785792
## 2       250    8    30     4     3      18 Breakfast -0.43399870 -0.6139746
## 3       370   23    29     4     2      14 Breakfast  0.62189333 -0.6493701
## 4       450   28    30     4     2      21 Breakfast  0.97385733 -0.6139746
## 5       400   23    30     4     2      21 Breakfast  0.62189333 -0.6139746
## 6       430   23    31     4     3      26 Breakfast  0.62189333 -0.5785792
## 7       460   26    38     2     3      19 Breakfast  0.83307173 -0.3308112
## 8       520   30    43     3     4      19 Breakfast  1.11464294 -0.1538340
## 9       410   20    36     2     3      20 Breakfast  0.41071492 -0.4016020
## 10      470   25    42     3     4      20 Breakfast  0.76267893 -0.1892294
##       FiberZ     SugarZ  ProteinZ Dissimilarity ClusterWARD
## 1  1.5112617 -0.9213133 0.3204526      1.891270           1
## 2  1.5112617 -0.9213133 0.4079712      1.965831           1
## 3  1.5112617 -0.9561810 0.0578969      2.002493           1
## 4  1.5112617 -0.9561810 0.6705269      2.230059           1
## 5  1.5112617 -0.9561810 0.6705269      2.100361           1
## 6  1.5112617 -0.9213133 1.1081198      2.254366           1
## 7  0.2355213 -0.9213133 0.4954898      1.397585           1
## 8  0.8733915 -0.8864455 0.4954898      1.749343           1
## 9  0.2355213 -0.9213133 0.5830083      1.254655           1
## 10 0.8733915 -0.8864455 0.5830083      1.583030           1

Finding the initial leaders using the already mentioned Ward method in the 3 original clusters.

Initial_leaders <- aggregate(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")],
                            by = list(mydata1$ClusterWARD), 
                            FUN = mean)

Initial_leaders

##   Group.1      FatsZ     CarbsZ     FiberZ     SugarZ   ProteinZ
## 1       1  0.7617277 -0.1504858  0.8302922 -0.7648796  0.9673804
## 2       2 -0.5451678 -0.2418909 -0.4217962  0.1296093 -0.5552668
## 3       3  0.7274825  2.4123350 -0.5219496  2.4412454  0.1618252

Performing K-Means clustering

K_MEANS <- hkmeans(mydata1[c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ")], 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 82, 113, 59
## 
## Cluster means:
##         FatsZ     CarbsZ     FiberZ     SugarZ   ProteinZ
## 1  0.67039568 -0.1283666  0.9200649 -0.7414467  0.8039394
## 2 -0.65670159 -0.6427922 -0.6450783 -0.2063701 -0.6693592
## 3 -0.03669695  1.2025910 -0.3158580  1.4886984 -0.1215903
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  36  37  38  39  40  41  42  43  44 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1   1 
##  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 
##   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  84  85  86 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   1   1   2   1 
##  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
##   1   1   1   1   2   2   2   2   2   2   2   1   1   2   2   2   2   1   2   2 
## 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   2   2   2   2   2   2   3   2   2   2   2   2   2   2   3   2   2   2   2   2 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 
##   2   2   3   2   2   2   2   2   2   3   2   2   2   2   2   2   2   2   2   2 
## 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 
##   2   2   2   2   2   2   2   3   2   2   3   2   2   3   2   2   1   2   2   2 
## 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 
##   2   2   3   2   2   3   2   2   3   2   2   2   2   3   3   2   3   3   2   3 
## 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 
##   3   2   3   3   2   3   3   2   3   3   2   2   2   2   2   2   2   2   2   2 
## 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
##   2   2   2   2   2   2   2   3   2   2   3   2   2   3   2   2   3   3   3   3 
## 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 
##   3   3   3   3   3   3   2   3   3   2   3   3   2   3   3   3   3   3   3   3 
## 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   3   3   3   3   3   3   3   3   3   3   3   2   3   3 
## 
## Within cluster sum of squares by cluster:
## [1] 149.5657 148.7184 134.2054
##  (between_SS / total_SS =  59.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Visual representation of the three formed clusters using the K-Means clustering method.

fviz_cluster(K_MEANS, 
             palette = c("#FFCC00", "#429698", "#5D2689"), 
             repel = FALSE,
             ggtheme = theme_classic())

On the above visualization, some additional outliers could be seen, (ID’s 82, 34 and 32) therefore these were also removed along with the outliers already noticed by the dissimilarity variable and the visualization was re-run.

mydata1$ClusterK_Means <- K_MEANS$cluster
head(mydata1[c("ID", "ClusterWARD", "ClusterK_Means")])

##   ID ClusterWARD ClusterK_Means
## 1  1           1              1
## 2  2           1              1
## 3  3           1              1
## 4  4           1              1
## 5  5           1              1
## 6  6           1              1

Checking original classifications and the re-classifications between the two methods

table(mydata1$ClusterWARD)

## 
##   1   2   3 
##  74 164  16

table(mydata1$ClusterK_Means)

## 
##   1   2   3 
##  82 113  59

table(mydata1$ClusterWARD, mydata1$ClusterK_Means)

##    
##       1   2   3
##   1  74   0   0
##   2   8 113  43
##   3   0   0  16

From the last table it can be seen that no items were re-classified from cluster 1 to either cluster 2 or 3. From cluster 2 8 menu items were re-classified into cluster 1 and 43 items were re-classified into cluster 3. Furthermore, no items were re-classified from cluster 3 to either cluster 1 or 2. Cluster sizes are now 82, 113 and 59 respectively.

Finding the centroids of the newly formed clusters.

Centroids <- K_MEANS$centers
Centroids

##         FatsZ     CarbsZ     FiberZ     SugarZ   ProteinZ
## 1  0.67039568 -0.1283666  0.9200649 -0.7414467  0.8039394
## 2 -0.65670159 -0.6427922 -0.6450783 -0.2063701 -0.6693592
## 3 -0.03669695  1.2025910 -0.3158580  1.4886984 -0.1215903

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(FatsZ, CarbsZ, FiberZ, SugarZ, ProteinZ))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3), 
                        labels = c("1", "2", "3"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ"), 
                            labels = c("FatsZ", "CarbsZ", "FiberZ", "SugarZ", "ProteinZ"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2, 2)

From the above found centroids and chart it can be stated that menu items found in cluster 3 are approximately average on fats, high in carbohydrate and sugar nutritional values and below average in fiber and protein values. The 1st cluster contains menu items that are above average in fat values, fiber values as well as protein, however these items are below average in carbohydrates and sugar nutritional values. The 2nd cluster is below average in all categories, with the sugar values coming the closest to the average of the menu.

Testing the clustering variables using ANOVA to determine which is best at separating menu items

H0: µ(var, cluster1) = µ(var, cluster2) = µ(var, cluster3)

H1:At least one µ(var, cluster) is different

fit <- aov(cbind(FatsZ, CarbsZ, FiberZ, SugarZ, ProteinZ) ~ as.factor(ClusterK_Means),
           data = mydata1)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 83.862  41.931  132.41 < 2.2e-16 ***
## Residuals                 251 79.483   0.317                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 132.781  66.391  170.78 < 2.2e-16 ***
## Residuals                 251  97.574   0.389                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 121.305  60.652  170.76 < 2.2e-16 ***
## Residuals                 251  89.153   0.355                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 180.594  90.297  299.82 < 2.2e-16 ***
## Residuals                 251  75.595   0.301                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 103.376  51.688  143.06 < 2.2e-16 ***
## Residuals                 251  90.684   0.361                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the above ANOVA test output, it can be stated that the null hypothesis can be rejected for all clustering variables at p < 0.001. Meaning for all of the used variables the mean value (significantly) differs between the 3 clusters. From the F-statistic of the above results, the best separator of menu items is, by far, the sugar content. The worst (however, still significant) separator would be the fat content of menu items.

Checking for the differences in calories between the clusters

Checking the average calories of menu items by clusters.

describeBy(mydata1$Calories, mydata1$ClusterK_Means)

## 
##  Descriptive statistics by group 
## group: 1
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 82 476.95 131.49    465   474.7 111.19 140 800   660 0.15    -0.08
##       se
## X1 14.52
## ------------------------------------------------------------ 
## group: 2
##    vars   n   mean     sd median trimmed    mad min max range  skew kurtosis
## X1    1 113 181.15 106.05    190  183.52 103.78   0 360   360 -0.29    -0.88
##      se
## X1 9.98
## ------------------------------------------------------------ 
## group: 3
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis   se
## X1    1 59 492.71 178.96    450  479.18 163.09 250 930   680 0.63    -0.63 23.3

aggregate(mydata1$Calories,
          by = list(mydata1$ClusterK_Means),
          FUN = "mean")

##   Group.1        x
## 1       1 476.9512
## 2       2 181.1504
## 3       3 492.7119

In order to make sure whether the average calories of each cluster are different from each other, we perform another ANOVA test.

H0: µ(cluster1, Calories) = µ(cluster2, Calories) = µ(cluster3, Calories)

H1: µ(cluster1, Calories) ≠ µ(cluster2, Calories) ≠ µ(cluster3, Calories)

fit_calories <- aov(Calories ~ as.factor(ClusterK_Means),
                    data = mydata1)
summary(fit_calories)

##                            Df  Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   2 5744600 2872300   159.6 <2e-16 ***
## Residuals                 251 4517704   17999                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the ANOVA test is below 0.001, meaning the null hypothesis is rejected and at least one mean is different, and the result is validated.

Checking the association between categories of menu items

Testing for the category into which one item falls (categorical variable), we cannot do an anova, so thepearson Chi2 test is performed.

H0: There is no association between clusters of the categorical variable.

H1: There is association between clusters of the categorical variable.

chisquare_category <- chisq.test(mydata1$CategoryF, as.factor(mydata1$ClusterK_Means))

## Warning in chisq.test(mydata1$CategoryF, as.factor(mydata1$ClusterK_Means)):
## Chi-squared approximation may be incorrect

chisquare_category

## 
##  Pearson's Chi-squared test
## 
## data:  mydata1$CategoryF and as.factor(mydata1$ClusterK_Means)
## X-squared = 240.3, df = 8, p-value < 2.2e-16

addmargins(chisquare_category$observed)

##                  
## mydata1$CategoryF   1   2   3 Sum
##    Breakfast       36   2   0  38
##    Main            37   3   0  40
##    Sides_&_Salads   7  12   0  19
##    Desserts         1  10  24  35
##    Beverages        1  86  35 122
##    Sum             82 113  59 254

round(chisquare_category$expected, 2)

##                  
## mydata1$CategoryF     1     2     3
##    Breakfast      12.27 16.91  8.83
##    Main           12.91 17.80  9.29
##    Sides_&_Salads  6.13  8.45  4.41
##    Desserts       11.30 15.57  8.13
##    Beverages      39.39 54.28 28.34

round(chisquare_category$residuals, 2)

##                  
## mydata1$CategoryF     1     2     3
##    Breakfast       6.78 -3.63 -2.97
##    Main            6.70 -3.51 -3.05
##    Sides_&_Salads  0.35  1.22 -2.10
##    Desserts       -3.06 -1.41  5.57
##    Beverages      -6.12  4.31  1.25

From the above results of the Chi2 test we can reject the null hypothesis (p < 0.001), meaning the is at least some association between the category into which a menu item is classified by McDonald’s and the cluster it will then be classified into. The additional information gathered from this test will be used as additional information in the final description of the clusters.

Final description of the clusters

The clustering of the original 260 items, after removing the outliers 254, was performed based on 5 standardized variables (FatsZ, CarbsZ, FiberZ, SugarZ and “ProteinZ).

Originally using the Ward method in order to perform hierarchical clustering and using the dendrogram and nbclust function to find the optimal number of clusters, which turned out to be 3 according to the majority rule. The final clusters were then optimized by the use of K-Means clustering and were sized as follows:

Cluster 1: 82 menu items
Cluster 2: 113 menu items
Cluster 3: 59 menu items

Cluster 1 therefore contains approximately 32.3 % of all menu items, cluster 2 contains approximately 44.5 % of all menu items and the final cluster contains approximately 23.2 % of all menu items.

The average item in cluster 2 contains approximately 181 calories, the lowest of all three clusters. It also scores the lowest in 4 categories (except the sugar content, where cluster group 1 scores even lower) and is below average in all 5 determining variables. Menu items in cluster 3 are approximately average on fats, high in carbohydrate and sugar nutritional values and below average in fiber and protein values and have on average approximately 493 calories. The 1st cluster contains menu items that are above average in fat values, fiber values as well as protein, however these items are below average in carbohydrates and sugar nutritional values. Cluster 1 menu items have, on average, approximately 477 calories.

Cluster 1 contains more than expected number of menu items from the breakfast category (alpha 0.1%) and more than expected number of menu items from the main items category (alpha 0.1%). However, it contains below expected number of items from the deserts (alpha 1%) and beverages (alpha 0.1%) categories.

Cluster 2 contains more less than expected number of menu items from both the breakfast and main items category (both alpha 1%). It does however contain more than expected number of Beverages (alpha 1%).

Finally, cluster 3 contains less than expected number of menu items from breakfast, main and sides & salads categories (alpha values of 1%, 1% and 5% respectively). This cluster also contains more than expected number of menu items from the desserts category (aplha 0.1%).

Homework4_Ziga_Zalar

Žiga Zalar

2023-02-03