INTRODUCTION

McDonald’s Corporation is an American fast food company, founded in 1940 as a restaurant operated by Richard and Maurice McDonald, in San Bernardino, California, United States.

This dataset provides a nutrition analysis of every menu item on the US McDonald’s menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts. This dataset is taken from Kaggle : https://www.kaggle.com/datasets/mcdonalds/nutrition-facts

Our Goal is to clustering (dividing to some group) that will inform some information from the dataset. Eventually, each group have their characteristic of nutrition.

We will use Unsupervised Learning : PCA and K-Means

IMPORT LIBRARY

library(dplyr)
library(FactoMineR)
library(factoextra)
library(tidyverse)
library(ggiraphExtra)

IMPORT DATA

df <- read.csv("menu.csv")
str(df)
## 'data.frame':    260 obs. of  22 variables:
##  $ Item                         : chr  "Egg McMuffin" "Egg White Delight" "Sausage McMuffin" "Sausage McMuffin with Egg" ...
##  $ Calories                     : int  300 250 370 450 400 430 460 520 410 470 ...
##  $ Calories.from.Fat            : int  120 70 200 250 210 210 230 270 180 220 ...
##  $ Total.Fat                    : num  13 8 23 28 23 23 26 30 20 25 ...
##  $ Total.Fat....Daily.Value.    : int  20 12 35 43 35 36 40 47 32 38 ...
##  $ Saturated.Fat                : num  5 3 8 10 8 9 13 14 11 12 ...
##  $ Saturated.Fat....Daily.Value.: int  25 15 42 52 42 46 65 68 56 59 ...
##  $ Trans.Fat                    : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ Cholesterol                  : int  260 25 45 285 50 300 250 250 35 35 ...
##  $ Cholesterol....Daily.Value.  : int  87 8 15 95 16 100 83 83 11 11 ...
##  $ Sodium                       : int  750 770 780 860 880 960 1300 1410 1300 1420 ...
##  $ Sodium....Daily.Value.       : int  31 32 33 36 37 40 54 59 54 59 ...
##  $ Carbohydrates                : int  31 30 29 30 30 31 38 43 36 42 ...
##  $ Carbohydrates....Daily.Value.: int  10 10 10 10 10 10 13 14 12 14 ...
##  $ Dietary.Fiber                : int  4 4 4 4 4 4 2 3 2 3 ...
##  $ Dietary.Fiber....Daily.Value.: int  17 17 17 17 17 18 7 12 7 12 ...
##  $ Sugars                       : int  3 3 2 2 2 3 3 4 3 4 ...
##  $ Protein                      : int  17 18 14 21 21 26 19 19 20 20 ...
##  $ Vitamin.A....Daily.Value.    : int  10 6 8 15 6 15 10 15 2 6 ...
##  $ Vitamin.C....Daily.Value.    : int  0 0 0 0 0 2 8 8 8 8 ...
##  $ Calcium....Daily.Value.      : int  25 25 25 30 25 30 15 20 15 15 ...
##  $ Iron....Daily.Value.         : int  15 8 10 15 10 20 15 20 10 15 ...

As we can see, our data contain chr, so we have to remove them, because this Unsupervised Learning is good for numeric data only.

DATA CLEANING

# Check Missing Value
colSums(is.na(df))
##                          Item                      Calories 
##                             0                             0 
##             Calories.from.Fat                     Total.Fat 
##                             0                             0 
##     Total.Fat....Daily.Value.                 Saturated.Fat 
##                             0                             0 
## Saturated.Fat....Daily.Value.                     Trans.Fat 
##                             0                             0 
##                   Cholesterol   Cholesterol....Daily.Value. 
##                             0                             0 
##                        Sodium        Sodium....Daily.Value. 
##                             0                             0 
##                 Carbohydrates Carbohydrates....Daily.Value. 
##                             0                             0 
##                 Dietary.Fiber Dietary.Fiber....Daily.Value. 
##                             0                             0 
##                        Sugars                       Protein 
##                             0                             0 
##     Vitamin.A....Daily.Value.     Vitamin.C....Daily.Value. 
##                             0                             0 
##       Calcium....Daily.Value.          Iron....Daily.Value. 
##                             0                             0

We don’t have NA Value here. We good to continue.

df_num <- df %>% 
  select(-c(Total.Fat....Daily.Value.,Saturated.Fat....Daily.Value.,Cholesterol....Daily.Value.,Sodium....Daily.Value.,Carbohydrates....Daily.Value.,Dietary.Fiber....Daily.Value.)) %>% 
  select_if(is.numeric)

df_num

We delete some columns like “…Daily.Value”, because i think the information is double, except “Vitamins”, “Calcium” and “Iron”.

EXPLORATORY DATA ANALYSIS

summary(df_num)
##     Calories      Calories.from.Fat   Total.Fat       Saturated.Fat   
##  Min.   :   0.0   Min.   :   0.0    Min.   :  0.000   Min.   : 0.000  
##  1st Qu.: 210.0   1st Qu.:  20.0    1st Qu.:  2.375   1st Qu.: 1.000  
##  Median : 340.0   Median : 100.0    Median : 11.000   Median : 5.000  
##  Mean   : 368.3   Mean   : 127.1    Mean   : 14.165   Mean   : 6.008  
##  3rd Qu.: 500.0   3rd Qu.: 200.0    3rd Qu.: 22.250   3rd Qu.:10.000  
##  Max.   :1880.0   Max.   :1060.0    Max.   :118.000   Max.   :20.000  
##    Trans.Fat       Cholesterol         Sodium       Carbohydrates   
##  Min.   :0.0000   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:  5.00   1st Qu.: 107.5   1st Qu.: 30.00  
##  Median :0.0000   Median : 35.00   Median : 190.0   Median : 44.00  
##  Mean   :0.2038   Mean   : 54.94   Mean   : 495.8   Mean   : 47.35  
##  3rd Qu.:0.0000   3rd Qu.: 65.00   3rd Qu.: 865.0   3rd Qu.: 60.00  
##  Max.   :2.5000   Max.   :575.00   Max.   :3600.0   Max.   :141.00  
##  Dietary.Fiber       Sugars          Protein      Vitamin.A....Daily.Value.
##  Min.   :0.000   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00           
##  1st Qu.:0.000   1st Qu.:  5.75   1st Qu.: 4.00   1st Qu.:  2.00           
##  Median :1.000   Median : 17.50   Median :12.00   Median :  8.00           
##  Mean   :1.631   Mean   : 29.42   Mean   :13.34   Mean   : 13.43           
##  3rd Qu.:3.000   3rd Qu.: 48.00   3rd Qu.:19.00   3rd Qu.: 15.00           
##  Max.   :7.000   Max.   :128.00   Max.   :87.00   Max.   :170.00           
##  Vitamin.C....Daily.Value. Calcium....Daily.Value. Iron....Daily.Value.
##  Min.   :  0.000           Min.   : 0.00           Min.   : 0.000      
##  1st Qu.:  0.000           1st Qu.: 6.00           1st Qu.: 0.000      
##  Median :  0.000           Median :20.00           Median : 4.000      
##  Mean   :  8.535           Mean   :20.97           Mean   : 7.735      
##  3rd Qu.:  4.000           3rd Qu.:30.00           3rd Qu.:15.000      
##  Max.   :240.000           Max.   :70.00           Max.   :40.000

Above is the distribution of the covariance value of the data that has not been standardized (scaled). The variance of each variable differs greatly because the range/scale of each variable is different, as well as the covariance. Variance and covariance values ​​are affected by the scale of the data. The higher the scale, the higher the variance or covariance value.

Data with a high difference in scale between variables is not good for direct PCA analysis because it can cause bias. PC1 is considered to have captured the highest variance and subsequent PCs are considered not to provide information.

plot(prcomp(x=df_num))

As we can see, most of the information in the data is drop to Sodium, it’s because the sodium contain hundred scale (high variance) compared to others. So for the good condition, we scale our data.

DATA PRE-PROCESSING

# scaling 
df_scaled <- scale(df_num)
head(df_scaled)
##          Calories Calories.from.Fat   Total.Fat Saturated.Fat  Trans.Fat
## [1,] -0.284135610        -0.0554925 -0.08203469    -0.1893492 -0.4750187
## [2,] -0.492234930        -0.4464965 -0.43399870    -0.5651567 -0.4750187
## [3,]  0.007203438         0.5701140  0.62189333     0.3743621 -0.4750187
## [4,]  0.340162350         0.9611180  0.97385733     0.7501696 -0.4750187
## [5,]  0.132063030         0.6483148  0.62189333     0.3743621 -0.4750187
## [6,]  0.256922622         0.6483148  0.62189333     0.5622659  1.8552618
##      Cholesterol    Sodium Carbohydrates Dietary.Fiber     Sugars   Protein
## [1,]  2.34971281 0.4406211    -0.5785792      1.511262 -0.9213133 0.3204526
## [2,] -0.34310258 0.4752816    -0.6139746      1.511262 -0.9213133 0.4079712
## [3,] -0.11392681 0.4926118    -0.6493701      1.511262 -0.9561810 0.0578969
## [4,]  2.63618253 0.6312537    -0.6139746      1.511262 -0.9561810 0.6705269
## [5,] -0.05663286 0.6659142    -0.6139746      1.511262 -0.9561810 0.6705269
## [6,]  2.80806437 0.8045560    -0.5785792      1.511262 -0.9213133 1.1081198
##      Vitamin.A....Daily.Value. Vitamin.C....Daily.Value.
## [1,]               -0.14064145                -0.3239491
## [2,]               -0.30480206                -0.3239491
## [3,]               -0.22272176                -0.3239491
## [4,]                0.06455932                -0.3239491
## [5,]               -0.30480206                -0.3239491
## [6,]                0.06455932                -0.2480350
##      Calcium....Daily.Value. Iron....Daily.Value.
## [1,]               0.2366001           0.83287462
## [2,]               0.2366001           0.03042263
## [3,]               0.2366001           0.25969463
## [4,]               0.5303730           0.83287462
## [5,]               0.2366001           0.25969463
## [6,]               0.5303730           1.40605462
# melihat variansi yang dirangkum tiap PC (plot)
plot(prcomp(x=df_scaled))

Now, our information is divided to other PC’s and it’s have the same scale, it’s better than before.

PRINCIPAL COMPONENT ANALYSIS

For doing principle component analysis in R we can use function PCA.

pca <- PCA(df_scaled, scale. = F)

# Check PCA result
pca$eig
##           eigenvalue percentage of variance cumulative percentage of variance
## comp 1  7.2931170935           48.808505645                          48.80851
## comp 2  2.8361845557           18.980900502                          67.78941
## comp 3  1.2625379084            8.449417148                          76.23882
## comp 4  0.9793778199            6.554394677                          82.79322
## comp 5  0.7039996252            4.711451803                          87.50467
## comp 6  0.5507470055            3.685822945                          91.19049
## comp 7  0.4782480875            3.200630701                          94.39112
## comp 8  0.3639024398            2.435383124                          96.82651
## comp 9  0.2330115948            1.559408356                          98.38591
## comp 10 0.1398754984            0.936103721                          99.32202
## comp 11 0.0462923357            0.309807138                          99.63183
## comp 12 0.0396311241            0.265227600                          99.89705
## comp 13 0.0147668428            0.098825717                          99.99588
## comp 14 0.0003353829            0.002244519                          99.99812
## comp 15 0.0002803780            0.001876404                         100.00000

Insight 📌 :

This is what our infromation data looks like. as we can see the arrow, it means that how the variable (column) contribute to PCs, if the arrow is like horizontal, it contribute to PC1 (Dim1), and if vertical, it contribute to PC2(Dim2.

The number means it our observation(data), it show the row of the data. And we can see, data 83 we can call it outliers.

So this PCA, we can use to detect outliers too.

fviz_eig(pca, ncp = 15, addlabels = T, main = "Variance explained by each dimensions")

50% of the variances can be explained by only using the first 2 dimensions, with the first dimensions can explain 48.8% of the total variances.

We can keep around 80% of the information from our data by using only 4 dimensions. This mean that we can actually reduce the number of features on our dataset from 15 to just 4 numeric features.

We can extract the values of PC1 to PC4 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

And then, we take it and make it to a dataframe like below.

df_80 <- as.data.frame(pca$ind$coord)[,1:4]
head(df_80)
fviz_contrib(
  X = pca, #model PCA
  choice = "var", #menampilkan variable contribution
  axes = 1 #mengacu pada PC ke-1
)

We can see the variables (column) that most contribute to each PCs, above we see the PC1/Dim1.

CLUSTERING : K-Means

Now we use K-Means method to our data to clustering.

  • Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.

  • Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)

DATA CLEANING

rownames(df) <- df$Item

df_clean <- df %>% 
  select(-c(Item, Total.Fat....Daily.Value.,Saturated.Fat....Daily.Value.,Cholesterol....Daily.Value.,Sodium....Daily.Value.,Carbohydrates....Daily.Value.,Dietary.Fiber....Daily.Value.))

df_num_scale <- scale(df_clean)

First, we make the Item to an index, because eventually we want to keep to see the menu name after the clustering.This K-Means is good for numeric data only too.

And then we delete the “Item” column, and others like we cleaning in the first time above.

This K-means requires scaled data, because this method is counting the distance function. If we don’t to that, the result will be bad.

summary(df_num_scale)
##     Calories       Calories.from.Fat   Total.Fat       Saturated.Fat    
##  Min.   :-1.5327   Min.   :-0.9939   Min.   :-0.9971   Min.   :-1.1289  
##  1st Qu.:-0.6587   1st Qu.:-0.8375   1st Qu.:-0.8300   1st Qu.:-0.9410  
##  Median :-0.1177   Median :-0.2119   Median :-0.2228   Median :-0.1893  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5483   3rd Qu.: 0.5701   3rd Qu.: 0.5691   3rd Qu.: 0.7502  
##  Max.   : 6.2918   Max.   : 7.2954   Max.   : 7.3092   Max.   : 2.6292  
##    Trans.Fat       Cholesterol          Sodium        Carbohydrates    
##  Min.   :-0.475   Min.   :-0.6296   Min.   :-0.8591   Min.   :-1.6758  
##  1st Qu.:-0.475   1st Qu.:-0.5723   1st Qu.:-0.6728   1st Qu.:-0.6140  
##  Median :-0.475   Median :-0.2285   Median :-0.5299   Median :-0.1184  
##  Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.475   3rd Qu.: 0.1152   3rd Qu.: 0.6399   3rd Qu.: 0.4479  
##  Max.   : 5.351   Max.   : 5.9592   Max.   : 5.3797   Max.   : 3.3149  
##  Dietary.Fiber         Sugars           Protein       
##  Min.   :-1.0402   Min.   :-1.0259   Min.   :-1.1674  
##  1st Qu.:-1.0402   1st Qu.:-0.8254   1st Qu.:-0.8173  
##  Median :-0.4023   Median :-0.4157   Median :-0.1171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8734   3rd Qu.: 0.6477   3rd Qu.: 0.4955  
##  Max.   : 3.4249   Max.   : 3.4372   Max.   : 6.4468  
##  Vitamin.A....Daily.Value. Vitamin.C....Daily.Value. Calcium....Daily.Value.
##  Min.   :-0.55104          Min.   :-0.3239           Min.   :-1.23226       
##  1st Qu.:-0.46896          1st Qu.:-0.3239           1st Qu.:-0.87974       
##  Median :-0.22272          Median :-0.3239           Median :-0.05717       
##  Mean   : 0.00000          Mean   : 0.0000           Mean   : 0.00000       
##  3rd Qu.: 0.06456          3rd Qu.:-0.1721           3rd Qu.: 0.53037       
##  Max.   : 6.42578          Max.   : 8.7858           Max.   : 2.88056       
##  Iron....Daily.Value.
##  Min.   :-0.8867     
##  1st Qu.:-0.8867     
##  Median :-0.4281     
##  Mean   : 0.0000     
##  3rd Qu.: 0.8329     
##  Max.   : 3.6988

ELBOW METHOD

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(
  x = df_num_scale,
  FUNcluster = kmeans,
  method = "wss"
)+ labs(subtitle = "Elbow method")

Using the elbow method, we know that 3 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

SILHOUETTE METHOD

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(df_num_scale, kmeans, "silhouette") + labs(subtitle = "Silhouette method")

As we can see the silhouette method above, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 6.

CLUSTERING

Now let’s use our data to clustering using K-Means method.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)

df_kmeans <- kmeans(df_num_scale, centers=6)

df_kmeans$withinss
## [1] 251.3838 179.1372 189.7807 110.3061 295.0833 364.4706
df_kmeans$tot.withinss
## [1] 1390.162
df_kmeans$size
## [1] 11 68 71 30 13 67

We get 6 cluster (or we can say group), and it’s size / number of our data in each cluster.

df_kmeans$centers
##     Calories Calories.from.Fat  Total.Fat Saturated.Fat  Trans.Fat Cholesterol
## 1  2.5460151         2.8805924  2.8872617     1.9971673  0.9019652   2.8497327
## 2 -0.2859718        -0.5413725 -0.5349296    -0.4200839 -0.4407499  -0.3784900
## 3 -0.9362778        -0.7268079 -0.7244929    -0.8139307 -0.4750187  -0.5117566
## 4  1.1031932         0.5284069  0.5209970     1.3702521  1.6610717   0.1687233
## 5 -0.5674708        -0.5397360 -0.5558324    -0.7964229 -0.4750187  -0.4092110
## 6  0.4805517         0.7148438  0.7111976     0.5019684  0.1510268   0.4624332
##       Sodium Carbohydrates Dietary.Fiber     Sugars    Protein
## 1  2.5706642    1.17188589     1.9751673 -0.6867484  2.4049859
## 2 -0.5584158    0.07779788    -0.1959791  0.4723712 -0.2072329
## 3 -0.6513055   -0.78397243    -0.7347601 -0.3521341 -0.9232972
## 4 -0.4637281    1.75162019    -0.5086606  1.8901897  0.0170549
## 5 -0.4252239   -0.45061110     0.9224584 -0.3956149 -0.3662316
## 6  1.1250360   -0.13745701     0.7020234 -0.7631073  0.8573203
##   Vitamin.A....Daily.Value. Vitamin.C....Daily.Value. Calcium....Daily.Value.
## 1               -0.17048883               -0.09620661             0.006923178
## 2               -0.03381634               -0.24189484             0.780079879
## 3               -0.43370283               -0.20366259            -1.003866034
## 4                0.28344014               -0.32394913             1.382314200
## 5                1.92399396                3.36956726            -0.712512126
## 6                0.02168155               -0.03162292            -0.209759204
##   Iron....Daily.Value.
## 1            2.3960928
## 2           -0.5343282
## 3           -0.6977582
## 4           -0.3555186
## 5           -0.1635768
## 6            1.0792565
df_clean$cluster <- as.factor(df_kmeans$cluster)

Now we input the label cluster to our data frame in df_clean, and make it to factor type.

head(df_clean)

GROUPING THE DATA BASED ON LABEL

Doing grouping based on the clusters that are formed, to find out the character of each one.

df_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Now we have 6 cluster that each cluster has their each characteristic. Example, cluster with high Calories is cluster 1, so if we want to order menu with high calories, we choose the menu in cluster 1.

FILTERING DATA BASED ON CLUSTER LABEL

We can filter the data based on cluster to ease the owner/cashier to decide the menu that similar to other menus.

df_clean[df_clean$cluster==2,]

As example, we want to order the Hotcakes, but it’s out of order. In this case we just look to the cluster 2 to find a similar nutrition or other menu that similar to Hotcakes.

df_centroid <- df_clean %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

df_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    group_min = which.min(value),
    group_max = which.max(value))

To ease us to find the similar nutrition of the menus, we divide to 2 group, min and max.

How we read it ?

  • If we want food/drink with high Calories, we choose menu in cluster 1.

  • If we want food/drink with high Carbohydrates, we choose menu in cluster 4.

  • If we want food/drink with less Sugars, we choose menu in cluster 6.

  • If we want food/drink not high Calories and less Calories, we choose menu in cluster 2.

  • Cluster 2 does not appear above, it means cluster 2 is in the middle, Not High or Less in Nutrition.

COMBINING CLUSTER WITH PCA

fviz_cluster(object = df_kmeans, 
             data = df_num_scale,
             labelsize = 0)

SUMMARY

From the unsupervised learning analysis above, we can summarize that:

  • Using K-Means we can clustering (or we can say devide into some groups) our data. We have 6 cluster that each cluster have their nutrition characteristic.
  • Using PCA, we can reduce our dimension from 15 to just 4 dimension with keeping 80% information of our data.
  • The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.