McDonald’s Corporation is an American fast food company, founded in 1940 as a restaurant operated by Richard and Maurice McDonald, in San Bernardino, California, United States.
This dataset provides a nutrition analysis of every menu item on the US McDonald’s menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts. This dataset is taken from Kaggle : https://www.kaggle.com/datasets/mcdonalds/nutrition-facts
Our Goal is to clustering (dividing to some group) that will inform some information from the dataset. Eventually, each group have their characteristic of nutrition.
We will use Unsupervised Learning : PCA and K-Means
library(dplyr)
library(FactoMineR)
library(factoextra)
library(tidyverse)
library(ggiraphExtra)df <- read.csv("menu.csv")
str(df)## 'data.frame': 260 obs. of 22 variables:
## $ Item : chr "Egg McMuffin" "Egg White Delight" "Sausage McMuffin" "Sausage McMuffin with Egg" ...
## $ Calories : int 300 250 370 450 400 430 460 520 410 470 ...
## $ Calories.from.Fat : int 120 70 200 250 210 210 230 270 180 220 ...
## $ Total.Fat : num 13 8 23 28 23 23 26 30 20 25 ...
## $ Total.Fat....Daily.Value. : int 20 12 35 43 35 36 40 47 32 38 ...
## $ Saturated.Fat : num 5 3 8 10 8 9 13 14 11 12 ...
## $ Saturated.Fat....Daily.Value.: int 25 15 42 52 42 46 65 68 56 59 ...
## $ Trans.Fat : num 0 0 0 0 0 1 0 0 0 0 ...
## $ Cholesterol : int 260 25 45 285 50 300 250 250 35 35 ...
## $ Cholesterol....Daily.Value. : int 87 8 15 95 16 100 83 83 11 11 ...
## $ Sodium : int 750 770 780 860 880 960 1300 1410 1300 1420 ...
## $ Sodium....Daily.Value. : int 31 32 33 36 37 40 54 59 54 59 ...
## $ Carbohydrates : int 31 30 29 30 30 31 38 43 36 42 ...
## $ Carbohydrates....Daily.Value.: int 10 10 10 10 10 10 13 14 12 14 ...
## $ Dietary.Fiber : int 4 4 4 4 4 4 2 3 2 3 ...
## $ Dietary.Fiber....Daily.Value.: int 17 17 17 17 17 18 7 12 7 12 ...
## $ Sugars : int 3 3 2 2 2 3 3 4 3 4 ...
## $ Protein : int 17 18 14 21 21 26 19 19 20 20 ...
## $ Vitamin.A....Daily.Value. : int 10 6 8 15 6 15 10 15 2 6 ...
## $ Vitamin.C....Daily.Value. : int 0 0 0 0 0 2 8 8 8 8 ...
## $ Calcium....Daily.Value. : int 25 25 25 30 25 30 15 20 15 15 ...
## $ Iron....Daily.Value. : int 15 8 10 15 10 20 15 20 10 15 ...
As we can see, our data contain chr, so we have to
remove them, because this Unsupervised Learning is good for numeric data
only.
# Check Missing Value
colSums(is.na(df))## Item Calories
## 0 0
## Calories.from.Fat Total.Fat
## 0 0
## Total.Fat....Daily.Value. Saturated.Fat
## 0 0
## Saturated.Fat....Daily.Value. Trans.Fat
## 0 0
## Cholesterol Cholesterol....Daily.Value.
## 0 0
## Sodium Sodium....Daily.Value.
## 0 0
## Carbohydrates Carbohydrates....Daily.Value.
## 0 0
## Dietary.Fiber Dietary.Fiber....Daily.Value.
## 0 0
## Sugars Protein
## 0 0
## Vitamin.A....Daily.Value. Vitamin.C....Daily.Value.
## 0 0
## Calcium....Daily.Value. Iron....Daily.Value.
## 0 0
We don’t have NA Value here. We good to continue.
df_num <- df %>%
select(-c(Total.Fat....Daily.Value.,Saturated.Fat....Daily.Value.,Cholesterol....Daily.Value.,Sodium....Daily.Value.,Carbohydrates....Daily.Value.,Dietary.Fiber....Daily.Value.)) %>%
select_if(is.numeric)
df_numWe delete some columns like “…Daily.Value”, because i think the information is double, except “Vitamins”, “Calcium” and “Iron”.
summary(df_num)## Calories Calories.from.Fat Total.Fat Saturated.Fat
## Min. : 0.0 Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 210.0 1st Qu.: 20.0 1st Qu.: 2.375 1st Qu.: 1.000
## Median : 340.0 Median : 100.0 Median : 11.000 Median : 5.000
## Mean : 368.3 Mean : 127.1 Mean : 14.165 Mean : 6.008
## 3rd Qu.: 500.0 3rd Qu.: 200.0 3rd Qu.: 22.250 3rd Qu.:10.000
## Max. :1880.0 Max. :1060.0 Max. :118.000 Max. :20.000
## Trans.Fat Cholesterol Sodium Carbohydrates
## Min. :0.0000 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 5.00 1st Qu.: 107.5 1st Qu.: 30.00
## Median :0.0000 Median : 35.00 Median : 190.0 Median : 44.00
## Mean :0.2038 Mean : 54.94 Mean : 495.8 Mean : 47.35
## 3rd Qu.:0.0000 3rd Qu.: 65.00 3rd Qu.: 865.0 3rd Qu.: 60.00
## Max. :2.5000 Max. :575.00 Max. :3600.0 Max. :141.00
## Dietary.Fiber Sugars Protein Vitamin.A....Daily.Value.
## Min. :0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.000 1st Qu.: 5.75 1st Qu.: 4.00 1st Qu.: 2.00
## Median :1.000 Median : 17.50 Median :12.00 Median : 8.00
## Mean :1.631 Mean : 29.42 Mean :13.34 Mean : 13.43
## 3rd Qu.:3.000 3rd Qu.: 48.00 3rd Qu.:19.00 3rd Qu.: 15.00
## Max. :7.000 Max. :128.00 Max. :87.00 Max. :170.00
## Vitamin.C....Daily.Value. Calcium....Daily.Value. Iron....Daily.Value.
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 6.00 1st Qu.: 0.000
## Median : 0.000 Median :20.00 Median : 4.000
## Mean : 8.535 Mean :20.97 Mean : 7.735
## 3rd Qu.: 4.000 3rd Qu.:30.00 3rd Qu.:15.000
## Max. :240.000 Max. :70.00 Max. :40.000
Above is the distribution of the covariance value of the data that has not been standardized (scaled). The variance of each variable differs greatly because the range/scale of each variable is different, as well as the covariance. Variance and covariance values are affected by the scale of the data. The higher the scale, the higher the variance or covariance value.
Data with a high difference in scale between variables is not good for direct PCA analysis because it can cause bias. PC1 is considered to have captured the highest variance and subsequent PCs are considered not to provide information.
plot(prcomp(x=df_num))As we can see, most of the information in the data is drop to
Sodium, it’s because the sodium contain hundred scale (high
variance) compared to others. So for the good condition, we scale our
data.
# scaling
df_scaled <- scale(df_num)
head(df_scaled)## Calories Calories.from.Fat Total.Fat Saturated.Fat Trans.Fat
## [1,] -0.284135610 -0.0554925 -0.08203469 -0.1893492 -0.4750187
## [2,] -0.492234930 -0.4464965 -0.43399870 -0.5651567 -0.4750187
## [3,] 0.007203438 0.5701140 0.62189333 0.3743621 -0.4750187
## [4,] 0.340162350 0.9611180 0.97385733 0.7501696 -0.4750187
## [5,] 0.132063030 0.6483148 0.62189333 0.3743621 -0.4750187
## [6,] 0.256922622 0.6483148 0.62189333 0.5622659 1.8552618
## Cholesterol Sodium Carbohydrates Dietary.Fiber Sugars Protein
## [1,] 2.34971281 0.4406211 -0.5785792 1.511262 -0.9213133 0.3204526
## [2,] -0.34310258 0.4752816 -0.6139746 1.511262 -0.9213133 0.4079712
## [3,] -0.11392681 0.4926118 -0.6493701 1.511262 -0.9561810 0.0578969
## [4,] 2.63618253 0.6312537 -0.6139746 1.511262 -0.9561810 0.6705269
## [5,] -0.05663286 0.6659142 -0.6139746 1.511262 -0.9561810 0.6705269
## [6,] 2.80806437 0.8045560 -0.5785792 1.511262 -0.9213133 1.1081198
## Vitamin.A....Daily.Value. Vitamin.C....Daily.Value.
## [1,] -0.14064145 -0.3239491
## [2,] -0.30480206 -0.3239491
## [3,] -0.22272176 -0.3239491
## [4,] 0.06455932 -0.3239491
## [5,] -0.30480206 -0.3239491
## [6,] 0.06455932 -0.2480350
## Calcium....Daily.Value. Iron....Daily.Value.
## [1,] 0.2366001 0.83287462
## [2,] 0.2366001 0.03042263
## [3,] 0.2366001 0.25969463
## [4,] 0.5303730 0.83287462
## [5,] 0.2366001 0.25969463
## [6,] 0.5303730 1.40605462
# melihat variansi yang dirangkum tiap PC (plot)
plot(prcomp(x=df_scaled))Now, our information is divided to other PC’s and it’s have the same scale, it’s better than before.
For doing principle component analysis in R we can use function
PCA.
pca <- PCA(df_scaled, scale. = F)# Check PCA result
pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 7.2931170935 48.808505645 48.80851
## comp 2 2.8361845557 18.980900502 67.78941
## comp 3 1.2625379084 8.449417148 76.23882
## comp 4 0.9793778199 6.554394677 82.79322
## comp 5 0.7039996252 4.711451803 87.50467
## comp 6 0.5507470055 3.685822945 91.19049
## comp 7 0.4782480875 3.200630701 94.39112
## comp 8 0.3639024398 2.435383124 96.82651
## comp 9 0.2330115948 1.559408356 98.38591
## comp 10 0.1398754984 0.936103721 99.32202
## comp 11 0.0462923357 0.309807138 99.63183
## comp 12 0.0396311241 0.265227600 99.89705
## comp 13 0.0147668428 0.098825717 99.99588
## comp 14 0.0003353829 0.002244519 99.99812
## comp 15 0.0002803780 0.001876404 100.00000
Insight 📌 :
This is what our infromation data looks like. as we can see the arrow, it means that how the variable (column) contribute to PCs, if the arrow is like horizontal, it contribute to PC1 (Dim1), and if vertical, it contribute to PC2(Dim2.
The number means it our observation(data), it show the row of the data. And we can see, data 83 we can call it outliers.
So this PCA, we can use to detect outliers too.
fviz_eig(pca, ncp = 15, addlabels = T, main = "Variance explained by each dimensions")50% of the variances can be explained by only using the first 2 dimensions, with the first dimensions can explain 48.8% of the total variances.
We can keep around 80% of the information from our data by using only 4 dimensions. This mean that we can actually reduce the number of features on our dataset from 15 to just 4 numeric features.
We can extract the values of PC1 to PC4 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.
And then, we take it and make it to a dataframe like below.
df_80 <- as.data.frame(pca$ind$coord)[,1:4]
head(df_80)fviz_contrib(
X = pca, #model PCA
choice = "var", #menampilkan variable contribution
axes = 1 #mengacu pada PC ke-1
)We can see the variables (column) that most contribute to each PCs, above we see the PC1/Dim1.
Now we use K-Means method to our data to clustering.
Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)
rownames(df) <- df$Item
df_clean <- df %>%
select(-c(Item, Total.Fat....Daily.Value.,Saturated.Fat....Daily.Value.,Cholesterol....Daily.Value.,Sodium....Daily.Value.,Carbohydrates....Daily.Value.,Dietary.Fiber....Daily.Value.))
df_num_scale <- scale(df_clean)First, we make the Item to an index, because eventually
we want to keep to see the menu name after the clustering.This K-Means
is good for numeric data only too.
And then we delete the “Item” column, and others like we cleaning in the first time above.
This K-means requires scaled data, because this method is counting the distance function. If we don’t to that, the result will be bad.
summary(df_num_scale)## Calories Calories.from.Fat Total.Fat Saturated.Fat
## Min. :-1.5327 Min. :-0.9939 Min. :-0.9971 Min. :-1.1289
## 1st Qu.:-0.6587 1st Qu.:-0.8375 1st Qu.:-0.8300 1st Qu.:-0.9410
## Median :-0.1177 Median :-0.2119 Median :-0.2228 Median :-0.1893
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5483 3rd Qu.: 0.5701 3rd Qu.: 0.5691 3rd Qu.: 0.7502
## Max. : 6.2918 Max. : 7.2954 Max. : 7.3092 Max. : 2.6292
## Trans.Fat Cholesterol Sodium Carbohydrates
## Min. :-0.475 Min. :-0.6296 Min. :-0.8591 Min. :-1.6758
## 1st Qu.:-0.475 1st Qu.:-0.5723 1st Qu.:-0.6728 1st Qu.:-0.6140
## Median :-0.475 Median :-0.2285 Median :-0.5299 Median :-0.1184
## Mean : 0.000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.475 3rd Qu.: 0.1152 3rd Qu.: 0.6399 3rd Qu.: 0.4479
## Max. : 5.351 Max. : 5.9592 Max. : 5.3797 Max. : 3.3149
## Dietary.Fiber Sugars Protein
## Min. :-1.0402 Min. :-1.0259 Min. :-1.1674
## 1st Qu.:-1.0402 1st Qu.:-0.8254 1st Qu.:-0.8173
## Median :-0.4023 Median :-0.4157 Median :-0.1171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8734 3rd Qu.: 0.6477 3rd Qu.: 0.4955
## Max. : 3.4249 Max. : 3.4372 Max. : 6.4468
## Vitamin.A....Daily.Value. Vitamin.C....Daily.Value. Calcium....Daily.Value.
## Min. :-0.55104 Min. :-0.3239 Min. :-1.23226
## 1st Qu.:-0.46896 1st Qu.:-0.3239 1st Qu.:-0.87974
## Median :-0.22272 Median :-0.3239 Median :-0.05717
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.06456 3rd Qu.:-0.1721 3rd Qu.: 0.53037
## Max. : 6.42578 Max. : 8.7858 Max. : 2.88056
## Iron....Daily.Value.
## Min. :-0.8867
## 1st Qu.:-0.8867
## Median :-0.4281
## Mean : 0.0000
## 3rd Qu.: 0.8329
## Max. : 3.6988
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(
x = df_num_scale,
FUNcluster = kmeans,
method = "wss"
)+ labs(subtitle = "Elbow method")Using the elbow method, we know that 3 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.
The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).
fviz_nbclust(df_num_scale, kmeans, "silhouette") + labs(subtitle = "Silhouette method")As we can see the silhouette method above, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 6.
Now let’s use our data to clustering using K-Means method.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
df_kmeans <- kmeans(df_num_scale, centers=6)
df_kmeans$withinss## [1] 251.3838 179.1372 189.7807 110.3061 295.0833 364.4706
df_kmeans$tot.withinss## [1] 1390.162
df_kmeans$size## [1] 11 68 71 30 13 67
We get 6 cluster (or we can say group), and it’s size / number of our data in each cluster.
df_kmeans$centers## Calories Calories.from.Fat Total.Fat Saturated.Fat Trans.Fat Cholesterol
## 1 2.5460151 2.8805924 2.8872617 1.9971673 0.9019652 2.8497327
## 2 -0.2859718 -0.5413725 -0.5349296 -0.4200839 -0.4407499 -0.3784900
## 3 -0.9362778 -0.7268079 -0.7244929 -0.8139307 -0.4750187 -0.5117566
## 4 1.1031932 0.5284069 0.5209970 1.3702521 1.6610717 0.1687233
## 5 -0.5674708 -0.5397360 -0.5558324 -0.7964229 -0.4750187 -0.4092110
## 6 0.4805517 0.7148438 0.7111976 0.5019684 0.1510268 0.4624332
## Sodium Carbohydrates Dietary.Fiber Sugars Protein
## 1 2.5706642 1.17188589 1.9751673 -0.6867484 2.4049859
## 2 -0.5584158 0.07779788 -0.1959791 0.4723712 -0.2072329
## 3 -0.6513055 -0.78397243 -0.7347601 -0.3521341 -0.9232972
## 4 -0.4637281 1.75162019 -0.5086606 1.8901897 0.0170549
## 5 -0.4252239 -0.45061110 0.9224584 -0.3956149 -0.3662316
## 6 1.1250360 -0.13745701 0.7020234 -0.7631073 0.8573203
## Vitamin.A....Daily.Value. Vitamin.C....Daily.Value. Calcium....Daily.Value.
## 1 -0.17048883 -0.09620661 0.006923178
## 2 -0.03381634 -0.24189484 0.780079879
## 3 -0.43370283 -0.20366259 -1.003866034
## 4 0.28344014 -0.32394913 1.382314200
## 5 1.92399396 3.36956726 -0.712512126
## 6 0.02168155 -0.03162292 -0.209759204
## Iron....Daily.Value.
## 1 2.3960928
## 2 -0.5343282
## 3 -0.6977582
## 4 -0.3555186
## 5 -0.1635768
## 6 1.0792565
df_clean$cluster <- as.factor(df_kmeans$cluster)Now we input the label cluster to our data frame in df_clean, and make it to factor type.
head(df_clean)Doing grouping based on the clusters that are formed, to find out the character of each one.
df_clean %>%
group_by(cluster) %>%
summarise_all(mean)Now we have 6 cluster that each cluster has their each characteristic. Example, cluster with high Calories is cluster 1, so if we want to order menu with high calories, we choose the menu in cluster 1.
We can filter the data based on cluster to ease the owner/cashier to decide the menu that similar to other menus.
df_clean[df_clean$cluster==2,]As example, we want to order the Hotcakes, but it’s out of order. In this case we just look to the cluster 2 to find a similar nutrition or other menu that similar to Hotcakes.
df_centroid <- df_clean %>%
group_by(cluster) %>%
summarise_all(mean)
df_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
group_min = which.min(value),
group_max = which.max(value))To ease us to find the similar nutrition of the menus, we divide to 2 group, min and max.
How we read it ?
If we want food/drink with high Calories, we choose menu in
cluster 1.
If we want food/drink with high Carbohydrates, we choose menu in cluster 4.
If we want food/drink with less Sugars, we choose menu in cluster 6.
If we want food/drink not high Calories and less Calories, we choose menu in cluster 2.
Cluster 2 does not appear above, it means cluster 2 is in the middle, Not High or Less in Nutrition.
fviz_cluster(object = df_kmeans,
data = df_num_scale,
labelsize = 0)From the unsupervised learning analysis above, we can summarize that: