Hello Everyone! We are back again yet with another project. This is my submission for Unsupervised Learning project.We are going to provide recommendations for McDonald Menu based on Nutritions using PCA and Clustering
The dataset can be obtained through https://www.kaggle.com/mcdonalds/nutrition-facts .
This dataset provides a nutrition analysis of every menu item on the US McDonald’s menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts.
Let’s get started!
Before we do analysis, we need to load the required library packages.
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(FactoMineR)
library(factoextra) # fviz_pca_ind(), fviz_eig()
library(tictoc)
library(scales)
library(plotly)
library(GGally)
library(esquisse)We need the data to do the analysis. Then, we have to load the dataset
mcd <- read.csv("menu.csv")Select all the columns needed then check the columns data types. Then change the data types for each columns’ data type that needs to be changed
menu <- mcd %>%
mutate(Vitamin.A = Vitamin.A....Daily.Value.,
Vitamin.C = Vitamin.C....Daily.Value.,
Calcium = Calcium....Daily.Value.,
Iron = Iron....Daily.Value.,
Fat = Total.Fat) %>%
select(-Calories.from.Fat,-Total.Fat....Daily.Value.,
-Saturated.Fat....Daily.Value.,-Trans.Fat,
-Cholesterol....Daily.Value.,-Sodium....Daily.Value.,
-Carbohydrates....Daily.Value.,
-Dietary.Fiber....Daily.Value.,
-Vitamin.A....Daily.Value.,
-Vitamin.C....Daily.Value.,
-Calcium....Daily.Value.,
-Iron....Daily.Value.,
-Serving.Size,
-Saturated.Fat,
-Total.Fat) %>%
mutate_if(is.character,as.factor)
glimpse(menu)#> Rows: 260
#> Columns: 14
#> $ Category <fct> Breakfast, Breakfast, Breakfast, Breakfast, Breakfast, B~
#> $ Item <fct> "Egg McMuffin", "Egg White Delight", "Sausage McMuffin",~
#> $ Calories <int> 300, 250, 370, 450, 400, 430, 460, 520, 410, 470, 430, 4~
#> $ Cholesterol <int> 260, 25, 45, 285, 50, 300, 250, 250, 35, 35, 30, 30, 250~
#> $ Sodium <int> 750, 770, 780, 860, 880, 960, 1300, 1410, 1300, 1420, 10~
#> $ Carbohydrates <int> 31, 30, 29, 30, 30, 31, 38, 43, 36, 42, 34, 39, 36, 42, ~
#> $ Dietary.Fiber <int> 4, 4, 4, 4, 4, 4, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,~
#> $ Sugars <int> 3, 3, 2, 2, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 3, 3, 4, 3,~
#> $ Protein <int> 17, 18, 14, 21, 21, 26, 19, 19, 20, 20, 11, 11, 18, 18, ~
#> $ Vitamin.A <int> 10, 6, 8, 15, 6, 15, 10, 15, 2, 6, 0, 4, 6, 10, 0, 4, 0,~
#> $ Vitamin.C <int> 0, 0, 0, 0, 0, 2, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 2, 2, 2,~
#> $ Calcium <int> 25, 25, 25, 30, 25, 30, 15, 20, 15, 15, 6, 8, 10, 10, 8,~
#> $ Iron <int> 15, 8, 10, 15, 10, 20, 15, 20, 10, 15, 15, 15, 20, 20, 1~
#> $ Fat <dbl> 13, 8, 23, 28, 23, 23, 26, 30, 20, 25, 27, 31, 33, 37, 2~
head(menu)colnames(menu)#> [1] "Category" "Item" "Calories" "Cholesterol"
#> [5] "Sodium" "Carbohydrates" "Dietary.Fiber" "Sugars"
#> [9] "Protein" "Vitamin.A" "Vitamin.C" "Calcium"
#> [13] "Iron" "Fat"
menu_num <- menu %>% select_if(is.numeric)We have to check if there is any missing values in our data set
colSums(is.na(menu))#> Category Item Calories Cholesterol Sodium
#> 0 0 0 0 0
#> Carbohydrates Dietary.Fiber Sugars Protein Vitamin.A
#> 0 0 0 0 0
#> Vitamin.C Calcium Iron Fat
#> 0 0 0 0
There are no missing values in our data set. Now we are ready to go to the data analysis.
To get to know more about our data, let us check the summary.
summary(menu)#> Category Item
#> Coffee & Tea :95 1% Low Fat Milk Jug : 1
#> Breakfast :42 Apple Slices : 1
#> Smoothies & Shakes:28 Bacon Buffalo Ranch McChicken : 1
#> Beverages :27 Bacon Cheddar McChicken : 1
#> Chicken & Fish :27 Bacon Clubhouse Burger : 1
#> Beef & Pork :15 Bacon Clubhouse Crispy Chicken Sandwich: 1
#> (Other) :26 (Other) :254
#> Calories Cholesterol Sodium Carbohydrates
#> Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
#> 1st Qu.: 210.0 1st Qu.: 5.00 1st Qu.: 107.5 1st Qu.: 30.00
#> Median : 340.0 Median : 35.00 Median : 190.0 Median : 44.00
#> Mean : 368.3 Mean : 54.94 Mean : 495.8 Mean : 47.35
#> 3rd Qu.: 500.0 3rd Qu.: 65.00 3rd Qu.: 865.0 3rd Qu.: 60.00
#> Max. :1880.0 Max. :575.00 Max. :3600.0 Max. :141.00
#>
#> Dietary.Fiber Sugars Protein Vitamin.A
#> Min. :0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
#> 1st Qu.:0.000 1st Qu.: 5.75 1st Qu.: 4.00 1st Qu.: 2.00
#> Median :1.000 Median : 17.50 Median :12.00 Median : 8.00
#> Mean :1.631 Mean : 29.42 Mean :13.34 Mean : 13.43
#> 3rd Qu.:3.000 3rd Qu.: 48.00 3rd Qu.:19.00 3rd Qu.: 15.00
#> Max. :7.000 Max. :128.00 Max. :87.00 Max. :170.00
#>
#> Vitamin.C Calcium Iron Fat
#> Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.: 2.375
#> Median : 0.000 Median :20.00 Median : 4.000 Median : 11.000
#> Mean : 8.535 Mean :20.97 Mean : 7.735 Mean : 14.165
#> 3rd Qu.: 4.000 3rd Qu.:30.00 3rd Qu.:15.000 3rd Qu.: 22.250
#> Max. :240.000 Max. :70.00 Max. :40.000 Max. :118.000
#>
ggplot(gather(menu %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10,fill="firebrick") +
facet_wrap(~key, scales = 'free_x',nrow=3) +
theme_bw()If we look at summary and visualization above at all the nutrients column, we found that max value for each nutrients is far from the mean and median value. We can conclude that the data have outlier. We will check later using PCA to found the outliers.
Now let’s take a look at below visualization of nutrients per category.
a <- ggplot(data = menu, aes(x = Calcium, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Calcium")
b <- ggplot(data = menu, aes(x = Calories, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Calories")
c <- ggplot(data = menu, aes(x = Carbohydrates, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Carbohydrates")
d <- ggplot(data = menu, aes(x = Cholesterol, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Cholesterol")
e <- ggplot(data = menu, aes(x = Dietary.Fiber, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Dietary.Fiber")
f <- ggplot(data = menu, aes(x = Fat, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Fat")
g <- ggplot(data = menu, aes(x = Iron, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Iron")
h <- ggplot(data = menu, aes(x = Protein, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Protein")
i <- ggplot(data = menu, aes(x = Sodium, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Sodium")
j <- ggplot(data = menu, aes(x = Sugars, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Sugars")
k <- ggplot(data = menu, aes(x = Vitamin.A, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Vitamin.A")
l <- ggplot(data = menu, aes(x = Vitamin.C, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Vitamin.C")
ggpubr::ggarrange(a,b,c,d,e,f,g,h,i,j,k,l,
ncol = 2, nrow = 6) Summary from plot above:
Now let us check the correlation between numerical variables
ggcorr(menu,label = T) Based on this result, we can see that there are some of the variables that has a strong positive correlation. We will try to reduce the dimension using PCA.
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, we convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
The objective of PCA is to find Q, so that such a linear transformation is possible. By using nearZeroVar we are able to show that to eliminate ~50% of the original predictors we still retain enough information. PCA shares the same objective as nearZeroVar, but does it differently: it looks for correlation within our data and use that redundancy to create a new matrix Z with just enough dimensions to explain most of the variance in the original data. The new variables of matrix Z are called principal components.
# making index for quantitative and qualitative variables
quanti_var <- c(3:14)
quali_var <- c(1:2)FactoMineR package for function PCA()
pca_menu <- PCA(X = menu, # data
scale.unit = T, # scaling
quali.sup = quali_var, # data categorical (the number)
graph = F) # F = not showing the plot
# ncp = 10) # default = 5
pca_menu$eig#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 5.5565359443 46.304466202 46.30447
#> comp 2 2.5841338691 21.534448910 67.83892
#> comp 3 1.1941186002 9.950988335 77.78990
#> comp 4 0.9714291848 8.095243207 85.88515
#> comp 5 0.5254357947 4.378631622 90.26378
#> comp 6 0.4667872027 3.889893356 94.15367
#> comp 7 0.3494208991 2.911840826 97.06551
#> comp 8 0.1824119663 1.520099720 98.58561
#> comp 9 0.0933776984 0.778147487 99.36376
#> comp 10 0.0550715037 0.458929197 99.82269
#> comp 11 0.0209874981 0.174895818 99.99758
#> comp 12 0.0002898385 0.002415321 100.00000
fviz_eig(pca_menu, ncp = 12,
addlabels = T, main = "Variance explained by each dimensions")summary(pca_menu)#>
#> Call:
#> PCA(X = menu, scale.unit = T, quali.sup = quali_var, graph = F)
#>
#>
#> Eigenvalues
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> Variance 5.557 2.584 1.194 0.971 0.525 0.467 0.349
#> % of var. 46.304 21.534 9.951 8.095 4.379 3.890 2.912
#> Cumulative % of var. 46.304 67.839 77.790 85.885 90.264 94.154 97.066
#> Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
#> Variance 0.182 0.093 0.055 0.021 0.000
#> % of var. 1.520 0.778 0.459 0.175 0.002
#> Cumulative % of var. 98.586 99.364 99.823 99.998 100.000
#>
#> Individuals (the 10 first)
#> Dist Dim.1 ctr cos2 Dim.2 ctr cos2
#> 1 | 3.207 | 1.620 0.182 0.255 | -1.359 0.275 0.180 |
#> 2 | 2.173 | 0.288 0.006 0.018 | -1.165 0.202 0.287 |
#> 3 | 2.136 | 0.922 0.059 0.186 | -1.137 0.193 0.284 |
#> 4 | 3.684 | 2.617 0.474 0.504 | -1.171 0.204 0.101 |
#> 5 | 2.284 | 1.292 0.116 0.320 | -1.149 0.197 0.253 |
#> 6 | 4.008 | 2.954 0.604 0.543 | -1.322 0.260 0.109 |
#> 7 | 3.148 | 2.203 0.336 0.490 | -1.398 0.291 0.197 |
#> 8 | 3.606 | 2.986 0.617 0.686 | -1.332 0.264 0.137 |
#> 9 | 2.006 | 0.966 0.065 0.232 | -1.275 0.242 0.404 |
#> 10 | 2.497 | 1.753 0.213 0.493 | -1.338 0.266 0.287 |
#> Dim.3 ctr cos2
#> 1 -0.256 0.021 0.006 |
#> 2 0.137 0.006 0.004 |
#> 3 -0.054 0.001 0.001 |
#> 4 -0.370 0.044 0.010 |
#> 5 -0.131 0.006 0.003 |
#> 6 -0.297 0.028 0.005 |
#> 7 -0.758 0.185 0.058 |
#> 8 -0.462 0.069 0.016 |
#> 9 -0.504 0.082 0.063 |
#> 10 -0.258 0.021 0.011 |
#>
#> Variables (the 10 first)
#> Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
#> Calories | 0.901 14.605 0.812 | 0.362 5.062 0.131 | -0.065
#> Cholesterol | 0.735 9.711 0.540 | -0.074 0.210 0.005 | -0.179
#> Sodium | 0.903 14.687 0.816 | -0.330 4.226 0.109 | -0.097
#> Carbohydrates | 0.500 4.497 0.250 | 0.793 24.361 0.630 | 0.048
#> Dietary.Fiber | 0.750 10.118 0.562 | -0.255 2.523 0.065 | 0.359
#> Sugars | -0.111 0.221 0.012 | 0.960 35.640 0.921 | 0.072
#> Protein | 0.912 14.985 0.833 | -0.042 0.068 0.002 | 0.019
#> Vitamin.A | 0.205 0.758 0.042 | 0.067 0.174 0.004 | 0.755
#> Vitamin.C | -0.049 0.042 0.002 | -0.188 1.375 0.036 | 0.640
#> Calcium | 0.272 1.327 0.074 | 0.769 22.869 0.591 | 0.043
#> ctr cos2
#> Calories 0.349 0.004 |
#> Cholesterol 2.670 0.032 |
#> Sodium 0.786 0.009 |
#> Carbohydrates 0.194 0.002 |
#> Dietary.Fiber 10.767 0.129 |
#> Sugars 0.432 0.005 |
#> Protein 0.030 0.000 |
#> Vitamin.A 47.776 0.570 |
#> Vitamin.C 34.299 0.410 |
#> Calcium 0.151 0.002 |
#>
#> Supplementary categories (the 10 first)
#> Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
#> Beef & Pork | 2.745 | 2.255 0.675 3.810 | -1.051 0.147 -2.604 |
#> Beverages | 2.855 | -2.655 0.865 -6.171 | -0.650 0.052 -2.215 |
#> Breakfast | 2.728 | 2.402 0.776 7.199 | -1.090 0.160 -4.788 |
#> Chicken & Fish | 2.758 | 2.428 0.775 5.642 | -1.056 0.147 -3.600 |
#> Coffee & Tea | 1.450 | -1.196 0.680 -6.196 | 0.684 0.222 5.193 |
#> Desserts | 1.712 | -1.599 0.873 -1.816 | -0.365 0.045 -0.608 |
#> Salads | 6.013 | 1.006 0.028 1.056 | -1.434 0.057 -2.206 |
#> Smoothies & Shakes | 2.647 | 0.101 0.001 0.238 | 2.565 0.939 8.922 |
#> Snacks & Sides | 1.844 | -0.970 0.277 -1.519 | -1.386 0.565 -3.183 |
#> 1% Low Fat Milk Jug | 2.641 | -2.184 0.683 -0.926 | -0.464 0.031 -0.289 |
#> Dim.3 cos2 v.test
#> Beef & Pork -0.348 0.016 -1.270 |
#> Beverages -0.061 0.000 -0.304 |
#> Breakfast -0.366 0.018 -2.365 |
#> Chicken & Fish 0.180 0.004 0.903 |
#> Coffee & Tea -0.219 0.023 -2.446 |
#> Desserts -0.283 0.027 -0.693 |
#> Salads 4.777 0.631 10.813 |
#> Smoothies & Shakes 0.317 0.014 1.621 |
#> Snacks & Sides 0.200 0.012 0.676 |
#> 1% Low Fat Milk Jug -0.284 0.012 -0.260 |
Through the PCA, I can retain some informative PC (high in cumulative variance) to perform dimension reduction. By doing this, I can reduce the dimension of the variables while also retaining as much information as possible.
Based on Plot and Summary above, I would like to retain at least 80% of the data. Therefore I am going to choose Dim 1-4
plot.PCA(x = pca_menu,
choix = "ind", # individual plot
invisible = "quali", # to make the category tables invisible
select = "contrib5", # 5 outliers
habillage = 1) # color the dots depends on ...Through the individual plot of PCA, dim 1 could cover 46.30% variance of data.
We also found the 5 outlier to be (depends on the menu Category):
To represent more than two components tha variables will be positioned inside the circle of correlation. If the variable is closer to the circle (outside), that means the variable can reconstruct it better from the first two components. If the variable is closed to the center of the plot (inside), that means the variable is less important for the two components.
fviz_pca_var(pca_menu , select.var = list(contrib = 20), col.var = "contrib",
gradient.cols = c("cyan", "gold", "maroon"), repel = TRUE)fviz_contrib(X = pca_menu,
choice = "var", # lihat kontribusi berdasarkan variable
axes = 1) # PC yang ingin di tampilkanFat, Protein, Sodium, Calories, Iron, Dietery Fiber, Cholesterol are the most nutrients contributed to dimension 1
fviz_contrib(X = pca_menu,
choice = "var", # lihat kontribusi berdasarkan variable
axes = 2) # PC yang ingin di tampilkanSugar, Carbohydrates, and Calcium are the most nutrients contributed to dimension 2
fviz_contrib(X = pca_menu,
choice = "var", # lihat kontribusi berdasarkan variable
axes = 3) # PC yang ingin di tampilkanVitamin A,C, and Dietery Fiber are the most nutrients contributed to dimension 3
fviz_contrib(X = pca_menu,
choice = "var", # lihat kontribusi berdasarkan variable
axes = 4) # PC yang ingin di tampilkanVitamin C is the most nutrients contributed to dimension 4
Clustering is an unsupervised machine learning task. It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space
Clustering must only use the numerical variables
head(menu_num)We need to scale our data since all data is numerical data to avoid bias in the result
menu_num_scale <- scale(menu_num)We need to find the optimum K to do K-Means Clustering. K is the number clusters to our model.
There are some methods available to find the optimum K. However we will use Elbow Method for this project.
Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.
fviz_nbclust(menu_num_scale, kmeans, method = "wss", k.max = 30)+ labs(subtitle = "Elbow method") + theme_bw()If we look at above clustering plot above, It seems the “elbow” is k=11.
Now we will implement our K optimum (k=11) to clustering process. We also will create new column cluster for to observe each classification
RNGkind(sample.kind = "Rounding")
set.seed(999)
# k-means clustering
menu_clust <- kmeans(menu_num_scale, centers = 11)The amount of observations for each clusters are:
menu_clust$size#> [1] 40 14 44 27 25 7 6 3 45 15 34
A good clustering results can be looked by 3 aspects, Within Sum of Squares (withinss), Between Sum of Squares (betweenss) and Total Sum of Squares (totss).
Within Sum of Squares (withinss): sum of squares distance from each observation to centroid of each cluster.
menu_clust$withinss#> [1] 35.49918 47.08081 43.80337 118.72423 56.04666 41.18269 31.45488
#> [8] 48.05747 96.11475 29.18924 42.92216
Between Sum of Squares (betweenss): sum of squares distance from each centroid to global average. Based on the number of observations in the cluster.
menu_clust$betweenss#> [1] 2517.925
Total Sum of Squares (totss): sum of squares distance from each observation to global average.
menu_clust$totss#> [1] 3108
A good clustering should have:
BSS/TSS that is closer to 1.
menu_clust$betweenss/menu_clust$totss#> [1] 0.810143
Result of clustering has great accuracy in 81 %, which means is good accuracy.
let us determine each cluster characteristics
menu$cluster <- menu_clust$cluster
menu1 <- menu[,3:15]
menu1 %>%
group_by(cluster) %>%
summarise_all(mean) %>%
head(11)Based on result above, below are profiles on each clusters:
Below is the plot of the clusters
fviz_cluster(ggtheme = theme_bw() ,
object=menu_clust,
data = menu_num_scale,ellipse.alpha = 0.1,
labelsize = 10)If the customers want a food with high vitamin C, what food should we recommend?
On the profiling result, we can see that food with very high vitamin C is in Cluster 6,
So we can filter all the foods in Cluster 6
menu %>% select(-Category) %>%
filter(cluster==6)So if the customers want foods with high vitamin C, we can recommend them to order Fruit and Maple Oatmeal, Apple Slices, and Minute Maid juice.
If the customers want another burger options besides Big Mac, what food should we recommend?
We can check first in what cluster Big Mac is.
menu %>% filter(Item=="Big Mac")Now we know that Big Mac is in cluster 4. Now we can check all the foods in cluster 4
menu %>% filter(cluster==4)So after we filter cluster 4, we found out that we have a lot of options in cluster 4 to be recommended. If we want to be more specific, we can propose them foods in cluster 4 with the same category as Big Mac, such as Quarter Pound Burger, Double Quarter Pounder Burger, or Bacon Clubhouse Burger.
After exploring our dataset by using PCA and K-Means for clustering, we are able to conclude that: