1 Introduction

Hello Everyone! We are back again yet with another project. This is my submission for Unsupervised Learning project.We are going to provide recommendations for McDonald Menu based on Nutritions using PCA and Clustering

The dataset can be obtained through https://www.kaggle.com/mcdonalds/nutrition-facts .

This dataset provides a nutrition analysis of every menu item on the US McDonald’s menu, including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts.

Let’s get started!

2 Data Preparation

2.1 Library Setup

Before we do analysis, we need to load the required library packages.

library(tidyverse)
library(ggplot2)
library(ggpubr)
library(FactoMineR)
library(factoextra) # fviz_pca_ind(), fviz_eig()
library(tictoc)
library(scales)
library(plotly)
library(GGally)
library(esquisse)

2.2 Import Data

We need the data to do the analysis. Then, we have to load the dataset

mcd <- read.csv("menu.csv")

3 Exploratory Data Analysis

3.1 Check Data Types

Select all the columns needed then check the columns data types. Then change the data types for each columns’ data type that needs to be changed

menu <- mcd %>% 
        mutate(Vitamin.A = Vitamin.A....Daily.Value.,
               Vitamin.C = Vitamin.C....Daily.Value.,
               Calcium = Calcium....Daily.Value.,
               Iron = Iron....Daily.Value.,
               Fat = Total.Fat) %>% 
        select(-Calories.from.Fat,-Total.Fat....Daily.Value.,
               -Saturated.Fat....Daily.Value.,-Trans.Fat,
               -Cholesterol....Daily.Value.,-Sodium....Daily.Value.,
               -Carbohydrates....Daily.Value.,
               -Dietary.Fiber....Daily.Value.,
               -Vitamin.A....Daily.Value.,
               -Vitamin.C....Daily.Value.,
               -Calcium....Daily.Value.,
               -Iron....Daily.Value.,
               -Serving.Size,
               -Saturated.Fat,
               -Total.Fat) %>% 
            mutate_if(is.character,as.factor)
  
glimpse(menu)
#> Rows: 260
#> Columns: 14
#> $ Category      <fct> Breakfast, Breakfast, Breakfast, Breakfast, Breakfast, B~
#> $ Item          <fct> "Egg McMuffin", "Egg White Delight", "Sausage McMuffin",~
#> $ Calories      <int> 300, 250, 370, 450, 400, 430, 460, 520, 410, 470, 430, 4~
#> $ Cholesterol   <int> 260, 25, 45, 285, 50, 300, 250, 250, 35, 35, 30, 30, 250~
#> $ Sodium        <int> 750, 770, 780, 860, 880, 960, 1300, 1410, 1300, 1420, 10~
#> $ Carbohydrates <int> 31, 30, 29, 30, 30, 31, 38, 43, 36, 42, 34, 39, 36, 42, ~
#> $ Dietary.Fiber <int> 4, 4, 4, 4, 4, 4, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,~
#> $ Sugars        <int> 3, 3, 2, 2, 2, 3, 3, 4, 3, 4, 2, 3, 2, 3, 3, 3, 3, 4, 3,~
#> $ Protein       <int> 17, 18, 14, 21, 21, 26, 19, 19, 20, 20, 11, 11, 18, 18, ~
#> $ Vitamin.A     <int> 10, 6, 8, 15, 6, 15, 10, 15, 2, 6, 0, 4, 6, 10, 0, 4, 0,~
#> $ Vitamin.C     <int> 0, 0, 0, 0, 0, 2, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 2, 2, 2,~
#> $ Calcium       <int> 25, 25, 25, 30, 25, 30, 15, 20, 15, 15, 6, 8, 10, 10, 8,~
#> $ Iron          <int> 15, 8, 10, 15, 10, 20, 15, 20, 10, 15, 15, 15, 20, 20, 1~
#> $ Fat           <dbl> 13, 8, 23, 28, 23, 23, 26, 30, 20, 25, 27, 31, 33, 37, 2~
head(menu)
colnames(menu)
#>  [1] "Category"      "Item"          "Calories"      "Cholesterol"  
#>  [5] "Sodium"        "Carbohydrates" "Dietary.Fiber" "Sugars"       
#>  [9] "Protein"       "Vitamin.A"     "Vitamin.C"     "Calcium"      
#> [13] "Iron"          "Fat"
menu_num <- menu %>% select_if(is.numeric)

3.2 Check Missing values

We have to check if there is any missing values in our data set

colSums(is.na(menu))
#>      Category          Item      Calories   Cholesterol        Sodium 
#>             0             0             0             0             0 
#> Carbohydrates Dietary.Fiber        Sugars       Protein     Vitamin.A 
#>             0             0             0             0             0 
#>     Vitamin.C       Calcium          Iron           Fat 
#>             0             0             0             0

There are no missing values in our data set. Now we are ready to go to the data analysis.

3.3 Data Analysis

To get to know more about our data, let us check the summary.

summary(menu)
#>                Category                                       Item    
#>  Coffee & Tea      :95   1% Low Fat Milk Jug                    :  1  
#>  Breakfast         :42   Apple Slices                           :  1  
#>  Smoothies & Shakes:28   Bacon Buffalo Ranch McChicken          :  1  
#>  Beverages         :27   Bacon Cheddar McChicken                :  1  
#>  Chicken & Fish    :27   Bacon Clubhouse Burger                 :  1  
#>  Beef & Pork       :15   Bacon Clubhouse Crispy Chicken Sandwich:  1  
#>  (Other)           :26   (Other)                                :254  
#>     Calories       Cholesterol         Sodium       Carbohydrates   
#>  Min.   :   0.0   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
#>  1st Qu.: 210.0   1st Qu.:  5.00   1st Qu.: 107.5   1st Qu.: 30.00  
#>  Median : 340.0   Median : 35.00   Median : 190.0   Median : 44.00  
#>  Mean   : 368.3   Mean   : 54.94   Mean   : 495.8   Mean   : 47.35  
#>  3rd Qu.: 500.0   3rd Qu.: 65.00   3rd Qu.: 865.0   3rd Qu.: 60.00  
#>  Max.   :1880.0   Max.   :575.00   Max.   :3600.0   Max.   :141.00  
#>                                                                     
#>  Dietary.Fiber       Sugars          Protein        Vitamin.A     
#>  Min.   :0.000   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
#>  1st Qu.:0.000   1st Qu.:  5.75   1st Qu.: 4.00   1st Qu.:  2.00  
#>  Median :1.000   Median : 17.50   Median :12.00   Median :  8.00  
#>  Mean   :1.631   Mean   : 29.42   Mean   :13.34   Mean   : 13.43  
#>  3rd Qu.:3.000   3rd Qu.: 48.00   3rd Qu.:19.00   3rd Qu.: 15.00  
#>  Max.   :7.000   Max.   :128.00   Max.   :87.00   Max.   :170.00  
#>                                                                   
#>    Vitamin.C          Calcium           Iron             Fat         
#>  Min.   :  0.000   Min.   : 0.00   Min.   : 0.000   Min.   :  0.000  
#>  1st Qu.:  0.000   1st Qu.: 6.00   1st Qu.: 0.000   1st Qu.:  2.375  
#>  Median :  0.000   Median :20.00   Median : 4.000   Median : 11.000  
#>  Mean   :  8.535   Mean   :20.97   Mean   : 7.735   Mean   : 14.165  
#>  3rd Qu.:  4.000   3rd Qu.:30.00   3rd Qu.:15.000   3rd Qu.: 22.250  
#>  Max.   :240.000   Max.   :70.00   Max.   :40.000   Max.   :118.000  
#> 
ggplot(gather(menu %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10,fill="firebrick") + 
    facet_wrap(~key, scales = 'free_x',nrow=3) +
  theme_bw()

If we look at summary and visualization above at all the nutrients column, we found that max value for each nutrients is far from the mean and median value. We can conclude that the data have outlier. We will check later using PCA to found the outliers.

Now let’s take a look at below visualization of nutrients per category.

a <- ggplot(data = menu, aes(x = Calcium, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Calcium") 
b <- ggplot(data = menu, aes(x = Calories, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Calories")  
c <- ggplot(data = menu, aes(x = Carbohydrates, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Carbohydrates")  
d <- ggplot(data = menu, aes(x = Cholesterol, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Cholesterol")  
e <- ggplot(data = menu, aes(x = Dietary.Fiber, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Dietary.Fiber") 
f <- ggplot(data = menu, aes(x = Fat, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Fat")  
g <- ggplot(data = menu, aes(x = Iron, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Iron") 
h <- ggplot(data = menu, aes(x = Protein, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Protein")  
i <- ggplot(data = menu, aes(x = Sodium, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Sodium")  
j <- ggplot(data = menu, aes(x = Sugars, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Sugars")  
k <- ggplot(data = menu, aes(x = Vitamin.A, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Vitamin.A")  
l <- ggplot(data = menu, aes(x = Vitamin.C, y = Category, fill = Category)) + geom_boxplot(show.legend = F, alpha = 0.3) + theme_grey() + labs(title = "Vitamin.C") 

ggpubr::ggarrange(a,b,c,d,e,f,g,h,i,j,k,l,
                  ncol = 2, nrow = 6)

Summary from plot above:

  • Salad has the highest both Vitamin A and Vitamin C, and also Dietery Fiber.
  • Smoothies and Shakes has the highest Sugars, Calcium, and Carbohydrates.
  • In terms of Sodium, Chicken&Fish, Breakfast, and Beef&Pork have the highest Sodium.
  • Almost all food menu have averagely the same protein, iron, cholesterols nutrients value

Now let us check the correlation between numerical variables

ggcorr(menu,label = T)

Based on this result, we can see that there are some of the variables that has a strong positive correlation. We will try to reduce the dimension using PCA.

4 Unsupervised Learning using: PCA

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, we convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

The objective of PCA is to find Q, so that such a linear transformation is possible. By using nearZeroVar we are able to show that to eliminate ~50% of the original predictors we still retain enough information. PCA shares the same objective as nearZeroVar, but does it differently: it looks for correlation within our data and use that redundancy to create a new matrix Z with just enough dimensions to explain most of the variance in the original data. The new variables of matrix Z are called principal components.

4.1 Data Pre-processing (Scaling)

# making index for quantitative and qualitative variables
quanti_var <- c(3:14)
quali_var <- c(1:2)

4.2 Choosing the PC

FactoMineR package for function PCA()

pca_menu <- PCA(X = menu, # data
                 scale.unit = T, # scaling
                 quali.sup = quali_var, # data categorical (the number)
                 graph = F) # F = not showing the plot
                 # ncp = 10) # default = 5
pca_menu$eig
#>           eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  5.5565359443           46.304466202                          46.30447
#> comp 2  2.5841338691           21.534448910                          67.83892
#> comp 3  1.1941186002            9.950988335                          77.78990
#> comp 4  0.9714291848            8.095243207                          85.88515
#> comp 5  0.5254357947            4.378631622                          90.26378
#> comp 6  0.4667872027            3.889893356                          94.15367
#> comp 7  0.3494208991            2.911840826                          97.06551
#> comp 8  0.1824119663            1.520099720                          98.58561
#> comp 9  0.0933776984            0.778147487                          99.36376
#> comp 10 0.0550715037            0.458929197                          99.82269
#> comp 11 0.0209874981            0.174895818                          99.99758
#> comp 12 0.0002898385            0.002415321                         100.00000
fviz_eig(pca_menu, ncp = 12, 
         addlabels = T, main = "Variance explained by each dimensions")

summary(pca_menu)
#> 
#> Call:
#> PCA(X = menu, scale.unit = T, quali.sup = quali_var, graph = F) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
#> Variance               5.557   2.584   1.194   0.971   0.525   0.467   0.349
#> % of var.             46.304  21.534   9.951   8.095   4.379   3.890   2.912
#> Cumulative % of var.  46.304  67.839  77.790  85.885  90.264  94.154  97.066
#>                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
#> Variance               0.182   0.093   0.055   0.021   0.000
#> % of var.              1.520   0.778   0.459   0.175   0.002
#> Cumulative % of var.  98.586  99.364  99.823  99.998 100.000
#> 
#> Individuals (the 10 first)
#>                         Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
#> 1                   |  3.207 |  1.620  0.182  0.255 | -1.359  0.275  0.180 |
#> 2                   |  2.173 |  0.288  0.006  0.018 | -1.165  0.202  0.287 |
#> 3                   |  2.136 |  0.922  0.059  0.186 | -1.137  0.193  0.284 |
#> 4                   |  3.684 |  2.617  0.474  0.504 | -1.171  0.204  0.101 |
#> 5                   |  2.284 |  1.292  0.116  0.320 | -1.149  0.197  0.253 |
#> 6                   |  4.008 |  2.954  0.604  0.543 | -1.322  0.260  0.109 |
#> 7                   |  3.148 |  2.203  0.336  0.490 | -1.398  0.291  0.197 |
#> 8                   |  3.606 |  2.986  0.617  0.686 | -1.332  0.264  0.137 |
#> 9                   |  2.006 |  0.966  0.065  0.232 | -1.275  0.242  0.404 |
#> 10                  |  2.497 |  1.753  0.213  0.493 | -1.338  0.266  0.287 |
#>                      Dim.3    ctr   cos2  
#> 1                   -0.256  0.021  0.006 |
#> 2                    0.137  0.006  0.004 |
#> 3                   -0.054  0.001  0.001 |
#> 4                   -0.370  0.044  0.010 |
#> 5                   -0.131  0.006  0.003 |
#> 6                   -0.297  0.028  0.005 |
#> 7                   -0.758  0.185  0.058 |
#> 8                   -0.462  0.069  0.016 |
#> 9                   -0.504  0.082  0.063 |
#> 10                  -0.258  0.021  0.011 |
#> 
#> Variables (the 10 first)
#>                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
#> Calories            |  0.901 14.605  0.812 |  0.362  5.062  0.131 | -0.065
#> Cholesterol         |  0.735  9.711  0.540 | -0.074  0.210  0.005 | -0.179
#> Sodium              |  0.903 14.687  0.816 | -0.330  4.226  0.109 | -0.097
#> Carbohydrates       |  0.500  4.497  0.250 |  0.793 24.361  0.630 |  0.048
#> Dietary.Fiber       |  0.750 10.118  0.562 | -0.255  2.523  0.065 |  0.359
#> Sugars              | -0.111  0.221  0.012 |  0.960 35.640  0.921 |  0.072
#> Protein             |  0.912 14.985  0.833 | -0.042  0.068  0.002 |  0.019
#> Vitamin.A           |  0.205  0.758  0.042 |  0.067  0.174  0.004 |  0.755
#> Vitamin.C           | -0.049  0.042  0.002 | -0.188  1.375  0.036 |  0.640
#> Calcium             |  0.272  1.327  0.074 |  0.769 22.869  0.591 |  0.043
#>                        ctr   cos2  
#> Calories             0.349  0.004 |
#> Cholesterol          2.670  0.032 |
#> Sodium               0.786  0.009 |
#> Carbohydrates        0.194  0.002 |
#> Dietary.Fiber       10.767  0.129 |
#> Sugars               0.432  0.005 |
#> Protein              0.030  0.000 |
#> Vitamin.A           47.776  0.570 |
#> Vitamin.C           34.299  0.410 |
#> Calcium              0.151  0.002 |
#> 
#> Supplementary categories (the 10 first)
#>                         Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
#> Beef & Pork         |  2.745 |  2.255  0.675  3.810 | -1.051  0.147 -2.604 |
#> Beverages           |  2.855 | -2.655  0.865 -6.171 | -0.650  0.052 -2.215 |
#> Breakfast           |  2.728 |  2.402  0.776  7.199 | -1.090  0.160 -4.788 |
#> Chicken & Fish      |  2.758 |  2.428  0.775  5.642 | -1.056  0.147 -3.600 |
#> Coffee & Tea        |  1.450 | -1.196  0.680 -6.196 |  0.684  0.222  5.193 |
#> Desserts            |  1.712 | -1.599  0.873 -1.816 | -0.365  0.045 -0.608 |
#> Salads              |  6.013 |  1.006  0.028  1.056 | -1.434  0.057 -2.206 |
#> Smoothies & Shakes  |  2.647 |  0.101  0.001  0.238 |  2.565  0.939  8.922 |
#> Snacks & Sides      |  1.844 | -0.970  0.277 -1.519 | -1.386  0.565 -3.183 |
#> 1% Low Fat Milk Jug |  2.641 | -2.184  0.683 -0.926 | -0.464  0.031 -0.289 |
#>                      Dim.3   cos2 v.test  
#> Beef & Pork         -0.348  0.016 -1.270 |
#> Beverages           -0.061  0.000 -0.304 |
#> Breakfast           -0.366  0.018 -2.365 |
#> Chicken & Fish       0.180  0.004  0.903 |
#> Coffee & Tea        -0.219  0.023 -2.446 |
#> Desserts            -0.283  0.027 -0.693 |
#> Salads               4.777  0.631 10.813 |
#> Smoothies & Shakes   0.317  0.014  1.621 |
#> Snacks & Sides       0.200  0.012  0.676 |
#> 1% Low Fat Milk Jug -0.284  0.012 -0.260 |

Through the PCA, I can retain some informative PC (high in cumulative variance) to perform dimension reduction. By doing this, I can reduce the dimension of the variables while also retaining as much information as possible.

Based on Plot and Summary above, I would like to retain at least 80% of the data. Therefore I am going to choose Dim 1-4

4.3 PCA Visualization

4.3.1 Individual Factor Map

plot.PCA(x = pca_menu,
         choix = "ind", # individual plot
         invisible = "quali", # to make the category tables invisible
         select = "contrib5", # 5 outliers
         habillage = 1)  # color the dots depends on ...

Through the individual plot of PCA, dim 1 could cover 46.30% variance of data.

We also found the 5 outlier to be (depends on the menu Category):

  • 4 from Breakfast Menu : 35, 32, 33, 29
  • 1 from Chicken and fish menu: 83

4.3.2 Variable Factor Map

To represent more than two components tha variables will be positioned inside the circle of correlation. If the variable is closer to the circle (outside), that means the variable can reconstruct it better from the first two components. If the variable is closed to the center of the plot (inside), that means the variable is less important for the two components.

fviz_pca_var(pca_menu , select.var = list(contrib = 20), col.var = "contrib", 
             gradient.cols = c("cyan", "gold", "maroon"), repel = TRUE)

fviz_contrib(X = pca_menu, 
             choice = "var", # lihat kontribusi berdasarkan variable
             axes = 1) # PC yang ingin di tampilkan

Fat, Protein, Sodium, Calories, Iron, Dietery Fiber, Cholesterol are the most nutrients contributed to dimension 1

fviz_contrib(X = pca_menu, 
             choice = "var", # lihat kontribusi berdasarkan variable
             axes = 2) # PC yang ingin di tampilkan

Sugar, Carbohydrates, and Calcium are the most nutrients contributed to dimension 2

fviz_contrib(X = pca_menu, 
             choice = "var", # lihat kontribusi berdasarkan variable
             axes = 3) # PC yang ingin di tampilkan

Vitamin A,C, and Dietery Fiber are the most nutrients contributed to dimension 3

fviz_contrib(X = pca_menu, 
             choice = "var", # lihat kontribusi berdasarkan variable
             axes = 4) # PC yang ingin di tampilkan

Vitamin C is the most nutrients contributed to dimension 4

5 Unsupervised Learning using: Clustering

Clustering is an unsupervised machine learning task. It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space

5.1 Scaling

Clustering must only use the numerical variables

head(menu_num)

We need to scale our data since all data is numerical data to avoid bias in the result

menu_num_scale <- scale(menu_num)

5.2 Finding Optimum K

We need to find the optimum K to do K-Means Clustering. K is the number clusters to our model.

There are some methods available to find the optimum K. However we will use Elbow Method for this project.

Choosing the number of clusters using elbow method is arbitrary. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(menu_num_scale, kmeans, method = "wss", k.max = 30)+ labs(subtitle = "Elbow method") + theme_bw()

If we look at above clustering plot above, It seems the “elbow” is k=11.

5.3 K-Means Clustering

Now we will implement our K optimum (k=11) to clustering process. We also will create new column cluster for to observe each classification

RNGkind(sample.kind = "Rounding")
set.seed(999)

# k-means clustering
menu_clust <- kmeans(menu_num_scale, centers = 11)

The amount of observations for each clusters are:

menu_clust$size
#>  [1] 40 14 44 27 25  7  6  3 45 15 34

5.4 Goodness of Fit

A good clustering results can be looked by 3 aspects, Within Sum of Squares (withinss), Between Sum of Squares (betweenss) and Total Sum of Squares (totss).

Within Sum of Squares (withinss): sum of squares distance from each observation to centroid of each cluster.

menu_clust$withinss
#>  [1]  35.49918  47.08081  43.80337 118.72423  56.04666  41.18269  31.45488
#>  [8]  48.05747  96.11475  29.18924  42.92216

Between Sum of Squares (betweenss): sum of squares distance from each centroid to global average. Based on the number of observations in the cluster.

menu_clust$betweenss
#> [1] 2517.925

Total Sum of Squares (totss): sum of squares distance from each observation to global average.

menu_clust$totss
#> [1] 3108

A good clustering should have:

BSS/TSS that is closer to 1.

menu_clust$betweenss/menu_clust$totss
#> [1] 0.810143

Result of clustering has great accuracy in 81 %, which means is good accuracy.

5.5 Clustering Analysis

5.5.1 Cluster Profiling

let us determine each cluster characteristics

menu$cluster <- menu_clust$cluster

menu1 <- menu[,3:15]

menu1 %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  head(11)

Based on result above, below are profiles on each clusters:

  • Cluster 1 foods has almost no nutrient values and calories.
  • Cluster 2 foods has very high cholesterol, pretty high iron, mid sodium, mid calories
  • Cluster 3 foods has high calcium, but low in Vit C and dietery fiber
  • Cluster 4 foods has pretty high calories, mid sodium, mid dietery fiber, good protein, pretty high iron and fat.
  • Cluster 5 foods has pretty high sugars, pretty high carbohydrates, very low protein, very low sodium very low iron and very low fat.
  • Cluster 6 foods has very high Vitamin C, very low Vitamin A, very low cholesterol, very low sodium.
  • Cluster 7 foods has very high Vitamin A, pretty high dietery fiber.
  • Cluster 8 foods has very high calories, very high cholesterol, very high sodium, very high carbohydrates, very high dietery fiber, very high protein, very high iron, very high fat.
  • Cluster 9 foods has mid of everything.
  • Cluster 10 foods has very high sugar, very high calcium, high fat, high calories, low dietery fiber and zero vitamin c.
  • Cluster 11 foods has prett high sugars, high calcium, zero vitamin c,

5.5.2 Clustering Plot

Below is the plot of the clusters

fviz_cluster(ggtheme = theme_bw() ,
             object=menu_clust,
             data = menu_num_scale,ellipse.alpha = 0.1,
             labelsize = 10)

5.5.3 Recommendation Example

5.5.3.1 Example 1

If the customers want a food with high vitamin C, what food should we recommend?

On the profiling result, we can see that food with very high vitamin C is in Cluster 6,

So we can filter all the foods in Cluster 6

menu %>% select(-Category) %>% 
  filter(cluster==6)

So if the customers want foods with high vitamin C, we can recommend them to order Fruit and Maple Oatmeal, Apple Slices, and Minute Maid juice.

5.5.3.2 Example 2

If the customers want another burger options besides Big Mac, what food should we recommend?

We can check first in what cluster Big Mac is.

menu %>% filter(Item=="Big Mac")

Now we know that Big Mac is in cluster 4. Now we can check all the foods in cluster 4

menu %>% filter(cluster==4)

So after we filter cluster 4, we found out that we have a lot of options in cluster 4 to be recommended. If we want to be more specific, we can propose them foods in cluster 4 with the same category as Big Mac, such as Quarter Pound Burger, Double Quarter Pounder Burger, or Bacon Clubhouse Burger.

6 Conclusion

After exploring our dataset by using PCA and K-Means for clustering, we are able to conclude that:

  • McDonald menu listed on this dataset are able to be separated into 11 Clusters.
  • We can reduce our dimensions from 12 features into 4 dimensions to retain at least 80% of the data.
  • By using the profiling, we could recommend foods to the customers by using the nutrition characteristics that were recognized by the model.