Introduction

Background

Plant based diet has become a new phenomenon in this decade. deciding what fruit and vegetables combination on a person diet daily has become more important than ever. from the dataset of nutritional values of fruits and vegetables we can cluster this food into the same category based on their nutrition. to solve this problem ee will try to do a clustering analysis using K-means method. We will also see if we can do a dimensionality reduction using the Principle Components Analysis (PCA).

Dataset

The dataset is consist of nutritional values of some fruits and vegetables. the nutritional values consist of macro nutrients and also some vitamins and minerals. the dataset is acquire from Kaggle

Library and Setup

library(tidyverse)
library(tidymodels)
library(stringr)
library(FactoMineR)
library(factoextra)
library(gridExtra)
library(plotly)
library(GGally)

Import Data

f <- read.csv("fruits.csv") %>% mutate(type = as.factor("fruit"))
v <- read.csv("vegetables.csv") %>% mutate(type = as.factor("vegetable"))

# combine 2 dataset
fruit.vegs <- rbind(f,v)
glimpse(fruit.vegs)

## Rows: 148
## Columns: 23
## $ name              <chr> "Apple nutrition facts", "Apricot nutrition facts", ~
## $ energy..kcal.kJ.  <chr> "48/200", "48/201", "160/670", "89/371", "43/181", "~
## $ water..g.         <dbl> 86.70, 86.40, 73.23, 74.91, 88.15, 84.21, 91.38, 79.~
## $ protein..g.       <dbl> 0.27, 1.40, 2.00, 1.09, 1.39, 0.74, 1.04, 1.57, 1.06~
## $ total.fat..g.     <dbl> 0.13, 0.39, 14.70, 0.33, 0.49, 0.33, 0.33, 0.68, 0.2~
## $ carbohydrates..g. <dbl> 12.70, 11.12, 8.53, 22.84, 9.61, 14.49, 6.73, 17.71,~
## $ fiber..g.         <chr> "1.3", "2", "6.7", "2.6", "5.3", "2.4", "2.8", "3", ~
## $ sugars..g.        <chr> "10.1", "9.24", "0.66", "12.23", "4.88", "9.96", "3.~
## $ calcium..mg.      <int> 5, 13, 12, 5, 29, 6, 3, 10, 13, 30, 8, 55, 33, 39, 6~
## $ iron..mg.         <dbl> 0.07, 0.39, 0.55, 0.26, 0.62, 0.28, 0.08, 0.27, 0.36~
## $ magnessium..mg.   <chr> "4", "10", "29", "27", "20", "6", "10", "17", "11", ~
## $ phosphorus..mg.   <int> 11, 23, 52, 22, 22, 12, 12, 26, 21, 21, 13, 59, 44, ~
## $ potassium..mg.    <chr> "90", "259", "485", "358", "162", "77", "133", "287"~
## $ sodium..g.        <chr> "0", "1", "7", "1", "1", "1", "2", "7", "0", "1", "2~
## $ vitamin.A..IU.    <chr> "38", "1926", "146", "64", "214", "54", "61", "5", "~
## $ vitamin.C..mg.    <dbl> 4.0, 10.0, 10.0, 8.7, 21.0, 9.7, 34.4, 12.6, 7.0, 48~
## $ vitamin.B1..mg.   <chr> "0.019", "0.03", "0.067", "0.031", "0.02", "0.037", ~
## $ vitamin.B2..mg.   <chr> "0.028", "0.04", "0.13", "0.073", "0.026", "0.041", ~
## $ viatmin.B3..mg.   <chr> "0.091", "0.6", "1.738", "0.665", "0.646", "0.418", ~
## $ vitamin.B5..mg.   <chr> "0.071", "0.24", "1.389", "0.334", "0.276", "0.124",~
## $ vitamin.B6..mg.   <chr> "0.037", "0.054", "0.257", "0.367", "0.03", "0.052",~
## $ vitamin.E..mg.    <chr> "0.05", "0.89", "2.07", "0.1", "1.17", "0.57", "0.15~
## $ type              <fct> fruit, fruit, fruit, fruit, fruit, fruit, fruit, fru~

Data Preprocessing

We will conduct data cleaning and wrangling to make data easier to interpret by the computer.

levels(fruit.vegs$name %>% as.factor()) %>% head()

## [1] "Apple nutrition facts"   "Apricot nutrition facts"
## [3] "Artichokes, cooked"      "Asparagus, cooked"      
## [5] "Avocado nutrition"       "Banana nutrition facts"

levels(fruit.vegs$name %>% as.factor()) %>% tail()

## [1] "Tomatoes, red, ripe, stewed" "Turnips or swede, cooked"   
## [3] "Turnips or swede, raw"       "Wasabi, root, raw"          
## [5] "Watermelon nutrition facts"  "Yam, cooked"

Notice how the name of foods are redundant. we should take out he word besides the plants’ name. we will also remove the cooked version of vegetables and focus on the raw unless it has only the cooked version.

# remove unnecessarily redundant words
fruit.vegs$name <- fruit.vegs$name %>% str_remove_all(pattern = "nutrition|facts") %>% 
  str_trim()

# remove the cooked version only if it has both cooked and raw version.
is.cooked <- fruit.vegs$name[fruit.vegs$name %>% str_detect("cooked|baked|stewed")] %>% str_remove("cooked|baked|stewed") %>% str_trim()
is.raw <- fruit.vegs$name[fruit.vegs$name %>% str_detect("raw")] %>% str_remove("raw") %>% str_trim()
has.two.ver <- fruit.vegs$name[(fruit.vegs$name %>% str_detect(paste(is.cooked, collapse = '|'))) & (fruit.vegs$name %>% str_detect(paste(is.raw, collapse = '|')))]

fruit.vegs.clean <- fruit.vegs %>% filter(!(name %in% has.two.ver) | str_detect(name, "raw"))

Variable name represent the name of fruits. Therefore, it would be wise to assign the values of column name into row names.

# check and remove duplicate name
fruit.vegs.clean <- fruit.vegs.clean[!(fruit.vegs.clean$name %>% duplicated()),]
rownames(fruit.vegs.clean) <- fruit.vegs.clean$name
fruit.vegs.clean <- fruit.vegs.clean %>% select(-name)

Then, change the values of energy variable into kcal value only. it is because kcal represent calories that most people use for the basis of measurement of amount of energy that need to be burned when we eat the food.

fruit.vegs.clean <- fruit.vegs.clean %>% 
  mutate(energy..kcal.kJ. = sub("\\/.*", "", energy..kcal.kJ.),
         energy..kcal.kJ. = as.numeric(energy..kcal.kJ.))

Next, we convert inappropriate data type. as we can see from above, there are several columns that should be in double or integer type. we will convert it into numeric.

fruit.vegs.clean <- fruit.vegs.clean %>% mutate_if(is.character,as.numeric)

We also look at the number of observations containing missing values across observations, if the number of missing values less than 10% from observation, let’s take out the observation with missing values.

# check missing values
colSums(is.na(fruit.vegs.clean)) %>% sum()

## [1] 54

There is no missing values, but we should check whether values that represent zero such as (“-”) exist. and if exist we will replace it with 0.

fruit.vegs.clean[fruit.vegs.clean == "-"] <- 0

Exploratory Data Analysis

Possibility for Clustering

We will inspect the difference between numerical variable based on the food type. but due to there are 21 variabels we will only look at some basic and most important information. for this case we will choose calorie, water, carbohydrate, protein, and fat amount.

library(cowplot)

p1 <- ggplot(fruit.vegs.clean, aes(type, energy..kcal.kJ., fill = type)) + 
  geom_boxplot(show.legend = F) + 
  theme_minimal() + 
  labs(title = "Calories")

p2 <- ggplot(fruit.vegs.clean, aes(type, water..g., fill = type)) + 
  geom_boxplot(show.legend = F) + 
  theme_minimal() + 
  labs(title = "Water")

p3 <- ggplot(fruit.vegs.clean, aes(type, carbohydrates..g., fill = type)) + 
  geom_boxplot(show.legend = F) + 
  theme_minimal() + 
  labs(title = "Carbohydrate")

p4 <- ggplot(fruit.vegs.clean, aes(type, protein..g., fill = type)) + 
  geom_boxplot(show.legend = F) + 
  theme_minimal() + 
  labs(title = "Protein")

p5 <- ggplot(fruit.vegs.clean, aes(type, vitamin.C..mg., fill = type)) + 
  geom_boxplot(show.legend = F) + 
  theme_minimal() + 
  labs(title = "Vit.C")

p6 <- ggplot(fruit.vegs.clean, aes(type, total.fat..g., fill = type)) + 
  geom_boxplot() + 
  theme_minimal() + 
  theme(legend.position = "bottom") + labs(title = "Total Fat")

plot_grid(p1, p2, p3, p4, p5, p6)

Vegetables has slightly more variance on macro nutrition aspect compare to fruits.

Based on our exploratory process, we can segment or cluster these plant based food based on their specific traits, such as the macro & micro nutrients. To find more interesting and undiscovered pattern in the data, we will use clustering method using the K-means.

Possibility for Principle Component Analysis (PCA)

Here we will explore the data distribution of each numeric variable using density plot and the correlation between each variable using scatterplot which were provided within ggpairs function from GGally package.

library(corrplot)
cor.matrix <- fruit.vegs.clean %>% 
  na.omit() %>% 
  select(-type) %>%
  cor()
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FAFAFA", "#77AADD", "#4477AA"))
corrplot(cor.matrix, method = "color", col = col(200),
         type = "upper", order = "alphabet", number.cex = .6,
         addCoef.col = "black",
         tl.col = "black",
         diag = FALSE)

It can be seen that there is a strong correlation between some variables from the data. This result indicates that this dataset has multicollinearity and might not be suitable for various classification algorithms (which have non-multicollinearity as their assumption).

Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

PCA is very useful to retain information while reducing the dimension of the data. However, we need to make sure that our data is properly scaled in order to get a useful PCA. You may use scale() function to scale the numeric variable and store it as fruit.vegies.scaled.

fruit.vegs.scaled <- scale(fruit.vegs.clean %>% select(-type)) %>% 
  replace_na(0)

Dimensionality Reduction

Here we will make PCA from the fruit.vegs dataset. We will see the eigenvalues and the percentage of variances explained by each dimensions. The eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

qualivar <- c(22)
pca.fruit.vegies <- PCA(X = fruit.vegs.clean,
                  scale.unit = T,
                  quali.sup = qualivar,
                  graph = F)

summary(pca.fruit.vegies)

## 
## Call:
## PCA(X = fruit.vegs.clean, scale.unit = T, quali.sup = qualivar,  
##      graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               8.147   3.106   2.084   1.451   1.235   0.927   0.754
## % of var.             38.797  14.790   9.924   6.909   5.883   4.416   3.591
## Cumulative % of var.  38.797  53.587  63.512  70.421  76.303  80.719  84.311
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.613   0.458   0.415   0.393   0.362   0.277   0.235
## % of var.              2.921   2.181   1.974   1.872   1.722   1.318   1.120
## Cumulative % of var.  87.232  89.412  91.386  93.259  94.981  96.299  97.419
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.161   0.134   0.104   0.080   0.047   0.015   0.001
## % of var.              0.768   0.637   0.496   0.380   0.225   0.069   0.006
## Cumulative % of var.  98.187  98.824  99.320  99.701  99.925  99.994 100.000
## 
## Individuals (the 10 first)
##                            Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Apple                  |  3.195 | -2.535  0.680  0.629 | -1.339  0.498  0.176 |
## Apricot                |  1.884 | -1.225  0.159  0.423 | -0.702  0.137  0.139 |
## Avocado                |  6.953 |  3.783  1.514  0.296 | -1.408  0.550  0.041 |
## Banana                 |  2.893 | -0.026  0.000  0.000 | -1.432  0.569  0.245 |
## Blackberries           |  2.241 | -0.896  0.085  0.160 | -0.630  0.110  0.079 |
## Blueberry              |  2.677 | -1.837  0.357  0.471 | -1.304  0.472  0.237 |
## Carambola or Starfruit |  2.819 | -2.122  0.477  0.567 | -1.110  0.342  0.155 |
## Cherimoya fruit        |  2.114 | -0.063  0.000  0.001 | -0.991  0.273  0.220 |
## Cherry fruit           |  2.266 | -1.483  0.233  0.428 | -1.404  0.547  0.384 |
## Clementine             |  2.077 | -1.539  0.251  0.549 | -0.667  0.123  0.103 |
##                         Dim.3    ctr   cos2  
## Apple                  -0.361  0.054  0.013 |
## Apricot                -0.518  0.111  0.076 |
## Avocado                -2.065  1.763  0.088 |
## Banana                  1.077  0.480  0.139 |
## Blackberries           -0.753  0.235  0.113 |
## Blueberry              -0.495  0.101  0.034 |
## Carambola or Starfruit -1.070  0.474  0.144 |
## Cherimoya fruit         0.532  0.117  0.063 |
## Cherry fruit            0.230  0.022  0.010 |
## Clementine             -0.345  0.049  0.028 |
## 
## Variables (the 10 first)
##                           Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## energy..kcal.kJ.       |  0.892  9.761  0.795 | -0.400  5.152  0.160 |  0.013
## water..g.              | -0.830  8.450  0.688 |  0.445  6.371  0.198 | -0.258
## protein..g.            |  0.881  9.518  0.775 |  0.048  0.075  0.002 | -0.150
## total.fat..g.          |  0.772  7.313  0.596 | -0.128  0.529  0.016 | -0.542
## carbohydrates..g.      |  0.398  1.946  0.159 | -0.481  7.436  0.231 |  0.614
## fiber..g.              |  0.625  4.798  0.391 | -0.184  1.090  0.034 |  0.262
## sugars..g.             |  0.123  0.185  0.015 | -0.575 10.662  0.331 |  0.535
## calcium..mg.           |  0.408  2.041  0.166 |  0.591 11.235  0.349 |  0.353
## iron..mg.              |  0.617  4.671  0.381 |  0.473  7.208  0.224 |  0.230
## magnessium..mg.        |  0.857  9.018  0.735 |  0.245  1.934  0.060 | -0.011
##                           ctr   cos2  
## energy..kcal.kJ.        0.008  0.000 |
## water..g.               3.191  0.066 |
## protein..g.             1.085  0.023 |
## total.fat..g.          14.080  0.293 |
## carbohydrates..g.      18.093  0.377 |
## fiber..g.               3.297  0.069 |
## sugars..g.             13.751  0.287 |
## calcium..mg.            5.986  0.125 |
## iron..mg.               2.530  0.053 |
## magnessium..mg.         0.006  0.000 |
## 
## Supplementary categories
##                            Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
## fruit                  |  1.380 | -0.791  0.328 -2.678 | -1.036  0.563 -5.680 |
## vegetable              |  1.121 |  0.642  0.328  2.678 |  0.841  0.563  5.680 |
##                         Dim.3   cos2 v.test  
## fruit                   0.040  0.001  0.269 |
## vegetable              -0.033  0.001 -0.269 |

Let’s visualize the percentage of variances captured by each dimensions.

fviz_eig(pca.fruit.vegies, addlabels = T, main = "Variance explained by each dimensions")

If we tolerate no more than 20% of information loss. then we can use 6 principal component (PC). which are dim 1 to dim 6.

plot.PCA(x = pca.fruit.vegies,
         choix = "ind", # plot individual (observasi)
         select = "contrib10",
         habillage = 1)

From the graph above we can determine that Date and peanutes are outliers food that has high amount of calories.

Now, We will try to remove the outliers

outliers <- c("Date : Dates, deglet noor","Dates, medjool","Peanuts, dry-roasted")

fruit.vegs.clean <- fruit.vegs.clean %>% 
  rownames_to_column('name') %>%
  filter(!name %in% outliers) %>% 
  column_to_rownames('name')

# scale the dataframe
fruit.vegs.scaled <- scale(fruit.vegs.clean %>% select(-type)) %>% 
  replace_na(0)

Individual and Variable Factor Map

cos2 or the squared cosine value shows the importance of a principal component for a given observation (vector of original variables). The value of cos2 can help find the components that are important to interpret observations.

The individual observations map shows where each of the observations is positioned in term of PC1 and PC2. Using only the first 2 PCs, we can see that there are a lot of outliers in our data that has high cos2 in the PC1. Further analysis can be done to check them out.

fviz_pca_ind(pca.fruit.vegies, 
             habillage = 22, 
             addEllipses = T,
             geom.ind = c("point","text"),
             repel = T)

From the plot, we can see that vegetable has more varied PCs values, indicated by bigger ellipse sphere, and has higher PC1 scores compared to fruit. However, it’s clear that from the PCA, our data can be represent with only 2 dimensions, since it has alrady accomodate more than 99% of the variance and can represent our data well.

Variable Factor Map

If the observations are represented by their projections, the variables are represented by their correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components. The closer to the center of the plot a variable is, the less important it is for the first two components.

fviz_pca_var(pca.fruit.vegies, select.var = list(contrib = 22), col.var = "contrib", 
    gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

One of the insight that we can generate from the graph above is that calories and water has negative correlation. which mean that plant based food that has high calories will has low amount of water.

We can also check the quality of representation or cos2 of each variables. A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle. A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

fviz_cos2(pca.fruit.vegies, choice = "var", fill = "cos2") + scale_fill_viridis_c(option = "B") + 
    theme(legend.position = "top")

We can consider to remove the sodium and vitamin.A variable since it’s contribute too little to our PC1. The following table show the correlation of each variable to the PC1

dd <- dimdesc(pca.fruit.vegies)
as.data.frame(dd[[1]]$quanti)

Clustering

Optimal Number of Clusters

Before we do cluster analysis, first we need to determine the optimal number of cluster. In clustering method, we seek to minimize the total within-cluster sum of squares (meaning that the distance is minimum between observation in the same cluster). To find the optimum number of cluster, we can use 3 methods: elbow method, silhouette method, and gap statistic. We will decide the number of cluster based on majority voting.

Elbow Method

Choosing the number of clusters using elbow method is a little bit subjective. The rule of thumb is we choose the number of cluster in the area of “bend of an elbow”, where the graph is total within sum of squares start to stagnate with the increase of the number of clusters.

fviz_nbclust(fruit.vegs.scaled, kmeans, method = "wss", k.max = 15) + labs(subtitle = "Elbow method")

Using the elbow method, we know that 5 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters. This method may be not enough since the optimal number of clusters is vague.

Silhouette Method

The silhouette method measures the silhouette coefficient, by calculating the mean intra-cluster distance and the mean nearest-cluster distance for each observations. We get the optimal number of clusters by choosing the number of cluster with the highest silhouette score (the peak).

fviz_nbclust(fruit.vegs.scaled, kmeans, "silhouette", k.max = 15) + labs(subtitle = "Silhouette method")

Based on the silhouette method, number of clusters with maximum score is considered as the optimum k-clusters. The graph shows that the optimum number of cluster is 2.

Gap Method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

fviz_nbclust(fruit.vegs.scaled, kmeans, "gap_stat", k.max = 15) + labs(subtitle = "Gap Statistic method")

There are 2 possibilities to use.
2. If we use k=2, because our data is combination of two dataset which is fruits and vegetables. it seems that two cluster will automatically differentiate the data into fruit and vegetable. so our model will not be useful. 3. So, in this case we will try to use k=5

K-Means Clustering

Without PCA

Here is the algorithm behind K-Means Clustering:

Randomly assign number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iteratre until the cluster assignments stop changing. For each of the K clusters, compute the cluster centroid. The Kth cluster centroid is the vector of the p features means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (using euclidean distance or any other distance measurement)

RNGkind(sample.kind = "Rounding")
set.seed(100)
km <- kmeans(fruit.vegs.scaled, centers = 5)
km$centers

##   energy..kcal.kJ.    water..g. protein..g. total.fat..g. carbohydrates..g.
## 1       0.24444418 -0.310834476  0.08246849   -0.07917622        0.51708444
## 2      -0.01379815  0.002669229 -0.46905648   -0.16978179        0.09204301
## 3      -0.73670213  0.611050828  0.14623284   -0.14888393       -0.72609810
## 4      -0.80286171  0.835371619 -0.33700809   -0.23494687       -0.66569506
## 5       2.22932865 -2.122484947  2.28956091    1.44386535        0.95656982
##     fiber..g. sugars..g. calcium..mg.  iron..mg. magnessium..mg.
## 1  0.21791070 -0.2548542   0.10526343  0.2797499       0.3915553
## 2 -0.25912976  0.9826422  -0.46292238 -0.5743972      -0.5372306
## 3 -0.02716787 -1.0442741   2.30437663  2.0095761       2.4217100
## 4 -0.45623719 -0.5494061  -0.09974286 -0.2622191      -0.4081390
## 5  1.78068292 -0.6269204   0.79526902  1.3995358       1.3679295
##   phosphorus..mg. potassium..mg.  sodium..g. vitamin.A..IU. vitamin.C..mg.
## 1       0.3740437      0.6378397  0.13590610     0.55292882     0.43303789
## 2      -0.5904493     -0.6085555 -0.43566074    -0.34697251    -0.06804930
## 3       0.1575768      2.0261995  3.78664799     1.88348262    -0.17505384
## 4      -0.3561028     -0.3633323  0.07083675    -0.03966096    -0.00026426
## 5       2.1686220      1.1191637 -0.35562059    -0.46987861    -0.55535578
##   vitamin.B1..mg. vitamin.B2..mg. viatmin.B3..mg. vitamin.B5..mg.
## 1       0.2756666       0.0576733      1.00437696       0.4433993
## 2      -0.4597908      -0.2956316     -0.38796554      -0.2163089
## 3       0.5820921       3.3800259     -0.01080528      -0.5913291
## 4      -0.3371075      -0.2329397     -0.36288617      -0.2106058
## 5       1.7449350       0.3977701      0.39576196       0.6640565
##   vitamin.B6..mg. vitamin.E..mg.
## 1       0.2833132     0.08557683
## 2      -0.4109613    -0.13592573
## 3       0.2492036     2.72649605
## 4      -0.1409598    -0.17972836
## 5       1.0981722    -0.10323770

km$betweenss / km$totss * 100

## [1] 44.70045

The ratio between the sum of squares distance between cluster to the total sum of squares is 46.3%, meaning that most of the sum of squares distance comes from the distance between clusters. the closer the ratio to 1 the better the clustering model. Thus, we can conclude that our data has not been properly clustered since the ratio is below 50%.

We’ve already get the information about the cluster of each observation. Let’s join the vector cluster into the dataset.

cluster <- as.factor(km$cluster)
fruit.vegs.clust <- fruit.vegs.clean %>% bind_cols(cluster = cluster) %>% na.omit()

Clustering With PCA

PCA can also be integrated with the result of the K-means Clustering to help visualize our data in a fewer dimensions than the original features.

fviz_cluster(object = km, data = fruit.vegs.scaled, labelsize = 0) + theme_minimal()

From the graph above it seems that even with only two PC we can see the distinct seperation between each clustering.

Cluster Analysis

We will do analysis regarding the characteristic of each cluster and see if there is a difference or specific traits on each clusters. Since we have a lot of features (21), we might not be able to explore all of them.

library(ggforce)
fruit.vegs.clust %>% mutate(cluster = cluster) %>% ggplot(aes(energy..kcal.kJ., phosphorus..mg., 
    color = cluster)) + geom_point(alpha = 0.5) + geom_mark_hull() + scale_color_brewer(palette = "Set1") + 
    theme_minimal() + theme(legend.position = "top")

We might also check on each clusters centroid:

fruit.vegs.clust %>% group_by(cluster) %>% summarise_if(is.numeric, "mean") %>% mutate_if(is.numeric, 
    .funs = "round", digits = 2)

Some interesting finding that we can take from the centroid including:

Cluster 1 has the highest vit.C and also quite high in carbs and sugar.
Cluster 2 has the highest sugar while its protein is the lowest.
Cluster 3 has the highest in vit.A, E and its mineral is also quite high while its sugar and carbs is the lowest.
Cluster 4 has the lowest calories, fat, and fiber.
Cluster 5 is the opposite of cluster 4 and also high in protein. but has the lowest vit.A and C.

Conclusion

The conclusion that we can make from our model and analysis are:

We can separate our data into at least 5 clusters based on all of the numerical features, but with less than 50% of the total sum of squares come from the distance of observations between clusters.
cluster 4 is the best for person who is on diet. so does cluster 3 which not only has low carbs and sugar but it rich in vitamin and mineral. on the other hand cluster 5 should be avoided for those in diet.
We can reduce our dimensions from 21 features into just 6 dimensions and still retain more than 80% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.
However, as we have seen, the dimensionality reduction is not enough for us to visualize the clustering of our data, indicated by overlapping of clusters if we only use the first 2 dimensions. Perhaps the result from the gap statistic method is true, that there is only 2 big cluster.

Plant Based Food Clustering

Agung Atidhira

September 30, 2021