Open Food Facts Clustering

Introduction

Open Food Facts is a freely accessible database of food products that contains comprehensive details about the ingredients, allergens, nutritional facts and all the information that can be found on product labels. It is a non-profit association of volunteers, which is said to be made by everyone, for everyone. It serves as reference for individuals to compare and modify their existing food choices in a more health-conscious manner. This is because it helps people to make more informed food choices by making the knowledge on food accessible as easy as by just scanning the food label using their app, food additives labels, allergens, etc. can be decoded, to find a certain product that match your criteria. Additionally, the nutritional quality of each product is evaluated using a grading system comprising the letters A, B, C, D and E. It should be noted that only a limited number of items are currently labelled with their respective nutritional grades on their packaging in stores.

The data from Open Food Facts is available on the official website (https://world.openfoodfacts.org/). However, the original dataset is too large to process (10 GB), hence in this project, the TSV (tab-separated values) file from Open Food Facts Kaggle is used. This file can be accessed via the following link: https://www.kaggle.com/datasets/openfoodfacts/world-food-facts.

Objectives

As a consumer myself, I am very concerned with the food choices I made. I will always read the ingredient list and try to understand or even translate the ingredients which made up this product when necessary. This is to ensure I do not consume undesired and unnecessary additives, and to have the food ingredients as close to raw food as possible depending on the products.

Objective 1: Use clustering to explore the relationship between nutritional attributes and consumer choices. Objective 2: Analyze geographic trends and validate the nutritional quality in food products.

# read packages
library(tidyverse)
library(readr)
library(dplyr)
library(factoextra)
library(fpc)
library(NbClust)
library(ggplot2)
library(cluster)
library(flexclust)
library(corrplot)

Data - Open Food Facts

Let’s have a look at the summary of the data:

To clarify, this dataset has data from France so the nutrition_grade_fr is nutrition grade of products in France.

open_food_facts <- read.csv("trimmed_open_food_facts_dataset.csv", sep = ",")
head(open_food_facts)

##                  product_name          brands manufacturing_places
## 1 Roasted & Salted Pistachios Paramount Farms                     
## 2                   Panettone            Coop                     
## 3   Creme legere semi epaisse       Carrefour                     
## 4     Organic Cheese Crackers     Full Circle                     
## 5            Miel de Montagne        Eric Bur                     
## 6        Gummies, Peach Rings          Kroger                     
##   purchase_places allergens traces additives_n ingredients_from_palm_oil_n
## 1                                            0                           0
## 2                                           NA                          NA
## 3                                            0                           0
## 4                                            3                           0
## 5     Lyon,France                            0                           0
## 6                                            7                           0
##   energy.from.fat_100g trans.fat_100g sugars_100g fiber_100g proteins_100g
## 1                   NA              0         3.9        5.2         11.69
## 2                   NA             NA        29.0        2.0          7.00
## 3                   NA             NA         3.1         NA          2.80
## 4                   NA              0         0.0        3.3          6.67
## 5                   NA             NA        80.0        0.0          0.40
## 6                   NA              0        50.0        0.0          5.00
##   sodium_100g alcohol_100g fruits.vegetables.nuts_100g cocoa_100g
## 1 0.286000000           NA                          NA         NA
## 2 0.157480315           NA                          NA         NA
## 3 0.055118110           NA                          NA         NA
## 4 0.900000000           NA                          NA         NA
## 5 0.002362205           NA                          NA         NA
## 6 0.012000000           NA                          NA         NA
##   nutrition_grade_fr
## 1                  a
## 2                  d
## 3                  d
## 4                  d
## 5                  d
## 6                  d

str(open_food_facts)

## 'data.frame':    1500 obs. of  18 variables:
##  $ product_name               : chr  "Roasted & Salted Pistachios" "Panettone" "Creme legere semi epaisse" "Organic Cheese Crackers" ...
##  $ brands                     : chr  "Paramount Farms" "Coop" "Carrefour" "Full Circle" ...
##  $ manufacturing_places       : chr  "" "" "" "" ...
##  $ purchase_places            : chr  "" "" "" "" ...
##  $ allergens                  : chr  "" "" "" "" ...
##  $ traces                     : chr  "" "" "" "" ...
##  $ additives_n                : num  0 NA 0 3 0 7 0 5 1 NA ...
##  $ ingredients_from_palm_oil_n: num  0 NA 0 0 0 0 0 0 0 NA ...
##  $ energy.from.fat_100g       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ trans.fat_100g             : num  0 NA NA 0 NA 0 NA NA 0 NA ...
##  $ sugars_100g                : num  3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
##  $ fiber_100g                 : num  5.2 2 NA 3.3 0 0 NA 2 2.3 NA ...
##  $ proteins_100g              : num  11.69 7 2.8 6.67 0.4 ...
##  $ sodium_100g                : num  0.286 0.15748 0.05512 0.9 0.00236 ...
##  $ alcohol_100g               : num  NA NA NA NA NA NA NA 0 NA NA ...
##  $ fruits.vegetables.nuts_100g: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cocoa_100g                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ nutrition_grade_fr         : chr  "a" "d" "d" "d" ...

summary(open_food_facts)

##  product_name          brands          manufacturing_places purchase_places   
##  Length:1500        Length:1500        Length:1500          Length:1500       
##  Class :character   Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character   Mode  :character     Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##   allergens            traces           additives_n    
##  Length:1500        Length:1500        Min.   : 0.000  
##  Class :character   Class :character   1st Qu.: 0.000  
##  Mode  :character   Mode  :character   Median : 1.000  
##                                        Mean   : 2.041  
##                                        3rd Qu.: 3.000  
##                                        Max.   :22.000  
##                                        NA's   :146     
##  ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g  
##  Min.   :0.00000             Min.   :  0.0        Min.   :0.0000  
##  1st Qu.:0.00000             1st Qu.: 65.0        1st Qu.:0.0000  
##  Median :0.00000             Median :130.0        Median :0.0000  
##  Mean   :0.03323             Mean   :193.3        Mean   :0.0251  
##  3rd Qu.:0.00000             3rd Qu.:290.0        3rd Qu.:0.0000  
##  Max.   :2.00000             Max.   :450.0        Max.   :4.5500  
##  NA's   :146                 NA's   :1497         NA's   :746     
##   sugars_100g       fiber_100g    proteins_100g     sodium_100g     
##  Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:  1.23   1st Qu.: 0.00   1st Qu.: 1.820   1st Qu.: 0.0330  
##  Median :  5.00   Median : 1.40   Median : 5.710   Median : 0.2362  
##  Mean   : 14.92   Mean   : 2.79   Mean   : 7.834   Mean   : 0.4476  
##  3rd Qu.: 23.00   3rd Qu.: 3.60   3rd Qu.:11.000   3rd Qu.: 0.5118  
##  Max.   :100.00   Max.   :51.60   Max.   :55.000   Max.   :23.7500  
##  NA's   :1        NA's   :254     NA's   :1        NA's   :1        
##   alcohol_100g  fruits.vegetables.nuts_100g   cocoa_100g    nutrition_grade_fr
##  Min.   :0      Min.   :  0.00              Min.   :28.00   Length:1500       
##  1st Qu.:0      1st Qu.:  7.75              1st Qu.:29.50   Class :character  
##  Median :0      Median : 45.00              Median :38.50   Mode  :character  
##  Mean   :0      Mean   : 42.19              Mean   :43.75                     
##  3rd Qu.:0      3rd Qu.: 53.75              3rd Qu.:52.75                     
##  Max.   :0      Max.   :100.00              Max.   :70.00                     
##  NA's   :1489   NA's   :1486                NA's   :1496

dim(open_food_facts)

## [1] 1500   18

# Get the total number of rows in the dataset
print(paste("Total number of rows: ", nrow(open_food_facts)))

## [1] "Total number of rows:  1500"

# Count rows that have at least one NA value
print(paste("Number of empty rows: ", sum(!complete.cases(open_food_facts))))

## [1] "Number of empty rows:  1500"

# Count the number of NA values in each column
print(paste("Number of NA values in each column", colSums(is.na(open_food_facts))))

##  [1] "Number of NA values in each column 0"   
##  [2] "Number of NA values in each column 0"   
##  [3] "Number of NA values in each column 0"   
##  [4] "Number of NA values in each column 0"   
##  [5] "Number of NA values in each column 0"   
##  [6] "Number of NA values in each column 0"   
##  [7] "Number of NA values in each column 146" 
##  [8] "Number of NA values in each column 146" 
##  [9] "Number of NA values in each column 1497"
## [10] "Number of NA values in each column 746" 
## [11] "Number of NA values in each column 1"   
## [12] "Number of NA values in each column 254" 
## [13] "Number of NA values in each column 1"   
## [14] "Number of NA values in each column 1"   
## [15] "Number of NA values in each column 1489"
## [16] "Number of NA values in each column 1486"
## [17] "Number of NA values in each column 1496"
## [18] "Number of NA values in each column 0"

# Data Factorization for non-numeric column
open_food_facts <- open_food_facts %>%
  mutate(across(where(is.character), as.factor))
# type = factor, with level = "a" "b" "c" "d" "e"
levels(open_food_facts$nutrition_grade_fr)

## [1] "a" "b" "c" "d" "e"

# Convert factor columns to numeric (label encoding)
factor_columns <- sapply(open_food_facts, is.factor)
open_food_facts[factor_columns] <- lapply(open_food_facts[factor_columns], function(x) as.numeric(x))
# nutrition_grade_fr, "a" = 1 and so on

# Replace NA with 0 in numeric columns
open_food_facts[sapply(open_food_facts, is.numeric)] <- lapply(open_food_facts[sapply(open_food_facts, is.numeric)], function(x) {
  ifelse(is.na(x), 0, x)
})

str(open_food_facts)

## 'data.frame':    1500 obs. of  18 variables:
##  $ product_name               : num  1165 985 370 932 826 ...
##  $ brands                     : num  797 220 157 385 327 557 340 607 165 880 ...
##  $ manufacturing_places       : num  1 1 1 1 1 1 1 98 1 1 ...
##  $ purchase_places            : num  1 1 1 1 79 1 1 1 1 1 ...
##  $ allergens                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ traces                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ additives_n                : num  0 0 0 3 0 7 0 5 1 0 ...
##  $ ingredients_from_palm_oil_n: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ energy.from.fat_100g       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trans.fat_100g             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sugars_100g                : num  3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
##  $ fiber_100g                 : num  5.2 2 0 3.3 0 0 0 2 2.3 0 ...
##  $ proteins_100g              : num  11.69 7 2.8 6.67 0.4 ...
##  $ sodium_100g                : num  0.286 0.15748 0.05512 0.9 0.00236 ...
##  $ alcohol_100g               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruits.vegetables.nuts_100g: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cocoa_100g                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nutrition_grade_fr         : num  1 4 4 4 4 4 3 1 2 2 ...

# All columns are of type numeric now now

# Verify that no column has NA
print(paste("Number of NA values in each column", colSums(is.na(open_food_facts))))

##  [1] "Number of NA values in each column 0"
##  [2] "Number of NA values in each column 0"
##  [3] "Number of NA values in each column 0"
##  [4] "Number of NA values in each column 0"
##  [5] "Number of NA values in each column 0"
##  [6] "Number of NA values in each column 0"
##  [7] "Number of NA values in each column 0"
##  [8] "Number of NA values in each column 0"
##  [9] "Number of NA values in each column 0"
## [10] "Number of NA values in each column 0"
## [11] "Number of NA values in each column 0"
## [12] "Number of NA values in each column 0"
## [13] "Number of NA values in each column 0"
## [14] "Number of NA values in each column 0"
## [15] "Number of NA values in each column 0"
## [16] "Number of NA values in each column 0"
## [17] "Number of NA values in each column 0"
## [18] "Number of NA values in each column 0"

# For better clustering result quality:
# Identify the constant columns
constant_columns <- sapply(open_food_facts, function(x) length(unique(x)) == 1)

# Remove constant columns and create a new dataframe
open_food_facts_cleaned <- open_food_facts[, !constant_columns]
str(open_food_facts_cleaned)

## 'data.frame':    1500 obs. of  17 variables:
##  $ product_name               : num  1165 985 370 932 826 ...
##  $ brands                     : num  797 220 157 385 327 557 340 607 165 880 ...
##  $ manufacturing_places       : num  1 1 1 1 1 1 1 98 1 1 ...
##  $ purchase_places            : num  1 1 1 1 79 1 1 1 1 1 ...
##  $ allergens                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ traces                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ additives_n                : num  0 0 0 3 0 7 0 5 1 0 ...
##  $ ingredients_from_palm_oil_n: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ energy.from.fat_100g       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trans.fat_100g             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sugars_100g                : num  3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
##  $ fiber_100g                 : num  5.2 2 0 3.3 0 0 0 2 2.3 0 ...
##  $ proteins_100g              : num  11.69 7 2.8 6.67 0.4 ...
##  $ sodium_100g                : num  0.286 0.15748 0.05512 0.9 0.00236 ...
##  $ fruits.vegetables.nuts_100g: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cocoa_100g                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nutrition_grade_fr         : num  1 4 4 4 4 4 3 1 2 2 ...

# Columns product_name and brands were pruned for a more focused interpretation on nutrients and geographic locations
open_food_facts_cleaned_trimmed <- open_food_facts_cleaned[, !(colnames(open_food_facts_cleaned) %in% c("product_name", "brands"))]

# Check the trimmed dataset with columns product_name and brands removed
head(open_food_facts_cleaned_trimmed)

##   manufacturing_places purchase_places allergens traces additives_n
## 1                    1               1         1      1           0
## 2                    1               1         1      1           0
## 3                    1               1         1      1           0
## 4                    1               1         1      1           3
## 5                    1              79         1      1           0
## 6                    1               1         1      1           7
##   ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g sugars_100g
## 1                           0                    0              0         3.9
## 2                           0                    0              0        29.0
## 3                           0                    0              0         3.1
## 4                           0                    0              0         0.0
## 5                           0                    0              0        80.0
## 6                           0                    0              0        50.0
##   fiber_100g proteins_100g sodium_100g fruits.vegetables.nuts_100g cocoa_100g
## 1        5.2         11.69 0.286000000                           0          0
## 2        2.0          7.00 0.157480315                           0          0
## 3        0.0          2.80 0.055118110                           0          0
## 4        3.3          6.67 0.900000000                           0          0
## 5        0.0          0.40 0.002362205                           0          0
## 6        0.0          5.00 0.012000000                           0          0
##   nutrition_grade_fr
## 1                  1
## 2                  4
## 3                  4
## 4                  4
## 5                  4
## 6                  4

# Have a view on final dataset
str(open_food_facts_cleaned_trimmed)

## 'data.frame':    1500 obs. of  15 variables:
##  $ manufacturing_places       : num  1 1 1 1 1 1 1 98 1 1 ...
##  $ purchase_places            : num  1 1 1 1 79 1 1 1 1 1 ...
##  $ allergens                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ traces                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ additives_n                : num  0 0 0 3 0 7 0 5 1 0 ...
##  $ ingredients_from_palm_oil_n: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ energy.from.fat_100g       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trans.fat_100g             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sugars_100g                : num  3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
##  $ fiber_100g                 : num  5.2 2 0 3.3 0 0 0 2 2.3 0 ...
##  $ proteins_100g              : num  11.69 7 2.8 6.67 0.4 ...
##  $ sodium_100g                : num  0.286 0.15748 0.05512 0.9 0.00236 ...
##  $ fruits.vegetables.nuts_100g: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cocoa_100g                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nutrition_grade_fr         : num  1 4 4 4 4 4 3 1 2 2 ...

# Data scaling
# to keep data originality, I decided not to scale the data
#open_food_facts_scaled <- scale(open_food_facts_cleaned_trimmed)

Clustering

Optimal number of clusters

First of all, the optimal number of clusters are determined using Silhouette Score and Calinski-Harabasz Index.

#Determine number of clusters using Silhouette Score
fviz_nbclust(open_food_facts_cleaned_trimmed, kmeans, method = "silhouette")

# Silhoutte width shows that optimal number of clusters = 2

# Just in case, Calinski-Harabasz Index is used to find out the optimal number of cluster as well
ch_result <- NbClust(open_food_facts_cleaned_trimmed, 
                     distance = "euclidean", 
                     method = "kmeans", 
                     index = "ch")
#print(ch_result)
plot(ch_result$All.index, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Calinski-Harabasz Index",
     main = "Calinski-Harabasz Index for K-Means Clustering")

Calinski-Harabasz Index shows that the most optimal number of clusters is 1, and second optimal number of clusters is 3. Since Silhouette Score shows that the most optimal number of cluster is 2, and clustering is the main objective here, both 2 and 3 clusters are used to cluster the data for comparison.

K-means Clustering

Clustering using K-means with 2 and 3 clusters is carried out.

set.seed(122)
# optimal cluster = 2
kmeans_result1 <- eclust(open_food_facts_cleaned_trimmed, "kmeans", hc_metric = "euclidean", k = 2)

# another option to visualise clustering result for more customisation
fviz_cluster(kmeans_result1, 
             geom = "point",
             show.clust.cent = TRUE,
             ellipse.type = "convex", 
             ggtheme = theme_minimal(), 
             main="k-means with 2 clusters")

set.seed(125)
#optimal cluster = 3
kmeans_result2 <- eclust(open_food_facts_cleaned_trimmed, "kmeans", hc_metric = "euclidean", k = 3)

K-medoids (PAM) Clustering

For comparison, Partition around medoids (PAM) is used to carry out clustering. Clustering with 2 medoids are carried out with the reference of average Silhouette width for K-means Clustering.

pam1 <- eclust(open_food_facts_cleaned_trimmed, "pam", k = 2)

Experiment to investigate clustering by nutrition grade

Since the nutrition grade has 5 distinct categories, i.e. A, B, C, D and E, clustering based on nutrition_grade_fr is carried out to explore the potential of products’ characteristics for each nutritional grade although most nutritional values usually exist in most of the products.

# Check for constant columns (those with zero variance) and remove them if needed
# Perform K-means clustering on the transposed data
set.seed(221)
nutrition_data_transposed <- t(open_food_facts_cleaned_trimmed)
fviz_nbclust(nutrition_data_transposed, kmeans, method = "silhouette")

kmeans_result1_1 <- eclust(nutrition_data_transposed, "kmeans", hc_metric = "euclidean", k = 5)

# kmeans_result1_1$cluster shows the cluster assigned for each feature
kmeans_result1_1$cluster

##        manufacturing_places             purchase_places 
##                           4                           5 
##                   allergens                      traces 
##                           2                           4 
##                 additives_n ingredients_from_palm_oil_n 
##                           3                           3 
##        energy.from.fat_100g              trans.fat_100g 
##                           3                           3 
##                 sugars_100g                  fiber_100g 
##                           1                           3 
##               proteins_100g                 sodium_100g 
##                           3                           3 
## fruits.vegetables.nuts_100g                  cocoa_100g 
##                           3                           3 
##          nutrition_grade_fr 
##                           3

Quality Validation of Clustering

To validate the quality of cluster, we can use Silhouette scores to find out the average Silhouette Width to determine whether the points are well defined for the clusters.

This is also a good reference to find out whether nutrition_grade_fr is redundant as it might have been associated with relevant nutrition components.

# Interpretation of Silhouette Scores:
# Scores near 1: Points are well-clustered.
# Scores near 0: Points are on the cluster boundary.
# Scores near -1: Points are likely to be misclassified.
# Calculate Silhouette Scores for all clustering results

# Silhouette_scores for k-means with 2 clusters
silhouette_scores <- silhouette(kmeans_result1$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for k-means with 3 clusters
silhouette_scores1 <- silhouette(kmeans_result2$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for PAM with 2 clusters
silhouette_scores_pam <- silhouette(pam1$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for clustering by nutrition grade
silhouette_scores_clustering_nutrition_grade <- silhouette(kmeans_result1_1$cluster, dist(t(open_food_facts_cleaned_trimmed)))

# Print Silhouette Scores Summary of k-means with 2 clusters
summary(silhouette_scores)

## Silhouette of 1500 units in 2 clusters from silhouette.default(x = kmeans_result1$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
##  Cluster sizes and average silhouette widths:
##      1342       158 
## 0.6939548 0.2114234 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1906  0.5899  0.7568  0.6431  0.7838  0.7974

# Print Silhouette Scores Summary of k-means with 3 clusters
summary(silhouette_scores1)

## Silhouette of 1500 units in 3 clusters from silhouette.default(x = kmeans_result2$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
##  Cluster sizes and average silhouette widths:
##       167       129      1204 
## 0.1435185 0.2552822 0.6985189 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2361  0.4657  0.7292  0.5986  0.7693  0.7900

# Print Silhouette Scores Summary of PAM with 2 clusters
summary(silhouette_scores_pam)

## Silhouette of 1500 units in 2 clusters from silhouette.default(x = pam1$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
##  Cluster sizes and average silhouette widths:
##       1247        253 
## 0.73860660 0.03263965 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4202  0.5706  0.7764  0.6195  0.8080  0.8238

# Print Silhouette Scores Summary of clustering by nutrition grade
summary(silhouette_scores_clustering_nutrition_grade)

## Silhouette of 15 units in 5 clusters from silhouette.default(x = kmeans_result1_1$cluster, dist = dist(t(open_food_facts_cleaned_trimmed))) :
##  Cluster sizes and average silhouette widths:
##          1          1         10          2          1 
##  0.0000000  0.0000000  0.6537860 -0.1485547  0.0000000 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2229  0.0000  0.6369  0.4161  0.7116  0.7521

# Visualize Silhouette Plot for each method
fviz_silhouette(silhouette_scores, main = "Silhouette Plot for K-means with 2 clusters")

##   cluster size ave.sil.width
## 1       1 1342          0.69
## 2       2  158          0.21

fviz_silhouette(silhouette_scores1, main = "Silhouette Plot for K-means with 3 clusters")

##   cluster size ave.sil.width
## 1       1  167          0.14
## 2       2  129          0.26
## 3       3 1204          0.70

fviz_silhouette(silhouette_scores_pam, main = "Silhouette Plot for PAM with 2 clusters")

##   cluster size ave.sil.width
## 1       1 1247          0.74
## 2       2  253          0.03

fviz_silhouette(silhouette_scores_clustering_nutrition_grade, main = "Silhouette Plot for clustering by nutrition grade")

##   cluster size ave.sil.width
## 1       1    1          0.00
## 2       2    1          0.00
## 3       3   10          0.65
## 4       4    2         -0.15
## 5       5    1          0.00

# Average Silhouette Width
avg_sil_width <- mean(silhouette_scores[, "sil_width"])
avg_sil_width1 <- mean(silhouette_scores1[, "sil_width"])
avg_sil_width2 <- mean(silhouette_scores_pam[, "sil_width"])
avg_sil_width3 <- mean(silhouette_scores_clustering_nutrition_grade[, "sil_width"])

print(paste("Average Silhouette Width for K-means with 2 clusters:", round(avg_sil_width, 2)))

## [1] "Average Silhouette Width for K-means with 2 clusters: 0.64"

print(paste("Average Silhouette Width for K-means with 3 clusters:", round(avg_sil_width1, 2)))

## [1] "Average Silhouette Width for K-means with 3 clusters: 0.6"

print(paste("Average Silhouette Width for PAM with 2 clusters:", round(avg_sil_width2, 2)))

## [1] "Average Silhouette Width for PAM with 2 clusters: 0.62"

print(paste("Average Silhouette Width for Silhouette Plot - clustering by nutrition grade:", round(avg_sil_width3, 2)))

## [1] "Average Silhouette Width for Silhouette Plot - clustering by nutrition grade: 0.42"

Since the Average Silhouette Width for both k-means and PAM are relatively closer to 1. It is considered to be relatively well-clustered. However, according to the plotted graph, it was observed that the positive values are represented as tall bars, indicating well-clustered points, whereas the negative values, suggesting that some points are near cluster boundaries. The negative values of the plots of k-means with 2 clusters are the closest to zero among all methods, it could suggest that it has a relatively better clustering performance among all.

It is also observed that the average Silhouette Width for K-means with 2 clusters is higher than the average Silhouette Width for K-means with 3 clusters. Thus, it is inferred that 2 clusters are more suitable for this clustering.

On the other hand, for clustering by nutrition_grade_fr, the Average Silhouette Width is relatively lower and closer to 0. Its dataframe shows that there are 3 clusters with Average Silhouette Width of 0.00. A silhouette width of 0 means that a data point lies equally close to its assigned cluster and the nearest neighboring cluster, which also implies that it is at the boundary between clusters and is not strongly affiliated with its assigned cluster.

Statistics in clustered groups

In earlier parts, Silhouette scores was utilised to find out the average Silhouette Width to determine whether the points are well defined. It is still good to analyst the statistics in clustered groups.

The information of the clusters can be viewed in the clustering vectors to exploit the clustering observation for each clustering.

The available components in the clustering vector are cluster, centers, tots, withinss, tot.withinss, betweenss, size, iter, ifault, clust_plot, silinfo, ngclust and data. In this project, tot.withinss, betweenss and size are exploited to analyse the quality of clustering.

Result Interpretation

K-means clustering result interpretation

For K-means clustering result with 2 clusters:

# Extract relevant metrics
total_withinss <- kmeans_result1$tot.withinss
betweenss <- kmeans_result1$betweenss
cluster_sizes <- kmeans_result1$size

cat("Total Within-Cluster Sum of Squares (Compactness):", total_withinss, "\n")

## Total Within-Cluster Sum of Squares (Compactness): 3276764

cat("Between-Cluster Sum of Squares (Variance):", betweenss, "\n")

## Between-Cluster Sum of Squares (Variance): 1866360

# Calculate the explained variance ratio
# A higher ratio indicates better-defined clusters
explained_variance_ratio <- betweenss / (betweenss + total_withinss)
cat("Explained Variance Ratio:", explained_variance_ratio, "\n")

## Explained Variance Ratio: 0.3628845

# Visualize Cluster Sizes
cluster_size <- data.frame(
  Cluster = factor(seq_along(cluster_sizes)), 
  Size = cluster_sizes
)

# Bar plot of cluster sizes
ggplot(cluster_size, aes(x = Cluster, y = Size, fill = Cluster)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "Cluster Sizes",
    x = "Cluster",
    y = "Number of Observations"
  ) +
  theme_minimal(base_size = 14) +
  scale_fill_brewer(palette = "Set2")

# To have a view on the data distribution of kmeans clustering
groupBWplot(open_food_facts_cleaned_trimmed, kmeans_result1$cluster, alpha=0.05)

PAM Clustering result interpretation

For PAM clustering result with k = 2, The nutritional density of the products in each cluster are shown in the table of medoids values. For instance, cluster 1 has the highest protein concentration, highest sodium and fiber per 100g of product, whereas cluster 2 has the highest concentration of sugar per 100g of product and cluster 3 has highest possibilities to find allergens in the product.

# Table of medoids values
print(pam1$medoids)

##      manufacturing_places purchase_places allergens traces additives_n
## [1,]                    1               1         1      1           0
## [2,]                   45              51        59     26           3
##      ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g
## [1,]                           0                    0              0
## [2,]                           1                    0              0
##      sugars_100g fiber_100g proteins_100g sodium_100g
## [1,]        7.69        1.9          5.77   0.4620000
## [2,]       28.00        0.0          6.00   0.4606299
##      fruits.vegetables.nuts_100g cocoa_100g nutrition_grade_fr
## [1,]                           0          0                  3
## [2,]                           0          0                  5

# To have a view on the data distribution of PAM clustering
groupBWplot(open_food_facts_cleaned_trimmed, 
            as.factor(pam1$clustering), 
            alpha=0.05,
            outlier.size=6,
            outlier.colour="red",
            line.col="blue",
            box.col="lightblue")

Correlation of features

# Calculate the correlation matrix
cor_matrix <- cor(open_food_facts_cleaned_trimmed, use = "complete.obs", method = "pearson")

# Plot the correlation matrix as a heatmap
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.cex = 0.7, # Text label size
         tl.col = "black", # Text label color
         addCoef.col = "black", # Display correlation coefficient on the heatmap
         number.cex = 0.5, # Size of the correlation coefficient numbers
         col = colorRampPalette(c("blue", "white", "red"))(200))

According to the correlation heatmap, the most obvious result to take note is that manufacturing_places is moderately correlated to purchase_places (0.51); Sugars_100g is moderately correlated to nutrition_grade_fr (0.42) whereas it is negatively correlated to protein_100g (-0.27).

Statistically but not necessarily realistically, it could be inferred that manufacturing_places might affect the purchase_places. On the other hand, a higher sugar content might indicate that the product has higher nutritional grade but lower protein content.

Clustering

Clarice Eng (Kah Hui Eng)

2024-12-23

Open Food Facts Clustering

Introduction

Objectives

Data - Open Food Facts

Clustering

Optimal number of clusters

K-means Clustering

K-medoids (PAM) Clustering

Experiment to investigate clustering by nutrition grade

Quality Validation of Clustering

Statistics in clustered groups

Result Interpretation

K-means clustering result interpretation

PAM Clustering result interpretation

Correlation of features