Open Food Facts is a freely accessible database of food products that contains comprehensive details about the ingredients, allergens, nutritional facts and all the information that can be found on product labels. It is a non-profit association of volunteers, which is said to be made by everyone, for everyone. It serves as reference for individuals to compare and modify their existing food choices in a more health-conscious manner. This is because it helps people to make more informed food choices by making the knowledge on food accessible as easy as by just scanning the food label using their app, food additives labels, allergens, etc. can be decoded, to find a certain product that match your criteria. Additionally, the nutritional quality of each product is evaluated using a grading system comprising the letters A, B, C, D and E. It should be noted that only a limited number of items are currently labelled with their respective nutritional grades on their packaging in stores.
The data from Open Food Facts is available on the official website (https://world.openfoodfacts.org/). However, the original dataset is too large to process (10 GB), hence in this project, the TSV (tab-separated values) file from Open Food Facts Kaggle is used. This file can be accessed via the following link: https://www.kaggle.com/datasets/openfoodfacts/world-food-facts.
As a consumer myself, I am very concerned with the food choices I made. I will always read the ingredient list and try to understand or even translate the ingredients which made up this product when necessary. This is to ensure I do not consume undesired and unnecessary additives, and to have the food ingredients as close to raw food as possible depending on the products.
Objective 1: Use clustering to explore the relationship between nutritional attributes and consumer choices. Objective 2: Analyze geographic trends and validate the nutritional quality in food products.
Let’s have a look at the summary of the data:
To clarify, this dataset has data from France so the nutrition_grade_fr is nutrition grade of products in France.
## product_name brands manufacturing_places
## 1 Roasted & Salted Pistachios Paramount Farms
## 2 Panettone Coop
## 3 Creme legere semi epaisse Carrefour
## 4 Organic Cheese Crackers Full Circle
## 5 Miel de Montagne Eric Bur
## 6 Gummies, Peach Rings Kroger
## purchase_places allergens traces additives_n ingredients_from_palm_oil_n
## 1 0 0
## 2 NA NA
## 3 0 0
## 4 3 0
## 5 Lyon,France 0 0
## 6 7 0
## energy.from.fat_100g trans.fat_100g sugars_100g fiber_100g proteins_100g
## 1 NA 0 3.9 5.2 11.69
## 2 NA NA 29.0 2.0 7.00
## 3 NA NA 3.1 NA 2.80
## 4 NA 0 0.0 3.3 6.67
## 5 NA NA 80.0 0.0 0.40
## 6 NA 0 50.0 0.0 5.00
## sodium_100g alcohol_100g fruits.vegetables.nuts_100g cocoa_100g
## 1 0.286000000 NA NA NA
## 2 0.157480315 NA NA NA
## 3 0.055118110 NA NA NA
## 4 0.900000000 NA NA NA
## 5 0.002362205 NA NA NA
## 6 0.012000000 NA NA NA
## nutrition_grade_fr
## 1 a
## 2 d
## 3 d
## 4 d
## 5 d
## 6 d
## 'data.frame': 1500 obs. of 18 variables:
## $ product_name : chr "Roasted & Salted Pistachios" "Panettone" "Creme legere semi epaisse" "Organic Cheese Crackers" ...
## $ brands : chr "Paramount Farms" "Coop" "Carrefour" "Full Circle" ...
## $ manufacturing_places : chr "" "" "" "" ...
## $ purchase_places : chr "" "" "" "" ...
## $ allergens : chr "" "" "" "" ...
## $ traces : chr "" "" "" "" ...
## $ additives_n : num 0 NA 0 3 0 7 0 5 1 NA ...
## $ ingredients_from_palm_oil_n: num 0 NA 0 0 0 0 0 0 0 NA ...
## $ energy.from.fat_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ trans.fat_100g : num 0 NA NA 0 NA 0 NA NA 0 NA ...
## $ sugars_100g : num 3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
## $ fiber_100g : num 5.2 2 NA 3.3 0 0 NA 2 2.3 NA ...
## $ proteins_100g : num 11.69 7 2.8 6.67 0.4 ...
## $ sodium_100g : num 0.286 0.15748 0.05512 0.9 0.00236 ...
## $ alcohol_100g : num NA NA NA NA NA NA NA 0 NA NA ...
## $ fruits.vegetables.nuts_100g: num NA NA NA NA NA NA NA NA NA NA ...
## $ cocoa_100g : num NA NA NA NA NA NA NA NA NA NA ...
## $ nutrition_grade_fr : chr "a" "d" "d" "d" ...
## product_name brands manufacturing_places purchase_places
## Length:1500 Length:1500 Length:1500 Length:1500
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## allergens traces additives_n
## Length:1500 Length:1500 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Median : 1.000
## Mean : 2.041
## 3rd Qu.: 3.000
## Max. :22.000
## NA's :146
## ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g
## Min. :0.00000 Min. : 0.0 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 65.0 1st Qu.:0.0000
## Median :0.00000 Median :130.0 Median :0.0000
## Mean :0.03323 Mean :193.3 Mean :0.0251
## 3rd Qu.:0.00000 3rd Qu.:290.0 3rd Qu.:0.0000
## Max. :2.00000 Max. :450.0 Max. :4.5500
## NA's :146 NA's :1497 NA's :746
## sugars_100g fiber_100g proteins_100g sodium_100g
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 1.23 1st Qu.: 0.00 1st Qu.: 1.820 1st Qu.: 0.0330
## Median : 5.00 Median : 1.40 Median : 5.710 Median : 0.2362
## Mean : 14.92 Mean : 2.79 Mean : 7.834 Mean : 0.4476
## 3rd Qu.: 23.00 3rd Qu.: 3.60 3rd Qu.:11.000 3rd Qu.: 0.5118
## Max. :100.00 Max. :51.60 Max. :55.000 Max. :23.7500
## NA's :1 NA's :254 NA's :1 NA's :1
## alcohol_100g fruits.vegetables.nuts_100g cocoa_100g nutrition_grade_fr
## Min. :0 Min. : 0.00 Min. :28.00 Length:1500
## 1st Qu.:0 1st Qu.: 7.75 1st Qu.:29.50 Class :character
## Median :0 Median : 45.00 Median :38.50 Mode :character
## Mean :0 Mean : 42.19 Mean :43.75
## 3rd Qu.:0 3rd Qu.: 53.75 3rd Qu.:52.75
## Max. :0 Max. :100.00 Max. :70.00
## NA's :1489 NA's :1486 NA's :1496
## [1] 1500 18
# Get the total number of rows in the dataset
print(paste("Total number of rows: ", nrow(open_food_facts)))## [1] "Total number of rows: 1500"
# Count rows that have at least one NA value
print(paste("Number of empty rows: ", sum(!complete.cases(open_food_facts))))## [1] "Number of empty rows: 1500"
# Count the number of NA values in each column
print(paste("Number of NA values in each column", colSums(is.na(open_food_facts))))## [1] "Number of NA values in each column 0"
## [2] "Number of NA values in each column 0"
## [3] "Number of NA values in each column 0"
## [4] "Number of NA values in each column 0"
## [5] "Number of NA values in each column 0"
## [6] "Number of NA values in each column 0"
## [7] "Number of NA values in each column 146"
## [8] "Number of NA values in each column 146"
## [9] "Number of NA values in each column 1497"
## [10] "Number of NA values in each column 746"
## [11] "Number of NA values in each column 1"
## [12] "Number of NA values in each column 254"
## [13] "Number of NA values in each column 1"
## [14] "Number of NA values in each column 1"
## [15] "Number of NA values in each column 1489"
## [16] "Number of NA values in each column 1486"
## [17] "Number of NA values in each column 1496"
## [18] "Number of NA values in each column 0"
# Data Factorization for non-numeric column
open_food_facts <- open_food_facts %>%
mutate(across(where(is.character), as.factor))
# type = factor, with level = "a" "b" "c" "d" "e"
levels(open_food_facts$nutrition_grade_fr)## [1] "a" "b" "c" "d" "e"
# Convert factor columns to numeric (label encoding)
factor_columns <- sapply(open_food_facts, is.factor)
open_food_facts[factor_columns] <- lapply(open_food_facts[factor_columns], function(x) as.numeric(x))
# nutrition_grade_fr, "a" = 1 and so on
# Replace NA with 0 in numeric columns
open_food_facts[sapply(open_food_facts, is.numeric)] <- lapply(open_food_facts[sapply(open_food_facts, is.numeric)], function(x) {
ifelse(is.na(x), 0, x)
})
str(open_food_facts)## 'data.frame': 1500 obs. of 18 variables:
## $ product_name : num 1165 985 370 932 826 ...
## $ brands : num 797 220 157 385 327 557 340 607 165 880 ...
## $ manufacturing_places : num 1 1 1 1 1 1 1 98 1 1 ...
## $ purchase_places : num 1 1 1 1 79 1 1 1 1 1 ...
## $ allergens : num 1 1 1 1 1 1 1 1 1 1 ...
## $ traces : num 1 1 1 1 1 1 1 1 1 1 ...
## $ additives_n : num 0 0 0 3 0 7 0 5 1 0 ...
## $ ingredients_from_palm_oil_n: num 0 0 0 0 0 0 0 0 0 0 ...
## $ energy.from.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ trans.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sugars_100g : num 3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
## $ fiber_100g : num 5.2 2 0 3.3 0 0 0 2 2.3 0 ...
## $ proteins_100g : num 11.69 7 2.8 6.67 0.4 ...
## $ sodium_100g : num 0.286 0.15748 0.05512 0.9 0.00236 ...
## $ alcohol_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fruits.vegetables.nuts_100g: num 0 0 0 0 0 0 0 0 0 0 ...
## $ cocoa_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nutrition_grade_fr : num 1 4 4 4 4 4 3 1 2 2 ...
# All columns are of type numeric now now
# Verify that no column has NA
print(paste("Number of NA values in each column", colSums(is.na(open_food_facts))))## [1] "Number of NA values in each column 0"
## [2] "Number of NA values in each column 0"
## [3] "Number of NA values in each column 0"
## [4] "Number of NA values in each column 0"
## [5] "Number of NA values in each column 0"
## [6] "Number of NA values in each column 0"
## [7] "Number of NA values in each column 0"
## [8] "Number of NA values in each column 0"
## [9] "Number of NA values in each column 0"
## [10] "Number of NA values in each column 0"
## [11] "Number of NA values in each column 0"
## [12] "Number of NA values in each column 0"
## [13] "Number of NA values in each column 0"
## [14] "Number of NA values in each column 0"
## [15] "Number of NA values in each column 0"
## [16] "Number of NA values in each column 0"
## [17] "Number of NA values in each column 0"
## [18] "Number of NA values in each column 0"
# For better clustering result quality:
# Identify the constant columns
constant_columns <- sapply(open_food_facts, function(x) length(unique(x)) == 1)
# Remove constant columns and create a new dataframe
open_food_facts_cleaned <- open_food_facts[, !constant_columns]
str(open_food_facts_cleaned)## 'data.frame': 1500 obs. of 17 variables:
## $ product_name : num 1165 985 370 932 826 ...
## $ brands : num 797 220 157 385 327 557 340 607 165 880 ...
## $ manufacturing_places : num 1 1 1 1 1 1 1 98 1 1 ...
## $ purchase_places : num 1 1 1 1 79 1 1 1 1 1 ...
## $ allergens : num 1 1 1 1 1 1 1 1 1 1 ...
## $ traces : num 1 1 1 1 1 1 1 1 1 1 ...
## $ additives_n : num 0 0 0 3 0 7 0 5 1 0 ...
## $ ingredients_from_palm_oil_n: num 0 0 0 0 0 0 0 0 0 0 ...
## $ energy.from.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ trans.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sugars_100g : num 3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
## $ fiber_100g : num 5.2 2 0 3.3 0 0 0 2 2.3 0 ...
## $ proteins_100g : num 11.69 7 2.8 6.67 0.4 ...
## $ sodium_100g : num 0.286 0.15748 0.05512 0.9 0.00236 ...
## $ fruits.vegetables.nuts_100g: num 0 0 0 0 0 0 0 0 0 0 ...
## $ cocoa_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nutrition_grade_fr : num 1 4 4 4 4 4 3 1 2 2 ...
# Columns product_name and brands were pruned for a more focused interpretation on nutrients and geographic locations
open_food_facts_cleaned_trimmed <- open_food_facts_cleaned[, !(colnames(open_food_facts_cleaned) %in% c("product_name", "brands"))]
# Check the trimmed dataset with columns product_name and brands removed
head(open_food_facts_cleaned_trimmed)## manufacturing_places purchase_places allergens traces additives_n
## 1 1 1 1 1 0
## 2 1 1 1 1 0
## 3 1 1 1 1 0
## 4 1 1 1 1 3
## 5 1 79 1 1 0
## 6 1 1 1 1 7
## ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g sugars_100g
## 1 0 0 0 3.9
## 2 0 0 0 29.0
## 3 0 0 0 3.1
## 4 0 0 0 0.0
## 5 0 0 0 80.0
## 6 0 0 0 50.0
## fiber_100g proteins_100g sodium_100g fruits.vegetables.nuts_100g cocoa_100g
## 1 5.2 11.69 0.286000000 0 0
## 2 2.0 7.00 0.157480315 0 0
## 3 0.0 2.80 0.055118110 0 0
## 4 3.3 6.67 0.900000000 0 0
## 5 0.0 0.40 0.002362205 0 0
## 6 0.0 5.00 0.012000000 0 0
## nutrition_grade_fr
## 1 1
## 2 4
## 3 4
## 4 4
## 5 4
## 6 4
## 'data.frame': 1500 obs. of 15 variables:
## $ manufacturing_places : num 1 1 1 1 1 1 1 98 1 1 ...
## $ purchase_places : num 1 1 1 1 79 1 1 1 1 1 ...
## $ allergens : num 1 1 1 1 1 1 1 1 1 1 ...
## $ traces : num 1 1 1 1 1 1 1 1 1 1 ...
## $ additives_n : num 0 0 0 3 0 7 0 5 1 0 ...
## $ ingredients_from_palm_oil_n: num 0 0 0 0 0 0 0 0 0 0 ...
## $ energy.from.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ trans.fat_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sugars_100g : num 3.9 29 3.1 0 80 50 30 1.5 2.27 4 ...
## $ fiber_100g : num 5.2 2 0 3.3 0 0 0 2 2.3 0 ...
## $ proteins_100g : num 11.69 7 2.8 6.67 0.4 ...
## $ sodium_100g : num 0.286 0.15748 0.05512 0.9 0.00236 ...
## $ fruits.vegetables.nuts_100g: num 0 0 0 0 0 0 0 0 0 0 ...
## $ cocoa_100g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nutrition_grade_fr : num 1 4 4 4 4 4 3 1 2 2 ...
First of all, the optimal number of clusters are determined using Silhouette Score and Calinski-Harabasz Index.
#Determine number of clusters using Silhouette Score
fviz_nbclust(open_food_facts_cleaned_trimmed, kmeans, method = "silhouette")# Silhoutte width shows that optimal number of clusters = 2
# Just in case, Calinski-Harabasz Index is used to find out the optimal number of cluster as well
ch_result <- NbClust(open_food_facts_cleaned_trimmed,
distance = "euclidean",
method = "kmeans",
index = "ch")
#print(ch_result)
plot(ch_result$All.index, type = "b",
xlab = "Number of Clusters",
ylab = "Calinski-Harabasz Index",
main = "Calinski-Harabasz Index for K-Means Clustering")Calinski-Harabasz Index shows that the most optimal number of clusters is 1, and second optimal number of clusters is 3. Since Silhouette Score shows that the most optimal number of cluster is 2, and clustering is the main objective here, both 2 and 3 clusters are used to cluster the data for comparison.
Clustering using K-means with 2 and 3 clusters is carried out.
set.seed(122)
# optimal cluster = 2
kmeans_result1 <- eclust(open_food_facts_cleaned_trimmed, "kmeans", hc_metric = "euclidean", k = 2)# another option to visualise clustering result for more customisation
fviz_cluster(kmeans_result1,
geom = "point",
show.clust.cent = TRUE,
ellipse.type = "convex",
ggtheme = theme_minimal(),
main="k-means with 2 clusters")set.seed(125)
#optimal cluster = 3
kmeans_result2 <- eclust(open_food_facts_cleaned_trimmed, "kmeans", hc_metric = "euclidean", k = 3)For comparison, Partition around medoids (PAM) is used to carry out clustering. Clustering with 2 medoids are carried out with the reference of average Silhouette width for K-means Clustering.
Since the nutrition grade has 5 distinct categories, i.e. A, B, C, D and E, clustering based on nutrition_grade_fr is carried out to explore the potential of products’ characteristics for each nutritional grade although most nutritional values usually exist in most of the products.
# Check for constant columns (those with zero variance) and remove them if needed
# Perform K-means clustering on the transposed data
set.seed(221)
nutrition_data_transposed <- t(open_food_facts_cleaned_trimmed)
fviz_nbclust(nutrition_data_transposed, kmeans, method = "silhouette")## manufacturing_places purchase_places
## 4 5
## allergens traces
## 2 4
## additives_n ingredients_from_palm_oil_n
## 3 3
## energy.from.fat_100g trans.fat_100g
## 3 3
## sugars_100g fiber_100g
## 1 3
## proteins_100g sodium_100g
## 3 3
## fruits.vegetables.nuts_100g cocoa_100g
## 3 3
## nutrition_grade_fr
## 3
To validate the quality of cluster, we can use Silhouette scores to find out the average Silhouette Width to determine whether the points are well defined for the clusters.
This is also a good reference to find out whether nutrition_grade_fr is redundant as it might have been associated with relevant nutrition components.
# Interpretation of Silhouette Scores:
# Scores near 1: Points are well-clustered.
# Scores near 0: Points are on the cluster boundary.
# Scores near -1: Points are likely to be misclassified.
# Calculate Silhouette Scores for all clustering results
# Silhouette_scores for k-means with 2 clusters
silhouette_scores <- silhouette(kmeans_result1$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for k-means with 3 clusters
silhouette_scores1 <- silhouette(kmeans_result2$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for PAM with 2 clusters
silhouette_scores_pam <- silhouette(pam1$cluster, dist(open_food_facts_cleaned_trimmed))
# Silhouette_scores for clustering by nutrition grade
silhouette_scores_clustering_nutrition_grade <- silhouette(kmeans_result1_1$cluster, dist(t(open_food_facts_cleaned_trimmed)))
# Print Silhouette Scores Summary of k-means with 2 clusters
summary(silhouette_scores)## Silhouette of 1500 units in 2 clusters from silhouette.default(x = kmeans_result1$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
## Cluster sizes and average silhouette widths:
## 1342 158
## 0.6939548 0.2114234
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1906 0.5899 0.7568 0.6431 0.7838 0.7974
## Silhouette of 1500 units in 3 clusters from silhouette.default(x = kmeans_result2$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
## Cluster sizes and average silhouette widths:
## 167 129 1204
## 0.1435185 0.2552822 0.6985189
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2361 0.4657 0.7292 0.5986 0.7693 0.7900
## Silhouette of 1500 units in 2 clusters from silhouette.default(x = pam1$cluster, dist = dist(open_food_facts_cleaned_trimmed)) :
## Cluster sizes and average silhouette widths:
## 1247 253
## 0.73860660 0.03263965
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4202 0.5706 0.7764 0.6195 0.8080 0.8238
# Print Silhouette Scores Summary of clustering by nutrition grade
summary(silhouette_scores_clustering_nutrition_grade)## Silhouette of 15 units in 5 clusters from silhouette.default(x = kmeans_result1_1$cluster, dist = dist(t(open_food_facts_cleaned_trimmed))) :
## Cluster sizes and average silhouette widths:
## 1 1 10 2 1
## 0.0000000 0.0000000 0.6537860 -0.1485547 0.0000000
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2229 0.0000 0.6369 0.4161 0.7116 0.7521
# Visualize Silhouette Plot for each method
fviz_silhouette(silhouette_scores, main = "Silhouette Plot for K-means with 2 clusters")## cluster size ave.sil.width
## 1 1 1342 0.69
## 2 2 158 0.21
## cluster size ave.sil.width
## 1 1 167 0.14
## 2 2 129 0.26
## 3 3 1204 0.70
## cluster size ave.sil.width
## 1 1 1247 0.74
## 2 2 253 0.03
fviz_silhouette(silhouette_scores_clustering_nutrition_grade, main = "Silhouette Plot for clustering by nutrition grade")## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 1 0.00
## 3 3 10 0.65
## 4 4 2 -0.15
## 5 5 1 0.00
# Average Silhouette Width
avg_sil_width <- mean(silhouette_scores[, "sil_width"])
avg_sil_width1 <- mean(silhouette_scores1[, "sil_width"])
avg_sil_width2 <- mean(silhouette_scores_pam[, "sil_width"])
avg_sil_width3 <- mean(silhouette_scores_clustering_nutrition_grade[, "sil_width"])
print(paste("Average Silhouette Width for K-means with 2 clusters:", round(avg_sil_width, 2)))## [1] "Average Silhouette Width for K-means with 2 clusters: 0.64"
## [1] "Average Silhouette Width for K-means with 3 clusters: 0.6"
## [1] "Average Silhouette Width for PAM with 2 clusters: 0.62"
print(paste("Average Silhouette Width for Silhouette Plot - clustering by nutrition grade:", round(avg_sil_width3, 2)))## [1] "Average Silhouette Width for Silhouette Plot - clustering by nutrition grade: 0.42"
Since the Average Silhouette Width for both k-means and PAM are relatively closer to 1. It is considered to be relatively well-clustered. However, according to the plotted graph, it was observed that the positive values are represented as tall bars, indicating well-clustered points, whereas the negative values, suggesting that some points are near cluster boundaries. The negative values of the plots of k-means with 2 clusters are the closest to zero among all methods, it could suggest that it has a relatively better clustering performance among all.
It is also observed that the average Silhouette Width for K-means with 2 clusters is higher than the average Silhouette Width for K-means with 3 clusters. Thus, it is inferred that 2 clusters are more suitable for this clustering.
On the other hand, for clustering by nutrition_grade_fr, the Average Silhouette Width is relatively lower and closer to 0. Its dataframe shows that there are 3 clusters with Average Silhouette Width of 0.00. A silhouette width of 0 means that a data point lies equally close to its assigned cluster and the nearest neighboring cluster, which also implies that it is at the boundary between clusters and is not strongly affiliated with its assigned cluster.
In earlier parts, Silhouette scores was utilised to find out the average Silhouette Width to determine whether the points are well defined. It is still good to analyst the statistics in clustered groups.
The information of the clusters can be viewed in the clustering vectors to exploit the clustering observation for each clustering.
The available components in the clustering vector are cluster, centers, tots, withinss, tot.withinss, betweenss, size, iter, ifault, clust_plot, silinfo, ngclust and data. In this project, tot.withinss, betweenss and size are exploited to analyse the quality of clustering.
For K-means clustering result with 2 clusters:
# Extract relevant metrics
total_withinss <- kmeans_result1$tot.withinss
betweenss <- kmeans_result1$betweenss
cluster_sizes <- kmeans_result1$size
cat("Total Within-Cluster Sum of Squares (Compactness):", total_withinss, "\n")## Total Within-Cluster Sum of Squares (Compactness): 3276764
## Between-Cluster Sum of Squares (Variance): 1866360
# Calculate the explained variance ratio
# A higher ratio indicates better-defined clusters
explained_variance_ratio <- betweenss / (betweenss + total_withinss)
cat("Explained Variance Ratio:", explained_variance_ratio, "\n")## Explained Variance Ratio: 0.3628845
# Visualize Cluster Sizes
cluster_size <- data.frame(
Cluster = factor(seq_along(cluster_sizes)),
Size = cluster_sizes
)
# Bar plot of cluster sizes
ggplot(cluster_size, aes(x = Cluster, y = Size, fill = Cluster)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(
title = "Cluster Sizes",
x = "Cluster",
y = "Number of Observations"
) +
theme_minimal(base_size = 14) +
scale_fill_brewer(palette = "Set2")# To have a view on the data distribution of kmeans clustering
groupBWplot(open_food_facts_cleaned_trimmed, kmeans_result1$cluster, alpha=0.05)For PAM clustering result with k = 2, The nutritional density of the products in each cluster are shown in the table of medoids values. For instance, cluster 1 has the highest protein concentration, highest sodium and fiber per 100g of product, whereas cluster 2 has the highest concentration of sugar per 100g of product and cluster 3 has highest possibilities to find allergens in the product.
## manufacturing_places purchase_places allergens traces additives_n
## [1,] 1 1 1 1 0
## [2,] 45 51 59 26 3
## ingredients_from_palm_oil_n energy.from.fat_100g trans.fat_100g
## [1,] 0 0 0
## [2,] 1 0 0
## sugars_100g fiber_100g proteins_100g sodium_100g
## [1,] 7.69 1.9 5.77 0.4620000
## [2,] 28.00 0.0 6.00 0.4606299
## fruits.vegetables.nuts_100g cocoa_100g nutrition_grade_fr
## [1,] 0 0 3
## [2,] 0 0 5
# To have a view on the data distribution of PAM clustering
groupBWplot(open_food_facts_cleaned_trimmed,
as.factor(pam1$clustering),
alpha=0.05,
outlier.size=6,
outlier.colour="red",
line.col="blue",
box.col="lightblue")# Calculate the correlation matrix
cor_matrix <- cor(open_food_facts_cleaned_trimmed, use = "complete.obs", method = "pearson")
# Plot the correlation matrix as a heatmap
corrplot(cor_matrix, method = "color", type = "upper",
tl.cex = 0.7, # Text label size
tl.col = "black", # Text label color
addCoef.col = "black", # Display correlation coefficient on the heatmap
number.cex = 0.5, # Size of the correlation coefficient numbers
col = colorRampPalette(c("blue", "white", "red"))(200))According to the correlation heatmap, the most obvious result to take note is that manufacturing_places is moderately correlated to purchase_places (0.51); Sugars_100g is moderately correlated to nutrition_grade_fr (0.42) whereas it is negatively correlated to protein_100g (-0.27).
Statistically but not necessarily realistically, it could be inferred that manufacturing_places might affect the purchase_places. On the other hand, a higher sugar content might indicate that the product has higher nutritional grade but lower protein content.