With so many food choices available, it’s easy to feel overwhelmed - not just by what to eat, but also by the nutritional information that comes with it. When looking at food labels, we are often presented with a long list of nutrients, and it’s not always clear which ones matter most. Additionaly, a surve conductd by Michigan State University, and found that while most Americans are influenced by food labels when grocery shopping, the most commonly used terms on those labels are also the ones that cause the most confusion.
In this project, I want to explore how dimensionality reduction and clustering methods can help simplify and improve the analysis of nutritional data. The goal is to take a complex dataset with 34 different nutrients and find a way to extract the most important patterns, making it easier to categorize and understand different types of food.
The dataset contains 34 numerical attributes representing various aspects of food composition, ranging from macronutrients (fats, carbohydrates, proteins, etc) to micronutrients (vitamins, minerals, etc). While these features offer a detailed breakdown of food properties, they also introduce redundancy and complexity, making direct analysis difficult. The detailed explanation are as follow:
Link from where the dataset obtained: https://www.kaggle.com/datasets/utsavdey1410/food-nutrition-dataset
I choose the 5th dataset from the link given above since it has the most observations and food categories, e.g. “natural food” vs “processed food.”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(dplyr)
library(dbscan)
##
## Attaching package: 'dbscan'
##
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(ggplot2)
library(ggcorrplot)
library(pdp)
##
## Attaching package: 'pdp'
##
## The following object is masked from 'package:purrr':
##
## partial
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggpubr)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(corpcor)
data <- read.csv("FOOD-DATA-GROUP5.csv")
str(data)
## 'data.frame': 722 obs. of 37 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Unnamed..0 : int 0 1 2 3 4 5 6 7 8 9 ...
## $ food : chr "margarine with yoghurt" "sunflower seed butter" "hazelnut oil" "menhaden fish oil" ...
## $ Caloric.Value : int 88 99 120 1966 123 123 120 115 94 59 ...
## $ Fat : num 9.8 8.8 13.6 218 13.6 13.6 13.6 12.8 8.1 6.7 ...
## $ Saturated.Fats : num 1.9 0.7 1 66.3 3.1 4.1 7.2 5 1.1 1.2 ...
## $ Monounsaturated.Fats: num 5.6 6.2 10.6 58.2 6.4 4.6 5.3 5.8 3.1 2.3 ...
## $ Polyunsaturated.Fats: num 2 1.6 1.4 74.5 3.1 4.3 0.5 1.4 3.6 2.8 ...
## $ Carbohydrates : num 0.073 3.7 0 0 0 0 0 0 3.8 0.1 ...
## $ Sugars : num 0 1.7 0 0 0 0 0 0 0 0 ...
## $ Protein : num 0.058 2.8 0 0 0 0 0 0 2.9 0.082 ...
## $ Dietary.Fiber : num 0 0.9 0 0 0 0 0 0 0.9 0 ...
## $ Cholesterol : num 0 0 0 1135.8 77.5 ...
## $ Sodium : num 0.018 0.065 0 0 0 0 0 0 0.031 0.019 ...
## $ Water : num 3.9 0.1 0 0 0 0 0 0 0.3 7 ...
## $ Vitamin.A : num 0.1 0 0 0 4.1 0 0 0 0 0 ...
## $ Vitamin.B1 : num 0.074 0.021 0 0 0 0 0 0 0.004 0 ...
## $ Vitamin.B11 : num 0.086 0.005 0 0 0 0 0 0 0.002 0.067 ...
## $ Vitamin.B12 : num 0.073 0 0 0 0 0 0 0 0 0 ...
## $ Vitamin.B2 : num 0.005 0.075 0 0 0 0 0 0 0.099 0.084 ...
## $ Vitamin.B3 : num 0.097 1.1 0 0 0 0 0 0 1.1 0 ...
## $ Vitamin.B5 : num 0 0.2 0 0 0 0 0 0 0.077 0.051 ...
## $ Vitamin.B6 : num 0.075 0.013 0 0 0 0 0 0 0.1 0.049 ...
## $ Vitamin.C : num 0.082 0.4 0 0 0 0 0 0 0 0 ...
## $ Vitamin.D : num 0 0 0 0 0.004 0.02 0 0.025 0 0 ...
## $ Vitamin.E : num 0.7 3.7 6.4 0 0 0 0.035 0.078 0 0.6 ...
## $ Vitamin.K : num 0.002 0 0 0 0 0 0 0 0 0.073 ...
## $ Calcium : num 2.8 10.2 0 0 0 ...
## $ Copper : num 0.001 0.3 0 0 0 0 0 0 0.7 0.045 ...
## $ Iron : num 0.027 0.7 0 0 0 0 0 0 3.1 0.078 ...
## $ Magnesium : num 0.3 49.8 0 0 0 0 0 0 57.9 0.1 ...
## $ Manganese : num 0 0.3 0 0 0 0 0 0 0.4 0 ...
## $ Phosphorus : num 2.2 106.6 0 0 0 ...
## $ Potassium : num 3.5 92.2 0 0 0 0 0 0 93.1 5 ...
## $ Selenium : num 0 0.075 0 0 0 0 0 0.012 0.088 0 ...
## $ Zinc : num 0.008 0.8 0 0 0 0 0 0.089 1.2 0.054 ...
## $ Nutrition.Density : num 13 27.5 13.6 218 17.7 ...
Before we move on, I want to include the basic description of the numerical variables within the data.
summary(data)
## X Unnamed..0 food Caloric.Value
## Min. : 0.0 Min. : 0.0 Length:722 Min. : 0.00
## 1st Qu.:180.2 1st Qu.:180.2 Class :character 1st Qu.: 24.25
## Median :360.5 Median :360.5 Mode :character Median : 68.50
## Mean :360.5 Mean :360.5 Mean : 124.49
## 3rd Qu.:540.8 3rd Qu.:540.8 3rd Qu.: 123.00
## Max. :721.0 Max. :721.0 Max. :1966.00
## Fat Saturated.Fats Monounsaturated.Fats
## Min. : 0.000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.200 1st Qu.: 0.04325 1st Qu.: 0.039
## Median : 0.700 Median : 0.10000 Median : 0.100
## Mean : 5.592 Mean : 1.82955 Mean : 1.855
## 3rd Qu.: 4.375 3rd Qu.: 0.87500 3rd Qu.: 1.400
## Max. :218.000 Max. :196.20000 Max. :95.900
## Polyunsaturated.Fats Carbohydrates Sugars Protein
## Min. : 0.00000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.06025 1st Qu.: 2.025 1st Qu.: 0.000 1st Qu.: 0.30
## Median : 0.20000 Median : 8.150 Median : 0.700 Median : 1.80
## Mean : 1.69686 Mean : 15.770 Mean : 3.097 Mean : 3.52
## 3rd Qu.: 1.00000 3rd Qu.: 19.775 3rd Qu.: 3.000 3rd Qu.: 3.70
## Max. :115.80000 Max. :190.100 Max. :105.000 Max. :67.90
## Dietary.Fiber Cholesterol Sodium Water
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.079 1st Qu.: 0.000 1st Qu.: 0.0300 1st Qu.: 3.45
## Median : 0.950 Median : 0.000 Median : 0.0825 Median : 15.15
## Mean : 2.078 Mean : 4.893 Mean : 0.2744 Mean : 61.84
## 3rd Qu.: 2.800 3rd Qu.: 0.000 3rd Qu.: 0.2000 3rd Qu.:109.62
## Max. :50.200 Max. :1135.800 Max. :49.4000 Max. :690.40
## Vitamin.A Vitamin.B1 Vitamin.B11 Vitamin.B12
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.01400 1st Qu.:0.00000
## Median :0.00000 Median :0.03100 Median :0.04800 Median :0.00000
## Mean :0.02584 Mean :0.10181 Mean :0.05967 Mean :0.02022
## 3rd Qu.:0.00000 3rd Qu.:0.09575 3rd Qu.:0.08200 3rd Qu.:0.03575
## Max. :4.10000 Max. :6.30000 Max. :0.70000 Max. :0.30000
## Vitamin.B2 Vitamin.B3 Vitamin.B5 Vitamin.B6
## Min. :0.00000 Min. : 0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 0.04725 1st Qu.:0.02925 1st Qu.:0.0230
## Median :0.04150 Median : 0.20000 Median :0.10000 Median :0.0795
## Mean :0.08128 Mean : 0.96785 Mean :0.40663 Mean :0.1790
## 3rd Qu.:0.09600 3rd Qu.: 1.00000 3rd Qu.:0.50000 3rd Qu.:0.2000
## Max. :1.60000 Max. :16.60000 Max. :8.00000 Max. :5.2000
## Vitamin.C Vitamin.D Vitamin.E Vitamin.K
## Min. : 0.0000 Min. : 0.00 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 0.0955 Median : 0.00 Median : 0.0110 Median :0.0310
## Mean : 8.8780 Mean : 3.04 Mean : 0.4996 Mean :0.1808
## 3rd Qu.: 2.6750 3rd Qu.: 0.00 3rd Qu.: 0.2750 3rd Qu.:0.0810
## Max. :348.4000 Max. :181.70 Max. :41.6000 Max. :6.1000
## Calcium Copper Iron Magnesium
## Min. : 0.000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.074 1st Qu.: 0.0322 1st Qu.: 0.06925 1st Qu.: 0.70
## Median : 8.150 Median : 0.0890 Median : 0.40000 Median : 5.85
## Mean : 37.684 Mean : 9.7540 Mean : 1.01508 Mean : 20.92
## 3rd Qu.: 35.225 3rd Qu.: 0.5000 3rd Qu.: 1.10000 3rd Qu.: 23.20
## Max. :868.000 Max. :348.0000 Max. :29.20000 Max. :520.80
## Manganese Phosphorus Potassium Selenium
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.042 1st Qu.: 0.30 1st Qu.: 17.48 1st Qu.: 0.018
## Median : 0.200 Median : 14.15 Median : 63.00 Median : 0.058
## Mean : 5.474 Mean : 53.20 Mean : 200.92 Mean : 63.613
## 3rd Qu.: 0.900 3rd Qu.: 58.15 3rd Qu.: 191.35 3rd Qu.: 0.096
## Max. :156.600 Max. :1309.40 Max. :4053.00 Max. :1309.500
## Zinc Nutrition.Density
## Min. :0.0000 Min. : 0.10
## 1st Qu.:0.0350 1st Qu.: 13.60
## Median :0.0995 Median : 35.30
## Mean :0.3707 Mean : 74.56
## 3rd Qu.:0.4000 3rd Qu.: 83.45
## Max. :9.1000 Max. :918.80
ooking at the dataset, I noticed that it consists of 722 food items with a wide range of nutritional attributes, many of which vary significantly in scale. Some features, like Caloric.Value, Potassium, and Phosphorus, have extremely high values, while others, such as Vitamin.B12 and Selenium, have many zero entries. There are also redundant index columns (X, Unnamed..0) and a categorical food column that I need to remove before performing PCA or clustering. Many features, especially macronutrients like Fat, Saturated.Fats, and Carbohydrates, are heavily skewed, while others, such as vitamins and minerals, have sparse distributions with many zeros.
Given this, I need to standardize the dataset to ensure that no single variable dominates the analysis. Additionally, some variables with excessive zero values may not contribute meaningfully to dimensionality reduction, so filtering may be necessary. With such a high variance across features, PCA should be useful in reducing dimensionality while preserving key patterns.
Since the data has unecessary column, I have to remove those and other empty observation.
data <- data[,-c(1,2)]
data <- na.omit(data)
Checking the data suitability for dimension reduction
dim(data)
## [1] 722 35
Data has 35 columns and are suited for dimension reduction. Before I move, I want to check the classes of each variables (features) in the data, since most dimension reduction only works with numeric datatype.
sapply(data, class)
## food Caloric.Value Fat
## "character" "integer" "numeric"
## Saturated.Fats Monounsaturated.Fats Polyunsaturated.Fats
## "numeric" "numeric" "numeric"
## Carbohydrates Sugars Protein
## "numeric" "numeric" "numeric"
## Dietary.Fiber Cholesterol Sodium
## "numeric" "numeric" "numeric"
## Water Vitamin.A Vitamin.B1
## "numeric" "numeric" "numeric"
## Vitamin.B11 Vitamin.B12 Vitamin.B2
## "numeric" "numeric" "numeric"
## Vitamin.B3 Vitamin.B5 Vitamin.B6
## "numeric" "numeric" "numeric"
## Vitamin.C Vitamin.D Vitamin.E
## "numeric" "numeric" "numeric"
## Vitamin.K Calcium Copper
## "numeric" "numeric" "numeric"
## Iron Magnesium Manganese
## "numeric" "numeric" "numeric"
## Phosphorus Potassium Selenium
## "numeric" "numeric" "numeric"
## Zinc Nutrition.Density
## "numeric" "numeric"
data_dimrec_numeric <- data[, sapply(data, is.numeric)]
scaled_data <- scale(data_dimrec_numeric)
I conducted a correlation matrix analysis to assess the relationships between variables. The reason for this step is that PCA works by transforming correlated variables into new uncorrelated principal components. If the dataset consists mostly of uncorrelated variables, PCA may not be effective because it relies on capturing shared variance. By computing the correlation matrix, I was able to identify which variables were highly correlated and which were not.
cor_matrix <- cor(scaled_data)
ggcorrplot(cor_matrix,
hc.order = FALSE,
type = "upper",
lab = F,
lab_size = 0.00005,
colors = c("blue", "white", "red")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, face = "plain"),
axis.text.y = element_text(face = "plain"))
Within this dataset, we can see that there are some features that are
highly correlated, which create redundancy; such as: Fat, Saturated
Fats, Monounsaturated Fats, Polyunsaturated Fats → Strong correlation.
Carbohydrates, Sugars, and Caloric Value → Strong correlation. Multiple
Vitamins (e.g., B1, B2, B3, B5, B6, B12) also seem moderately
correlated.
Thus we can implement dimensionality reduction to reduce these features, while maintaining variablity. I’m also keeping highly-correlated variables. At the beginning we have 34 variables to work with, and now we have 25 which is still better suited for dimensionality reduction.
Since PCA benefits from strong correlations, I filtered out weakly correlated variables using a threshold of 0.05.
cor_matrix_abs <- abs(cor_matrix)
high_cor_vars <- colnames(cor_matrix)[apply(cor_matrix_abs, 2, function(x) any(x > 0.5 & x < 1))]
filtered_data <- scaled_data[, high_cor_vars]
filtered_data_ds <- as.data.frame(filtered_data)
head(filtered_data_ds,5)
## Caloric.Value Fat Saturated.Fats Monounsaturated.Fats
## 1 -0.167287868 0.2119669 0.006326839 0.6412986
## 2 -0.116858951 0.1615928 -0.101440260 0.7440570
## 3 -0.020585564 0.4033884 -0.074498485 1.4976184
## 4 8.442303583 10.6998513 5.789827813 9.6497826
## 5 -0.006832223 0.4033884 0.114093938 0.7783098
## Polyunsaturated.Fats Carbohydrates Protein Dietary.Fiber Vitamin.B1
## 1 0.04734267 -0.6793511 -0.5316697 -0.5927538 -0.09012576
## 2 -0.01512735 -0.5223814 -0.1105874 -0.3359922 -0.26191058
## 3 -0.04636236 -0.6825104 -0.5405766 -0.5927538 -0.32997627
## 4 11.37003390 -0.6825104 -0.5405766 -0.5927538 -0.32997627
## 5 0.21913523 -0.6825104 -0.5405766 -0.5927538 -0.32997627
## Vitamin.B12 Vitamin.B2 Vitamin.B3 Vitamin.B5 Vitamin.B6 Vitamin.D
## 1 1.4811222 -0.57111824 -0.45271556 -0.5387004 -0.2933482 -0.2297057
## 2 -0.5675873 -0.04699862 0.06870105 -0.2737411 -0.4683080 -0.2297057
## 3 -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2297057
## 4 -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2297057
## 5 -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2294034
## Vitamin.K Calcium Copper Iron Magnesium Manganese Phosphorus
## 1 -0.2828768 -0.4152301 -0.2861763 -0.4819763 -0.4457247 -0.3552616 -0.4646120
## 2 -0.2860402 -0.3271462 -0.2774029 -0.1536941 0.6241230 -0.3357924 0.4864845
## 3 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
## 4 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
## 5 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
## Potassium Selenium Zinc Nutrition.Density
## 1 -0.4618446 -0.3498440 -0.4680411 -0.5450870
## 2 -0.2543378 -0.3494315 0.5538686 -0.4165060
## 3 -0.4700326 -0.3498440 -0.4783634 -0.5395204
## 4 -0.4700326 -0.3498440 -0.4783634 1.2694098
## 5 -0.4700326 -0.3498440 -0.4783634 -0.5032356
Before proceeding with Principal Component Analysis (PCA), I needed to ensure that my dataset met the fundamental assumptions required for effective dimensionality reduction. Specifically, I conducted two key test, which are Kaiser-Meyer-Olkin factor adequacy and Bartlett’s test of sphericity.
kmo_result <- KMO(filtered_data_ds)
print(kmo_result)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = filtered_data_ds)
## Overall MSA = 0.75
## MSA for each item =
## Caloric.Value Fat Saturated.Fats
## 0.55 0.55 0.37
## Monounsaturated.Fats Polyunsaturated.Fats Carbohydrates
## 0.47 0.45 0.47
## Protein Dietary.Fiber Vitamin.B1
## 0.64 0.93 0.92
## Vitamin.B12 Vitamin.B2 Vitamin.B3
## 0.88 0.96 0.90
## Vitamin.B5 Vitamin.B6 Vitamin.D
## 0.86 0.85 0.86
## Vitamin.K Calcium Copper
## 0.90 0.78 0.84
## Iron Magnesium Manganese
## 0.96 0.91 0.76
## Phosphorus Potassium Selenium
## 0.93 0.91 0.82
## Zinc Nutrition.Density
## 0.94 0.84
Initially, I found that many variables had weak correlations, which explained why my KMO test result was low (0.52). Since PCA benefits from strong correlations, I filtered out weakly correlated variables using a threshold, improving the KMO score to 0.75.
This step was crucial because keeping uncorrelated variables in PCA would have resulted in components that do not effectively capture underlying patterns in the data. By refining the dataset before PCA, I ensured that the dimensionality reduction process would extract meaningful structure rather than just noise.
bartlett_result <- cortest.bartlett(cor_matrix, n = nrow(filtered_data_ds))
# Print results
print(bartlett_result)
## $chisq
## [1] 32002.29
##
## $p.value
## [1] 0
##
## $df
## [1] 561
The results from Bartlett’s test of sphericity confirm that applying PCA to my dataset makes sense. The chi-square statistic is extremely high (32,002.29), and the p-value is effectively zero, meaning I can confidently reject the null hypothesis that my correlation matrix is an identity matrix. With 561 degrees of freedom, this test shows that my variables have significant correlations and share common variance. This reassures me that dimensionality reduction is appropriate, as my dataset contains enough interdependencies for PCA to extract meaningful components.
Principal Component Analysis (PCA) is a technique I used to reduce the dimensionality of my dataset while retaining as much meaningful information as possible. The main idea behind PCA is to transform my original variables into a new set of uncorrelated features called principal components (PCs). These components are ordered based on how much variance they explain in the data—PC1 captures the most variance, followed by PC2, PC3, and so on. This allows me to simplify complex datasets by focusing on just the most important patterns while removing redundancy and noise. By applying PCA, I transformed my high-dimensional nutritional data into a smaller set of components that still preserve its essential structure, making it easier to interpret and use for clustering.
pca_result <- prcomp(filtered_data_ds, center = TRUE, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.9261 2.0650 1.8304 1.13363 0.98648 0.88570 0.87531
## Proportion of Variance 0.3293 0.1640 0.1288 0.04943 0.03743 0.03017 0.02947
## Cumulative Proportion 0.3293 0.4933 0.6222 0.67159 0.70902 0.73919 0.76866
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.86302 0.83849 0.82717 0.76680 0.68339 0.66218 0.64984
## Proportion of Variance 0.02865 0.02704 0.02632 0.02261 0.01796 0.01686 0.01624
## Cumulative Proportion 0.79731 0.82435 0.85066 0.87328 0.89124 0.90811 0.92435
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.58475 0.55983 0.53442 0.51444 0.46262 0.39339 0.36877
## Proportion of Variance 0.01315 0.01205 0.01098 0.01018 0.00823 0.00595 0.00523
## Cumulative Proportion 0.93750 0.94955 0.96054 0.97072 0.97895 0.98490 0.99013
## PC22 PC23 PC24 PC25 PC26
## Standard deviation 0.3640 0.28237 0.1764 0.11107 0.03027
## Proportion of Variance 0.0051 0.00307 0.0012 0.00047 0.00004
## Cumulative Proportion 0.9952 0.99829 0.9995 0.99996 1.00000
fviz_eig(pca_result, addlabels = TRUE, barfill = "lightblue", barcolor = "black", linecolor = "black") +
theme_minimal()
Looking at my scree plot, PC1 explains 32.9% of the variance, followed by PC2 at 16.4% and PC3 at 12.9%. Together, these three components account for about 62.2% of the total variance. The variance explained drops significantly from PC4 onward, with PC4 contributing 4.9% and PC5 at 3.7%, after which the values stagnate around 3% or lower.
Given that the elbow point is clearly around PC4–PC5, I need to balance information retention with dimensionality reduction. If I stop at PC5, I capture approximately 70.9% of the total variance, and extending to PC6 would give me 73.9%. Since the additional variance gain beyond PC6 is minimal, I will retain the first six principal components for my analysis.
pca_loadings <- as.data.frame(pca_result$rotation[, 1:6])
head(pca_loadings, 5)
## PC1 PC2 PC3 PC4 PC5
## Caloric.Value 0.2208772 0.002060511 -0.3975456 0.08046347 -0.042938509
## Fat 0.1195779 -0.052698448 -0.5006284 -0.03586913 -0.001046917
## Saturated.Fats 0.0582518 -0.055483996 -0.4257407 -0.06485705 -0.090640532
## Monounsaturated.Fats 0.1069255 -0.045589478 -0.4351612 -0.05041065 0.010850083
## Polyunsaturated.Fats 0.1325871 -0.030829280 -0.3463100 0.03689165 0.127029123
## PC6
## Caloric.Value 0.090318664
## Fat 0.004988606
## Saturated.Fats 0.061151746
## Monounsaturated.Fats -0.062504993
## Polyunsaturated.Fats -0.026785184
library(factoextra)
# Contribution of variables to PC1
fviz_contrib(pca_result, choice = "var", axes = 1, top = 10)
# Contribution of variables to PC2
fviz_contrib(pca_result, choice = "var", axes = 2, top = 10)
# Contribution of variables to PC3
fviz_contrib(pca_result, choice = "var", axes = 3, top = 10)
# Contribution of variables to PC4
fviz_contrib(pca_result, choice = "var", axes = 4, top = 10)
# Contribution of variables to PC5
fviz_contrib(pca_result, choice = "var", axes = 5, top = 10)
# Contribution of variables to PC6
fviz_contrib(pca_result, choice = "var", axes = 6, top = 10)
In my dataset, the first two principal components (PC1 and PC2) account
for the largest variance, indicating that the primary way the data
naturally separates is based on mineral content and nutrient
density.
PC3 and PC4 capture differences in fat composition and
carbohydrate-rich sources, effectively distinguishing
energy-dense foods.
Meanwhile, PC5 and PC6 focus on trace minerals and
vitamins, further differentiating food types based on their
micronutrient profiles.
Given this breakdown, retaining the first six principal components
ensures that I preserve the most meaningful variation while
reducing dimensionality.
pca_var <- get_pca_var(pca_result)
fviz_contrib(pca_result, "var", axes = 1:6, fill = "lightblue", color = "lightblue2")
The chart shown above illustrates the cumulative contributions of individual variables across the first six principal components (PCs). The red dashed line marks the threshold for significant contributions, helping me identify which variables play the most influential role in structuring my dataset. Fat and Caloric Value emerge as the top contributors, confirming that energy density is a primary factor in distinguishing different data points. Calcium, Nutrition Density, and Phosphorus also play a key role, reinforcing the importance of mineral content in food classification. Additionally, Zinc, Magnesium, and Vitamin B3 contribute substantially, indicating that micronutrient composition is another major driver of variation.
Other notable contributors include Selenium, Manganese, and Protein, suggesting that both trace minerals and macronutrient composition significantly shape the dataset’s structure. Monounsaturated and Saturated Fats appear relevant but are secondary compared to total fat content. Meanwhile, Carbohydrates and Polyunsaturated Fats contribute below the threshold but still provide some differentiation. The lowest-ranked contributors among the top variables are Vitamin B12, Vitamin D, and Vitamin K, confirming that while vitamins play a role, their variability is less influential than macronutrients and minerals.
This analysis validates my decision to retain six principal components, as they effectively capture the key sources of variation in my dataset. Now we have a clearer understanding that energy density, mineral content, and protein composition are the dominant factors shaping the structure of my data.
fviz_pca_var(pca_result,
col.var = "lightblue",
repel = TRUE,
title = "PCA Variable Factor Map",
axes = c(1,2))
Looking at the axes, PC1 (32.9%) explains most of the variance, while
PC2 (16.4%) captures additional differentiation. Together, these two
components account for 49.3% of the total variation in my dataset, which
is quite substantial. The variables clustering in similar directions
suggest strong correlations, meaning that foods high in one nutrient are
likely to be high in others from the same group. For example,
Phosphorus, Iron, Magnesium, and Zinc contribute heavily to PC1, while
Selenium, Manganese, and Copper are more aligned with PC2. This tells me
that PC1 is mainly driven by macronutrients like fat and protein, while
PC2 captures the variation from micronutrients like vitamins and
minerals.
I also notice that certain variables oppose each other, implying negative correlations. For instance, fat-related variables (Saturated, Monounsaturated, and Polyunsaturated Fats) pull in a different direction compared to mineral-heavy variables like Selenium and Copper. This makes sense because high-fat foods generally don’t have the same nutritional profile as mineral-rich foods.
Overall, this plot confirms that the structure of my dataset is largely driven by macronutrient content on PC1 and micronutrient density on PC2. It also justifies why I retained six principal components—since they capture distinct and meaningful patterns in the data.
reduced_data_six <- as.data.frame(pca_result$x[, 1:5])
head(reduced_data_six,5)
## PC1 PC2 PC3 PC4 PC5
## 1 -1.6275821953 -0.5667021 -0.7801081 0.06324648 0.57860765
## 2 -0.0009151555 -1.0881589 -0.1820802 -0.17120608 0.18404617
## 3 -1.5215270322 -1.2799019 -1.1831586 -0.13679361 -0.03798116
## 4 4.8119110872 -2.8891079 -19.6927146 -0.80080015 0.46508117
## 5 -1.5390768256 -1.2663423 -1.0477090 -0.11402869 -0.03229059
I’m using Hopkins statistics to do an acid-test before proceeding to clustering.
tendency <- get_clust_tendency(reduced_data_six, n = nrow(reduced_data_six) * 0.1)
cat("Hopkins statistic:", tendency$hopkins_stat, "\n")
## Hopkins statistic: 0.9440615
set.seed(123)
show_elbow<-fviz_nbclust(reduced_data_six, kmeans, method = "wss") +ggtitle("Elbow Method")
show_silhouette<- fviz_nbclust(reduced_data_six, kmeans, method = "silhouette") +ggtitle("Silhouette Method")
grid.arrange(show_elbow,show_silhouette, ncol=2, top="Optimal Number of Clusters")
To determine the optimal number of clusters for my dataset, I used both the Elbow Method and the Silhouette Method, as shown in the figure above. These two approaches provide complementary insights: the Elbow Method identifies the point where adding more clusters yields diminishing returns in variance reduction, while the Silhouette Method assesses how well-separated the clusters are.
From the Elbow Method, I observed a clear inflection point around k = 3 or k = 4, meaning that beyond this point, the reduction in within-cluster variance slows down significantly. Meanwhile, the Silhouette Score peaks at k = 2, indicating the best cluster separation at that level. However, considering the trade-off between separation and meaningful grouping, k = 3 emerges as the best choice. At k = 3, the Silhouette Score remains relatively high while ensuring a reasonable division of my data into distinct groups.
With this justification, I decided to proceed with K-Means Clustering using k = 3 to segment my dataset effectively. If the results are unsatisfactory or show overlapping clusters, I may explore Hierarchical Clustering as an alternative approach.
set.seed(123)
# Run K-Means for different k values
kmeans_k3 <- kmeans(reduced_data_six, centers = 3)
kmeans_k4 <- kmeans(reduced_data_six, centers = 4)
kmeans_k5 <- kmeans(reduced_data_six, centers = 5)
reduced_data_six$Cluster_k3 <- as.factor(kmeans_k3$cluster)
reduced_data_six$Cluster_k4 <- as.factor(kmeans_k4$cluster)
reduced_data_six$Cluster_k5 <- as.factor(kmeans_k5$cluster)
hull_data <- reduced_data_six %>%
group_by(Cluster_k3) %>%
slice(chull(PC1, PC2))
p1 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k3, shape = Cluster_k3)) +
geom_point(size = 3) +
geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k3, group = Cluster_k3), alpha = 0.3)
theme_minimal() +
labs(title = "K-Means Clustering Result", color = "Cluster_k3", fill = "Cluster_k3")
hull_data <- reduced_data_six %>%
group_by(Cluster_k4) %>%
slice(chull(PC1, PC2))
p2 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k4, shape = Cluster_k4)) +
geom_point(size = 3) +
geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k4, group = Cluster_k4), alpha = 0.3)
theme_minimal() +
labs(title = "K-Means Clustering Result (k=3)", color = "Cluster_k4", fill = "Cluster_k4")
hull_data <- reduced_data_six %>%
group_by(Cluster_k5) %>%
slice(chull(PC1, PC2))
p3 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k5, shape = Cluster_k5)) +
geom_point(size = 3) +
geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k5, group = Cluster_k5), alpha = 0.3)
theme_minimal() +
labs(title = "K-Means Clustering Result (k=3)", color = "Cluster_k5", fill = "Cluster_k5")
ggarrange(p1, p2, p3, ncol = 2, nrow = 1)
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
sil_k3 <- silhouette(kmeans_k3$cluster, dist(reduced_data_six))
sil_k4 <- silhouette(kmeans_k4$cluster, dist(reduced_data_six))
sil_k5 <- silhouette(kmeans_k5$cluster, dist(reduced_data_six))
p3 <- fviz_silhouette(sil_k3) + ggtitle("Silhouette Plot (K=3)")
## cluster size ave.sil.width
## 1 1 76 0.44
## 2 2 601 0.60
## 3 3 45 0.08
p4 <- fviz_silhouette(sil_k4) + ggtitle("Silhouette Plot (K=4)")
## cluster size ave.sil.width
## 1 1 150 0.42
## 2 2 77 0.41
## 3 3 33 0.01
## 4 4 462 0.72
p5 <- fviz_silhouette(sil_k5) + ggtitle("Silhouette Plot (K=5)")
## cluster size ave.sil.width
## 1 1 149 0.42
## 2 2 33 0.01
## 3 3 449 0.72
## 4 4 73 0.40
## 5 5 18 0.12
p3
p4
p5
To evaluate the clustering quality, I generated silhouette plots for different values of \(K\) (3, 4, and 5) - following the result of elbow and sillhouette plot and according to my own intuition. The silhouette width measures how well-separated the clusters are, with higher values indicating better-defined clusters.
For \(K=3\), the silhouette plot showed a reasonably high average silhouette width, suggesting that the clusters were fairly well-formed. However, upon increasing \(K\) to 4 and 5, the results did not significantly improve, and in some cases, new clusters had noticeably lower silhouette scores. This suggests that while some finer-grained separation is possible, it may not contribute meaningfully to better clustering.
Given these findings, I will proceed with hierarchical clustering as an alternative approach. Unlike K-Means, hierarchical clustering does not require pre-specifying the number of clusters and allows for a more flexible visualization of how data points group together at different levels. This method will help validate whether \(K=3\) remains the most reasonable choice or if another structure emerges naturally from the data.
dist_matrix <- dist(reduced_data_six, method = "euclidean")
hclust_result <- hclust(dist_matrix, method = "ward.D2")
fviz_dend(hclust_result, k = 3, rect = TRUE, rect_border = "jco", rect_fill = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
clusters_hierarchical <- cutree(hclust_result, k = 3)
The dendrogram shown above helps visualize how hierarchical clustering groups the dataset based on similarity. The height represents the dissimilarity between clusters, and the colored segments indicate potential natural cluster separations.
From the visualization: - Four distinct clusters emerge, suggesting K=4 might be a valid choice. - The blue and green-yellow-ish clusters are more compact, whereas the purple cluster is more distinct. - The yellow cluster is smaller, indicating a niche subgroup.
By cutting the dendrogram at an appropriate height, I can choose K=3 or K=4 as the final number of clusters. This structure aligns well with the previous silhouette results, confirming that hierarchical clustering captures meaningful patterns in the dataset.
To evaluate the quality of hierarchical clustering, I first plotted the clusters in the PC1 vs PC2 space. If clusters are well-separated, this suggests the method captured meaningful structure. However, if clusters overlap, it might indicate a need for re-evaluating K or the clustering method.
silhouette_hierarchical <- silhouette(clusters_hierarchical, dist(reduced_data_six))
fviz_silhouette(silhouette_hierarchical) +
labs(title = "Silhouette Plot for Hierarchical Clustering")
## cluster size ave.sil.width
## 1 1 462 0.77
## 2 2 183 0.08
## 3 3 77 0.41
## Representation for Each Clusters
reduced_data_six$Hierarchical_Cluster <- as.factor(clusters_hierarchical)
cluster_summary <- aggregate(filtered_data_ds[, -c(1,2)],
by = list(Cluster = reduced_data_six$Hierarchical_Cluster),
FUN = mean)
head(cluster_summary,5)
## Cluster Saturated.Fats Monounsaturated.Fats Polyunsaturated.Fats
## 1 1 -0.08859568 -0.1080424 -0.1369963
## 2 2 0.28093272 0.3568432 0.4055340
## 3 3 -0.13609718 -0.1998276 -0.1418239
## Carbohydrates Protein Dietary.Fiber Vitamin.B1 Vitamin.B12 Vitamin.B2
## 1 -0.3734764 -0.35272956 -0.3897078 -0.2032879 -0.1424281 -0.2909557
## 2 0.8492512 0.85143384 0.7131916 0.6401944 -0.2458422 0.9858157
## 3 0.2225081 0.09283981 0.6432588 -0.3017734 1.4388432 -0.5971784
## Vitamin.B3 Vitamin.B5 Vitamin.B6 Vitamin.D Vitamin.K Calcium Copper
## 1 -0.3122326 -0.3292908 -0.3072257 -0.1811763 -0.1444004 -0.2853551 -0.1953250
## 2 0.9709229 0.1797736 0.2543536 -0.2295636 -0.1996869 0.9087797 -0.2768837
## 3 -0.4341222 1.5484906 1.2388516 1.6326443 1.3409828 -0.4476963 1.8299984
## Iron Magnesium Manganese Phosphorus Potassium Selenium Zinc
## 1 -0.3106060 -0.2796763 -0.2605709 -0.3115312 -0.3034565 -0.2630288 -0.2775306
## 2 0.9447645 0.8818620 -0.3132085 0.9886213 0.8745299 -0.3496061 0.8753835
## 3 -0.3817134 -0.4177961 2.3078039 -0.4803936 -0.2576889 2.4090546 -0.4152732
## Nutrition.Density
## 1 -0.3913460
## 2 1.1466233
## 3 -0.3770159
From the clustering results, I can identify three distinct food groups based on their nutrient composition:
These results confirm that PCA successfully reduced the complexity of the dataset while clustering uncovered meaningful groups that reflect different dietary patterns.
In this project, I explored how dimensionality reduction and clustering methods can simplify and improve the analysis of nutritional data. By applying Principal Component Analysis (PCA), I successfully reduced the dataset from 34 dimensions to 6, capturing the most important patterns while filtering out redundancy. Clustering using hierarchical methods then revealed three meaningful food categories, making it easier to understand the structure within the dataset.
These clusters provide a data-driven way to classify foods, helping to analyze dietary patterns in a way that makes sense.
This project confirmed that nutritional datasets exhibit natural structure, and unsupervised learning methods are powerful tools for uncovering meaningful insights.
Michigan State University (2018, October). Americans pay attention to food labels, but are confused by what information matters. https://www.canr.msu.edu/news/americans-pay-attention-to-food-labels-but-are-confused-by-what-information-matters
National Institutes of Health. (2021). Macronutrients overview. National Center for Biotechnology Information, U.S. National Library of Medicine. Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK594226/
MD Anderson Cancer Center. (n.d.). What are macronutrients? Retrieved from https://www.mdanderson.org/publications/focused-on-health/what-are-macronutrients-.h15-1593780.html