What Really Matters in Food: Finding Important Insights in Food Nutrients Information

Description

With so many food choices available, it’s easy to feel overwhelmed - not just by what to eat, but also by the nutritional information that comes with it. When looking at food labels, we are often presented with a long list of nutrients, and it’s not always clear which ones matter most. Additionaly, a surve conductd by Michigan State University, and found that while most Americans are influenced by food labels when grocery shopping, the most commonly used terms on those labels are also the ones that cause the most confusion.

In this project, I want to explore how dimensionality reduction and clustering methods can help simplify and improve the analysis of nutritional data. The goal is to take a complex dataset with 34 different nutrients and find a way to extract the most important patterns, making it easier to categorize and understand different types of food.

Dataset

The dataset contains 34 numerical attributes representing various aspects of food composition, ranging from macronutrients (fats, carbohydrates, proteins, etc) to micronutrients (vitamins, minerals, etc). While these features offer a detailed breakdown of food properties, they also introduce redundancy and complexity, making direct analysis difficult. The detailed explanation are as follow:

Food: The name or type of the food item.
Caloric Value: Total energy provided by the food, typically measured in kilocalories (kcal) per 100 grams.
Fat (in g): Total amount of fats in grams per 100 grams, including the breakdowns that follow.
- Saturated Fats (in g): Amount of saturated fats (fats that typically raise the level of cholesterol in the blood) in grams per 100 grams.
- Monounsaturated Fats (in g): Amount of monounsaturated fats (considered heart-healthy fats) in grams per 100 grams.
- Polyunsaturated Fats (in g): Amount of polyunsaturated fats (include essential fats your body needs but can’t produce itself) in grams per 100 grams.
Carbohydrates (in g): Total carbohydrates in grams per 100 grams, including sugars.
- Sugars (in g): Total sugars in grams per 100 grams, a subset of carbohydrates.
Protein (in g): Total proteins in grams per 100 grams, essential for body repair and growth.
Dietary Fiber (in g): Fiber content in grams per 100 grams, important for digestive health.
Cholesterol (in mg): Cholesterol content in milligrams per 100 grams, pertinent for cardiovascular health.
Sodium (in g): Sodium content in milligrams per 100 grams, crucial for fluid balance and nerve function.
Water (in g): Water content in grams per 100 grams, which affects the food’s energy density.
Vitamin A (in mg): Important for vision and immune functioning.
Vitamin B1 (Thiamine) (in mg): Essential for glucose metabolism.
Vitamin B11 (Folic Acid) (in mg): Crucial for cell function and tissue growth, particularly important in pregnancy.
Vitamin B12 (in mg): Important for brain function and blood formation.
Vitamin B2 (Riboflavin) (in mg): Necessary for energy production, cell function, and fat metabolism.
Vitamin B3 (Niacin) (in mg): Supports digestive system, skin, and nerves health.
Vitamin B5 (Pantothenic Acid) (in mg): Necessary for making blood cells, and helps convert food into energy.
Vitamin B6 (in mg): Important for normal brain development and keeping the nervous and immune systems healthy.
Vitamin C (in mg): Important for the repair of all body tissues.
Vitamin D (in mg): Crucial for the absorption of calcium, promoting bone growth and health.
Vitamin E (in mg): Acts as an antioxidant, helping to protect cells from the damage caused by free radicals.
Vitamin K (in mg): Necessary for blood clotting and bone health.
Calcium (in mg): Vital for building and maintaining strong bones and teeth.
Copper (in mg): Helps with the formation of collagen, increases the absorption of iron, and plays a role in energy production.
Iron (in mg): Essential for the creation of red blood cells.
Magnesium (in mg): Important for many processes in the body including regulation of muscle and nerve function, blood sugar levels, and blood pressure and making protein, bone, and DNA.
Manganese (in mg): Involved in the formation of bones, blood clotting factors, and enzymes that play a role in fat and carbohydrate metabolism, calcium absorption, and blood sugar regulation.
Phosphorus (in mg): Helps with the formation of bones and teeth and is necessary for the body to make protein for the growth, maintenance, and repair of cells and tissues.
Potassium (in mg): Helps regulate fluid balance, muscle contractions, and nerve signals.
Selenium (in mg): Important for reproduction, thyroid gland function, DNA production, and protecting the body from damage caused by free radicals and from infection.
Zinc (in mg): Necessary for the immune system to properly function and plays a role in cell division, cell growth, wound healing, and the breakdown of carbohydrates.
Nutrition Density: A metric indicating the nutrient richness of the food per calorie.

Link from where the dataset obtained: https://www.kaggle.com/datasets/utsavdey1410/food-nutrition-dataset

I choose the 5th dataset from the link given above since it has the most observations and food categories, e.g. “natural food” vs “processed food.”

Prerequisites

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)  
library(dplyr) 
library(dbscan)

## 
## Attaching package: 'dbscan'
## 
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggplot2)
library(ggcorrplot)
library(pdp)

## 
## Attaching package: 'pdp'
## 
## The following object is masked from 'package:purrr':
## 
##     partial

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(ggpubr)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(corpcor)

data <- read.csv("FOOD-DATA-GROUP5.csv")
str(data)

## 'data.frame':    722 obs. of  37 variables:
##  $ X                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Unnamed..0          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ food                : chr  "margarine with yoghurt" "sunflower seed butter" "hazelnut oil" "menhaden fish oil" ...
##  $ Caloric.Value       : int  88 99 120 1966 123 123 120 115 94 59 ...
##  $ Fat                 : num  9.8 8.8 13.6 218 13.6 13.6 13.6 12.8 8.1 6.7 ...
##  $ Saturated.Fats      : num  1.9 0.7 1 66.3 3.1 4.1 7.2 5 1.1 1.2 ...
##  $ Monounsaturated.Fats: num  5.6 6.2 10.6 58.2 6.4 4.6 5.3 5.8 3.1 2.3 ...
##  $ Polyunsaturated.Fats: num  2 1.6 1.4 74.5 3.1 4.3 0.5 1.4 3.6 2.8 ...
##  $ Carbohydrates       : num  0.073 3.7 0 0 0 0 0 0 3.8 0.1 ...
##  $ Sugars              : num  0 1.7 0 0 0 0 0 0 0 0 ...
##  $ Protein             : num  0.058 2.8 0 0 0 0 0 0 2.9 0.082 ...
##  $ Dietary.Fiber       : num  0 0.9 0 0 0 0 0 0 0.9 0 ...
##  $ Cholesterol         : num  0 0 0 1135.8 77.5 ...
##  $ Sodium              : num  0.018 0.065 0 0 0 0 0 0 0.031 0.019 ...
##  $ Water               : num  3.9 0.1 0 0 0 0 0 0 0.3 7 ...
##  $ Vitamin.A           : num  0.1 0 0 0 4.1 0 0 0 0 0 ...
##  $ Vitamin.B1          : num  0.074 0.021 0 0 0 0 0 0 0.004 0 ...
##  $ Vitamin.B11         : num  0.086 0.005 0 0 0 0 0 0 0.002 0.067 ...
##  $ Vitamin.B12         : num  0.073 0 0 0 0 0 0 0 0 0 ...
##  $ Vitamin.B2          : num  0.005 0.075 0 0 0 0 0 0 0.099 0.084 ...
##  $ Vitamin.B3          : num  0.097 1.1 0 0 0 0 0 0 1.1 0 ...
##  $ Vitamin.B5          : num  0 0.2 0 0 0 0 0 0 0.077 0.051 ...
##  $ Vitamin.B6          : num  0.075 0.013 0 0 0 0 0 0 0.1 0.049 ...
##  $ Vitamin.C           : num  0.082 0.4 0 0 0 0 0 0 0 0 ...
##  $ Vitamin.D           : num  0 0 0 0 0.004 0.02 0 0.025 0 0 ...
##  $ Vitamin.E           : num  0.7 3.7 6.4 0 0 0 0.035 0.078 0 0.6 ...
##  $ Vitamin.K           : num  0.002 0 0 0 0 0 0 0 0 0.073 ...
##  $ Calcium             : num  2.8 10.2 0 0 0 ...
##  $ Copper              : num  0.001 0.3 0 0 0 0 0 0 0.7 0.045 ...
##  $ Iron                : num  0.027 0.7 0 0 0 0 0 0 3.1 0.078 ...
##  $ Magnesium           : num  0.3 49.8 0 0 0 0 0 0 57.9 0.1 ...
##  $ Manganese           : num  0 0.3 0 0 0 0 0 0 0.4 0 ...
##  $ Phosphorus          : num  2.2 106.6 0 0 0 ...
##  $ Potassium           : num  3.5 92.2 0 0 0 0 0 0 93.1 5 ...
##  $ Selenium            : num  0 0.075 0 0 0 0 0 0.012 0.088 0 ...
##  $ Zinc                : num  0.008 0.8 0 0 0 0 0 0.089 1.2 0.054 ...
##  $ Nutrition.Density   : num  13 27.5 13.6 218 17.7 ...

Data Summary

Descriptive Statistics

Before we move on, I want to include the basic description of the numerical variables within the data.

summary(data)

##        X           Unnamed..0        food           Caloric.Value    
##  Min.   :  0.0   Min.   :  0.0   Length:722         Min.   :   0.00  
##  1st Qu.:180.2   1st Qu.:180.2   Class :character   1st Qu.:  24.25  
##  Median :360.5   Median :360.5   Mode  :character   Median :  68.50  
##  Mean   :360.5   Mean   :360.5                      Mean   : 124.49  
##  3rd Qu.:540.8   3rd Qu.:540.8                      3rd Qu.: 123.00  
##  Max.   :721.0   Max.   :721.0                      Max.   :1966.00  
##       Fat          Saturated.Fats      Monounsaturated.Fats
##  Min.   :  0.000   Min.   :  0.00000   Min.   : 0.000      
##  1st Qu.:  0.200   1st Qu.:  0.04325   1st Qu.: 0.039      
##  Median :  0.700   Median :  0.10000   Median : 0.100      
##  Mean   :  5.592   Mean   :  1.82955   Mean   : 1.855      
##  3rd Qu.:  4.375   3rd Qu.:  0.87500   3rd Qu.: 1.400      
##  Max.   :218.000   Max.   :196.20000   Max.   :95.900      
##  Polyunsaturated.Fats Carbohydrates         Sugars           Protein     
##  Min.   :  0.00000    Min.   :  0.000   Min.   :  0.000   Min.   : 0.00  
##  1st Qu.:  0.06025    1st Qu.:  2.025   1st Qu.:  0.000   1st Qu.: 0.30  
##  Median :  0.20000    Median :  8.150   Median :  0.700   Median : 1.80  
##  Mean   :  1.69686    Mean   : 15.770   Mean   :  3.097   Mean   : 3.52  
##  3rd Qu.:  1.00000    3rd Qu.: 19.775   3rd Qu.:  3.000   3rd Qu.: 3.70  
##  Max.   :115.80000    Max.   :190.100   Max.   :105.000   Max.   :67.90  
##  Dietary.Fiber     Cholesterol           Sodium            Water       
##  Min.   : 0.000   Min.   :   0.000   Min.   : 0.0000   Min.   :  0.00  
##  1st Qu.: 0.079   1st Qu.:   0.000   1st Qu.: 0.0300   1st Qu.:  3.45  
##  Median : 0.950   Median :   0.000   Median : 0.0825   Median : 15.15  
##  Mean   : 2.078   Mean   :   4.893   Mean   : 0.2744   Mean   : 61.84  
##  3rd Qu.: 2.800   3rd Qu.:   0.000   3rd Qu.: 0.2000   3rd Qu.:109.62  
##  Max.   :50.200   Max.   :1135.800   Max.   :49.4000   Max.   :690.40  
##    Vitamin.A         Vitamin.B1       Vitamin.B11       Vitamin.B12     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.01400   1st Qu.:0.00000  
##  Median :0.00000   Median :0.03100   Median :0.04800   Median :0.00000  
##  Mean   :0.02584   Mean   :0.10181   Mean   :0.05967   Mean   :0.02022  
##  3rd Qu.:0.00000   3rd Qu.:0.09575   3rd Qu.:0.08200   3rd Qu.:0.03575  
##  Max.   :4.10000   Max.   :6.30000   Max.   :0.70000   Max.   :0.30000  
##    Vitamin.B2        Vitamin.B3         Vitamin.B5        Vitamin.B6    
##  Min.   :0.00000   Min.   : 0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.: 0.04725   1st Qu.:0.02925   1st Qu.:0.0230  
##  Median :0.04150   Median : 0.20000   Median :0.10000   Median :0.0795  
##  Mean   :0.08128   Mean   : 0.96785   Mean   :0.40663   Mean   :0.1790  
##  3rd Qu.:0.09600   3rd Qu.: 1.00000   3rd Qu.:0.50000   3rd Qu.:0.2000  
##  Max.   :1.60000   Max.   :16.60000   Max.   :8.00000   Max.   :5.2000  
##    Vitamin.C          Vitamin.D        Vitamin.E         Vitamin.K     
##  Min.   :  0.0000   Min.   :  0.00   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.:  0.0000   1st Qu.:  0.00   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median :  0.0955   Median :  0.00   Median : 0.0110   Median :0.0310  
##  Mean   :  8.8780   Mean   :  3.04   Mean   : 0.4996   Mean   :0.1808  
##  3rd Qu.:  2.6750   3rd Qu.:  0.00   3rd Qu.: 0.2750   3rd Qu.:0.0810  
##  Max.   :348.4000   Max.   :181.70   Max.   :41.6000   Max.   :6.1000  
##     Calcium            Copper              Iron            Magnesium     
##  Min.   :  0.000   Min.   :  0.0000   Min.   : 0.00000   Min.   :  0.00  
##  1st Qu.:  0.074   1st Qu.:  0.0322   1st Qu.: 0.06925   1st Qu.:  0.70  
##  Median :  8.150   Median :  0.0890   Median : 0.40000   Median :  5.85  
##  Mean   : 37.684   Mean   :  9.7540   Mean   : 1.01508   Mean   : 20.92  
##  3rd Qu.: 35.225   3rd Qu.:  0.5000   3rd Qu.: 1.10000   3rd Qu.: 23.20  
##  Max.   :868.000   Max.   :348.0000   Max.   :29.20000   Max.   :520.80  
##    Manganese         Phosphorus        Potassium          Selenium       
##  Min.   :  0.000   Min.   :   0.00   Min.   :   0.00   Min.   :   0.000  
##  1st Qu.:  0.042   1st Qu.:   0.30   1st Qu.:  17.48   1st Qu.:   0.018  
##  Median :  0.200   Median :  14.15   Median :  63.00   Median :   0.058  
##  Mean   :  5.474   Mean   :  53.20   Mean   : 200.92   Mean   :  63.613  
##  3rd Qu.:  0.900   3rd Qu.:  58.15   3rd Qu.: 191.35   3rd Qu.:   0.096  
##  Max.   :156.600   Max.   :1309.40   Max.   :4053.00   Max.   :1309.500  
##       Zinc        Nutrition.Density
##  Min.   :0.0000   Min.   :  0.10   
##  1st Qu.:0.0350   1st Qu.: 13.60   
##  Median :0.0995   Median : 35.30   
##  Mean   :0.3707   Mean   : 74.56   
##  3rd Qu.:0.4000   3rd Qu.: 83.45   
##  Max.   :9.1000   Max.   :918.80

ooking at the dataset, I noticed that it consists of 722 food items with a wide range of nutritional attributes, many of which vary significantly in scale. Some features, like Caloric.Value, Potassium, and Phosphorus, have extremely high values, while others, such as Vitamin.B12 and Selenium, have many zero entries. There are also redundant index columns (X, Unnamed..0) and a categorical food column that I need to remove before performing PCA or clustering. Many features, especially macronutrients like Fat, Saturated.Fats, and Carbohydrates, are heavily skewed, while others, such as vitamins and minerals, have sparse distributions with many zeros.

Given this, I need to standardize the dataset to ensure that no single variable dominates the analysis. Additionally, some variables with excessive zero values may not contribute meaningfully to dimensionality reduction, so filtering may be necessary. With such a high variance across features, PCA should be useful in reducing dimensionality while preserving key patterns.

Preprocessing

Since the data has unecessary column, I have to remove those and other empty observation.

data <- data[,-c(1,2)]
data <- na.omit(data)

Checking the data suitability for dimension reduction

dim(data)

## [1] 722  35

Data has 35 columns and are suited for dimension reduction. Before I move, I want to check the classes of each variables (features) in the data, since most dimension reduction only works with numeric datatype.

sapply(data, class)

##                 food        Caloric.Value                  Fat 
##          "character"            "integer"            "numeric" 
##       Saturated.Fats Monounsaturated.Fats Polyunsaturated.Fats 
##            "numeric"            "numeric"            "numeric" 
##        Carbohydrates               Sugars              Protein 
##            "numeric"            "numeric"            "numeric" 
##        Dietary.Fiber          Cholesterol               Sodium 
##            "numeric"            "numeric"            "numeric" 
##                Water            Vitamin.A           Vitamin.B1 
##            "numeric"            "numeric"            "numeric" 
##          Vitamin.B11          Vitamin.B12           Vitamin.B2 
##            "numeric"            "numeric"            "numeric" 
##           Vitamin.B3           Vitamin.B5           Vitamin.B6 
##            "numeric"            "numeric"            "numeric" 
##            Vitamin.C            Vitamin.D            Vitamin.E 
##            "numeric"            "numeric"            "numeric" 
##            Vitamin.K              Calcium               Copper 
##            "numeric"            "numeric"            "numeric" 
##                 Iron            Magnesium            Manganese 
##            "numeric"            "numeric"            "numeric" 
##           Phosphorus            Potassium             Selenium 
##            "numeric"            "numeric"            "numeric" 
##                 Zinc    Nutrition.Density 
##            "numeric"            "numeric"

data_dimrec_numeric <- data[, sapply(data, is.numeric)]
scaled_data <- scale(data_dimrec_numeric)

Correlation Checks

Correlation Matrix

I conducted a correlation matrix analysis to assess the relationships between variables. The reason for this step is that PCA works by transforming correlated variables into new uncorrelated principal components. If the dataset consists mostly of uncorrelated variables, PCA may not be effective because it relies on capturing shared variance. By computing the correlation matrix, I was able to identify which variables were highly correlated and which were not.

cor_matrix <- cor(scaled_data)
ggcorrplot(cor_matrix, 
           hc.order = FALSE,    
           type = "upper",       
           lab = F,          
           lab_size = 0.00005,        
           colors = c("blue", "white", "red")) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, face = "plain"),  
        axis.text.y = element_text(face = "plain"))

Within this dataset, we can see that there are some features that are highly correlated, which create redundancy; such as: Fat, Saturated Fats, Monounsaturated Fats, Polyunsaturated Fats → Strong correlation. Carbohydrates, Sugars, and Caloric Value → Strong correlation. Multiple Vitamins (e.g., B1, B2, B3, B5, B6, B12) also seem moderately correlated.

Thus we can implement dimensionality reduction to reduce these features, while maintaining variablity. I’m also keeping highly-correlated variables. At the beginning we have 34 variables to work with, and now we have 25 which is still better suited for dimensionality reduction.

Filtered Data with Correlation Threshold

Since PCA benefits from strong correlations, I filtered out weakly correlated variables using a threshold of 0.05.

cor_matrix_abs <- abs(cor_matrix)
high_cor_vars <- colnames(cor_matrix)[apply(cor_matrix_abs, 2, function(x) any(x > 0.5 & x < 1))]


filtered_data <- scaled_data[, high_cor_vars]
filtered_data_ds <- as.data.frame(filtered_data)
head(filtered_data_ds,5)

##   Caloric.Value        Fat Saturated.Fats Monounsaturated.Fats
## 1  -0.167287868  0.2119669    0.006326839            0.6412986
## 2  -0.116858951  0.1615928   -0.101440260            0.7440570
## 3  -0.020585564  0.4033884   -0.074498485            1.4976184
## 4   8.442303583 10.6998513    5.789827813            9.6497826
## 5  -0.006832223  0.4033884    0.114093938            0.7783098
##   Polyunsaturated.Fats Carbohydrates    Protein Dietary.Fiber  Vitamin.B1
## 1           0.04734267    -0.6793511 -0.5316697    -0.5927538 -0.09012576
## 2          -0.01512735    -0.5223814 -0.1105874    -0.3359922 -0.26191058
## 3          -0.04636236    -0.6825104 -0.5405766    -0.5927538 -0.32997627
## 4          11.37003390    -0.6825104 -0.5405766    -0.5927538 -0.32997627
## 5           0.21913523    -0.6825104 -0.5405766    -0.5927538 -0.32997627
##   Vitamin.B12  Vitamin.B2  Vitamin.B3 Vitamin.B5 Vitamin.B6  Vitamin.D
## 1   1.4811222 -0.57111824 -0.45271556 -0.5387004 -0.2933482 -0.2297057
## 2  -0.5675873 -0.04699862  0.06870105 -0.2737411 -0.4683080 -0.2297057
## 3  -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2297057
## 4  -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2297057
## 5  -0.5675873 -0.60855536 -0.50314170 -0.5387004 -0.5049931 -0.2294034
##    Vitamin.K    Calcium     Copper       Iron  Magnesium  Manganese Phosphorus
## 1 -0.2828768 -0.4152301 -0.2861763 -0.4819763 -0.4457247 -0.3552616 -0.4646120
## 2 -0.2860402 -0.3271462 -0.2774029 -0.1536941  0.6241230 -0.3357924  0.4864845
## 3 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
## 4 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
## 5 -0.2860402 -0.4485592 -0.2862057 -0.4951466 -0.4522086 -0.3552616 -0.4846543
##    Potassium   Selenium       Zinc Nutrition.Density
## 1 -0.4618446 -0.3498440 -0.4680411        -0.5450870
## 2 -0.2543378 -0.3494315  0.5538686        -0.4165060
## 3 -0.4700326 -0.3498440 -0.4783634        -0.5395204
## 4 -0.4700326 -0.3498440 -0.4783634         1.2694098
## 5 -0.4700326 -0.3498440 -0.4783634        -0.5032356

Dimension Reduction

Before proceeding with Principal Component Analysis (PCA), I needed to ensure that my dataset met the fundamental assumptions required for effective dimensionality reduction. Specifically, I conducted two key test, which are Kaiser-Meyer-Olkin factor adequacy and Bartlett’s test of sphericity.

Kaiser-Meyer-Olkin test

kmo_result <- KMO(filtered_data_ds)  
print(kmo_result)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = filtered_data_ds)
## Overall MSA =  0.75
## MSA for each item = 
##        Caloric.Value                  Fat       Saturated.Fats 
##                 0.55                 0.55                 0.37 
## Monounsaturated.Fats Polyunsaturated.Fats        Carbohydrates 
##                 0.47                 0.45                 0.47 
##              Protein        Dietary.Fiber           Vitamin.B1 
##                 0.64                 0.93                 0.92 
##          Vitamin.B12           Vitamin.B2           Vitamin.B3 
##                 0.88                 0.96                 0.90 
##           Vitamin.B5           Vitamin.B6            Vitamin.D 
##                 0.86                 0.85                 0.86 
##            Vitamin.K              Calcium               Copper 
##                 0.90                 0.78                 0.84 
##                 Iron            Magnesium            Manganese 
##                 0.96                 0.91                 0.76 
##           Phosphorus            Potassium             Selenium 
##                 0.93                 0.91                 0.82 
##                 Zinc    Nutrition.Density 
##                 0.94                 0.84

Initially, I found that many variables had weak correlations, which explained why my KMO test result was low (0.52). Since PCA benefits from strong correlations, I filtered out weakly correlated variables using a threshold, improving the KMO score to 0.75.

This step was crucial because keeping uncorrelated variables in PCA would have resulted in components that do not effectively capture underlying patterns in the data. By refining the dataset before PCA, I ensured that the dimensionality reduction process would extract meaningful structure rather than just noise.

Bartlett’s test of sphericity

bartlett_result <- cortest.bartlett(cor_matrix, n = nrow(filtered_data_ds))

# Print results
print(bartlett_result)

## $chisq
## [1] 32002.29
## 
## $p.value
## [1] 0
## 
## $df
## [1] 561

The results from Bartlett’s test of sphericity confirm that applying PCA to my dataset makes sense. The chi-square statistic is extremely high (32,002.29), and the p-value is effectively zero, meaning I can confidently reject the null hypothesis that my correlation matrix is an identity matrix. With 561 degrees of freedom, this test shows that my variables have significant correlations and share common variance. This reassures me that dimensionality reduction is appropriate, as my dataset contains enough interdependencies for PCA to extract meaningful components.

PCA

Principal Component Analysis (PCA) is a technique I used to reduce the dimensionality of my dataset while retaining as much meaningful information as possible. The main idea behind PCA is to transform my original variables into a new set of uncorrelated features called principal components (PCs). These components are ordered based on how much variance they explain in the data—PC1 captures the most variance, followed by PC2, PC3, and so on. This allows me to simplify complex datasets by focusing on just the most important patterns while removing redundancy and noise. By applying PCA, I transformed my high-dimensional nutritional data into a smaller set of components that still preserve its essential structure, making it easier to interpret and use for clustering.

pca_result <- prcomp(filtered_data_ds, center = TRUE, scale. = TRUE)

summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.9261 2.0650 1.8304 1.13363 0.98648 0.88570 0.87531
## Proportion of Variance 0.3293 0.1640 0.1288 0.04943 0.03743 0.03017 0.02947
## Cumulative Proportion  0.3293 0.4933 0.6222 0.67159 0.70902 0.73919 0.76866
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.86302 0.83849 0.82717 0.76680 0.68339 0.66218 0.64984
## Proportion of Variance 0.02865 0.02704 0.02632 0.02261 0.01796 0.01686 0.01624
## Cumulative Proportion  0.79731 0.82435 0.85066 0.87328 0.89124 0.90811 0.92435
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.58475 0.55983 0.53442 0.51444 0.46262 0.39339 0.36877
## Proportion of Variance 0.01315 0.01205 0.01098 0.01018 0.00823 0.00595 0.00523
## Cumulative Proportion  0.93750 0.94955 0.96054 0.97072 0.97895 0.98490 0.99013
##                          PC22    PC23   PC24    PC25    PC26
## Standard deviation     0.3640 0.28237 0.1764 0.11107 0.03027
## Proportion of Variance 0.0051 0.00307 0.0012 0.00047 0.00004
## Cumulative Proportion  0.9952 0.99829 0.9995 0.99996 1.00000

Scree Plot - PCA

fviz_eig(pca_result, addlabels = TRUE, barfill = "lightblue", barcolor = "black", linecolor = "black") +
  theme_minimal()

Looking at my scree plot, PC1 explains 32.9% of the variance, followed by PC2 at 16.4% and PC3 at 12.9%. Together, these three components account for about 62.2% of the total variance. The variance explained drops significantly from PC4 onward, with PC4 contributing 4.9% and PC5 at 3.7%, after which the values stagnate around 3% or lower.

Given that the elbow point is clearly around PC4–PC5, I need to balance information retention with dimensionality reduction. If I stop at PC5, I capture approximately 70.9% of the total variance, and extending to PC6 would give me 73.9%. Since the additional variance gain beyond PC6 is minimal, I will retain the first six principal components for my analysis.

PCA Rotation

pca_loadings <- as.data.frame(pca_result$rotation[, 1:6]) 
head(pca_loadings, 5)

##                            PC1          PC2        PC3         PC4          PC5
## Caloric.Value        0.2208772  0.002060511 -0.3975456  0.08046347 -0.042938509
## Fat                  0.1195779 -0.052698448 -0.5006284 -0.03586913 -0.001046917
## Saturated.Fats       0.0582518 -0.055483996 -0.4257407 -0.06485705 -0.090640532
## Monounsaturated.Fats 0.1069255 -0.045589478 -0.4351612 -0.05041065  0.010850083
## Polyunsaturated.Fats 0.1325871 -0.030829280 -0.3463100  0.03689165  0.127029123
##                               PC6
## Caloric.Value         0.090318664
## Fat                   0.004988606
## Saturated.Fats        0.061151746
## Monounsaturated.Fats -0.062504993
## Polyunsaturated.Fats -0.026785184

library(factoextra)

# Contribution of variables to PC1
fviz_contrib(pca_result, choice = "var", axes = 1, top = 10)

# Contribution of variables to PC2
fviz_contrib(pca_result, choice = "var", axes = 2, top = 10)

# Contribution of variables to PC3
fviz_contrib(pca_result, choice = "var", axes = 3, top = 10)

# Contribution of variables to PC4
fviz_contrib(pca_result, choice = "var", axes = 4, top = 10)

# Contribution of variables to PC5
fviz_contrib(pca_result, choice = "var", axes = 5, top = 10)

# Contribution of variables to PC6
fviz_contrib(pca_result, choice = "var", axes = 6, top = 10)

In my dataset, the first two principal components (PC1 and PC2) account for the largest variance, indicating that the primary way the data naturally separates is based on mineral content and nutrient density.
PC3 and PC4 capture differences in fat composition and carbohydrate-rich sources, effectively distinguishing energy-dense foods.
Meanwhile, PC5 and PC6 focus on trace minerals and vitamins, further differentiating food types based on their micronutrient profiles.
Given this breakdown, retaining the first six principal components ensures that I preserve the most meaningful variation while reducing dimensionality.

pca_var <- get_pca_var(pca_result)

fviz_contrib(pca_result, "var", axes = 1:6, fill = "lightblue", color = "lightblue2")

The chart shown above illustrates the cumulative contributions of individual variables across the first six principal components (PCs). The red dashed line marks the threshold for significant contributions, helping me identify which variables play the most influential role in structuring my dataset. Fat and Caloric Value emerge as the top contributors, confirming that energy density is a primary factor in distinguishing different data points. Calcium, Nutrition Density, and Phosphorus also play a key role, reinforcing the importance of mineral content in food classification. Additionally, Zinc, Magnesium, and Vitamin B3 contribute substantially, indicating that micronutrient composition is another major driver of variation.

Other notable contributors include Selenium, Manganese, and Protein, suggesting that both trace minerals and macronutrient composition significantly shape the dataset’s structure. Monounsaturated and Saturated Fats appear relevant but are secondary compared to total fat content. Meanwhile, Carbohydrates and Polyunsaturated Fats contribute below the threshold but still provide some differentiation. The lowest-ranked contributors among the top variables are Vitamin B12, Vitamin D, and Vitamin K, confirming that while vitamins play a role, their variability is less influential than macronutrients and minerals.

This analysis validates my decision to retain six principal components, as they effectively capture the key sources of variation in my dataset. Now we have a clearer understanding that energy density, mineral content, and protein composition are the dominant factors shaping the structure of my data.

fviz_pca_var(pca_result, 
             col.var = "lightblue",    
             repel = TRUE,         
             title = "PCA Variable Factor Map",
             axes = c(1,2))

Looking at the axes, PC1 (32.9%) explains most of the variance, while PC2 (16.4%) captures additional differentiation. Together, these two components account for 49.3% of the total variation in my dataset, which is quite substantial. The variables clustering in similar directions suggest strong correlations, meaning that foods high in one nutrient are likely to be high in others from the same group. For example, Phosphorus, Iron, Magnesium, and Zinc contribute heavily to PC1, while Selenium, Manganese, and Copper are more aligned with PC2. This tells me that PC1 is mainly driven by macronutrients like fat and protein, while PC2 captures the variation from micronutrients like vitamins and minerals.

I also notice that certain variables oppose each other, implying negative correlations. For instance, fat-related variables (Saturated, Monounsaturated, and Polyunsaturated Fats) pull in a different direction compared to mineral-heavy variables like Selenium and Copper. This makes sense because high-fat foods generally don’t have the same nutritional profile as mineral-rich foods.

Overall, this plot confirms that the structure of my dataset is largely driven by macronutrient content on PC1 and micronutrient density on PC2. It also justifies why I retained six principal components—since they capture distinct and meaningful patterns in the data.

Applying Dimension Reduction to Dataset

reduced_data_six <- as.data.frame(pca_result$x[, 1:5])
head(reduced_data_six,5)

##             PC1        PC2         PC3         PC4         PC5
## 1 -1.6275821953 -0.5667021  -0.7801081  0.06324648  0.57860765
## 2 -0.0009151555 -1.0881589  -0.1820802 -0.17120608  0.18404617
## 3 -1.5215270322 -1.2799019  -1.1831586 -0.13679361 -0.03798116
## 4  4.8119110872 -2.8891079 -19.6927146 -0.80080015  0.46508117
## 5 -1.5390768256 -1.2663423  -1.0477090 -0.11402869 -0.03229059

Clustering

Check Clusterability

I’m using Hopkins statistics to do an acid-test before proceeding to clustering.

tendency <- get_clust_tendency(reduced_data_six, n = nrow(reduced_data_six) * 0.1)

cat("Hopkins statistic:", tendency$hopkins_stat, "\n")

## Hopkins statistic: 0.9440615

Initial Clusters: Elbow and Silhouette Methods

set.seed(123)

show_elbow<-fviz_nbclust(reduced_data_six, kmeans, method = "wss") +ggtitle("Elbow Method")
show_silhouette<- fviz_nbclust(reduced_data_six, kmeans, method = "silhouette") +ggtitle("Silhouette Method")

grid.arrange(show_elbow,show_silhouette, ncol=2, top="Optimal Number of Clusters")

To determine the optimal number of clusters for my dataset, I used both the Elbow Method and the Silhouette Method, as shown in the figure above. These two approaches provide complementary insights: the Elbow Method identifies the point where adding more clusters yields diminishing returns in variance reduction, while the Silhouette Method assesses how well-separated the clusters are.

From the Elbow Method, I observed a clear inflection point around k = 3 or k = 4, meaning that beyond this point, the reduction in within-cluster variance slows down significantly. Meanwhile, the Silhouette Score peaks at k = 2, indicating the best cluster separation at that level. However, considering the trade-off between separation and meaningful grouping, k = 3 emerges as the best choice. At k = 3, the Silhouette Score remains relatively high while ensuring a reasonable division of my data into distinct groups.

With this justification, I decided to proceed with K-Means Clustering using k = 3 to segment my dataset effectively. If the results are unsatisfactory or show overlapping clusters, I may explore Hierarchical Clustering as an alternative approach.

K-Means

set.seed(123)

# Run K-Means for different k values
kmeans_k3 <- kmeans(reduced_data_six, centers = 3) 
kmeans_k4 <- kmeans(reduced_data_six, centers = 4)
kmeans_k5 <- kmeans(reduced_data_six, centers = 5)

reduced_data_six$Cluster_k3 <- as.factor(kmeans_k3$cluster)
reduced_data_six$Cluster_k4 <- as.factor(kmeans_k4$cluster)
reduced_data_six$Cluster_k5 <- as.factor(kmeans_k5$cluster)

K-Means Graph of Clusters

hull_data <- reduced_data_six %>%
  group_by(Cluster_k3) %>%
  slice(chull(PC1, PC2))

p1 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k3, shape = Cluster_k3)) +
  geom_point(size = 3) + 
  geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k3, group = Cluster_k3), alpha = 0.3) 
  theme_minimal() +
  labs(title = "K-Means Clustering Result", color = "Cluster_k3", fill = "Cluster_k3")

hull_data <- reduced_data_six %>%
  group_by(Cluster_k4) %>%
  slice(chull(PC1, PC2))

p2 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k4, shape = Cluster_k4)) +
  geom_point(size = 3) + 
  geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k4, group = Cluster_k4), alpha = 0.3) 
  theme_minimal() +
  labs(title = "K-Means Clustering Result (k=3)", color = "Cluster_k4", fill = "Cluster_k4")

hull_data <- reduced_data_six %>%
  group_by(Cluster_k5) %>%
  slice(chull(PC1, PC2))

p3 <- ggplot(reduced_data_six, aes(x = PC1, y = PC2, color = Cluster_k5, shape = Cluster_k5)) +
  geom_point(size = 3) + 
  geom_polygon(data = hull_data, aes(x = PC1, y = PC2, fill = Cluster_k5, group = Cluster_k5), alpha = 0.3) 
  theme_minimal() +
  labs(title = "K-Means Clustering Result (k=3)", color = "Cluster_k5", fill = "Cluster_k5")

ggarrange(p1, p2, p3, ncol = 2, nrow = 1)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

K-Means Silhouette Score

sil_k3 <- silhouette(kmeans_k3$cluster, dist(reduced_data_six))
sil_k4 <- silhouette(kmeans_k4$cluster, dist(reduced_data_six))
sil_k5 <- silhouette(kmeans_k5$cluster, dist(reduced_data_six))

p3 <- fviz_silhouette(sil_k3) + ggtitle("Silhouette Plot (K=3)")

##   cluster size ave.sil.width
## 1       1   76          0.44
## 2       2  601          0.60
## 3       3   45          0.08

p4 <- fviz_silhouette(sil_k4) + ggtitle("Silhouette Plot (K=4)")

##   cluster size ave.sil.width
## 1       1  150          0.42
## 2       2   77          0.41
## 3       3   33          0.01
## 4       4  462          0.72

p5 <- fviz_silhouette(sil_k5) + ggtitle("Silhouette Plot (K=5)")

##   cluster size ave.sil.width
## 1       1  149          0.42
## 2       2   33          0.01
## 3       3  449          0.72
## 4       4   73          0.40
## 5       5   18          0.12

p3

p4

p5

To evaluate the clustering quality, I generated silhouette plots for different values of \(K\) (3, 4, and 5) - following the result of elbow and sillhouette plot and according to my own intuition. The silhouette width measures how well-separated the clusters are, with higher values indicating better-defined clusters.

For \(K=3\), the silhouette plot showed a reasonably high average silhouette width, suggesting that the clusters were fairly well-formed. However, upon increasing \(K\) to 4 and 5, the results did not significantly improve, and in some cases, new clusters had noticeably lower silhouette scores. This suggests that while some finer-grained separation is possible, it may not contribute meaningfully to better clustering.

Given these findings, I will proceed with hierarchical clustering as an alternative approach. Unlike K-Means, hierarchical clustering does not require pre-specifying the number of clusters and allows for a more flexible visualization of how data points group together at different levels. This method will help validate whether \(K=3\) remains the most reasonable choice or if another structure emerges naturally from the data.

Hierarchal Clustering

dist_matrix <- dist(reduced_data_six, method = "euclidean")
hclust_result <- hclust(dist_matrix, method = "ward.D2")
fviz_dend(hclust_result, k = 3, rect = TRUE, rect_border = "jco", rect_fill = TRUE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

clusters_hierarchical <- cutree(hclust_result, k = 3)

The dendrogram shown above helps visualize how hierarchical clustering groups the dataset based on similarity. The height represents the dissimilarity between clusters, and the colored segments indicate potential natural cluster separations.

From the visualization: - Four distinct clusters emerge, suggesting K=4 might be a valid choice. - The blue and green-yellow-ish clusters are more compact, whereas the purple cluster is more distinct. - The yellow cluster is smaller, indicating a niche subgroup.

By cutting the dendrogram at an appropriate height, I can choose K=3 or K=4 as the final number of clusters. This structure aligns well with the previous silhouette results, confirming that hierarchical clustering captures meaningful patterns in the dataset.

To evaluate the quality of hierarchical clustering, I first plotted the clusters in the PC1 vs PC2 space. If clusters are well-separated, this suggests the method captured meaningful structure. However, if clusters overlap, it might indicate a need for re-evaluating K or the clustering method.

Hierarcal Clustering Sillhouette

silhouette_hierarchical <- silhouette(clusters_hierarchical, dist(reduced_data_six))

fviz_silhouette(silhouette_hierarchical) +
  labs(title = "Silhouette Plot for Hierarchical Clustering")

##   cluster size ave.sil.width
## 1       1  462          0.77
## 2       2  183          0.08
## 3       3   77          0.41

## Representation for Each Clusters

reduced_data_six$Hierarchical_Cluster <- as.factor(clusters_hierarchical)

cluster_summary <- aggregate(filtered_data_ds[, -c(1,2)], 
                             by = list(Cluster = reduced_data_six$Hierarchical_Cluster), 
                             FUN = mean)
head(cluster_summary,5)

##   Cluster Saturated.Fats Monounsaturated.Fats Polyunsaturated.Fats
## 1       1    -0.08859568           -0.1080424           -0.1369963
## 2       2     0.28093272            0.3568432            0.4055340
## 3       3    -0.13609718           -0.1998276           -0.1418239
##   Carbohydrates     Protein Dietary.Fiber Vitamin.B1 Vitamin.B12 Vitamin.B2
## 1    -0.3734764 -0.35272956    -0.3897078 -0.2032879  -0.1424281 -0.2909557
## 2     0.8492512  0.85143384     0.7131916  0.6401944  -0.2458422  0.9858157
## 3     0.2225081  0.09283981     0.6432588 -0.3017734   1.4388432 -0.5971784
##   Vitamin.B3 Vitamin.B5 Vitamin.B6  Vitamin.D  Vitamin.K    Calcium     Copper
## 1 -0.3122326 -0.3292908 -0.3072257 -0.1811763 -0.1444004 -0.2853551 -0.1953250
## 2  0.9709229  0.1797736  0.2543536 -0.2295636 -0.1996869  0.9087797 -0.2768837
## 3 -0.4341222  1.5484906  1.2388516  1.6326443  1.3409828 -0.4476963  1.8299984
##         Iron  Magnesium  Manganese Phosphorus  Potassium   Selenium       Zinc
## 1 -0.3106060 -0.2796763 -0.2605709 -0.3115312 -0.3034565 -0.2630288 -0.2775306
## 2  0.9447645  0.8818620 -0.3132085  0.9886213  0.8745299 -0.3496061  0.8753835
## 3 -0.3817134 -0.4177961  2.3078039 -0.4803936 -0.2576889  2.4090546 -0.4152732
##   Nutrition.Density
## 1        -0.3913460
## 2         1.1466233
## 3        -0.3770159

From the clustering results, I can identify three distinct food groups based on their nutrient composition:

Cluster 1: Low in fats, carbohydrates, and protein, suggesting these foods are likely low-energy, nutrient-light foods. This could include vegetables, low-calorie items, or high-water-content foods.
Cluster 2: High across all macronutrients (fats, carbohydrates, and protein), making this group representative of calorie-dense and nutrient-rich foods. Likely includes high-fat foods like nuts, dairy, or energy-dense processed foods.
Cluster 3: Moderate carbohydrate and protein levels but still lower in fats. This may correspond to lean protein sources, legumes, or balanced dietary items like whole grains.

These results confirm that PCA successfully reduced the complexity of the dataset while clustering uncovered meaningful groups that reflect different dietary patterns.

Final Conclusion

In this project, I explored how dimensionality reduction and clustering methods can simplify and improve the analysis of nutritional data. By applying Principal Component Analysis (PCA), I successfully reduced the dataset from 34 dimensions to 6, capturing the most important patterns while filtering out redundancy. Clustering using hierarchical methods then revealed three meaningful food categories, making it easier to understand the structure within the dataset.

These clusters provide a data-driven way to classify foods, helping to analyze dietary patterns in a way that makes sense.

Cluster 1 represents low-energy, fiber-rich foods, which could be useful for weight management.
Cluster 2 contains energy-dense, high-protein foods, ideal for muscle-building or high-calorie diets.
Cluster 3 includes lean, moderate-energy foods, perfect for balanced meal plans.

This project confirmed that nutritional datasets exhibit natural structure, and unsupervised learning methods are powerful tools for uncovering meaningful insights.

References

Michigan State University (2018, October). Americans pay attention to food labels, but are confused by what information matters. https://www.canr.msu.edu/news/americans-pay-attention-to-food-labels-but-are-confused-by-what-information-matters

National Institutes of Health. (2021). Macronutrients overview. National Center for Biotechnology Information, U.S. National Library of Medicine. Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK594226/

MD Anderson Cancer Center. (n.d.). What are macronutrients? Retrieved from https://www.mdanderson.org/publications/focused-on-health/what-are-macronutrients-.h15-1593780.html