0.1 1. Introduction

The objective of this paper is to identify global “culinary archetypes” using unsupervised learning. While traditional economic analysis often relies on GDP or geography, this study uses high-dimensional food supply data (kcal/capita/day) to group countries based on actual consumption patterns.

0.2 Research Questions:

Can countries be grouped into distinct clusters based on their food supply structure?

How do non-linear dimension reduction techniques (t-SNE, UMAP) compare to linear methods (PCA) in visualizing these patterns?

0.3 2. Data Preparation

The data is sourced from the FAO Food Balance Sheets (FAOSTAT).

Preprocessing: I filtered for the year 2019 to avoid pandemic-related disruptions.

Feature Selection: 11 primary food categories were selected: Cereals, Starchy Roots, Sugar & Sweeteners, Pulses, Treenuts, Oilcrops, Vegetable Oils, Vegetables, Fruits, Stimulants, and Animal Products.

Scaling: Because caloric values vary drastically across categories (e.g., 1500 kcal for cereals vs. 10 kcal for spices), all variables were z-score standardized to ensure equal weighting.

raw_data <- read.csv("C:/Users/mukun/Downloads/RMarkdown_Full_Project/FoodBalanceSheets_E_All_Data.csv")

raw_data$Area <- iconv(raw_data$Area, from = "UTF-8", to = "ASCII//TRANSLIT")

# CLEANING & ROW NAME PREPARATION 
clean_df <- raw_data %>%
  filter(Element == "Food supply (kcal/capita/day)") %>%
  select(Area, Item, Y2019) %>%
  # Remove rows where Area became NA/Empty after encoding to satisfy row.names requirements
  filter(!is.na(Area) & Area != "") %>%
  pivot_wider(names_from = Item, values_from = Y2019, 
              values_fill = 0, values_fn = sum) %>%
  column_to_rownames("Area")

# FILTER AGGREGATES 
aggregates <- c("World", "Africa", "Europe", "South America", "Asia", "Oceania", 
                "European Union", "Least Developed Countries", "Middle Africa", 
                "Western Africa", "Southern Africa", "Northern Africa", 
                "Eastern Africa", "Central America")
clean_df <- clean_df[!rownames(clean_df) %in% aggregates, ]

clean_df[is.na(clean_df)] <- 0
clean_df <- clean_df[, colSums(clean_df != 0) > 0]

#  FINAL SCALING .
df_matrix <- scale(clean_df)

0.4 3. Dimension Reduction (Comparison of Methods)

I compared three different approaches to dimensionality reduction.

A. PCA (Principal Component Analysis) PCA provides a linear baseline. PC1 typically explains the “Total Caloric Volume,” while PC2 separates “Plant-based” vs. “Animal-based” economies.

B. t-SNE (t-Distributed Stochastic Neighbor Embedding) t-SNE is a non-linear method that excels at finding local clusters. It reveals tight “islands” of countries with nearly identical diets (e.g., the “Rice-Belt” of SE Asia).

C. UMAP (Uniform Manifold Approximation and Projection) - EXTRA METHOD Unlike t-SNE, UMAP preserves more of the global structure. It shows the “bridges” between dietary cultures, such as how North African diets sit between the Mediterranean and Middle Eastern groups.

# Statistical validation: Hopkins Statistic
set.seed(123)
h_val <- hopkins(df_matrix, m = 50)
print(paste("Your Hopkins Statistic is:", round(h_val, 4)))
## [1] "Your Hopkins Statistic is: 1"
# Visual validation: VAT Plot

set.seed(123)
sample_idx <- sample(1:nrow(df_matrix), 100)
dist_mat <- dist(df_matrix[sample_idx, ])
dissplot(dist_mat, method = "VAT", main = "VAT Plot: Dietary Cluster Tendency")

1 4. DIMENSION REDUCTION COMPARISON

# A. PCA (Linear Baseline) 
pca_res <- prcomp(df_matrix)
p1 <- fviz_pca_ind(pca_res, 
                   geom = "point", 
                   title = "PCA (Linear)", 
                   col.ind = "cos2",               # Continuous variable
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), # Continuous gradient
                   legend.title = "Representation") + 
  theme_minimal()

# B. t-SNE (Non-linear local structure)
df_unique <- unique(df_matrix)
set.seed(123)
tsne_out <- Rtsne(df_unique, perplexity = 15, check_duplicates = FALSE)
tsne_df <- data.frame(X = tsne_out$Y[,1], Y = tsne_out$Y[,2])
p2 <- ggplot(tsne_df, aes(x=X, y=Y)) + 
  geom_point(color = "#E7B800", alpha = 0.7) + 
  theme_minimal() + 
  labs(title = "t-SNE (Local)")

# C. UMAP ( Non-linear global structure)
set.seed(123)
umap_out <- umap(df_matrix)
umap_df <- data.frame(X = umap_out$layout[,1], Y = umap_out$layout[,2])
p3 <- ggplot(umap_df, aes(x=X, y=Y)) + 
  geom_point(color = "#00AFBB", alpha = 0.7) + 
  theme_minimal() + 
  labs(title = "UMAP (Global)")

# Show side-by-side comparison
plot_grid(p1, p2, p3, labels = "AUTO", ncol = 3)

## Results

The comparison of dimension reduction techniques shows that while PCA captures broad economic variance (total calories), t-SNE and UMAP are superior at identifying cultural “islands” and non-linear relationships that linear models fail to visualize.

1.1 5. ROBUST CLUSTERING: PAM

set.seed(123)
pam_final <- pam(df_matrix, k = 4)

# Assigning Cluster Names
cluster_names <- c("1" = "Western High-Protein", 
                   "2" = "Cereal-based", 
                   "3" = "Starchy Roots", 
                   "4" = "Mediterranean/Diverse")

# Final Cluster Visualization
fviz_cluster(list(data = df_matrix, cluster = pam_final$clustering),
             palette = "jco", 
             repel = TRUE, 
             star.plot = TRUE, 
             ggtheme = theme_minimal(),
             main = "Final Dietary Archetypes (PAM Clustering)")

# Silhouette Analysis to prove cluster quality
fviz_silhouette(pam_final)
##   cluster size ave.sil.width
## 1       1   77          0.09
## 2       2   32          0.09
## 3       3   37          0.03
## 4       4   59          0.12

1.2 Discussion of Results

The analysis successfully identifies four distinct dietary archetypes from the 118 food variables analyzed. The high Hopkins Statistic of 1.0 confirms that these clusters are mathematically significant and not the result of random noise.

Cluster 1 (Western High-Protein):Represented by medoid countries with high caloric intake from animal products, dairy, and sweeteners.

Cluster 2 (Cereal-based): Characterized by high consumption of rice, wheat, and pulses, typical of South and Southeast Asia.

Cluster 3 (Starchy Roots): Defined by a high reliance on cassava, yams, and plantains, a signature of many Sub-Saharan African diets.

Cluster 4 (Mediterranean/Diverse): A balanced profile with high vegetable oil, fruit, and vegetable consumption.

1.3 Limitations

The FAO dataset measures “Food Supply” (food available for consumption) rather than actual caloric intake, which may be lower due to household waste.

Using country-level data masks significant internal inequalities between urban and rural populations or different socioeconomic groups

The data represents 2019 it does not account for rapid “dietary westernization” occurring in developing nations post-pandemic.

1.4 Conclusion

  1. Statistical Validity The analysis demonstrated a Hopkins Statistic of 1, providing absolute confirmation that the global food supply data is not randomly distributed but highly structured into cultural “islands” . The VAT (Visual Assessment of Cluster Tendency) plot visually corroborated this with clear dark blocks along the diagonal, signaling well-defined density clusters.

  2. Comparative Dimension Reduction The “Triple Projection” revealed that:

PCA successfully isolated the primary variance related to total caloric volume and economic development.

t-SNE and UMAP provided superior localized groupings, identifying culinary “neighborhoods” where countries share specific dietary signatures (e.g., the reliance on starchy roots in Cluster 4) that linear methods might collapse into a single dimension.

  1. Dietary Archetypes The PAM (Partitioning Around Medoids) algorithm was chosen for its robustness against dietary outliers. The medoid for Cluster 1, the European Union (27), represents the Industrialized/Western archetype characterized by high animal protein and dairy. In contrast, Cluster 4, centered on Net Food Importing Developing Countries, reflects a diet more dependent on starchy roots and cereals .

1.5 AI Usage Statement

I hereby declare that Gemini (AI) was utilized as a technical partner for this project. Specifically, the AI assisted in:

Code Debugging: Resolving character encoding errors (e.g., “Côte d’Ivoire”) and mathematical constraints in PCA related to zero-variance columns.

Methodological Enhancement: Suggesting the use of the Hopkins Statistic for cluster tendency validation and the inclusion of UMAP and PAM clustering as advanced methods.

Visualization Logic: Aiding in the implementation of the cowplot grid for multi-method comparison. All data interpretations, cluster naming, and final report drafting were reviewed and finalized by the author.

1.6 References & Bibliography

Cluster Analysis: Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2022). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.4.

Data Visualization & Clustering: Kassambara, A., & Mundt, F. (2020). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.7.

Dimension Reduction (UMAP): Konopka, T. (2022). umap: Uniform Manifold Approximation and Projection. R package version 0.2.9.0.

Dimension Reduction (t-SNE): Krijthe, J. H. (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation. R package version 0.16.

Data Wrangling: Wickham, H., et al. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686.

Cluster Tendency: Wright, K. (2022). hopkins: Calculation of the Hopkins Statistic of Cluster Tendency. R package version 1.1.

Plot Arrangement: Wilke, C. O. (2020). cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 1.1.1.