Dimension reduction and clustering in analysis of instant noodles consumption

Introduction

In recent years, the global consumption patterns of instant noodles have become a subject of interest, reflecting changes in lifestyles, dietary preferences, and economic factors. Instant noodles, a convenient and quick food option, are not only popular but also hold significance as an economic indicator. This study explores the consumption trends of instant noodles across different countries and regions, shedding light on the broader implications for economics and societal well-being. Additionally, it touches upon the relevance of instant food consumption in the broader context, emphasizing its economic and cultural dimensions.

Relevance in economics

Understanding the patterns and factors influencing instant noodle consumption is crucial for undersanding today’s market and social behaviors. Instant noodles are often considered a cost-effective and easily accessible food option, making them a staple for individuals with varying socio-economic backgrounds. The widespread popularity of instant noodles has experienced significant growth in recent decades, driven by their convenience and affordability (Park,Lee, et al., 2011). Analyzing consumption trends can provide insights into economic disparities and accessibility to food resources. Shifts in instant food consumption can serve as an economic indicator, reflecting changes in consumer behavior and overall economic health. Understanding the dietary patterns associated with adverse health effects, such as frequent instant noodle consumption in this case, can have implications for the food industry and policymakers. It may prompt discussions on food regulations, labeling requirements, and public health campaigns to promote healthier eating habits. A positive correlation was identified between the regular consumption of instant noodles and elevated levels of plasma triglycerides, diastolic blood pressure, and fasting blood glucose among college students in Korea. Individuals with a higher frequency of instant noodle consumption demonstrated a heightened likelihood of having multiple cardiometabolic risk factors. The odds ratio for hypertriglyceridemia was notably higher in those who consumed instant noodles three or more times per week compared to those with lower consumption frequencies (Huh, Kim, et al., 2017).

Data Description

The dataset encompasses information about instant noodle consumption across different countries and regions. It includes data about consumption values in millions of US dollars for years 2018-2022, information about country and region,the ranking of the country based on population in 2022, country code, country and territory information, capital, and 2022 population.The dataset’s richness allows for a comprehensive analysis of the factors influencing instant noodle consumption, providing a nuanced understanding of the economic and cultural dynamics at play. The dataset can can be found under the link: https://www.kaggle.com/datasets/fortuneuwha/world-instant-noodles-consumption-2022

## Warning: pakiet 'dplyr' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'dplyr'

## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     filter, lag

## Następujące obiekty zostały zakryte z 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: pakiet 'readr' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'arules' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: Matrix

## Warning: pakiet 'Matrix' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'arules'

## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     recode

## Następujące obiekty zostały zakryte z 'package:base':
## 
##     abbreviate, write

## Warning: pakiet 'caret' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: ggplot2

## Warning: pakiet 'ggplot2' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: lattice

## Warning: pakiet 'factoextra' został zbudowany w wersji R 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## Warning: pakiet 'labdsv' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: mgcv

## Ładowanie wymaganego pakietu: nlme

## 
## Dołączanie pakietu: 'nlme'

## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     collapse

## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.

## This is labdsv 2.1-0
## convert existing ordinations with as.dsvord()

## 
## Dołączanie pakietu: 'labdsv'

## Następujący obiekt został zakryty z 'package:arules':
## 
##     predict

## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     density, loadings

## Warning: pakiet 'psych' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'psych'

## Następujący obiekt został zakryty z 'package:labdsv':
## 
##     pca

## Następujące obiekty zostały zakryte z 'package:ggplot2':
## 
##     %+%, alpha

## Warning: pakiet 'cluster' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'flexclust' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: grid

## Ładowanie wymaganego pakietu: modeltools

## Ładowanie wymaganego pakietu: stats4

## 
## Dołączanie pakietu: 'modeltools'

## Następujący obiekt został zakryty z 'package:arules':
## 
##     info

## Warning: pakiet 'fpc' został zbudowany w wersji R 4.3.2

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

## Warning: pakiet 'ggthemes' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'plotly' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'plotly'

## Następujący obiekt został zakryty z 'package:ggplot2':
## 
##     last_plot

## Następujący obiekt został zakryty z 'package:stats':
## 
##     filter

## Następujący obiekt został zakryty z 'package:graphics':
## 
##     layout

## Warning: pakiet 'stringr' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'missMDA' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'ade4' został zbudowany w wersji R 4.3.2

## Registered S3 method overwritten by 'ade4':
##   method       from  
##   summary.dist labdsv

## Warning: pakiet 'smacof' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: plotrix

## Warning: pakiet 'plotrix' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'plotrix'

## Następujący obiekt został zakryty z 'package:flexclust':
## 
##     placeLabels

## Następujący obiekt został zakryty z 'package:psych':
## 
##     rescale

## Ładowanie wymaganego pakietu: colorspace

## Warning: pakiet 'colorspace' został zbudowany w wersji R 4.3.2

## Ładowanie wymaganego pakietu: e1071

## Warning: pakiet 'e1071' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'e1071'

## Następujący obiekt został zakryty z 'package:flexclust':
## 
##     bclust

## 
## Dołączanie pakietu: 'smacof'

## Następujący obiekt został zakryty z 'package:psych':
## 
##     Procrustes

## Następujący obiekt został zakryty z 'package:base':
## 
##     transform

## Warning: pakiet 'Rtsne' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'psy'

## Następujący obiekt został zakryty z 'package:psych':
## 
##     wkappa

## Warning: pakiet 'scales' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'scales'

## Następujący obiekt został zakryty z 'package:plotrix':
## 
##     rescale

## Następujące obiekty zostały zakryte z 'package:psych':
## 
##     alpha, rescale

## Następujący obiekt został zakryty z 'package:readr':
## 
##     col_factor

## Warning: pakiet 'kableExtra' został zbudowany w wersji R 4.3.2

## 
## Dołączanie pakietu: 'kableExtra'

## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     group_rows

## Warning: pakiet 'pdp' został zbudowany w wersji R 4.3.2

## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.3.2

## corrplot 0.92 loaded

##   Country.Region X2018 X2019 X2020 X2021 X2022 Rank CCA3 Country.Territory
## 1          China 40250 41450 46360 43990 45070    1  CHN             China
## 2      Indonesia 12540 12520 12640 13270 14260    4  IDN         Indonesia
## 3          India  6060  6730  6730  7560  7580    2  IND             India
## 4          Japan  5780  5630  5970  5850  5980   11  JPN             Japan
## 5    Philippines  3980  3850  4470  4440  4290   13  PHL       Philippines
## 6    South Korea  3820  3900  4130  3790  3950   29  KOR       South Korea
##     Capital Continent X2022.Population
## 1   Beijing      Asia       1425887337
## 2   Jakarta      Asia        275501339
## 3 New Delhi      Asia       1417173173
## 4     Tokyo      Asia        123951692
## 5    Manila      Asia        115559009
## 6     Seoul      Asia         51815810

First, checking the missing data.

missing_val <- sapply(instant_noodles_df, function(x) sum(is.na(x))/nrow(instant_noodles_df))
print(missing_val)

##    Country.Region             X2018             X2019             X2020 
##        0.00000000        0.01886792        0.01886792        0.01886792 
##             X2021             X2022              Rank              CCA3 
##        0.00000000        0.01886792        0.00000000        0.00000000 
## Country.Territory           Capital         Continent  X2022.Population 
##        0.00000000        0.00000000        0.00000000        0.00000000

na_indices <- which(is.na(instant_noodles_df), arr.ind = TRUE)

print(na_indices)

##      row col
## [1,]  43   2
## [2,]  43   3
## [3,]  43   4
## [4,]  53   6

#Veryfing the countries for NA values
print(instant_noodles_df$Country.Region[43])

## [1] "Serbia"

print(instant_noodles_df$Country.Region[53])

## [1] "Ukraine"

As there is missing data in Serbia and Ukraine, it can be replaced with a mean of rows. Applying mean from whole column is not realistic, as some countries, especially the continent of Asia, differ significantally from the numbers of other areas.

head(instant_noodles_df)

##   Country.Region X2018 X2019 X2020 X2021 X2022 Rank CCA3 Country.Territory
## 1          China 40250 41450 46360 43990 45070    1  CHN             China
## 2      Indonesia 12540 12520 12640 13270 14260    4  IDN         Indonesia
## 3          India  6060  6730  6730  7560  7580    2  IND             India
## 4          Japan  5780  5630  5970  5850  5980   11  JPN             Japan
## 5    Philippines  3980  3850  4470  4440  4290   13  PHL       Philippines
## 6    South Korea  3820  3900  4130  3790  3950   29  KOR       South Korea
##     Capital Continent X2022.Population
## 1   Beijing      Asia       1425887337
## 2   Jakarta      Asia        275501339
## 3 New Delhi      Asia       1417173173
## 4     Tokyo      Asia        123951692
## 5    Manila      Asia        115559009
## 6     Seoul      Asia         51815810

tail(instant_noodles_df)

##    Country.Region X2018 X2019 X2020 X2021 X2022 Rank CCA3 Country.Territory
## 48        Denmark    20    20    10    10    20  115  DNK           Denmark
## 49        Finland    10    20    20    20    20  118  FIN           Finland
## 50    Switzerland    10    10    10    10    20  101  CHE       Switzerland
## 51      Argentina    10    10     0    20    10   33  ARG         Argentina
## 52     Costa Rica    10    10    10    20    10  124  CRI        Costa Rica
## 53        Ukraine   320   340   320   350    NA   38  UKR           Ukraine
##         Capital     Continent X2022.Population
## 48   Copenhagen        Europe          5882261
## 49     Helsinki        Europe          5540745
## 50         Bern        Europe          8740472
## 51 Buenos Aires South America         45510318
## 52     San José North America          5180829
## 53         Kiev        Europe         39701739

As the first step, let’s verify the information for Serbia.

target_row_S <- instant_noodles_df$Country.Region == "Serbia"
target_columns_S <- c("X2018","X2019","X2020", "X2021","X2022")
print(instant_noodles_df[target_row_S, target_columns_S])

##    X2018 X2019 X2020 X2021 X2022
## 43    NA    NA    NA    50    50

As all of the present values are equal 50, the other years with no information will be replaced with this value.

instant_noodles_df <- instant_noodles_df %>%
  mutate(
    X2018 = ifelse(row_number() == 43, coalesce(X2018, 50), X2018),
    X2019 = ifelse(row_number() == 43, coalesce(X2019, 50), X2019),
    X2020 = ifelse(row_number() == 43, coalesce(X2020, 50), X2020)
  )

Now, let’s verify information for Ukraine.

target_row_U <- instant_noodles_df$Country.Region == "Ukraine"
target_columns_U <- c("X2018","X2019","X2020", "X2021","X2022")
print(instant_noodles_df[target_row_U, target_columns_U])

##    X2018 X2019 X2020 X2021 X2022
## 53   320   340   320   350    NA

As there is one year missing, the mean from the other observations will be calculated, to replace the missing value.

mean_U<-(mean(320,340,320,350))
instant_noodles_df$X2022[53] <- mean_U

After having all needed information, the next step will be choosing numerical variables which are relevant for this study.

instant_noodles<-select (instant_noodles_df, -c(Rank, CCA3, Country.Territory, Capital))

instant_noodles_numeric <- instant_noodles[, c(2:6)]
str(instant_noodles_numeric)

## 'data.frame':    53 obs. of  5 variables:
##  $ X2018: num  40250 12540 6060 5780 3980 ...
##  $ X2019: num  41450 12520 6730 5630 3850 ...
##  $ X2020: num  46360 12640 6730 5970 4470 ...
##  $ X2021: int  43990 13270 7560 5850 4440 3790 3630 2850 2620 2100 ...
##  $ X2022: num  45070 14260 7580 5980 4290 ...

instant_noodles_numeric$X2021 <- as.numeric(instant_noodles_numeric$X2021)

All of the columns were set to numeric for the studies operations.

Dimension reduction

PCA (Principal Component Analysis)

Principal Component Analysis is a dimensionality reduction technique that transforms the original variables into a new set of uncorrelated variables, known as principal components. This method allows for simplifying the dataset while retaining the most important information, making it suitable for identifying patterns and trends in instant noodle consumption across different dimensions.

In the context of instant noodle consumption, have been applied to understand the topic of consumption in different years.

pca <- prcomp(instant_noodles_numeric)
pca$rotation

##              PC1         PC2        PC3        PC4        PC5
## X2018 -0.4153831  0.12463223 -0.7419234  0.3072575  0.4087373
## X2019 -0.4268055  0.05931286 -0.2983563 -0.7232140 -0.4497379
## X2020 -0.4728055 -0.83377411  0.1539094  0.2143048 -0.1079864
## X2021 -0.4529543  0.18472313  0.5025249 -0.3207473  0.6366315
## X2022 -0.4653830  0.50164419  0.2903684  0.4834760 -0.4622867

summary(pca)

## Importance of components:
##                              PC1       PC2       PC3      PC4      PC5
## Standard deviation     1.396e+04 225.35105 149.30117 63.19075 46.54961
## Proportion of Variance 9.996e-01   0.00026   0.00011  0.00002  0.00001
## Cumulative Proportion  9.996e-01   0.99985   0.99997  0.99999  1.00000

Standard Deviation: Spread of the data along each principal component. A higher standard deviation means that the data points are more dispersed along that component.
Proportion of Variance: The proportion of the total variance in the data explained by each principal component. High variance indicates that it captures most of the information in the data and the subsequent components contribute less to the overall variance.
Cumulative Proportion: This is the cumulative sum of the proportions of variance.

fviz_pca_var(pca, col.var = "purple")

PC1 dominates the variance and likely represents an overall trend, while the subsequent principal components capture smaller, more specific patterns or variations associated with individual years. The cumulative proportion indicates the cumulative amount of variance explained by each principal component up to that point. As the starting year of the study, PC1 can represent a general trend or pattern in the consumption values across the years.

install.packages("pdp")

## Warning: pakiet 'pdp' jest w użyciu i nie zostanie zainstalowany

library(pdp)
pca_var <- get_pca_var(pca)

fviz_contrib(pca, "var", axes = 1:5, fill = "lightblue", color = "tomato")

Based on the results obtained from the PCA analysis and the contributions of each year (2018 to 2022) to the principal components, it is visible that the year 2020 stands out as the most influential in the first five principal components. It contributes over 20%, indicating that the patterns and variations captured by the first five components are strongly influenced by the consumption values in 2020.Following closely, the year 2022 makes a significant contribution, slightly less than 2020 but still exceeding the 20% threshold.The contributions from the year 2021 are just above 20%, indicating a moderate influence on the principal components. While not as dominant as 2020 or 2022, 2021 still contributes substantially to the observed patterns. Year 2019 and 2018 have relatively lower impact.The years 2019 and 2018 have contributions below 20%, with 2019 having a slightly higher impact than 2018.These earlier years contribute less to the principal components. The diminishing contributions for earlier years suggest that consumption patterns may have evolved or changed over time.

Clustering

CLARA (Clustering Large Applications)

CLARA is clustering algorithm designed to handle large datasets efficiently. It rovides a way to obtain a representative sample from the dataset, perform k-medoids clustering on the sample, and then assign the rest of the data to the clusters found in the sample. This approach makes it suitable for large-scale applications where traditional clustering methods might be computationally expensive.

library(cluster)
library(factoextra)


for (i in 2:10) {
  cl <- clara(instant_noodles_df, i)
  print(paste('Average silhouette for',i,'clusters:',cl$silinfo$avg.width)) 
}

## [1] "Average silhouette for 2 clusters: 0.934338067785307"
## [1] "Average silhouette for 3 clusters: 0.64989627607545"
## [1] "Average silhouette for 4 clusters: 0.571684061013459"
## [1] "Average silhouette for 5 clusters: 0.612482760382871"
## [1] "Average silhouette for 6 clusters: 0.56930648529619"
## [1] "Average silhouette for 7 clusters: 0.57922499591618"
## [1] "Average silhouette for 8 clusters: 0.623826391890962"
## [1] "Average silhouette for 9 clusters: 0.653681379553354"
## [1] "Average silhouette for 10 clusters: 0.650406448529437"

The average silhouette width is a measure of how well-separated clusters are in a clustering solution. It ranges from -1 to 1, where a higher value indicates better-defined clusters.

In this case, the silhouette analysis suggests that a 2-cluster solution has a very high average silhouette width of 0.93. This indicates well-defined and separated clusters. As you increase the number of clusters, the average silhouette width decreases, suggesting that the clusters become less distinct or overlapping. Choosing 2 Clusters also shows high silhouette width, which can be applied with the data set 3 Clusters still have good separation, but less distinct than 2 clusters. In the case of 4-7 clusters some reduction in silhouette width is visible, indicating a bit of overlap or less distinct clusters. When it comes to 8-10 clusters, the silhouette width starts to stabilize, suggesting that additional clusters might not contribute significantly to better separation, but also showing higher silhouette width.

Silhouette method

The Silhouette Method is a technique for choosing the optimal number of clusters in a dataset. The idea is to calculate the average silhouette width for different numbers of clusters and choose the number that maximizes this score. It can be applied to various clustering algorithms, such as K-Means.

set.seed(12345)
silhouette_values <- numeric(10)

for (k in 2:10) {
  kmeans_model <- kmeans(instant_noodles_numeric, centers = k)
  silhouette_values[k] <- silhouette(kmeans_model$cluster, dist(instant_noodles_numeric))
}

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

## Warning in silhouette_values[k] <- silhouette(kmeans_model$cluster,
## dist(instant_noodles_numeric)): liczba pozycji do zastąpienia nie jest
## wielokrotnością długości zamiany

plot(2:10, silhouette_values[2:10], type = "b", main = "Silhouette Method",
     xlab = "Number of Clusters (k)", ylab = "Average Silhouette Score")

Based on the plot and the context of instant noodle consumption, the optimal number of clusters obtained through the Silhouette Method suggests that the dataset could be naturally grouped into more clusters, each representing a distinct pattern or behavior related to instant noodle consumption. This segmentation could be driven by various factors such as geographical location, economic conditions, cultural preferences, or other variables present in the dataset.

Conclusions

The study delved into the global consumption patterns of instant noodles, providing valuable insights into the economic and societal implications of this popular food item. The relevance of instant noodle consumption in economics was highlighted, emphasizing its role as an economic indicator and its impact on various aspects of well-being. The application of Principal Component Analysis (PCA) was employed to identify trends in instant noodle consumption over the specified years. The results of PCA revealed the significant influence of the year 2020, followed by 2022 and 2021, in explaining the variance in consumption patterns. This temporal analysis provided a nuanced understanding of the evolving dynamics of instant noodle consumption. The CLARA and Silhouette method were applied to select the optimal number of clusters. Clustering analysis applied to instant noodle consumption data can offer valuable insights into distinct patterns, trends, and consumer behaviors.
The overall findings emphasize the importance of recent years in shaping consumption trends, suggesting potential shifts in dietary preferences and economic factors. The observed trends in instant noodle consumption could be indicative of broader lifestyle changes, reflecting a preference for quick, convenient, and appetizing food options. The claim that these mass-produced foods aid consumers in utilizing limited time aligns with the fast-paced, restless lifestyle that many individuals lead. The convenience and time-saving aspects of instant noodles make them an attractive choice, especially for students and those with busy schedules (Tran, Nguyen, 2015).It opens avenues for future research exploring the intricate connections between food consumption, economic indicators, and societal well-being.

Bibliography

Huh I.S., Kim H., Jo H.K., Lim C.S., Kim J.S., Kim S.J., Kwon O., Oh B., Chang N. (2017). Instant noodle consumption is associated with cardiometabolic risk factors among college students in Seoul, Nutrition Research and Practice.
Park J., Lee J.S., Jang Y.A., Chung H.R., Kim J. (2011). A comparison of food and nutrient intake between instant noodle consumers and non-instant noodle consumers in Korean adults, Nutrition Research and Practice.
Tran, H.,Nguyen T. (2015). The effect of instant foods. Academia.