INTRODUCTION

Understanding cross-country differences in health and development is challenging because each country is described by many indicators at once: life expectancy, mortality, income, schooling, and more. Unsupervised learning methods, such as clustering and dimension reduction, help summarize these patterns.

In this project, I used the WHO Life Expectancy & Socioeconomic data set from kaggle and focused on Asian countries in 2014.

My goal is to:

  • Group Asian countries into clusters with similar health and socioeconomic profiles.

  • Compare several clustering and visualization techniques: k-means, PAM, hierarchical clustering, MDS, t-SNE, and UMAP.

DATA LOADING AND PREPARATION

life <- read.csv("C:/Users/mukun/Downloads/RMarkdown_Full_Project/Life Expectancy Data.csv", stringsAsFactors = FALSE)

str(life)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
head(life)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2

My focus is on Asian countries for the year 2014, since the analysis of the data set shows this year has the most complete information for this region.

asia_countries <- c(
  "Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh", "Bhutan", 
  "Brunei Darussalam", "Cambodia", "China", "Cyprus", "Georgia", "India", 
  "Indonesia", "Iran (Islamic Republic of)", "Iraq", "Israel", "Japan", 
  "Jordan", "Kazakhstan", "Kuwait", "Kyrgyzstan", "Lao People's Democratic Republic", 
  "Lebanon", "Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal", "Oman", 
  "Pakistan", "Philippines", "Qatar", "Republic of Korea", "Saudi Arabia", 
  "Singapore", "Sri Lanka", "Syrian Arab Republic", "Tajikistan", "Thailand", 
  "Timor-Leste", "Turkey", "Turkmenistan", "United Arab Emirates", "Uzbekistan", 
  "Viet Nam", "Yemen"
)

life_asia_2014 <- life %>%
  filter(Country %in% asia_countries & Year == 2014)

# Confirming number of countries
n_distinct(life_asia_2014$Country)
## [1] 46

Next, I selected a set of health and socioeconomic variables. For Asia 2014, variables like Alcohol and Schooling are complete.

# Using the variables identified as most complete for Asia 2014
target_cols <- c(
  "Life.expectancy",
  "Adult.Mortality",
  "Alcohol",
  "BMI",
  "HIV.AIDS",
  "Income.composition.of.resources",
  "Schooling"
)

data_clust_raw <- life_asia_2014[, target_cols]
data_clean <- na.omit(data_clust_raw)

# Syncing the labels with the cleaned data rows
life_clean <- life_asia_2014[rownames(data_clean), ]

# Scaling for clustering
data_clust <- scale(data_clean)

MY METHODS

I used a few different ways to look at the data:

K-means and PAM: These are my main tools for splitting the countries into groups. I included PAM because it’s usually a bit more reliable if there are outliers.

Hierarchical Clustering: I used this to create a family treeof countries so I could see which ones are the most similar to each other.

Dimension Reduction (MDS, t-SNE, UMAP): Since I can’t visualize 7 dimensions at once, I used these to squish the data down into a 2D map.

RESULTS

After checking the elbow plot, I decided that 3 clusters gave the best balance between being simple and actually describing the data well.

k_opt <- 3

K-means Clustering

if(nrow(data_clean) >= 3) {
  set.seed(123)
  km_res <- kmeans(data_clust, centers = k_opt, nstart = 25)
  
  # Syncing cluster results to the clean data
  life_clean$kmeans_cluster <- factor(km_res$cluster)
  
  fviz_cluster(km_res, data = data_clust,
               geom = "point",
               main = "K-means clustering of Asian Countries (2014)")
}

PAM Clustering

pam_res <- pam(data_clust, k = k_opt)
life_clean$pam_cluster <- factor(pam_res$clustering)

fviz_cluster(pam_res, data = data_clust, geom = "point", 
             main = "PAM clustering of Asian countries (2014)")

Hierarchical Clustering

dist_mat <- dist(data_clust)
hc_ward <- hclust(dist_mat, method = "ward.D2")

# Using cex = 0.5 to ensure labels are readable
plot(hc_ward, labels = life_clean$Country, cex = 0.6, 
     main = "Dendrogram of Asian Countries (2014)")

MDS Visualisation

mds_res <- cmdscale(dist_mat, k = 2)
mds_df <- data.frame(
  Dim1 = mds_res[,1],
  Dim2 = mds_res[,2],
  Country = life_clean$Country,
  cluster = life_clean$kmeans_cluster
)

ggplot(mds_df, aes(Dim1, Dim2, label = Country, colour = cluster)) +
  geom_point() +
  geom_text(vjust = -0.5, size = 2.5, check_overlap = TRUE, show.legend = FALSE) +
  theme_minimal() +
  ggtitle("MDS Map of Asian Countries (2014)")

t-SNE

set.seed(123)
# Perplexity adjusted for the smaller Asian dataset size
tsne_res <- Rtsne(data_clust, dims = 2, perplexity = 10, max_iter = 1000)

tsne_df <- data.frame(
  Dim1 = tsne_res$Y[,1],
  Dim2 = tsne_res$Y[,2],
  Country = life_clean$Country,
  cluster = life_clean$kmeans_cluster
)

ggplot(tsne_df, aes(Dim1, Dim2, colour = cluster, label = Country)) +
  geom_point(size = 3) +
  geom_text(vjust = -0.7, size = 2.5, check_overlap = TRUE) +
  theme_minimal() +
  ggtitle("t-SNE Embedding: Asian Health Clusters (2014)")

UMAP

set.seed(123)
umap_res <- umap(data_clust)

umap_df <- data.frame(
  Dim1 = umap_res$layout[,1],
  Dim2 = umap_res$layout[,2],
  Country = life_clean$Country,
  cluster = life_clean$kmeans_cluster
)
ggplot(umap_df, aes(Dim1, Dim2, colour = cluster)) +
  geom_point(size = 3) +
  theme_minimal() +
  ggtitle("UMAP Embedding of Asian countries (2014)")

Cluster Profiling

# --- Cluster Profiling ---

cluster_profile <- life_clean %>%
  group_by(kmeans_cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

cluster_profile
## # A tibble: 3 × 21
##   kmeans_cluster  Year Life.expectancy Adult.Mortality infant.deaths Alcohol
##   <fct>          <dbl>           <dbl>           <dbl>         <dbl>   <dbl>
## 1 1               2014            73.6           110           33.9    4.94 
## 2 2               2014            67.3           185.         105      0.365
## 3 3               2014            76.9            68.8          3.25   0.668
## # ℹ 15 more variables: percentage.expenditure <dbl>, Hepatitis.B <dbl>,
## #   Measles <dbl>, BMI <dbl>, under.five.deaths <dbl>, Polio <dbl>,
## #   Total.expenditure <dbl>, Diphtheria <dbl>, HIV.AIDS <dbl>, GDP <dbl>,
## #   Population <dbl>, thinness..1.19.years <dbl>, thinness.5.9.years <dbl>,
## #   Income.composition.of.resources <dbl>, Schooling <dbl>

COUNTRY CLUSTER TABLE

# I'm creating a table to show exactly which cluster each country belongs to.
# I also sorted it by cluster so it's easier to read.

country_assignment <- life_clean %>%
  select(Country, kmeans_cluster) %>%
  arrange(kmeans_cluster, Country)

# Using kable to make the table readable
knitr::kable(country_assignment, 
             col.names = c("Country Name", "Assigned Cluster"),
             caption = "Table: Final Cluster Membership for Asian Countries (2014)")
Table: Final Cluster Membership for Asian Countries (2014)
Country Name Assigned Cluster
Armenia 1
China 1
Georgia 1
Kazakhstan 1
Philippines 1
Sri Lanka 1
Thailand 1
Viet Nam 1
Afghanistan 2
Bangladesh 2
Bhutan 2
Cambodia 2
India 2
Indonesia 2
Iraq 2
Lao People’s Democratic Republic 2
Myanmar 2
Nepal 2
Pakistan 2
Syrian Arab Republic 2
Tajikistan 2
Timor-Leste 2
Turkmenistan 2
Uzbekistan 2
Yemen 2
Azerbaijan 3
Bahrain 3
Brunei Darussalam 3
Cyprus 3
Iran (Islamic Republic of) 3
Israel 3
Japan 3
Jordan 3
Kuwait 3
Kyrgyzstan 3
Lebanon 3
Malaysia 3
Maldives 3
Mongolia 3
Oman 3
Qatar 3
Saudi Arabia 3
Singapore 3
Turkey 3
United Arab Emirates 3

Discussion

Across all methods, three distinct tiers of Asian countries emerge:

High-performing tier: Countries like Japan, Singapore, and Israel. They have the highest life expectancy and most years of schooling.

Developing tier: Countries like China, Thailand, and Vietnam. These have decent health metrics and are in the middle of the pack.

Challenged tier: Countries like Afghanistan and Yemen have much higher mortality rates and lower life expectancy.

K-means and PAM provided highly consistent groupings. Non-linear methods like t-SNE and UMAP were particularly effective at separating the most advanced economies from the rest of the region.

Limitations

Data availability for 2015 was too low for several variables, necessitating a focus on 2014.

Countries like Syria may have skewed data due to conflict during this period.

Standardizing variables gives them equal weight, which may not always reflect their actual impact on public health.

AI Usage Statement

I used AI to assist in structuring the R Markdown code and identifying data completeness patterns for Asian countries. All final interpretation and analysis decisions were made by me.