INTRODUCTION
Understanding cross-country differences in health and development is challenging because each country is described by many indicators at once: life expectancy, mortality, income, schooling, and more. Unsupervised learning methods, such as clustering and dimension reduction, help summarize these patterns.
In this project, I used the WHO Life Expectancy & Socioeconomic data set from kaggle and focused on Asian countries in 2014.
My goal is to:
Group Asian countries into clusters with similar health and socioeconomic profiles.
Compare several clustering and visualization techniques: k-means, PAM, hierarchical clustering, MDS, t-SNE, and UMAP.
DATA LOADING AND PREPARATION
life <- read.csv("C:/Users/mukun/Downloads/RMarkdown_Full_Project/Life Expectancy Data.csv", stringsAsFactors = FALSE)
str(life)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
head(life)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
My focus is on Asian countries for the year 2014, since the analysis of the data set shows this year has the most complete information for this region.
asia_countries <- c(
"Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh", "Bhutan",
"Brunei Darussalam", "Cambodia", "China", "Cyprus", "Georgia", "India",
"Indonesia", "Iran (Islamic Republic of)", "Iraq", "Israel", "Japan",
"Jordan", "Kazakhstan", "Kuwait", "Kyrgyzstan", "Lao People's Democratic Republic",
"Lebanon", "Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal", "Oman",
"Pakistan", "Philippines", "Qatar", "Republic of Korea", "Saudi Arabia",
"Singapore", "Sri Lanka", "Syrian Arab Republic", "Tajikistan", "Thailand",
"Timor-Leste", "Turkey", "Turkmenistan", "United Arab Emirates", "Uzbekistan",
"Viet Nam", "Yemen"
)
life_asia_2014 <- life %>%
filter(Country %in% asia_countries & Year == 2014)
# Confirming number of countries
n_distinct(life_asia_2014$Country)
## [1] 46
Next, I selected a set of health and socioeconomic variables. For Asia 2014, variables like Alcohol and Schooling are complete.
# Using the variables identified as most complete for Asia 2014
target_cols <- c(
"Life.expectancy",
"Adult.Mortality",
"Alcohol",
"BMI",
"HIV.AIDS",
"Income.composition.of.resources",
"Schooling"
)
data_clust_raw <- life_asia_2014[, target_cols]
data_clean <- na.omit(data_clust_raw)
# Syncing the labels with the cleaned data rows
life_clean <- life_asia_2014[rownames(data_clean), ]
# Scaling for clustering
data_clust <- scale(data_clean)
MY METHODS
I used a few different ways to look at the data:
K-means and PAM: These are my main tools for splitting the countries into groups. I included PAM because it’s usually a bit more reliable if there are outliers.
Hierarchical Clustering: I used this to create a family treeof countries so I could see which ones are the most similar to each other.
Dimension Reduction (MDS, t-SNE, UMAP): Since I can’t visualize 7 dimensions at once, I used these to squish the data down into a 2D map.
RESULTS
After checking the elbow plot, I decided that 3 clusters gave the best balance between being simple and actually describing the data well.
k_opt <- 3
K-means Clustering
if(nrow(data_clean) >= 3) {
set.seed(123)
km_res <- kmeans(data_clust, centers = k_opt, nstart = 25)
# Syncing cluster results to the clean data
life_clean$kmeans_cluster <- factor(km_res$cluster)
fviz_cluster(km_res, data = data_clust,
geom = "point",
main = "K-means clustering of Asian Countries (2014)")
}
PAM Clustering
pam_res <- pam(data_clust, k = k_opt)
life_clean$pam_cluster <- factor(pam_res$clustering)
fviz_cluster(pam_res, data = data_clust, geom = "point",
main = "PAM clustering of Asian countries (2014)")
Hierarchical Clustering
dist_mat <- dist(data_clust)
hc_ward <- hclust(dist_mat, method = "ward.D2")
# Using cex = 0.5 to ensure labels are readable
plot(hc_ward, labels = life_clean$Country, cex = 0.6,
main = "Dendrogram of Asian Countries (2014)")
MDS Visualisation
mds_res <- cmdscale(dist_mat, k = 2)
mds_df <- data.frame(
Dim1 = mds_res[,1],
Dim2 = mds_res[,2],
Country = life_clean$Country,
cluster = life_clean$kmeans_cluster
)
ggplot(mds_df, aes(Dim1, Dim2, label = Country, colour = cluster)) +
geom_point() +
geom_text(vjust = -0.5, size = 2.5, check_overlap = TRUE, show.legend = FALSE) +
theme_minimal() +
ggtitle("MDS Map of Asian Countries (2014)")
t-SNE
set.seed(123)
# Perplexity adjusted for the smaller Asian dataset size
tsne_res <- Rtsne(data_clust, dims = 2, perplexity = 10, max_iter = 1000)
tsne_df <- data.frame(
Dim1 = tsne_res$Y[,1],
Dim2 = tsne_res$Y[,2],
Country = life_clean$Country,
cluster = life_clean$kmeans_cluster
)
ggplot(tsne_df, aes(Dim1, Dim2, colour = cluster, label = Country)) +
geom_point(size = 3) +
geom_text(vjust = -0.7, size = 2.5, check_overlap = TRUE) +
theme_minimal() +
ggtitle("t-SNE Embedding: Asian Health Clusters (2014)")
UMAP
set.seed(123)
umap_res <- umap(data_clust)
umap_df <- data.frame(
Dim1 = umap_res$layout[,1],
Dim2 = umap_res$layout[,2],
Country = life_clean$Country,
cluster = life_clean$kmeans_cluster
)
ggplot(umap_df, aes(Dim1, Dim2, colour = cluster)) +
geom_point(size = 3) +
theme_minimal() +
ggtitle("UMAP Embedding of Asian countries (2014)")
Cluster Profiling
# --- Cluster Profiling ---
cluster_profile <- life_clean %>%
group_by(kmeans_cluster) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
cluster_profile
## # A tibble: 3 × 21
## kmeans_cluster Year Life.expectancy Adult.Mortality infant.deaths Alcohol
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2014 73.6 110 33.9 4.94
## 2 2 2014 67.3 185. 105 0.365
## 3 3 2014 76.9 68.8 3.25 0.668
## # ℹ 15 more variables: percentage.expenditure <dbl>, Hepatitis.B <dbl>,
## # Measles <dbl>, BMI <dbl>, under.five.deaths <dbl>, Polio <dbl>,
## # Total.expenditure <dbl>, Diphtheria <dbl>, HIV.AIDS <dbl>, GDP <dbl>,
## # Population <dbl>, thinness..1.19.years <dbl>, thinness.5.9.years <dbl>,
## # Income.composition.of.resources <dbl>, Schooling <dbl>
COUNTRY CLUSTER TABLE
# I'm creating a table to show exactly which cluster each country belongs to.
# I also sorted it by cluster so it's easier to read.
country_assignment <- life_clean %>%
select(Country, kmeans_cluster) %>%
arrange(kmeans_cluster, Country)
# Using kable to make the table readable
knitr::kable(country_assignment,
col.names = c("Country Name", "Assigned Cluster"),
caption = "Table: Final Cluster Membership for Asian Countries (2014)")
| Country Name | Assigned Cluster |
|---|---|
| Armenia | 1 |
| China | 1 |
| Georgia | 1 |
| Kazakhstan | 1 |
| Philippines | 1 |
| Sri Lanka | 1 |
| Thailand | 1 |
| Viet Nam | 1 |
| Afghanistan | 2 |
| Bangladesh | 2 |
| Bhutan | 2 |
| Cambodia | 2 |
| India | 2 |
| Indonesia | 2 |
| Iraq | 2 |
| Lao People’s Democratic Republic | 2 |
| Myanmar | 2 |
| Nepal | 2 |
| Pakistan | 2 |
| Syrian Arab Republic | 2 |
| Tajikistan | 2 |
| Timor-Leste | 2 |
| Turkmenistan | 2 |
| Uzbekistan | 2 |
| Yemen | 2 |
| Azerbaijan | 3 |
| Bahrain | 3 |
| Brunei Darussalam | 3 |
| Cyprus | 3 |
| Iran (Islamic Republic of) | 3 |
| Israel | 3 |
| Japan | 3 |
| Jordan | 3 |
| Kuwait | 3 |
| Kyrgyzstan | 3 |
| Lebanon | 3 |
| Malaysia | 3 |
| Maldives | 3 |
| Mongolia | 3 |
| Oman | 3 |
| Qatar | 3 |
| Saudi Arabia | 3 |
| Singapore | 3 |
| Turkey | 3 |
| United Arab Emirates | 3 |
Discussion
Across all methods, three distinct tiers of Asian countries emerge:
High-performing tier: Countries like Japan, Singapore, and Israel. They have the highest life expectancy and most years of schooling.
Developing tier: Countries like China, Thailand, and Vietnam. These have decent health metrics and are in the middle of the pack.
Challenged tier: Countries like Afghanistan and Yemen have much higher mortality rates and lower life expectancy.
K-means and PAM provided highly consistent groupings. Non-linear methods like t-SNE and UMAP were particularly effective at separating the most advanced economies from the rest of the region.
Limitations
Data availability for 2015 was too low for several variables, necessitating a focus on 2014.
Countries like Syria may have skewed data due to conflict during this period.
Standardizing variables gives them equal weight, which may not always reflect their actual impact on public health.
AI Usage Statement
I used AI to assist in structuring the R Markdown code and identifying data completeness patterns for Asian countries. All final interpretation and analysis decisions were made by me.