Introduction
This activity aims to explore and analyze the relationship between a country’s development level and its health outcomes using Principal Component Analysis (PCA). The focus is on understanding how various health indicators (such as mortality rates, life expectancy, and disease prevalence) vary across countries with different stages of development.
About the dataset
The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.
You can read more about the dataset in here: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download
library(broom)
library(tidyverse)
setwd("C:/Users/Jacob/Downloads")
life_data<- read_csv("life_expectancy_who.csv")
# Fix the columns' names
names(life_data) <- tolower(names(life_data)) # Convert all column names to lowercase
# Replace spaces and slashes with underscores to improve readability
names(life_data) <- gsub(" ", "_", names(life_data))
names(life_data) <- gsub("/", "_", names(life_data))
Data Preparation:
The first step is to clean and filter the dataset to retain relevant variables. The code is set up, but you need to add other indicators that you believe can be relevant to this analysis. We focus on numerical health and development indicators and remove any missing values.
# Clean & filter: keep only numeric columns + identifiers
life_clean <- life_data |>
select(country, status, life_expectancy, adult_mortality, infant_deaths,
gdp, hiv_aids, alcohol, polio) |>
drop_na()
Principal Component Analysis (PCA) Setup:
Use PCA to reduce the dimensionality of the dataset while retaining as much variance as possible.
Ensure that the dataset is scaled before performing PCA.
Perform PCA on the numerical columns and extract the principal components (PC1 and PC2). What do the first two principal components explain about the data?
pca <- life_clean |>
select(where(is.numeric)) |>
scale() |>
prcomp()
# View PCA result
pca
## Standard deviations (1, .., p=7):
## [1] 1.6710788 1.0972634 0.9776635 0.8783445 0.7945853 0.6562576 0.4627676
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## life_expectancy 0.5502554 -0.06614842 0.009273361 -0.00677278 0.03612004
## adult_mortality -0.4629009 0.33568234 -0.008342506 -0.10970949 -0.01763082
## infant_deaths -0.1457455 -0.42732276 -0.771267275 -0.44040941 -0.04976892
## gdp 0.3525191 0.32619014 -0.375828279 0.27606801 -0.70310172
## hiv_aids -0.3798086 0.54223629 -0.109715729 -0.13677379 -0.16147470
## alcohol 0.2915424 0.49720657 -0.417936566 0.04810399 0.67542205
## polio 0.3289053 0.22957011 0.277553215 -0.83469929 -0.13887583
## PC6 PC7
## life_expectancy -0.030823975 -0.83093724
## adult_mortality 0.728588485 -0.36025313
## infant_deaths -0.006403893 -0.06943991
## gdp 0.175014213 0.16397504
## hiv_aids -0.654122654 -0.27754292
## alcohol 0.061726909 0.17549516
## polio 0.076492217 0.20055583
Visualize PCA Results:
Create a scatter plot of the first two principal components (PC1 and PC2) to visualize how countries cluster based on their health and development indicators.
Add color coding or labels based on the development status (e.g., developed vs. developing countries) to better understand how these groups are positioned in the PCA space.
Add the arrows of PC1 and PC2, make sure they are labeled.
pca |>
# add PCs to the original dataset
augment(life_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = status)) +
# Add the PCA1 and PCA2 arrows
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC1 arrow
geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC2 arrow
# Add text labels- Positioning and color of the label
geom_text(aes(x = 5, y = 0, label = "PC1"),
vjust = -0.5, color = "black") +
geom_text(aes(x = 0, y = 4, label = "PC2"),
hjust = -0.5, color = "black") +
xlim(-5, 5) +
ylim(-4, 4) +
xlab("PC1") +
ylab("PC2") +
guides(color = guide_legend(title = NULL)) + # No title for the legend
scale_color_manual(values = c("Developed" = "skyblue", "Developing" = "darkred")) +
theme_minimal()
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 2321 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 2321 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 2321 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 2321 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## Warning: Removed 39 rows containing missing values or values outside the scale range
## (`geom_point()`).
Create the Rotation Arrows:
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
pca |>
tidy(matrix = "rotation") |> # extract rotation matrix
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column)) +
coord_fixed()
Summarize the findings of the PCA.
What does PC1 show? There are more developed countries to the right which are associated with higher rates of gdp, alcohol, polio, and life expectancy. The left has less developed countries which are associated with higher rates of infant and adult deaths and HIV.
What does PC2 show? The higher you go the more devevloped countries there are. The highest rates are alcohol and HIV while the lowest are infant deaths.
What does the analysis reveal about the relationship between development level and health outcomes in different countries? Developed countries have higher rates of gdp, alcohol, polio, and life expectancy while less developed countries have higher rates of infant and adult deaths and HIV.
Plot the variance explained by each PC- bargraph
pca |>
# extract eigenvalues
tidy(matrix = "eigenvalues") |>
ggplot(aes(PC, percent)) +
geom_col(fill= "lightblue") +
scale_x_continuous(
# create one axis tick per PC
breaks = 1:7
) +
scale_y_continuous(
name = "variance explained",
breaks = seq(0, 1, by = 0.1),
# format y axis ticks as percent values
label = scales::label_percent(accuracy = 1)
)+
xlab("Principal Component (PC)") +
theme_minimal()
How much variation is explained by PC1 and PC2?
pca |>
tidy(matrix = "eigenvalues")
## # A tibble: 7 × 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.67 0.399 0.399
## 2 2 1.10 0.172 0.571
## 3 3 0.978 0.137 0.707
## 4 4 0.878 0.110 0.818
## 5 5 0.795 0.0902 0.908
## 6 6 0.656 0.0615 0.969
## 7 7 0.463 0.0306 1
pc1 explains about 40% and pc2 explains about 17%. Combined they explain about 57% of the variation.