Introduction

This activity aims to explore and analyze the relationship between a country’s development level and its health outcomes using Principal Component Analysis (PCA). The focus is on understanding how various health indicators (such as mortality rates, life expectancy, and disease prevalence) vary across countries with different stages of development.

About the dataset

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.

You can read more about the dataset in here: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download

library(tidyverse)
library(broom)
setwd("C:/Users/hanle/Desktop/Han Le - Intro to Data/intro2r")

life_data<- read_csv("life_expectancy_who.csv")

# Fix the columns' names
names(life_data) <- tolower(names(life_data))  # Convert all column names to lowercase 

# Replace spaces and slashes with underscores to improve readability 
names(life_data) <- gsub(" ", "_", names(life_data))
names(life_data) <- gsub("/", "_", names(life_data)) 

Data Preparation:

The first step is to clean and filter the dataset to retain relevant variables. The code is set up, but you need to add other indicators that you believe can be relevant to this analysis. We focus on numerical health and development indicators and remove any missing values.

# Clean & filter: keep only numeric columns + identifiers
life_clean <- life_data |>
  select(country, status, life_expectancy, adult_mortality, infant_deaths, alcohol, bmi, gdp, measles, schooling) |> 
  drop_na()

Principal Component Analysis (PCA) Setup:

Use PCA to reduce the dimensionality of the dataset while retaining as much variance as possible.

Ensure that the dataset is scaled before performing PCA.

Perform PCA on the numerical columns and extract the principal components (PC1 and PC2). What do the first two principal components explain about the data?

# Perform PCA on numeric variables
pca_fit <- life_clean |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp()


# View PCA result
pca_fit 
## Standard deviations (1, .., p=8):
## [1] 1.8481272 1.1995253 0.9554386 0.8327058 0.7259899 0.6981672 0.5948895
## [8] 0.4134136
## 
## Rotation (n x k) = (8 x 8):
##                        PC1         PC2           PC3         PC4         PC5
## life_expectancy  0.4785663 -0.10773558  0.2561416792  0.00941281  0.09996169
## adult_mortality -0.3530590  0.21363375 -0.6052063870 -0.09378009 -0.40028555
## infant_deaths   -0.1819082 -0.66017890  0.0005035297 -0.01022769 -0.19480705
## alcohol          0.3244417 -0.11716562 -0.6764571800 -0.27905509  0.40621425
## bmi              0.3974504  0.03452042  0.0891152347 -0.42504374 -0.73782197
## gdp              0.3318061 -0.11093595 -0.2845589158  0.83380179 -0.28053402
## measles         -0.1429924 -0.68951039 -0.0497212285 -0.11399176 -0.01944292
## schooling        0.4684674 -0.06531991 -0.1383278615 -0.15575174  0.05797165
##                         PC6         PC7         PC8
## life_expectancy  0.09145628  0.29873511  0.76555270
## adult_mortality  0.03973467  0.42372097  0.33659122
## infant_deaths    0.69957714 -0.05827857 -0.01462991
## alcohol          0.05170738 -0.40710892  0.11010176
## bmi             -0.10241538 -0.30961694 -0.03879377
## gdp             -0.09408269 -0.10729047 -0.04833821
## measles         -0.68173226  0.14904366  0.03621288
## schooling        0.11779536  0.65764421 -0.53211472

Visualize PCA Results:

Create a scatter plot of the first two principal components (PC1 and PC2) to visualize how countries cluster based on their health and development indicators.

Add color coding or labels based on the development status (e.g., developed vs. developing countries) to better understand how these groups are positioned in the PCA space.

Add the arrows of PC1 and PC2, make sure they are labeled.

pca_fit |>
  # add PCs to the original dataset
  augment(life_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = status)) +
  
  # Add the PCA1 and PCA2 arrows
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC2 arrow
  
   # Add text labels- Positioning and color of the label
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  
   
  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  
  xlim(-5, 5) +
  ylim(-4, 4) +
  xlab("PC1") +
  ylab("PC2") +
  guides(color = guide_legend(title = NULL)) +  # No title for the legend
  theme_minimal()

Create the Rotation Arrows:

# define an arrow style
arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca_fit |>
  tidy(matrix = "rotation") |>  # extract rotation matrix
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1.5, 1) + ylim(-1, 1) +
  coord_fixed()

Summarize the findings of the PCA.

What does PC1 show?

PC1 captures information related to the countries’ development level; the scatterplot shows that “Developed” and “Developing” separate along PC1. Most developed countries are on the positive side of PC1 and most of developing countries are on the negative side of PC1.

What does PC2 show? There’s no big distinction between development level in terms of PC2. Additionally, with PC2, some arrows point up while others point down. This means PC2 show differences between health issues (adult morality, infant deaths, measles). You can also see that infant deaths correlate with measles rate.

What does the analysis reveal about the relationship between development level and health outcomes in different countries? BMI, life expectancy, GDP, schooling, and alcohol point to the right, meaning if you increase them, the country’s position moves to the right in the plot. Since developed countries are more on the right, you can infer that developed countries have better life expectancy, education, etc. Because developed and developing countries don’t separate much along PC2, you can infer that adult morality, infant deaths, and measles don’t strongly relate to development level. In conclusion, both developing and developed countries can have different mixes of health issues.

Plot the variance explained by each PC- bargraph

pca_fit |>
  # extract eigenvalues
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "pink") + 
  scale_x_continuous(
    # create one axis tick per PC
    breaks = 1:9
  ) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    # format y axis ticks as percent values
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_minimal()

How much variation is explained by PC1 and PC2?

pca_fit |>
  tidy(matrix = "eigenvalues")
## # A tibble: 8 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.85   0.427       0.427
## 2     2   1.20   0.180       0.607
## 3     3   0.955  0.114       0.721
## 4     4   0.833  0.0867      0.808
## 5     5   0.726  0.0659      0.873
## 6     6   0.698  0.0609      0.934
## 7     7   0.595  0.0442      0.979
## 8     8   0.413  0.0214      1

PC1 explains around 43% and PC2 explains around 18%. In total, PC1 and PC2 explain around 61% of the variation.