Introduction

This activity aims to explore and analyze the relationship between a country’s development level and its health outcomes using Principal Component Analysis (PCA). The focus is on understanding how various health indicators (such as mortality rates, life expectancy, and disease prevalence) vary across countries with different stages of development.

About the dataset

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.

You can read more about the dataset in here: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download

library(tidyverse)
library(broom)

life_data<- read_csv("life_expectancy_who.csv")
head(life_data)
## # A tibble: 6 × 22
##   Country      Year Status   `Life expectancy` `Adult Mortality` `infant deaths`
##   <chr>       <dbl> <chr>                <dbl>             <dbl>           <dbl>
## 1 Afghanistan  2015 Develop…              65                 263              62
## 2 Afghanistan  2014 Develop…              59.9               271              64
## 3 Afghanistan  2013 Develop…              59.9               268              66
## 4 Afghanistan  2012 Develop…              59.5               272              69
## 5 Afghanistan  2011 Develop…              59.2               275              71
## 6 Afghanistan  2010 Develop…              58.8               279              74
## # ℹ 16 more variables: Alcohol <dbl>, `percentage expenditure` <dbl>,
## #   `Hepatitis B` <dbl>, Measles <dbl>, BMI <dbl>, `under-five deaths` <dbl>,
## #   Polio <dbl>, `Total expenditure` <dbl>, Diphtheria <dbl>, `HIV/AIDS` <dbl>,
## #   GDP <dbl>, Population <dbl>, `thinness  1-19 years` <dbl>,
## #   `thinness 5-9 years` <dbl>, `Income composition of resources` <dbl>,
## #   Schooling <dbl>
# Fix the columns' names
names(life_data) <- tolower(names(life_data))  # Convert all column names to lowercase 

# Replace spaces and slashes with underscores to improve readability 
names(life_data) <- gsub(" ", "_", names(life_data))
names(life_data) <- gsub("/", "_", names(life_data)) 

head(life_data)
## # A tibble: 6 × 22
##   country      year status life_expectancy adult_mortality infant_deaths alcohol
##   <chr>       <dbl> <chr>            <dbl>           <dbl>         <dbl>   <dbl>
## 1 Afghanistan  2015 Devel…            65               263            62    0.01
## 2 Afghanistan  2014 Devel…            59.9             271            64    0.01
## 3 Afghanistan  2013 Devel…            59.9             268            66    0.01
## 4 Afghanistan  2012 Devel…            59.5             272            69    0.01
## 5 Afghanistan  2011 Devel…            59.2             275            71    0.01
## 6 Afghanistan  2010 Devel…            58.8             279            74    0.01
## # ℹ 15 more variables: percentage_expenditure <dbl>, hepatitis_b <dbl>,
## #   measles <dbl>, bmi <dbl>, `under-five_deaths` <dbl>, polio <dbl>,
## #   total_expenditure <dbl>, diphtheria <dbl>, hiv_aids <dbl>, gdp <dbl>,
## #   population <dbl>, `thinness__1-19_years` <dbl>, `thinness_5-9_years` <dbl>,
## #   income_composition_of_resources <dbl>, schooling <dbl>

Data Preparation:

The first step is to clean and filter the dataset to retain relevant variables. The code is set up, but you need to add other indicators that you believe can be relevant to this analysis. We focus on numerical health and development indicators and remove any missing values.

# Clean & filter: keep only numeric columns + identifiers
life_clean <- life_data |>
  select(country, status, life_expectancy, adult_mortality, infant_deaths, hepatitis_b, measles, polio, hiv_aids, gdp)|> 
  drop_na()

Principal Component Analysis (PCA) Setup:

Use PCA to reduce the dimensionality of the dataset while retaining as much variance as possible.

Ensure that the dataset is scaled before performing PCA.

Perform PCA on the numerical columns and extract the principal components (PC1 and PC2). What do the first two principal components explain about the data?

# Perform PCA on numeric variables
LCpca_fit <- life_clean |> 
  select(where(is.numeric)) |> 
  scale() |>                   
  prcomp()                     

# View PCA result
LCpca_fit
## Standard deviations (1, .., p=8):
## [1] 1.6555745 1.2661279 1.0718721 0.9491823 0.7173510 0.6698868 0.6430729
## [8] 0.4788035
## 
## Rotation (n x k) = (8 x 8):
##                        PC1         PC2        PC3         PC4         PC5
## life_expectancy  0.5344423 -0.16477535  0.1049686  0.04053914 -0.06961659
## adult_mortality -0.4659712  0.27481309 -0.1568497  0.12520327  0.16246779
## infant_deaths   -0.2103954 -0.59185580 -0.2699163  0.07038576  0.03096162
## hepatitis_b      0.2807830  0.28873227 -0.5798968 -0.16168086  0.65808735
## measles         -0.1415068 -0.58205466 -0.4039493  0.09010430  0.06713094
## polio            0.3341892  0.17366594 -0.5809920  0.05903679 -0.64794562
## hiv_aids        -0.3881862  0.30550119 -0.2022601  0.44179067 -0.21896461
## gdp              0.3028892 -0.03726634  0.1166566  0.86301955  0.24986912
##                         PC6         PC7           PC8
## life_expectancy -0.11802175  0.04441094 -0.8085856748
## adult_mortality  0.26756498 -0.59496125 -0.4637944019
## infant_deaths    0.61813266  0.35976802 -0.1230929665
## hepatitis_b     -0.01508458  0.20510244  0.0001685303
## measles         -0.60225229 -0.31769442  0.0418361861
## polio            0.19689155 -0.20475021  0.1288341767
## hiv_aids        -0.33550122  0.55582922 -0.2245877253
## gdp              0.13464452 -0.14404938  0.2171264484

Visualize PCA Results:

Create a scatter plot of the first two principal components (PC1 and PC2) to visualize how countries cluster based on their health and development indicators.

Add color coding or labels based on the development status (e.g., developed vs. developing countries) to better understand how these groups are positioned in the PCA space.

Add the arrows of PC1 and PC2, make sure they are labeled.

LCpca_fit |>
  augment(life_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color= status)) +
  theme_light() +
  xlab ("PC 1") +
  ylab ("PC 2") +
  geom_segment(aes(x = -8.5, y = 0, xend = 4.7, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  
  geom_segment(aes(x = 0, y = -12.85, xend = 0, yend = 4.8), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  
  geom_text(aes(x = 1, y = -7.5, label = "PC1"), 
            vjust = -0.5, color = "black") +  
  geom_text(aes(x = -7.5, y = 1, label = "PC2"), 
            hjust = -0.5, color = "black")  +
  guides(color = guide_legend(title = NULL)) +  
  scale_color_manual(values = c("Developed" = "hotpink3", "Developing" = "lightblue3"))

Create the Rotation Arrows:

LCarrow_style <- arrow(
  angle = 20, length = grid::unit(9, "pt"),
  ends = "first", type = "closed"
)
LCpca_fit |>
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = LCarrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1, 0.5) + ylim(-0.75, 0.5) + 
  coord_fixed()+
  theme_minimal()

Summarize the findings of the PCA.

What does PC1 show? Difference between developed and developing countries.

What does PC2 show? Adult mortality correlation with HIV/Aids and infant death correlation with measles in relation to developing countries.

What does the analysis reveal about the relationship between development level and health outcomes in different countries? More developed countries tend to have less issues with death rates as affected by illness.

Plot the variance explained by each PC- bargraph

LCpca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "hotpink3") + 
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_minimal()

LCpca_fit |>
  tidy(matrix = "eigenvalues")
## # A tibble: 8 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.66   0.343       0.343
## 2     2   1.27   0.200       0.543
## 3     3   1.07   0.144       0.687
## 4     4   0.949  0.113       0.799
## 5     5   0.717  0.0643      0.864
## 6     6   0.670  0.0561      0.920
## 7     7   0.643  0.0517      0.971
## 8     8   0.479  0.0287      1

How much variation is explained by PC1 and PC2? PC2 cumulative - 54.3%