Introduction

This activity aims to explore and analyze the relationship between a country’s development level and its health outcomes using Principal Component Analysis (PCA). The focus is on understanding how various health indicators (such as mortality rates, life expectancy, and disease prevalence) vary across countries with different stages of development.

About the dataset

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website.

You can read more about the dataset in here: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download

library(tidyverse)
setwd("C:/Users/mayss/Desktop/Data 110/datasets")

life_data<- read_csv("life_expectancy_who.csv")

# Fix the columns' names
names(life_data) <- tolower(names(life_data))  # Convert all column names to lowercase 

# Replace spaces and slashes with underscores to improve readability 
names(life_data) <- gsub(" ", "_", names(life_data))
names(life_data) <- gsub("/", "_", names(life_data)) 

Data Preparation:

The first step is to clean and filter the dataset to retain relevant variables. The code is set up, but you need to add other indicators that you believe can be relevant to this analysis. We focus on numerical health and development indicators and remove any missing values.

# Clean & filter: keep only numeric columns + identifiers
life_clean <- life_data |>
  select(country, status, life_expectancy, adult_mortality, infant_deaths, alcohol, hepatitis_b, measles, hiv_aids, gdp, schooling) |> 
  drop_na()

Principal Component Analysis (PCA) Setup:

Use PCA to reduce the dimensionality of the dataset while retaining as much variance as possible.

Ensure that the dataset is scaled before performing PCA.

Perform PCA on the numerical columns and extract the principal components (PC1 and PC2). What do the first two principal components explain about the data?

# Perform PCA on numeric variables
pca_fit <- life_clean |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp()

# View PCA result
pca_fit
## Standard deviations (1, .., p=9):
## [1] 1.7828643 1.2575121 1.1176473 0.9385045 0.8406321 0.6866960 0.6610838
## [8] 0.5900722 0.3830156
## 
## Rotation (n x k) = (9 x 9):
##                        PC1          PC2        PC3         PC4          PC5
## life_expectancy  0.5153682 -0.137327257 -0.0806698  0.03595171 -0.044096533
## adult_mortality -0.4111848  0.261450611  0.2851393 -0.10022718  0.004035623
## infant_deaths   -0.1790959 -0.616422387  0.1815040 -0.13171655  0.049625220
## alcohol          0.2825354  0.070585711  0.5992859 -0.05351549 -0.458903459
## hepatitis_b      0.1861880  0.246976958 -0.2076391 -0.89651072  0.149920791
## measles         -0.1183190 -0.612333885  0.2076885 -0.34782758  0.031045139
## hiv_aids        -0.3166868  0.299110010  0.4863812 -0.11945605  0.136551053
## gdp              0.3168744  0.007258577  0.3277570  0.15797085  0.841296671
## schooling        0.4517254  0.048538748  0.2997680 -0.06714761 -0.187397541
##                        PC6         PC7         PC8         PC9
## life_expectancy -0.1084257 -0.03618042  0.19352018  0.80949180
## adult_mortality  0.3106923  0.34675822  0.62750102  0.24632498
## infant_deaths    0.4434372 -0.56902532  0.11370247  0.04287020
## alcohol          0.3586016  0.19597343 -0.41822996  0.02597157
## hepatitis_b      0.1611796 -0.06143529 -0.07127553 -0.01346562
## measles         -0.4417684  0.49985160 -0.02119437 -0.02247931
## hiv_aids        -0.4877649 -0.42945985 -0.20327813  0.27764669
## gdp              0.1660832  0.14598668 -0.05302034 -0.08758719
## schooling       -0.2825704 -0.23849637  0.57557388 -0.44281916

Visualize PCA Results:

Create a scatter plot of the first two principal components (PC1 and PC2) to visualize how countries cluster based on their health and development indicators.

Add color coding or labels based on the development status (e.g., developed vs. developing countries) to better understand how these groups are positioned in the PCA space.

Add the arrows for PC1 and PC2, make sure they are labeled.

library(broom)
## Warning: package 'broom' was built under R version 4.3.3
pca_fit |>
  # Add PCs back to original data
  augment(life_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = status)) +
  xlab("PC1") +
  ylab("PC2") +
  theme_minimal()

pca_fit |>
  # Add PCs to the original dataset
  augment(life_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = status)) +  # Use status for coloring

  # Add the PCA1 and PCA2 arrows
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC2 arrow

  # Add text labels
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  

  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  

  # Customize the plot
  xlab("PC1") +
  ylab("PC2") +
  theme_minimal()
## Warning in geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 1865 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), arrow = arrow(type = "closed", : All aesthetics have length 1, but the data has 1865 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 5, y = 0, label = "PC1"), vjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 1865 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## Warning in geom_text(aes(x = 0, y = 4, label = "PC2"), hjust = -0.5, color = "black"): All aesthetics have length 1, but the data has 1865 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Create the Rotation Arrows:

#Rotation Matrix

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)
pca_fit |>
  # extract rotation matrix
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1, 0.5) + ylim(-1, 0.7) + 
  coord_fixed()+
  theme_minimal()

Summarize the findings of the PCA.

What does PC1 show? PC1 represents a gradient of development level—from developing to developed countries. High PC1 values indicate higher GDP, life expectancy, and healthcare investment, while low PC1 values are associated with poorer health outcomes and lower economic development.

What does PC2 show? PC2 captures variation in the type of health challenges countries face. Negative PC2 values are associated with high child mortality and measles, whereas positive PC2 values reflect higher adult mortality and HIV prevalence.

What does the analysis reveal about the relationship between development level and health outcomes in different countries?

Developed countries tend to cluster in the positive PC1 (high wealth, more spending on the healthcare system, etc.) and are positioned in both the positive and negative values on PC2, depending on their adult health outcomes.

Developing countries are more likely to have negative PC1 values (lower GDP, lower life expectancy), though we see that some may appear on the positive side of PC1. As for PC2, they are positioned either negatively (if child mortality or measles is high) or positively (if adult mortality and HIV rates are more problematic).

The rotation of arrows represents how each health-related factor contributes to the PC2 axis. Infant deaths and measles contribute to the negative direction, indicating that these factors are associated with lower development levels, especially concerning childhood health. On the other hand, adult mortality and HIV are positioned in the positive direction, suggesting that as a country improves in some aspects, these adult-related health issues persist in some developing countries.

Additionally, HIV and adult mortality are positioned exactly in the same direction and overlap in the plot, which suggests that HIV is a significant contributor to adult mortality in these countries.

Plot the variance explained by each PC- bar-graph

#Plot the variance explained by each PC- bar-graph

pca_fit |>
  # extract eigenvalues
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "lightgreen") + 
  scale_x_continuous(
    # create one axis tick per PC
    breaks = 1:9
  ) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    # format y axis ticks as percent values
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_minimal()

How much variation is explained by PC1 and PC2?

pca_fit |>
  tidy(matrix = "eigenvalues")
## # A tibble: 9 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.78   0.353       0.353
## 2     2   1.26   0.176       0.529
## 3     3   1.12   0.139       0.668
## 4     4   0.939  0.0979      0.766
## 5     5   0.841  0.0785      0.844
## 6     6   0.687  0.0524      0.896
## 7     7   0.661  0.0486      0.945
## 8     8   0.590  0.0387      0.984
## 9     9   0.383  0.0163      1

About 53%