Abstract

The dataset utilized in this study is derived from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and economic data from the United Nations. It encompasses key health and economic indicators for 193 countries over the period 2000–2015. The primary objective is to examine critical factors influencing life expectancy and identify relationships among variables through a dimensionality reduction technique, Principal Component Analysis (PCA).

The dataset originally contained missing values, particularly for population, Hepatitis B vaccination coverage, and GDP from less-known countries, which were subsequently excluded to ensure data quality. The final dataset comprises 22 columns and 2,938 rows, with 20 predictor variables categorized into immunization, mortality, economic, and social factors.

PCA was employed to reduce dimensionality and uncover latent structures in the data, simplifying the interpretation of relationships between life expectancy and its predictors. This analysis aims to answer key questions, such as the impact of healthcare expenditure, mortality rates, lifestyle factors, schooling, and immunization coverage on life expectancy. Insights from PCA will highlight the variables most influential in determining life expectancy, offering valuable implications for policymakers in addressing health disparities and improving global health outcomes.

Breakdown of Variables:

  • Country: Name of the country.
  • Year: The year of data collection (2000–2015).
  • Status: Economic classification of the country (Developing or Developed).
  • Life expectancy: Average lifespan (in years) of individuals in the country.
  • Adult Mortality: Probability of dying between ages 15 and 60 (per 1000 population).
  • Infant deaths: Number of deaths of infants below one year of age per 1000 live births.
  • Alcohol: Alcohol consumption per capita (in liters).
  • Percentage expenditure: Government health expenditure as a percentage of total government expenditure.
  • Hepatitis B: Immunization coverage for Hepatitis B (% of children aged 1 year).
  • Measles: Number of reported cases of measles (per 1000 population).
  • BMI: Average Body Mass Index for the population.
  • Under-five deaths: Number of deaths of children under five per 1000 live births.
  • Polio: Immunization coverage for Polio (% of children aged 1 year).
  • Total expenditure: Government health expenditure as a percentage of GDP.
  • Diphtheria: Immunization coverage for Diphtheria (% of children aged 1 year).
  • HIV/AIDS: Deaths per 1000 live births due to HIV/AIDS.
  • GDP: Gross Domestic Product per capita (in USD).
  • Population: Total population of the country.
  • Thinness 1-19 years: Percentage of thin individuals aged 1–19 years.
  • Thinness 5-9 years: Percentage of thin individuals aged 5–9 years.
  • Income composition of resources: Human Development Index (HDI)-based index that measures income distribution.
  • Schooling: Average number of years of schooling.

PCA

We will begin by loading in the necessary data set into our work environment and cleaning the data of any columns with missing values and non numerical metrics. From here, we will take a quick look at the statistical spread of our present columns to see what we are working with.

##       Year      Life expectancy Adult Mortality infant deaths  
##  Min.   :2015   Min.   :51.00   Min.   :  1.0   Min.   :  0.0  
##  1st Qu.:2015   1st Qu.:65.75   1st Qu.: 74.0   1st Qu.:  0.0  
##  Median :2015   Median :73.90   Median :138.0   Median :  2.0  
##  Mean   :2015   Mean   :71.62   Mean   :152.9   Mean   : 23.8  
##  3rd Qu.:2015   3rd Qu.:76.95   3rd Qu.:213.0   3rd Qu.: 17.0  
##  Max.   :2015   Max.   :88.00   Max.   :484.0   Max.   :910.0  
##  percentage expenditure    Measles      under-five deaths     Polio      
##  Min.   :  0.000        Min.   :    0   Min.   :   0.00   Min.   : 5.00  
##  1st Qu.:  0.000        1st Qu.:    0   1st Qu.:   0.00   1st Qu.:83.00  
##  Median :  0.000        Median :   17   Median :   3.00   Median :93.00  
##  Mean   :  2.384        Mean   : 1503   Mean   :  31.61   Mean   :83.21  
##  3rd Qu.:  0.000        3rd Qu.:  202   3rd Qu.:  21.00   3rd Qu.:97.00  
##  Max.   :364.975        Max.   :90387   Max.   :1100.00   Max.   :99.00  
##    Diphtheria       HIV/AIDS     
##  Min.   : 6.00   Min.   :0.1000  
##  1st Qu.:83.50   1st Qu.:0.1000  
##  Median :93.00   Median :0.1000  
##  Mean   :84.63   Mean   :0.6607  
##  3rd Qu.:97.00   3rd Qu.:0.4000  
##  Max.   :99.00   Max.   :9.3000

Feature Engineering

Upon initial inspection, we can see there are a few unnecessary columns in our filtered data set that would negatively affect our principal component analysis specifically year and percentage expenditure. All countries have the same year and a majority of the countries with the exception of 2 of them have values for percentage expenditure. This redundancy will reduce the true amount of variation present in our analysis, so let us remove them and perform an initial analysis.

# Feature Engineering:
pca_table <- LE_final%>%
  select(-Year, -`percentage expenditure`)


pca_result = prcomp(pca_table, scale = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Standard deviation     1.8554 1.5504 0.9643 0.66423 0.57401 0.5114 0.43342
## Proportion of Variance 0.4303 0.3005 0.1162 0.05515 0.04119 0.0327 0.02348
## Cumulative Proportion  0.4303 0.7308 0.8470 0.90215 0.94334 0.9760 0.99951
##                            PC8
## Standard deviation     0.06234
## Proportion of Variance 0.00049
## Cumulative Proportion  1.00000
fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  labs(subtitle = "Figure 4")+
  theme_minimal()

Our initial PCA creation shows us a cumulative variation proportion of 73% from our first two derived Principle Components. We can directly observe how each of our variables contribute to our PCA to make initial guesses as what our 2 factors of interest could potentially be:

p1 = fviz_contrib(pca_result, choice = "var", axes = 1) + 
  ggtitle("Contributions to PC1")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")


p2  = fviz_contrib(pca_result, choice = "var", axes = 2) + 
  ggtitle("Contributions to PC2")+
  geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")

grid.arrange(p1, p2, ncol = 2)

With this quick visualization, we can see factors like Life Expectancy and Adult Mortality play very heavy roles in PC1 and factors such as Measles and Infant Deaths play heavy roles in PC2. Before making a final assumption, we can gain more insight in how these factors will affect our PCA with a bi-plot Graph and some Factor Analysis:

Bi-Plot of variables

fviz_pca_var(pca_result, col.var = "contrib")+
  labs(title = 'PCA Variable Bi-Plot')+
  scale_color_gradient(low="mediumslateblue", high="seagreen3") +
  theme_minimal()

Our Bi-plot provides more insight to the overall positioning of our in our dimensional reduced graph. The arrows with their different contribution levels give us a great foundation in understanding how to interpret our graphs moving forward: such as the one ahead of us.

fviz_pca_biplot(pca_result,
col.ind = "cos2", # Color by quality of representation
gradient.cols = c("mediumslateblue", "seagreen3"), # Gradient from dark red to pale red
repel = TRUE, # Avoid overlapping labels
geom = "point" # Display individuals as points
)

We can see a strong gathering of data points around the heart of our PCA, with much of the variation being shown along dimension 1. We can also observe a few very strong outliers with respect to our principal components in out bottom left quadrent, relating heavily to under-five/infant deaths and measles.

Factor Analysis

A quick Factor Analysis can give us an even stronger grasp of the components we are working with Upon initial inspection, one could assume factor one to be relating to general health and longevity and factor 2 with child health and disease. Let us see if our factor analysis supports this:

x.f <- factanal(pca_table, 2, scores="Bartlett", rotation="varimax")
x.f
## 
## Call:
## factanal(x = pca_table, factors = 2, scores = "Bartlett", rotation = "varimax")
## 
## Uniquenesses:
##   Life expectancy   Adult Mortality     infant deaths           Measles 
##             0.155             0.324             0.005             0.357 
## under-five deaths             Polio        Diphtheria          HIV/AIDS 
##             0.006             0.677             0.694             0.509 
## 
## Loadings:
##                   Factor1 Factor2
## Life expectancy    0.912  -0.115 
## Adult Mortality   -0.819         
## infant deaths     -0.139   0.988 
## Measles                    0.802 
## under-five deaths -0.174   0.982 
## Polio              0.566         
## Diphtheria         0.551         
## HIV/AIDS          -0.701         
## 
##                Factor1 Factor2
## SS loadings      2.668   2.605
## Proportion Var   0.334   0.326
## Cumulative Var   0.334   0.659
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 130.03 on 13 degrees of freedom.
## The p-value is 2.06e-21

Based on the factor analysis output a general interpretation of the loadings can be seen to support our initial suspicions:

Factor 1

  • Life expectancy (0.912)
  • Polio (0.566)
  • Diphtheria (0.551)
  • Strong negative loadings:
  • Adult Mortality (-0.819)
  • HIV/AIDS (-0.701)

This suggests Factor 1 is associated with general health conditions and longevity, with higher life expectancy, polio immunization, and diphtheria immunization contributing positively, while adult mortality and HIV/AIDS have a negative relationship.

Factor 2

  • Strong positive loadings:
  • Infant deaths (0.988)
  • Under-five deaths (0.982)
  • Measles (0.802)

This suggests Factor 2 is related to child health and disease impact, with variables linked to child mortality and preventable diseases contributing strongly.

Cumulative Variance

  • Factor 1 explains 33.4% of the variance.
  • Factor 2 explains 32.6% of the variance.

Cumulatively, the two factors explain 65.9% of the variance in the data, indicating that the model captures a substantial portion of the overall variability. With our components identified, we can display our countries on a holistic graph and create a general conclusion on our analysis

Conslusion

data.frame(z1=-pca_result$x[,1],z2=pca_result$x[,2]) %>% 
  ggplot(aes(z1,z2,label=countries, color = life_exp)) + geom_point(size=0) +
  labs(title="PC Distribution with Life Expectancy", x="PC1", y="PC2") +
  theme_bw() + scale_color_gradient(low="mediumslateblue", high="seagreen3")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) 

Through Principal Component Analysis, this study successfully reduced the dimensional of a diverse data set containing health and socioeconomic factors across 183 countries.

The two major components identified to affect:

  • Factor 1: General Health and Longevity – Represents variables associated with overall health and well-being, particularly related to adult health and life expectancy.
  • Factor 2: Child Health and Disease Prevalence – Captures variables associated with child mortality and the prevalence of preventable diseases.

The results highlighted relationships between healthcare spending, immunization, mortality rates, education, and lifestyle choices with life expectancy, offering valuable insights for global health policy. We can see an extreme outliers with Nigeria and India in our distribution graph above, prompting for further research into the health policies between these specific countries and how they contrast with the rest of the world.