The dataset utilized in this study is derived from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and economic data from the United Nations. It encompasses key health and economic indicators for 193 countries over the period 2000–2015. The primary objective is to examine critical factors influencing life expectancy and identify relationships among variables through a dimensionality reduction technique, Principal Component Analysis (PCA).
The dataset originally contained missing values, particularly for population, Hepatitis B vaccination coverage, and GDP from less-known countries, which were subsequently excluded to ensure data quality. The final dataset comprises 22 columns and 2,938 rows, with 20 predictor variables categorized into immunization, mortality, economic, and social factors.
PCA was employed to reduce dimensionality and uncover latent structures in the data, simplifying the interpretation of relationships between life expectancy and its predictors. This analysis aims to answer key questions, such as the impact of healthcare expenditure, mortality rates, lifestyle factors, schooling, and immunization coverage on life expectancy. Insights from PCA will highlight the variables most influential in determining life expectancy, offering valuable implications for policymakers in addressing health disparities and improving global health outcomes.
We will begin by loading in the necessary data set into our work environment and cleaning the data of any columns with missing values and non numerical metrics. From here, we will take a quick look at the statistical spread of our present columns to see what we are working with.
## Year Life expectancy Adult Mortality infant deaths
## Min. :2015 Min. :51.00 Min. : 1.0 Min. : 0.0
## 1st Qu.:2015 1st Qu.:65.75 1st Qu.: 74.0 1st Qu.: 0.0
## Median :2015 Median :73.90 Median :138.0 Median : 2.0
## Mean :2015 Mean :71.62 Mean :152.9 Mean : 23.8
## 3rd Qu.:2015 3rd Qu.:76.95 3rd Qu.:213.0 3rd Qu.: 17.0
## Max. :2015 Max. :88.00 Max. :484.0 Max. :910.0
## percentage expenditure Measles under-five deaths Polio
## Min. : 0.000 Min. : 0 Min. : 0.00 Min. : 5.00
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:83.00
## Median : 0.000 Median : 17 Median : 3.00 Median :93.00
## Mean : 2.384 Mean : 1503 Mean : 31.61 Mean :83.21
## 3rd Qu.: 0.000 3rd Qu.: 202 3rd Qu.: 21.00 3rd Qu.:97.00
## Max. :364.975 Max. :90387 Max. :1100.00 Max. :99.00
## Diphtheria HIV/AIDS
## Min. : 6.00 Min. :0.1000
## 1st Qu.:83.50 1st Qu.:0.1000
## Median :93.00 Median :0.1000
## Mean :84.63 Mean :0.6607
## 3rd Qu.:97.00 3rd Qu.:0.4000
## Max. :99.00 Max. :9.3000
Upon initial inspection, we can see there are a few unnecessary columns in our filtered data set that would negatively affect our principal component analysis specifically year and percentage expenditure. All countries have the same year and a majority of the countries with the exception of 2 of them have values for percentage expenditure. This redundancy will reduce the true amount of variation present in our analysis, so let us remove them and perform an initial analysis.
# Feature Engineering:
pca_table <- LE_final%>%
select(-Year, -`percentage expenditure`)
pca_result = prcomp(pca_table, scale = TRUE)
summary(pca_result)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8554 1.5504 0.9643 0.66423 0.57401 0.5114 0.43342
## Proportion of Variance 0.4303 0.3005 0.1162 0.05515 0.04119 0.0327 0.02348
## Cumulative Proportion 0.4303 0.7308 0.8470 0.90215 0.94334 0.9760 0.99951
## PC8
## Standard deviation 0.06234
## Proportion of Variance 0.00049
## Cumulative Proportion 1.00000
fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
labs(subtitle = "Figure 4")+
theme_minimal()Our initial PCA creation shows us a cumulative variation proportion of 73% from our first two derived Principle Components. We can directly observe how each of our variables contribute to our PCA to make initial guesses as what our 2 factors of interest could potentially be:
p1 = fviz_contrib(pca_result, choice = "var", axes = 1) +
ggtitle("Contributions to PC1")+
geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
p2 = fviz_contrib(pca_result, choice = "var", axes = 2) +
ggtitle("Contributions to PC2")+
geom_bar(stat = "identity", fill = "seagreen3", color = "black") +
theme_minimal() + theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
)+
labs(x = "Column Name")
grid.arrange(p1, p2, ncol = 2)With this quick visualization, we can see factors like Life Expectancy and Adult Mortality play very heavy roles in PC1 and factors such as Measles and Infant Deaths play heavy roles in PC2. Before making a final assumption, we can gain more insight in how these factors will affect our PCA with a bi-plot Graph and some Factor Analysis:
fviz_pca_var(pca_result, col.var = "contrib")+
labs(title = 'PCA Variable Bi-Plot')+
scale_color_gradient(low="mediumslateblue", high="seagreen3") +
theme_minimal()Our Bi-plot provides more insight to the overall positioning of our in our dimensional reduced graph. The arrows with their different contribution levels give us a great foundation in understanding how to interpret our graphs moving forward: such as the one ahead of us.
fviz_pca_biplot(pca_result,
col.ind = "cos2", # Color by quality of representation
gradient.cols = c("mediumslateblue", "seagreen3"), # Gradient from dark red to pale red
repel = TRUE, # Avoid overlapping labels
geom = "point" # Display individuals as points
)We can see a strong gathering of data points around the heart of our PCA, with much of the variation being shown along dimension 1. We can also observe a few very strong outliers with respect to our principal components in out bottom left quadrent, relating heavily to under-five/infant deaths and measles.
A quick Factor Analysis can give us an even stronger grasp of the components we are working with Upon initial inspection, one could assume factor one to be relating to general health and longevity and factor 2 with child health and disease. Let us see if our factor analysis supports this:
##
## Call:
## factanal(x = pca_table, factors = 2, scores = "Bartlett", rotation = "varimax")
##
## Uniquenesses:
## Life expectancy Adult Mortality infant deaths Measles
## 0.155 0.324 0.005 0.357
## under-five deaths Polio Diphtheria HIV/AIDS
## 0.006 0.677 0.694 0.509
##
## Loadings:
## Factor1 Factor2
## Life expectancy 0.912 -0.115
## Adult Mortality -0.819
## infant deaths -0.139 0.988
## Measles 0.802
## under-five deaths -0.174 0.982
## Polio 0.566
## Diphtheria 0.551
## HIV/AIDS -0.701
##
## Factor1 Factor2
## SS loadings 2.668 2.605
## Proportion Var 0.334 0.326
## Cumulative Var 0.334 0.659
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 130.03 on 13 degrees of freedom.
## The p-value is 2.06e-21
Based on the factor analysis output a general interpretation of the loadings can be seen to support our initial suspicions:
Factor 1
This suggests Factor 1 is associated with general health conditions and longevity, with higher life expectancy, polio immunization, and diphtheria immunization contributing positively, while adult mortality and HIV/AIDS have a negative relationship.
Factor 2
This suggests Factor 2 is related to child health and disease impact, with variables linked to child mortality and preventable diseases contributing strongly.
Cumulative Variance
Cumulatively, the two factors explain 65.9% of the variance in the data, indicating that the model captures a substantial portion of the overall variability. With our components identified, we can display our countries on a holistic graph and create a general conclusion on our analysis
data.frame(z1=-pca_result$x[,1],z2=pca_result$x[,2]) %>%
ggplot(aes(z1,z2,label=countries, color = life_exp)) + geom_point(size=0) +
labs(title="PC Distribution with Life Expectancy", x="PC1", y="PC2") +
theme_bw() + scale_color_gradient(low="mediumslateblue", high="seagreen3")+theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) Through Principal Component Analysis, this study successfully reduced the dimensional of a diverse data set containing health and socioeconomic factors across 183 countries.
The two major components identified to affect:
The results highlighted relationships between healthcare spending, immunization, mortality rates, education, and lifestyle choices with life expectancy, offering valuable insights for global health policy. We can see an extreme outliers with Nigeria and India in our distribution graph above, prompting for further research into the health policies between these specific countries and how they contrast with the rest of the world.