From the begining of data-driven approaches to the problem of explaining variance in healthcare expenditures, GDP is thought to be the most significant determinant of their level, the predominant method chosen to search for factors associated with levels of healthcare expenditure being panel data regression.
This projects presents an unsupervised learning approach to studying divergence of healthcare expenditures worldwide, with particular focus on non-economic factors describing health needs of populations. Used dataset consist of data registered for 116 countries in 2016 and comes from two open sources: World Bank Data and World Health Organisation database.
Utilised Unsupervised Learning Methods: - Hierarchical Clustering for creating groups of countries demonstrating similar characteristics - Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimension reduction
Health/Demographic Variables Included in the Dataset:
health_expend - out-of-pocket health expenditures per capita (World Bank Data)
gdp_per_cap - GDP per capita (World Bank Data)
alcohol_consump – consumption of pure alcohol per capita in liters (WHO database)
elderly_percent - percentage of population aged over 65 years (World Bank Data)
obes_preval - prevalence of obesity in the population (WHO database)
mri_dens - number of MRI scanners per million citizens (a measure of technological advancement of healthcare) (WHO database)
insuff_phys_act - percentage of population aged over 18, whose physical activity did not satisfy WHO standards (WHO database)
tobb_preval - percentage of active tobacco users aged over 15 (WHO database)
List of used external packages: Visualisation: ggplot2 - general purpose visualisation package plotly - package used for creating interactive plot (enabling to zoom in/out and identify countries by hovering cursor over datapoints) Handling data and statistical analysis: dplyr stats psych dendextend
#Installing and loading necessary packages
packages <- c("ggplot2", "dplyr", "plotly", "stats", "psych", "dendextend")
not_installed <- packages[!(packages %in% installed.packages()[ , "Package"])]
if(length(not_installed)) install.packages(not_installed)
library(ggplot2)
library(dplyr)
library(plotly)
library(stats)
library(psych)
library(dendextend)
# Load the data
df <- read.csv("HealthData.csv")
The main nominal determinant of health expenditures is of course the GDP per capita, whose affect will be later disconsidered to search for less economic-related dependencies in data. The strength of their relationship can be observed both on the histograms and by calculating Pearson Correlation Coefficient reaching as high as 0.89. Next cell produces both histograms and Pearson Correlation Coefficient value.
# Calculate the correlation between the standardized GDP per capita and health expenditure variables
rho <- cor(df$gdp_per_cap, df$health_expend)
print(paste("Correlation coefficient for standardized GDP per capita and health expenditure:", rho))
## [1] "Correlation coefficient for standardized GDP per capita and health expenditure: 0.894860464181384"
# Plot histograms of the GDP per capita and health expenditure variables
h1 <- plot_ly(x = ~df$gdp_per_cap,
type = "histogram",
histnorm = "probability", color="red", width = 500, height = 500) %>%
layout(title = "Distribution of health expenditure per capita",
yaxis = list(title = "Frequency",
zeroline = FALSE),
xaxis = list(title = "Value",
zeroline = FALSE))
h2 <- plot_ly(x = ~df$health_expend,
type = "histogram",
histnorm = "probability", width = 500, height = 500) %>%
layout(title = "Distribution of GDP per capita",
yaxis = list(title = "Frequency",
zeroline = FALSE),
xaxis = list(title = "Value",
zeroline = FALSE))
h1; h2
The first used method of analysis of the data is clustering which allows for group-specific investigations. Dendrogram generated by the next cell informs about the most sensible number of clusters to choose for the purpose of hierarchical clustering.This method of clustering was chosen because of the meaningfulness of the obtained results and its reproducibilty (as it belongs to the class of deterministic algorithms). The chosen measure of distance was most commonly used Ward’s metric, which guarantees minimisation of the intra-group differences.
# Preprocess the data for clustering
#Erase the GDP influence, and remove non-numerical columns:
X <- df[, -c(1,5)]
#data scaling
X_scaled <- scale(X)
#creating dendrogram
dend <- X_scaled %>% dist %>%
hclust(method = "ward.D") %>% as.dendrogram %>%
set("branches_k_color", k=3) %>% set("branches_lwd", 2)
# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19),
cex = 0.7, col = "black")
#plotting using base plot function
plot(dend, main = "Countries Dendrogram",nodePar = nodePar, leaflab = "none" )
The shape of the dendrogram advocates for the
meaningfulness of splitting the sample into 3 groups. The next cell
demonstrates mean values of variables in clusters.
# Apply hierarchical clustering to the data
cluster <- hclust(dist(X_scaled), method = "ward.D")
# Assign cluster labels to the data
df_cl <- df
df_cl$cluster <- cutree(cluster, k = 3)
# Print a summary of the data by cluster
df_cl %>% select(where(is.numeric)) %>% group_by(cluster) %>% summarize_all(list(mean=mean)) %>% t() %>% print()
## [,1] [,2] [,3]
## cluster 1.0000000 2.000000 3.000000
## health_expend_mean 280.6769142 597.623285 3107.461663
## alcohol_consump_mean 4.6960784 5.138710 10.576471
## elderly_percent_mean 5.0515764 7.878365 17.936073
## gdp_per_cap_mean 5569.3052624 10806.162091 33450.584391
## obes_preval_mean 11.0862745 28.590323 25.323529
## tobb_preval_mean 18.0235294 23.954839 28.164706
## mri_dens_mean 0.8907843 1.449677 9.330588
## insuff_phys_act_mean 19.1145098 32.713226 32.426176
We can see that despite standarisation and omitting the role of GDP per capita in clustering it is still one of the most differentiating variable. Demonstrating the means allow for rough classification of clusters. The third being the one with wealthiest, long-living population with highest consumption of alcohol and cigarettes and highest level of technological advancement, the second being the cluster with most alarming health statistics and the third which is least obese and wealthy, but cannot boast of higher life expectancy.
The next cell runs the principal component analysis which allows for reducing the number of dimension of data, by calculating the eigenvectors explaining most of the variance between observations. Three first principal components explain around 73% of variance.
# Scaling
col_names <- colnames(df)
col_names <- col_names[-1]
X_scaled <- as.data.frame(X_scaled, col.names = col_names)
X_scaled$cluster <- cutree(cluster, k = 3)
X$cluster <- cutree(cluster, k = 3)
# Perform PCA
pca_health <- prcomp(X_scaled[,-8], center = TRUE, scale. = FALSE)
print(summary(pca_health))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7237 1.0953 0.9703 0.83631 0.80300 0.55133 0.48950
## Proportion of Variance 0.4244 0.1714 0.1345 0.09992 0.09212 0.04342 0.03423
## Cumulative Proportion 0.4244 0.5958 0.7303 0.83023 0.92235 0.96577 1.00000
From the output of the next cell printing loadings of principal components it is possible to find the most discriminating variables in the dataset(that is to say variables that with the strongest correlation with principal components). Loadings also allow for interpretation of meaning of principal components. The most discriminating variables are: percentage of population aged 65+, prevalence of insufficient physical activity and prevalence of active smokers. When it comes to interpreting the meaning of principal components, the first can be seen as the “vector of health and longevity”, the second as its opposite the “vector of unhealthy habits”. The third PC cannot be easily interpreted. It is also worth mentioning that consumption of alcohol is positively correlated with the “vector of health and longevity” and negatively with the “vector of unhealthy habits”. The reason for it might be that the statistic responsible for alcohol consumption does not take account of the difference between high and low percentage alcohol beverages.
#printing loadings
print(pca_health$rotation)
## PC1 PC2 PC3 PC4 PC5
## health_expend 0.4524968 -0.24724678 0.36486581 -0.06206318 0.20460719
## alcohol_consump 0.3827226 -0.32402293 -0.34400627 0.58806225 -0.17925641
## elderly_percent 0.5110954 -0.08207203 -0.20476588 0.11697534 -0.09579318
## obes_preval 0.3248559 0.49398135 0.09741711 0.18547574 0.71940927
## tobb_preval 0.2640124 0.43906407 -0.63408899 -0.44630444 -0.14294943
## mri_dens 0.4010858 -0.30481712 0.24064819 -0.60516389 -0.08127108
## insuff_phys_act 0.2268211 0.54537514 0.48696311 0.19192572 -0.61011267
## PC6 PC7
## health_expend 0.68575058 -0.29157981
## alcohol_consump -0.26185486 -0.42857229
## elderly_percent 0.12454311 0.80730793
## obes_preval -0.29717511 -0.02640018
## tobb_preval 0.20978090 -0.26799461
## mri_dens -0.55849459 -0.05967094
## insuff_phys_act -0.03781408 -0.05901029
Following plot demonstrates how precisely the world splits into 3 groups if we take health-related factors into consideration. All of the plots are interactive so it is possible to search for a given country among the points.
# Convert the principal components to a data frame
principal_health_Df <- as.data.frame(pca_health$x)
colnames(principal_health_Df) <- c("principal_component_1", "principal_component_2")
principal_health_Df$cluster <- X$cluster
principal_health_Df$Country <- df$Country
# Plot the clusters on the first two principal components
#fig = plot_ly(data =principal_health_Df, x=~principal_component_1, y=~principal_component_2, type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))
fig <- plot_ly()%>%
add_trace(data =principal_health_Df, x=~principal_component_1, y=~principal_component_2, type="scatter", text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),
showlegend = FALSE,
mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
layout(title="Clusters of Countries in PC Space", legend=list(title=list(text='Country')))
fig
Now let’s compare PCA with MDS results. Both methods were devised for dimension reduction purpose, but instead of calculating eigenvectors, MDS algorythm translates the values of the points into 2-dimensional space by minimising the STRESS function taking account of the difference between original distances between datapoints and those obtained in the 2-dimensional space. As a standard linear method PCA is believed to be less applicable for data with less correlated features. Nevertheless, in case of this dataset the results of both methods are very similar.
# Perform MDS on the data
mds <- cmdscale(dist(X_scaled[-c(1,5)]))
# Create a data frame with the coordinates from MDS
mds_df <- data.frame(x = mds[, 1], y = mds[, 2], Country = df$Country)
mds_df$cluster <- X$cluster
fig <- plot_ly()%>%
add_trace(data =mds_df, x=~x, y=~y, type="scatter", text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),
showlegend = FALSE,
mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
layout(title="Multidimensional Scaling (MDS)", legend=list(title=list(text='Country')))
fig
Next cell plots the data in a 3-dimensional space of 3 most discriminating features (as the analysis of PCA loadings suggests). These are: percentage of population aged 65+, prevalence of insufficient physical activity and prevalence of active smokers
#Adding Country labels to X
X$Country <- df$Country
# Create a scatter plot
fig <- plot_ly()%>%
add_trace(data = X, x=~elderly_percent, y=~tobb_preval, z=~insuff_phys_act, type="scatter3d" ,
text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),
showlegend = FALSE,
mode="markers", marker = list(size = 5), color=~cluster, colors=c("red","green","blue"))%>%
layout(title="Most discriminating variables", legend=list(title=list(text='Country')))
fig
Two next plots demonstrate alarming relationships between logarithms of healthcare expenditures and two variables: prevalence of obesity (Pearson Correlation Coefficient = 0.67) and levels of insufficient physical activity. Logarithmic transformation allows for interpretation of relationship in terms of percentage changes. On the first plot we may note that the level of obesity observed within population is highest in case of the second cluster of countries and on average 3% lower for 3 cluster. The dependency between healthcare expenditures and prevalence of insufficient physical activity is not as straightforward and the Pearson Correlation Coefficient in this case cannot be calculated, but the plot also suggests positive correlation.
# Create a scatter plot
fig <- plot_ly() %>%
add_trace(data = X, x=~log(health_expend), y=~log(obes_preval), text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))%>% layout(title="Prevalence of obesity vs Health expenditures",legend=list(title=list(text='Country')))
fig
# Calculate the correlation between the prevalence of obesity and health expenditure variables
rho <- cor(log(X$obes_preval), log(X$health_expend))
print(paste("Correlation coefficient for prevalence of obesity and health expenditure:", rho))
## [1] "Correlation coefficient for prevalence of obesity and health expenditure: 0.671869318977173"
# Create a scatter plot with the most discriminating variables (ARE THEY REALLY THE MOST DISCRIMINATING, WHICH ONE DIFFERENTIATES YELLOW/GREEN)
fig <- plot_ly()%>%
add_trace(data = X, x=~log(health_expend), y=~log(insuff_phys_act), text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),
type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
layout(title="Insufficient physical activity vs Health expenditures ", legend=list(title=list(text='Country')))
fig
By seeing the heads of df groups we can anticipate the geographical distribution of clusters.
# Print the first few rows of each cluster, COUNTRIES??
df_cl %>% filter(cluster == 1) %>% head() %>% select(Country) %>% print() #1 CLUSTER
## Country
## 1 Albania
## 2 Azerbaijan
## 3 Bangladesh
## 4 Benin
## 5 Brunei Darussalam
## 6 Burkina Faso
df_cl %>% filter(cluster == 2) %>% head() %>% select(Country) %>% print() #2 CLUSTER
## Country
## 1 Argentina
## 2 Armenia
## 3 Barbados
## 4 Belarus
## 5 Botswana
## 6 Brazil
df_cl %>% filter(cluster == 3) %>% head() %>% select(Country) %>% print() #3 CLUSTER
## Country
## 1 Australia
## 2 Austria
## 3 Belgium
## 4 Bosnia and Herzegovina
## 5 Bulgaria
## 6 Canada
Although the project was not designed with such an intention, the results of the study reveal a health-related version of the twentieth century division for three worlds, which vary highly when it comes to longevity and standards of living.The first, blue cluster consists mainly of developed OECD countries, green cluster represents post-communist and Arabic countries, and the third, red cluster is composed of the developing countries. Amid interesting conclusions from this study, it should be noted that tobacco and alcohol consumption obtain the highest level in the first, blue cluster, but the populations of the second, green cluster struggle with highest level of obesity and insufficient physical activity in the world.
# Read in the country codes data
df1 <- read.csv("CountryCodes.csv")
colnames(df1) <- c("Country","Code")
# Merge the data frames
df_merge_cl <- merge(df_cl, df1, by="Country")
# Create a world map
fig <- plot_ly(df_merge_cl, type='choropleth', locations=df_merge_cl$Code, z=df_merge_cl$cluster, text=df$COUNTRY,
color=df_merge_cl$cluster, colors=c("red","green","blue"), width = 900, height = 700)
fig <- fig %>% colorbar(title = "cluster") %>% layout(title = 'World Map Indicating Clusters of Countries ')
fig