Introduction

From the begining of data-driven approaches to the problem of explaining variance in healthcare expenditures, GDP is thought to be the most significant determinant of their level, the predominant method chosen to search for factors associated with levels of healthcare expenditure being panel data regression.

This projects presents an unsupervised learning approach to studying divergence of healthcare expenditures worldwide, with particular focus on non-economic factors describing health needs of populations. Used dataset consist of data registered for 116 countries in 2016 and comes from two open sources: World Bank Data and World Health Organisation database.



Utilised Unsupervised Learning Methods:
- Hierarchical Clustering for creating groups of countries demonstrating similar characteristics
- Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimension reduction



Health/Demographic Variables Included in the Dataset:

health_expend - out-of-pocket health expenditures per capita (World Bank Data)

gdp_per_cap - GDP per capita (World Bank Data)

alcohol_consump – consumption of pure alcohol per capita in liters (WHO database)

elderly_percent - percentage of population aged over 65 years (World Bank Data)

obes_preval - prevalence of obesity in the population (WHO database)

mri_dens - number of MRI scanners per million citizens (a measure of technological advancement of healthcare) (WHO database)

insuff_phys_act - percentage of population aged over 18, whose physical activity did not satisfy WHO standards (WHO database)

tobb_preval - percentage of active tobacco users aged over 15 (WHO database)



List of used external packages:

Visualisation:

ggplot2 - general purpose visualisation package
plotly - package used for creating interactive plot (enabling to zoom in/out and identify countries by hovering cursor over datapoints)

Handling data and statistical analysis:

dplyr
stats
psych
dendextend

#Installing and loading necessary packages
packages <- c("ggplot2", "dplyr", "plotly", "stats", "psych", "dendextend")                                   
not_installed <- packages[!(packages %in% installed.packages()[ , "Package"])]    
if(length(not_installed)) install.packages(not_installed) 

library(ggplot2) 
library(dplyr) 
library(plotly) 
library(stats) 
library(psych) 
library(dendextend)
# Load the data
df <- read.csv("HealthData.csv")



The main nominal determinant of health expenditures is of course the GDP per capita, whose affect will be later disconsidered to search for less economic-related dependencies in data. The strength of their relationship can be observed both on the histograms and by calculating Pearson Correlation Coefficient reaching as high as 0.89. Next cell produces both histograms and Pearson Correlation Coefficient value.

# Calculate the correlation between the standardized GDP per capita and health expenditure variables
rho <- cor(df$gdp_per_cap, df$health_expend)
print(paste("Correlation coefficient for standardized GDP per capita and health expenditure:", rho))
## [1] "Correlation coefficient for standardized GDP per capita and health expenditure: 0.894860464181384"
# Plot histograms of the GDP per capita and health expenditure variables

h1 <- plot_ly(x = ~df$gdp_per_cap,
              type = "histogram",
              histnorm = "probability", color="red", width = 500, height = 500) %>% 
  layout(title = "Distribution of health expenditure per capita",
         yaxis = list(title = "Frequency",
                      zeroline = FALSE),
         xaxis = list(title = "Value",
                      zeroline = FALSE))
h2 <- plot_ly(x = ~df$health_expend,
              type = "histogram",
              histnorm = "probability", width = 500, height = 500) %>% 
  layout(title = "Distribution of GDP per capita",
         yaxis = list(title = "Frequency",
                      zeroline = FALSE),
         xaxis = list(title = "Value",
                      zeroline = FALSE))
h1; h2

Clustering

The first used method of analysis of the data is clustering which allows for group-specific investigations. Dendrogram generated by the next cell informs about the most sensible number of clusters to choose for the purpose of hierarchical clustering.This method of clustering was chosen because of the meaningfulness of the obtained results and its reproducibilty (as it belongs to the class of deterministic algorithms). The chosen measure of distance was most commonly used Ward’s metric, which guarantees minimisation of the intra-group differences.

# Preprocess the data for clustering
#Erase the GDP influence, and remove non-numerical columns:
X <- df[, -c(1,5)]

#data scaling
X_scaled <- scale(X)

#creating dendrogram
dend <- X_scaled %>% dist %>% 
   hclust(method = "ward.D")  %>% as.dendrogram %>%
   set("branches_k_color", k=3) %>% set("branches_lwd", 2) 

# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), 
                cex = 0.7, col = "black")

#plotting using base plot function
plot(dend, main = "Countries Dendrogram",nodePar = nodePar, leaflab = "none"  )



The shape of the dendrogram advocates for the meaningfulness of splitting the sample into 3 groups. The next cell demonstrates mean values of variables in clusters.


# Apply hierarchical clustering to the data
cluster <- hclust(dist(X_scaled), method = "ward.D")

# Assign cluster labels to the data
df_cl <- df
df_cl$cluster <- cutree(cluster, k = 3)

# Print a summary of the data by cluster
df_cl %>% select(where(is.numeric)) %>% group_by(cluster) %>% summarize_all(list(mean=mean)) %>% t() %>% print()
##                              [,1]         [,2]         [,3]
## cluster                 1.0000000     2.000000     3.000000
## health_expend_mean    280.6769142   597.623285  3107.461663
## alcohol_consump_mean    4.6960784     5.138710    10.576471
## elderly_percent_mean    5.0515764     7.878365    17.936073
## gdp_per_cap_mean     5569.3052624 10806.162091 33450.584391
## obes_preval_mean       11.0862745    28.590323    25.323529
## tobb_preval_mean       18.0235294    23.954839    28.164706
## mri_dens_mean           0.8907843     1.449677     9.330588
## insuff_phys_act_mean   19.1145098    32.713226    32.426176



We can see that despite standarisation and omitting the role of GDP per capita in clustering it is still one of the most differentiating variable. Demonstrating the means allow for rough classification of clusters. The third being the one with wealthiest, long-living population with highest consumption of alcohol and cigarettes and highest level of technological advancement, the second being the cluster with most alarming health statistics and the third which is least obese and wealthy, but cannot boast of higher life expectancy.

PCA

The next cell runs the principal component analysis which allows for reducing the number of dimension of data, by calculating the eigenvectors explaining most of the variance between observations. Three first principal components explain around 73% of variance.

# Scaling
col_names <- colnames(df)
col_names <- col_names[-1]
X_scaled <- as.data.frame(X_scaled, col.names = col_names)
X_scaled$cluster <- cutree(cluster, k = 3)
X$cluster <- cutree(cluster, k = 3)


# Perform PCA
pca_health <- prcomp(X_scaled[,-8], center = TRUE, scale. = FALSE)
print(summary(pca_health))
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7237 1.0953 0.9703 0.83631 0.80300 0.55133 0.48950
## Proportion of Variance 0.4244 0.1714 0.1345 0.09992 0.09212 0.04342 0.03423
## Cumulative Proportion  0.4244 0.5958 0.7303 0.83023 0.92235 0.96577 1.00000



From the output of the next cell printing loadings of principal components it is possible to find the most discriminating variables in the dataset(that is to say variables that with the strongest correlation with principal components). Loadings also allow for interpretation of meaning of principal components.

The most discriminating variables are: percentage of population aged 65+, prevalence of insufficient physical activity and prevalence of active smokers. When it comes to interpreting the meaning of principal components, the first can be seen as the “vector of health and longevity”, the second as its opposite the “vector of unhealthy habits”. The third PC cannot be easily interpreted. It is also worth mentioning that consumption of alcohol is positively correlated with the “vector of health and longevity” and negatively with the “vector of unhealthy habits”. The reason for it might be that the statistic responsible for alcohol consumption does not take account of the difference between high and low percentage alcohol beverages.

#printing loadings
print(pca_health$rotation)
##                       PC1         PC2         PC3         PC4         PC5
## health_expend   0.4524968 -0.24724678  0.36486581 -0.06206318  0.20460719
## alcohol_consump 0.3827226 -0.32402293 -0.34400627  0.58806225 -0.17925641
## elderly_percent 0.5110954 -0.08207203 -0.20476588  0.11697534 -0.09579318
## obes_preval     0.3248559  0.49398135  0.09741711  0.18547574  0.71940927
## tobb_preval     0.2640124  0.43906407 -0.63408899 -0.44630444 -0.14294943
## mri_dens        0.4010858 -0.30481712  0.24064819 -0.60516389 -0.08127108
## insuff_phys_act 0.2268211  0.54537514  0.48696311  0.19192572 -0.61011267
##                         PC6         PC7
## health_expend    0.68575058 -0.29157981
## alcohol_consump -0.26185486 -0.42857229
## elderly_percent  0.12454311  0.80730793
## obes_preval     -0.29717511 -0.02640018
## tobb_preval      0.20978090 -0.26799461
## mri_dens        -0.55849459 -0.05967094
## insuff_phys_act -0.03781408 -0.05901029



Following plot demonstrates how precisely the world splits into 3 groups if we take health-related factors into consideration. All of the plots are interactive so it is possible to search for a given country among the points.

# Convert the principal components to a data frame
principal_health_Df <- as.data.frame(pca_health$x)
colnames(principal_health_Df) <- c("principal_component_1", "principal_component_2")
principal_health_Df$cluster <- X$cluster
principal_health_Df$Country <- df$Country

# Plot the clusters on the first two principal components
#fig = plot_ly(data =principal_health_Df, x=~principal_component_1, y=~principal_component_2, type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))
fig <- plot_ly()%>%
  add_trace(data =principal_health_Df, x=~principal_component_1, y=~principal_component_2, type="scatter", text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'), 
            showlegend = FALSE,
            mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
  layout(title="Clusters of Countries in PC Space",  legend=list(title=list(text='Country')))
fig

MDS

Now let’s compare PCA with MDS results. Both methods were devised for dimension reduction purpose, but instead of calculating eigenvectors, MDS algorythm translates the values of the points into 2-dimensional space by minimising the STRESS function taking account of the difference between original distances between datapoints and those obtained in the 2-dimensional space. As a standard linear method PCA is believed to be less applicable for data with less correlated features. Nevertheless, in case of this dataset the results of both methods are very similar.

# Perform MDS on the data
mds <- cmdscale(dist(X_scaled[-c(1,5)]))

# Create a data frame with the coordinates from MDS
mds_df <- data.frame(x = mds[, 1], y = mds[, 2], Country = df$Country)
mds_df$cluster <- X$cluster

fig <- plot_ly()%>%
  add_trace(data =mds_df, x=~x, y=~y, type="scatter", text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'), 
            showlegend = FALSE,
            mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
  layout(title="Multidimensional Scaling (MDS)",  legend=list(title=list(text='Country')))
fig



Next cell plots the data in a 3-dimensional space of 3 most discriminating features (as the analysis of PCA loadings suggests). These are: percentage of population aged 65+, prevalence of insufficient physical activity and prevalence of active smokers

#Adding Country labels to X
X$Country <- df$Country
 
# Create a scatter plot 
fig <- plot_ly()%>%
  add_trace(data = X, x=~elderly_percent, y=~tobb_preval, z=~insuff_phys_act, type="scatter3d" ,
            text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'), 
            showlegend = FALSE,
            mode="markers", marker = list(size = 5), color=~cluster, colors=c("red","green","blue"))%>%
  layout(title="Most discriminating variables",  legend=list(title=list(text='Country')))
fig



Two next plots demonstrate alarming relationships between logarithms of healthcare expenditures and two variables: prevalence of obesity (Pearson Correlation Coefficient = 0.67) and levels of insufficient physical activity. Logarithmic transformation allows for interpretation of relationship in terms of percentage changes.

On the first plot we may note that the level of obesity observed within population is highest in case of the second cluster of countries and on average 3% lower for 3 cluster. The dependency between healthcare expenditures and prevalence of insufficient physical activity is not as straightforward and the Pearson Correlation Coefficient in this case cannot be calculated, but the plot also suggests positive correlation.

# Create a scatter plot
fig <- plot_ly() %>%
  add_trace(data = X, x=~log(health_expend), y=~log(obes_preval), text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))%>% layout(title="Prevalence of obesity vs Health expenditures",legend=list(title=list(text='Country'))) 
fig
# Calculate the correlation between the prevalence of obesity and health expenditure variables
rho <- cor(log(X$obes_preval), log(X$health_expend))
print(paste("Correlation coefficient for prevalence of obesity and health expenditure:", rho))
## [1] "Correlation coefficient for prevalence of obesity and health expenditure: 0.671869318977173"
# Create a scatter plot with the most discriminating variables (ARE THEY REALLY THE MOST DISCRIMINATING, WHICH ONE DIFFERENTIATES YELLOW/GREEN)
fig <- plot_ly()%>%
  add_trace(data = X, x=~log(health_expend), y=~log(insuff_phys_act), text = ~Country, hovertemplate = paste('<b>%{text}<extra></extra></b>'),
               type="scatter", mode="markers", color=~cluster, colors=c("red","green","blue"))%>%
  layout(title="Insufficient physical activity vs Health expenditures ",  legend=list(title=list(text='Country')))
fig



By seeing the heads of df groups we can anticipate the geographical distribution of clusters.

# Print the first few rows of each cluster, COUNTRIES??
df_cl %>% filter(cluster == 1) %>% head() %>% select(Country) %>% print() #1 CLUSTER
##             Country
## 1           Albania
## 2        Azerbaijan
## 3        Bangladesh
## 4             Benin
## 5 Brunei Darussalam
## 6      Burkina Faso
df_cl %>% filter(cluster == 2) %>% head() %>% select(Country) %>% print() #2 CLUSTER
##     Country
## 1 Argentina
## 2   Armenia
## 3  Barbados
## 4   Belarus
## 5  Botswana
## 6    Brazil
df_cl %>% filter(cluster == 3) %>% head() %>% select(Country) %>% print() #3 CLUSTER
##                  Country
## 1              Australia
## 2                Austria
## 3                Belgium
## 4 Bosnia and Herzegovina
## 5               Bulgaria
## 6                 Canada

World Map indicating Clusters of Countries + Conclusion

Although the project was not designed with such an intention, the results of the study reveal a health-related version of the twentieth century division for three worlds, which vary highly when it comes to longevity and standards of living.The first, blue cluster consists mainly of developed OECD countries, green cluster represents post-communist and Arabic countries, and the third, red cluster is composed of the developing countries. Amid interesting conclusions from this study, it should be noted that tobacco and alcohol consumption obtain the highest level in the first, blue cluster, but the populations of the second, green cluster struggle with highest level of obesity and insufficient physical activity in the world.

# Read in the country codes data
df1 <- read.csv("CountryCodes.csv")
colnames(df1) <- c("Country","Code")

# Merge the data frames
df_merge_cl <- merge(df_cl, df1, by="Country")


# Create a world map
fig <- plot_ly(df_merge_cl, type='choropleth', locations=df_merge_cl$Code, z=df_merge_cl$cluster, text=df$COUNTRY,
               color=df_merge_cl$cluster, colors=c("red","green","blue"), width = 900, height = 700)
fig <- fig %>% colorbar(title = "cluster") %>% layout(title = 'World Map Indicating Clusters of Countries ')

fig