Warning: package 'plotly' was built under R version 4.4.2
Warning: package 'hexbin' was built under R version 4.4.2
This project aims to analyze the relationship between universal health coverage and health outcomes across the world, focusing on certain key indicators:
Households with more than 10% of expenditure on healthcare (%)
Households with more than 25% of expenditure on healthcare (%)
Government budget spent on healthcare (%)
Life expectancy at birth (years)
Neonatal mortality (%)
By using data from the World Health Organization (WHO), the project seeks to explore how varying levels of government healthcare investment are associated with public health, contributing to the understanding of health interventions globally.
I chose this topic and dataset because I care about public health and the way governments spend their money. When I found the dataset on the WHO website, I was intrigued and decided that it would make a good project topic.
The data was collected by WHO from each country’s own reporting organization. Details can be found in each indicator’s .csv file.
The cleaning process was composed mainly of narrowing down the 26 indicators to only the ones that would be useful. Beyond that, the cleaning was composed of removing unnecessary columns (all but the year, region, country, and measured value), renaming columns, renaming countries from codes to their common names, converting the neonatal mortality rate to a percentage, and finally combining the data into one dataset.
The UHC index is the percentage of the population covered by high-quality healthcare.
“Universal health coverage (UHC) means that all people have access to the full range of quality health services they need, when and where they need them, without financial hardship. It covers the full continuum of essential health services, from health promotion to prevention, treatment, rehabilitation and palliative care.”
World Health Organization. “Universal Health Coverage.” World Health Organization, www.who.int/health-topics/universal-health-coverage. Accessed 16 Dec. 2024.
Warning: package 'plotly' was built under R version 4.4.2
Warning: package 'hexbin' was built under R version 4.4.2
# Create a vector containing the names of all the variables
<- c("health_expenditure_10", "health_expenditure_25", "gov_health_expenditure", "uhc_coverage", "life_expectancy", "neonatal_mortality")
variables
# Remove non-country rows of each dataset
# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
# Use get() to retrieve the dataset
<- get(var)
dataset
# Filter out non-country rows and use select to remove unnecessary columns
<- dataset %>%
cleaned_data filter(SpatialDimension == "COUNTRY") %>%
# Remove columns that are not useful
select(-SpatialDimension, -Id, -IndicatorCode, -ParentLocationCode, -TimeDimension, -DisaggregatingDimension1, -DisaggregatingDimension1ValueCode, -DisaggregatingDimension2, -DisaggregatingDimension2ValueCode, -DisaggregatingDimension3, -DisaggregatingDimension3ValueCode, -DataSourceDimension, -DataSourceDimensionValueCode, -Value, -Low, -High, -Date, -TimeDimensionBegin, -TimeDimensionEnd, -TimeDimensionValue, -Comments)
# Reassign the cleaned data to the same variable name
assign(var, cleaned_data, envir = .GlobalEnv)
}
# Rename columns for readability
# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
# Use get() to retrieve the dataset
<- get(var)
dataset
# Create the new column name using the dataset's name
<- var
new_column_name
# Rename columns, including changing 'NumericValue' to the name of the dataset
<- dataset %>%
renamed_data rename(
country = SpatialDimensionValueCode,
region = ParentLocation,
year = TimeDim,
!!new_column_name := NumericValue # Dynamically rename NumericValue
)
# Reassign the renamed data to the same variable name
assign(var, renamed_data, envir = .GlobalEnv)
}
# Get rid of renamed_data, now that it is not needed
rm(renamed_data)
# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
# Use get() to retrieve the dataset
<- get(var)
dataset
# Replace country codes with country names using the country_codes mapping
$country <- country_codes[dataset$country]
dataset
# Reassign the updated dataset
assign(var, dataset, envir = .GlobalEnv)
}
# Get rid of dataset, now that it is not needed
rm(dataset)
# Give NA to incorrect data in health_expenditure_10
c(288, 4138), "health_expenditure_10"] <- NA
health_expenditure_10[
# Convert neonatal mortality rate from per 1000 to percentage
$neonatal_mortality <- neonatal_mortality$neonatal_mortality / 10 neonatal_mortality
# Get all the datasets in a list
<- lapply(variables, get)
datasets
# Ensure unique country-year combinations for each dataset by removing duplicates
<- lapply(datasets, function(df) {
datasets_cleaned %>%
df distinct(country, year, .keep_all = TRUE) # Keep only unique country-year pairs
})
# Combine the datasets by 'country' and 'year'
<- reduce(datasets_cleaned, function(x, y) {
combined_data left_join(x, y, by = c("country", "year"), relationship = "many-to-many")
})
# Remove the extra region columns
<- combined_data %>%
combined_data # Temporarily rename the leftmost region column
rename(column = region.x) %>%
# Remove any columns containing "region"
select(-matches("region")) %>%
# Rename the leftmost region column back to region
rename(region = column)
# Give Kosovo the Europe region
<- combined_data %>%
combined_data mutate(region = if_else(country == "Kosovo", "Europe", region))
# Get rid of now unnecessary variables
rm(new_column_name, var)
# Perform multiple linear regression, with life expectancy as the dependent variable
<- lm(life_expectancy ~ health_expenditure_10 +
model +
health_expenditure_25 +
gov_health_expenditure
uhc_coverage,data = combined_data)
# Display the summary of the model
summary(model)
Call:
lm(formula = life_expectancy ~ health_expenditure_10 + health_expenditure_25 +
gov_health_expenditure + uhc_coverage, data = combined_data)
Residuals:
Min 1Q Median 3Q Max
-18.4235 -3.0380 0.2466 3.4471 11.3627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.48631 1.23340 40.122 <2e-16 ***
health_expenditure_10 -0.05077 0.05233 -0.970 0.3328
health_expenditure_25 0.35165 0.23595 1.490 0.1374
gov_health_expenditure 0.15082 0.08468 1.781 0.0761 .
uhc_coverage 0.33304 0.02256 14.764 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.917 on 258 degrees of freedom
(725 observations deleted due to missingness)
Multiple R-squared: 0.6142, Adjusted R-squared: 0.6082
F-statistic: 102.7 on 4 and 258 DF, p-value: < 2.2e-16
A median residual of 0.25 suggests a relatively balanced error distribution around zero.
An intercept of 49.49 suggests that, with no individual or government healthcare assistance, the life expectancy would be 49.49 years. This value is statistically significant with a very low p-value.
The correlation between the percentage of households that spend more than 10% on healthcare and life expectancy is very weak, as is the correlation between the percentage of households that spend more than 25% on healthcare and life expectancy.
The coefficient for gov_health_expenditure is 0.1508, indicating that for each additional 1% of government health expenditure, life expectancy increases by 0.1508 years. The p-value suggests a borderline statistically significant relationship.
The coefficient for uhc_coverage is 0.333, meaning that for each unit increase in universal health coverage (UHC), life expectancy increases by 0.33 years. This variable is highly statistically significant with a very low p-value, indicating a strong relationship between UHC index and life expectancy.
The adjusted R-squared value is 0.608, showing that with these four predictors, the fit is not bad, but it is not very good either.
Warning in geom_point(alpha = 0.7, aes(text = paste(country, "\nRegion: ", :
Ignoring unknown aesthetics: text
Warning: Removed 722 rows containing non-finite outside the scale range
(`stat_smooth()`).
Source: WHO
Using coloration by region, it can be easily seen that African countries have both low universal health coverage and life expectancies. In contrast, the countries with the highest life expectancies and universal health coverages are in Europe, the western Pacific region, and the Americas.
Especially in Europe and the western Pacific region, universal health coverage is strongly correlated with life expectancy, with higher universal health coverage correlating to higher life expectancy. This does not guarantee a causative relationship, but it seems likely.
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
Warning: Removed 721 rows containing non-finite outside the scale range
(`stat_binhex()`).
Warning: Removed 721 rows containing non-finite outside the scale range
(`stat_smooth()`).
Source: WHO
This hexbin plot shows that there is a strong correlation between neonatal mortality rates and universal health coverage, with countries with better UHC tending to have lower neonatal mortality rates. The plot also shows the density, and it is easy to see that there are more countries with high UHC and a low neonatal mortality rate.
This project examined the relationship between universal health coverage and public health outcomes, including life expectancy and neonatal mortality rates. I was surprised to see that the United States had such a high universal health coverage index, but it makes sense if one considers that UHC does not necessarily indicate government-funded healthcare.
I was slightly concerned by the occurrence of multiple years for each country and each indicator, which can be seen in the first graph; some countries appear on the plot multiple times. Still, the plots are useful in visualizing and conceptualizing the relationships between UHC and life expectancy and neonatal mortality rate.
What was most surprising to me was the WHO definition of the eastern Mediterranean, which included countries like Pakistan and Somalia, among others, which are both incredibly far from the Mediterranean Sea geographically.
Data source: World Health Organization