Project 3

Author

Ryan Seabold

Using WHO Data to Compare Universal Health Coverage and Public Health Indicators

World Health Organization logo

Source: United Nations

This project aims to analyze the relationship between universal health coverage and health outcomes across the world, focusing on certain key indicators:

Households with more than 10% of expenditure on healthcare (%)
Households with more than 25% of expenditure on healthcare (%)
Government budget spent on healthcare (%)
Life expectancy at birth (years)
Neonatal mortality (%)

By using data from the World Health Organization (WHO), the project seeks to explore how varying levels of government healthcare investment are associated with public health, contributing to the understanding of health interventions globally.

I chose this topic and dataset because I care about public health and the way governments spend their money. When I found the dataset on the WHO website, I was intrigued and decided that it would make a good project topic.

The data was collected by WHO from each country’s own reporting organization. Details can be found in each indicator’s .csv file.

The cleaning process was composed mainly of narrowing down the 26 indicators to only the ones that would be useful. Beyond that, the cleaning was composed of removing unnecessary columns (all but the year, region, country, and measured value), renaming columns, renaming countries from codes to their common names, converting the neonatal mortality rate to a percentage, and finally combining the data into one dataset.

Definitions of Universal Health Coverage

The UHC index is the percentage of the population covered by high-quality healthcare.

“Universal health coverage (UHC) means that all people have access to the full range of quality health services they need, when and where they need them, without financial hardship. It covers the full continuum of essential health services, from health promotion to prevention, treatment, rehabilitation and palliative care.”

World Health Organization. “Universal Health Coverage.” World Health Organization, www.who.int/health-topics/universal-health-coverage. Accessed 16 Dec. 2024.

Prepare the files and working environment

Set the working directory

Import the necessary libraries

Warning: package 'plotly' was built under R version 4.4.2

Warning: package 'hexbin' was built under R version 4.4.2

Import the data

Remove unnecessary columns

# Create a vector containing the names of all the variables
variables <- c("health_expenditure_10", "health_expenditure_25", "gov_health_expenditure", "uhc_coverage", "life_expectancy", "neonatal_mortality")

# Remove non-country rows of each dataset
# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
  # Use get() to retrieve the dataset
  dataset <- get(var)
  
  # Filter out non-country rows and use select to remove unnecessary columns
  cleaned_data <- dataset %>%
                  filter(SpatialDimension == "COUNTRY") %>%
                  # Remove columns that are not useful
                  select(-SpatialDimension, -Id, -IndicatorCode, -ParentLocationCode, -TimeDimension, -DisaggregatingDimension1, -DisaggregatingDimension1ValueCode, -DisaggregatingDimension2, -DisaggregatingDimension2ValueCode, -DisaggregatingDimension3, -DisaggregatingDimension3ValueCode, -DataSourceDimension, -DataSourceDimensionValueCode, -Value, -Low, -High, -Date, -TimeDimensionBegin, -TimeDimensionEnd, -TimeDimensionValue, -Comments)
  
  # Reassign the cleaned data to the same variable name
  assign(var, cleaned_data, envir = .GlobalEnv)
}

Rename the columns for readability

# Rename columns for readability
# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
  # Use get() to retrieve the dataset
  dataset <- get(var)
  
  # Create the new column name using the dataset's name
  new_column_name <- var
  
  # Rename columns, including changing 'NumericValue' to the name of the dataset
  renamed_data <- dataset %>%
                  rename(
                    country = SpatialDimensionValueCode,
                    region = ParentLocation,
                    year = TimeDim,
                    !!new_column_name := NumericValue # Dynamically rename NumericValue
                  )
  
  # Reassign the renamed data to the same variable name
  assign(var, renamed_data, envir = .GlobalEnv)
}

# Get rid of renamed_data, now that it is not needed
rm(renamed_data)

Rename the countries from codes to names

# Use a for loop to simplify the code (used ChatGPT for unfamiliar function)
for (var in variables) {
  # Use get() to retrieve the dataset
  dataset <- get(var)
  
  # Replace country codes with country names using the country_codes mapping
  dataset$country <- country_codes[dataset$country]
  
  # Reassign the updated dataset
  assign(var, dataset, envir = .GlobalEnv)
}

# Get rid of dataset, now that it is not needed
rm(dataset)

Convert rates to percentages for consistency

# Give NA to incorrect data in health_expenditure_10
health_expenditure_10[c(288, 4138), "health_expenditure_10"] <- NA

# Convert neonatal mortality rate from per 1000 to percentage
neonatal_mortality$neonatal_mortality <- neonatal_mortality$neonatal_mortality / 10

Combine the data

# Get all the datasets in a list
datasets <- lapply(variables, get)

# Ensure unique country-year combinations for each dataset by removing duplicates
datasets_cleaned <- lapply(datasets, function(df) {
  df %>%
    distinct(country, year, .keep_all = TRUE)  # Keep only unique country-year pairs
})

# Combine the datasets by 'country' and 'year'
combined_data <- reduce(datasets_cleaned, function(x, y) {
  left_join(x, y, by = c("country", "year"), relationship = "many-to-many")
})

Do final cleaning on the combined data

# Remove the extra region columns
combined_data <- combined_data %>%
  # Temporarily rename the leftmost region column
  rename(column = region.x) %>%
  # Remove any columns containing "region"
  select(-matches("region")) %>%
  # Rename the leftmost region column back to region
  rename(region = column)

# Give Kosovo the Europe region
combined_data <- combined_data %>%
  mutate(region = if_else(country == "Kosovo", "Europe", region))

# Get rid of now unnecessary variables
rm(new_column_name, var)

Analyze UHC and health outcomes with statistics

Perform a multiple linear regression with life expectancy as the dependent variable

# Perform multiple linear regression, with life expectancy as the dependent variable
model <- lm(life_expectancy ~ health_expenditure_10 +
                              health_expenditure_25 +
                              gov_health_expenditure +
                              uhc_coverage,
            data = combined_data)

# Display the summary of the model
summary(model)


Call:
lm(formula = life_expectancy ~ health_expenditure_10 + health_expenditure_25 + 
    gov_health_expenditure + uhc_coverage, data = combined_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.4235  -3.0380   0.2466   3.4471  11.3627 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            49.48631    1.23340  40.122   <2e-16 ***
health_expenditure_10  -0.05077    0.05233  -0.970   0.3328    
health_expenditure_25   0.35165    0.23595   1.490   0.1374    
gov_health_expenditure  0.15082    0.08468   1.781   0.0761 .  
uhc_coverage            0.33304    0.02256  14.764   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.917 on 258 degrees of freedom
  (725 observations deleted due to missingness)
Multiple R-squared:  0.6142,    Adjusted R-squared:  0.6082 
F-statistic: 102.7 on 4 and 258 DF,  p-value: < 2.2e-16

Interpret the analysis

A median residual of 0.25 suggests a relatively balanced error distribution around zero.

An intercept of 49.49 suggests that, with no individual or government healthcare assistance, the life expectancy would be 49.49 years. This value is statistically significant with a very low p-value.

The correlation between the percentage of households that spend more than 10% on healthcare and life expectancy is very weak, as is the correlation between the percentage of households that spend more than 25% on healthcare and life expectancy.

The coefficient for gov_health_expenditure is 0.1508, indicating that for each additional 1% of government health expenditure, life expectancy increases by 0.1508 years. The p-value suggests a borderline statistically significant relationship.

The coefficient for uhc_coverage is 0.333, meaning that for each unit increase in universal health coverage (UHC), life expectancy increases by 0.33 years. This variable is highly statistically significant with a very low p-value, indicating a strong relationship between UHC index and life expectancy.

The adjusted R-squared value is 0.608, showing that with these four predictors, the fit is not bad, but it is not very good either.

Universal health coverage and life expectancy

Warning in geom_point(alpha = 0.7, aes(text = paste(country, "\nRegion: ", :
Ignoring unknown aesthetics: text

Warning: Removed 722 rows containing non-finite outside the scale range
(`stat_smooth()`).

Source: WHO

Using coloration by region, it can be easily seen that African countries have both low universal health coverage and life expectancies. In contrast, the countries with the highest life expectancies and universal health coverages are in Europe, the western Pacific region, and the Americas.

Especially in Europe and the western Pacific region, universal health coverage is strongly correlated with life expectancy, with higher universal health coverage correlating to higher life expectancy. This does not guarantee a causative relationship, but it seems likely.

Universal health coverage and neonatal mortality rates

Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

Warning: Removed 721 rows containing non-finite outside the scale range
(`stat_binhex()`).

Warning: Removed 721 rows containing non-finite outside the scale range
(`stat_smooth()`).

Source: WHO

This hexbin plot shows that there is a strong correlation between neonatal mortality rates and universal health coverage, with countries with better UHC tending to have lower neonatal mortality rates. The plot also shows the density, and it is easy to see that there are more countries with high UHC and a low neonatal mortality rate.

End paragraph

This project examined the relationship between universal health coverage and public health outcomes, including life expectancy and neonatal mortality rates. I was surprised to see that the United States had such a high universal health coverage index, but it makes sense if one considers that UHC does not necessarily indicate government-funded healthcare.

I was slightly concerned by the occurrence of multiple years for each country and each indicator, which can be seen in the first graph; some countries appear on the plot multiple times. Still, the plots are useful in visualizing and conceptualizing the relationships between UHC and life expectancy and neonatal mortality rate.

What was most surprising to me was the WHO definition of the eastern Mediterranean, which included countries like Pakistan and Somalia, among others, which are both incredibly far from the Mediterranean Sea geographically.

Data source: World Health Organization