library(readxl)
library(dplyr)
library(readr)
library(tidyr)
library(tidyverse)
library(janitor)
library(sf)
#library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)
library(WDI)
library(MASS)
library(plotly)
##Our Final Project
For our final project, we have decided to research the relationship between health spending and life expectancy. In recent years, rising life expectancy has been a key indicator of national well being and health, as well as how well the health system is performing within a country. Therefore, understanding what factors influence and determine life expectancy is of importance to policy makers. One of the most prominent factors believed by many to be influencing this relationship is the health expenditure. However, the relationship between spending and life expectancy is not straightforward—spending more does not always guarantee better outcomes. We will be then investigating this relationship using data from the World Bank, and controlling for socioeconomic and environmental factors (such as GDP per capita, access to clean water, immunization rate against measles in children, and the balance between public and private health funding). To do this, we perform a comprehensive data analysis using R: cleaning and merging multiple international datasets, generating summary statistics and visualizations, running regression models, and applying matching techniques to explore causal relationships. To conclude, we will summarise our main findings from this analysis, as well as go over limitations and reflect on causality and randomisation.
indicators <- c(
health_expenditure= "SH.XPD.CHEX.PC.CD",
life_expectancy = "SP.DYN.LE00.IN",
gdp_percapita = "NY.GDP.PCAP.CD",
clean_water_access = "SH.H2O.BASW.ZS",
public_health_expenditure = "SH.XPD.GHED.CH.ZS",
immunization = "SH.IMM.MEAS",
population = "SP.POP.TOTL"
)
data_raw <- WDI(indicator=indicators, start = 2022, end=2022, extra= TRUE)
data <- data_raw |>
dplyr::filter(region!= "Aggregates") |>
dplyr::select(iso2c, country, health_expenditure, life_expectancy, gdp_percapita, clean_water_access, public_health_expenditure, immunization, population) |>
tidyr::drop_na()
head(data,10)
| iso2c | country | health_expenditure | life_expectancy | gdp_percapita | clean_water_access | public_health_expenditure | immunization | population |
|---|---|---|---|---|---|---|---|---|
| AF | Afghanistan | 80.7 | 65.6 | 357 | 82.2 | 0.786 | 56 | 4.06e+07 |
| AL | Albania | 414 | 78.8 | 6.85e+03 | 95.1 | 45.2 | 86 | 2.78e+06 |
| DZ | Algeria | 180 | 76.1 | 4.96e+03 | 94.7 | 47.4 | 79 | 4.55e+07 |
| AD | Andorra | 3.19e+03 | 84 | 4.24e+04 | 100 | 73.5 | 98 | 7.97e+04 |
| AO | Angola | 101 | 64.2 | 2.93e+03 | 57.7 | 51.6 | 37 | 3.56e+07 |
| AG | Antigua and Barbuda | 1.08e+03 | 77.5 | 2.01e+04 | 98.4 | 57.4 | 99 | 9.28e+04 |
| AM | Armenia | 675 | 74.8 | 6.57e+03 | 100 | 17.2 | 95 | 2.97e+06 |
| AU | Australia | 6.73e+03 | 83.2 | 6.5e+04 | 100 | 74.1 | 96 | 2.6e+07 |
| AT | Austria | 5.85e+03 | 81.3 | 5.22e+04 | 100 | 77.5 | 95 | 9.04e+06 |
| AZ | Azerbaijan | 304 | 74.1 | 7.77e+03 | 97.6 | 30.3 | 93 | 1.01e+07 |
Now that we have successfully merged our data for all variables into one single united called “data”, we can compute our statistics to understand the data further.
summary(data)
## iso2c country health_expenditure life_expectancy
## Length:172 Length:172 Min. : 15.35 Min. :18.82
## Class :character Class :character 1st Qu.: 90.43 1st Qu.:67.40
## Mode :character Mode :character Median : 430.10 Median :73.61
## Mean : 1382.11 Mean :72.56
## 3rd Qu.: 1558.81 3rd Qu.:77.93
## Max. :12434.43 Max. :85.75
## gdp_percapita clean_water_access public_health_expenditure
## Min. : 250.6 Min. : 35.12 Min. : 0.786
## 1st Qu.: 2291.6 1st Qu.: 84.71 1st Qu.:36.689
## Median : 6513.8 Median : 96.82 Median :54.480
## Mean : 18183.3 Mean : 89.23 Mean :52.541
## 3rd Qu.: 21508.1 3rd Qu.: 99.75 3rd Qu.:72.053
## Max. :226052.0 Max. :100.00 Max. :92.342
## immunization population
## Min. :33.00 Min. :9.992e+03
## 1st Qu.:78.50 1st Qu.:2.610e+06
## Median :90.00 Median :1.011e+07
## Mean :84.47 Mean :4.503e+07
## 3rd Qu.:96.00 3rd Qu.:3.323e+07
## Max. :99.00 Max. :1.425e+09
#calculate specific statistics
data %>%
summarise(across(where(is.numeric),
.fns= list(
mean = ~mean(.),
median = ~median(.),
sd = ~sd(.),
min = ~min(.),
max= ~max(.))))
| health_expenditure_mean | health_expenditure_median | health_expenditure_sd | health_expenditure_min | health_expenditure_max | life_expectancy_mean | life_expectancy_median | life_expectancy_sd | life_expectancy_min | life_expectancy_max | gdp_percapita_mean | gdp_percapita_median | gdp_percapita_sd | gdp_percapita_min | gdp_percapita_max | clean_water_access_mean | clean_water_access_median | clean_water_access_sd | clean_water_access_min | clean_water_access_max | public_health_expenditure_mean | public_health_expenditure_median | public_health_expenditure_sd | public_health_expenditure_min | public_health_expenditure_max | immunization_mean | immunization_median | immunization_sd | immunization_min | immunization_max | population_mean | population_median | population_sd | population_min | population_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.38e+03 | 430 | 2.2e+03 | 15.3 | 1.24e+04 | 72.6 | 73.6 | 8.25 | 18.8 | 85.7 | 1.82e+04 | 6.51e+03 | 2.86e+04 | 251 | 2.26e+05 | 89.2 | 96.8 | 15 | 35.1 | 100 | 52.5 | 54.5 | 22.4 | 0.786 | 92.3 | 84.5 | 90 | 15.4 | 33 | 99 | 4.5e+07 | 1.01e+07 | 1.58e+08 | 9.99e+03 | 1.43e+09 |
Now that we have the basic idea of what the variables look like, we can visualise them to understand even further.
histogram_health_expenditure <-
ggplot(data, aes(x = health_expenditure)) +
geom_histogram() +
labs(title= "Health Expenditure per capita",
x= "Health Expenditure",
y= "Count") +
theme_minimal()
histogram_health_expenditure
histogram_health_expenditure <-
ggplot(data, aes(x = health_expenditure)) +
geom_histogram() +
labs(title= "Health Expenditure per capita",
x= "Health Expenditure",
y= "Count") +
theme_minimal()
histogram_health_expenditure
histogram_life_expectancy <-
ggplot(data, aes(x = life_expectancy)) +
geom_histogram() +
labs(title= "Life Expectancy (average) across Countries",
x= "Life expectancy (in years)",
y= "Count") +
theme_minimal()
histogram_life_expectancy
The
histograms above confirm the data skew that we imagined when we looked
at the statistical summary: indeed, the health expenditure is skewed to
the left and the life expectancy is skewed to the right. But these
visualizations were just concerning just a single variable, what about
the visualizations showing the relationship between the variables? For
that, we will create a scatter plot.
data$density <- with(data, densCols(health_expenditure, life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))
scatterplot1 <- ggplot(data, aes(x = health_expenditure, y = life_expectancy)) +
geom_point(aes(color = density)) +
labs(
title = "Health Spending vs Life Expectancy",
x = "Health Spending Per Capita (USD)",
y = "Life Expectancy (years)",
color = "Density"
) +
scale_color_identity() + # Use exact colors returned by densCols
theme_minimal()
scatterplot1
Now, we
can see how there seems to be a somewhat positive relationship between
health spending and life expectancy. But the relationship seems to be
skewed and non linear, so let’s apply the log transformation to better
visualise the data.
data$density <- with(data, densCols(log(health_expenditure), life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))
scatterplot2 <- ggplot(data,
aes(x = log(health_expenditure), y = life_expectancy)) +
geom_point(aes(color = density)) +
labs(title = "Health Spending vs Life Expectancy",
x = "Log of Health Spending Per Capita (USD)",
y = "Life Expectancy (years)",
color = "Density") +
scale_color_identity() +
theme_minimal()
scatterplot2
scatterplot2trace <- ggplot( data,
aes(x= log(health_expenditure), y= life_expectancy))+
geom_point(aes(color=density))+
geom_smooth(method="lm", se = FALSE, color = "black", linewidth= 1)+
labs(title= "Health Spending vs Life Expectancy Traced",
x= "Log of Health Spending Per Capita (USD)",
y= "Life Expectancy (years)",
color= "Density")+
scale_color_identity()+
theme_minimal()
scatterplot2trace
Now,
looking at the second scatterplot, we can see a much clearer and
linearised relationship. We can additionally see in the second graph a
clear positive correlation matching a linear relationship between health
spending and life expectancy.
##Regression analysis
In the scatter plot we have created in the previous section, there seems to clearly be a positive relationship between the two: as health expenditure increases, so does life expectancy. But are there extraneous variables that may be influencing this? For this, we return back to the controlled variables we merged into our dataset at the start. This includes GDP per capita (accounting for the fact wealthier countries may influence life expectancy), clean water access (accounting for infrastructure reasons that may influence weather countries), public health expenditure (accounting for the fact that the overall effectiveness and scope of a country’s healthcare system may be influenced by government spending, not just by private health expenditure), and immunization to measles (accounting for fact that immunization programmes and lifestyle choices may influence life expectancy).
We will now calculate the regression coefficients accounting for our controlled variables to understand the true relationship between just health expenditure and life expectancy.
model <- lm(life_expectancy ~ health_expenditure + gdp_percapita +
clean_water_access + public_health_expenditure + immunization, data = data)
summary(model)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita +
## clean_water_access + public_health_expenditure + immunization,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.58 -2.09 0.34 2.52 9.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.742e+01 2.363e+00 15.837 < 2e-16 ***
## health_expenditure 5.720e-04 2.959e-04 1.933 0.0549 .
## gdp_percapita 3.810e-05 2.317e-05 1.644 0.1020
## clean_water_access 2.845e-01 3.214e-02 8.854 1.23e-15 ***
## public_health_expenditure 4.336e-02 2.079e-02 2.085 0.0386 *
## immunization 7.099e-02 2.749e-02 2.582 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.418 on 166 degrees of freedom
## Multiple R-squared: 0.7216, Adjusted R-squared: 0.7133
## F-statistic: 86.07 on 5 and 166 DF, p-value: < 2.2e-16
Firstly, the multiple R-squared calculated is 0.7216, meaning that this model explains for around 72% of the life expectancy variation, which indicates a strong fit of the chosen predictors collectively explaining a strong portion of variability in life expectancy. Additionally, the adjusted R squared (0.7133) is quite close to the multiple R-squared, indicating that the control variables have improved the model without overfitting.
The intercept is calculated at 37.42 years, meaning that (while being unrealistic and/or arbitrary) without all these variables, the average life expectancy would be 37.42 years. The estimate of health expenditure calculates that with an increase in 1 dollar per capita in health expenditure will result in life expectancy being increased also by 0.000572 years (AKA 0.21 days) with marginal significance.
As for our controlled variables:
For every 1 unit increase in GDP per capita, life expectancy is predicted to increase by 0.0000381 years (or about 0.014 days)
For every 1% increase in the population with access to clean water, life expectancy is predicted to increase by 0.2845 years (or approximately 104 days)
For every 1% increase in public health expenditure as a proportion of total health spending, life expectancy is predicted to increase by 0.04336 years (or about 16 days)
For every 1% increase in measles immunization coverage, life expectancy is predicted to increase by 0.07099 years (or about 26 days)
So overall, when looking at the regression analysis, we can see that increased health expenditure has a somewhat positive effect on life expectancy with marginal significance.
After our regression analysis, we have identified associations between our variables, but not causation. Just because countries that spend more on health tend to have higher life expectancy doesn’t mean that increasing health spending causes life expectancy to increase. To explore causality, we will be randomly matching. This is because in our current data set, the countries are not randomly assigned to different levels of health spending. Richer countries might both spend more and have better outcomes due to other factors (like education, infrastructure, governance, etc.). In this respect, confounding variables may be creating biases in our conclusions. By matching, this will allow us to treat this data set as a randomised experiment. It will help us compare similar countries (matching through our controlled variables) but with different levels of health spending. This will allow us to have a more “fair” estimate of the causation of health spending and life expectancy (because in this way we are now holding other factors constant but with randomization).
To do this, we will first create a treatment variable. This will be high vs low spenders on health using the median. Afterwards we will match countries using the MatchIt package. Finally, a t-test will be performed to determine whether the difference in life expectancy is significant.
data$high_spending <- ifelse(data$health_expenditure > median(data$health_expenditure), 1, 0)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("MatchIt")
##
## The downloaded binary packages are in
## /var/folders/lq/jwrt1s2519d4djhsllwp0zkh0000gn/T//RtmptBjj56/downloaded_packages
library(MatchIt)
match <- matchit(high_spending ~ gdp_percapita + clean_water_access +
immunization + public_health_expenditure, data = data, method = "nearest")
data_match<- match.data(match)
t.test(life_expectancy ~ high_spending, data = data_match)
##
## Welch Two Sample t-test
##
## data: life_expectancy by high_spending
## t = -10.349, df = 138.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -12.182180 -8.273955
## sample estimates:
## mean in group 0 mean in group 1
## 67.44759 77.67566
The t-test performed calculated the mean for group 0 (i.e. low spenders) to be 67.45 years and the mean for group 1 (i.e. high spending) to be 77.68 years. This indicates that countries with high levels of spending live for on average 10.2 years longer than matched countries with low levels of spending. The p-value < 2.2e-16 also indicates that this result is extremely significantly significant, and is unlikely to be due to chance. In this calculation, as mentioned previously, we used nearest neighbor matching to pair each low-spending country with a high-spending country of similar controlled variables (i.e. GDP, access to clean water, public health expenditure levels, and childhood immunisation of measles rate). This calculation supports our hypothesis that higher health spending causes higher life expectancy, and on average by 10 years. This claim is also stronger than our claim with just reliance on our regression model and analysis, since we have simulated a random experiment. However, caution should still be taken with regard to these calculations and results due to potential unobserved confounding, which will be explained further in our conclusion.
To get even more in depth, we are going to look closer into spending on a state by state basis in the USA. For this analysis we will be using data from the IHME and the official united states government data site.
setwd("~/Desktop/R")
life<- read.csv("U.S._Life_Expectancy_at_Birth_by_State_and_Census_Tract_-_2010-2015.csv")
spending<- read.csv("IHME_USA_STATE_HEALTH_SPENDING_2003_2019_DATA_Y2022M08D01.CSV")
colnames(life)
## [1] "State" "County"
## [3] "Census.Tract.Number" "Life.Expectancy"
## [5] "Life.Expectancy.Range" "Life.Expectancy.Standard.Error"
colnames(spending)
## [1] "year" "region" "division" "state" "population"
## [6] "group" "subgroup" "metric" "val" "upper"
## [11] "lower"
With the IHME data they have differenciated between private and public spending, for our purposes we will only take into account total spending so we will exclude those values while we get rid of any NAs. Additionally in the Data.gov data they have included life expectancy range and standard error, for our purposes we are only concerned with the life expectancy. The US life expectancy at birth by State is for the period of 2010-2015 while the health expenditure by state payer and type of care ranges from 2003-2019, we will also be focusing on the standardized spending per capita which is adjusted for differences like health needs or cost of care per state. We will also limit the health expenditure to only concern 2010-2015.
spending_clean <- spending %>%
as_tibble() %>%
filter(group == "Total",
subgroup == "Total",
!is.na(val),
year >= 2010, year<= 2015,
metric== "Spending per capita") %>%
dplyr::select(state, population, val)%>%
group_by(state)%>%
summarize(average_expenditure= mean(val, na.rm= TRUE),
population= first(population))
life_clean<- life%>%
dplyr::select(State, Life.Expectancy)%>%
group_by(State)%>%
drop_na()%>%
summarize(Life.Expectancy= mean(Life.Expectancy, na.rm=T))
Now that we’ve cleaned the two data sets we will combine them.
Life_spending <- left_join(life_clean, spending_clean, by=c("State"="state"))
Life_spending
| State | Life.Expectancy | average_expenditure | population |
|---|---|---|---|
| Alabama | 74.8 | 7.77e+03 | 4.78e+06 |
| Alaska | 78.9 | 1.17e+04 | 7.1e+05 |
| Arizona | 78.4 | 7.06e+03 | 6.39e+06 |
| Arkansas | 75.6 | 7.88e+03 | 2.91e+06 |
| California | 80.2 | 8.12e+03 | 3.72e+07 |
| Colorado | 79.5 | 7.32e+03 | 5.03e+06 |
| Connecticut | 80.1 | 1.06e+04 | 3.57e+06 |
| Delaware | 77.8 | 1.09e+04 | 8.98e+05 |
| District of Columbia | 76.4 | 1.33e+04 | 6.02e+05 |
| Florida | 78.4 | 8.73e+03 | 1.88e+07 |
| Georgia | 76.6 | 6.93e+03 | 9.69e+06 |
| Hawaii | 81.3 | 7.85e+03 | 1.36e+06 |
| Idaho | 79.1 | 7.26e+03 | 1.57e+06 |
| Illinois | 78.2 | 8.83e+03 | 1.28e+07 |
| Indiana | 76.8 | 8.78e+03 | 6.49e+06 |
| Iowa | 79.1 | 8.77e+03 | 3.05e+06 |
| Kansas | 78.1 | 8.36e+03 | 2.85e+06 |
| Kentucky | 75.6 | 8.5e+03 | 4.34e+06 |
| Louisiana | 75.4 | 8.56e+03 | 4.54e+06 |
| Maine | 78.4 | 1.03e+04 | 1.33e+06 |
| Maryland | 78.7 | 9.34e+03 | 5.77e+06 |
| Massachusetts | 80.3 | 1.15e+04 | 6.55e+06 |
| Michigan | 77.3 | 8.69e+03 | 9.89e+06 |
| Minnesota | 80.5 | 9.45e+03 | 5.31e+06 |
| Mississippi | 75 | 8.25e+03 | 2.97e+06 |
| Missouri | 77 | 8.75e+03 | 5.99e+06 |
| Montana | 78.8 | 8.74e+03 | 9.9e+05 |
| Nebraska | 79 | 9.12e+03 | 1.83e+06 |
| Nevada | 77.6 | 6.99e+03 | 2.7e+06 |
| New Hampshire | 80 | 1.03e+04 | 1.32e+06 |
| New Jersey | 79.6 | 9.47e+03 | 8.79e+06 |
| New Mexico | 78.5 | 7.76e+03 | 2.06e+06 |
| New York | 80.3 | 1.06e+04 | 1.94e+07 |
| North Carolina | 77.4 | 7.93e+03 | 9.53e+06 |
| North Dakota | 79.4 | 1.07e+04 | 6.74e+05 |
| Ohio | 76.6 | 9.22e+03 | 1.15e+07 |
| Oklahoma | 75.9 | 8.22e+03 | 3.75e+06 |
| Oregon | 79.1 | 8.36e+03 | 3.83e+06 |
| Pennsylvania | 78.1 | 9.92e+03 | 1.27e+07 |
| Rhode Island | 79.1 | 1.02e+04 | 1.05e+06 |
| South Carolina | 76.6 | 7.87e+03 | 4.62e+06 |
| South Dakota | 79.8 | 9.65e+03 | 8.15e+05 |
| Tennessee | 75.5 | 7.93e+03 | 6.35e+06 |
| Texas | 77.9 | 7.53e+03 | 2.51e+07 |
| Utah | 79.4 | 6.33e+03 | 2.76e+06 |
| Vermont | 81 | 1.07e+04 | 6.26e+05 |
| Virginia | 78.3 | 8.17e+03 | 8e+06 |
| Washington | 79.7 | 8.48e+03 | 6.73e+06 |
| West Virginia | 75.7 | 1e+04 | 1.85e+06 |
| Wisconsin | 79 | 9.3e+03 | 5.68e+06 |
| Wyoming | 80.1 | 9.04e+03 | 5.63e+05 |
Now that we have our data clean and all together we can show this relationship in a graph
library(ggplot2)
library(plotly)
Life_spending$tooltip_text <- paste(
"State: ", Life_spending$State, "<br>",
"Health Expenditure per Capita: $", round(Life_spending$average_expenditure, 2), "<br>",
"Life Expectancy: ", round(Life_spending$Life.Expectancy, 2), " years", "<br>",
"Population: ", format(Life_spending$population, big.mark = ",")
)
p <- ggplot(Life_spending, aes(x = average_expenditure, y = Life.Expectancy, size = population, text = tooltip_text)) +
geom_point(alpha = 0.7, color = "blue") +
labs(title = "Life Expectancy vs Health Expenditure by State",
x = "Health Expenditure per Capita",
y = "Life Expectancy (Years)") +
theme_minimal()
interactive_plot <- ggplotly(p, tooltip = "text")
interactive_plot
Based on the graph, it seems that the relationship isn’t quite as strong in this data as it was in the global data. To make sure we will take a regression.
summary (lm(Life.Expectancy ~ average_expenditure+ population, data = Life_spending))
##
## Call:
## lm(formula = Life.Expectancy ~ average_expenditure + population,
## data = Life_spending)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0268 -1.1539 0.2833 1.2886 3.5397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.536e+01 1.624e+00 46.395 <2e-16 ***
## average_expenditure 3.034e-04 1.738e-04 1.746 0.0872 .
## population 2.553e-08 3.514e-08 0.727 0.4710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.665 on 48 degrees of freedom
## Multiple R-squared: 0.06273, Adjusted R-squared: 0.02368
## F-statistic: 1.606 on 2 and 48 DF, p-value: 0.2112
These findings show a relatively small relationship between average expenditure and life span with the estimate that every additional dollar spent on healthcare per capita increases the expectencay by about 0.0003034 years, or about 0.110741 days, or by about 2.66 hours.
These findings highlight that health spending does not account for all differences in life expectancy as we previously found when looking into the global data.
##Redaction
On a global scale the scatterplots seem to show a strong relationship between the two variables of Health expenditure and life expectancy, the regression analysis showed that other variables are more statistically significant. Additionally, when we further limited the scope to isolate state by state spending and life expectancy in the US, where the infrastrucutre and preventative care would in theory be more standardized than when comparing with different countries, the impact of health expenditure was not statistically significant on life expectancy, though still more significant than population. We found that GDP per capita does not have a statistically significant effect on life expectancy in our global model which suggests that after accounting for other health-related factors, wealth alone may not directly explain life expectancy differences.