Final Project

library(readxl)
library(dplyr)
library(readr)
library(tidyr)
library(tidyverse)
library(janitor)
library(sf)
#library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)
library(WDI)
library(MASS)
library(plotly)

##Our Final Project

For our final project, we have decided to research the relationship between health spending and life expectancy. In recent years, rising life expectancy has been a key indicator of national well being and health, as well as how well the health system is performing within a country. Therefore, understanding what factors influence and determine life expectancy is of importance to policy makers. One of the most prominent factors believed by many to be influencing this relationship is the health expenditure. However, the relationship between spending and life expectancy is not straightforward—spending more does not always guarantee better outcomes. We will be then investigating this relationship using data from the World Bank, and controlling for socioeconomic and environmental factors (such as GDP per capita, access to clean water, immunization rate against measles in children, and the balance between public and private health funding). To do this, we perform a comprehensive data analysis using R: cleaning and merging multiple international datasets, generating summary statistics and visualizations, running regression models, and applying matching techniques to explore causal relationships. To conclude, we will summarise our main findings from this analysis, as well as go over limitations and reflect on causality and randomisation.

indicators <- c(
  health_expenditure= "SH.XPD.CHEX.PC.CD",     
  life_expectancy = "SP.DYN.LE00.IN",                    
  gdp_percapita = "NY.GDP.PCAP.CD",                    
  clean_water_access = "SH.H2O.BASW.ZS",        
  public_health_expenditure = "SH.XPD.GHED.CH.ZS",       
  immunization = "SH.IMM.MEAS",
  population = "SP.POP.TOTL"
)

data_raw <- WDI(indicator=indicators, start = 2022, end=2022, extra= TRUE)
  
data <- data_raw |>
  dplyr::filter(region!= "Aggregates") |>
  dplyr::select(iso2c, country, health_expenditure, life_expectancy, gdp_percapita, clean_water_access, public_health_expenditure, immunization, population) |>
  tidyr::drop_na()
head(data,10)

iso2c	country	health_expenditure	life_expectancy	gdp_percapita	clean_water_access	public_health_expenditure	immunization	population
AF	Afghanistan	80.7	65.6	357	82.2	0.786	56	4.06e+07
AL	Albania	414	78.8	6.85e+03	95.1	45.2	86	2.78e+06
DZ	Algeria	180	76.1	4.96e+03	94.7	47.4	79	4.55e+07
AD	Andorra	3.19e+03	84	4.24e+04	100	73.5	98	7.97e+04
AO	Angola	101	64.2	2.93e+03	57.7	51.6	37	3.56e+07
AG	Antigua and Barbuda	1.08e+03	77.5	2.01e+04	98.4	57.4	99	9.28e+04
AM	Armenia	675	74.8	6.57e+03	100	17.2	95	2.97e+06
AU	Australia	6.73e+03	83.2	6.5e+04	100	74.1	96	2.6e+07
AT	Austria	5.85e+03	81.3	5.22e+04	100	77.5	95	9.04e+06
AZ	Azerbaijan	304	74.1	7.77e+03	97.6	30.3	93	1.01e+07

Summary Statistics

Now that we have successfully merged our data for all variables into one single united called “data”, we can compute our statistics to understand the data further.

summary(data)

##     iso2c             country          health_expenditure life_expectancy
##  Length:172         Length:172         Min.   :   15.35   Min.   :18.82  
##  Class :character   Class :character   1st Qu.:   90.43   1st Qu.:67.40  
##  Mode  :character   Mode  :character   Median :  430.10   Median :73.61  
##                                        Mean   : 1382.11   Mean   :72.56  
##                                        3rd Qu.: 1558.81   3rd Qu.:77.93  
##                                        Max.   :12434.43   Max.   :85.75  
##  gdp_percapita      clean_water_access public_health_expenditure
##  Min.   :   250.6   Min.   : 35.12     Min.   : 0.786           
##  1st Qu.:  2291.6   1st Qu.: 84.71     1st Qu.:36.689           
##  Median :  6513.8   Median : 96.82     Median :54.480           
##  Mean   : 18183.3   Mean   : 89.23     Mean   :52.541           
##  3rd Qu.: 21508.1   3rd Qu.: 99.75     3rd Qu.:72.053           
##  Max.   :226052.0   Max.   :100.00     Max.   :92.342           
##   immunization     population       
##  Min.   :33.00   Min.   :9.992e+03  
##  1st Qu.:78.50   1st Qu.:2.610e+06  
##  Median :90.00   Median :1.011e+07  
##  Mean   :84.47   Mean   :4.503e+07  
##  3rd Qu.:96.00   3rd Qu.:3.323e+07  
##  Max.   :99.00   Max.   :1.425e+09

#calculate specific statistics
data %>%
  summarise(across(where(is.numeric), 
                   .fns= list(
                     mean = ~mean(.), 
                     median = ~median(.), 
                     sd = ~sd(.),
                     min = ~min(.),
                     max= ~max(.))))

health_expenditure_mean	health_expenditure_median	health_expenditure_sd	health_expenditure_min	health_expenditure_max	life_expectancy_mean	life_expectancy_median	life_expectancy_sd	life_expectancy_min	life_expectancy_max	gdp_percapita_mean	gdp_percapita_median	gdp_percapita_sd	gdp_percapita_min	gdp_percapita_max	clean_water_access_mean	clean_water_access_median	clean_water_access_sd	clean_water_access_min	clean_water_access_max	public_health_expenditure_mean	public_health_expenditure_median	public_health_expenditure_sd	public_health_expenditure_min	public_health_expenditure_max	immunization_mean	immunization_median	immunization_sd	immunization_min	immunization_max	population_mean	population_median	population_sd	population_min	population_max
1.38e+03	430	2.2e+03	15.3	1.24e+04	72.6	73.6	8.25	18.8	85.7	1.82e+04	6.51e+03	2.86e+04	251	2.26e+05	89.2	96.8	15	35.1	100	52.5	54.5	22.4	0.786	92.3	84.5	90	15.4	33	99	4.5e+07	1.01e+07	1.58e+08	9.99e+03	1.43e+09

Data Visualisation

Now that we have the basic idea of what the variables look like, we can visualise them to understand even further.

histogram_health_expenditure <- 
  ggplot(data, aes(x = health_expenditure)) +
  geom_histogram() +
  labs(title= "Health Expenditure per capita",
       x= "Health Expenditure",
       y= "Count") +
  theme_minimal()
histogram_health_expenditure

histogram_health_expenditure <- 
  ggplot(data, aes(x = health_expenditure)) +
  geom_histogram() +
  labs(title= "Health Expenditure per capita",
       x= "Health Expenditure",
       y= "Count") +
  theme_minimal()
histogram_health_expenditure

histogram_life_expectancy <- 
  ggplot(data, aes(x = life_expectancy)) +
  geom_histogram() +
  labs(title= "Life Expectancy (average) across Countries",
       x= "Life expectancy (in years)",
       y= "Count") +
  theme_minimal()
histogram_life_expectancy

The histograms above confirm the data skew that we imagined when we looked at the statistical summary: indeed, the health expenditure is skewed to the left and the life expectancy is skewed to the right. But these visualizations were just concerning just a single variable, what about the visualizations showing the relationship between the variables? For that, we will create a scatter plot.

data$density <- with(data, densCols(health_expenditure, life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))
scatterplot1 <- ggplot(data, aes(x = health_expenditure, y = life_expectancy)) +
  geom_point(aes(color = density)) +
  labs(
    title = "Health Spending vs Life Expectancy", 
    x = "Health Spending Per Capita (USD)", 
    y = "Life Expectancy (years)",
    color = "Density"
  ) +
  scale_color_identity() +  # Use exact colors returned by densCols
  theme_minimal()
scatterplot1

Now, we can see how there seems to be a somewhat positive relationship between health spending and life expectancy. But the relationship seems to be skewed and non linear, so let’s apply the log transformation to better visualise the data.

data$density <- with(data, densCols(log(health_expenditure), life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))

scatterplot2 <- ggplot(data, 
                       aes(x = log(health_expenditure), y = life_expectancy)) +
  geom_point(aes(color = density)) + 
  labs(title = "Health Spending vs Life Expectancy", 
       x = "Log of Health Spending Per Capita (USD)", 
       y = "Life Expectancy (years)",
       color = "Density") +
  scale_color_identity() +  
  theme_minimal()

scatterplot2

scatterplot2trace <- ggplot( data, 
                             aes(x= log(health_expenditure), y= life_expectancy))+
  geom_point(aes(color=density))+
  geom_smooth(method="lm", se = FALSE, color = "black", linewidth= 1)+
  labs(title= "Health Spending vs Life Expectancy Traced",
       x= "Log of Health Spending Per Capita (USD)",
       y= "Life Expectancy (years)",
       color= "Density")+
  scale_color_identity()+
  theme_minimal()
scatterplot2trace

Now, looking at the second scatterplot, we can see a much clearer and linearised relationship. We can additionally see in the second graph a clear positive correlation matching a linear relationship between health spending and life expectancy.

##Regression analysis

In the scatter plot we have created in the previous section, there seems to clearly be a positive relationship between the two: as health expenditure increases, so does life expectancy. But are there extraneous variables that may be influencing this? For this, we return back to the controlled variables we merged into our dataset at the start. This includes GDP per capita (accounting for the fact wealthier countries may influence life expectancy), clean water access (accounting for infrastructure reasons that may influence weather countries), public health expenditure (accounting for the fact that the overall effectiveness and scope of a country’s healthcare system may be influenced by government spending, not just by private health expenditure), and immunization to measles (accounting for fact that immunization programmes and lifestyle choices may influence life expectancy).

We will now calculate the regression coefficients accounting for our controlled variables to understand the true relationship between just health expenditure and life expectancy.

model <- lm(life_expectancy ~ health_expenditure + gdp_percapita +
clean_water_access + public_health_expenditure + immunization, data = data)
summary(model)

## 
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita + 
##     clean_water_access + public_health_expenditure + immunization, 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.58  -2.09   0.34   2.52   9.77 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.742e+01  2.363e+00  15.837  < 2e-16 ***
## health_expenditure        5.720e-04  2.959e-04   1.933   0.0549 .  
## gdp_percapita             3.810e-05  2.317e-05   1.644   0.1020    
## clean_water_access        2.845e-01  3.214e-02   8.854 1.23e-15 ***
## public_health_expenditure 4.336e-02  2.079e-02   2.085   0.0386 *  
## immunization              7.099e-02  2.749e-02   2.582   0.0107 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.418 on 166 degrees of freedom
## Multiple R-squared:  0.7216, Adjusted R-squared:  0.7133 
## F-statistic: 86.07 on 5 and 166 DF,  p-value: < 2.2e-16

Firstly, the multiple R-squared calculated is 0.7216, meaning that this model explains for around 72% of the life expectancy variation, which indicates a strong fit of the chosen predictors collectively explaining a strong portion of variability in life expectancy. Additionally, the adjusted R squared (0.7133) is quite close to the multiple R-squared, indicating that the control variables have improved the model without overfitting.

The intercept is calculated at 37.42 years, meaning that (while being unrealistic and/or arbitrary) without all these variables, the average life expectancy would be 37.42 years. The estimate of health expenditure calculates that with an increase in 1 dollar per capita in health expenditure will result in life expectancy being increased also by 0.000572 years (AKA 0.21 days) with marginal significance.

As for our controlled variables:

For every 1 unit increase in GDP per capita, life expectancy is predicted to increase by 0.0000381 years (or about 0.014 days)
For every 1% increase in the population with access to clean water, life expectancy is predicted to increase by 0.2845 years (or approximately 104 days)
For every 1% increase in public health expenditure as a proportion of total health spending, life expectancy is predicted to increase by 0.04336 years (or about 16 days)
For every 1% increase in measles immunization coverage, life expectancy is predicted to increase by 0.07099 years (or about 26 days)

So overall, when looking at the regression analysis, we can see that increased health expenditure has a somewhat positive effect on life expectancy with marginal significance.

Exploring causality

After our regression analysis, we have identified associations between our variables, but not causation. Just because countries that spend more on health tend to have higher life expectancy doesn’t mean that increasing health spending causes life expectancy to increase. To explore causality, we will be randomly matching. This is because in our current data set, the countries are not randomly assigned to different levels of health spending. Richer countries might both spend more and have better outcomes due to other factors (like education, infrastructure, governance, etc.). In this respect, confounding variables may be creating biases in our conclusions. By matching, this will allow us to treat this data set as a randomised experiment. It will help us compare similar countries (matching through our controlled variables) but with different levels of health spending. This will allow us to have a more “fair” estimate of the causation of health spending and life expectancy (because in this way we are now holding other factors constant but with randomization).

To do this, we will first create a treatment variable. This will be high vs low spenders on health using the median. Afterwards we will match countries using the MatchIt package. Finally, a t-test will be performed to determine whether the difference in life expectancy is significant.

data$high_spending <- ifelse(data$health_expenditure > median(data$health_expenditure), 1, 0)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("MatchIt")

## 
## The downloaded binary packages are in
##  /var/folders/lq/jwrt1s2519d4djhsllwp0zkh0000gn/T//RtmptBjj56/downloaded_packages

library(MatchIt)

match <- matchit(high_spending ~ gdp_percapita + clean_water_access + 
                       immunization + public_health_expenditure, data = data, method = "nearest")
data_match<- match.data(match)

t.test(life_expectancy ~ high_spending, data = data_match)

## 
##  Welch Two Sample t-test
## 
## data:  life_expectancy by high_spending
## t = -10.349, df = 138.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -12.182180  -8.273955
## sample estimates:
## mean in group 0 mean in group 1 
##        67.44759        77.67566

The t-test performed calculated the mean for group 0 (i.e. low spenders) to be 67.45 years and the mean for group 1 (i.e. high spending) to be 77.68 years. This indicates that countries with high levels of spending live for on average 10.2 years longer than matched countries with low levels of spending. The p-value < 2.2e-16 also indicates that this result is extremely significantly significant, and is unlikely to be due to chance. In this calculation, as mentioned previously, we used nearest neighbor matching to pair each low-spending country with a high-spending country of similar controlled variables (i.e. GDP, access to clean water, public health expenditure levels, and childhood immunisation of measles rate). This calculation supports our hypothesis that higher health spending causes higher life expectancy, and on average by 10 years. This claim is also stronger than our claim with just reliance on our regression model and analysis, since we have simulated a random experiment. However, caution should still be taken with regard to these calculations and results due to potential unobserved confounding, which will be explained further in our conclusion.

To get even more in depth, we are going to look closer into spending on a state by state basis in the USA. For this analysis we will be using data from the IHME and the official united states government data site.

setwd("~/Desktop/R")
life<- read.csv("U.S._Life_Expectancy_at_Birth_by_State_and_Census_Tract_-_2010-2015.csv")
spending<- read.csv("IHME_USA_STATE_HEALTH_SPENDING_2003_2019_DATA_Y2022M08D01.CSV")
colnames(life)

## [1] "State"                          "County"                        
## [3] "Census.Tract.Number"            "Life.Expectancy"               
## [5] "Life.Expectancy.Range"          "Life.Expectancy.Standard.Error"

colnames(spending)

##  [1] "year"       "region"     "division"   "state"      "population"
##  [6] "group"      "subgroup"   "metric"     "val"        "upper"     
## [11] "lower"

With the IHME data they have differenciated between private and public spending, for our purposes we will only take into account total spending so we will exclude those values while we get rid of any NAs. Additionally in the Data.gov data they have included life expectancy range and standard error, for our purposes we are only concerned with the life expectancy. The US life expectancy at birth by State is for the period of 2010-2015 while the health expenditure by state payer and type of care ranges from 2003-2019, we will also be focusing on the standardized spending per capita which is adjusted for differences like health needs or cost of care per state. We will also limit the health expenditure to only concern 2010-2015.

spending_clean <- spending %>%
  as_tibble() %>%
  filter(group == "Total", 
         subgroup == "Total",
         !is.na(val),
         year >= 2010, year<= 2015,
         metric== "Spending per capita") %>%
  dplyr::select(state, population, val)%>%
  group_by(state)%>%
  summarize(average_expenditure= mean(val, na.rm= TRUE),
            population= first(population))



life_clean<- life%>%
  dplyr::select(State, Life.Expectancy)%>%
  group_by(State)%>%
  drop_na()%>%
  summarize(Life.Expectancy= mean(Life.Expectancy, na.rm=T))

Now that we’ve cleaned the two data sets we will combine them.

Life_spending <- left_join(life_clean, spending_clean, by=c("State"="state"))
Life_spending

State	Life.Expectancy	average_expenditure	population
Alabama	74.8	7.77e+03	4.78e+06
Alaska	78.9	1.17e+04	7.1e+05
Arizona	78.4	7.06e+03	6.39e+06
Arkansas	75.6	7.88e+03	2.91e+06
California	80.2	8.12e+03	3.72e+07
Colorado	79.5	7.32e+03	5.03e+06
Connecticut	80.1	1.06e+04	3.57e+06
Delaware	77.8	1.09e+04	8.98e+05
District of Columbia	76.4	1.33e+04	6.02e+05
Florida	78.4	8.73e+03	1.88e+07
Georgia	76.6	6.93e+03	9.69e+06
Hawaii	81.3	7.85e+03	1.36e+06
Idaho	79.1	7.26e+03	1.57e+06
Illinois	78.2	8.83e+03	1.28e+07
Indiana	76.8	8.78e+03	6.49e+06
Iowa	79.1	8.77e+03	3.05e+06
Kansas	78.1	8.36e+03	2.85e+06
Kentucky	75.6	8.5e+03	4.34e+06
Louisiana	75.4	8.56e+03	4.54e+06
Maine	78.4	1.03e+04	1.33e+06
Maryland	78.7	9.34e+03	5.77e+06
Massachusetts	80.3	1.15e+04	6.55e+06
Michigan	77.3	8.69e+03	9.89e+06
Minnesota	80.5	9.45e+03	5.31e+06
Mississippi	75	8.25e+03	2.97e+06
Missouri	77	8.75e+03	5.99e+06
Montana	78.8	8.74e+03	9.9e+05
Nebraska	79	9.12e+03	1.83e+06
Nevada	77.6	6.99e+03	2.7e+06
New Hampshire	80	1.03e+04	1.32e+06
New Jersey	79.6	9.47e+03	8.79e+06
New Mexico	78.5	7.76e+03	2.06e+06
New York	80.3	1.06e+04	1.94e+07
North Carolina	77.4	7.93e+03	9.53e+06
North Dakota	79.4	1.07e+04	6.74e+05
Ohio	76.6	9.22e+03	1.15e+07
Oklahoma	75.9	8.22e+03	3.75e+06
Oregon	79.1	8.36e+03	3.83e+06
Pennsylvania	78.1	9.92e+03	1.27e+07
Rhode Island	79.1	1.02e+04	1.05e+06
South Carolina	76.6	7.87e+03	4.62e+06
South Dakota	79.8	9.65e+03	8.15e+05
Tennessee	75.5	7.93e+03	6.35e+06
Texas	77.9	7.53e+03	2.51e+07
Utah	79.4	6.33e+03	2.76e+06
Vermont	81	1.07e+04	6.26e+05
Virginia	78.3	8.17e+03	8e+06
Washington	79.7	8.48e+03	6.73e+06
West Virginia	75.7	1e+04	1.85e+06
Wisconsin	79	9.3e+03	5.68e+06
Wyoming	80.1	9.04e+03	5.63e+05

Now that we have our data clean and all together we can show this relationship in a graph

library(ggplot2)
library(plotly)


Life_spending$tooltip_text <- paste(
  "State: ", Life_spending$State, "<br>",
  "Health Expenditure per Capita: $", round(Life_spending$average_expenditure, 2), "<br>",
  "Life Expectancy: ", round(Life_spending$Life.Expectancy, 2), " years", "<br>",
  "Population: ", format(Life_spending$population, big.mark = ",")
)

p <- ggplot(Life_spending, aes(x = average_expenditure, y = Life.Expectancy, size = population, text = tooltip_text)) +
  geom_point(alpha = 0.7, color = "blue") + 
  labs(title = "Life Expectancy vs Health Expenditure by State",
       x = "Health Expenditure per Capita",
       y = "Life Expectancy (Years)") +
  theme_minimal()


interactive_plot <- ggplotly(p, tooltip = "text") 

interactive_plot

Based on the graph, it seems that the relationship isn’t quite as strong in this data as it was in the global data. To make sure we will take a regression.

summary (lm(Life.Expectancy ~ average_expenditure+ population, data = Life_spending))

## 
## Call:
## lm(formula = Life.Expectancy ~ average_expenditure + population, 
##     data = Life_spending)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0268 -1.1539  0.2833  1.2886  3.5397 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.536e+01  1.624e+00  46.395   <2e-16 ***
## average_expenditure 3.034e-04  1.738e-04   1.746   0.0872 .  
## population          2.553e-08  3.514e-08   0.727   0.4710    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.665 on 48 degrees of freedom
## Multiple R-squared:  0.06273,    Adjusted R-squared:  0.02368 
## F-statistic: 1.606 on 2 and 48 DF,  p-value: 0.2112

These findings show a relatively small relationship between average expenditure and life span with the estimate that every additional dollar spent on healthcare per capita increases the expectencay by about 0.0003034 years, or about 0.110741 days, or by about 2.66 hours.

These findings highlight that health spending does not account for all differences in life expectancy as we previously found when looking into the global data.

##Redaction

On a global scale the scatterplots seem to show a strong relationship between the two variables of Health expenditure and life expectancy, the regression analysis showed that other variables are more statistically significant. Additionally, when we further limited the scope to isolate state by state spending and life expectancy in the US, where the infrastrucutre and preventative care would in theory be more standardized than when comparing with different countries, the impact of health expenditure was not statistically significant on life expectancy, though still more significant than population. We found that GDP per capita does not have a statistically significant effect on life expectancy in our global model which suggests that after accounting for other health-related factors, wealth alone may not directly explain life expectancy differences.

Final Project

Authors: Teah Smith and Isabelle Jackson-Cameron

Instructor: Magloire Seshie

Summary Statistics

Data Visualisation

Exploring causality