library(readxl)
library(dplyr)
library(readr)
library(tidyr)
library(tidyverse)
library(janitor)
library(sf)
#library(tmap)
library(skimr)
library(jtools)
library(huxtable)
library(rmapshaper)
library(WDI)
library(MASS)
library(plotly)
For our final project, we have decided to research the relationship between health spending and life expectancy. In recent years, rising life expectancy has been a key indicator of national well being and health, as well as how well the health system is performing within a country. Therefore, understanding what factors influence and determine life expectancy is of importance to policy makers. One of the most prominent factors believed by many to be influencing this relationship is the health expenditure. However, the relationship between spending and life expectancy is not straightforward—spending more does not always guarantee better outcomes. We will be then investigating this relationship using data from the World Bank, and controlling for socioeconomic and environmental factors (such as GDP per capita, access to clean water, immunization rate against measles in children, and the balance between public and private health funding). To do this, we perform a comprehensive data analysis using R: cleaning and merging multiple international datasets, generating summary statistics and visualizations, running regression models, and applying matching techniques to explore causal relationships. To conclude, we will summarise our main findings from this analysis, as well as go over limitations and reflect on causality and randomisation.
indicators <- c(
health_expenditure= "SH.XPD.CHEX.PC.CD",
life_expectancy = "SP.DYN.LE00.IN",
gdp_percapita = "NY.GDP.PCAP.CD",
clean_water_access = "SH.H2O.BASW.ZS",
public_health_expenditure = "SH.XPD.GHED.CH.ZS",
immunization = "SH.IMM.MEAS",
population = "SP.POP.TOTL"
)
data_raw <- WDI(indicator=indicators, start = 2022, end=2022, extra= TRUE)
data <- data_raw |>
dplyr::filter(region!= "Aggregates") |>
dplyr::select(iso2c, country, health_expenditure, life_expectancy, gdp_percapita, clean_water_access, public_health_expenditure, immunization, population) |>
tidyr::drop_na()
head(data,10)
| iso2c | country | health_expenditure | life_expectancy | gdp_percapita | clean_water_access | public_health_expenditure | immunization | population |
|---|---|---|---|---|---|---|---|---|
| AF | Afghanistan | 80.7 | 65.6 | 357 | 82.2 | 0.786 | 56 | 4.06e+07 |
| AL | Albania | 414 | 78.8 | 6.85e+03 | 95.1 | 45.2 | 86 | 2.78e+06 |
| DZ | Algeria | 180 | 76.1 | 4.96e+03 | 94.7 | 47.4 | 79 | 4.55e+07 |
| AD | Andorra | 3.19e+03 | 84 | 4.24e+04 | 100 | 73.5 | 98 | 7.97e+04 |
| AO | Angola | 101 | 64.2 | 2.93e+03 | 57.7 | 51.6 | 37 | 3.56e+07 |
| AG | Antigua and Barbuda | 1.08e+03 | 77.5 | 2.01e+04 | 98.4 | 57.4 | 99 | 9.28e+04 |
| AM | Armenia | 675 | 74.8 | 6.57e+03 | 100 | 17.2 | 95 | 2.97e+06 |
| AU | Australia | 6.73e+03 | 83.2 | 6.5e+04 | 100 | 74.1 | 96 | 2.6e+07 |
| AT | Austria | 5.85e+03 | 81.3 | 5.22e+04 | 100 | 77.5 | 95 | 9.04e+06 |
| AZ | Azerbaijan | 304 | 74.1 | 7.77e+03 | 97.6 | 30.3 | 93 | 1.01e+07 |
Now that we have successfully merged our data for all variables into one single united called “data”, we can compute our statistics to understand the data further.
summary(data)
## iso2c country health_expenditure life_expectancy
## Length:172 Length:172 Min. : 15.35 Min. :18.82
## Class :character Class :character 1st Qu.: 90.43 1st Qu.:67.40
## Mode :character Mode :character Median : 430.10 Median :73.61
## Mean : 1382.11 Mean :72.56
## 3rd Qu.: 1558.81 3rd Qu.:77.93
## Max. :12434.43 Max. :85.75
## gdp_percapita clean_water_access public_health_expenditure
## Min. : 250.6 Min. : 35.12 Min. : 0.786
## 1st Qu.: 2291.6 1st Qu.: 84.71 1st Qu.:36.689
## Median : 6513.8 Median : 96.82 Median :54.480
## Mean : 18183.3 Mean : 89.23 Mean :52.541
## 3rd Qu.: 21508.1 3rd Qu.: 99.75 3rd Qu.:72.053
## Max. :226052.0 Max. :100.00 Max. :92.342
## immunization population
## Min. :33.00 Min. :9.992e+03
## 1st Qu.:78.50 1st Qu.:2.610e+06
## Median :90.00 Median :1.011e+07
## Mean :84.47 Mean :4.503e+07
## 3rd Qu.:96.00 3rd Qu.:3.323e+07
## Max. :99.00 Max. :1.425e+09
#calculate specific statistics
data %>%
summarise(across(where(is.numeric),
.fns= list(
mean = ~mean(.),
median = ~median(.),
sd = ~sd(.),
min = ~min(.),
max= ~max(.))))
| health_expenditure_mean | health_expenditure_median | health_expenditure_sd | health_expenditure_min | health_expenditure_max | life_expectancy_mean | life_expectancy_median | life_expectancy_sd | life_expectancy_min | life_expectancy_max | gdp_percapita_mean | gdp_percapita_median | gdp_percapita_sd | gdp_percapita_min | gdp_percapita_max | clean_water_access_mean | clean_water_access_median | clean_water_access_sd | clean_water_access_min | clean_water_access_max | public_health_expenditure_mean | public_health_expenditure_median | public_health_expenditure_sd | public_health_expenditure_min | public_health_expenditure_max | immunization_mean | immunization_median | immunization_sd | immunization_min | immunization_max | population_mean | population_median | population_sd | population_min | population_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.38e+03 | 430 | 2.2e+03 | 15.3 | 1.24e+04 | 72.6 | 73.6 | 8.25 | 18.8 | 85.7 | 1.82e+04 | 6.51e+03 | 2.86e+04 | 251 | 2.26e+05 | 89.2 | 96.8 | 15 | 35.1 | 100 | 52.5 | 54.5 | 22.4 | 0.786 | 92.3 | 84.5 | 90 | 15.4 | 33 | 99 | 4.5e+07 | 1.01e+07 | 1.58e+08 | 9.99e+03 | 1.43e+09 |
From this, we can see the basic statistics about our data. For example: - The mean spending on health per capita in countries is 1382 USD, with the minimum being 15.35USD and the maximum being 12434.43 USD. - The median however is 430.10USD, leading us to believe that the data is skewed to the left. - The mean life expectancy is 72 years old, with the minimum being 18.82 years and the maximum being 85.73 years. The median life expectancy however is 73.61, leading us to believe that the data is more skewed to the right. - The average GDP per capita is 18183USD and the median is 6513USD. - The average access to clean water is 89% of the population, and the median access is 96%. - The average health system in a country sources 52% of their expenditures from the public spending (opposed to private spending), and the median health system sources 54% of their expenditures from public spending. - Immunizations records for measles in children on average are at 84% and have a median of 90%. These statistics provide information to us about both the outcome and predictor variables (i.e. life expectancy and health expenditure) as well as the control variables (i.e. GDP per capita, access to clean water, private vs. public health spending, and childhood measle immunization).
Now that we have the basic idea of what the variables look like, we can visualise them to understand even further.
histogram_health_expenditure <-
ggplot(data, aes(x = health_expenditure)) +
geom_histogram() +
labs(title= "Health Expenditure per capita",
x= "Health Expenditure",
y= "Count") +
theme_minimal()
histogram_health_expenditure
The histogram above is asymmetrical, with the majority of the data being concentrated on the left side. The range is from 0 USD to approximately 1250 USD. The asymmetricalness denotes that the majority of the sampled countries spend less than 500 USD per capita on health expenditure. We can also see that both the median and mode are lower than the mean (which was confirmed in the previous section), which confirms our hypothesis that the data would be positively skewed.
histogram_life_expectancy <-
ggplot(data, aes(x = life_expectancy)) +
geom_histogram() +
labs(title= "Life Expectancy (average) across Countries",
x= "Life expectancy (in years)",
y= "Count") +
theme_minimal()
histogram_life_expectancy
In the graph above, we can firstly notice an extreme outlier, with a life expectancy of around 18 years old (which is extremely low). The median and mode are both located on the right of the mean, confirming that the data is negatively skewed. Looking at the data past the anomaly however, overall there is a quasi-uniform distribution, with the mode and median being somewhat close to the mean. There also appears to be three peaks in the data, signalling potentially other variables that may influence the pooling of certain countries in one of the three peaks.
As mentioned just before, the histograms above confirm the data skew that we imagined when we looked at the statistical summary: indeed, the health expenditure is skewed to the left and the life expectancy is skewed to the right. But these visualizations were just concerning just a single variable, what about the visualizations showing the relationship between the variables? For that, we will create a scatter plot.
data$density <- with(data, densCols(health_expenditure, life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))
scatterplot1 <- ggplot(data, aes(x = health_expenditure, y = life_expectancy)) +
geom_point(aes(color = density)) +
labs(
title = "Health Spending vs Life Expectancy",
x = "Health Spending Per Capita (USD)",
y = "Life Expectancy (years)",
color = "Density"
) +
scale_color_identity() + # Use exact colors returned by densCols
theme_minimal()
scatterplot1
Now, we can see how there seems to be a somewhat positive relationship between health spending and life expectancy. Lower health spendings is pooled with lower life expectancy. But as we understand, the relationship seems to be skewed and non linear (and extremely concentrated on the left side), so let’s apply the log transformation to better visualise the data.
data$density <- with(data, densCols(log(health_expenditure), life_expectancy, colramp = colorRampPalette(c("green", "darkgreen"))))
scatterplot2 <- ggplot(data,
aes(x = log(health_expenditure), y = life_expectancy)) +
geom_point(aes(color = density)) +
labs(title = "Health Spending vs Life Expectancy",
x = "Log of Health Spending Per Capita (USD)",
y = "Life Expectancy (years)",
color = "Density") +
scale_color_identity() +
theme_minimal()
scatterplot2
Now, looking at the second scatterplot, we can see a much clearer and linearised relationship. The positive relationship that we observed in the previous graph now appears more strongly. We can also see that there is more variation on the left side of the graph. Countries with lower spending demonstrate more variation in life expectancy. This suggests to us that other factors (such as public policy, lifestyle choices etc) may play a role in this. On the right side however, we can see how there is a tighter cluster in higher spending on health and a higher life expectancy. This suggests that there is a more consistent return on health investment in wealthier countries.
scatterplot2trace <- ggplot( data,
aes(x= log(health_expenditure), y= life_expectancy))+
geom_point(aes(color=density))+
geom_smooth(method="lm", se = FALSE, color = "black", linewidth= 1)+
labs(title= "Health Spending vs Life Expectancy Traced",
x= "Log of Health Spending Per Capita (USD)",
y= "Life Expectancy (years)",
color= "Density")+
scale_color_identity()+
theme_minimal()
scatterplot2trace
We can additionally see in the second graph a clear positive correlation matching a linear relationship between health spending and life expectancy. Once again, since there is a wider dispersion of countries on the left side of the trend line, this signals that there may be other structural factors that may play a role in this relationship.
In the scatter plot we have created in the previous section, there seems to clearly be a positive relationship between the two: as health expenditure increases, so does life expectancy. But are there extraneous variables that may be influencing this? For this, we return back to the controlled variables we merged into our dataset at the start. This includes GDP per capita (accounting for the fact wealthier countries may influence life expectancy), clean water access (accounting for infrastructure reasons that may influence weather countries), public health expenditure (accounting for the fact that the overall effectiveness and scope of a country’s healthcare system may be influenced by government spending, not just by private health expenditure), and immunization to measles (accounting for fact that immunization programmes and lifestyle choices may influence life expectancy).
We will now calculate the regression coefficients accounting for our controlled variables to understand the true relationship between just health expenditure and life expectancy.
model.1<- lm(life_expectancy~ health_expenditure, data = data)
model.2<-lm(life_expectancy~ health_expenditure+ gdp_percapita, data = data)
model.3<- lm(life_expectancy~ health_expenditure+ gdp_percapita+ clean_water_access, data = data)
model.4<-lm(life_expectancy~ health_expenditure+ gdp_percapita+ clean_water_access+public_health_expenditure, data = data)
model <- lm(life_expectancy ~ health_expenditure + gdp_percapita +
clean_water_access + public_health_expenditure + immunization, data = data)
summary(model.1)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.865 -2.975 1.331 4.284 9.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.958e+01 6.103e-01 114.015 <2e-16 ***
## health_expenditure 2.158e-03 2.355e-04 9.161 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.77 on 170 degrees of freedom
## Multiple R-squared: 0.3305, Adjusted R-squared: 0.3266
## F-statistic: 83.92 on 1 and 170 DF, p-value: < 2.2e-16
When we regress health expenditure alone on life expectancy, we find that the intercept is estimated at 69.58 years, meaning that without health expenditure that is what to average life expectancy would be. The impact of health expenditure is 0.002158, which means that for every one dollar per cpaita increase in health expenditure, the average life span will increase by 0.002158 years (or 0.78767 days). However, the r squared is only 0.3305, which means that approximately 33% of the variance in life expectancy is explained by the health expenditure. This is not a very strong fit so we can continue analyzing the impact by adding additional variables that we have observed.
summary(model.2)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.615 -3.551 1.558 4.138 9.021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.934e+01 6.092e-01 113.806 < 2e-16 ***
## health_expenditure 1.225e-03 4.417e-04 2.773 0.00619 **
## gdp_percapita 8.436e-05 3.399e-05 2.482 0.01404 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.67 on 169 degrees of freedom
## Multiple R-squared: 0.3541, Adjusted R-squared: 0.3464
## F-statistic: 46.32 on 2 and 169 DF, p-value: < 2.2e-16
summary(model.3)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita +
## clean_water_access, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.216 -2.065 0.183 2.702 9.442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.909e+01 2.201e+00 17.763 <2e-16 ***
## health_expenditure 6.154e-04 3.041e-04 2.023 0.0446 *
## gdp_percapita 5.112e-05 2.328e-05 2.195 0.0295 *
## clean_water_access 3.552e-01 2.538e-02 13.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.545 on 168 degrees of freedom
## Multiple R-squared: 0.7018, Adjusted R-squared: 0.6964
## F-statistic: 131.8 on 3 and 168 DF, p-value: < 2.2e-16
summary(model.4)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita +
## clean_water_access + public_health_expenditure, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.506 -1.798 0.131 2.905 9.236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.986e+01 2.202e+00 18.103 <2e-16 ***
## health_expenditure 5.972e-04 3.007e-04 1.986 0.0487 *
## gdp_percapita 3.989e-05 2.355e-05 1.693 0.0922 .
## clean_water_access 3.214e-01 2.928e-02 10.975 <2e-16 ***
## public_health_expenditure 4.719e-02 2.109e-02 2.238 0.0266 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.492 on 167 degrees of freedom
## Multiple R-squared: 0.7105, Adjusted R-squared: 0.7035
## F-statistic: 102.4 on 4 and 167 DF, p-value: < 2.2e-16
summary(model)
##
## Call:
## lm(formula = life_expectancy ~ health_expenditure + gdp_percapita +
## clean_water_access + public_health_expenditure + immunization,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.58 -2.09 0.34 2.52 9.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.742e+01 2.363e+00 15.837 < 2e-16 ***
## health_expenditure 5.720e-04 2.959e-04 1.933 0.0549 .
## gdp_percapita 3.810e-05 2.317e-05 1.644 0.1020
## clean_water_access 2.845e-01 3.214e-02 8.854 1.23e-15 ***
## public_health_expenditure 4.336e-02 2.079e-02 2.085 0.0386 *
## immunization 7.099e-02 2.749e-02 2.582 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.418 on 166 degrees of freedom
## Multiple R-squared: 0.7216, Adjusted R-squared: 0.7133
## F-statistic: 86.07 on 5 and 166 DF, p-value: < 2.2e-16
As we increase the number of variables that we are regressing on Life expectancy, we see that the multiple r squared increases with each variable. The most noticeable jump occurs when we go from a multiple r square of 0.3541 to 0.7018 when we add the variable of clean water access. This implies that clean water access plays a very important role in explaining the average life span.
Additionally, with every additional variable the intercept decreases, with once again the most noticeable jump being due to clean water access. When accounting for clean water access we see the intercept life span drop from 69.34 years to 39.09 years. This reinforces that access to clean water seems to have a very important effect. Access to clean water would be an interesting variable to continue analyzing in future research.
In our final model, when accounting for all variables available in our data set, the multiple R-squared calculated is 0.7216, meaning that this model explains for around 72% of the life expectancy variation, which indicates a strong fit of the chosen predictors collectively explaining a strong portion of variability in life expectancy. Additionally, the adjusted R squared (0.7133) is quite close to the multiple R-squared, indicating that the control variables have improved the model without overfitting.
The intercept is calculated at 37.42 years, meaning that (while being unrealistic and/or arbitrary) without all these variables, the average life expectancy would be 37.42 years. The estimate of health expenditure calculates that with an increase in 1 dollar per capita in health expenditure will result in life expectancy being increased also by 0.000572 years (AKA 0.21 days) with marginal significance.
As for our controlled variables:
For every 1 unit increase in GDP per capita, life expectancy is predicted to increase by 0.0000381 years (or about 0.014 days)
For every 1% increase in the population with access to clean water, life expectancy is predicted to increase by 0.2845 years (or approximately 104 days)
For every 1% increase in public health expenditure as a proportion of total health spending, life expectancy is predicted to increase by 0.04336 years (or about 16 days)
For every 1% increase in measles immunization coverage, life expectancy is predicted to increase by 0.07099 years (or about 26 days)
So overall, when looking at the regression analysis, we can see that increased health expenditure has a somewhat positive effect on life expectancy with marginal significance.
After our regression analysis, we have identified associations between our variables, but not causation. Just because countries that spend more on health tend to have higher life expectancy doesn’t mean that increasing health spending causes life expectancy to increase. To explore causality, we will be randomly matching. This is because in our current data set, the countries are not randomly assigned to different levels of health spending. Richer countries might both spend more and have better outcomes due to other factors (like education, infrastructure, governance, etc.). In this respect, confounding variables may be creating biases in our conclusions. By matching, this will allow us to treat this data set as a randomised experiment. It will help us compare similar countries (matching through our controlled variables) but with different levels of health spending. This will allow us to have a more “fair” estimate of the causation of health spending and life expectancy (because in this way we are now holding other factors constant but with randomization).
To do this, we will first create a treatment variable. This will be high vs low spenders on health using the median. Afterwards we will match countries using the MatchIt package. Finally, a t-test will be performed to determine whether the difference in life expectancy is significant.
data$high_spending <- ifelse(data$health_expenditure > median(data$health_expenditure), 1, 0)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("MatchIt")
##
## The downloaded binary packages are in
## /var/folders/lq/jwrt1s2519d4djhsllwp0zkh0000gn/T//RtmpMD5B0m/downloaded_packages
library(MatchIt)
match <- matchit(high_spending ~ gdp_percapita + clean_water_access +
immunization + public_health_expenditure, data = data, method = "nearest")
data_match<- match.data(match)
t.test(life_expectancy ~ high_spending, data = data_match)
##
## Welch Two Sample t-test
##
## data: life_expectancy by high_spending
## t = -10.349, df = 138.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -12.182180 -8.273955
## sample estimates:
## mean in group 0 mean in group 1
## 67.44759 77.67566
The t-test performed calculated the mean for group 0 (i.e. low spenders) to be 67.45 years and the mean for group 1 (i.e. high spending) to be 77.68 years. This indicates that countries with high levels of spending live for on average 10.2 years longer than matched countries with low levels of spending. The p-value < 2.2e-16 also indicates that this result is extremely significantly significant, and is unlikely to be due to chance. In this calculation, as mentioned previously, we used nearest neighbor matching to pair each low-spending country with a high-spending country of similar controlled variables (i.e. GDP, access to clean water, public health expenditure levels, and childhood immunisation of measles rate). This calculation supports our hypothesis that higher health spending causes higher life expectancy, and on average by 10 years. This claim is also stronger than our claim with just reliance on our regression model and analysis, since we have simulated a random experiment. However, caution should still be taken with regard to these calculations and results due to potential unobserved confounding, which will be explained further in our conclusion.
In order to understand limit the impact of structural country by country differences that might have a large impact on life expectancy as we have found above, we are going to look closer into spending on a state by state basis in the USA. By analyzing the data within one country we will, in theory, have less variation in structural differences (such as access to clean drinking water), theoretically allowing for a clearer observation of the impact of health expenditure. For this analysis we will be using data from the IHME [ https://ghdx.healthdata.org/record/ihme-data/united-states-health-spending-by-state-payer-type-service-2003-2019] and the official United States government data site [https://catalog.data.gov/dataset/u-s-life-expectancy-at-birth-by-state-and-census-tract-2010-2015].
setwd("~/Desktop/R")
life<- read.csv("U.S._Life_Expectancy_at_Birth_by_State_and_Census_Tract_-_2010-2015.csv")
spending<- read.csv("IHME_USA_STATE_HEALTH_SPENDING_2003_2019_DATA_Y2022M08D01.CSV")
colnames(life)
## [1] "State" "County"
## [3] "Census.Tract.Number" "Life.Expectancy"
## [5] "Life.Expectancy.Range" "Life.Expectancy.Standard.Error"
colnames(spending)
## [1] "year" "region" "division" "state" "population"
## [6] "group" "subgroup" "metric" "val" "upper"
## [11] "lower"
With the IHME data they have differenciated between private and public spending, for our purposes we will only take into account total spending so we will exclude those values while we get rid of any NAs. Additionally in the Data.gov data they have included life expectancy range and standard error, for our purposes we are only concerned with the life expectancy. The US life expectancy at birth by State is for the period of 2010-2015 while the health expenditure by state payer and type of care ranges from 2003-2019, we will also be focusing on the standardized spending per capita which is adjusted for differences like health needs or cost of care per state. We will also limit the health expenditure to only concern the average health expenditure for state over the period of 2010-2015.
spending_clean <- spending %>%
as_tibble() %>%
filter(group == "Total",
subgroup == "Total",
!is.na(val),
year >= 2010, year<= 2015,
metric== "Spending per capita") %>%
dplyr::select(state, population, val)%>%
group_by(state)%>%
summarize(average_expenditure= mean(val, na.rm= TRUE),
population= first(population))
life_clean<- life%>%
dplyr::select(State, Life.Expectancy)%>%
group_by(State)%>%
drop_na()%>%
summarize(Life.Expectancy= mean(Life.Expectancy, na.rm=T))
Now that we’ve cleaned the two data sets, we will combine them.
Life_spending <- left_join(life_clean, spending_clean, by=c("State"="state"))
Life_spending
| State | Life.Expectancy | average_expenditure | population |
|---|---|---|---|
| Alabama | 74.8 | 7.77e+03 | 4.78e+06 |
| Alaska | 78.9 | 1.17e+04 | 7.1e+05 |
| Arizona | 78.4 | 7.06e+03 | 6.39e+06 |
| Arkansas | 75.6 | 7.88e+03 | 2.91e+06 |
| California | 80.2 | 8.12e+03 | 3.72e+07 |
| Colorado | 79.5 | 7.32e+03 | 5.03e+06 |
| Connecticut | 80.1 | 1.06e+04 | 3.57e+06 |
| Delaware | 77.8 | 1.09e+04 | 8.98e+05 |
| District of Columbia | 76.4 | 1.33e+04 | 6.02e+05 |
| Florida | 78.4 | 8.73e+03 | 1.88e+07 |
| Georgia | 76.6 | 6.93e+03 | 9.69e+06 |
| Hawaii | 81.3 | 7.85e+03 | 1.36e+06 |
| Idaho | 79.1 | 7.26e+03 | 1.57e+06 |
| Illinois | 78.2 | 8.83e+03 | 1.28e+07 |
| Indiana | 76.8 | 8.78e+03 | 6.49e+06 |
| Iowa | 79.1 | 8.77e+03 | 3.05e+06 |
| Kansas | 78.1 | 8.36e+03 | 2.85e+06 |
| Kentucky | 75.6 | 8.5e+03 | 4.34e+06 |
| Louisiana | 75.4 | 8.56e+03 | 4.54e+06 |
| Maine | 78.4 | 1.03e+04 | 1.33e+06 |
| Maryland | 78.7 | 9.34e+03 | 5.77e+06 |
| Massachusetts | 80.3 | 1.15e+04 | 6.55e+06 |
| Michigan | 77.3 | 8.69e+03 | 9.89e+06 |
| Minnesota | 80.5 | 9.45e+03 | 5.31e+06 |
| Mississippi | 75 | 8.25e+03 | 2.97e+06 |
| Missouri | 77 | 8.75e+03 | 5.99e+06 |
| Montana | 78.8 | 8.74e+03 | 9.9e+05 |
| Nebraska | 79 | 9.12e+03 | 1.83e+06 |
| Nevada | 77.6 | 6.99e+03 | 2.7e+06 |
| New Hampshire | 80 | 1.03e+04 | 1.32e+06 |
| New Jersey | 79.6 | 9.47e+03 | 8.79e+06 |
| New Mexico | 78.5 | 7.76e+03 | 2.06e+06 |
| New York | 80.3 | 1.06e+04 | 1.94e+07 |
| North Carolina | 77.4 | 7.93e+03 | 9.53e+06 |
| North Dakota | 79.4 | 1.07e+04 | 6.74e+05 |
| Ohio | 76.6 | 9.22e+03 | 1.15e+07 |
| Oklahoma | 75.9 | 8.22e+03 | 3.75e+06 |
| Oregon | 79.1 | 8.36e+03 | 3.83e+06 |
| Pennsylvania | 78.1 | 9.92e+03 | 1.27e+07 |
| Rhode Island | 79.1 | 1.02e+04 | 1.05e+06 |
| South Carolina | 76.6 | 7.87e+03 | 4.62e+06 |
| South Dakota | 79.8 | 9.65e+03 | 8.15e+05 |
| Tennessee | 75.5 | 7.93e+03 | 6.35e+06 |
| Texas | 77.9 | 7.53e+03 | 2.51e+07 |
| Utah | 79.4 | 6.33e+03 | 2.76e+06 |
| Vermont | 81 | 1.07e+04 | 6.26e+05 |
| Virginia | 78.3 | 8.17e+03 | 8e+06 |
| Washington | 79.7 | 8.48e+03 | 6.73e+06 |
| West Virginia | 75.7 | 1e+04 | 1.85e+06 |
| Wisconsin | 79 | 9.3e+03 | 5.68e+06 |
| Wyoming | 80.1 | 9.04e+03 | 5.63e+05 |
Now that we have our data clean and all together we can show this relationship in a graph to observe if it follows the same pattern as our previous global graph.
library(ggplot2)
library(plotly)
Life_spending$tooltip_text <- paste(
"State: ", Life_spending$State, "<br>",
"Health Expenditure per Capita: $", round(Life_spending$average_expenditure, 2), "<br>",
"Life Expectancy: ", round(Life_spending$Life.Expectancy, 2), " years", "<br>",
"Population: ", format(Life_spending$population, big.mark = ",")
)
p <- ggplot(Life_spending, aes(x = average_expenditure, y = Life.Expectancy, size = population, text = tooltip_text)) +
geom_point(alpha = 0.7, color = "blue") +
labs(title = "Life Expectancy vs Health Expenditure by State",
x = "Health Expenditure per Capita",
y = "Life Expectancy (Years)") +
theme_minimal()
interactive_plot <- ggplotly(p, tooltip = "text")
interactive_plot
Based on the graph, it seems that the relationship isn’t quite as strong in this data as it was in the global data. To make sure, we will take a regression.
summary (lm(Life.Expectancy ~ average_expenditure+ population, data = Life_spending))
##
## Call:
## lm(formula = Life.Expectancy ~ average_expenditure + population,
## data = Life_spending)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0268 -1.1539 0.2833 1.2886 3.5397
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.536e+01 1.624e+00 46.395 <2e-16 ***
## average_expenditure 3.034e-04 1.738e-04 1.746 0.0872 .
## population 2.553e-08 3.514e-08 0.727 0.4710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.665 on 48 degrees of freedom
## Multiple R-squared: 0.06273, Adjusted R-squared: 0.02368
## F-statistic: 1.606 on 2 and 48 DF, p-value: 0.2112
These findings show a relatively small relationship between average expenditure and life span with the estimate that every additional dollar spent on healthcare per capita increases the life expectancy by about 0.0003034 years, or about 0.110741 days, or by about 2.66 hours.
These findings highlight that health spending does not account for all differences in life expectancy, upholding what we previously found when looking into the global data.
On a global scale, the scatterplots seem to show a strong relationship between the two variables of Health expenditure and life expectancy, the regression analysis showed that other variables are more statistically significant. Additionally, when we further limited the scope to isolate state by state spending and life expectancy in the US, where the infrastructure and preventative care would in theory be more standardized than when comparing with different countries, the impact of health expenditure was not statistically significant on life expectancy, though still more significant than population. We also found that GDP per capita does not have a statistically significant effect on life expectancy in our global model which suggests that after accounting for other health-related factors, wealth alone may not directly explain life expectancy differences.
This project also highlighted the role that randomisation plays in the differentiation between correlation and causation. While our regression analysis illustrated a marginally significant relationship between the chosen variables, it still remained unclear if longer life expectancy was actually caused by higher health spending or if it was just a mere correlation, and had other factors driving it. The matching allowed us to come to a more credible conclusion in our project. The choice of methodology in causal interpretations is important in the social sciences due to policy making being reliant on results like these. If we mistake correlation for causation, we risk making the wrong decisions leading to a waste of resources or even causing harm. However, there are limitations encountered in this project that should be kept in mind when evaluating our conclusion. Firstly, despite matching of key variables to establish a causal relationship, there still may be other confounding variables that we did not account for (such as education level, income inequality etc.) that could influence both life expectancy and health expenditure. Secondly, defining “high” and “low” spending on health solely on the median may obscure complex effects in the relationship and simplify it. Lastly, especially concerning the worldwide focus, the analysis is focused on only a singular year, thus limiting our ability to assess the dynamic effects over time. Overall, this study strengthens the case for a causal interpretation between higher health spending and better health outcomes, but future research using longitudinal data, additional covariates, and more advanced causal inference techniques would be valuable to deepen and validate these findings.