Project 2

Author

Jonathan RH

Does a Country’s Spending on Health Have an effect on Life Expectancy

In Project 2, I wanted to explore if a Country’s GDP spending on health affects the life expectancy of its people. I found a data set on Kaggle that is a collection of data of socioeconomic indicators for nineteen different countries. The data set came with 209 observations and ninety-five columns. The variables I will be using for this project are as followed: Time, TotalPopulation, CountryName, currentHealthSpendingPercentOfGDP, survivalTo65MalePercentOfPop, and survivalTo65FemalePercentOfPop. All the variables are numerical expect CountryName which is the name of the nineteen countries, but I will be using Germany, Japan, Finland, United Kingdom, and Denmark. Time is just the years from 2011 to 2021. TotalPopulation is the Total population of the country at the time in the millions. currentHealthSpendingPercentOfGDP is the percentage of the GDP that is spent on Health. Finally, survivalTo65MalePercentOfPop and survivalTo65FemalePercentOfPop are the percentage of the population of each gender that make it beyond sixty-five. Even though the data set is from Kaggle, the source of the data is World Development Indicators (WDI). Once I loaded in the data, I changed the names of the columns, to make it easier to work with. Then I created a new data set with countries I want using the filter () code.I also filtered out the N/As in the columns I am using so I can plot my scatter plot using filter(!is.na()). The reason I chose this topic was mostly due to my curiosity about whether a country’s GDP spending has a positive effect on life expectancy. In addition, health is important and is always a good topic to spread information about.

Loading Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Loading in Data

Raw_Data <- read_csv("WDI_Indicators_MainData.csv")

Rows: 209 Columns: 95
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Time Code, Country Name, Country Code
dbl (92): Time, Current education expenditure, primary (% of total expenditu...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Changing the Names of the Columns

colnames(Raw_Data)[74] <- 'survivalTo65FemalePercentOfPop'

colnames(Raw_Data)[75] <- 'survivalTo65MalePercentOfPop'

colnames(Raw_Data)[9] <- 'currentHealthSpendingPercentOfGDP'

colnames(Raw_Data)[3] <- 'CountryName'

colnames(Raw_Data)[64] <- 'TotalPopulation'

Statistical Analysis

cor(Raw_Data$TotalPopulation, Raw_Data$currentHealthSpendingPercentOfGDP)

[1] 0.6432977

I wanted to see the correlation between total population and a country’s GDP spending on health. After running the test using cor() it was conclude that there is a moderate positive relationship between the two variables.

cor(Raw_Data$currentHealthSpendingPercentOfGDP, Raw_Data[c('survivalTo65MalePercentOfPop','survivalTo65FemalePercentOfPop')])

     survivalTo65MalePercentOfPop survivalTo65FemalePercentOfPop
[1,]                   -0.6976651                     -0.7774658

The main reason to why I wanted to do this project to begin with. When I started the assignment I initialed thought that there would be a positive relationship between GDP spending and survival past 65. To my shock after running the correlation test for both female and male there is a moderately strong relation between the variables. I noticed that the relation for male survival beyond 65 is stronger.

lm <- lm(cbind(survivalTo65MalePercentOfPop, survivalTo65FemalePercentOfPop) ~ currentHealthSpendingPercentOfGDP , data = Raw_Data)
lm


Call:
lm(formula = cbind(survivalTo65MalePercentOfPop, survivalTo65FemalePercentOfPop) ~ 
    currentHealthSpendingPercentOfGDP, data = Raw_Data)

Coefficients:
                                   survivalTo65MalePercentOfPop
(Intercept)                        95.7437                     
currentHealthSpendingPercentOfGDP  -0.8304                     
                                   survivalTo65FemalePercentOfPop
(Intercept)                        98.6437                       
currentHealthSpendingPercentOfGDP  -0.5907

After, all the correlations test have been completed it was time to see the linear equation between survival for male and females beyond 65 and GDP spending on health. I used lm() for the linear model and cbind() to compare female and male with GDP spending. Once the code was finish I got the following:

Based on the coefficients the formula is:

survivalTo65MalePercentOfPop = -0.8304(currentHealthSpendingPercentOfGDP) + 95.7437

The formula suggests for every increase in GDP spending for health the “survivalTo65MalePercentOfPop” will decrease by -0.8304 and 95.74 is the percent of males that would live beyond 65 is no spending is done. Which is interesting because I didn’t expect the percentage to be so high if there is no spending. Next up is the females:

survivalTo65FealePercentOfPop = -0.5907(currentHealthSpendingPercentOfGDP) + 98.6437

The formula suggests for every increase in GDP spending for health the “survivalTo65FemalePercentOfPop” will decrease by -0.5907 and 98.64 is the percent of females that would live beyond 65 if no spending is done

sum_lm <- summary(lm)
sum_lm

Response survivalTo65MalePercentOfPop :

Call:
lm(formula = survivalTo65MalePercentOfPop ~ currentHealthSpendingPercentOfGDP, 
    data = Raw_Data)

Residuals:
   Min     1Q Median     3Q    Max 
-6.571 -1.260 -0.095  1.257  4.424 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       95.74373    0.62184  153.97   <2e-16 ***
currentHealthSpendingPercentOfGDP -0.83040    0.05927  -14.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.018 on 207 degrees of freedom
Multiple R-squared:  0.4867,    Adjusted R-squared:  0.4843 
F-statistic: 196.3 on 1 and 207 DF,  p-value: < 2.2e-16


Response survivalTo65FemalePercentOfPop :

Call:
lm(formula = survivalTo65FemalePercentOfPop ~ currentHealthSpendingPercentOfGDP, 
    data = Raw_Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8028 -0.9092 -0.0228  0.6312  2.6658 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       98.64373    0.34845  283.09   <2e-16 ***
currentHealthSpendingPercentOfGDP -0.59068    0.03321  -17.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.131 on 207 degrees of freedom
Multiple R-squared:  0.6045,    Adjusted R-squared:  0.6025 
F-statistic: 316.3 on 1 and 207 DF,  p-value: < 2.2e-16

Based on the summary of the linear model the results are as followed

Males: The p-value is 2.2e-16 which makes the evidence statically significant. The adjusted R-Squared is 0.4843 which means that 48% of the data can be explained by the variable.

Females: The p-value is 2.2e-16 which makes the evidence statically significant. The adjusted R-Squared is 0.6025 which means that 60% of the data can be explained by the variable.

Filtering the Data

CountriesIWant <- Raw_Data |>
  filter(CountryName %in% c('Germany', 'France','Finland',"United Kingdom", "Denmark"))|>
  filter(!is.na(currentHealthSpendingPercentOfGDP)) |>
  filter(!is.na(Time))
options(scipen = 999)

Now that the statistical analysis has been completed. I went ahead and cleaned out my data by filtering out my data for the countries that I wanted. Along with using filter(!is.na()) to remove any N/As from my numerical variables that I am going to use for my scatter plot. Lastly, I used options(scipen = 999) to convert my total population from scientfic noation to standard.

Female and Male Life Expectancy Facet Wraps

I wanted to get a visualization of the relationships between male and female survival beyond 65 with GDP spending on health. So, I created facet wraps for both genders, grouping by country and including the linear line.

Female_EXPECT <- CountriesIWant |>
  ggplot(aes(x = currentHealthSpendingPercentOfGDP,
             y = survivalTo65FemalePercentOfPop)) +
  geom_point() +
  geom_line() +
  geom_smooth(color = "hotpink") +
  labs(x = "Health Spending (% of GDP)",
       y = "Life Expentecy up to 65[FEMALE] (% Of Population") +
  facet_wrap(~ CountryName)
Female_EXPECT

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

table(Raw_Data$CountryName)


           Australia              Austria              Belgium 
                  11                   11                   11 
              Canada              Denmark              Finland 
                  11                   11                   11 
              France              Germany                Italy 
                  11                   11                   11 
               Japan          Korea, Rep.          Netherlands 
                  11                   11                   11 
              Norway                Spain               Sweden 
                  11                   11                   11 
         Switzerland United Arab Emirates       United Kingdom 
                  11                   11                   11 
       United States 
                  11

Male_EXPECT <- CountriesIWant |>
  ggplot(aes(x = currentHealthSpendingPercentOfGDP,
             y = survivalTo65MalePercentOfPop)) +
  geom_point() +
  geom_line() +
  geom_smooth(color = "darkblue") +
  labs(x = "Health Spending (% of GDP)",
       y = "Life Expentency up to 65 [MALE] (% of Population") +
  facet_wrap(~ CountryName)
Male_EXPECT

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Scatter Plot

SP <- CountriesIWant |>
  ggplot(aes(x = Time, 
             y = currentHealthSpendingPercentOfGDP, 
             color = CountryName, 
             fill = TotalPopulation/1e6,
             text = paste0(
               "Year: ", Time, "<br>",
               "Country: ", CountryName, "<br>",
               "Population (Millions): ", round(TotalPopulation/1e6, 2), "<br>",
               "Health Spending (% GDP): ", round(currentHealthSpendingPercentOfGDP, 2), "<br>",
               "Life Expentecy up to 65[FEMALE] (% Of Population): ", round(survivalTo65FemalePercentOfPop, 2), "<br>",
               "Life Expentency up to 65 [MALE] (% of Population): ", round(survivalTo65MalePercentOfPop,2), "<br>"))) +
  geom_line(aes(group = CountryName),size = 1) +  
  geom_point(size = 3, shape = 23, stroke = 1) +
  labs(title = "Life Expectency and Country Health Spending Overtime \n (2011 - 2021)",
       x = "Years",
       y = "Health Spending (% of GDP)",
       color = "Country",
       fill = "Population (Millions)",
       caption = "Source: World Development Indicators (WDI)") +
  theme_bw() +
  scale_fill_gradient(low = "red", high = "blue") +
  scale_color_brewer(palette = "Set2")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

SP

For my finial visualization I created a scatter plot to show the percentage of country GDP spending on health from 2011 to 2021.With the focus being on Denmark, Finland, Germany, Japan, and the United Kingdom which I used for color. Population was used to fill. Finally, I added the appropriate titles for the axes, legends, citation, and title.

Once the scatter plot was complete, I made an interactive plot using plotly. I had a couple issues that were out of my hand, the line did transfer along with the citation. Besides those headaches I used text() to customize the labels and added more information into the tooltip. The variables that I included where percentage of the population to survival beyond sixty-five for both females and males.

ggplotly(SP, tooltip = "text")

When approaching this project, I came in with the thought that the relationship between government spending and percentage of survival for female and males would be positive. However, after conducting correlation tests and developing a linear model, the result showed that the increase in spending had a negative relationship with the percentage of females and males surviving past sixty-five. The R-squared show that 48% of the variation in the Males data and 60% of the variation in the female’s data can be explained by the spending of GDP for health. I went ahead a created a scatter plot to represent the percentage of GDP spending on Health throughout the years. One thing that caught my attention was the increase of the GDP spending on health after 2019. My guess of the sudden rise was COVID-19 and after some research this is the case. In a report conducted by the World Health Organization (WHO) in 2022, there was a focus on the spending of governments from 2019 to 2020. From 2000 to 2019, health spending became increasingly unequal, with a gap in per capita terms and countries where they were not spending much on health. In addition, there is a major difference in GDP spending for health for low- and high-income countries where low-income countries are spending less on health and high-income countries are spending more on health (WHO). Which causes problems when it came to deal with the COVID-19 pandemic that started at the end of 2019. To combat the pandemic, countries worldwide focus more on health. The pandemic can also be a factor too while the current relationship is negative. Overall, this project was a very educational experience both coding and about the GDP spending habits when it comes to health. Hopefully, countries continue to put more money into health so that if any new pandemic is to rise there will be a cushion for the impact. One thing I would of like to incorporate was more data visualizations and a way to fix the line in plotly.

Bibliography

World Health Organization. “Global Spending on Health: Rising to the Pandemic’s Challenges.” www.who.int, 8 Dec. 2022, www.who.int/publications/i/item/9789240064911.

Sources:

For multiply correlation - https://stackoverflow.com/questions/38548943/correlation-between-multiple-variables-of-a-data-frame

Adding variables to the tooltip- https://stackoverflow.com/questions/36325154/how-to-choose-variable-to-display-in-tooltip-when-using-ggplotly

cbind - https://www.programmingr.com/examples/r-dataframe/cbind-in-r/

https://www.statology.org/cbind-in-r/