Project 2

Life Expectancy and Socioeconomic Variables

The dataset that I chose for this project was the dataset provided in our google drive which contains variables relating to life expectancy and various socio-economic variables. This dataset is derived from The World Bank, a global partnership that combines five institutions with the goal of eliminating world poverty. As the institutions involved with The World Bank are major international financial corporations, it makes sense that they are able to provide enough information to compile a dataset regarding the generalized socio-economic status of various countries.

The World Bank is associated with the United Nations and is one of the largest sources of financial assistance to developing countries. The loans and grants that are given to the governments of developing countries are aimed to help these countries grow. An interesting point I found is that the governing body of The World Bank is based off of a country’s capital subscription, so those that are more wealthy and developed have more influence.

This dataset includes a plethora of variables, but the quantitative ones that I have chosen to focus on are: year, life expectancy, undernourishment, CO2 levels, health expenditures, education expenditures, unemployment, sanitation, and injuries. I chose this dataset because I was interested in how third-world countries were affected by such variables in comparison to other countries. It is important to know how people around the world are affected by matters which may seem insignificant to us, such as easily treatable injuries or having the privilege of high government education expenditures.

Load Libraries

suppressWarnings({
library(corrplot)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(sf)
library(rnaturalearth)
library(highcharter)
library(leaflet)
library(countrycode)
library(tmaptools)
library(readr)})

corrplot 0.92 loaded

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Load Dataset

suppressWarnings({
setwd("C:/Users/rafiz/Downloads")
data <- read_csv("life_exp_kaggle_full.csv")
head(data)})

Rows: 3306 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Country Name, Country Code, Region, IncomeGroup
dbl (12): Year, Life Expectancy World Bank, Prevelance of Undernourishment, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 6 × 16
  `Country Name`  `Country Code` Region IncomeGroup  Year Life Expectancy Worl…¹
  <chr>           <chr>          <chr>  <chr>       <dbl>                  <dbl>
1 Afghanistan     AFG            South… Low income   2001                   56.3
2 Angola          AGO            Sub-S… Lower midd…  2001                   47.1
3 Albania         ALB            Europ… Upper midd…  2001                   74.3
4 Andorra         AND            Europ… High income  2001                   NA  
5 United Arab Em… ARE            Middl… High income  2001                   74.5
6 Argentina       ARG            Latin… Upper midd…  2001                   73.8
# ℹ abbreviated name: ¹`Life Expectancy World Bank`
# ℹ 10 more variables: `Prevelance of Undernourishment` <dbl>, CO2 <dbl>,
#   `Health Expenditure %` <dbl>, `Education Expenditure %` <dbl>,
#   Unemployment <dbl>, Corruption <dbl>, Sanitation <dbl>, Injuries <dbl>,
#   Communicable <dbl>, NonCommunicable <dbl>

Clean the data

#removing all NA data
data_clean <- data |>
  filter(!is.na(`Life Expectancy World Bank`)) |>
  filter(!is.na(`Prevelance of Undernourishment`)) |>
  filter(!is.na(CO2)) |>
  filter(!is.na(`Health Expenditure %`)) |>
  filter(!is.na(`Education Expenditure %`)) |>
  filter(!is.na(Unemployment)) |>
  filter(!is.na(Sanitation)) |>
  filter(!is.na(Injuries)) 

#getting rid of columns i dont want/deem unnecessary  
data_clean2 <- data_clean[, -c(12, 15, 16)]

#removing the non-numeric columns for the correlation plot
cor_data_clean3 <- data_clean2[, -c(1, 2, 3, 4)]

#renaming for ease
cor_data_clean4 <- cor_data_clean3 |>
  rename(lifeExp = `Life Expectancy World Bank`, undernourished = `Prevelance of Undernourishment`, healthExp = `Health Expenditure %`, eduExp = `Education Expenditure %`, unemployment = Unemployment, sanitation = Sanitation, injuries = Injuries)

#country names + renamed data
data_clean3 <- data_clean2 |>
  rename(lifeExp = `Life Expectancy World Bank`, undernourished = `Prevelance of Undernourishment`, healthExp = `Health Expenditure %`, eduExp = `Education Expenditure %`, unemployment = Unemployment, sanitation = Sanitation, injuries = Injuries, coName = `Country Name`, coCode = `Country Code`, region = Region, incGroup = IncomeGroup, year = Year)

Correlation Plot

cor <- cor(cor_data_clean4)
corrplot(cor, 
         method = "number",
         tl.cex = 0.6,
         number.cex = 0.8,
         bg = "lightgray",
         title = "Correlation Plot",
         tl.srt = 45
         )

From this plot, it is easier to see that variables such as sanitation and malnourishment have an effect specifically on life expectancy. It is also easier to visualize which variables should be taken into account when considering how to create the visualizations.

Scatter Plot

ggplot(cor_data_clean4, aes(x = `undernourished`, y = `lifeExp`)) +
  labs(title = "Undernourishment vs Life Expectancy", x = "Undernourishment", y = "Life Expectancy", caption = "The World Bank") +
  geom_point(color = "pink") +
  geom_smooth(method = lm) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Here I started with a simple scatterplot with a linear regression model to show how undernourishment affects life expectancy. There is a very clear line that shows that higher rates of malnourishment lead to a lower life expectancy.

Linear Regression Models

fit1 <- lm(data = cor_data_clean4, lifeExp ~ undernourished + CO2 + healthExp + eduExp + unemployment + sanitation + injuries)
summary(fit1)


Call:
lm(formula = lifeExp ~ undernourished + CO2 + healthExp + eduExp + 
    unemployment + sanitation + injuries, data = cor_data_clean4)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8491  -2.0877   0.3807   2.9323  20.0466 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.171e+01  6.239e-01 114.929  < 2e-16 ***
undernourished -5.634e-01  1.796e-02 -31.375  < 2e-16 ***
CO2            -2.036e-07  1.803e-07  -1.129   0.2592    
healthExp       5.543e-01  5.960e-02   9.301  < 2e-16 ***
eduExp         -1.641e-01  8.107e-02  -2.025   0.0431 *  
unemployment   -1.986e-01  2.460e-02  -8.072 1.57e-15 ***
sanitation      7.684e-02  5.962e-03  12.889  < 2e-16 ***
injuries        5.695e-08  2.657e-08   2.144   0.0323 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.576 on 1288 degrees of freedom
Multiple R-squared:  0.7246,    Adjusted R-squared:  0.7231 
F-statistic: 484.2 on 7 and 1288 DF,  p-value: < 2.2e-16

From this model, you can see that the R-squared value is at 72.31% which is already very high, but lets see if removing the variables with the next highest p-values makes a difference:

fit2 <- lm(data = cor_data_clean4, lifeExp ~ undernourished + healthExp + unemployment + sanitation + injuries)
summary(fit2)


Call:
lm(formula = lifeExp ~ undernourished + healthExp + unemployment + 
    sanitation + injuries, data = cor_data_clean4)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9908  -2.0431   0.3706   2.9597  19.9203 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.113e+01  5.489e-01 129.582  < 2e-16 ***
undernourished -5.561e-01  1.768e-02 -31.451  < 2e-16 ***
healthExp       5.257e-01  5.830e-02   9.018  < 2e-16 ***
unemployment   -2.027e-01  2.444e-02  -8.293 2.74e-16 ***
sanitation      7.597e-02  5.956e-03  12.755  < 2e-16 ***
injuries        3.794e-08  1.871e-08   2.027   0.0428 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.582 on 1290 degrees of freedom
Multiple R-squared:  0.7235,    Adjusted R-squared:  0.7225 
F-statistic: 675.2 on 5 and 1290 DF,  p-value: < 2.2e-16

Here we can see that the R-squared value actually decreased so I chose to stick with the first fit.

Diagnostic Plots

par(mfrow = c(2,2))
plot(fit1)

There aren’t many outliers on this plot and it would seem that a linear plot would work well.

Equation

The equation for my model is: 71.71 + (-0.5634)undernourished + (-2.036e-07)CO2 + (0.5543)healthExp + (-0.1641)eduExp + (-0.1986)unemployment + (7.684e-02)sanitation + (5.695e-08)injuries

Final Visualizaton

#focusing on the year 2003
hc_data <- data_clean3 |>
  filter(year == 2003)

#using the countries in the plot and converting to find the respective long and lat
geo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>% 
  select(3,5:6)

Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

output <- hc_data %>% left_join(y = geo,by = join_by(coCode==`Alpha-3 code`))

vis_map <- output |>
  rename(long = `Longitude (average)`, lat = `Latitude (average)`)

#interactivity data
popup_data <- paste0(
  "<b>Year:  </b>", vis_map$year, "<br>",
  "<b>Country:  </b>", vis_map$coName, "<br>",
  "<b>Life Expectancy:  </b>", vis_map$lifeExp, "<br>",
  "<b>Education Expenditures in %:  </b>", vis_map$eduExp, "<br>"
  
)


leaflet() |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    data = vis_map,
    radius = vis_map$eduExp*100000, #to make the visualizations easier to see for the viewer
    color = "#e93323",
    popup = popup_data
  )

Assuming "long" and "lat" are longitude and latitude, respectively

#trying other years just for comparison
hc_data2 <- data_clean3 |>
  filter(year == 2010)

hc_data3 <- data_clean3 |>
  filter(year == 2008)

#using the countries in the plot and converting to find the respective long and lat
geo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>% 
  select(3,5:6)

Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

output <- hc_data2 %>% left_join(y = geo,by = join_by(coCode==`Alpha-3 code`))

vis_map2 <- output |>
  rename(long = `Longitude (average)`, lat = `Latitude (average)`)

#using the countries in the plot and converting to find the respective long and lat
geo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>% 
  select(3,5:6)

Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

output <- hc_data3 %>% left_join(y = geo,by = join_by(coCode==`Alpha-3 code`))

vis_map3 <- output |>
  rename(long = `Longitude (average)`, lat = `Latitude (average)`)

#interactivity data
popup_data <- paste0(
  "<b>Year:  </b>", vis_map2$year, "<br>",
  "<b>Country:  </b>", vis_map2$coName, "<br>",
  "<b>Life Expectancy:  </b>", vis_map2$lifeExp, "<br>",
  "<b>Education Expenditures in %:  </b>", vis_map2$eduExp, "<br>"
  
)


leaflet() |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    data = vis_map2,
    radius = vis_map2$eduExp*100000, #to make the visualizations easier to see for the viewer
    color = "#e93323",
    popup = popup_data
  )

Assuming "long" and "lat" are longitude and latitude, respectively

#interactivity data
popup_data <- paste0(
  "<b>Year:  </b>", vis_map3$year, "<br>",
  "<b>Country:  </b>", vis_map3$coName, "<br>",
  "<b>Life Expectancy:  </b>", vis_map3$lifeExp, "<br>",
  "<b>Education Expenditures in %:  </b>", vis_map3$eduExp, "<br>"
  
)


leaflet() |>
  addProviderTiles("Esri.NatGeoWorldMap") |>
  addCircles(
    data = vis_map3,
    radius = vis_map3$eduExp*100000, #to make the visualizations easier to see for the viewer
    color = "#e93323",
    popup = popup_data
  )

Assuming "long" and "lat" are longitude and latitude, respectively

Conclusion

This visualization includes a colored world map with various sized red points on it. The size of each point represents the percentage each country’s government spends on education expenditures. I wanted to see if education had an impact on the general life expectancy of citizens. I also decided to multiply the information from my dataset by 100,000 for ease of viewership for the user, without compromising the quality of the data. I have also changed the year two other times just to compare and see the differences between countries through the years. It was interesting to see countries with massive bubbles having extremely large life expectency which would seem accurate however, there are also many countries with smaller bubbles that seem to have just as high rates.

I attempted to add additional coding so that the user themselves would be able to input a year from 2000-2019, the parameters of my dataset, however, the issue came when attempting to gather the longitude and latitude for each country because that information is not included in the dataset. It just did not work out in the way that I hoped it would so I kept the year manual. I also attempted to manipulate the colors of the bubbles so that the user could be made visually aware of the life expectancy of each country however, that also did not work out.