Life Expectancy and Socioeconomic Variables Project

The dataset that I chose for this project contains variables relating to life expectancy and various socio-economic variables. This dataset is derived from The World Bank, a global partnership that combines five institutions with the goal of eliminating world poverty. As the institutions involved with The World Bank are major international financial corporations, it makes sense that they are able to provide enough information to compile a dataset regarding the generalized socio-economic status of various countries.

The World Bank is associated with the United Nations and is one of the largest sources of financial assistance to developing countries. The loans and grants that are given to the governments of developing countries are aimed to help these countries grow. This dataset includes a plethora of variables, but the quantitative ones that I have chosen to focus on are: year, life expectancy, undernourishment, CO2 levels, health expenditures, education expenditures, unemployment, sanitation, and injuries. I chose this dataset because I was interested in how third-world countries were affected by such variables in comparison to other countries. It is important to know how people around the world are affected by matters which may seem insignificant to us, such as easily treatable injuries or having the privilege of high government education expenditures.

Load Libraries

suppressWarnings({
library(corrplot)
library(tidyverse)
library(ggplot2)
library(dplyr)})

## corrplot 0.92 loaded

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load Dataset

suppressWarnings({
setwd("C:/Users/rafiz/Downloads")
data <- read_csv("life_exp_kaggle_full.csv")
head(data)})

## Rows: 3306 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Country Name, Country Code, Region, IncomeGroup
## dbl (12): Year, Life Expectancy World Bank, Prevelance of Undernourishment, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 16
##   `Country Name`  `Country Code` Region IncomeGroup  Year Life Expectancy Worl…¹
##   <chr>           <chr>          <chr>  <chr>       <dbl>                  <dbl>
## 1 Afghanistan     AFG            South… Low income   2001                   56.3
## 2 Angola          AGO            Sub-S… Lower midd…  2001                   47.1
## 3 Albania         ALB            Europ… Upper midd…  2001                   74.3
## 4 Andorra         AND            Europ… High income  2001                   NA  
## 5 United Arab Em… ARE            Middl… High income  2001                   74.5
## 6 Argentina       ARG            Latin… Upper midd…  2001                   73.8
## # ℹ abbreviated name: ¹`Life Expectancy World Bank`
## # ℹ 10 more variables: `Prevelance of Undernourishment` <dbl>, CO2 <dbl>,
## #   `Health Expenditure %` <dbl>, `Education Expenditure %` <dbl>,
## #   Unemployment <dbl>, Corruption <dbl>, Sanitation <dbl>, Injuries <dbl>,
## #   Communicable <dbl>, NonCommunicable <dbl>

Clean the data

#removing all NA data
data_clean <- data |>
  filter(!is.na(`Life Expectancy World Bank`)) |>
  filter(!is.na(`Prevelance of Undernourishment`)) |>
  filter(!is.na(CO2)) |>
  filter(!is.na(`Health Expenditure %`)) |>
  filter(!is.na(`Education Expenditure %`)) |>
  filter(!is.na(Unemployment)) |>
  filter(!is.na(Sanitation)) |>
  filter(!is.na(Injuries)) 

#getting rid of columns i dont want/deem unnecessary  
data_clean2 <- data_clean[, -c(12, 15, 16)]

#removing the non-numeric columns for the correlation plot
cor_data_clean3 <- data_clean2[, -c(1, 2, 3, 4)]

#renaming for ease
cor_data_clean4 <- cor_data_clean3 |>
  rename(lifeExp = `Life Expectancy World Bank`, undernourished = `Prevelance of Undernourishment`, healthExp = `Health Expenditure %`, eduExp = `Education Expenditure %`, unemployment = Unemployment, sanitation = Sanitation, injuries = Injuries)

Correlation Plot

cor <- cor(cor_data_clean4)
corrplot(cor, 
         method = "number",
         tl.cex = 0.6,
         number.cex = 0.8,
         bg = "lightgray",
         title = "Correlation Plot",
         tl.srt = 45
         )

Linear Regression Models

fit1 <- lm(data = cor_data_clean4, lifeExp ~ undernourished)
summary(fit1)

## 
## Call:
## lm(formula = lifeExp ~ undernourished, data = cor_data_clean4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.754  -2.525   0.424   3.768  23.172 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    78.77530    0.19290  408.38   <2e-16 ***
## undernourished -0.73642    0.01556  -47.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.266 on 1294 degrees of freedom
## Multiple R-squared:  0.6337, Adjusted R-squared:  0.6334 
## F-statistic:  2239 on 1 and 1294 DF,  p-value: < 2.2e-16

My null hypothesis would reveal that the health of citizens (by nourishment) would have no particular impact on life expectancy however, by the observed p-value we can tell that the results are actually statistically significant. The p-value is so tiny that we can reject the null and conclude that there is a correlation between life expectancy and whether or not people are receiving adequate food.

fit2 <- lm(data = cor_data_clean4, lifeExp ~ undernourished + CO2 + healthExp + eduExp + unemployment + sanitation + injuries)
summary(fit2)

## 
## Call:
## lm(formula = lifeExp ~ undernourished + CO2 + healthExp + eduExp + 
##     unemployment + sanitation + injuries, data = cor_data_clean4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8491  -2.0877   0.3807   2.9323  20.0466 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.171e+01  6.239e-01 114.929  < 2e-16 ***
## undernourished -5.634e-01  1.796e-02 -31.375  < 2e-16 ***
## CO2            -2.036e-07  1.803e-07  -1.129   0.2592    
## healthExp       5.543e-01  5.960e-02   9.301  < 2e-16 ***
## eduExp         -1.641e-01  8.107e-02  -2.025   0.0431 *  
## unemployment   -1.986e-01  2.460e-02  -8.072 1.57e-15 ***
## sanitation      7.684e-02  5.962e-03  12.889  < 2e-16 ***
## injuries        5.695e-08  2.657e-08   2.144   0.0323 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.576 on 1288 degrees of freedom
## Multiple R-squared:  0.7246, Adjusted R-squared:  0.7231 
## F-statistic: 484.2 on 7 and 1288 DF,  p-value: < 2.2e-16

Scatter Plot

ggplot(cor_data_clean4, aes(x = `undernourished`, y = `lifeExp`)) +
  labs(title = "Undernourished vs Life Expectancy", x = "Undernourished", y = "Life Expectancy", caption = "The World Bank") +
  geom_point(color = "pink") +
  geom_smooth(method = lm) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Conclusion and Future Direction

I was able to figure out using the z-test that undernourishment was a factor that had a significant impact on life expectancy however, it is not a lone acting variable. Other factors like health and education expenditures also had a very significant impact on the data set. It may seem quite obvious that the amount of food one has access to would have an impact on ones life expectancy however, I was able to also note outside variables that also have an impact.

While my scatter plot specifically displays one variable, in the future, I would like to be able to implement multivariable plots while also potentially taking the geo-location of the data into account as well since my data set includes country codes. This data is extremely useful for allocating resources to different countries in area where it could help them thrive.

References

Chen, James. “Z-Test Definition: Its Uses in Statistics Simply Explained with Example.” Investopedia, Investopedia, www.investopedia.com/terms/z/z-test.asp#:~:text=Key%20Takeaways,that%20follows%20a%20normal%20distribution. Accessed 30 Apr. 2024.

“World Bank.” Encyclopædia Britannica, Encyclopædia Britannica, inc., 18 Apr. 2024, www.britannica.com/topic/World-Bank.

Proj2

Rafiza Rahman

2024-04-23