Data Preparation

library(Hmisc)
library(tidyverse)

url <- "https://raw.githubusercontent.com/peterphung2043/DATA-606---Final-Project/main/Life%20Expectancy%20Data.csv"

life_expectancy_data <- read.csv(url(url))

parsed_life_expectancy_data <- life_expectancy_data %>%
  select(Life.expectancy, Schooling)

Research question

Does spending more years in school have any correspondence with increased or decreased life expectancy?

Cases

The cases are countries across the world. The dataset includes 15 years of school retention data for each case. There are 193 countries in this dataset (193 countries and 15 years of school retention data for each country is 2893 cases). I think that I will probably focus on the most recent year for each country (2015). I might possibly perform a comparison from the earliest year to the latest year for each country.

Data collection

The World Health Organization keeps track of the health status of every country in the world. This data was collected from the WHO and United Nations website and uploaded onto kaggle.

Type of study

This is an observational study.

Data Source

KumarRajarshi. (2018, February 10). Life expectancy (WHO). Kaggle. Retrieved October 31, 2021, from https://www.kaggle.com/kumarajarshi/life-expectancy-who.

Dependent Variable

The dependent variable is the life expectancy and it is quantitative.

Independent Variable

The independent variable is the number of years of schooling and it is quantitative.

Relevant summary statistics

describe(parsed_life_expectancy_data$Life.expectancy)
## parsed_life_expectancy_data$Life.expectancy 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2928       10      362        1    69.22    10.62     51.4     54.8 
##      .25      .50      .75      .90      .95 
##     63.1     72.1     75.7     79.7     82.0 
## 
## lowest : 36.3 39.0 41.0 41.5 42.3, highest: 85.0 86.0 87.0 88.0 89.0
describe(parsed_life_expectancy_data$Schooling)
## parsed_life_expectancy_data$Schooling 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2775      163      173        1    11.99    3.713      5.8      7.7 
##      .25      .50      .75      .90      .95 
##     10.1     12.3     14.3     15.9     16.8 
## 
## lowest :  0.0  2.8  2.9  3.0  3.1, highest: 20.3 20.4 20.5 20.6 20.7
parsed_life_expectancy_data %>%
  pivot_longer(cols = c(Life.expectancy, Schooling)) %>%
  ggplot(mapping = aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 173 rows containing non-finite values (stat_bin).

parsed_life_expectancy_data %>%
  ggplot(mapping = aes(x = Schooling, y = Life.expectancy)) +
  geom_point()
## Warning: Removed 170 rows containing missing values (geom_point).