Project Proposal - Predicting Life Expectancy

Process

This proposal will go through the proposal steps for the final project outcome which is the regression model.

We will do some basic data cleanup and some exploratory data analysis EDA.

Data Preparation

We will use the tidyverse library which has a good collection of packages we will use in this proposal and project. Also we will use the reshape library since we will be using the melt function.

rm(list=ls())
library(tidyverse)
library(reshape)

The dataset came from Kaggle but it dose require customer registration in Kaggle to download.

For reproducibility I decided to launch a Linux Server in AWS dedicated for this project and placed the dataset there. This way the code will work for everyone regardless if they want to create an account in Kaggle or not.

life_data <- read.csv("http://3.86.40.38/life_expectancy_data.csv")
head(life_data)

Research questions

Using linear regression only, are we able to find a regression model using any combination of the metrics provided in the WHO Dataset which gives us a reasonable level of predictibility for life expectancy? That is, Life.expectancy is thew reponse variable, and our job is to find the best predictor variables.
Do we see temporal variations (by year) of the life_expectancy which allows us to make some statements about global or regional trends.
We suspect that a big driver of life_expectancy is the Country itself. Can we remove the country variable and maintain a reasonable prediction power.

Cases

What are the cases, and how many are there?

Let’s run a simple table command first.

head(table(life_data$Country, life_data$Year))

##                      
##                       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
##   Afghanistan            1    1    1    1    1    1    1    1    1    1    1
##   Albania                1    1    1    1    1    1    1    1    1    1    1
##   Algeria                1    1    1    1    1    1    1    1    1    1    1
##   Angola                 1    1    1    1    1    1    1    1    1    1    1
##   Antigua and Barbuda    1    1    1    1    1    1    1    1    1    1    1
##   Argentina              1    1    1    1    1    1    1    1    1    1    1
##                      
##                       2011 2012 2013 2014 2015
##   Afghanistan            1    1    1    1    1
##   Albania                1    1    1    1    1
##   Algeria                1    1    1    1    1
##   Angola                 1    1    1    1    1
##   Antigua and Barbuda    1    1    1    1    1
##   Argentina              1    1    1    1    1

We see data main format is by country and by year, then all other metrics are dependent of these two main factors

Let’s do some counts to check how much data do we have in the dataset

life_data %>%
  count(total = n())

life_data %>%
  group_by(Country) %>%
  summarise(countries = n()) %>%
  summarize(n_countries = n())

life_data %>%
  group_by(Year) %>%
  summarise(Years = n()) %>%
  summarize(n_year = n())

We have a total of 2,938 records which is composed of 193 countries collected throughout 16 years. Since 193 x 16 is 3,008 and we only have 2,938 row, we know that there must be some countries with missing data for some years.

Data collection

Content The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore this dataset consideres data from year 2000-2015 for 193 countries.

Acknowledgements

The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.

Type of study

What type of study is this (observational/experiment)?

This is an observational study as we are not randomizing the data and comparing treatments vs control. We will use linear regression to derive a set of predictiors.

Data Source

The dataset came from Kaggle but it did require customer registration. For reproducibility I decided to launch a Linux Server in AWS dedicated for this project and placed the dataset there. This way the code will work for everyone regardless if they want to create an account in Kaggle or not.

This is the original link of the dataset in Kaggle if you still want to access through Kaggle.

https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/download

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

For this regression our dependent or response variable is Life.expectancy which is the life expectancy on average for people at birth in a specific country and specific year.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Let’s see all column names

names(life_data)

##  [1] "Country"                         "Year"                           
##  [3] "Status"                          "Life.expectancy"                
##  [5] "Adult.Mortality"                 "infant.deaths"                  
##  [7] "Alcohol"                         "percentage.expenditure"         
##  [9] "Hepatitis.B"                     "Measles"                        
## [11] "BMI"                             "under.five.deaths"              
## [13] "Polio"                           "Total.expenditure"              
## [15] "Diphtheria"                      "HIV.AIDS"                       
## [17] "GDP"                             "Population"                     
## [19] "thinness..1.19.years"            "thinness.5.9.years"             
## [21] "Income.composition.of.resources" "Schooling"

The dataset has a total of 21 variables which for this project will be our potential independent variables. We will test them to derive the optimal regression model which may end up being just 1, all 21 or some number in between. Most are quantitative, but the do have two qualitative variables Country and Status

Some data observation first to asses how clean the data is.

glimpse(life_data)

## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afghani~
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010, 20~
## $ Status                          <chr> "Developing", "Developing", "Developin~
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58~
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287~
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84~
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.~
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18~
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64~
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861~
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16~
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113~
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,~
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.~
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58~
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1~
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 669.9~
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2~
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18~
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18~
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4~
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8~

Let’s check missing values or NA’s and non-sensical values

Easiest way to check all variables for NA’s and non-sensical values would be to do summary call to our life_data dataframe. This way we can look at MAX, MIN and NA’S

summary(life_data)

##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163

We do see quite a few NA’s in almost every metric. For the project we will have to decide what to do with them.

sum(is.na(life_data))

## [1] 2563

Total NA’s values in the data 2,563 which is quite hight, we may need to do some imputation later during our actual project work.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Let’s do a first Boxplot of the dependent variable Life Expectancy, to see the range of values globally.

par(mfrow= c(1,1))
boxplot(life_data$Life.expectancy, 
        main = "Boxplot", 
        ylab = "Life Expectancy Years")

Now let’s plot all other independent variables to see their distribution and outliers.

par(mfrow = c(2,3))

invisible(lapply(4:9, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

par(mfrow = c(2,3))
invisible(lapply(10:15, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

par(mfrow = c(2,3))

invisible(lapply(16:21, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

Let’s check now possibility for scatter plot. First one it comes to mind is life expectancy vs GDP

par(mfrow = c(1,1))
life_data %>% 
  ggplot(aes(x=GDP, y=Life.expectancy)) + geom_point()

## Warning: Removed 453 rows containing missing values (geom_point).

The plot doesn’t show a linear relationship, so we may need to do some transformations if we want to use it.

Let’s create a matrix of scatter plots of all independent variables.

pairs(life_data[,4:22])

Doesn’t look pretty since we have almost 20 variables, but it gives us an idea how corelated variables are to each other.

Let’s this time just plot a few to have a taste of what the pair command is doing.

pairs(life_data[,4:7])

Instead let’s plot a corelation heatmap

life_cormat <- round(cor(life_data[,4:22],use = "complete.obs"),2)
melted_cormat <- melt(life_cormat)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

ggplot(data = melted_cormat, aes(x=X1, y=X2, fill=value)) + 
  geom_tile() +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 8, hjust = 1))+
 coord_fixed()

par(mfrow=c(1,1))
library(corrplot)

## corrplot 0.92 loaded

# png(file="corr.png", res=300, width=1200, height=1200)
corrplot(life_cormat, method="number",
         tl.cex = 0.3,
         number.cex = 0.3,
         cl.cex = 0.3)

# dev.off()

We see some strong corelations with life expectancy for example schooling. Let’s plot them to take a closer look.

life_data %>% 
  ggplot(aes(x=Schooling, y=Life.expectancy)) + geom_point()

## Warning: Removed 170 rows containing missing values (geom_point).