Introduction

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. As a result, this study tries to perform different regression techniques in order to get some insights on the dataset made available by Deeksha Russell and Duan Wang, who gathered the data from the WHO and United Nations websites.

Importing the required packages and libraries

#install.packages('ggplot2')
#install.packages('corrplot')
#install.packages('dplyr')
#install.packages('caret')
#install.packages('superml')
#install.packages('rpart')
#install.packages('rpart.plot')
#install.packages('Metrics')
#install.packages('randomForest')

library(ggplot2)
library(corrplot)
library(dplyr)
library(caret)
library(superml)
library(rpart)
library(rpart.plot)
library(Metrics)
library(randomForest)

Importing the dataset(CSV format)

data <- read.csv('Life Expectancy Data-regression.csv')

About the Dataset

The dataset includes 2938 observations and 22 variables. It contains information about the following variables:

  1. Country - Names of the countries
  2. Year - Year of observations
  3. Status - whether developed or developing
  4. Life Expectancy - Average time a citizen of any country is expected to live(in years)
  5. Adult Mortality - Probability of dying between 15 and 60 years per 1000 population
  6. Infant deaths - Number of Infant Deaths per 1000 population
  7. Alcohol - Alcohol, recorded per capita (15+) consumption (in litres)
  8. Percentage expenditure - Expenditure on health as a percentage of GDP per capita (%)
  9. Hepatitis B - Immunization coverage among 1-year olds (%)
  10. Measles - Number of reported cases per 1000 population
  11. BMI - Average Body Mass Index of entire population
  12. Under-five deaths - Number of under-five deaths per 1000 population
  13. Polio - Immunization coverage among 1-year olds (%)
  14. Total expenditure - Government expenditure on health industry as a percentage of total government expenditure(%)
  15. Diphtheria - Immunization coverage among 1-year olds (%)
  16. HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
  17. GDP - Gross Domestic Product per capita (in USD)
  18. Population - Population of the country
  19. Thinness 10-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
  20. Thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%)
  21. Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  22. Schooling - Number of years of Schooling

The data related to life expectancy, health factors for 193 countries has been collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and its corresponding economic data was collected from the United Nation website for a period of 16 years(2000-2015).

The dataset is made available on Kaggle- https://www.kaggle.com/kumarajarshi/life-expectancy-who

Basic Exploratory Data Analysis

Viewing some initial observations

head(data)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..10.19.years thinness.5.9.years Income.composition.of.resources
## 1                  17.2               17.3                           0.479
## 2                  17.5               17.5                           0.476
## 3                  17.7               17.7                           0.470
## 4                  17.9               18.0                           0.463
## 5                  18.2               18.2                           0.454
## 6                  18.4               18.4                           0.448
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5
## 6       9.2

Checking the structure of the data

str(data)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..10.19.years          : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

Viewing the data type of each column of the data

sapply(data, FUN=class)
##                         Country                            Year 
##                     "character"                       "integer" 
##                          Status                 Life.expectancy 
##                     "character"                       "numeric" 
##                 Adult.Mortality                   infant.deaths 
##                       "integer"                       "integer" 
##                         Alcohol          percentage.expenditure 
##                       "numeric"                       "numeric" 
##                     Hepatitis.B                         Measles 
##                       "integer"                       "integer" 
##                             BMI               under.five.deaths 
##                       "numeric"                       "integer" 
##                           Polio               Total.expenditure 
##                       "integer"                       "numeric" 
##                      Diphtheria                        HIV.AIDS 
##                       "integer"                       "numeric" 
##                             GDP                      Population 
##                       "numeric"                       "numeric" 
##           thinness..10.19.years              thinness.5.9.years 
##                       "numeric"                       "numeric" 
## Income.composition.of.resources                       Schooling 
##                       "numeric"                       "numeric"

Checking values in status Column with frequency table

statusTable <- table(data$Status)
statusTable <- as.data.frame(statusTable)
head(statusTable)
##         Var1 Freq
## 1  Developed  512
## 2 Developing 2426
prop.table(table(data$Status))
## 
##  Developed Developing 
##  0.1742682  0.8257318

Most of the nations are with the “Developing” status(nearly 82.6%) as per the Data.

Checking the proportion of countries for which life expectancy is more than 70

table(data$Status[data$Life.expectancy>70.0])
## 
##  Developed Developing 
##        511       1109

Plotting the ratio of Developing and Developed countries

ggplot(data = statusTable, aes(x=Var1, y=Freq))+
  geom_bar(stat='identity')

Plotting the correlation graph between variables

nums <- unlist(lapply(data, is.numeric))  
data[ , nums]
cormat <- cor(data[, nums],use='pairwise.complete.obs')
corrplot(cormat,type = 'lower')

Analysis of Correlation:

  1. Adult Mortality is negatively Correlated to Life Expectancy on a significant scale.
  2. Under-5 deaths is positively correlated to infants deaths on a significant scale.
  3. HIV AIDS is negatively correlated to Life Expectancy on a significant scale.
  4. GDP is positively correlated with percentage expenditure on a significant scale.
  5. Thinness is negatively correlated to Life Expectancy on a significant scale.
  6. Schooling is positively correlated to Life Expectancy on a significant scale.

Life Expectancy is the target variable for our modeling purpose. The features are expected to be independent. High Correlation between the features is a sign of multicollinearity. We should treat it.

Data Cleaning

Identifying the total NA Values in each column

sapply(data, function(x) sum(is.na(x)))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                              10 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##           thinness..10.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

Imputing NA values:

For column ‘Life Expectancy’
data$Life.expectancy[is.na(data$Life.expectancy)] <- 
  mean(data$Life.expectancy, na.rm=TRUE)

sapply(data, function(x) sum(is.na(x)))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##           thinness..10.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163
For column ‘Adult Mortality’
data$Adult.Mortality[is.na(data$Adult.Mortality)] <- mean(data$Adult.Mortality, na.rm=TRUE)

Now, instead of imputing the mean values of the whole column, we can impute the values of mean of range to make it more precise. The pairs of variables chosen below are selected after trying out various other pairs. The following variable pairs were found to be the most relevant ones for our studies, based on solely the Visualization.

Imputing Alcohol Values in relation to the Schooling Values

For this, lets first create a scatterplot and analyze.

ggplot(data=data, aes(x=Alcohol, y=Schooling))+
  geom_point()

impute_alcohol <- function(values){
  alcoholValue <- as.numeric(values[1])
  schoolingValue <- as.numeric(values[2])
  if(is.na(alcoholValue) == TRUE && is.na(schoolingValue) == FALSE){
    if(schoolingValue <= 2.5){
      return(4.0)
    } else if(2.5<schoolingValue && schoolingValue<=5.0){
      return(2.0)
    } else if(5.0<schoolingValue && schoolingValue<=7.5){
      return(2.5)
    } else if(7.5<schoolingValue && schoolingValue<=10.0){
      return(3.0)
    } else if(10.0<schoolingValue && schoolingValue<=15.0){
      return(4.0)
    } else if(schoolingValue>15.0){
      return(10.0)
    }
  }
  else{
    return(alcoholValue)
  }
}
school_alcohol <- data[c('Alcohol','Schooling')]
data$Alcohol <- apply(school_alcohol, FUN=impute_alcohol, MARGIN=1)
sapply(data, function(x) sum(is.na(x)))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                               0                               0 
##                         Alcohol          percentage.expenditure 
##                               9                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##           thinness..10.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

As there are still 9 NA’s in the Alcohol variable, lets further Impute values with Mean of the whole column.

data$Alcohol[is.na(data$Alcohol)] <- mean(data$Alcohol, na.rm=TRUE)
Imputing Polio in regards to the Life Expectancy
ggplot(data,aes(x=Polio, y=Life.expectancy))+
  geom_point()

impute_polio <- function(values){
  polio=as.numeric(values[1])
  le=as.numeric(values[2])
  if(is.na(polio) == TRUE ){
    if(le <= 45.0){
      return(72.0)
    } else if(45.0<le && le<=50.0){
      return(57.0)
    } else if(50.0<le && le<=60.0){
      return(70.0)
    } else if(60.0<le && le<= 70.0){
      return(80.0)
    } else if(70.0<le && le<=80.0){
      return(87.0)
    } else if(le > 80.0){
      return(95.0)
    }
  }
  else{
    return(polio)
  }
}

polio_lifeExpect <- data[c('Polio','Life.expectancy')]
data$Polio<- apply(polio_lifeExpect,FUN=impute_polio, MARGIN=1)
Imputing diphtheria with regards to Polio
ggplot(data, aes(x=Polio, y=Diphtheria))+
  geom_point(na.rm=TRUE)

impute_diphtheria <- function(values){
  d <- as.numeric(values[1])
  p <- as.numeric(values[2])
  if(is.na(d) == TRUE){
    if( p<=10){
      return(75.0)
    }
  else if( 10<p && p<=40){
    return(27.0)
  }
  else if (40<p && p<=45){
    return(33.0)
  }
  else if (45<p && p<=50){
    return(37.0)
  }
  else if( 50<p && p<=60){
    return(39.0)
  }
  else if( 60<p && p<=80){
    return(60.0)
  }
  else if (p>80){
    return(80.0)
    }
  }
  else{
    return(d)
  }
}

dip_polio <- data[c('Diphtheria','Polio')]
data$Diphtheria <- apply(dip_polio, FUN=impute_diphtheria, MARGIN=1)
Imputing Hepatitis with diphtheria
ggplot(data, aes(x=Hepatitis.B, y=Diphtheria))+
  geom_point(na.rm=TRUE)

impute_hep <- function(values){
  hep<- as.numeric(values[1])
  dip <- as.numeric(values[2])
  if(is.na(hep) ==TRUE){
    if (dip<=15){
    return (65.0)}
  else if (15<dip && dip<=30){
    return (28.0)}
  else if (30<dip && dip<=45){
    return (38.0)}
  else if (45<dip && dip<=60){
    return(50.0)}
  else if (60<dip && dip<=80){
    return (65.0)
  }
  else if(dip>80){
    return (88.4)
    }
  }
  else{
    return (hep)
  }
}

hep_dip <- data[c('Hepatitis.B','Diphtheria')]
data$Hepatitis.B <- apply(hep_dip, FUN=impute_hep, MARGIN=1)
Imputing BMI with life expectancy
ggplot(data, aes(x=Life.expectancy, y=BMI))+
  geom_point()

impute_bmi <- function(values){
  l <- as.numeric(values[1])
  b <- as.numeric(values[2])
  if (is.na(b)){
    if (l<=50){
      return(18.0)
    }
    else if (50<l && l<=60){
      return (23.0)
    }
    else if (60<l && l<=70){
      return (35.0)
    }
    else if (70<l && l<=80){
      return( 46.8)
    }
    else if (80<l && l<=100){
      return (55.0)
    }
  }
  else{
    return(b)
    }
}

life_bmi <- data[c('Life.expectancy','BMI')]
data$BMI <- apply(life_bmi, FUN=impute_bmi, MARGIN=1)
Imputing Total Expenditure using Alcohol
ggplot(data, aes(x= Total.expenditure, y=Alcohol))+
  geom_point(na.rm=TRUE)

impute_totalexp <- function(values){
  t<- as.numeric(values[1])
  a<- as.numeric(values[2])
  if( a<=2.5){
    return (5.0)}
  else if (2.5<a && a<=5.0){
    return (6.0)
  }
  else if (5.0<a && a<=10.0){
    return (6.25)
  }
  else if (10.0<a && a<=12.5){
    return (7.0)
  }
  else if (a>12.5){
    return (7.5)
  }
  else{
    return (t)
  }
}

alcohol_exp <- data[c('Total.expenditure','Alcohol')]
data$Total.expenditure <- apply(alcohol_exp, FUN=impute_totalexp, MARGIN=1)
Imputing GDP with percentage expenditure
ggplot(data,aes(x=percentage.expenditure, y=GDP))+
  geom_point(na.rm=TRUE)

impute_gdp<- function(values){
  p <- values[1]
  g <-values[2]
  if (is.na(g)){
    if (p<=1250){
    return (1100.0)}
  else if (1250<p && p<=2500){
    return (1800.0)}
  else if (2500<p && p<=3750){
    return (2900.0)}
  else if (3750<p && p<=7500){
    return (3500.0)}
  else if (7500<p && p<=8750){
    return (4500.0)}
  else if (8750<p && p<=10000){
    return (5000.0)}
  else if (10000<p && p<=11250){
    return (5700.0)}
  else if (11250<p && p<=12500){
    return (7000.0)}
  else if (12500<p && p<=15000){
    return (8000.0)}
  else if (15000<p && p<=17500){
    return (9000.0)}
  else if (p>17500){
    return (8500.0)}}
  else{
    return (g)}
}

percent_gdp <- data[c('percentage.expenditure','GDP')]
data$GDP <- apply(percent_gdp, FUN= impute_gdp, MARGIN=1)
Imputing Population with infant deaths
ggplot(data, aes(x=infant.deaths, y=Population))+
  geom_point(na.rm=TRUE)

impute_pop <- function(values){
  i<- as.numeric(values[1])
  p<- as.numeric(values[2])
  if (is.na(p)){
    if (i<=100){
    return( 0.19*((10)**9))}
  else if (100<i && i<=250){
    return( 0.18*((10)**9))}
  else if (250<i && i<=350){
    return( 0.02*((10)**9))}
  else if (350<i && i<=900){
    return( 0.1*((10)**9))}
  else if (900<i && i<=1100){
    return( 0.18*((10)**9))}
  else if (1100<i && i<=1250){
    return( 0.05*((10)**9))}
  else if (1250<i && i<=1500){
    return( 0.19*((10)**9))}
  else if (1500<i && i<=1750){
    return (0.05*((10)**9))}
  else if (i>1750){
    return (0.1*((10)**9))}}
  else{
    return (p)}
}
infant_pop <- data[c('infant.deaths','Population')]
data$Population <- apply(infant_pop, FUN=impute_pop, MARGIN=1)
Imputing Thinness10-19 with BMI
ggplot(data, aes(x=BMI, y=thinness..10.19.years))+
  geom_point(na.rm=TRUE)

impute_thinness <- function(values){
  b<- as.numeric(values[1])
  t<- as.numeric(values[2])
  if (is.na(t)){
    if (b<=10){
    return (5.0)}
  else if (10<b && b<=20){
    return (10.0)}
  else if (20<b && b<=30){
    return (8.0)}
  else if (30<b && b<=40){
    return (6.0)}
  else if (40<b && b<=50){
    return (3.0)}
  else if (50<b && b<=70){
    return (4.0)}
  else if (b>70){
    return (1.0)}}
  else{
    return (t)}
}

bmi_thin <- data[c('BMI','thinness..10.19.years')]
data$thinness..10.19.years <- apply(bmi_thin, FUN=impute_thinness, MARGIN=1)
Imputing thinness 5-9 with BMI
ggplot(data, aes(x=BMI, y=thinness.5.9.years))+
  geom_point(na.rm=TRUE)

impute_thin1 <- function(values){
  b <- as.numeric(values[1])
  t <- as.numeric(values[2])
  if(is.na(t)){
    if (b<=10){
      return( 5.0)}
    else if (10<b && b<=20){
      return (10.0)}
    else if (20<b && b<=30){
      return (8.0)}
    else if (30<b && b<=40){
      return (6.0)}
    else if (40<b && b<=50){
      return (3.0)}
    else if (50<b && b<=70){
      return (4.0)}
    else if (b>70){
      return (1.0)}
  }
  else{
    return(t)
  }
}

bmi_thin1 <- data[c('BMI','thinness.5.9.years')]
data$thinness.5.9.years <- apply(bmi_thin1, FUN = impute_thin1, MARGIN = 1)
Imputing Income Composition of Resources with life expectancy
ggplot(data, aes(x=Life.expectancy, y=Income.composition.of.resources))+
  geom_point(na.rm=TRUE)

impute_income <- function(values){
  l <- as.numeric( values[1])
  i <- as.numeric(values[2])
  if(is.na(i) == TRUE){
    if (l<=40){
      return (0.4)}
    else if (40<l && l<=50){
      return (0.42)}
    else if (50<l && l<=60){
      return (0.402)}
    else if (60<l && l<=70){
      return (0.54)}
    else if (70<l && l<=80){
      return (0.71)}
    else if (l>80){
      return (0.88)}
  }
  else{
    return(i)
  }
}

life_income <- data[c('Life.expectancy','Income.composition.of.resources')]
data$Income.composition.of.resources <- apply(life_income, FUN = impute_income, MARGIN = 1)
Imputing Schooling with life expectancy
ggplot(data, aes(x=Life.expectancy, y=Schooling))+
  geom_point(na.rm=TRUE)

impute_schooling <- function(values){
  l <- as.numeric(values[1])
  s <- as.numeric(values[2])
  if(is.na(s) == TRUE){
    if (l<= 40){
      return (8.0)}
    else if (40<l && l<=44 ){
      return (7.5)}
    else if (44<l && l<50 ){
      return (8.1)}
    else if (50<l && l<=60 ){
      return (8.2)}
    else if (60<l && l<=70 ){
      return (10.5)}
    else if (70<l && l<=80 ){
      return (13.4)}
    else if (l>80 ){
      return (16.5)}
  }
  else{
    return(s)
  }
}

life_school <- data[c('Life.expectancy', 'Schooling')]
data$Schooling <- apply(life_school, FUN=impute_schooling, MARGIN=1)
Checking if there are any missing values after imputing
sapply(data, FUN=function(x)(sum(is.na(x))))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                               0                               0 
##                         Alcohol          percentage.expenditure 
##                               0                               0 
##                     Hepatitis.B                         Measles 
##                               0                               0 
##                             BMI               under.five.deaths 
##                               0                               0 
##                           Polio               Total.expenditure 
##                               0                               0 
##                      Diphtheria                        HIV.AIDS 
##                               0                               0 
##                             GDP                      Population 
##                               0                               0 
##           thinness..10.19.years              thinness.5.9.years 
##                               0                               0 
## Income.composition.of.resources                       Schooling 
##                               0                               0

Plotting different histograms to show the distribution of the variables along with any skewness they might have.

par(mfrow=c(3,3))
par(mar=c(3,2,1,1))
hist(data$infant.deaths)
hist(data$Adult.Mortality)
hist(data$Alcohol)
hist(data$Hepatitis.B)
hist(data$Measles)
hist(data$Polio)
hist(data$Diphtheria)
hist(data$HIV.AIDS)
hist(data$BMI)

Clearly, all the variables are skewed either way. However, as we are going to use those models which do not require normality assumption, we can just move ahead with our analysis.

Data Pre-processing:

One Hot Encoding - allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers.

Label Encoding - refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated.

Initializing label encoder for “Country” variable

encoder <- LabelEncoder$new()

Transforming the Country Variable

data$Country <- encoder$fit_transform(data$Country)

Excluding “Country” column and making “Status” a one Hot Encoded Column

dummy_var <- dummyVars(formula = '~.', data = data[-1])
oneHotData <- as.data.frame(predict(dummy_var, newdata=data[-1]))
colnames(oneHotData)
##  [1] "Year"                            "StatusDeveloped"                
##  [3] "StatusDeveloping"                "Life.expectancy"                
##  [5] "Adult.Mortality"                 "infant.deaths"                  
##  [7] "Alcohol"                         "percentage.expenditure"         
##  [9] "Hepatitis.B"                     "Measles"                        
## [11] "BMI"                             "under.five.deaths"              
## [13] "Polio"                           "Total.expenditure"              
## [15] "Diphtheria"                      "HIV.AIDS"                       
## [17] "GDP"                             "Population"                     
## [19] "thinness..10.19.years"           "thinness.5.9.years"             
## [21] "Income.composition.of.resources" "Schooling"

Appending Country again in the one hot encoded dataset

oneHotData$Country <- data$Country
colnames(oneHotData)
##  [1] "Year"                            "StatusDeveloped"                
##  [3] "StatusDeveloping"                "Life.expectancy"                
##  [5] "Adult.Mortality"                 "infant.deaths"                  
##  [7] "Alcohol"                         "percentage.expenditure"         
##  [9] "Hepatitis.B"                     "Measles"                        
## [11] "BMI"                             "under.five.deaths"              
## [13] "Polio"                           "Total.expenditure"              
## [15] "Diphtheria"                      "HIV.AIDS"                       
## [17] "GDP"                             "Population"                     
## [19] "thinness..10.19.years"           "thinness.5.9.years"             
## [21] "Income.composition.of.resources" "Schooling"                      
## [23] "Country"

Lets now split the data into Train & Test datasets

Creating sample size of 80% :
set.seed(123) # For reproducibility of the same results
sample_size <- floor(0.8*nrow(oneHotData))

Creating train index by unbiased random sampling of data

train_index <- sample(seq_len(nrow(oneHotData)), size=sample_size)

Creating train dataset

train<- oneHotData[train_index,]

Creating Test dataset

x_test<- oneHotData[-train_index,]

Creating different models for the dataset and checking which would be the best choice.

We are rating our models on the basis of Root Mean Square Error (RMSE) score. We believe that depending on a specific problem, outliers can be significant in the overall analysis. So, that is why we chose RMSE over MAE.

The thumb-rule is that Lesser the score, better the model is.

Creating Linear Regression Model

train_lm <- lm(Life.expectancy~., data=train)
summary(train_lm)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8327  -2.2263  -0.1125   2.2033  18.0778 
## 
## Coefficients: (1 not defined because of singularities)
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.959e+01  3.760e+01   2.383  0.01725 *  
## Year                            -1.737e-02  1.882e-02  -0.923  0.35605    
## StatusDeveloped                  1.515e+00  2.908e-01   5.210 2.06e-07 ***
## StatusDeveloping                        NA         NA      NA       NA    
## Adult.Mortality                 -1.708e-02  8.471e-04 -20.163  < 2e-16 ***
## infant.deaths                    9.975e-02  9.312e-03  10.712  < 2e-16 ***
## Alcohol                          8.545e-02  6.522e-02   1.310  0.19029    
## percentage.expenditure           1.527e-04  1.005e-04   1.519  0.12898    
## Hepatitis.B                     -5.719e-04  4.421e-03  -0.129  0.89708    
## Measles                         -1.174e-05  8.573e-06  -1.369  0.17105    
## BMI                              3.342e-02  5.249e-03   6.368 2.30e-10 ***
## under.five.deaths               -7.524e-02  6.876e-03 -10.943  < 2e-16 ***
## Polio                            2.593e-02  4.844e-03   5.353 9.52e-08 ***
## Total.expenditure               -3.153e-01  3.258e-01  -0.968  0.33315    
## Diphtheria                       2.678e-02  5.264e-03   5.087 3.94e-07 ***
## HIV.AIDS                        -4.761e-01  1.806e-02 -26.363  < 2e-16 ***
## GDP                              2.552e-05  1.526e-05   1.673  0.09455 .  
## Population                       2.549e-09  9.633e-10   2.646  0.00819 ** 
## thinness..10.19.years           -5.538e-02  5.428e-02  -1.020  0.30770    
## thinness.5.9.years              -2.163e-02  5.371e-02  -0.403  0.68726    
## Income.composition.of.resources  7.036e+00  6.937e-01  10.143  < 2e-16 ***
## Schooling                        7.683e-01  4.634e-02  16.581  < 2e-16 ***
## Country                          3.708e-03  1.447e-03   2.562  0.01048 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.858 on 2328 degrees of freedom
## Multiple R-squared:  0.8372, Adjusted R-squared:  0.8357 
## F-statistic:   570 on 21 and 2328 DF,  p-value: < 2.2e-16
Plotting the inferences of Linear regression model
par(mfrow=c(2,2))
plot(train_lm)

Predicting Classes using the model
predictions <- predict(train_lm, newdata=x_test)
RMSE score of Linear Regression Model
rmse.lm <- rmse(predictions,x_test$Life.expectancy)
rmse.lm
## [1] 3.916231

Creating Decision Tree model

model <- rpart(Life.expectancy~., data=train,method='anova')
Plotting the decision Tree
par(mfrow=c(1,1))
rpart.plot(model)

Viewing the inferences of model with the node split details
summary(model)
## Call:
## rpart(formula = Life.expectancy ~ ., data = train, method = "anova")
##   n= 2350 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.57851430      0 1.0000000 1.0007399 0.027577056
## 2 0.12226618      1 0.4214857 0.4392424 0.013529014
## 3 0.04595534      2 0.2992195 0.2946733 0.010446933
## 4 0.03984852      3 0.2532642 0.2634445 0.010082613
## 5 0.02788668      4 0.2134157 0.2333384 0.008571159
## 6 0.01952901      5 0.1855290 0.1989316 0.007572385
## 7 0.01756969      6 0.1660000 0.1810374 0.007029801
## 8 0.01000000      7 0.1484303 0.1602394 0.006030408
## 
## Variable importance
##                        HIV.AIDS Income.composition.of.resources 
##                              26                              19 
##                 Adult.Mortality                       Schooling 
##                              18                              14 
##           thinness..10.19.years              thinness.5.9.years 
##                               6                               6 
##                             BMI                         Alcohol 
##                               3                               2 
##               Total.expenditure          percentage.expenditure 
##                               2                               2 
##                             GDP               under.five.deaths 
##                               2                               1 
## 
## Node number 1: 2350 observations,    complexity param=0.5785143
##   mean=69.27591, MSE=90.56297 
##   left son=2 (619 obs) right son=3 (1731 obs)
##   Primary splits:
##       HIV.AIDS                        < 0.65     to the right, improve=0.5785143, (0 missing)
##       Income.composition.of.resources < 0.5875   to the left,  improve=0.5717741, (0 missing)
##       Schooling                       < 11.45    to the left,  improve=0.5139798, (0 missing)
##       Adult.Mortality                 < 237.5    to the right, improve=0.4752723, (0 missing)
##       BMI                             < 39.85    to the left,  improve=0.3418446, (0 missing)
##   Surrogate splits:
##       Adult.Mortality                 < 254.5    to the right, agree=0.880, adj=0.544, (0 split)
##       Income.composition.of.resources < 0.5345   to the left,  agree=0.844, adj=0.407, (0 split)
##       Schooling                       < 10.05    to the left,  agree=0.825, adj=0.336, (0 split)
##       thinness..10.19.years           < 6.45     to the right, agree=0.793, adj=0.213, (0 split)
##       thinness.5.9.years              < 6.45     to the right, agree=0.788, adj=0.194, (0 split)
## 
## Node number 2: 619 observations,    complexity param=0.03984852
##   mean=57.17173, MSE=48.31389 
##   left son=4 (468 obs) right son=5 (151 obs)
##   Primary splits:
##       Income.composition.of.resources < 0.5285   to the left,  improve=0.2835751, (0 missing)
##       BMI                             < 33.25    to the left,  improve=0.2675741, (0 missing)
##       Adult.Mortality                 < 331.5    to the right, improve=0.2631155, (0 missing)
##       under.five.deaths               < 1.5      to the right, improve=0.2283233, (0 missing)
##       infant.deaths                   < 1.5      to the right, improve=0.2282347, (0 missing)
##   Surrogate splits:
##       Schooling              < 11.05    to the left,  agree=0.914, adj=0.649, (0 split)
##       percentage.expenditure < 153.9527 to the left,  agree=0.866, adj=0.450, (0 split)
##       GDP                    < 1641.038 to the left,  agree=0.863, adj=0.437, (0 split)
##       BMI                    < 29.75    to the left,  agree=0.843, adj=0.358, (0 split)
##       under.five.deaths      < 4.5      to the right, agree=0.842, adj=0.351, (0 split)
## 
## Node number 3: 1731 observations,    complexity param=0.1222662
##   mean=73.60433, MSE=34.54393 
##   left son=6 (706 obs) right son=7 (1025 obs)
##   Primary splits:
##       Income.composition.of.resources < 0.7005   to the left,  improve=0.4351671, (0 missing)
##       Adult.Mortality                 < 161.5    to the right, improve=0.3814791, (0 missing)
##       Schooling                       < 11.45    to the left,  improve=0.3768480, (0 missing)
##       StatusDeveloping                < 0.5      to the right, improve=0.2941991, (0 missing)
##       StatusDeveloped                 < 0.5      to the left,  improve=0.2941991, (0 missing)
##   Surrogate splits:
##       Schooling         < 12.45    to the left,  agree=0.879, adj=0.703, (0 split)
##       Adult.Mortality   < 143.5    to the right, agree=0.760, adj=0.411, (0 split)
##       BMI               < 51.05    to the left,  agree=0.757, adj=0.404, (0 split)
##       Alcohol           < 5.105    to the left,  agree=0.750, adj=0.387, (0 split)
##       Total.expenditure < 6.125    to the left,  agree=0.748, adj=0.381, (0 split)
## 
## Node number 4: 468 observations,    complexity param=0.01756969
##   mean=55.06923, MSE=30.7844 
##   left son=8 (122 obs) right son=9 (346 obs)
##   Primary splits:
##       Adult.Mortality   < 379.5    to the right, improve=0.25954110, (0 missing)
##       HIV.AIDS          < 9.35     to the right, improve=0.24252910, (0 missing)
##       under.five.deaths < 39.5     to the right, improve=0.12430700, (0 missing)
##       infant.deaths     < 22.5     to the right, improve=0.12077190, (0 missing)
##       Year              < 2006.5   to the left,  improve=0.09574549, (0 missing)
##   Surrogate splits:
##       HIV.AIDS               < 6.55     to the right, agree=0.827, adj=0.336, (0 split)
##       Country                < 187      to the right, agree=0.754, adj=0.057, (0 split)
##       percentage.expenditure < 392.5857 to the right, agree=0.746, adj=0.025, (0 split)
##       Alcohol                < 10.32    to the right, agree=0.744, adj=0.016, (0 split)
##       Population             < 7115     to the left,  agree=0.741, adj=0.008, (0 split)
## 
## Node number 5: 151 observations,    complexity param=0.01952901
##   mean=63.68808, MSE=46.48026 
##   left son=10 (76 obs) right son=11 (75 obs)
##   Primary splits:
##       HIV.AIDS                        < 2.85     to the right, improve=0.5921793, (0 missing)
##       Adult.Mortality                 < 314.5    to the right, improve=0.5497381, (0 missing)
##       Income.composition.of.resources < 0.6475   to the left,  improve=0.3197414, (0 missing)
##       Measles                         < 0.5      to the right, improve=0.2738616, (0 missing)
##       under.five.deaths               < 1.5      to the right, improve=0.2640346, (0 missing)
##   Surrogate splits:
##       Adult.Mortality   < 267.5    to the right, agree=0.861, adj=0.720, (0 split)
##       Polio             < 83.5     to the left,  agree=0.781, adj=0.560, (0 split)
##       Diphtheria        < 83.5     to the left,  agree=0.768, adj=0.533, (0 split)
##       under.five.deaths < 2.5      to the right, agree=0.748, adj=0.493, (0 split)
##       BMI               < 34.8     to the left,  agree=0.742, adj=0.480, (0 split)
## 
## Node number 6: 706 observations,    complexity param=0.02788668
##   mean=68.93265, MSE=21.59911 
##   left son=12 (185 obs) right son=13 (521 obs)
##   Primary splits:
##       Adult.Mortality                 < 211.5    to the right, improve=0.3892017, (0 missing)
##       Schooling                       < 10.35    to the left,  improve=0.3174602, (0 missing)
##       Income.composition.of.resources < 0.5565   to the left,  improve=0.2950295, (0 missing)
##       Polio                           < 84.5     to the left,  improve=0.1690747, (0 missing)
##       Diphtheria                      < 86.5     to the left,  improve=0.1648257, (0 missing)
##   Surrogate splits:
##       Income.composition.of.resources < 0.4855   to the left,  agree=0.775, adj=0.141, (0 split)
##       Schooling                       < 9.05     to the left,  agree=0.758, adj=0.076, (0 split)
##       Country                         < 0.5      to the left,  agree=0.754, adj=0.059, (0 split)
##       HIV.AIDS                        < 0.55     to the right, agree=0.745, adj=0.027, (0 split)
##       infant.deaths                   < 1550     to the right, agree=0.742, adj=0.016, (0 split)
## 
## Node number 7: 1025 observations,    complexity param=0.04595534
##   mean=76.8221, MSE=18.07368 
##   left son=14 (708 obs) right son=15 (317 obs)
##   Primary splits:
##       Income.composition.of.resources < 0.8365   to the left,  improve=0.5279394, (0 missing)
##       thinness.5.9.years              < 2.05     to the right, improve=0.4275270, (0 missing)
##       thinness..10.19.years           < 1.85     to the right, improve=0.4215412, (0 missing)
##       Adult.Mortality                 < 105      to the right, improve=0.3752846, (0 missing)
##       Schooling                       < 15.65    to the left,  improve=0.2613797, (0 missing)
##   Surrogate splits:
##       Schooling              < 15.65    to the left,  agree=0.853, adj=0.524, (0 split)
##       thinness.5.9.years     < 1.35     to the right, agree=0.826, adj=0.438, (0 split)
##       thinness..10.19.years  < 1.25     to the right, agree=0.819, adj=0.413, (0 split)
##       GDP                    < 22219.63 to the left,  agree=0.815, adj=0.401, (0 split)
##       percentage.expenditure < 3032.965 to the left,  agree=0.811, adj=0.388, (0 split)
## 
## Node number 8: 122 observations
##   mean=50.30902, MSE=17.68033 
## 
## Node number 9: 346 observations
##   mean=56.74769, MSE=24.59787 
## 
## Node number 10: 76 observations
##   mean=58.47632, MSE=20.34233 
## 
## Node number 11: 75 observations
##   mean=68.96933, MSE=17.55039 
## 
## Node number 12: 185 observations
##   mean=64.06703, MSE=12.44091 
## 
## Node number 13: 521 observations
##   mean=70.66036, MSE=13.45965 
## 
## Node number 14: 708 observations
##   mean=74.75516, MSE=8.259095 
## 
## Node number 15: 317 observations
##   mean=81.43849, MSE=9.141106
Printing the Standard Error
printcp(model)
## 
## Regression tree:
## rpart(formula = Life.expectancy ~ ., data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] Adult.Mortality                 HIV.AIDS                       
## [3] Income.composition.of.resources
## 
## Root node error: 212823/2350 = 90.563
## 
## n= 2350 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.578514      0   1.00000 1.00074 0.0275771
## 2 0.122266      1   0.42149 0.43924 0.0135290
## 3 0.045955      2   0.29922 0.29467 0.0104469
## 4 0.039849      3   0.25326 0.26344 0.0100826
## 5 0.027887      4   0.21342 0.23334 0.0085712
## 6 0.019529      5   0.18553 0.19893 0.0075724
## 7 0.017570      6   0.16600 0.18104 0.0070298
## 8 0.010000      7   0.14843 0.16024 0.0060304
Plotting the standard error graph at per node split
plotcp(model)

Making Predictions using Decision Tree model
predictions.tree <- predict(model, newdata = x_test)
Checking the rmse score of Decision Tree
rmse.tree <- rmse(predictions.tree,x_test$Life.expectancy)
rmse.tree
## [1] 3.743962

Creating Random Forest Model

forest <- randomForest(Life.expectancy~., data=train, ntree=100, mtry=2, 
                       importance=TRUE)
summary(forest)
##                 Length Class  Mode     
## call               6   -none- call     
## type               1   -none- character
## predicted       2350   -none- numeric  
## mse              100   -none- numeric  
## rsq              100   -none- numeric  
## oob.times       2350   -none- numeric  
## importance        44   -none- numeric  
## importanceSD      22   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            11   -none- list     
## coefs              0   -none- NULL     
## y               2350   -none- numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
plot(forest)

Feature Importance according to random forest
importance(forest)
##                                   %IncMSE IncNodePurity
## Year                             8.958095      1936.080
## StatusDeveloped                  4.767448      3937.985
## StatusDeveloping                 5.081020      6882.608
## Adult.Mortality                 13.052102     23821.922
## infant.deaths                   12.007491      7074.949
## Alcohol                         10.904265      4705.035
## percentage.expenditure           8.212917      4035.761
## Hepatitis.B                      8.450515      5645.523
## Measles                          8.792795      2004.794
## BMI                              9.218450     12221.867
## under.five.deaths               10.488890      7264.541
## Polio                            8.262369     12967.633
## Total.expenditure                6.646072      1604.009
## Diphtheria                       7.638524      8181.647
## HIV.AIDS                        12.024780     31763.257
## GDP                              8.616361      6277.536
## Population                      11.009760      1788.260
## thinness..10.19.years           10.010494     12097.180
## thinness.5.9.years              11.408010     12759.559
## Income.composition.of.resources  9.652734     21983.642
## Schooling                       10.958745     17886.536
## Country                         16.012205      2458.688
Plotting Variable Importance
varImpPlot(forest)

Predictions
predictions.forest <- predict(forest,newdata=x_test)
Evaluating RMSE Score
rmse.forest <- rmse(predictions.forest, x_test$Life.expectancy)
rmse.forest
## [1] 1.950322

XGBoost

ctrl_cv3 <- trainControl(method = "cv", number = 3)

parameters_xgb <- expand.grid(nrounds = seq(20, 80, 10),
                              max_depth = c(8),
                              eta = c(0.25), 
                              gamma = 1,
                              colsample_bytree = c(0.2),
                              min_child_weight = c(150),
                              subsample = 0.8)

set.seed(123456789)

# smoted_training_set$Bankrupt <- as.factor(smoted_training_set$Bankrupt)

# XGBoost model on training data

x_train_xgb <- train(Life.expectancy ~ .,
                      data = train,
                      method = "xgbTree",
                      trControl = ctrl_cv3,
                      tuneGrid  = parameters_xgb)
x_train_xgb
## eXtreme Gradient Boosting 
## 
## 2350 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 1567, 1565, 1568 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSE      Rsquared   MAE     
##   20       3.576405  0.8616218  2.630030
##   30       3.268725  0.8827739  2.383526
##   40       3.143173  0.8915580  2.291453
##   50       3.045902  0.8979337  2.218238
##   60       2.972017  0.9026987  2.163592
##   70       2.912872  0.9064610  2.110570
##   80       2.867746  0.9092864  2.072611
## 
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 150
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 80, max_depth = 8, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 150
##  and subsample = 0.8.
# plotting the model result
#  as it is seen, we have least RMSE value after 80 iteration 
plot(x_train_xgb)

#  XGBoost model on test data 

x_test_xgb <- train(Life.expectancy ~ .,
                    data = x_test,
                    method = "xgbTree",
                    trControl = ctrl_cv3,
                    tuneGrid  = parameters_xgb)
x_test_xgb
## eXtreme Gradient Boosting 
## 
## 588 samples
##  22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 393, 392, 391 
## Resampling results across tuning parameters:
## 
##   nrounds  RMSE      Rsquared   MAE     
##   20       5.503532  0.6643759  4.274236
##   30       5.443069  0.6709637  4.224592
##   40       5.421155  0.6732309  4.208020
##   50       5.396903  0.6763660  4.203635
##   60       5.386832  0.6774605  4.192104
##   70       5.360919  0.6805100  4.167311
##   80       5.335165  0.6834915  4.149924
## 
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 150
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 80, max_depth = 8, eta
##  = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 150
##  and subsample = 0.8.
# plotting the result on test data 
# the RMSE metrics is 
plot(x_test_xgb)

prediction.xgb <-predict(x_test_xgb,data = x_test)
rmse.xgb<- rmse(prediction.xgb,x_test$Life.expectancy)

Conclusion:

  1. Random Forest produced least error implying the greatest accuracy among the three models used.
  2. Income Composition, HIV.Aids, Adult.Mortality are important variables.
  3. From Analysis, it is clear that most people’s Life Expectancy lies between 40-65 years
  4. An Average person lives upto 69 years of age.
  5. Almost for all the Developed countries, average life expectancy is more than 70 years, whereas it is less than 70 years for more than half of the developing nations.