Synopsis

What is the importance of knowing Life Expectancy?

Life expectancy, or the average number of years a person is expected to live, can be used as an overall indicator of community health. Low life expectancies can result from high infant mortality rates, high rates of drug overdose or suicide, barriers to high quality healthcare, and other factors. There are many uses for it in the financial world, including life insurance, pension planning, and U.S. Social Security benefits.

About our Dataset

The dataset related to life expectancy, health factors for 193 countries has been collected from WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative.

In this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single dataset. On initial visual inspection of the data showed some missing values. As the datasets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP.

The goal of this project is to build a Linear Regression Model to predict the likelihood of Life Expectancy in different countries of the world.

The link to original dataset can be found here.

image

Packages required

  • library(ggplot2)
  • library(GGally)
  • library(ggcorrplot)
  • library(plotly)
  • library(tidyverse)
  • library(cowplot)
  • library(psych)
  • library(lattice)
  • library(xtable)
  • library(plyr); library(dplyr)
  • library(gridExtra)
  • library(WVPlots)

The above packages were installed and used in this project.

Data Dictionary

There are 22 columns in our dataset. These columns’s label are listed below.

Country - Country
Year - data is collected from 2000 - 2015 years
Status - Developed or Developing status
Life expectancy - Life Expectancy in age
Adult Mortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
infant deaths - Number of Infant Deaths per 1000 population
Alcohol - Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
percentage expenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Measles - number of reported cases per 1000 population
BMI - Average Body Mass Index of entire population
under-five deaths - Number of under-five deaths per 1000 population
Polio - (Pol3) immunization coverage among 1-year-olds (%)
Total expenditure - General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP - Gross Domestic Product per capita (in USD)
Population - Population of the country
thinness 1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%)
Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling - Number of years of Schooling(years)

Data Preparation

#Reading csv file
life_expectancy_data <- read.csv("C://Users/raghu/Desktop/Daya Courses/Projects/Life-Expectancy-Prediction/Life Expectancy Data.csv")
head(life_expectancy_data, 5)
##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
##   Schooling
## 1      10.1
## 2      10.0
## 3       9.9
## 4       9.8
## 5       9.5

The dataset contains 2938 rows and 22 columns

#Dimensions : Gives numbers of rows and columns
dim(life_expectancy_data)
## [1] 2938   22

We can understand more about the structure of the dataset by using the str() function.

# Structure of dataset
str(life_expectancy_data)
## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

There are 13 variables that are taken as indicators from this dataset.

#statistical summary of the variables
summary(life_expectancy_data)
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163

Data Cleaning

#Check for missing values
colSums(is.na(life_expectancy_data))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                              10 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##            thinness..1.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

We will now impute the missing values with the mean to avoid errors in our analysis.

# Select numeric variables for calculating mean
life_expectancy_data_num <- select(life_expectancy_data,-c(1,2,3))

#Calculate means of all the numeric variables
colMeans(life_expectancy_data_num, na.rm = TRUE)
##                 Life.expectancy                 Adult.Mortality 
##                    6.922493e+01                    1.647964e+02 
##                   infant.deaths                         Alcohol 
##                    3.030395e+01                    4.602861e+00 
##          percentage.expenditure                     Hepatitis.B 
##                    7.382513e+02                    8.094046e+01 
##                         Measles                             BMI 
##                    2.419592e+03                    3.832125e+01 
##               under.five.deaths                           Polio 
##                    4.203574e+01                    8.255019e+01 
##               Total.expenditure                      Diphtheria 
##                    5.938190e+00                    8.232408e+01 
##                        HIV.AIDS                             GDP 
##                    1.742103e+00                    7.483158e+03 
##                      Population            thinness..1.19.years 
##                    1.275338e+07                    4.839704e+00 
##              thinness.5.9.years Income.composition.of.resources 
##                    4.870317e+00                    6.275511e-01 
##                       Schooling 
##                    1.199279e+01
# Impute missing values in numeric variables with mean
for(i in 4:ncol(life_expectancy_data)) {
  life_expectancy_data[ , i][is.na(life_expectancy_data[ , i])] <- mean(life_expectancy_data[ , i], na.rm=TRUE)
}
summary(life_expectancy_data) 
##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.20  
##  Mode  :character   Median :2008   Mode  :character   Median :72.00  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.60  
##                     Max.   :2015                      Max.   :89.00  
##  Adult.Mortality infant.deaths       Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.010   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 1.093   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 4.160   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.603   Mean   :  738.251     
##  3rd Qu.:227.0   3rd Qu.:  22.0   3rd Qu.: 7.390   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.870   Max.   :19479.912     
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:80.94   1st Qu.:     0.0   1st Qu.:19.40   1st Qu.:   0.00  
##  Median :87.00   Median :    17.0   Median :43.00   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:96.00   3rd Qu.:   360.2   3rd Qu.:56.10   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.370    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.938    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.330    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   580.49   1st Qu.:4.189e+05   1st Qu.: 1.60       
##  Median :  3116.56   Median :3.676e+06   Median : 3.40       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  7483.16   3rd Qu.:1.275e+07   3rd Qu.: 7.10       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.60      1st Qu.:0.5042                  1st Qu.:10.30  
##  Median : 3.40      Median :0.6620                  Median :12.10  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7720                  3rd Qu.:14.10  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70
# We can see that now the data set has no missing values
colSums(is.na(life_expectancy_data))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                               0 
##                 Adult.Mortality                   infant.deaths 
##                               0                               0 
##                         Alcohol          percentage.expenditure 
##                               0                               0 
##                     Hepatitis.B                         Measles 
##                               0                               0 
##                             BMI               under.five.deaths 
##                               0                               0 
##                           Polio               Total.expenditure 
##                               0                               0 
##                      Diphtheria                        HIV.AIDS 
##                               0                               0 
##                             GDP                      Population 
##                               0                               0 
##            thinness..1.19.years              thinness.5.9.years 
##                               0                               0 
## Income.composition.of.resources                       Schooling 
##                               0                               0
dim(life_expectancy_data)
## [1] 2938   22

Outlier analysis

While predicting life expectancy there could be few outliers that we need to ignore. Detection of outliers is important as, it increases the error variance and reduces the power of statistical tests. They can cause bias and/or influence estimates. They can also impact the basic assumption of regression as well as other statistical models.

#Plotting box plots of life expectancy to understand outliers
boxplot(life_expectancy_data$Life.expectancy, xlab="Life Expectancy")

From the box plot we can see that age below 45 is Outlier. Our analysis is not applicable for these records.

Removing outliers

outliers <- boxplot(life_expectancy_data$Life.expectancy, plot=FALSE)$out

life_expectancy_data<- life_expectancy_data[-which(life_expectancy_data$Life.expectancy %in% outliers),]

dim(life_expectancy_data)
## [1] 2921   22

Graphical analysis

Now we will perform correlation analysis to identify how each factor is related to the life expectancy of a person.

Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:
1. Scatter plot: Visualize the linear relationship between the predictor and response.
2. Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
3. Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.

We will be using mostly scatter plots to identify any relationship between variables as they are easy to detect visually detect any correlation.

1. correlation between Health expenditure and life expectancy.
#correlation between percentage expenditure and life expectancy
life_expectancy_vs_percenntage_expenditure <-  ggplot(life_expectancy_data, aes(percentage.expenditure, Life.expectancy)) + 
                                      geom_jitter(color = "yellow", alpha = 0.5) + theme_light()

life_expectancy_vs_Total_expenditure  <- ggplot(life_expectancy_data, aes(Total.expenditure, Life.expectancy)) +
                                      geom_jitter(color = "Light green", alpha = 0.5) + theme_light()

p <- plot_grid(life_expectancy_vs_percenntage_expenditure, life_expectancy_vs_Total_expenditure) 
title <- ggdraw() + draw_label("Correlation between Health expenditure and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))

We can see from the above graph that the concentration of life expectancy is more when expenditure varies from 5k - 20k. Similar Analysis could be done for other variables. Lets find out if there is any effect of immunization coverage on life expectancy.

2. Life expectancy vs Immunizations
library(plotly)
life_expectancy_vs_Hepatitis_B <- ggplot(life_expectancy_data, aes(Hepatitis.B, Life.expectancy)) + 
                                      geom_jitter(color = "purple", alpha = 0.5) + theme_light()

life_expectancy_vs_Diphtheria  <- ggplot(life_expectancy_data, aes(Diphtheria, Life.expectancy)) +
                                       geom_jitter(color = "orange", alpha = 0.5) + theme_light()
                              
life_expectancy_vs_Polio  <- ggplot(life_expectancy_data, aes(Polio, Life.expectancy)) + geom_jitter(color = "pink", alpha = 0.5) + theme_grey()

p <- plot_grid(life_expectancy_vs_Hepatitis_B, life_expectancy_vs_Diphtheria, life_expectancy_vs_Polio ) 
title <- ggdraw() + draw_label("Correlation between Immunizations and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))


Data points for Immunizations ( Hepatitis.B and Diphtheria) are concentrated in the age range from 60 - 80 years old, this means getting imunized is definitely positive for better life expectancy.

3. Life expectancy vs Measles

Measles represents the number of reported cases per 1000 population

#correlation between measles and life expectancy
life_expectancy_vs_Measles  <- plot_ly(data = life_expectancy_data, x = ~Measles , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                    color =  'rgba(221,160,221, .3)',
                                                     line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                                 width = 2)))
life_expectancy_vs_Measles  <- life_expectancy_vs_Measles  %>% layout(title = 'Scatter Plot: Life Expectancy vs Measles',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Measles 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

We can see that the concentration of reported cases of Measles is between 0 - 50k and is varied across all the age groups. We can also see some values in the 50k - 250k range that may have influenced the Life expectancy.

4. Life expectancy vs Alcohol

Alcohol represents Gross Domestic Product per capita (in USD)

#correlation between alcohol and life expectancy
life_expectancy_vs_Alcohol  <- plot_ly(data = life_expectancy_data, x = ~Alcohol , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                 color = 'rgba(152, 215, 182, .5)',
                                                 line = list(color = 'rgba(0, 0, 0, 0)',
                                                             width = 2)))
life_expectancy_vs_Alcohol  <- life_expectancy_vs_Alcohol  %>% layout(title = 'Scatter Plot: Life Expectancy vs Alcohol ',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Alcohol 
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode


5. Life expectancy vs BMI
#correlation between BMI and life expectancy
life_expectancy_vs_BMI <- plot_ly(data = life_expectancy_data, x = ~BMI, y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                   color = 'rgba(255,182,193, .9)',
                                            line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                        width = 2)))
life_expectancy_vs_BMI <- life_expectancy_vs_BMI %>% layout(title = 'Scatter Plot: Life Expectancy vs BMI',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_BMI
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

We can see that life expectancy of the population decrease as the BMI increase.

6. Life expectancy vs under five deaths

We want to analyse if lesser deaths are positively related to life expectancy.
Under five deaths represents the number of under-five deaths per 1000 population

library(plotly)
life_expectancy_vs_under_five_deaths  <- ggplot(life_expectancy_data, aes(under.five.deaths, Life.expectancy)) + geom_jitter(color = "pink", alpha = 0.5) + theme_grey()

p <- plot_grid(life_expectancy_vs_under_five_deaths)
title <- ggdraw() + draw_label("Correlation between Under five deaths and life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))


7. Life expectancy vs GDP

GDP Gross Domestic Product per capita (in USD)

#correlation between GDP and lif expectancy
life_expectancy_vs_GDP  <- ggplot(life_expectancy_data, aes(GDP, Life.expectancy)) +
                                      geom_jitter(color = "dark green", alpha = 0.5) + theme_light()

p <- plot_grid(life_expectancy_vs_GDP) 
title <- ggdraw() + draw_label("Correlation between GDP vs Life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))


8. Life expectancy vs thinness

thinness 1 to 19 years Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

#correlation between thinness and life expectancy
life_expectancy_vs_thinness_1_19_years  <- ggplot(life_expectancy_data, aes(thinness..1.19.years, Life.expectancy)) +
                                      geom_jitter(color = "blue", alpha = 0.5) + theme_light()

life_expectancy_vs_thinness_5_9_years  <- ggplot(life_expectancy_data, aes(thinness.5.9.years, Life.expectancy)) +
                                      geom_jitter(color = "orange", alpha = 0.5) + theme_light()
                                                  
p <- plot_grid(life_expectancy_vs_thinness_1_19_years, life_expectancy_vs_thinness_5_9_years) 
title <- ggdraw() + draw_label("Correlation between Thinness vs Life expectancy", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1))


9. Life expectancy vs Income composition of resources

Income composition of resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

library(plotly)
life_expectancy_vs_Income_composition_of_resources <- plot_ly(data = life_expectancy_data, x = ~Income.composition.of.resources , y = ~Life.expectancy,
                                      marker = list(size = 10,
                                                      color = 'rgba(181, 201, 253, .9)',
                                                  line = list(color = 'rgba(255, 0, 38, 0.2)',
                                                              width = 2)))
life_expectancy_vs_Income_composition_of_resources  <- life_expectancy_vs_Income_composition_of_resources %>% layout(title = 'Scatter Plot: Life Expectancy vs Income composition of resources',
                                                                    yaxis = list(zeroline = FALSE),
                                                                    xaxis = list(zeroline = FALSE))

life_expectancy_vs_Income_composition_of_resources
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

We can see Human development positively influences Life expectancy.

Correlation and Variances

We will use ggcorr function to check the correlation between independent numerical variables and the dependent variable.

## Warning in ggcorr(life_expectancy_data, label = T, label_size = 2, label_round =
## 2, : data in column(s) 'Country', 'Status' are not numeric and were ignored

We observe that the Life.Expectancy variable has a strong positive correlation with schooling and Income.composition.of.resources. On the other hand, Adult.Mortality has a strong negative correlation with Life.Expectancy and is a valid finding since when the Adult mortality is high, the life expectancy will definitely be low.

Preparing Data for modeling

#we will split the data into train and test for model building
n_train <- round(0.8 * nrow(life_expectancy_data))
train_indices <- sample(1:nrow(life_expectancy_data), n_train)
train_data <- life_expectancy_data[train_indices, ]
test_data <- life_expectancy_data[-train_indices, ]

Regression analysis

Now that we have seen correlation showing the relationship of Life Expectancy with each independent variables. Let’s analyse the data set using Multiple linear regression.

The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known.

Let’s build our initial Linear regression model.

Outcome: life.expectancy (reasons for high or low life expectancy?) Predictors: Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, GDP, Total.expenditure, Diphtheria, thinness..1.19.years, thinness.5.9.years, Income.composition.of.resources

#first model
formula1 <- as.formula("Life.expectancy ~ Alcohol + percentage.expenditure + Hepatitis.B + Measles +  BMI + under.five.deaths + Polio+ Total.expenditure + Diphtheria  + thinness..1.19.years + thinness.5.9.years + GDP + Income.composition.of.resources")
model1 <- lm(formula1, data = train_data)
summary(model1)
## 
## Call:
## lm(formula = formula1, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9221  -3.0859   0.1629   3.1554  22.9896 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.643e+01  7.399e-01  62.744  < 2e-16 ***
## Alcohol                          6.435e-02  3.590e-02   1.793  0.07317 .  
## percentage.expenditure           1.565e-04  1.312e-04   1.193  0.23308    
## Hepatitis.B                     -1.352e-02  6.070e-03  -2.228  0.02600 *  
## Measles                         -1.919e-05  1.274e-05  -1.506  0.13226    
## BMI                              9.389e-02  7.399e-03  12.689  < 2e-16 ***
## under.five.deaths                2.807e-04  9.539e-04   0.294  0.76858    
## Polio                            5.381e-02  6.883e-03   7.819 8.03e-15 ***
## Total.expenditure                9.183e-02  5.324e-02   1.725  0.08467 .  
## Diphtheria                       6.084e-02  7.219e-03   8.428  < 2e-16 ***
## thinness..1.19.years            -1.038e-01  7.900e-02  -1.314  0.18911    
## thinness.5.9.years              -1.301e-01  7.753e-02  -1.678  0.09355 .  
## GDP                              5.920e-05  2.077e-05   2.850  0.00441 ** 
## Income.composition.of.resources  1.714e+01  7.424e-01  23.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.64 on 2323 degrees of freedom
## Multiple R-squared:  0.6327, Adjusted R-squared:  0.6306 
## F-statistic: 307.8 on 13 and 2323 DF,  p-value: < 2.2e-16
r_sq1 <- summary(model1)$r.squared
prediction1 <- predict(model1, newdata = test_data)
residuals1 <- test_data$Life.expectancy - prediction1
rmse1 <- sqrt(mean(residuals1^2, na.rm=TRUE))

The first step in interpreting the regression is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:

This is very interesting to see the effect of all other variables - Alcohol, Immunizations (Polio, Diptheria), BMI and Human development have a statistically significant effect on the outcome of Life expectancy.

summary(model1)$coefficient
##                                      Estimate   Std. Error   t value
## (Intercept)                      4.642729e+01 7.399437e-01 62.744351
## Alcohol                          6.435314e-02 3.590015e-02  1.792559
## percentage.expenditure           1.565036e-04 1.312087e-04  1.192784
## Hepatitis.B                     -1.352081e-02 6.069735e-03 -2.227578
## Measles                         -1.918537e-05 1.274108e-05 -1.505788
## BMI                              9.388895e-02 7.399244e-03 12.688993
## under.five.deaths                2.806923e-04 9.538888e-04  0.294261
## Polio                            5.381452e-02 6.882801e-03  7.818695
## Total.expenditure                9.183487e-02 5.323976e-02  1.724930
## Diphtheria                       6.084419e-02 7.219329e-03  8.427957
## thinness..1.19.years            -1.037799e-01 7.900391e-02 -1.313605
## thinness.5.9.years              -1.300648e-01 7.752717e-02 -1.677667
## GDP                              5.920481e-05 2.077122e-05  2.850329
## Income.composition.of.resources  1.714154e+01 7.424478e-01 23.087872
##                                      Pr(>|t|)
## (Intercept)                      0.000000e+00
## Alcohol                          7.317358e-02
## percentage.expenditure           2.330759e-01
## Hepatitis.B                      2.600415e-02
## Measles                          1.322575e-01
## BMI                              1.013461e-35
## under.five.deaths                7.685847e-01
## Polio                            8.027681e-15
## Total.expenditure                8.467309e-02
## Diphtheria                       6.079207e-17
## thinness..1.19.years             1.891089e-01
## thinness.5.9.years               9.354661e-02
## GDP                              4.405924e-03
## Income.composition.of.resources 2.381008e-106

For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that, change in the Alcohol,BMI,Polio, Total expenditure,Diphtheria, Thinness 1- 19 years,Income composition of resources are significantly associated to life expectancy of a person.

For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.

We found that Measles, percentage expenditure, Hepatitis B, under five deaths variables are not significant in the multiple regression model. We can remove these variables from our analysis.

#second model
#dropping insignificant variables like measles, percentage.expenditure, Hepatitis.B, Under.five.deaths, Thinness.5.9.years
formula2 <- as.formula("Life.expectancy ~  Alcohol +  Diphtheria  +  BMI +  Polio + Total.expenditure + thinness..1.19.years +  Income.composition.of.resources")
model2 <- lm(formula2, data = train_data)
summary(model2)
## 
## Call:
## lm(formula = formula2, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0275  -3.0852   0.1366   3.0732  23.3538 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     44.780807   0.692541  64.662  < 2e-16 ***
## Alcohol                          0.111912   0.035692   3.135  0.00174 ** 
## Diphtheria                       0.054351   0.006880   7.899 4.29e-15 ***
## BMI                              0.097621   0.007428  13.143  < 2e-16 ***
## Polio                            0.053603   0.006934   7.731 1.58e-14 ***
## Total.expenditure                0.122170   0.053438   2.286  0.02233 *  
## thinness..1.19.years            -0.237843   0.034424  -6.909 6.26e-12 ***
## Income.composition.of.resources 18.911779   0.720720  26.240  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.731 on 2329 degrees of freedom
## Multiple R-squared:  0.6198, Adjusted R-squared:  0.6187 
## F-statistic: 542.4 on 7 and 2329 DF,  p-value: < 2.2e-16
r_sq2 <- summary(model2)$r.squared
prediction2 <- predict(model2, newdata = test_data)
residuals2 <- test_data$Life.expectancy - prediction2
rmse2 <- sqrt(mean(residuals2^2, na.rm=TRUE))

Comparing the models

print(paste0("R-squared for first model:", round(r_sq1, 4)))
## [1] "R-squared for first model:0.6327"
print(paste0("R-squared for second model: ", round(r_sq2, 4)))
## [1] "R-squared for second model: 0.6198"
print(paste0("RMSE for first model: ", round(rmse1, 2)))
## [1] "RMSE for first model: 6.12"
print(paste0("RMSE for second model: ", round(rmse2, 2)))
## [1] "RMSE for second model: 6.17"

The confidence interval of the model coefficient can be extracted as follow:

confint(model2, level=0.95)
##                                       2.5 %      97.5 %
## (Intercept)                     43.42274699 46.13886800
## Alcohol                          0.04191988  0.18190398
## Diphtheria                       0.04085837  0.06784335
## BMI                              0.08305466  0.11218642
## Polio                            0.04000570  0.06720047
## Total.expenditure                0.01737986  0.22696006
## thinness..1.19.years            -0.30534709 -0.17033909
## Income.composition.of.resources 17.49845990 20.32509864

Prediction

#prediction
test_data$prediction <- predict(model2, newdata = test_data)
ggplot(test_data, aes(x = prediction, y = Life.expectancy)) + 
  geom_point(color = "blue", alpha = 0.7) + 
  geom_abline(color = "red") +
  ggtitle("Prediction vs. Real values")

Model evaluation / Validation

Diagnostic Plots are used to evaluate the model assumptions and understand whether or not there are observations that can strongly have influence on the analysis.

Residuals are the measure of how far from the regression line the Data points are. Fitted values are models prediction of mean response value when you input the values of the predictors, factor levels or components into the model.

We have plotted the following graphs:
1. Residuals Vs Fitted values graph 2. Histogram of residuals

#residuals vs linear model prediction
test_data$residuals <- test_data$Life.expectancy - test_data$prediction
ggplot(data = test_data, aes(x = prediction, y = residuals)) +
  geom_pointrange(aes(ymin = 0, ymax = residuals), color = "purple", alpha = 0.7) + geom_hline(yintercept = 0, linetype = 4, color = "red") +
  ggtitle("Residuals vs. Linear model prediction")

#histogram for residuals
ggplot(test_data, aes(x = residuals)) + 
  geom_histogram(bins = 15, fill = "light blue") +
  ggtitle("Histogram of residuals")

Residual: It is an error between a predicted value and the observed actual value.

Assumptions for a Residual Plot: 1. The most important assumption of a linear regression model is that errors are independent and normally distributed. 2. It has high quality of points close to the origin and low density of points away from the origin. 3. It is symmetric about the origin.

To validate the regression model, you must use redual plot to visually confirm the validity of your model.

GainCurvePlot

The use case for this visualization is to compare a predictive model score to an actual outcome (either binary (0/1) or continuous). In this case the gain curve plot measures how well the model score sorts the data compared to the true outcome value.

GainCurvePlot(test_data, "prediction", "Life.expectancy", "Model2")



Finally our model can be written as follow:

Life_Expectancy = 45.342414 + 0.051514 x Alcohol + 0.037775 x Diphtheria + 0.094218 x BMI + 0.053829 x Polio + 0.020896 x Total.expenditure + -0.174295 x Thinness..1.19.years + 21.336983 x Income.composition.of.resources

Model accuracy assessment

As we have seen in our Linear regression model, the overall quality of the model can be assessed by examining the R-squared (R2) and Residual Standard Error (RSE).

R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.

R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.

A problem with the R2, is that, it will always increase when more variables are added to the model, even if those variables are only weakly associated with the response (James et al. 2014). A solution is to adjust the R2 by taking into account the number of predictor variables.

The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x variables included in the prediction model.

Residual Standard Error (RSE), or sigma:
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand). The error rate can be estimated by dividing the RSE by the mean outcome variable:

sigma(model2)/mean(life_expectancy_data$Life.expectancy)
## [1] 0.08260357


In our regression example, the RSE is 5.860291 corresponding to 8.4% error rate which is pretty good.

Testing the final Model

We can test our model by giving the input values for the model to predict the approx. life expectancy of a person.

#test1
XYZ <- data.frame(  Country = "India",
                     Alcohol = 5.28,
                     Diphtheria = 86,
                     BMI = 38.9,
                     Polio = 98,
                     Total.expenditure = 11.14,
                     thinness..1.19.years = 2.1,
                     Income.composition.of.resources = 0.741)
print(paste0("Life expectancy for XYZ: ", round(predict(model2, XYZ), 2)))
## [1] "Life expectancy for XYZ: 73.97"

Conclusion

It is deduced from the above observations that, there is a high chance of increased Life expectancy by being immunized. We were successfully able to meet the objective and figure out variables that play important role in Life expectancy prediction model, that can be used in companies and as well as by customers. Data set with more features will provide more accurate outputs.