Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. As a result, this study tries to perform different regression techniques in order to get some insights on the dataset made available by Deeksha Russell and Duan Wang, who gathered the data from the WHO and United Nations websites.
#install.packages('ggplot2')
#install.packages('corrplot')
#install.packages('dplyr')
#install.packages('caret')
#install.packages('superml')
#install.packages('rpart')
#install.packages('rpart.plot')
#install.packages('Metrics')
#install.packages('randomForest')
library(ggplot2)
library(corrplot)
library(dplyr)
library(caret)
library(superml)
library(rpart)
library(rpart.plot)
library(Metrics)
library(randomForest)
data <- read.csv('Life Expectancy Data-regression.csv')
The dataset includes 2938 observations and 22 variables. It contains information about the following variables:
The data related to life expectancy, health factors for 193 countries has been collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO) and its corresponding economic data was collected from the United Nation website for a period of 16 years(2000-2015).
The dataset is made available on Kaggle- https://www.kaggle.com/kumarajarshi/life-expectancy-who
head(data)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..10.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling
## 1 10.1
## 2 10.0
## 3 9.9
## 4 9.8
## 5 9.5
## 6 9.2
str(data)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..10.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
sapply(data, FUN=class)
## Country Year
## "character" "integer"
## Status Life.expectancy
## "character" "numeric"
## Adult.Mortality infant.deaths
## "integer" "integer"
## Alcohol percentage.expenditure
## "numeric" "numeric"
## Hepatitis.B Measles
## "integer" "integer"
## BMI under.five.deaths
## "numeric" "integer"
## Polio Total.expenditure
## "integer" "numeric"
## Diphtheria HIV.AIDS
## "integer" "numeric"
## GDP Population
## "numeric" "numeric"
## thinness..10.19.years thinness.5.9.years
## "numeric" "numeric"
## Income.composition.of.resources Schooling
## "numeric" "numeric"
statusTable <- table(data$Status)
statusTable <- as.data.frame(statusTable)
head(statusTable)
## Var1 Freq
## 1 Developed 512
## 2 Developing 2426
prop.table(table(data$Status))
##
## Developed Developing
## 0.1742682 0.8257318
Most of the nations are with the “Developing” status(nearly 82.6%) as per the Data.
table(data$Status[data$Life.expectancy>70.0])
##
## Developed Developing
## 511 1109
ggplot(data = statusTable, aes(x=Var1, y=Freq))+
geom_bar(stat='identity')
nums <- unlist(lapply(data, is.numeric))
data[ , nums]
cormat <- cor(data[, nums],use='pairwise.complete.obs')
corrplot(cormat,type = 'lower')
Life Expectancy is the target variable for our modeling purpose. The features are expected to be independent. High Correlation between the features is a sign of multicollinearity. We should treat it.
sapply(data, function(x) sum(is.na(x)))
## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..10.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
data$Life.expectancy[is.na(data$Life.expectancy)] <-
mean(data$Life.expectancy, na.rm=TRUE)
sapply(data, function(x) sum(is.na(x)))
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..10.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
data$Adult.Mortality[is.na(data$Adult.Mortality)] <- mean(data$Adult.Mortality, na.rm=TRUE)
Now, instead of imputing the mean values of the whole column, we can impute the values of mean of range to make it more precise. The pairs of variables chosen below are selected after trying out various other pairs. The following variable pairs were found to be the most relevant ones for our studies, based on solely the Visualization.
For this, lets first create a scatterplot and analyze.
ggplot(data=data, aes(x=Alcohol, y=Schooling))+
geom_point()
impute_alcohol <- function(values){
alcoholValue <- as.numeric(values[1])
schoolingValue <- as.numeric(values[2])
if(is.na(alcoholValue) == TRUE && is.na(schoolingValue) == FALSE){
if(schoolingValue <= 2.5){
return(4.0)
} else if(2.5<schoolingValue && schoolingValue<=5.0){
return(2.0)
} else if(5.0<schoolingValue && schoolingValue<=7.5){
return(2.5)
} else if(7.5<schoolingValue && schoolingValue<=10.0){
return(3.0)
} else if(10.0<schoolingValue && schoolingValue<=15.0){
return(4.0)
} else if(schoolingValue>15.0){
return(10.0)
}
}
else{
return(alcoholValue)
}
}
school_alcohol <- data[c('Alcohol','Schooling')]
data$Alcohol <- apply(school_alcohol, FUN=impute_alcohol, MARGIN=1)
sapply(data, function(x) sum(is.na(x)))
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 0 0
## Alcohol percentage.expenditure
## 9 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..10.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
As there are still 9 NA’s in the Alcohol variable, lets further Impute values with Mean of the whole column.
data$Alcohol[is.na(data$Alcohol)] <- mean(data$Alcohol, na.rm=TRUE)
ggplot(data,aes(x=Polio, y=Life.expectancy))+
geom_point()
impute_polio <- function(values){
polio=as.numeric(values[1])
le=as.numeric(values[2])
if(is.na(polio) == TRUE ){
if(le <= 45.0){
return(72.0)
} else if(45.0<le && le<=50.0){
return(57.0)
} else if(50.0<le && le<=60.0){
return(70.0)
} else if(60.0<le && le<= 70.0){
return(80.0)
} else if(70.0<le && le<=80.0){
return(87.0)
} else if(le > 80.0){
return(95.0)
}
}
else{
return(polio)
}
}
polio_lifeExpect <- data[c('Polio','Life.expectancy')]
data$Polio<- apply(polio_lifeExpect,FUN=impute_polio, MARGIN=1)
ggplot(data, aes(x=Polio, y=Diphtheria))+
geom_point(na.rm=TRUE)
impute_diphtheria <- function(values){
d <- as.numeric(values[1])
p <- as.numeric(values[2])
if(is.na(d) == TRUE){
if( p<=10){
return(75.0)
}
else if( 10<p && p<=40){
return(27.0)
}
else if (40<p && p<=45){
return(33.0)
}
else if (45<p && p<=50){
return(37.0)
}
else if( 50<p && p<=60){
return(39.0)
}
else if( 60<p && p<=80){
return(60.0)
}
else if (p>80){
return(80.0)
}
}
else{
return(d)
}
}
dip_polio <- data[c('Diphtheria','Polio')]
data$Diphtheria <- apply(dip_polio, FUN=impute_diphtheria, MARGIN=1)
ggplot(data, aes(x=Hepatitis.B, y=Diphtheria))+
geom_point(na.rm=TRUE)
impute_hep <- function(values){
hep<- as.numeric(values[1])
dip <- as.numeric(values[2])
if(is.na(hep) ==TRUE){
if (dip<=15){
return (65.0)}
else if (15<dip && dip<=30){
return (28.0)}
else if (30<dip && dip<=45){
return (38.0)}
else if (45<dip && dip<=60){
return(50.0)}
else if (60<dip && dip<=80){
return (65.0)
}
else if(dip>80){
return (88.4)
}
}
else{
return (hep)
}
}
hep_dip <- data[c('Hepatitis.B','Diphtheria')]
data$Hepatitis.B <- apply(hep_dip, FUN=impute_hep, MARGIN=1)
ggplot(data, aes(x=Life.expectancy, y=BMI))+
geom_point()
impute_bmi <- function(values){
l <- as.numeric(values[1])
b <- as.numeric(values[2])
if (is.na(b)){
if (l<=50){
return(18.0)
}
else if (50<l && l<=60){
return (23.0)
}
else if (60<l && l<=70){
return (35.0)
}
else if (70<l && l<=80){
return( 46.8)
}
else if (80<l && l<=100){
return (55.0)
}
}
else{
return(b)
}
}
life_bmi <- data[c('Life.expectancy','BMI')]
data$BMI <- apply(life_bmi, FUN=impute_bmi, MARGIN=1)
ggplot(data, aes(x= Total.expenditure, y=Alcohol))+
geom_point(na.rm=TRUE)
impute_totalexp <- function(values){
t<- as.numeric(values[1])
a<- as.numeric(values[2])
if( a<=2.5){
return (5.0)}
else if (2.5<a && a<=5.0){
return (6.0)
}
else if (5.0<a && a<=10.0){
return (6.25)
}
else if (10.0<a && a<=12.5){
return (7.0)
}
else if (a>12.5){
return (7.5)
}
else{
return (t)
}
}
alcohol_exp <- data[c('Total.expenditure','Alcohol')]
data$Total.expenditure <- apply(alcohol_exp, FUN=impute_totalexp, MARGIN=1)
ggplot(data,aes(x=percentage.expenditure, y=GDP))+
geom_point(na.rm=TRUE)
impute_gdp<- function(values){
p <- values[1]
g <-values[2]
if (is.na(g)){
if (p<=1250){
return (1100.0)}
else if (1250<p && p<=2500){
return (1800.0)}
else if (2500<p && p<=3750){
return (2900.0)}
else if (3750<p && p<=7500){
return (3500.0)}
else if (7500<p && p<=8750){
return (4500.0)}
else if (8750<p && p<=10000){
return (5000.0)}
else if (10000<p && p<=11250){
return (5700.0)}
else if (11250<p && p<=12500){
return (7000.0)}
else if (12500<p && p<=15000){
return (8000.0)}
else if (15000<p && p<=17500){
return (9000.0)}
else if (p>17500){
return (8500.0)}}
else{
return (g)}
}
percent_gdp <- data[c('percentage.expenditure','GDP')]
data$GDP <- apply(percent_gdp, FUN= impute_gdp, MARGIN=1)
ggplot(data, aes(x=infant.deaths, y=Population))+
geom_point(na.rm=TRUE)
impute_pop <- function(values){
i<- as.numeric(values[1])
p<- as.numeric(values[2])
if (is.na(p)){
if (i<=100){
return( 0.19*((10)**9))}
else if (100<i && i<=250){
return( 0.18*((10)**9))}
else if (250<i && i<=350){
return( 0.02*((10)**9))}
else if (350<i && i<=900){
return( 0.1*((10)**9))}
else if (900<i && i<=1100){
return( 0.18*((10)**9))}
else if (1100<i && i<=1250){
return( 0.05*((10)**9))}
else if (1250<i && i<=1500){
return( 0.19*((10)**9))}
else if (1500<i && i<=1750){
return (0.05*((10)**9))}
else if (i>1750){
return (0.1*((10)**9))}}
else{
return (p)}
}
infant_pop <- data[c('infant.deaths','Population')]
data$Population <- apply(infant_pop, FUN=impute_pop, MARGIN=1)
ggplot(data, aes(x=BMI, y=thinness..10.19.years))+
geom_point(na.rm=TRUE)
impute_thinness <- function(values){
b<- as.numeric(values[1])
t<- as.numeric(values[2])
if (is.na(t)){
if (b<=10){
return (5.0)}
else if (10<b && b<=20){
return (10.0)}
else if (20<b && b<=30){
return (8.0)}
else if (30<b && b<=40){
return (6.0)}
else if (40<b && b<=50){
return (3.0)}
else if (50<b && b<=70){
return (4.0)}
else if (b>70){
return (1.0)}}
else{
return (t)}
}
bmi_thin <- data[c('BMI','thinness..10.19.years')]
data$thinness..10.19.years <- apply(bmi_thin, FUN=impute_thinness, MARGIN=1)
ggplot(data, aes(x=BMI, y=thinness.5.9.years))+
geom_point(na.rm=TRUE)
impute_thin1 <- function(values){
b <- as.numeric(values[1])
t <- as.numeric(values[2])
if(is.na(t)){
if (b<=10){
return( 5.0)}
else if (10<b && b<=20){
return (10.0)}
else if (20<b && b<=30){
return (8.0)}
else if (30<b && b<=40){
return (6.0)}
else if (40<b && b<=50){
return (3.0)}
else if (50<b && b<=70){
return (4.0)}
else if (b>70){
return (1.0)}
}
else{
return(t)
}
}
bmi_thin1 <- data[c('BMI','thinness.5.9.years')]
data$thinness.5.9.years <- apply(bmi_thin1, FUN = impute_thin1, MARGIN = 1)
ggplot(data, aes(x=Life.expectancy, y=Income.composition.of.resources))+
geom_point(na.rm=TRUE)
impute_income <- function(values){
l <- as.numeric( values[1])
i <- as.numeric(values[2])
if(is.na(i) == TRUE){
if (l<=40){
return (0.4)}
else if (40<l && l<=50){
return (0.42)}
else if (50<l && l<=60){
return (0.402)}
else if (60<l && l<=70){
return (0.54)}
else if (70<l && l<=80){
return (0.71)}
else if (l>80){
return (0.88)}
}
else{
return(i)
}
}
life_income <- data[c('Life.expectancy','Income.composition.of.resources')]
data$Income.composition.of.resources <- apply(life_income, FUN = impute_income, MARGIN = 1)
ggplot(data, aes(x=Life.expectancy, y=Schooling))+
geom_point(na.rm=TRUE)
impute_schooling <- function(values){
l <- as.numeric(values[1])
s <- as.numeric(values[2])
if(is.na(s) == TRUE){
if (l<= 40){
return (8.0)}
else if (40<l && l<=44 ){
return (7.5)}
else if (44<l && l<50 ){
return (8.1)}
else if (50<l && l<=60 ){
return (8.2)}
else if (60<l && l<=70 ){
return (10.5)}
else if (70<l && l<=80 ){
return (13.4)}
else if (l>80 ){
return (16.5)}
}
else{
return(s)
}
}
life_school <- data[c('Life.expectancy', 'Schooling')]
data$Schooling <- apply(life_school, FUN=impute_schooling, MARGIN=1)
sapply(data, FUN=function(x)(sum(is.na(x))))
## Country Year
## 0 0
## Status Life.expectancy
## 0 0
## Adult.Mortality infant.deaths
## 0 0
## Alcohol percentage.expenditure
## 0 0
## Hepatitis.B Measles
## 0 0
## BMI under.five.deaths
## 0 0
## Polio Total.expenditure
## 0 0
## Diphtheria HIV.AIDS
## 0 0
## GDP Population
## 0 0
## thinness..10.19.years thinness.5.9.years
## 0 0
## Income.composition.of.resources Schooling
## 0 0
par(mfrow=c(3,3))
par(mar=c(3,2,1,1))
hist(data$infant.deaths)
hist(data$Adult.Mortality)
hist(data$Alcohol)
hist(data$Hepatitis.B)
hist(data$Measles)
hist(data$Polio)
hist(data$Diphtheria)
hist(data$HIV.AIDS)
hist(data$BMI)
Clearly, all the variables are skewed either way. However, as we are going to use those models which do not require normality assumption, we can just move ahead with our analysis.
One Hot Encoding - allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers.
Label Encoding - refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated.
encoder <- LabelEncoder$new()
data$Country <- encoder$fit_transform(data$Country)
dummy_var <- dummyVars(formula = '~.', data = data[-1])
oneHotData <- as.data.frame(predict(dummy_var, newdata=data[-1]))
colnames(oneHotData)
## [1] "Year" "StatusDeveloped"
## [3] "StatusDeveloping" "Life.expectancy"
## [5] "Adult.Mortality" "infant.deaths"
## [7] "Alcohol" "percentage.expenditure"
## [9] "Hepatitis.B" "Measles"
## [11] "BMI" "under.five.deaths"
## [13] "Polio" "Total.expenditure"
## [15] "Diphtheria" "HIV.AIDS"
## [17] "GDP" "Population"
## [19] "thinness..10.19.years" "thinness.5.9.years"
## [21] "Income.composition.of.resources" "Schooling"
oneHotData$Country <- data$Country
colnames(oneHotData)
## [1] "Year" "StatusDeveloped"
## [3] "StatusDeveloping" "Life.expectancy"
## [5] "Adult.Mortality" "infant.deaths"
## [7] "Alcohol" "percentage.expenditure"
## [9] "Hepatitis.B" "Measles"
## [11] "BMI" "under.five.deaths"
## [13] "Polio" "Total.expenditure"
## [15] "Diphtheria" "HIV.AIDS"
## [17] "GDP" "Population"
## [19] "thinness..10.19.years" "thinness.5.9.years"
## [21] "Income.composition.of.resources" "Schooling"
## [23] "Country"
set.seed(123) # For reproducibility of the same results
sample_size <- floor(0.8*nrow(oneHotData))
train_index <- sample(seq_len(nrow(oneHotData)), size=sample_size)
train<- oneHotData[train_index,]
x_test<- oneHotData[-train_index,]
We are rating our models on the basis of Root Mean Square Error (RMSE) score. We believe that depending on a specific problem, outliers can be significant in the overall analysis. So, that is why we chose RMSE over MAE.
train_lm <- lm(Life.expectancy~., data=train)
summary(train_lm)
##
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8327 -2.2263 -0.1125 2.2033 18.0778
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.959e+01 3.760e+01 2.383 0.01725 *
## Year -1.737e-02 1.882e-02 -0.923 0.35605
## StatusDeveloped 1.515e+00 2.908e-01 5.210 2.06e-07 ***
## StatusDeveloping NA NA NA NA
## Adult.Mortality -1.708e-02 8.471e-04 -20.163 < 2e-16 ***
## infant.deaths 9.975e-02 9.312e-03 10.712 < 2e-16 ***
## Alcohol 8.545e-02 6.522e-02 1.310 0.19029
## percentage.expenditure 1.527e-04 1.005e-04 1.519 0.12898
## Hepatitis.B -5.719e-04 4.421e-03 -0.129 0.89708
## Measles -1.174e-05 8.573e-06 -1.369 0.17105
## BMI 3.342e-02 5.249e-03 6.368 2.30e-10 ***
## under.five.deaths -7.524e-02 6.876e-03 -10.943 < 2e-16 ***
## Polio 2.593e-02 4.844e-03 5.353 9.52e-08 ***
## Total.expenditure -3.153e-01 3.258e-01 -0.968 0.33315
## Diphtheria 2.678e-02 5.264e-03 5.087 3.94e-07 ***
## HIV.AIDS -4.761e-01 1.806e-02 -26.363 < 2e-16 ***
## GDP 2.552e-05 1.526e-05 1.673 0.09455 .
## Population 2.549e-09 9.633e-10 2.646 0.00819 **
## thinness..10.19.years -5.538e-02 5.428e-02 -1.020 0.30770
## thinness.5.9.years -2.163e-02 5.371e-02 -0.403 0.68726
## Income.composition.of.resources 7.036e+00 6.937e-01 10.143 < 2e-16 ***
## Schooling 7.683e-01 4.634e-02 16.581 < 2e-16 ***
## Country 3.708e-03 1.447e-03 2.562 0.01048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.858 on 2328 degrees of freedom
## Multiple R-squared: 0.8372, Adjusted R-squared: 0.8357
## F-statistic: 570 on 21 and 2328 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(train_lm)
predictions <- predict(train_lm, newdata=x_test)
rmse.lm <- rmse(predictions,x_test$Life.expectancy)
rmse.lm
## [1] 3.916231
model <- rpart(Life.expectancy~., data=train,method='anova')
par(mfrow=c(1,1))
rpart.plot(model)
summary(model)
## Call:
## rpart(formula = Life.expectancy ~ ., data = train, method = "anova")
## n= 2350
##
## CP nsplit rel error xerror xstd
## 1 0.57851430 0 1.0000000 1.0007399 0.027577056
## 2 0.12226618 1 0.4214857 0.4392424 0.013529014
## 3 0.04595534 2 0.2992195 0.2946733 0.010446933
## 4 0.03984852 3 0.2532642 0.2634445 0.010082613
## 5 0.02788668 4 0.2134157 0.2333384 0.008571159
## 6 0.01952901 5 0.1855290 0.1989316 0.007572385
## 7 0.01756969 6 0.1660000 0.1810374 0.007029801
## 8 0.01000000 7 0.1484303 0.1602394 0.006030408
##
## Variable importance
## HIV.AIDS Income.composition.of.resources
## 26 19
## Adult.Mortality Schooling
## 18 14
## thinness..10.19.years thinness.5.9.years
## 6 6
## BMI Alcohol
## 3 2
## Total.expenditure percentage.expenditure
## 2 2
## GDP under.five.deaths
## 2 1
##
## Node number 1: 2350 observations, complexity param=0.5785143
## mean=69.27591, MSE=90.56297
## left son=2 (619 obs) right son=3 (1731 obs)
## Primary splits:
## HIV.AIDS < 0.65 to the right, improve=0.5785143, (0 missing)
## Income.composition.of.resources < 0.5875 to the left, improve=0.5717741, (0 missing)
## Schooling < 11.45 to the left, improve=0.5139798, (0 missing)
## Adult.Mortality < 237.5 to the right, improve=0.4752723, (0 missing)
## BMI < 39.85 to the left, improve=0.3418446, (0 missing)
## Surrogate splits:
## Adult.Mortality < 254.5 to the right, agree=0.880, adj=0.544, (0 split)
## Income.composition.of.resources < 0.5345 to the left, agree=0.844, adj=0.407, (0 split)
## Schooling < 10.05 to the left, agree=0.825, adj=0.336, (0 split)
## thinness..10.19.years < 6.45 to the right, agree=0.793, adj=0.213, (0 split)
## thinness.5.9.years < 6.45 to the right, agree=0.788, adj=0.194, (0 split)
##
## Node number 2: 619 observations, complexity param=0.03984852
## mean=57.17173, MSE=48.31389
## left son=4 (468 obs) right son=5 (151 obs)
## Primary splits:
## Income.composition.of.resources < 0.5285 to the left, improve=0.2835751, (0 missing)
## BMI < 33.25 to the left, improve=0.2675741, (0 missing)
## Adult.Mortality < 331.5 to the right, improve=0.2631155, (0 missing)
## under.five.deaths < 1.5 to the right, improve=0.2283233, (0 missing)
## infant.deaths < 1.5 to the right, improve=0.2282347, (0 missing)
## Surrogate splits:
## Schooling < 11.05 to the left, agree=0.914, adj=0.649, (0 split)
## percentage.expenditure < 153.9527 to the left, agree=0.866, adj=0.450, (0 split)
## GDP < 1641.038 to the left, agree=0.863, adj=0.437, (0 split)
## BMI < 29.75 to the left, agree=0.843, adj=0.358, (0 split)
## under.five.deaths < 4.5 to the right, agree=0.842, adj=0.351, (0 split)
##
## Node number 3: 1731 observations, complexity param=0.1222662
## mean=73.60433, MSE=34.54393
## left son=6 (706 obs) right son=7 (1025 obs)
## Primary splits:
## Income.composition.of.resources < 0.7005 to the left, improve=0.4351671, (0 missing)
## Adult.Mortality < 161.5 to the right, improve=0.3814791, (0 missing)
## Schooling < 11.45 to the left, improve=0.3768480, (0 missing)
## StatusDeveloping < 0.5 to the right, improve=0.2941991, (0 missing)
## StatusDeveloped < 0.5 to the left, improve=0.2941991, (0 missing)
## Surrogate splits:
## Schooling < 12.45 to the left, agree=0.879, adj=0.703, (0 split)
## Adult.Mortality < 143.5 to the right, agree=0.760, adj=0.411, (0 split)
## BMI < 51.05 to the left, agree=0.757, adj=0.404, (0 split)
## Alcohol < 5.105 to the left, agree=0.750, adj=0.387, (0 split)
## Total.expenditure < 6.125 to the left, agree=0.748, adj=0.381, (0 split)
##
## Node number 4: 468 observations, complexity param=0.01756969
## mean=55.06923, MSE=30.7844
## left son=8 (122 obs) right son=9 (346 obs)
## Primary splits:
## Adult.Mortality < 379.5 to the right, improve=0.25954110, (0 missing)
## HIV.AIDS < 9.35 to the right, improve=0.24252910, (0 missing)
## under.five.deaths < 39.5 to the right, improve=0.12430700, (0 missing)
## infant.deaths < 22.5 to the right, improve=0.12077190, (0 missing)
## Year < 2006.5 to the left, improve=0.09574549, (0 missing)
## Surrogate splits:
## HIV.AIDS < 6.55 to the right, agree=0.827, adj=0.336, (0 split)
## Country < 187 to the right, agree=0.754, adj=0.057, (0 split)
## percentage.expenditure < 392.5857 to the right, agree=0.746, adj=0.025, (0 split)
## Alcohol < 10.32 to the right, agree=0.744, adj=0.016, (0 split)
## Population < 7115 to the left, agree=0.741, adj=0.008, (0 split)
##
## Node number 5: 151 observations, complexity param=0.01952901
## mean=63.68808, MSE=46.48026
## left son=10 (76 obs) right son=11 (75 obs)
## Primary splits:
## HIV.AIDS < 2.85 to the right, improve=0.5921793, (0 missing)
## Adult.Mortality < 314.5 to the right, improve=0.5497381, (0 missing)
## Income.composition.of.resources < 0.6475 to the left, improve=0.3197414, (0 missing)
## Measles < 0.5 to the right, improve=0.2738616, (0 missing)
## under.five.deaths < 1.5 to the right, improve=0.2640346, (0 missing)
## Surrogate splits:
## Adult.Mortality < 267.5 to the right, agree=0.861, adj=0.720, (0 split)
## Polio < 83.5 to the left, agree=0.781, adj=0.560, (0 split)
## Diphtheria < 83.5 to the left, agree=0.768, adj=0.533, (0 split)
## under.five.deaths < 2.5 to the right, agree=0.748, adj=0.493, (0 split)
## BMI < 34.8 to the left, agree=0.742, adj=0.480, (0 split)
##
## Node number 6: 706 observations, complexity param=0.02788668
## mean=68.93265, MSE=21.59911
## left son=12 (185 obs) right son=13 (521 obs)
## Primary splits:
## Adult.Mortality < 211.5 to the right, improve=0.3892017, (0 missing)
## Schooling < 10.35 to the left, improve=0.3174602, (0 missing)
## Income.composition.of.resources < 0.5565 to the left, improve=0.2950295, (0 missing)
## Polio < 84.5 to the left, improve=0.1690747, (0 missing)
## Diphtheria < 86.5 to the left, improve=0.1648257, (0 missing)
## Surrogate splits:
## Income.composition.of.resources < 0.4855 to the left, agree=0.775, adj=0.141, (0 split)
## Schooling < 9.05 to the left, agree=0.758, adj=0.076, (0 split)
## Country < 0.5 to the left, agree=0.754, adj=0.059, (0 split)
## HIV.AIDS < 0.55 to the right, agree=0.745, adj=0.027, (0 split)
## infant.deaths < 1550 to the right, agree=0.742, adj=0.016, (0 split)
##
## Node number 7: 1025 observations, complexity param=0.04595534
## mean=76.8221, MSE=18.07368
## left son=14 (708 obs) right son=15 (317 obs)
## Primary splits:
## Income.composition.of.resources < 0.8365 to the left, improve=0.5279394, (0 missing)
## thinness.5.9.years < 2.05 to the right, improve=0.4275270, (0 missing)
## thinness..10.19.years < 1.85 to the right, improve=0.4215412, (0 missing)
## Adult.Mortality < 105 to the right, improve=0.3752846, (0 missing)
## Schooling < 15.65 to the left, improve=0.2613797, (0 missing)
## Surrogate splits:
## Schooling < 15.65 to the left, agree=0.853, adj=0.524, (0 split)
## thinness.5.9.years < 1.35 to the right, agree=0.826, adj=0.438, (0 split)
## thinness..10.19.years < 1.25 to the right, agree=0.819, adj=0.413, (0 split)
## GDP < 22219.63 to the left, agree=0.815, adj=0.401, (0 split)
## percentage.expenditure < 3032.965 to the left, agree=0.811, adj=0.388, (0 split)
##
## Node number 8: 122 observations
## mean=50.30902, MSE=17.68033
##
## Node number 9: 346 observations
## mean=56.74769, MSE=24.59787
##
## Node number 10: 76 observations
## mean=58.47632, MSE=20.34233
##
## Node number 11: 75 observations
## mean=68.96933, MSE=17.55039
##
## Node number 12: 185 observations
## mean=64.06703, MSE=12.44091
##
## Node number 13: 521 observations
## mean=70.66036, MSE=13.45965
##
## Node number 14: 708 observations
## mean=74.75516, MSE=8.259095
##
## Node number 15: 317 observations
## mean=81.43849, MSE=9.141106
printcp(model)
##
## Regression tree:
## rpart(formula = Life.expectancy ~ ., data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] Adult.Mortality HIV.AIDS
## [3] Income.composition.of.resources
##
## Root node error: 212823/2350 = 90.563
##
## n= 2350
##
## CP nsplit rel error xerror xstd
## 1 0.578514 0 1.00000 1.00074 0.0275771
## 2 0.122266 1 0.42149 0.43924 0.0135290
## 3 0.045955 2 0.29922 0.29467 0.0104469
## 4 0.039849 3 0.25326 0.26344 0.0100826
## 5 0.027887 4 0.21342 0.23334 0.0085712
## 6 0.019529 5 0.18553 0.19893 0.0075724
## 7 0.017570 6 0.16600 0.18104 0.0070298
## 8 0.010000 7 0.14843 0.16024 0.0060304
plotcp(model)
predictions.tree <- predict(model, newdata = x_test)
rmse.tree <- rmse(predictions.tree,x_test$Life.expectancy)
rmse.tree
## [1] 3.743962
forest <- randomForest(Life.expectancy~., data=train, ntree=100, mtry=2,
importance=TRUE)
summary(forest)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 2350 -none- numeric
## mse 100 -none- numeric
## rsq 100 -none- numeric
## oob.times 2350 -none- numeric
## importance 44 -none- numeric
## importanceSD 22 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 2350 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
plot(forest)
importance(forest)
## %IncMSE IncNodePurity
## Year 8.958095 1936.080
## StatusDeveloped 4.767448 3937.985
## StatusDeveloping 5.081020 6882.608
## Adult.Mortality 13.052102 23821.922
## infant.deaths 12.007491 7074.949
## Alcohol 10.904265 4705.035
## percentage.expenditure 8.212917 4035.761
## Hepatitis.B 8.450515 5645.523
## Measles 8.792795 2004.794
## BMI 9.218450 12221.867
## under.five.deaths 10.488890 7264.541
## Polio 8.262369 12967.633
## Total.expenditure 6.646072 1604.009
## Diphtheria 7.638524 8181.647
## HIV.AIDS 12.024780 31763.257
## GDP 8.616361 6277.536
## Population 11.009760 1788.260
## thinness..10.19.years 10.010494 12097.180
## thinness.5.9.years 11.408010 12759.559
## Income.composition.of.resources 9.652734 21983.642
## Schooling 10.958745 17886.536
## Country 16.012205 2458.688
varImpPlot(forest)
predictions.forest <- predict(forest,newdata=x_test)
rmse.forest <- rmse(predictions.forest, x_test$Life.expectancy)
rmse.forest
## [1] 1.950322
ctrl_cv3 <- trainControl(method = "cv", number = 3)
parameters_xgb <- expand.grid(nrounds = seq(20, 80, 10),
max_depth = c(8),
eta = c(0.25),
gamma = 1,
colsample_bytree = c(0.2),
min_child_weight = c(150),
subsample = 0.8)
set.seed(123456789)
# smoted_training_set$Bankrupt <- as.factor(smoted_training_set$Bankrupt)
# XGBoost model on training data
x_train_xgb <- train(Life.expectancy ~ .,
data = train,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb)
x_train_xgb
## eXtreme Gradient Boosting
##
## 2350 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 1567, 1565, 1568
## Resampling results across tuning parameters:
##
## nrounds RMSE Rsquared MAE
## 20 3.576405 0.8616218 2.630030
## 30 3.268725 0.8827739 2.383526
## 40 3.143173 0.8915580 2.291453
## 50 3.045902 0.8979337 2.218238
## 60 2.972017 0.9026987 2.163592
## 70 2.912872 0.9064610 2.110570
## 80 2.867746 0.9092864 2.072611
##
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
## parameter 'min_child_weight' was held constant at a value of 150
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 80, max_depth = 8, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 150
## and subsample = 0.8.
# plotting the model result
# as it is seen, we have least RMSE value after 80 iteration
plot(x_train_xgb)
# XGBoost model on test data
x_test_xgb <- train(Life.expectancy ~ .,
data = x_test,
method = "xgbTree",
trControl = ctrl_cv3,
tuneGrid = parameters_xgb)
x_test_xgb
## eXtreme Gradient Boosting
##
## 588 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 393, 392, 391
## Resampling results across tuning parameters:
##
## nrounds RMSE Rsquared MAE
## 20 5.503532 0.6643759 4.274236
## 30 5.443069 0.6709637 4.224592
## 40 5.421155 0.6732309 4.208020
## 50 5.396903 0.6763660 4.203635
## 60 5.386832 0.6774605 4.192104
## 70 5.360919 0.6805100 4.167311
## 80 5.335165 0.6834915 4.149924
##
## Tuning parameter 'max_depth' was held constant at a value of 8
## Tuning
## parameter 'min_child_weight' was held constant at a value of 150
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 80, max_depth = 8, eta
## = 0.25, gamma = 1, colsample_bytree = 0.2, min_child_weight = 150
## and subsample = 0.8.
# plotting the result on test data
# the RMSE metrics is
plot(x_test_xgb)
prediction.xgb <-predict(x_test_xgb,data = x_test)
rmse.xgb<- rmse(prediction.xgb,x_test$Life.expectancy)