This project tries to create a prediction model based on the data provided by the World Health Organization (WHO) to evaluate the life expectancy for different countries. The data offers a timeframe from 2000 to 2015.The output algorithms have been used to test if they can maintain their accuracy in predicting the life expectancy for data they haven’t been trained.
2.1. load the packages:
The first step is to load all the required packages.
library(tidyverse)
library(readr)
library(dummies)
library(fastDummies)
library(glmnet)
library(leaps)
library(boot)
2.2 load the dataset
The data is read from the csv file and the life_exp dataset is created.
life_exp <- read_csv("Data/Life_Expectancy_Data.csv")
The prerequisite for the data analysis is to preprocess the data to make it suitable for analysis.
3.1. Removing the incomplete data
All the rows with missing or null values are removed.
life_exp <- life_exp[complete.cases(life_exp), ]
Below the first 5 rows are shown. The data contains 21 columns and 1636 rows. The columns are:
head(life_exp)
## # A tibble: 6 x 22
## Country Year Status `Life expectanc~ `Adult Mortalit~ `infant deaths`
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Afghan~ 2015 Devel~ 65 263 62
## 2 Afghan~ 2014 Devel~ 59.9 271 64
## 3 Afghan~ 2013 Devel~ 59.9 268 66
## 4 Afghan~ 2012 Devel~ 59.5 272 69
## 5 Afghan~ 2011 Devel~ 59.2 275 71
## 6 Afghan~ 2010 Devel~ 58.8 279 74
## # ... with 16 more variables: Alcohol <dbl>, `percentage
## # expenditure` <dbl>, `Hepatitis B` <dbl>, Measles <dbl>, BMI <dbl>,
## # `under-five deaths` <dbl>, Polio <dbl>, `Total expenditure` <dbl>,
## # Diphtheria <dbl>, `HIV/AIDS` <dbl>, GDP <dbl>, Population <dbl>,
## # `thinness 1-19 years` <dbl>, `thinness 5-9 years` <dbl>, `Income
## # composition of resources` <dbl>, Schooling <dbl>
3.2 Dropping year
The Year column is dropped as it will not be used in the analysis.
life_exp = subset(life_exp, select = -c(Year) )
3.3 Creating dummy variable for status variable
Except for Country name and Status(either developed or developing) columns, all of the data is numeric. The values are either in percentage, years, millions of dollars in the case of Gross Domestic Product (GDP).
For regression models, it will be handy to have the qualitative variables converted to quantitative variables. The Status of the country is changed into numerical with the dummy_cols function from the fastDummies package. This results in a new column with values 0 and 1. The original Status column is then dropped.
life_exp <- fastDummies::dummy_cols(life_exp, select_columns = "Status")
life_exp = subset(life_exp, select = -c(Status) )
3.4 Groupby country
The data is grouped by the country and the mean values for the 2000 - 2015 year period is found.
life_exp <- aggregate(life_exp[,2:21], list(life_exp$Country), mean)
life_exp = subset(life_exp, select = -c(Group.1) )
After the grouping, there are 133 rows.
3.5 Rename column names
It will be handy if the column names in the dataset does not contain spaces and other special characters. So, the spaces and other special characters are dropped from the column names.
names(life_exp)<-str_replace_all(names(life_exp), c(" " = "", "-" = "", "/"=""))
Before proceeding with the creation of the prediction model, it is important to split the data into separate sets for training, validation and test. Out of the 133 records, 77 is used for training, 28 for validation and the remaining for test.
splits <- c(rep("train", 77), rep("validation", 28), rep("test", 28))
life_exp <- life_exp %>% mutate(splits = sample(splits))
life_exp_train <- life_exp %>% filter(splits == "train")
life_exp_train <- life_exp_train %>% dplyr::select(-splits)
life_exp_validation <- life_exp %>% filter(splits == "validation")
life_exp_validation <- life_exp_validation %>% dplyr::select(-splits)
life_exp_test <- life_exp %>% filter(splits == "test")
life_exp_test <- life_exp_test %>% dplyr::select(-splits)
The response variable in the data set is Life expectancy. The predictor variabes are Country Status, Adult Mortality, Alcohol, percentage expenditure, Hepatitis B, Measles, BMI, under-five deaths, Polio, Total expenditure, Diphtheria, HIV/AIDS, GDP, Population, thinness 1-19 years, thinness 5-9 years, Income composition of resources and Schooling. Some important questions that could be addressed are,
A multiple linear regression can be used to answer this question. The null hypothesis is that there is no relationship between the predictors and the response. The alternate hypothesis is that there is a relationship between Life expectancy and at least one of the predictors. This hypothesis test can be performed by computing the F-statistic of the linear regression. When there is no relationship between the response and predictors, the F-statistic should have a value close to 1. On the other hand, if the alternate hypothesis is true, the F-statistic should have a value greater than 1.
The linear regression is fitted with these predictors and the summary is displayed below.
lm_model <- lm(Lifeexpectancy ~ ., data = life_exp_train)
summary(lm_model)
##
## Call:
## lm(formula = Lifeexpectancy ~ ., data = life_exp_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0016 -1.3553 0.0736 1.6967 4.4597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.025e+01 3.798e+00 15.864 < 2e-16 ***
## AdultMortality -4.330e-02 5.749e-03 -7.532 4.13e-10 ***
## infantdeaths -9.306e-02 1.110e-01 -0.838 0.40532
## Alcohol -2.596e-01 1.575e-01 -1.648 0.10479
## percentageexpenditure 9.620e-04 5.740e-04 1.676 0.09924 .
## HepatitisB 3.698e-03 3.767e-02 0.098 0.92213
## Measles -1.160e-04 2.019e-04 -0.574 0.56791
## BMI 6.029e-02 3.816e-02 1.580 0.11959
## underfivedeaths 5.481e-02 7.000e-02 0.783 0.43684
## Polio -2.637e-02 4.841e-02 -0.545 0.58815
## Totalexpenditure -2.347e-01 2.153e-01 -1.090 0.28017
## Diphtheria 3.457e-02 5.562e-02 0.622 0.53672
## HIVAIDS -1.720e-01 7.705e-02 -2.232 0.02957 *
## GDP -7.775e-05 1.184e-04 -0.657 0.51408
## Population 5.119e-08 3.650e-08 1.403 0.16615
## thinness119years 6.149e-02 2.864e-01 0.215 0.83079
## thinness59years 3.853e-02 2.952e-01 0.131 0.89663
## Incomecompositionofresources 9.938e+00 4.065e+00 2.445 0.01760 *
## Schooling 8.354e-01 2.890e-01 2.890 0.00544 **
## Status_Developing -1.420e+00 1.666e+00 -0.852 0.39775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.582 on 57 degrees of freedom
## Multiple R-squared: 0.9399, Adjusted R-squared: 0.9199
## F-statistic: 46.95 on 19 and 57 DF, p-value: < 2.2e-16
The value of F-statistic is significantly greater than 1. This suggests that the null hypothesis can be rejected. So, there is a relationship between atleast one of the predictors and Life expectancy. The number of observations are significantly higher than the number of predcitors. Also, the p-value associated with this F-statistic is very close to 0. So, we can conlcude that atleast one of the predictors and Life expectancy are related. The linear regression plot for the test dataset is displayed in below figure.
tibble(predicted = predict(lm_model,newdata = life_exp_test),
observed= life_exp_test$Lifeexpectancy) %>%
ggplot(aes(x = predicted, y = observed)) +
geom_point() +
theme_minimal() +
geom_abline(slope = 1) +
labs(title = "Predicted versus observed life expectancy using linear regression",x="Predicted value",y="True value")
Since the number of predictors is not high, Best subset selection approach can be used to decide the important predictors. The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The summary() command outputs the best set of variables for each model size.
dim(life_exp_train)
## [1] 77 20
regfit.best=regsubsets (Lifeexpectancy ~ ., data=life_exp_train, nvmax=19)
reg.summary =summary (regfit.best)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Lifeexpectancy ~ ., data = life_exp_train,
## nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AdultMortality FALSE FALSE
## infantdeaths FALSE FALSE
## Alcohol FALSE FALSE
## percentageexpenditure FALSE FALSE
## HepatitisB FALSE FALSE
## Measles FALSE FALSE
## BMI FALSE FALSE
## underfivedeaths FALSE FALSE
## Polio FALSE FALSE
## Totalexpenditure FALSE FALSE
## Diphtheria FALSE FALSE
## HIVAIDS FALSE FALSE
## GDP FALSE FALSE
## Population FALSE FALSE
## thinness119years FALSE FALSE
## thinness59years FALSE FALSE
## Incomecompositionofresources FALSE FALSE
## Schooling FALSE FALSE
## Status_Developing FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AdultMortality infantdeaths Alcohol percentageexpenditure
## 1 ( 1 ) "*" " " " " " "
## 2 ( 1 ) "*" " " " " " "
## 3 ( 1 ) "*" " " " " " "
## 4 ( 1 ) "*" " " " " " "
## 5 ( 1 ) "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*"
## 7 ( 1 ) "*" " " "*" "*"
## 8 ( 1 ) "*" " " "*" "*"
## 9 ( 1 ) "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*"
## HepatitisB Measles BMI underfivedeaths Polio Totalexpenditure
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " "*" " " " " " "
## 8 ( 1 ) " " " " "*" "*" " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " "*"
## 11 ( 1 ) "*" " " "*" " " " " "*"
## 12 ( 1 ) "*" " " "*" " " " " "*"
## 13 ( 1 ) "*" " " "*" "*" " " "*"
## 14 ( 1 ) " " " " "*" "*" " " "*"
## 15 ( 1 ) " " " " "*" "*" " " "*"
## 16 ( 1 ) " " "*" "*" "*" " " "*"
## 17 ( 1 ) " " "*" "*" "*" "*" "*"
## 18 ( 1 ) " " "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*"
## Diphtheria HIVAIDS GDP Population thinness119years
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " " "
## 6 ( 1 ) " " "*" " " " " " "
## 7 ( 1 ) " " "*" " " " " " "
## 8 ( 1 ) " " "*" " " " " " "
## 9 ( 1 ) " " "*" " " "*" " "
## 10 ( 1 ) " " "*" " " "*" " "
## 11 ( 1 ) " " "*" " " "*" " "
## 12 ( 1 ) " " "*" " " "*" "*"
## 13 ( 1 ) " " "*" " " "*" "*"
## 14 ( 1 ) " " "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*"
## thinness59years Incomecompositionofresources Schooling
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) " " "*" "*"
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) " " "*" "*"
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) " " "*" "*"
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) " " "*" "*"
## 14 ( 1 ) " " "*" "*"
## 15 ( 1 ) " " "*" "*"
## 16 ( 1 ) " " "*" "*"
## 17 ( 1 ) " " "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## Status_Developing
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) " "
## 8 ( 1 ) " "
## 9 ( 1 ) " "
## 10 ( 1 ) " "
## 11 ( 1 ) " "
## 12 ( 1 ) " "
## 13 ( 1 ) " "
## 14 ( 1 ) "*"
## 15 ( 1 ) "*"
## 16 ( 1 ) "*"
## 17 ( 1 ) "*"
## 18 ( 1 ) "*"
## 19 ( 1 ) "*"
The above summary of Best subset selection approach provides the best subset for each model size. The best model size can be decided by estimating the test MSE for each of the model size and then selecting the model size with the lowest estimated test MSE. A 10-fold cross validation is performed below to estimate the test MSE of each model size. The model size 2 is found to be one with lowest test MSE.
predict.regsubsets =function (object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k=10
set.seed(1)
folds=sample (1:k,nrow(life_exp_validation),replace=TRUE)
cv.errors =matrix (NA,k,19, dimnames =list(NULL , paste (1:19) ))
for(j in 1:k){
best.fit=regsubsets(Lifeexpectancy ~ .,data=life_exp_validation[folds!=j,],nvmax=19)
for(i in 1:19){
pred=predict (best.fit, life_exp_validation[folds==j,] ,id=i)
cv.errors[j,i]= mean( ( life_exp_validation$Lifeexpectancy[ folds==j]-pred)^2)
}
}
mean.cv.errors.bestsubset=apply(cv.errors ,2, mean)
mean.cv.errors.bestsubset
## 1 2 3 4 5 6 7
## 21.06960 30.19506 15.15802 15.84608 19.06236 18.18506 24.63374
## 8 9 10 11 12 13 14
## 177.80886 423.23436 447.83239 599.44951 452.43634 387.21358 538.07409
## 15 16 17 18 19
## 515.68543 529.76588 522.27302 491.09122 507.90516
min(mean.cv.errors.bestsubset)
## [1] 15.15802
As per the above summary statistics of best subset selection approach, the best 2 variables model consists of AdultMortality and Incomecompositionofresources.
So, it can be concluded that these 2 factors are significant to predict the response(i.e Life expectancy). A linear regression is performed below with these 2 factors and the plot for the test dataset is displayed.
lm_bestfit <- lm(Lifeexpectancy ~ AdultMortality+
Incomecompositionofresources, data = life_exp_train)
summary(lm_bestfit)
##
## Call:
## lm(formula = Lifeexpectancy ~ AdultMortality + Incomecompositionofresources,
## data = life_exp_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9391 -1.3799 0.2768 1.4957 9.6980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.608816 1.903651 34.990 < 2e-16 ***
## AdultMortality -0.058066 0.003719 -15.614 < 2e-16 ***
## Incomecompositionofresources 19.615970 2.341953 8.376 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.857 on 74 degrees of freedom
## Multiple R-squared: 0.9045, Adjusted R-squared: 0.902
## F-statistic: 350.6 on 2 and 74 DF, p-value: < 2.2e-16
tibble(predicted = predict(lm_bestfit, newdata = life_exp_test),
observed = life_exp_test$Lifeexpectancy) %>%
ggplot(aes(x = predicted, y = observed)) +
geom_point() +
theme_minimal() +
geom_abline(slope = 1) +
labs(title = "Predicted versus observed life expectancy using Best subset selection")
The answer to the previous question provides an interesting information. Alcohol consumption is not in the list of 5 factors which decide Life expectancy. This is against the popular notion that alcohol consumption reduces Life expectancy. The null hypothesis is that that there is no relationship between the two. and the alternate hypothesis is alcohol consumption reduces life expectancy.
A simple linear regression with only Alcohol as the predictor is as follows.
lm_alcohol <- lm(Lifeexpectancy ~ Alcohol, data = life_exp_train)
summary(lm_alcohol)
##
## Call:
## lm(formula = Lifeexpectancy ~ Alcohol, data = life_exp_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.232 -5.365 1.669 6.605 11.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.943 1.456 43.242 < 2e-16 ***
## Alcohol 1.045 0.285 3.667 0.000455 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.459 on 75 degrees of freedom
## Multiple R-squared: 0.1521, Adjusted R-squared: 0.1408
## F-statistic: 13.45 on 1 and 75 DF, p-value: 0.0004554
The F-statistic is higher than 1 and the p-value is almost 0. So, the null hypothesis can be rejected to prove that there is correlation between Alcohol consumption and Life expectancy.
An alternative to best subset selection is Lasso. It does variable selection and regularization to improve prediction accuracy and interpretability of the model. Lasso can be performed with glmnet function.
x_train <- model.matrix(Lifeexpectancy ~ ., data=life_exp_train)
result <- glmnet(x = x_train[, -1], # X matrix without intercept
y=life_exp_train$Lifeexpectancy,# Life expectancy as response
family = "gaussian", # Normally distributed errors
alpha = 1, # LASSO penalty
lambda = 1) # Penalty value
rownames(coef(result))[which(coef(result) != 0)]
## [1] "(Intercept)" "AdultMortality"
## [3] "BMI" "HIVAIDS"
## [5] "Incomecompositionofresources" "Schooling"
Assuming a value 1 for lambda, results in selecting only 8 predictors as shown in the above output.
A plot of predicted values versus observed values using test dataset for the generated model is shown below.
x_test <- model.matrix(Lifeexpectancy ~ ., data=life_exp_test)[, -1]
y_pred_lasso <- as.numeric(predict(result, newx = x_test))
tibble(Predicted = y_pred_lasso, Observed = life_exp_test$Lifeexpectancy) %>%
ggplot(aes(x = Predicted, y = Observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = 2) +
theme_minimal() +
labs(title = "Predicted versus observed life expectancy using Lasso")
The above Lasso regression is done with a Lambda value(tuning parameter) of 1. The lambda parameter changes the strength of the shrinkage in glmnet(). Changing the tuning parameter will change the predictions, and thus the MSE. The best value for lambda can be selected using the cv.glmnet function to determine the lambda value for which the out-of-sample MSE is lowest using k-fold cross-validation.
x_cv <- model.matrix(Lifeexpectancy ~ ., bind_rows(life_exp_train,
life_exp_validation)[, -21])[, -1]
result_cv <- cv.glmnet(x = x_cv, y = c(life_exp_train$Lifeexpectancy,
life_exp_validation$Lifeexpectancy), nfolds = 15)
best_lambda <- result_cv$lambda.min
best_lambda
## [1] 0.1690499
A plot of predicted values versus observed values using test dataset with tuned lambda is shown below.
x_test <- model.matrix(Lifeexpectancy ~ ., data=life_exp_test)[, -1]
y_pred_cv_tuning <- as.numeric(predict(result_cv, newx = x_test))
tibble(Predicted = y_pred_cv_tuning, Observed = life_exp_test$Lifeexpectancy) %>%
ggplot(aes(x = Predicted, y = Observed)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = 2) +
theme_minimal() +
labs(title = "Predicted versus observed life expectancy with tuned lambda")
A bar plot comparing the test MSE of the various prediction models described above is as follows. It is interesting to see that the test MSE of Lasso wit cv tuning is the lowest among the 4 prediction models.
# lm
mse <- function(y_true, y_pred) {
mean((y_true - y_pred)^2)}
y_pred <- predict(lm_model, newdata = life_exp_test)
mse(life_exp_test$Lifeexpectancy, y_pred)
## [1] 18.39668
#Best subset selection
k=10
set.seed(2)
folds=sample (1:k,nrow(life_exp_test),replace=TRUE)
cv.errors =matrix (NA,k,19, dimnames =list(NULL , paste (1:19) ))
for(j in 1:k){
best.fit=regsubsets(Lifeexpectancy ~ .,data=life_exp_test[folds!=j,],nvmax=19)
for(i in 1:19){
pred=predict (best.fit, life_exp_test[folds==j,] ,id=i)
cv.errors[j,i]= mean( ( life_exp_test$Lifeexpectancy[ folds==j]-pred)^2)
}
}
mean.cv.errors.bestsubset=apply(cv.errors ,2, mean)
min(mean.cv.errors.bestsubset)
## [1] 5.248014
#Lasso
mse(life_exp_test$Lifeexpectancy, y_pred_lasso)
## [1] 6.272871
#Lasso with CV tuning
mse(life_exp_test$Lifeexpectancy, y_pred_cv_tuning)
## [1] 5.01991
# Adding to vector
mses <- c(
mse(life_exp_test$Lifeexpectancy, y_pred),
min(mean.cv.errors.bestsubset),
mse(life_exp_test$Lifeexpectancy, y_pred_lasso),
mse(life_exp_test$Lifeexpectancy, y_pred_cv_tuning)
)
# Creating a plot
tibble(Method = as_factor(c("lm", "subset", "lasso", "cv_las")), MSE = mses) %>%
ggplot(aes(x = Method, y = MSE, fill = Method)) +
geom_bar(stat = "identity", col = "black") +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Comparison of test set MSE for different prediction methods")
This paper started with an introduction of the dataset,and then explained preprocessing steps and followed it up with a linear regression model. Best subset selection method is applied to filter the significant predictors. Some of the interesting analytical questions are answered with the linear prediction model and the prediction model with the variables selected through the best subset selection approach. Lasso regression model is created with glmnet() and then the best lambda is predicted through cross validation. As expected, the test MSE of the lasso regression model with tuned lambda is lower than the test MSE of the model with a random value for lambda. Another observation is that all the variables selected by the best subset selection approach are also selected by the Lasso model.