1 Global Life Expectancy (WHO)

1.1 Context

The World Health Organization collect annually data for all countries on indicators (metrics) which it believes are important factors for human development. The purpose of this analysis will be to determine what are the best predictors through a linear regression model to predict life expectancy.

1.2 Process

This project will use regression model to derive a prediction model. We will expand and test several regression algorithms: linear regression, lasso regularized, random forest and XGBoost.

We will do some basic data cleanup and some exploratory data analysis EDA.

2 Let’s get started

Let’s load some of the packages we will be using for thos

library(tidyverse)
library(tidymodels)
library(skimr)
library(reshape)
# library(mlbench)
rm(list = ls())
getwd()
## [1] "C:/Users/jrfal/OneDrive/Documents/Master and Doctor/CUNY MSDS/Courses/Data_606_ProbStats/R_folder/Project"
theme_set(theme_light())
#setwd("./Project")

2.1 Load and Process data

We setup a linux server in AWS for this project. We have stored the .CSV file with life expectancy on thr server, so you can easily download it as well.

life_data <- read.csv("http://3.86.40.38/life_expectancy_data.csv")
head(life_data)
life_data %>%
  summarize(across(.cols = everything(),
                   ~sum(is.na(.x))))

I see we have 10 records where the response variable is NA. We will remove those 10 records. For the predictor NA’s we will impute them later with a recipe

And we have plenty of NA’s. We can impute them with mean, median, mode or even apply a ML algorithms like KNN to impute the values.

life_data <- life_data %>%
  filter(!is.na(Life.expectancy))
sum(is.na(life_data))
## [1] 2513

Let’s convert contry to a factor for prediction purposes.

life_data$Country <- as_factor(life_data$Country)

2.2 Some Data visualizations

Let’s take a look at our data and some basic stats. Let’s start with the the variable we want to predict Life.expectancy

par(mfrow= c(1,1))
boxplot(life_data$Life.expectancy, 
        main = "Life Expectancy Globally", 
        ylab = "Life Expectancy Years")

Seems a very wide range of values which would make the prediction much more challenging.

Let’s take a look at our predictor variables

par(mfrow = c(2,3))

invisible(lapply(4:9, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

par(mfrow = c(2,3))
invisible(lapply(10:15, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

par(mfrow = c(2,3))

invisible(lapply(16:21, function(i) boxplot(life_data[, i],
                                          main = colnames(life_data)[i])))

Now it would be important to check the corelation between variables. Let’s plot a corelation heatmap

life_cormat <- round(cor(life_data[,4:22],use = "complete.obs"),2)
melted_cormat <- melt(life_cormat)
ggplot(data = melted_cormat, aes(x=X1, y=X2, fill=value)) + 
  geom_tile() +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 8, hjust = 1))+
 coord_fixed()

In case the heatmap is not clear enough we plot a corelation matrix, which is less visually appealing but it gives us what we need.

par(mfrow=c(1,1))
library(corrplot)
# png(file="corr.png", res=300, width=1200, height=1200)
corrplot(life_cormat, method="number",
         tl.cex = 0.3,
         number.cex = 0.3,
         cl.cex = 0.3)

# dev.off()

One more matrix…just for fun )

library(GGally)
life_data %>%
  ggscatmat(columns = 3:12, color="Status",alpha = 0.7)

Let’s plot the 50 top countries per their life expectancy. As expected most of them are the richest countries in the world

life_data %>%
  group_by(Country) %>%
  summarise(avg_life = mean(Life.expectancy)) %>%
  arrange(desc(avg_life)) %>%
  na.omit() %>%
  slice(1:50) %>%
  ggplot(aes(x = reorder(Country, avg_life),avg_life)) +
  geom_bar(stat="identity") +
  coord_flip()

Now we will plot the bottom 50 countries. As expected we find here some of the poorest contries in the world.

life_data %>%
  group_by(Country) %>%
  summarise(avg_life = mean(Life.expectancy)) %>%
  arrange(desc(avg_life)) %>%
  na.omit() %>%
  slice(134:183) %>%
  ggplot(aes(x = reorder(Country, -avg_life),avg_life)) +
  geom_bar(stat="identity") +
  coord_flip()

3 Let’s build our Tidymodels

We will make use of the Tidymodels framework.

  1. We will split the data between training and testing. We will make all our model changes and adjustments only on training data. Once done we will test it with the unseen test data

3.1 Separate The Data

set.seed(888)
life_split <- initial_split(life_data)
life_train <- training(life_split)
life_test <- testing(life_split)

#Training set
life_train %>%
  summarize(no_rows = n())
#Testing set
life_test %>%
  summarize(no_rows = n())

The function divided our data set into a training set with 2,196 records and a test set with 732 rows. All modelling will be done only with the training set.

The final review of performance will be done with unseen data set

We will make use of cross-validation. This technique divide the data in two components, one for training and once for validation. Then continue and repeats the process but changing the data which is testing and validation. In this case we will define 10 folds. Which mean our model with test and validate data 10 times.

## Cross Validation Split of Training Data
set.seed(888)
life_folds <- vfold_cv(
  data = life_train, 
  v = 10
  ) 

life_folds

3.2 Recipe for pre-processing

Tidymodels has the construct of recipes. These are pre-defined methods which you can apply to your data. This would allow you to do some feature engineer to get your data ready for our model.

In this case for our project we will define TWO RECIPES. They both normalize the data, but one will include the countries which have been hot encoded in the model while the second it will not.

life_recipe <- recipe(Life.expectancy ~ ., data = life_train) %>%
  step_impute_knn(all_predictors(), neighbors = 10) %>%
  # step_dummy(Country,one_hot = TRUE) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  prep()

life_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         21
## 
## Training data contained 2196 data points and 977 incomplete rows. 
## 
## Operations:
## 
## K-nearest neighbor imputation for Year, Status, Adult.Mortality, infant.deaths, ... [trained]
## Dummy variables from Country, Status [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for Year, Adult.Mortality, infant.deaths, Alcohol, ... [trained]

We will do another recipe call life_recipe_nc for “no country” we want to try excluding the Country from the data, for several reason I will explain later.

life_recipe_nc <- recipe(Life.expectancy ~ ., data = life_train) %>%
  step_impute_knn(all_predictors(), neighbors = 10) %>%
  update_role(Country, new_role = "id") %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) %>%
  prep()

life_recipe_nc
## Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor         20
## 
## Training data contained 2196 data points and 977 incomplete rows. 
## 
## Operations:
## 
## K-nearest neighbor imputation for Status, Adult.Mortality, infant.deaths, Alcoho... [trained]
## Dummy variables from Country, Status [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for Year, Adult.Mortality, infant.deaths, Alcohol, ... [trained]

3.3 Linear Regression

We will define not the second part of TIDYMODELS which is the model’s specification. In this case it will be a linear regression.

lm_spec <- linear_reg() %>% 
            set_engine('lm') %>% # adds lm implementation of linear regression
            set_mode('regression')

# View object properties
lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

3.4 Lasso Regression

For part 1 of the project we will compare the simple linear regression to a regularized regression using lasso penalty. Our goal is to check if some regularization benefits the linear regression.

lasso_spec <- linear_reg(mode = "regression",
                        penalty = tune(),
                        mixture = 1) %>% 
  set_engine("glmnet")

lasso_spec
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

3.5 Define the workflows

Tidymodels makes use of workflows which acts a structure which define the recipe and model to use.

In this case we will define two workflows. One for the linear regression and one for the regularized one.

### Define Workflow
lm_workflow <- workflow() %>%
  add_recipe(life_recipe) %>%
  add_model(lm_spec)

lm_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
### Define Workflow
lasso_workflow <- workflow() %>%
  add_recipe(life_recipe) %>%
  add_model(lasso_spec)

lasso_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

3.6 Let’s fit the models

lm_fit <- lm_workflow %>%
  fit_resamples(life_folds)
lm_fit %>%
  collect_metrics()

Results are a RMSE of 0.417 and RSQ of 0.827. Not too bad. But let’s see if we can do better. Let’s start by running again the regression and test a regresssion with some penalty for regulazation.

3.7 Tuning the penalty parameter

Tidymodels has a feature which allows us to test many hyperparameters then we can choose the model which gives us the best performance in training.

### Tune regression parameters
penalty_grid <- grid_regular(penalty(), levels = 20)
# life_boot <- bootstraps(life_train)
set.seed(888)

doParallel::registerDoParallel()

regression_grid <- tune_grid(
  lasso_workflow,
  resamples = life_folds,
  grid = penalty_grid,
  control = control_grid(verbose = FALSE, save_pred = TRUE),
            metrics = metric_set(rmse)
)

We will select the hyperparameter which gave us the best RMSE

best_rmse <- regression_grid %>% select_best("rmse")
best_rmse

The results is very small penalty of 0.000000001 which is essentially zero penalty. Which mean regularization will not help in getting better performance.

Let’s continue and test the model with the test data

final_regression <- finalize_workflow(lasso_workflow, best_rmse) 

final_regression
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
## 
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 1e-10
##   mixture = 1
## 
## Computational engine: glmnet
lasso_final <- last_fit(final_regression, life_split)

lasso_final %>%
  collect_metrics()
lm_final <- last_fit(lm_workflow, split = life_split)

lm_final %>%
  collect_metrics()

As expected, both models performed almost identically. With same RSQ and almost same RMSE

4 Stack Models

We will make use of a feature in tidymodels, which allows to put multiple models and recipes in a list and process thems so we can compare them and select the best model.

We will test the following 3 algorithms.

  1. Linear Regression
  2. Random Forest
  3. Xgboost Regression

and

TWO recipes

This will gives 6 model-recipes combinations. We will choose the best one in training and see how it performs with the unseen test data

4.1 Define the model specs for Random Forest and XGBoost

As before we will also test several hyperparameter settings making use of the tidymodels tune() function.

# Random Forest
randomf_spec <- rand_forest(
    mtry = tune(),
    trees = tune(),
    min_n = tune()
    ) %>%
  set_mode("regression") %>%
  set_engine("ranger")

# XGBoost
xgbboost_spec <- boost_tree(
  trees = tune(),
  mtry = tune(),
  tree_depth = tune(),
  learn_rate = .01
  ) %>%
  set_mode("regression") %>% 
  set_engine("xgboost",importance = TRUE)

4.2 Define the Workflow Set

Just a before we will define a worflow. But in this case it will a set of model specs and recipes

workflow_set <-workflow_set(
  preproc = list(life_recipe, life_recipe_nc),
  models = list(lm_spec, randomf_spec, xgbboost_spec),
  cross = TRUE
  )

workflow_set

You can see we have 6 model-recipe combinations. One for each of the 3 model specs and the two recipes.

Let’s fit our models, recuipes and hyperparameters. This will take long time. In my computer it took 15 minutes. If you want to run it, just change RUN = TRUE.

For this project purposes I ran it, and save the model so I can just re-load it without re-running the fitting process.

# Took 15 minutes on i7 computer
RUN = FALSE
if (RUN) {
doParallel::registerDoParallel()

fit_workflows <- workflow_set %>%  
  workflow_map(
    seed = 888,  
    fn = "tune_grid",
    grid = 20,
    resamples = life_folds
  )

doParallel::stopImplicitCluster()
}

If you didn’t ran the fitting, you can always re-load the model fit I ran and saved in my disk. I will make the file available in github

if (RUN) {
saved_life_modelset <- fit_workflows

saveRDS(saved_life_modelset, "saved_life_modelset2.rds")
}
########
if (!RUN) {
fit_workflows <- readRDS("saved_life_modelset2.rds")
}
fit_workflows

4.3 Evaluate model fits

Let’s see how our 6 models peformed and choose a winner.

autoplot(fit_workflows)

collect_metrics(fit_workflows)
rank_results(fit_workflows, rank_metric = "rmse", select_best = TRUE)

We can see in the plots that XGBoost model performed better that random forest and linear regression. XGBoost provided both better RMSE and R-Squared. While random forest performed second and linear regression last place, except for a couple of runs where XGBoost had the worst peformance of all models.

4.4 Extract the best workflow

We will select the best based on RMSE. The clear winner is XGBoost with Recipe 1 which includes countries

metric <- "rmse"

best_workflow_id <- fit_workflows %>% 
  rank_results(
    rank_metric = metric,
    select_best = TRUE
  ) %>% 
  slice(1) %>% 
  pull(wflow_id)

workflow_best <- extract_workflow(fit_workflows, id = best_workflow_id)

best_workflow_id
## [1] "recipe_1_boost_tree"

4.5 Extract tuning results from workflowset

We know our best model in training, now let’s extract the tuning parameters used to generate it.

workflow_best_tuned <- fit_workflows[fit_workflows$wflow_id == best_workflow_id,"result"][[1]][[1]]

workflow_best_tuned
collect_metrics(workflow_best_tuned)
autoplot(workflow_best_tuned)

select_best(workflow_best_tuned, "rmse")

You can see the hyper-parameters:

mtry = 12 Trees = 1959 tree_depth = 8

4.6 Fit the final model

Now let’s test with the unseen test data

workflow_best_final <- finalize_workflow(workflow_best, select_best(workflow_best_tuned, "rmse"))

doParallel::registerDoParallel()

workflow_best_final_fit <- workflow_best_final %>% 
  last_fit(
    split = life_split
  )

doParallel::stopImplicitCluster()

workflow_best_final_fit
workflow_best_final_fit %>% 
  collect_metrics()

Results are not bad at all.

RMSE of 0.187 and R-Squared 0.967

Much better than the linear regression models we ran before.

4.6.1 Some visualizations of the selected winner model

fit_test <- workflow_best_final_fit %>% 
  collect_predictions()

fit_test
fit_test %>%
  ggplot(aes(x=Life.expectancy, y=.pred)) +
  geom_abline(slope=1,intercept=0) +
  geom_point(alpha=0.3)

We can see that predicted points are very close the 45 degree diagonal, telling us is a good model for prediction.

Which variables where the most important in our selected model?

By using the VIP package we can plot the top 10 variables in their importance of prediction.

income.composition.of.resources was the most important, followed by HIV.AIDS, Adult mortality and Schooling. The remaining 6 are of a lesser importance.

library(vip)
extract_workflow(workflow_best_final_fit) %>%
  extract_fit_parsnip() %>%
  vip(geom = "col")

boxplot_data <- fit_test %>%
  mutate(difference = .pred - Life.expectancy)
boxplot_data$difference %>%
  boxplot(main = "Results of our Selected Model predicting",
          xlab = "Difference of Predictions",
          horizontal = FALSE)

5 Thank you!

gapminder <- read_csv("https://raw.githubusercontent.com/datavizpyr/data/master/gapminder-FiveYearData.csv")
head(gapminder)
#

df <- gapminder %>%
  filter(year %in% c(1952,2007)) %>%
  filter(continent=="Americas")

#
df <- df %>%
  mutate(paired = rep(1:(n()/2),each=2),
         year=factor(year))

#

df %>% 
  ggplot(aes(x= lifeExp, y= reorder(country,lifeExp))) +
  geom_line(aes(group = paired))+
    geom_point(aes(color=year), size=4) +
  labs(y="country")

#

df %>% 
  ggplot(aes(x= lifeExp, y= reorder(country,lifeExp))) +
  geom_line(aes(group = paired),color="grey")+
    geom_point(aes(color=year), size=4) +
  labs(y="country")+
  theme_classic(20)+
  theme(legend.position="top") +
  scale_color_brewer(palette="Accent", direction=-1)