The World Health Organization collect annually data for all countries on indicators (metrics) which it believes are important factors for human development. The purpose of this analysis will be to determine what are the best predictors through a linear regression model to predict life expectancy.
This project will use regression model to derive a prediction model. We will expand and test several regression algorithms: linear regression, lasso regularized, random forest and XGBoost.
We will do some basic data cleanup and some exploratory data analysis EDA.
Let’s load some of the packages we will be using for thos
library(tidyverse)
library(tidymodels)
library(skimr)
library(reshape)
# library(mlbench)
rm(list = ls())
getwd()
## [1] "C:/Users/jrfal/OneDrive/Documents/Master and Doctor/CUNY MSDS/Courses/Data_606_ProbStats/R_folder/Project"
theme_set(theme_light())
#setwd("./Project")
We setup a linux server in AWS for this project. We have stored the .CSV file with life expectancy on thr server, so you can easily download it as well.
life_data <- read.csv("http://3.86.40.38/life_expectancy_data.csv")
head(life_data)
life_data %>%
summarize(across(.cols = everything(),
~sum(is.na(.x))))
I see we have 10 records where the response variable is NA. We will remove those 10 records. For the predictor NA’s we will impute them later with a recipe
And we have plenty of NA’s. We can impute them with mean, median, mode or even apply a ML algorithms like KNN to impute the values.
life_data <- life_data %>%
filter(!is.na(Life.expectancy))
sum(is.na(life_data))
## [1] 2513
Let’s convert contry to a factor for prediction purposes.
life_data$Country <- as_factor(life_data$Country)
Let’s take a look at our data and some basic stats. Let’s start with the the variable we want to predict Life.expectancy
par(mfrow= c(1,1))
boxplot(life_data$Life.expectancy,
main = "Life Expectancy Globally",
ylab = "Life Expectancy Years")
Seems a very wide range of values which would make the prediction much more challenging.
Let’s take a look at our predictor variables
par(mfrow = c(2,3))
invisible(lapply(4:9, function(i) boxplot(life_data[, i],
main = colnames(life_data)[i])))
par(mfrow = c(2,3))
invisible(lapply(10:15, function(i) boxplot(life_data[, i],
main = colnames(life_data)[i])))
par(mfrow = c(2,3))
invisible(lapply(16:21, function(i) boxplot(life_data[, i],
main = colnames(life_data)[i])))
Now it would be important to check the corelation between variables. Let’s plot a corelation heatmap
life_cormat <- round(cor(life_data[,4:22],use = "complete.obs"),2)
melted_cormat <- melt(life_cormat)
ggplot(data = melted_cormat, aes(x=X1, y=X2, fill=value)) +
geom_tile() +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1))+
coord_fixed()
In case the heatmap is not clear enough we plot a corelation matrix, which is less visually appealing but it gives us what we need.
par(mfrow=c(1,1))
library(corrplot)
# png(file="corr.png", res=300, width=1200, height=1200)
corrplot(life_cormat, method="number",
tl.cex = 0.3,
number.cex = 0.3,
cl.cex = 0.3)
# dev.off()
One more matrix…just for fun )
library(GGally)
life_data %>%
ggscatmat(columns = 3:12, color="Status",alpha = 0.7)
Let’s plot the 50 top countries per their life expectancy. As expected most of them are the richest countries in the world
life_data %>%
group_by(Country) %>%
summarise(avg_life = mean(Life.expectancy)) %>%
arrange(desc(avg_life)) %>%
na.omit() %>%
slice(1:50) %>%
ggplot(aes(x = reorder(Country, avg_life),avg_life)) +
geom_bar(stat="identity") +
coord_flip()
Now we will plot the bottom 50 countries. As expected we find here some of the poorest contries in the world.
life_data %>%
group_by(Country) %>%
summarise(avg_life = mean(Life.expectancy)) %>%
arrange(desc(avg_life)) %>%
na.omit() %>%
slice(134:183) %>%
ggplot(aes(x = reorder(Country, -avg_life),avg_life)) +
geom_bar(stat="identity") +
coord_flip()
We will make use of the Tidymodels framework.
set.seed(888)
life_split <- initial_split(life_data)
life_train <- training(life_split)
life_test <- testing(life_split)
#Training set
life_train %>%
summarize(no_rows = n())
#Testing set
life_test %>%
summarize(no_rows = n())
The function divided our data set into a training set with 2,196 records and a test set with 732 rows. All modelling will be done only with the training set.
The final review of performance will be done with unseen data set
We will make use of cross-validation. This technique divide the data in two components, one for training and once for validation. Then continue and repeats the process but changing the data which is testing and validation. In this case we will define 10 folds. Which mean our model with test and validate data 10 times.
## Cross Validation Split of Training Data
set.seed(888)
life_folds <- vfold_cv(
data = life_train,
v = 10
)
life_folds
Tidymodels has the construct of recipes. These are pre-defined methods which you can apply to your data. This would allow you to do some feature engineer to get your data ready for our model.
In this case for our project we will define TWO RECIPES. They both normalize the data, but one will include the countries which have been hot encoded in the model while the second it will not.
life_recipe <- recipe(Life.expectancy ~ ., data = life_train) %>%
step_impute_knn(all_predictors(), neighbors = 10) %>%
# step_dummy(Country,one_hot = TRUE) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep()
life_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 21
##
## Training data contained 2196 data points and 977 incomplete rows.
##
## Operations:
##
## K-nearest neighbor imputation for Year, Status, Adult.Mortality, infant.deaths, ... [trained]
## Dummy variables from Country, Status [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for Year, Adult.Mortality, infant.deaths, Alcohol, ... [trained]
We will do another recipe call life_recipe_nc for “no country” we want to try excluding the Country from the data, for several reason I will explain later.
life_recipe_nc <- recipe(Life.expectancy ~ ., data = life_train) %>%
step_impute_knn(all_predictors(), neighbors = 10) %>%
update_role(Country, new_role = "id") %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep()
life_recipe_nc
## Recipe
##
## Inputs:
##
## role #variables
## id 1
## outcome 1
## predictor 20
##
## Training data contained 2196 data points and 977 incomplete rows.
##
## Operations:
##
## K-nearest neighbor imputation for Status, Adult.Mortality, infant.deaths, Alcoho... [trained]
## Dummy variables from Country, Status [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for Year, Adult.Mortality, infant.deaths, Alcohol, ... [trained]
We will define not the second part of TIDYMODELS which is the model’s specification. In this case it will be a linear regression.
lm_spec <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
# View object properties
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
For part 1 of the project we will compare the simple linear regression to a regularized regression using lasso penalty. Our goal is to check if some regularization benefits the linear regression.
lasso_spec <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
lasso_spec
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Tidymodels makes use of workflows which acts a structure which define the recipe and model to use.
In this case we will define two workflows. One for the linear regression and one for the regularized one.
### Define Workflow
lm_workflow <- workflow() %>%
add_recipe(life_recipe) %>%
add_model(lm_spec)
lm_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
### Define Workflow
lasso_workflow <- workflow() %>%
add_recipe(life_recipe) %>%
add_model(lasso_spec)
lasso_workflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
lm_fit <- lm_workflow %>%
fit_resamples(life_folds)
lm_fit %>%
collect_metrics()
Results are a RMSE of 0.417 and RSQ of 0.827. Not too bad. But let’s see if we can do better. Let’s start by running again the regression and test a regresssion with some penalty for regulazation.
Tidymodels has a feature which allows us to test many hyperparameters then we can choose the model which gives us the best performance in training.
### Tune regression parameters
penalty_grid <- grid_regular(penalty(), levels = 20)
# life_boot <- bootstraps(life_train)
set.seed(888)
doParallel::registerDoParallel()
regression_grid <- tune_grid(
lasso_workflow,
resamples = life_folds,
grid = penalty_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse)
)
We will select the hyperparameter which gave us the best RMSE
best_rmse <- regression_grid %>% select_best("rmse")
best_rmse
The results is very small penalty of 0.000000001 which is essentially zero penalty. Which mean regularization will not help in getting better performance.
Let’s continue and test the model with the test data
final_regression <- finalize_workflow(lasso_workflow, best_rmse)
final_regression
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 4 Recipe Steps
##
## * step_impute_knn()
## * step_dummy()
## * step_zv()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 1e-10
## mixture = 1
##
## Computational engine: glmnet
lasso_final <- last_fit(final_regression, life_split)
lasso_final %>%
collect_metrics()
lm_final <- last_fit(lm_workflow, split = life_split)
lm_final %>%
collect_metrics()
As expected, both models performed almost identically. With same RSQ and almost same RMSE
We will make use of a feature in tidymodels, which allows to put multiple models and recipes in a list and process thems so we can compare them and select the best model.
We will test the following 3 algorithms.
and
TWO recipes
This will gives 6 model-recipes combinations. We will choose the best one in training and see how it performs with the unseen test data
As before we will also test several hyperparameter settings making use of the tidymodels tune() function.
# Random Forest
randomf_spec <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
# XGBoost
xgbboost_spec <- boost_tree(
trees = tune(),
mtry = tune(),
tree_depth = tune(),
learn_rate = .01
) %>%
set_mode("regression") %>%
set_engine("xgboost",importance = TRUE)
Just a before we will define a worflow. But in this case it will a set of model specs and recipes
workflow_set <-workflow_set(
preproc = list(life_recipe, life_recipe_nc),
models = list(lm_spec, randomf_spec, xgbboost_spec),
cross = TRUE
)
workflow_set
You can see we have 6 model-recipe combinations. One for each of the 3 model specs and the two recipes.
Let’s fit our models, recuipes and hyperparameters. This will take long time. In my computer it took 15 minutes. If you want to run it, just change RUN = TRUE.
For this project purposes I ran it, and save the model so I can just re-load it without re-running the fitting process.
# Took 15 minutes on i7 computer
RUN = FALSE
if (RUN) {
doParallel::registerDoParallel()
fit_workflows <- workflow_set %>%
workflow_map(
seed = 888,
fn = "tune_grid",
grid = 20,
resamples = life_folds
)
doParallel::stopImplicitCluster()
}
If you didn’t ran the fitting, you can always re-load the model fit I ran and saved in my disk. I will make the file available in github
if (RUN) {
saved_life_modelset <- fit_workflows
saveRDS(saved_life_modelset, "saved_life_modelset2.rds")
}
########
if (!RUN) {
fit_workflows <- readRDS("saved_life_modelset2.rds")
}
fit_workflows
Let’s see how our 6 models peformed and choose a winner.
autoplot(fit_workflows)
collect_metrics(fit_workflows)
rank_results(fit_workflows, rank_metric = "rmse", select_best = TRUE)
We can see in the plots that XGBoost model performed better that random forest and linear regression. XGBoost provided both better RMSE and R-Squared. While random forest performed second and linear regression last place, except for a couple of runs where XGBoost had the worst peformance of all models.
We will select the best based on RMSE. The clear winner is XGBoost with Recipe 1 which includes countries
metric <- "rmse"
best_workflow_id <- fit_workflows %>%
rank_results(
rank_metric = metric,
select_best = TRUE
) %>%
slice(1) %>%
pull(wflow_id)
workflow_best <- extract_workflow(fit_workflows, id = best_workflow_id)
best_workflow_id
## [1] "recipe_1_boost_tree"
We know our best model in training, now let’s extract the tuning parameters used to generate it.
workflow_best_tuned <- fit_workflows[fit_workflows$wflow_id == best_workflow_id,"result"][[1]][[1]]
workflow_best_tuned
collect_metrics(workflow_best_tuned)
autoplot(workflow_best_tuned)
select_best(workflow_best_tuned, "rmse")
You can see the hyper-parameters:
mtry = 12 Trees = 1959 tree_depth = 8
Now let’s test with the unseen test data
workflow_best_final <- finalize_workflow(workflow_best, select_best(workflow_best_tuned, "rmse"))
doParallel::registerDoParallel()
workflow_best_final_fit <- workflow_best_final %>%
last_fit(
split = life_split
)
doParallel::stopImplicitCluster()
workflow_best_final_fit
workflow_best_final_fit %>%
collect_metrics()
Results are not bad at all.
RMSE of 0.187 and R-Squared 0.967
Much better than the linear regression models we ran before.
fit_test <- workflow_best_final_fit %>%
collect_predictions()
fit_test
fit_test %>%
ggplot(aes(x=Life.expectancy, y=.pred)) +
geom_abline(slope=1,intercept=0) +
geom_point(alpha=0.3)
We can see that predicted points are very close the 45 degree diagonal, telling us is a good model for prediction.
Which variables where the most important in our selected model?
By using the VIP package we can plot the top 10 variables in their importance of prediction.
income.composition.of.resources was the most important, followed by HIV.AIDS, Adult mortality and Schooling. The remaining 6 are of a lesser importance.
library(vip)
extract_workflow(workflow_best_final_fit) %>%
extract_fit_parsnip() %>%
vip(geom = "col")
boxplot_data <- fit_test %>%
mutate(difference = .pred - Life.expectancy)
boxplot_data$difference %>%
boxplot(main = "Results of our Selected Model predicting",
xlab = "Difference of Predictions",
horizontal = FALSE)
gapminder <- read_csv("https://raw.githubusercontent.com/datavizpyr/data/master/gapminder-FiveYearData.csv")
head(gapminder)
#
df <- gapminder %>%
filter(year %in% c(1952,2007)) %>%
filter(continent=="Americas")
#
df <- df %>%
mutate(paired = rep(1:(n()/2),each=2),
year=factor(year))
#
df %>%
ggplot(aes(x= lifeExp, y= reorder(country,lifeExp))) +
geom_line(aes(group = paired))+
geom_point(aes(color=year), size=4) +
labs(y="country")
#
df %>%
ggplot(aes(x= lifeExp, y= reorder(country,lifeExp))) +
geom_line(aes(group = paired),color="grey")+
geom_point(aes(color=year), size=4) +
labs(y="country")+
theme_classic(20)+
theme(legend.position="top") +
scale_color_brewer(palette="Accent", direction=-1)