Ensembling is a very common and powerful technique to enhance accuracy and overall performance in dealing with classification problem. Given a binary outcome, we would like to classify whether an event would occur based on a set of quantitative or qualitative variables. There are many algorithms to do the trick. In this blog post, we would like to use a set of classification algorithms and combine them together to enhance our prediction. We would like to use a public dataset to classify loan status based on various social demographic data.
# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('tidyverse', 'caret', 'broom', 'Hmisc', 'kableExtra', 'janitor')
pacman::p_load(char = packages)
# read data
df <- read.csv(url('https://datahack-prod.s3.ap-south-1.amazonaws.com/train_file/train_u6lujuX_CVtuZ9i.csv'))
# check missing
tempDf <- data.frame(fields = names(df),
missing = colSums(is.na(df)))
tempDf
## fields missing
## Loan_ID Loan_ID 0
## Gender Gender 0
## Married Married 0
## Dependents Dependents 0
## Education Education 0
## Self_Employed Self_Employed 0
## ApplicantIncome ApplicantIncome 0
## CoapplicantIncome CoapplicantIncome 0
## LoanAmount LoanAmount 22
## Loan_Amount_Term Loan_Amount_Term 14
## Credit_History Credit_History 50
## Property_Area Property_Area 0
## Loan_Status Loan_Status 0
# impute dataset
preProcValues <- preProcess(df, method = c("medianImpute", "center", "scale"))
dfProcessed <- predict(preProcValues, df)
# check again
tempDf2 <- data.frame(fields = names(dfProcessed),
missing = colSums(is.na(dfProcessed)))
tempDf2
## fields missing
## Loan_ID Loan_ID 0
## Gender Gender 0
## Married Married 0
## Dependents Dependents 0
## Education Education 0
## Self_Employed Self_Employed 0
## ApplicantIncome ApplicantIncome 0
## CoapplicantIncome CoapplicantIncome 0
## LoanAmount LoanAmount 0
## Loan_Amount_Term Loan_Amount_Term 0
## Credit_History Credit_History 0
## Property_Area Property_Area 0
## Loan_Status Loan_Status 0
# clean names
dfProcessed <- janitor::clean_names(dfProcessed)
names(dfProcessed)
## [1] "loan_id" "gender" "married"
## [4] "dependents" "education" "self_employed"
## [7] "applicant_income" "coapplicant_income" "loan_amount"
## [10] "loan_amount_term" "credit_history" "property_area"
## [13] "loan_status"
# check distribution
broom::glance(dfProcessed)
## Warning: 'glance.data.frame' is deprecated.
## See help("Deprecated")
## # A tibble: 1 x 4
## nrow ncol complete.obs na.fraction
## <int> <int> <int> <dbl>
## 1 614 13 614 0
table(dfProcessed$loan_status)
##
## N Y
## 192 422
prop.table(ftable(dfProcessed$loan_status)) %>% round(., 1)
## N Y
##
## 0.3 0.7
# head df
head(dfProcessed) %>% kable(.)
| loan_id | gender | married | dependents | education | self_employed | applicant_income | coapplicant_income | loan_amount | loan_amount_term | credit_history | property_area | loan_status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LP001002 | Male | No | 0 | Graduate | No | 0.0729314 | -0.5540356 | -0.2151272 | 0.276411 | 0.4324768 | Urban | Y |
| LP001003 | Male | Yes | 1 | Graduate | No | -0.1343025 | -0.0387000 | -0.2151272 | 0.276411 | 0.4324768 | Rural | N |
| LP001005 | Male | Yes | 0 | Graduate | Yes | -0.3934266 | -0.5540356 | -0.9395335 | 0.276411 | 0.4324768 | Urban | Y |
| LP001006 | Male | Yes | 0 | Not Graduate | No | -0.4616860 | 0.2517743 | -0.3085990 | 0.276411 | 0.4324768 | Urban | Y |
| LP001008 | Male | No | 0 | Graduate | No | 0.0976488 | -0.5540356 | -0.0632356 | 0.276411 | 0.4324768 | Urban | Y |
| LP001011 | Male | Yes | 2 | Graduate | Yes | 0.0022165 | 0.8798823 | 1.4089450 | 0.276411 | 0.4324768 | Urban | Y |
# str(), dim()
str(dfProcessed)
## 'data.frame': 614 obs. of 13 variables:
## $ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
## $ married : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
## $ dependents : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
## $ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
## $ self_employed : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
## $ applicant_income : num 0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
## $ coapplicant_income: num -0.554 -0.0387 -0.554 0.2518 -0.554 ...
## $ loan_amount : num -0.2151 -0.2151 -0.9395 -0.3086 -0.0632 ...
## $ loan_amount_term : num 0.276 0.276 0.276 0.276 0.276 ...
## $ credit_history : num 0.432 0.432 0.432 0.432 0.432 ...
## $ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
## $ loan_status : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
dim(dfProcessed)
## [1] 614 13
# make changes to the "loan_status" level, label
levels(dfProcessed$loan_status)
## [1] "N" "Y"
dfProcessed <- dfProcessed %>%
dplyr::mutate(loan_status = loan_status %>%
factor(., levels = c("Y", "N"), labels = c("Yes", "No")))
levels(dfProcessed$loan_status)
## [1] "Yes" "No"
Let’s try to improve our accuracy by doing a 10-fold cross-validation.
# set seed
set.seed(1234)
# split data
index <- caret::createDataPartition(dfProcessed$loan_status, p = 0.7, list = FALSE)
trainSet <- dfProcessed[index, ]
testSet <- dfProcessed[-index, ]
# define the training control - let's do 10-fold cross-validation
train.control <- caret::trainControl(
method = "cv",
number = 10,
savePredictions = 'final',
classProbs = TRUE)
# define IVs, DV
IV <- names(dfProcessed)[names(dfProcessed) %nin% c("loan_id", "loan_status")]; IV
## [1] "gender" "married" "dependents"
## [4] "education" "self_employed" "applicant_income"
## [7] "coapplicant_income" "loan_amount" "loan_amount_term"
## [10] "credit_history" "property_area"
DV <- "loan_status"
We will ensemble our models, i.e. using glm (generalized linear model), nb (naive bayes) and rf (random forest) as our base layer and stacking on top by building a top layer using glm, nb and gbm (gradient boosting machine). We will evaluate each model result.
# set seed
set.seed(1234)
# methods
baseTrainMethods <- c("glm", "nb", "rf")
topTrainMethods <- c("glm", "nb", "gbm")
# output
modelSummaryList <- vector(mode = "list")
# train base layer
for(baseLayer in baseTrainMethods){
# set parameters
ml = baseLayer
model = paste0("model_base_", ml)
OOF_prediction = paste0("OOF_pred_", ml)
prediction = paste0("pred_", ml)
result = paste0("result_", ml)
# model
assign(bquote(.(model)), caret::train(trainSet[, IV], trainSet[, DV],
method = ml,
trControl = train.control,
trace = FALSE))
# Out-Of-Fold probability predictions - trainSet
if(ml == "glm"){trainSet$OOF_pred_glm = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}
if(ml == "nb"){trainSet$OOF_pred_nb = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}
if(ml == "rf"){trainSet$OOF_pred_rf = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}
# Out-Of-Fold probability predictions - testSet
assign(bquote(.(OOF_prediction)), predict(eval(sym(model)), testSet[, IV], type = "prob")$Y)
if(ml == "glm"){testSet$OOF_pred_glm = eval(sym(OOF_prediction))}
if(ml == "nb"){testSet$OOF_pred_nb = eval(sym(OOF_prediction))}
if(ml == "rf"){testSet$OOF_pred_rf = eval(sym(OOF_prediction))}
# Y/N predictions for Confusion Matrix - testSet
assign(bquote(.(prediction)), predict(eval(sym(model)), testSet[, IV]))
if(ml == "glm"){testSet$pred_glm = eval(sym(prediction))}
if(ml == "nb"){testSet$pred_nb = eval(sym(prediction))}
if(ml == "rf"){testSet$pred_rf = eval(sym(prediction))}
# output
assign(bquote(.(result)), broom::tidy(caret::confusionMatrix(testSet[, prediction], testSet[, DV])) %>%
dplyr::mutate(trainMethod = ml) %>%
dplyr::select(trainMethod, everything()))
# store output into a list
tempModelList <- list(eval(sym(result)))
modelSummaryList <<- c(modelSummaryList, tempModelList)
}
# train top layer
for(topLayer in topTrainMethods){
# set parameters
ml = topLayer
model = paste0("model_top_", ml)
OOF_predictors_top = c("OOF_pred_glm", "OOF_pred_nb", "OOF_pred_rf")
OOF_prediction_top = paste0("OOF_pred_top_", ml)
prediction_top = paste0("pred_top_", ml)
result = paste0("result_top_", ml)
# model
assign(bquote(.(model)), caret::train(trainSet[, OOF_predictors_top], trainSet[, DV],
method = ml,
trControl = train.control))
# Out-Of-Fold probability predictions - testSet
assign(bquote(.(OOF_prediction_top)), predict(eval(sym(model)), testSet[, OOF_predictors_top], type = "prob")$Y)
if(ml == "glm"){testSet$OOF_pred_top_glm = eval(sym(OOF_prediction_top))}
if(ml == "nb"){testSet$OOF_pred_top_nb = eval(sym(OOF_prediction_top))}
if(ml == "gbm"){testSet$OOF_pred_top_gbm = eval(sym(OOF_prediction_top))}
# Y/N predictions for Confusion Matrix - testSet
assign(bquote(.(prediction_top)), predict(eval(sym(model)), testSet[, OOF_predictors_top]))
if(ml == "glm"){testSet$pred_top_glm = eval(sym(prediction_top))}
if(ml == "nb"){testSet$pred_top_nb = eval(sym(prediction_top))}
if(ml == "gbm"){testSet$pred_top_gbm = eval(sym(prediction_top))}
# output
assign(bquote(.(result)), broom::tidy(caret::confusionMatrix(testSet[, prediction_top], testSet[, DV])) %>%
dplyr::mutate(trainMethod = paste0(ml, " - top layer")) %>%
dplyr::select(trainMethod, everything()))
# store output into a list
tempModelList <- list(eval(sym(result)))
modelSummaryList <<- c(modelSummaryList, tempModelList)
}
# put together - averaging the OOF prediction probability
testSet <- testSet %>%
dplyr::mutate(pred_final_prob_avg = (OOF_pred_top_glm + OOF_pred_top_nb + OOF_pred_top_gbm) / length(topTrainMethods),
pred_final = ifelse(pred_final_prob_avg > 0.5, "Y", "N") %>%
factor(., levels = c("Y", "N"), labels = c("Yes", "No")))
finalResult <- broom::tidy(caret::confusionMatrix(testSet$pred_final, testSet$loan_status)) %>%
dplyr::mutate(trainMethod = "final - averaging") %>%
dplyr::select(trainMethod, everything())
# store finalResult output into a list
tempModelList <- list(finalResult)
modelSummaryList <<- c(modelSummaryList, tempModelList)
# let's look at the result
modelSummaryDf <- modelSummaryList %>%
dplyr::bind_rows() %>%
dplyr::select(trainMethod, term, estimate) %>%
tidyr::spread(term, estimate) %>%
arrange(desc(f1))
modelSummaryDf %>% kable()
| trainMethod | accuracy | balanced_accuracy | detection_prevalence | detection_rate | f1 | kappa | neg_pred_value | pos_pred_value | precision | prevalence | recall | sensitivity | specificity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| glm | 0.8524590 | 0.7631579 | 0.8360656 | 0.6885246 | 0.9032258 | 0.6047516 | 1.0000000 | 0.8235294 | 0.8235294 | 0.6885246 | 1.0000000 | 1.0000000 | 0.5263158 |
| rf | 0.8524590 | 0.7631579 | 0.8360656 | 0.6885246 | 0.9032258 | 0.6047516 | 1.0000000 | 0.8235294 | 0.8235294 | 0.6885246 | 1.0000000 | 1.0000000 | 0.5263158 |
| final - averaging | 0.8469945 | 0.7591896 | 0.8306011 | 0.6830601 | 0.8992806 | 0.5923628 | 0.9677419 | 0.8223684 | 0.8223684 | 0.6885246 | 0.9920635 | 0.9920635 | 0.5263158 |
| gbm - top layer | 0.8469945 | 0.7591896 | 0.8306011 | 0.6830601 | 0.8992806 | 0.5923628 | 0.9677419 | 0.8223684 | 0.8223684 | 0.6885246 | 0.9920635 | 0.9920635 | 0.5263158 |
| glm - top layer | 0.8469945 | 0.7591896 | 0.8306011 | 0.6830601 | 0.8992806 | 0.5923628 | 0.9677419 | 0.8223684 | 0.8223684 | 0.6885246 | 0.9920635 | 0.9920635 | 0.5263158 |
| nb - top layer | 0.8415301 | 0.7600251 | 0.8142077 | 0.6721311 | 0.8945455 | 0.5846443 | 0.9117647 | 0.8255034 | 0.8255034 | 0.6885246 | 0.9761905 | 0.9761905 | 0.5438596 |
| nb | 0.8360656 | 0.7560568 | 0.8087432 | 0.6666667 | 0.8905109 | 0.5726296 | 0.8857143 | 0.8243243 | 0.8243243 | 0.6885246 | 0.9682540 | 0.9682540 | 0.5438596 |
In summary, we build a base layer (using glm, nb and rf) to predict the “loan_status” using all the variables from the train set. Second, we build a top layer (using glm, nb and gbm) to predict the “loan_status” based on the outcomes (OOF or Out-Of-Fold prediction) of our base layer. In other words, we make prediction (of the target variable) based on the predictions of our base models. Finally, we simply average the outcome probabilities of our top layer to get our final probability and decision using 0.5 as cut-off.
Take a look of our results, i.e. modelSummaryDf, surprisingly, the glm and rf models perform the best! They achieve the highest f1 score followed by the “final - averaging” from the three top layer models. All models achieve very high recall (or sensitivity) and above 80% precision. However, specificity barely reaches over 50%. That would be a weakness that deserves more attention for improvement in the future.