data_621_blog

Ensembling methods

Ensembling is a very common and powerful technique to enhance accuracy and overall performance in dealing with classification problem. Given a binary outcome, we would like to classify whether an event would occur based on a set of quantitative or qualitative variables. There are many algorithms to do the trick. In this blog post, we would like to use a set of classification algorithms and combine them together to enhance our prediction. We would like to use a public dataset to classify loan status based on various social demographic data.

# load packages
if(!require(pacman)){install.packages("pacman"); require(pacman)}
## Loading required package: pacman
## Warning: package 'pacman' was built under R version 3.6.2
packages <- c('tidyverse', 'caret', 'broom', 'Hmisc', 'kableExtra', 'janitor')
pacman::p_load(char = packages)

# read data
df <- read.csv(url('https://datahack-prod.s3.ap-south-1.amazonaws.com/train_file/train_u6lujuX_CVtuZ9i.csv'))

# check missing
tempDf <- data.frame(fields = names(df),
                     missing = colSums(is.na(df)))

tempDf
##                              fields missing
## Loan_ID                     Loan_ID       0
## Gender                       Gender       0
## Married                     Married       0
## Dependents               Dependents       0
## Education                 Education       0
## Self_Employed         Self_Employed       0
## ApplicantIncome     ApplicantIncome       0
## CoapplicantIncome CoapplicantIncome       0
## LoanAmount               LoanAmount      22
## Loan_Amount_Term   Loan_Amount_Term      14
## Credit_History       Credit_History      50
## Property_Area         Property_Area       0
## Loan_Status             Loan_Status       0

# impute dataset
preProcValues <- preProcess(df, method = c("medianImpute", "center", "scale"))
dfProcessed <- predict(preProcValues, df)

# check again
tempDf2 <- data.frame(fields = names(dfProcessed),
                     missing = colSums(is.na(dfProcessed)))

tempDf2
##                              fields missing
## Loan_ID                     Loan_ID       0
## Gender                       Gender       0
## Married                     Married       0
## Dependents               Dependents       0
## Education                 Education       0
## Self_Employed         Self_Employed       0
## ApplicantIncome     ApplicantIncome       0
## CoapplicantIncome CoapplicantIncome       0
## LoanAmount               LoanAmount       0
## Loan_Amount_Term   Loan_Amount_Term       0
## Credit_History       Credit_History       0
## Property_Area         Property_Area       0
## Loan_Status             Loan_Status       0

# clean names 
dfProcessed <- janitor::clean_names(dfProcessed)
names(dfProcessed)
##  [1] "loan_id"            "gender"             "married"           
##  [4] "dependents"         "education"          "self_employed"     
##  [7] "applicant_income"   "coapplicant_income" "loan_amount"       
## [10] "loan_amount_term"   "credit_history"     "property_area"     
## [13] "loan_status"

# check distribution
broom::glance(dfProcessed)
## Warning: 'glance.data.frame' is deprecated.
## See help("Deprecated")
## # A tibble: 1 x 4
##    nrow  ncol complete.obs na.fraction
##   <int> <int>        <int>       <dbl>
## 1   614    13          614           0
table(dfProcessed$loan_status)
## 
##   N   Y 
## 192 422
prop.table(ftable(dfProcessed$loan_status)) %>% round(., 1)
##    N   Y
##         
##  0.3 0.7

# head df
head(dfProcessed) %>% kable(.)

loan_id	gender	married	dependents	education	self_employed	applicant_income	coapplicant_income	loan_amount	loan_amount_term	credit_history	property_area	loan_status
LP001002	Male	No	0	Graduate	No	0.0729314	-0.5540356	-0.2151272	0.276411	0.4324768	Urban	Y
LP001003	Male	Yes	1	Graduate	No	-0.1343025	-0.0387000	-0.2151272	0.276411	0.4324768	Rural	N
LP001005	Male	Yes	0	Graduate	Yes	-0.3934266	-0.5540356	-0.9395335	0.276411	0.4324768	Urban	Y
LP001006	Male	Yes	0	Not Graduate	No	-0.4616860	0.2517743	-0.3085990	0.276411	0.4324768	Urban	Y
LP001008	Male	No	0	Graduate	No	0.0976488	-0.5540356	-0.0632356	0.276411	0.4324768	Urban	Y
LP001011	Male	Yes	2	Graduate	Yes	0.0022165	0.8798823	1.4089450	0.276411	0.4324768	Urban	Y


# str(), dim()
str(dfProcessed)
## 'data.frame':    614 obs. of  13 variables:
##  $ loan_id           : Factor w/ 614 levels "LP001002","LP001003",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ gender            : Factor w/ 3 levels "","Female","Male": 3 3 3 3 3 3 3 3 3 3 ...
##  $ married           : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 3 3 3 ...
##  $ dependents        : Factor w/ 5 levels "","0","1","2",..: 2 3 2 2 2 4 2 5 4 3 ...
##  $ education         : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 2 1 1 1 ...
##  $ self_employed     : Factor w/ 3 levels "","No","Yes": 2 2 3 2 2 3 2 2 2 2 ...
##  $ applicant_income  : num  0.0729 -0.1343 -0.3934 -0.4617 0.0976 ...
##  $ coapplicant_income: num  -0.554 -0.0387 -0.554 0.2518 -0.554 ...
##  $ loan_amount       : num  -0.2151 -0.2151 -0.9395 -0.3086 -0.0632 ...
##  $ loan_amount_term  : num  0.276 0.276 0.276 0.276 0.276 ...
##  $ credit_history    : num  0.432 0.432 0.432 0.432 0.432 ...
##  $ property_area     : Factor w/ 3 levels "Rural","Semiurban",..: 3 1 3 3 3 3 3 2 3 2 ...
##  $ loan_status       : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 1 2 1 ...
dim(dfProcessed)
## [1] 614  13

# make changes to the "loan_status" level, label
levels(dfProcessed$loan_status)
## [1] "N" "Y"
dfProcessed <- dfProcessed %>% 
        dplyr::mutate(loan_status = loan_status %>% 
                              factor(., levels = c("Y", "N"), labels = c("Yes", "No")))
levels(dfProcessed$loan_status)
## [1] "Yes" "No"

Let’s try to improve our accuracy by doing a 10-fold cross-validation.

# set seed
set.seed(1234)

# split data
index <- caret::createDataPartition(dfProcessed$loan_status, p = 0.7, list = FALSE)
trainSet <- dfProcessed[index, ]
testSet <- dfProcessed[-index, ]

# define the training control - let's do 10-fold cross-validation
train.control <- caret::trainControl(
        method = "cv",
        number = 10,
        savePredictions = 'final',
        classProbs = TRUE)

# define IVs, DV
IV <- names(dfProcessed)[names(dfProcessed) %nin% c("loan_id", "loan_status")]; IV
##  [1] "gender"             "married"            "dependents"        
##  [4] "education"          "self_employed"      "applicant_income"  
##  [7] "coapplicant_income" "loan_amount"        "loan_amount_term"  
## [10] "credit_history"     "property_area"
DV <- "loan_status"

We will ensemble our models, i.e. using glm (generalized linear model), nb (naive bayes) and rf (random forest) as our base layer and stacking on top by building a top layer using glm, nb and gbm (gradient boosting machine). We will evaluate each model result.

# set seed
set.seed(1234)

# methods
baseTrainMethods <- c("glm", "nb", "rf")
topTrainMethods <- c("glm", "nb", "gbm")

# output
modelSummaryList <- vector(mode = "list")

# train base layer
for(baseLayer in baseTrainMethods){
        
        # set parameters
        ml = baseLayer
        model = paste0("model_base_", ml)
        OOF_prediction = paste0("OOF_pred_", ml)
        prediction = paste0("pred_", ml)
        result = paste0("result_", ml)
        
        # model
        assign(bquote(.(model)), caret::train(trainSet[, IV], trainSet[, DV], 
                                              method = ml, 
                                              trControl = train.control,
                                              trace = FALSE))
        
        # Out-Of-Fold probability predictions - trainSet    
        if(ml == "glm"){trainSet$OOF_pred_glm = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}        
        if(ml == "nb"){trainSet$OOF_pred_nb = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}        
        if(ml == "rf"){trainSet$OOF_pred_rf = eval(sym(model))$pred$Y[order(eval(sym(model))$pred$rowIndex)]}
        
        # Out-Of-Fold probability predictions - testSet 
        assign(bquote(.(OOF_prediction)), predict(eval(sym(model)), testSet[, IV], type = "prob")$Y)
        if(ml == "glm"){testSet$OOF_pred_glm = eval(sym(OOF_prediction))}        
        if(ml == "nb"){testSet$OOF_pred_nb = eval(sym(OOF_prediction))}        
        if(ml == "rf"){testSet$OOF_pred_rf = eval(sym(OOF_prediction))}
        
        # Y/N predictions for Confusion Matrix - testSet    
        assign(bquote(.(prediction)), predict(eval(sym(model)), testSet[, IV]))
        if(ml == "glm"){testSet$pred_glm = eval(sym(prediction))}        
        if(ml == "nb"){testSet$pred_nb = eval(sym(prediction))}        
        if(ml == "rf"){testSet$pred_rf = eval(sym(prediction))}
        
        # output
        assign(bquote(.(result)), broom::tidy(caret::confusionMatrix(testSet[, prediction], testSet[, DV])) %>%
                       dplyr::mutate(trainMethod = ml) %>%
                       dplyr::select(trainMethod, everything()))
        
        # store output into a list
        tempModelList <- list(eval(sym(result)))
        modelSummaryList <<- c(modelSummaryList, tempModelList)        
        
}

# train top layer
for(topLayer in topTrainMethods){
        
        # set parameters
        ml = topLayer
        model = paste0("model_top_", ml)        
        OOF_predictors_top = c("OOF_pred_glm", "OOF_pred_nb", "OOF_pred_rf")
        OOF_prediction_top = paste0("OOF_pred_top_", ml)
        prediction_top = paste0("pred_top_", ml)
        result = paste0("result_top_", ml)
        
        # model
        assign(bquote(.(model)), caret::train(trainSet[, OOF_predictors_top], trainSet[, DV], 
                                              method = ml, 
                                              trControl = train.control))
        
        # Out-Of-Fold probability predictions - testSet 
        assign(bquote(.(OOF_prediction_top)), predict(eval(sym(model)), testSet[, OOF_predictors_top], type = "prob")$Y)
        if(ml == "glm"){testSet$OOF_pred_top_glm = eval(sym(OOF_prediction_top))}
        if(ml == "nb"){testSet$OOF_pred_top_nb = eval(sym(OOF_prediction_top))}
        if(ml == "gbm"){testSet$OOF_pred_top_gbm = eval(sym(OOF_prediction_top))}
        
        # Y/N predictions for Confusion Matrix - testSet    
        assign(bquote(.(prediction_top)), predict(eval(sym(model)), testSet[, OOF_predictors_top]))
        if(ml == "glm"){testSet$pred_top_glm = eval(sym(prediction_top))}
        if(ml == "nb"){testSet$pred_top_nb = eval(sym(prediction_top))}
        if(ml == "gbm"){testSet$pred_top_gbm = eval(sym(prediction_top))}
        
        # output
        assign(bquote(.(result)), broom::tidy(caret::confusionMatrix(testSet[, prediction_top], testSet[, DV])) %>%
                       dplyr::mutate(trainMethod = paste0(ml, " - top layer")) %>%
                       dplyr::select(trainMethod, everything()))
        
        # store output into a list
        tempModelList <- list(eval(sym(result)))
        modelSummaryList <<- c(modelSummaryList, tempModelList)
        
}

# put together - averaging the OOF prediction probability
testSet <- testSet %>%
        dplyr::mutate(pred_final_prob_avg = (OOF_pred_top_glm + OOF_pred_top_nb + OOF_pred_top_gbm) / length(topTrainMethods),
                      pred_final = ifelse(pred_final_prob_avg > 0.5, "Y", "N") %>%
                              factor(., levels = c("Y", "N"), labels = c("Yes", "No")))

finalResult <- broom::tidy(caret::confusionMatrix(testSet$pred_final, testSet$loan_status)) %>%
        dplyr::mutate(trainMethod = "final - averaging") %>%
        dplyr::select(trainMethod, everything())

# store finalResult output into a list
tempModelList <- list(finalResult)
modelSummaryList <<- c(modelSummaryList, tempModelList)

# let's look at the result
modelSummaryDf <- modelSummaryList %>% 
        dplyr::bind_rows() %>% 
        dplyr::select(trainMethod, term, estimate) %>%
        tidyr::spread(term, estimate) %>%
        arrange(desc(f1))

modelSummaryDf %>% kable()

trainMethod	accuracy	balanced_accuracy	detection_prevalence	detection_rate	f1	kappa	neg_pred_value	pos_pred_value	precision	prevalence	recall	sensitivity	specificity
glm	0.8524590	0.7631579	0.8360656	0.6885246	0.9032258	0.6047516	1.0000000	0.8235294	0.8235294	0.6885246	1.0000000	1.0000000	0.5263158
rf	0.8524590	0.7631579	0.8360656	0.6885246	0.9032258	0.6047516	1.0000000	0.8235294	0.8235294	0.6885246	1.0000000	1.0000000	0.5263158
final - averaging	0.8469945	0.7591896	0.8306011	0.6830601	0.8992806	0.5923628	0.9677419	0.8223684	0.8223684	0.6885246	0.9920635	0.9920635	0.5263158
gbm - top layer	0.8469945	0.7591896	0.8306011	0.6830601	0.8992806	0.5923628	0.9677419	0.8223684	0.8223684	0.6885246	0.9920635	0.9920635	0.5263158
glm - top layer	0.8469945	0.7591896	0.8306011	0.6830601	0.8992806	0.5923628	0.9677419	0.8223684	0.8223684	0.6885246	0.9920635	0.9920635	0.5263158
nb - top layer	0.8415301	0.7600251	0.8142077	0.6721311	0.8945455	0.5846443	0.9117647	0.8255034	0.8255034	0.6885246	0.9761905	0.9761905	0.5438596
nb	0.8360656	0.7560568	0.8087432	0.6666667	0.8905109	0.5726296	0.8857143	0.8243243	0.8243243	0.6885246	0.9682540	0.9682540	0.5438596

In summary, we build a base layer (using glm, nb and rf) to predict the “loan_status” using all the variables from the train set. Second, we build a top layer (using glm, nb and gbm) to predict the “loan_status” based on the outcomes (OOF or Out-Of-Fold prediction) of our base layer. In other words, we make prediction (of the target variable) based on the predictions of our base models. Finally, we simply average the outcome probabilities of our top layer to get our final probability and decision using 0.5 as cut-off.

Take a look of our results, i.e. modelSummaryDf, surprisingly, the glm and rf models perform the best! They achieve the highest f1 score followed by the “final - averaging” from the three top layer models. All models achieve very high recall (or sensitivity) and above 80% precision. However, specificity barely reaches over 50%. That would be a weakness that deserves more attention for improvement in the future.

data_621_blog_5

Jimmy Ng

5/2/2020

Ensembling methods