Introduction

This document is a tested version based on the blog article on analyticsvidhya, “How to use XBGoost algorithm in R in easy steps”. I have fixed errors in the original article to make sure the results can be reproduced. The data from the Loan Prediction Challenge. You can download them from here: train, test.

What is XGBoost?

XGBoost algorithm is one of the popular winning recipe of data science. Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called Otto Classification challenge. The latest implementation on “xgboost” on R was launched in August 2015.

Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.

This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.

Preparation of Data for using XGBoost

XGBoost only works with numeric vectors. Yes! you need to work on data types here.

Therefore, you need to convert all other forms of data into numeric vectors. A simple method to convert categorical variable into numeric vector is One Hot Encoding. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s.

In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. Sparse Matrix is a matrix where most of the values of zeros. Conversely, a dense matrix is a matrix where most of the values are non-zeros.

library(Matrix)
sparse_matrix <- sparse.model.matrix(response ~ .-1, data = Data)

Building Model using Xgboost on R

Step 1: Load all the libraries

library(xgboost)
library(readr)
library(stringr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(car)

Step 2: Load the dataset

Load the data

set.seed(100)
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")

Load labels of the train data

labels = df_train['Loan_Status']
df_train = df_train[-grep('Loan_Status', colnames(df_train))]

Combine train and test data

df_all = rbind(df_train,df_test)

Step 3: Data Cleaning & Feature Engineering

The main purpose here is to deal with all the missing variables in the data. And also, creat dummy variable for the categorical features. (The missing value inputation here is arbitrary, without any validity.)

df_all$Gender[is.na(df_all$Gender)] = "Male"
df_all$Married[is.na(df_all$Married)] = "No"
df_all$Self_Employed[is.na(df_all$Self_Employed)] = "No"

df_all$LoanAmount[is.na(df_all$LoanAmount)] = mean(df_all$LoanAmount, na.rm = TRUE)
df_all$Loan_Amount_Term[is.na(df_all$Loan_Amount_Term)] = 360
df_all$Credit_History[is.na(df_all$Credit_History)] = 1
df_all$Dependents[is.na(df_all$Dependents)] = 0

One-hot-encoding categorical features

ohe_feats = c("Gender", "Married", "Education", "Self_Employed", "Property_Area")
dummies <- dummyVars(~ Gender + Married + Education + Self_Employed + Property_Area, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)

Split train and test data set

X = df_all_combined[df_all_combined$Loan_ID %in% df_train$Loan_ID,]
y <- recode(labels$Loan_Status,"'Y'=1; 'N'=0")
X_test = df_all_combined[df_all_combined$Loan_ID %in% df_test$Loan_ID,]

Step 4: Tune and Run the model

set.seed(100)
xgb <- xgboost(data = data.matrix(X[,-1]), 
 label = y, 
 eta = 0.1,
 max_depth = 15, 
 nround=25, 
 subsample = 0.5,
 colsample_bytree = 0.5,
 eval_metric = "merror",
 objective = "multi:softprob",
 num_class = 12,
 nthread = 3
)
## [0]  train-merror:0.190554
## [1]  train-merror:0.190554
## [2]  train-merror:0.190554
## [3]  train-merror:0.190554
## [4]  train-merror:0.190554
## [5]  train-merror:0.190554
## [6]  train-merror:0.190554
## [7]  train-merror:0.188925
## [8]  train-merror:0.188925
## [9]  train-merror:0.187296
## [10] train-merror:0.187296
## [11] train-merror:0.185668
## [12] train-merror:0.187296
## [13] train-merror:0.185668
## [14] train-merror:0.184039
## [15] train-merror:0.180782
## [16] train-merror:0.175896
## [17] train-merror:0.174267
## [18] train-merror:0.174267
## [19] train-merror:0.167752
## [20] train-merror:0.164495
## [21] train-merror:0.162866
## [22] train-merror:0.161238
## [23] train-merror:0.159609
## [24] train-merror:0.154723

Step 5: Score the Test

y_pred <- predict(xgb, data.matrix(X_test[,-1]))

Step 6: Check the model

Prints top 10 nodes of the model

model <- xgb.dump(xgb, with.stats = T)
model[1:10] 
##  [1] "booster[0]"                                                           
##  [2] "0:[f5<-1.00136e-005] yes=1,no=2,missing=1,gain=94.0304,cover=45.375"  
##  [3] "1:leaf=0.4875,cover=6.11111"                                          
##  [4] "2:[f2<5299] yes=3,no=4,missing=4,gain=9.00819,cover=39.2639"          
##  [5] "3:[f14<-1.00136e-005] yes=5,no=6,missing=5,gain=3.27528,cover=20.0139"
##  [6] "5:[f2<1512.5] yes=9,no=10,missing=10,gain=1.07562,cover=13.9028"      
##  [7] "9:leaf=-0.0393822,cover=2.59722"                                      
##  [8] "10:[f2<1638] yes=13,no=14,missing=14,gain=0.496044,cover=11.3056"     
##  [9] "13:leaf=0.0684564,cover=1.06944"                                      
## [10] "14:[f2<1895] yes=17,no=18,missing=18,gain=0.795448,cover=10.2361"

Compute feature importance matrix

names <- dimnames(data.matrix(X[,-1]))[[2]]
importance_matrix <- xgb.importance(names, model = xgb)
importance_matrix
##                    Feature        Gain       Cover   Frequence
##  1:         Credit_History 0.304734783 0.089193779 0.020753267
##  2:             LoanAmount 0.195705858 0.237091088 0.292851653
##  3:        ApplicantIncome 0.163190711 0.181815029 0.194465796
##  4:      CoapplicantIncome 0.148589732 0.178707407 0.177555726
##  5:             Dependents 0.025580740 0.051382810 0.058416603
##  6: Property_AreaSemiurban 0.024779798 0.034225070 0.027671022
##  7:     Property_AreaRural 0.020434777 0.017755815 0.029976941
##  8:       Loan_Amount_Term 0.018980493 0.054200458 0.031514220
##  9:     Property_AreaUrban 0.017538969 0.020874618 0.039200615
## 10:              MarriedNo 0.016337118 0.027031661 0.026133743
## 11:      EducationGraduate 0.012750047 0.027566370 0.019215988
## 12:  EducationNot Graduate 0.010606598 0.012295685 0.014604151
## 13:        Self_EmployedNo 0.010480755 0.017324122 0.017678709
## 14:       Self_EmployedYes 0.008902681 0.009810301 0.009992314
## 15:           GenderFemale 0.008671492 0.022642470 0.019984627
## 16:             MarriedYes 0.007289995 0.010997628 0.012298232
## 17:             GenderMale 0.005425454 0.007085690 0.007686395

Generate the imporance graph

xgb.plot.importance(importance_matrix[1:10,])

Below are from the blog article

Parameters used in Xgboost

  • General parameters refers to which booster we are using to do boosting. The commonly used are tree or linear model
  • Booster parameters depends on which booster you have chosen
  • Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks.

General Parameters

  • silent : The default value is 0. You need to specify 0 for printing running messages, 1 for silent mode. booster : The default value is gbtree. You need to specify the booster to use: gbtree (tree based) or gblinear (linear function).
  • num_pbuffer : This is set automatically by xgboost, no need to be set by user. Read documentation of xgboost for more details.
  • num_feature : This is set automatically by xgboost, no need to be set by user.

Booster Parameters

  • eta : The default value is set to 0.3. You need to specify step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. The range is 0 to 1. Low eta value means model is more robust to overfitting.
  • gamma : The default value is set to 0. You need to specify minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. The range is 0 to ???. Larger the gamma more conservative the algorithm is. max_depth : The default value is set to 6. You need to specify the maximum depth of a tree. The range is 1 to \(\infty\).
  • min_child_weight : The default value is set to 1. You need to specify the minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. The range is 0 to \(\infty\).
  • max_delta_step : The default value is set to 0. Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.The range is 0 to \(\infty\).
  • subsample : The default value is set to 1. You need to specify the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. The range is 0 to 1.
  • colsample_bytree : The default value is set to 1. You need to specify the subsample ratio of columns when constructing each tree. The range is 0 to 1.

Linear Booster Specific Parameters

  • lambda and alpha : These are regularization term on weights. Lambda default value assumed is 1 and alpha is 0.
  • lambda_bias : L2 regularization term on bias and has a default value of 0.

Learning Task Parameters

  • base_score : The default value is set to 0.5 . You need to specify the initial prediction score of all instances, global bias.
  • objective : The default value is set to reg:linear . You need to specify the type of learner you want which includes linear regression, logistic regression, poisson regression etc.
  • eval_metric : You need to specify the evaluation metrics for validation data, a default metric will be assigned according to objective( rmse for regression, and error for classification, mean average precision for ranking
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] car_2.1-1       caret_6.0-62    ggplot2_2.0.0   lattice_0.20-33
## [5] stringr_1.0.0   readr_0.2.2     xgboost_0.4-2  
## 
## loaded via a namespace (and not attached):
##  [1] Ckmeans.1d.dp_3.3.1 Rcpp_0.12.2         formatR_1.2.1      
##  [4] nloptr_1.0.4        plyr_1.8.3          iterators_1.0.8    
##  [7] tools_3.2.2         digest_0.6.8        lme4_1.1-10        
## [10] evaluate_0.8        gtable_0.1.2        nlme_3.1-122       
## [13] mgcv_1.8-10         Matrix_1.2-3        foreach_1.4.3      
## [16] yaml_2.1.13         parallel_3.2.2      SparseM_1.7        
## [19] knitr_1.11          MatrixModels_0.4-1  stats4_3.2.2       
## [22] grid_3.2.2          nnet_7.3-11         data.table_1.9.6   
## [25] rmarkdown_0.9       minqa_1.2.4         reshape2_1.4.1     
## [28] magrittr_1.5        scales_0.3.0        codetools_0.2-14   
## [31] htmltools_0.2.6     MASS_7.3-45         splines_3.2.2      
## [34] pbkrtest_0.4-4      colorspace_1.2-6    labeling_0.3       
## [37] quantreg_5.19       stringi_1.0-1       munsell_0.4.2      
## [40] chron_2.3-47