This document is a tested version based on the blog article on analyticsvidhya, “How to use XBGoost algorithm in R in easy steps”. I have fixed errors in the original article to make sure the results can be reproduced. The data from the Loan Prediction Challenge. You can download them from here: train, test.
XGBoost algorithm is one of the popular winning recipe of data science. Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called Otto Classification challenge. The latest implementation on “xgboost” on R was launched in August 2015.
Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.
This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking.
XGBoost only works with numeric vectors. Yes! you need to work on data types here.
Therefore, you need to convert all other forms of data into numeric vectors. A simple method to convert categorical variable into numeric vector is One Hot Encoding. This term emanates from digital circuit language, where it means an array of binary signals and only legal values are 0s and 1s.
In R, one hot encoding is quite easy. This step (shown below) will essentially make a sparse matrix using flags on every possible value of that variable. Sparse Matrix is a matrix where most of the values of zeros. Conversely, a dense matrix is a matrix where most of the values are non-zeros.
library(Matrix)
sparse_matrix <- sparse.model.matrix(response ~ .-1, data = Data)
library(xgboost)
library(readr)
library(stringr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(car)
Load the data
set.seed(100)
df_train = read_csv("train_users_2.csv")
df_test = read_csv("test_users.csv")
Load labels of the train data
labels = df_train['Loan_Status']
df_train = df_train[-grep('Loan_Status', colnames(df_train))]
Combine train and test data
df_all = rbind(df_train,df_test)
The main purpose here is to deal with all the missing variables in the data. And also, creat dummy variable for the categorical features. (The missing value inputation here is arbitrary, without any validity.)
df_all$Gender[is.na(df_all$Gender)] = "Male"
df_all$Married[is.na(df_all$Married)] = "No"
df_all$Self_Employed[is.na(df_all$Self_Employed)] = "No"
df_all$LoanAmount[is.na(df_all$LoanAmount)] = mean(df_all$LoanAmount, na.rm = TRUE)
df_all$Loan_Amount_Term[is.na(df_all$Loan_Amount_Term)] = 360
df_all$Credit_History[is.na(df_all$Credit_History)] = 1
df_all$Dependents[is.na(df_all$Dependents)] = 0
One-hot-encoding categorical features
ohe_feats = c("Gender", "Married", "Education", "Self_Employed", "Property_Area")
dummies <- dummyVars(~ Gender + Married + Education + Self_Employed + Property_Area, data = df_all)
df_all_ohe <- as.data.frame(predict(dummies, newdata = df_all))
df_all_combined <- cbind(df_all[,-c(which(colnames(df_all) %in% ohe_feats))],df_all_ohe)
Split train and test data set
X = df_all_combined[df_all_combined$Loan_ID %in% df_train$Loan_ID,]
y <- recode(labels$Loan_Status,"'Y'=1; 'N'=0")
X_test = df_all_combined[df_all_combined$Loan_ID %in% df_test$Loan_ID,]
set.seed(100)
xgb <- xgboost(data = data.matrix(X[,-1]),
label = y,
eta = 0.1,
max_depth = 15,
nround=25,
subsample = 0.5,
colsample_bytree = 0.5,
eval_metric = "merror",
objective = "multi:softprob",
num_class = 12,
nthread = 3
)
## [0] train-merror:0.190554
## [1] train-merror:0.190554
## [2] train-merror:0.190554
## [3] train-merror:0.190554
## [4] train-merror:0.190554
## [5] train-merror:0.190554
## [6] train-merror:0.190554
## [7] train-merror:0.188925
## [8] train-merror:0.188925
## [9] train-merror:0.187296
## [10] train-merror:0.187296
## [11] train-merror:0.185668
## [12] train-merror:0.187296
## [13] train-merror:0.185668
## [14] train-merror:0.184039
## [15] train-merror:0.180782
## [16] train-merror:0.175896
## [17] train-merror:0.174267
## [18] train-merror:0.174267
## [19] train-merror:0.167752
## [20] train-merror:0.164495
## [21] train-merror:0.162866
## [22] train-merror:0.161238
## [23] train-merror:0.159609
## [24] train-merror:0.154723
y_pred <- predict(xgb, data.matrix(X_test[,-1]))
Prints top 10 nodes of the model
model <- xgb.dump(xgb, with.stats = T)
model[1:10]
## [1] "booster[0]"
## [2] "0:[f5<-1.00136e-005] yes=1,no=2,missing=1,gain=94.0304,cover=45.375"
## [3] "1:leaf=0.4875,cover=6.11111"
## [4] "2:[f2<5299] yes=3,no=4,missing=4,gain=9.00819,cover=39.2639"
## [5] "3:[f14<-1.00136e-005] yes=5,no=6,missing=5,gain=3.27528,cover=20.0139"
## [6] "5:[f2<1512.5] yes=9,no=10,missing=10,gain=1.07562,cover=13.9028"
## [7] "9:leaf=-0.0393822,cover=2.59722"
## [8] "10:[f2<1638] yes=13,no=14,missing=14,gain=0.496044,cover=11.3056"
## [9] "13:leaf=0.0684564,cover=1.06944"
## [10] "14:[f2<1895] yes=17,no=18,missing=18,gain=0.795448,cover=10.2361"
Compute feature importance matrix
names <- dimnames(data.matrix(X[,-1]))[[2]]
importance_matrix <- xgb.importance(names, model = xgb)
importance_matrix
## Feature Gain Cover Frequence
## 1: Credit_History 0.304734783 0.089193779 0.020753267
## 2: LoanAmount 0.195705858 0.237091088 0.292851653
## 3: ApplicantIncome 0.163190711 0.181815029 0.194465796
## 4: CoapplicantIncome 0.148589732 0.178707407 0.177555726
## 5: Dependents 0.025580740 0.051382810 0.058416603
## 6: Property_AreaSemiurban 0.024779798 0.034225070 0.027671022
## 7: Property_AreaRural 0.020434777 0.017755815 0.029976941
## 8: Loan_Amount_Term 0.018980493 0.054200458 0.031514220
## 9: Property_AreaUrban 0.017538969 0.020874618 0.039200615
## 10: MarriedNo 0.016337118 0.027031661 0.026133743
## 11: EducationGraduate 0.012750047 0.027566370 0.019215988
## 12: EducationNot Graduate 0.010606598 0.012295685 0.014604151
## 13: Self_EmployedNo 0.010480755 0.017324122 0.017678709
## 14: Self_EmployedYes 0.008902681 0.009810301 0.009992314
## 15: GenderFemale 0.008671492 0.022642470 0.019984627
## 16: MarriedYes 0.007289995 0.010997628 0.012298232
## 17: GenderMale 0.005425454 0.007085690 0.007686395
Generate the imporance graph
xgb.plot.importance(importance_matrix[1:10,])
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] car_2.1-1 caret_6.0-62 ggplot2_2.0.0 lattice_0.20-33
## [5] stringr_1.0.0 readr_0.2.2 xgboost_0.4-2
##
## loaded via a namespace (and not attached):
## [1] Ckmeans.1d.dp_3.3.1 Rcpp_0.12.2 formatR_1.2.1
## [4] nloptr_1.0.4 plyr_1.8.3 iterators_1.0.8
## [7] tools_3.2.2 digest_0.6.8 lme4_1.1-10
## [10] evaluate_0.8 gtable_0.1.2 nlme_3.1-122
## [13] mgcv_1.8-10 Matrix_1.2-3 foreach_1.4.3
## [16] yaml_2.1.13 parallel_3.2.2 SparseM_1.7
## [19] knitr_1.11 MatrixModels_0.4-1 stats4_3.2.2
## [22] grid_3.2.2 nnet_7.3-11 data.table_1.9.6
## [25] rmarkdown_0.9 minqa_1.2.4 reshape2_1.4.1
## [28] magrittr_1.5 scales_0.3.0 codetools_0.2-14
## [31] htmltools_0.2.6 MASS_7.3-45 splines_3.2.2
## [34] pbkrtest_0.4-4 colorspace_1.2-6 labeling_0.3
## [37] quantreg_5.19 stringi_1.0-1 munsell_0.4.2
## [40] chron_2.3-47