The purpose of this report is to show the (relative) simplicity of implementing xgboost with the MLR package in R. MLR supports a wide range of learning algorithms, which can be switched out easily, too.
For much more information on the MLR package, see the tutorial here: https://mlr-org.github.io/mlr-tutorial/release/html/index.html
library(tidyverse) # data manipulation
library(mlr) # ML package (also some data manipulation)
library(knitr) # just using this for kable() to make pretty tables
library(xgboost)
# The 'xgboost' library must be installed - doesn't need to be loaded
train_orig <- read_csv("xgboost_train.csv")
test_orig <- read_csv("xgboost_test.csv")
First, I combine the train and test data together so that cleaning only needs to be done once and any averages (or other stats) are more accurate. Creating a new column to mark each data set allows me to easily separate them again after cleaning.
train <- train_orig %>%
mutate(dataset = "train")
test <- test_orig %>%
mutate(dataset = "test")
combined <- bind_rows(train, test)
MLR is a library for machine learning, and it contains many useful tools for data science in general. For example, the summarizeColumns() function is great for getting a quick overview of the data.
summarizeColumns(combined) %>%
kable(digits = 2)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| PassengerId | integer | 0 | 655.00 | 378.02 | 655.00 | 484.81 | 1.00 | 1309.00 | 0 |
| Survived | integer | 418 | 0.38 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 0 |
| Pclass | integer | 0 | 2.29 | 0.84 | 3.00 | 0.00 | 1.00 | 3.00 | 0 |
| Name | character | 0 | NA | 1.00 | NA | NA | 1.00 | 2.00 | 1307 |
| Sex | character | 0 | NA | 0.36 | NA | NA | 466.00 | 843.00 | 2 |
| Age | numeric | 263 | 29.88 | 14.41 | 28.00 | 11.86 | 0.17 | 80.00 | 0 |
| SibSp | integer | 0 | 0.50 | 1.04 | 0.00 | 0.00 | 0.00 | 8.00 | 0 |
| Parch | integer | 0 | 0.39 | 0.87 | 0.00 | 0.00 | 0.00 | 9.00 | 0 |
| Ticket | character | 0 | NA | 0.99 | NA | NA | 1.00 | 11.00 | 929 |
| Fare | numeric | 1 | 33.30 | 51.76 | 14.45 | 10.24 | 0.00 | 512.33 | 0 |
| Cabin | character | 1014 | NA | NA | NA | NA | 1.00 | 6.00 | 186 |
| Embarked | character | 2 | NA | NA | NA | NA | 123.00 | 914.00 | 3 |
| dataset | character | 0 | NA | 0.32 | NA | NA | 418.00 | 891.00 | 2 |
A number of columns are simply dropped here. PassengerID, for example, is just a unique identifier, and isn’t helpful in making predictions. While it might be possible to impute the missing values for Cabin, here I’m just dropping the field.
combined <- combined %>%
select(-c(PassengerId, Name, Ticket, Cabin))
A number of the features/columns in the data set are categorical variables. For example, Pclass describes each person as belonging to one of three categories. Change these variables to factors. Character values are not allowed in the data sets for the learning algorithms, so all must be handled (either dropped or converted). The ‘dataset’ column is the one we added, and we’ll be dropping it later.
combined <- combined %>%
mutate_at(
.vars = vars("Survived", "Pclass", "Sex", "Embarked"),
.funs = funs(as.factor(.))
)
The “NA” column in our summary table let’s us know that several columns have many missing entries. Lots of top-rated kernels go into data imputation. I’m going to use the MLR package to do it faster (but not as well). The missing values in the Survived column are from the test data set (the records we need to predict). Impute them for now - the exact values aren’t important.
The mlr package can impute for all integer fields, character fields, etc. without having to list each one. After the imputation, there are no more missing values in our features. Two more data processing tasks left to show some of the functions in mlr.
# Impute missing values by field type
imp <- impute(
combined,
classes = list(
factor = imputeMode(),
integer = imputeMean(),
numeric = imputeMean()
)
)
combined <- imp$data
# Show column summary
summarizeColumns(combined) %>%
kable(digits = 2)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Survived | factor | 0 | NA | 0.26 | NA | NA | 342.00 | 967.00 | 2 |
| Pclass | factor | 0 | NA | 0.46 | NA | NA | 277.00 | 709.00 | 3 |
| Sex | factor | 0 | NA | 0.36 | NA | NA | 466.00 | 843.00 | 2 |
| Age | numeric | 0 | 29.88 | 12.88 | 29.88 | 9.07 | 0.17 | 80.00 | 0 |
| SibSp | numeric | 0 | 0.50 | 1.04 | 0.00 | 0.00 | 0.00 | 8.00 | 0 |
| Parch | numeric | 0 | 0.39 | 0.87 | 0.00 | 0.00 | 0.00 | 9.00 | 0 |
| Fare | numeric | 0 | 33.30 | 51.74 | 14.45 | 10.24 | 0.00 | 512.33 | 0 |
| Embarked | factor | 0 | NA | 0.30 | NA | NA | 123.00 | 916.00 | 3 |
| dataset | character | 0 | NA | 0.32 | NA | NA | 418.00 | 891.00 | 2 |
| ## Feature | Normalizatio | n |
Fitting almost any model works better when the explanatory features are on the same scale. This is done pretty quickly for all numeric columns. Note that the mean for all numeric columns is 0 afterwards (variance is also standardized).
In the various data processing functions of the mlr package, you can specify the target (predicted) variable in the data set. This will prevent the data processing functions from modifying that variable.
combined <- normalizeFeatures(combined, target = "Survived")
summarizeColumns(combined) %>%
kable(digits = 2)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Survived | factor | 0 | NA | 0.26 | NA | NA | 342.00 | 967.00 | 2 |
| Pclass | factor | 0 | NA | 0.46 | NA | NA | 277.00 | 709.00 | 3 |
| Sex | factor | 0 | NA | 0.36 | NA | NA | 466.00 | 843.00 | 2 |
| Age | numeric | 0 | 0 | 1.00 | 0.00 | 0.7 | -2.31 | 3.89 | 0 |
| SibSp | numeric | 0 | 0 | 1.00 | -0.48 | 0.0 | -0.48 | 7.20 | 0 |
| Parch | numeric | 0 | 0 | 1.00 | -0.44 | 0.0 | -0.44 | 9.95 | 0 |
| Fare | numeric | 0 | 0 | 1.00 | -0.36 | 0.2 | -0.64 | 9.26 | 0 |
| Embarked | factor | 0 | NA | 0.30 | NA | NA | 123.00 | 916.00 | 3 |
| dataset | character | 0 | NA | 0.32 | NA | NA | 418.00 | 891.00 | 2 |
All factors must be expanded into numeric dummy columns (one-hot encoding). The MLR package will warn you if you haven’t completed this step. After conversion, note that Pclass now has three fields: Pclass.1-3. Each one is filled with 0 or 1 depending on class.
combined <- createDummyFeatures(
combined, target = "Survived",
cols = c(
"Pclass",
"Sex",
"Embarked"
)
)
summarizeColumns(combined) %>%
kable(digits = 2)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Survived | factor | 0 | NA | 0.26 | NA | NA | 342.00 | 967.00 | 2 |
| Age | numeric | 0 | 0.00 | 1.00 | 0.00 | 0.7 | -2.31 | 3.89 | 0 |
| SibSp | numeric | 0 | 0.00 | 1.00 | -0.48 | 0.0 | -0.48 | 7.20 | 0 |
| Parch | numeric | 0 | 0.00 | 1.00 | -0.44 | 0.0 | -0.44 | 9.95 | 0 |
| Fare | numeric | 0 | 0.00 | 1.00 | -0.36 | 0.2 | -0.64 | 9.26 | 0 |
| dataset | character | 0 | NA | 0.32 | NA | NA | 418.00 | 891.00 | 2 |
| Pclass.1 | numeric | 0 | 0.25 | 0.43 | 0.00 | 0.0 | 0.00 | 1.00 | 0 |
| Pclass.2 | numeric | 0 | 0.21 | 0.41 | 0.00 | 0.0 | 0.00 | 1.00 | 0 |
| Pclass.3 | numeric | 0 | 0.54 | 0.50 | 1.00 | 0.0 | 0.00 | 1.00 | 0 |
| Sex.female | numeric | 0 | 0.36 | 0.48 | 0.00 | 0.0 | 0.00 | 1.00 | 0 |
| Sex.male | numeric | 0 | 0.64 | 0.48 | 1.00 | 0.0 | 0.00 | 1.00 | 0 |
| Embarked.C | numeric | 0 | 0.21 | 0.40 | 0.00 | 0.0 | 0.00 | 1.00 | 0 |
| Embarked.Q | numeric | 0 | 0.09 | 0.29 | 0.00 | 0.0 | 0.00 | 1.00 | 0 |
| Embarked.S | numeric | 0 | 0.70 | 0.46 | 1.00 | 0.0 | 0.00 | 1.00 | 0 |
| We are done p | rocessing th | e inp | ut data | . We ne | ed to spl | it the | data fra | me back i | nto |
| train and tes | t data frame | s. Ag | ain not | e that | a lot mor | e coul | d be done | , but the | |
| point of this | kernel is t | o sho | w how t | o perfo | rm some t | asks w | ith MLR. |
train <- combined %>%
filter(dataset == "train") %>%
select(-dataset)
test <- combined %>%
filter(dataset == "test") %>%
select(-dataset)
Xgboost is an algorithm in the decision tree family. The first step is to create a task, which is just another term for data set. Create one for both the train and test data sets. The target says which column is the one to predict. Every other column is assumed to be an explanatory feature.
trainTask <- makeClassifTask(data = train, target = "Survived", positive = 1)
testTask <- makeClassifTask(data = test, target = "Survived")
Note: the mlr processing functions used above also work on these task objects. If used, you don’t have to specify the target column since that information is contained in the task object.
Now create a learner and a model. A learner specifies an algorithm while a model pairs it with a task (data set).
set.seed(1)
# Create an xgboost learner that is classification based and outputs
# labels (as opposed to probabilities)
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "response",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 200
)
)
# Create a model
xgb_model <- train(xgb_learner, task = trainTask)
Now we can make a prediction. The mlr package assumes that the “Survived” column in the test data set has the correct answers in it. It places these into the “truth” column. In our case, they are all zeros.
result <- predict(xgb_model, testTask)
head(result$data) %>%
kable()
| id | truth | response |
|---|---|---|
| 1 | 0 | 0 |
| 2 | 0 | 0 |
| 3 | 0 | 0 |
| 4 | 0 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 0 |
Create a submission file for Kaggle.
prediction <- result$data %>%
select(PassengerID = id, Survived = response) %>%
# Put back the original passenger IDs. No sorting has happened, so
# everything still matches up.
mutate(PassengerID = test_orig$PassengerId)
#write_csv(prediction, "initial_prediction.csv")
This scored a .73206, and ranked at 6785 on the leader board out of 7567. Pretty much the bottom; however, we achieved that result with a trivial amount of effort. We didn’t even tune our model’s hyper-parameters.
We can improve on the above performance by tuning our hyper-parameters. You can view all the parameters of the xgboost algorithm using the mlr package. A simple approach would be to play with these parameters manually to improve performance.
To read up on the parameters, see here:
http://xgboost.readthedocs.io/en/latest/parameter.html
# To see all the parameters of the xgboost classifier
getParamSet("classif.xgboost")
## Type len Def Constr
## booster discrete - gbtree gbtree,gblinear,dart
## silent integer - 0 -Inf to Inf
## eta numeric - 0.3 0 to 1
## gamma numeric - 0 0 to Inf
## max_depth integer - 6 1 to Inf
## min_child_weight numeric - 1 0 to Inf
## subsample numeric - 1 0 to 1
## colsample_bytree numeric - 1 0 to 1
## colsample_bylevel numeric - 1 0 to 1
## num_parallel_tree integer - 1 1 to Inf
## lambda numeric - 0 0 to Inf
## lambda_bias numeric - 0 0 to Inf
## alpha numeric - 0 0 to Inf
## objective untyped - binary:logistic -
## eval_metric untyped - error -
## base_score numeric - 0.5 -Inf to Inf
## max_delta_step numeric - 0 0 to Inf
## missing numeric - <NULL> -Inf to Inf
## nthread integer - - 1 to Inf
## nrounds integer - 1 1 to Inf
## feval untyped - <NULL> -
## verbose integer - 1 0 to 2
## print_every_n integer - 1 1 to Inf
## early_stopping_rounds integer - <NULL> 1 to Inf
## maximize logical - <NULL> -
## sample_type discrete - uniform uniform,weighted
## normalize_type discrete - tree tree,forest
## rate_drop numeric - 0 0 to 1
## skip_drop numeric - 0 0 to 1
## Req Tunable Trafo
## booster - TRUE -
## silent - FALSE -
## eta - TRUE -
## gamma - TRUE -
## max_depth - TRUE -
## min_child_weight - TRUE -
## subsample - TRUE -
## colsample_bytree - TRUE -
## colsample_bylevel - TRUE -
## num_parallel_tree - TRUE -
## lambda - TRUE -
## lambda_bias - TRUE -
## alpha - TRUE -
## objective - FALSE -
## eval_metric - FALSE -
## base_score - FALSE -
## max_delta_step - TRUE -
## missing - FALSE -
## nthread - FALSE -
## nrounds - TRUE -
## feval - FALSE -
## verbose - FALSE -
## print_every_n Y FALSE -
## early_stopping_rounds - FALSE -
## maximize - FALSE -
## sample_type Y TRUE -
## normalize_type Y TRUE -
## rate_drop Y TRUE -
## skip_drop Y TRUE -
The summary table above also lists which parameters can be tuned automatically in the “Tunable” column. The following code will perform this automated tuning. The first step is to define which terms to tune/optimize.
One note about search space: when drawing from a uniform/random distribution between from .01 to 1 for lambda (for example), you won’t adequately search the space on the low end of the range (e.g. .01 - .02). You can increase the number of samples taken in this range using the transfo (transformation) argument in makeParamSet. I have done this in the code below. More info after the code block.
xgb_params <- makeParamSet(
# The number of trees in the model (each one built sequentially)
makeIntegerParam("nrounds", lower = 100, upper = 500),
# number of splits in each tree
makeIntegerParam("max_depth", lower = 1, upper = 10),
# "shrinkage" - prevents overfitting
makeNumericParam("eta", lower = .1, upper = .5),
# L2 regularization - prevents overfitting
makeNumericParam("lambda", lower = -1, upper = 0, trafo = function(x) 10^x)
)
For lambda, a random value will be chosen (uniformly) between -1 and 0. That value is then transformed to the parameter using 10 ^ x, meaning that the parameter range is 10^-1 (.1) to 10^0 (1). This will increase the number of samples taken for small values of the parameter, and is effectively sampling on the log scale of the range.
The next step is to define how we will search (random, grid, etc.). We will do random search. On my laptop, I ran 50 iterations in < 1 minute, but it takes longer than 20 minutes in the Kaggle kernel, so I have reduced it. The rest of the script (and what I report as final score), assumes 50 iterations.
control <- makeTuneControlRandom(maxit = 1)
The last setup step is to define how we will evaluate the different sets of randomly-chosen parameters. Here, I am going to use 4-fold cross-validation. In this approach, our training data is split into 4 equal groups. The model is trained on 3 of the four groups and evaluated on the 4th. This process repeats until each of the four groups has been used as the validation set. Performance measures are then averaged into a final score.
# Create a description of the resampling plan
resample_desc <- makeResampleDesc("CV", iters = 4)
With all of our settings complete, we perform the tuning.
tuned_params <- tuneParams(
learner = xgb_learner,
task = trainTask,
resampling = resample_desc,
par.set = xgb_params,
control = control
)
After tuning, we can create a new xgboost model using the parameters that gave the best results. We then train and predict using that new model.
# Create a new model using tuned hyperparameters
xgb_tuned_learner <- setHyperPars(
learner = xgb_learner,
par.vals = tuned_params$x
)
# Re-train parameters using tuned hyperparameters (and full training set)
xgb_model <- train(xgb_tuned_learner, trainTask)
# Make a new prediction
result <- predict(xgb_model, testTask)
prediction <- result$data %>%
select(PassengerID = id, Survived = response) %>%
# Put back the original passenger IDs. No sorting has happened, so
# everything still matches up.
mutate(PassengerID = test_orig$PassengerId)
#write_csv(prediction, "final_prediction.csv")
Tuning the hyper-parameters increased the score to .76077 with a rank of 5973 out of 7472. Further improvements could be gained by doing more refined imputation, feature creation, better tuning, etc.
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library.
Parallel Computing: It is enabled with parallel processing (using OpenMP); i.e., when you run xgboost, by default, it would use all the cores of your laptop/machine.
Regularization: I believe this is the biggest advantage of xgboost. GBM has no provision for regularization. Regularization is a technique used to avoid overfitting in linear and tree-based models.
Enabled Cross Validation: In R, we usually use external packages such as caret and mlr to obtain CV results. But, xgboost is enabled with internal CV function (we’ll see below).
Missing Values: XGBoost is designed to handle missing values internally. The missing values are treated in such a manner that if there exists any trend in missing values, it is captured by the model.
Flexibility: In addition to regression, classification, and ranking problems, it supports user-defined objective functions also. An objective function is used to measure the performance of the model given a certain set of parameters. Furthermore, it supports user defined evaluation metrics as well.
Availability: Currently, it is available for programming languages such as R, Python, Java, Julia, and Scala.
Save and Reload: XGBoost gives us a feature to save our data matrix and model and reload it later. Suppose, we have a large data set, we can simply save the model and use it in future instead of wasting time redoing the computation.
Tree Pruning: Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree upto max_depth and then prune backward until the improvement in loss function is below a threshold.
XGBoost belongs to a family of boosting algorithms that convert weak learners into strong learners. A weak learner is one which is slightly better than random guessing. Let’s understand boosting first (in general).
Boosting is a sequential process; i.e., trees are grown using the information from a previously grown tree one after the other. This process slowly learns from data and tries to improve its prediction in subsequent iterations. Let’s look at a classic classification example:
Four classifiers (in 4 boxes), shown above, are trying hard to classify + and - classes as homogeneously as possible. Let’s understand this picture well.
Box 1: The first classifier creates a vertical line (split) at D1. It says anything to the left of D1 is + and anything to the right of D1 is -. However, this classifier misclassifies three + points.
Box 2: The next classifier says don’t worry I will correct your mistakes. Therefore, it gives more weight to the three + misclassified points (see bigger size of +) and creates a vertical line at D2. Again it says, anything to right of D2 is - and left is +. Still, it makes mistakes by incorrectly classifying three - points.
Box 3: The next classifier continues to bestow support. Again, it gives more weight to the three - misclassified points and creates a horizontal line at D3. Still, this classifier fails to classify the points (in circle) correctly. Remember that each of these classifiers has a misclassification error associated with them.
Boxes 1,2, and 3 are weak classifiers. These classifiers will now be used to create a strong classifier Box 4.
That’s the basic idea behind boosting algorithms. The very next model capitalizes on the misclassification/error of previous model and tries to reduce it. Now, let’s come to XGBoost.
As we know, XGBoost can used to solve both regression and classification problems. It is enabled with separate methods to solve respective problems. Let’s see:
Classification Problems: To solve such problems, it uses booster = gbtree parameter; i.e., a tree is grown one after other and attempts to reduce misclassification rate in subsequent iterations. In this, the next tree is built by giving a higher weight to misclassified points by the previous tree (as explained above).
Regression Problems: To solve such problems, we have two methods: booster = gbtree and booster = gblinear. You already know gbtree. In gblinear, it builds generalized linear model and optimizes it using regularization (L1,L2) and gradient descent. In this, the subsequent models are built on residuals (actual - predicted) generated by previous iterations. Are you wondering what is gradient descent? Understanding gradient descent requires math, however, let me try and explain it in simple words:
Every parameter has a significant role to play in the model’s performance. Before hypertuning, let’s first understand about these parameters and their importance. In this article, I’ve only explained the most frequently used and tunable parameters. To look at all the parameters, you can refer to its official documentation.
XGBoost parameters can be divided into three categories (as suggested by its authors):
General Parameters: Controls the booster type in the model which eventually drives overall functioning
Booster Parameters: Controls the performance of the selected booster
Learning Task Parameters: Sets and evaluates the learning process of the booster from the given data
1. General Parameters
2. Booster Parameters
As mentioned above, parameters for tree and linear boosters are different. Let’s understand each one of them:
Parameters for Tree Booster
Parameters for Linear Booster
Using linear booster has relatively lesser parameters to tune, hence it computes much faster than gbtree booster.
3. Learning Task Parameters
These parameters specify methods for the loss function and model evaluation. In addition to the parameters listed below, you are free to use a customized objective / evaluation function.
In this practical section, we’ll learn to tune xgboost in two ways: using the xgboost package and MLR package.
I’ll use the adult data set , this data set poses a classification problem where our job is to predict if the given user will have a salary <=50K or >50K.
I’ll follow the most common but effective steps in parameter tuning:
#load libraries
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(mlr)
#set variable names
setcol <- c("age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"target")
#load data
train <- read.table("xgboost_adultdata.data",header = F,sep = ",",col.names = setcol,na.strings = c(" ?"),stringsAsFactors = F)
test <- read.table("xgboost_adulttest.test",header = F,sep = ",",col.names = setcol,skip = 1, na.strings = c(" ?"),stringsAsFactors = F)
#convert data frame to data table
setDT(train)
setDT(test)
#check missing values
table(is.na(train))
##
## FALSE TRUE
## 484153 4262
sapply(train, function(x) sum(is.na(x))/length(x))*100
## age workclass fnlwgt education education.num
## 0.000000 5.638647 0.000000 0.000000 0.000000
## marital.status occupation relationship race sex
## 0.000000 5.660146 0.000000 0.000000 0.000000
## capital.gain capital.loss hours.per.week native.country target
## 0.000000 0.000000 0.000000 1.790486 0.000000
table(is.na(test))
##
## FALSE TRUE
## 242012 2203
sapply(test, function(x) sum(is.na(x))/length(x))*100
## age workclass fnlwgt education education.num
## 0.000000 5.914870 0.000000 0.000000 0.000000
## marital.status occupation relationship race sex
## 0.000000 5.933296 0.000000 0.000000 0.000000
## capital.gain capital.loss hours.per.week native.country target
## 0.000000 0.000000 0.000000 1.682943 0.000000
#quick data cleaning
#remove extra character from target variable
library(stringr)
test[,target := substr(target,start = 1,stop = nchar(target)-1)]
#remove leading whitespaces
char_col <- colnames(train)[sapply(test,is.character)]
for(i in char_col)
set(train,j=i,value = str_trim(train[[i]],side = "left"))
for(i in char_col)
set(test,j=i,value = str_trim(test[[i]],side = "left"))
#set all missing value as "Missing"
train[is.na(train)] <- "Missing"
test[is.na(test)] <- "Missing"
To use xgboost package, keep these things in mind: + Convert the categorical variables into numeric using one hot encoding + For classification, if the dependent variable belongs to class factor, convert it to numeric
R’s base function model.matrix is quick enough to implement one hot encoding. In the code below, ~.+0 leads to encoding of all categorical variables without producing an intercept. Alternatively, you can use the dummies package to accomplish the same task. Since xgboost package accepts target variable separately, we’ll do the encoding keeping this in mind:
#using one hot encoding
labels <- train$target
ts_label <- test$target
new_tr <- model.matrix(~.+0,data = train[,-c("target"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("target"),with=F])
#convert factor to numeric
labels <- as.numeric(as.factor(labels))-1
ts_label <- as.numeric(as.factor(ts_label))-1
For xgboost, we’ll use xgb.DMatrix to convert data table into a matrix (most recommended):
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
#default parameters
params <- list(
booster = "gbtree",
objective = "binary:logistic",
eta=0.3,
gamma=0,
max_depth=6,
min_child_weight=1,
subsample=1,
colsample_bytree=1
)
xgbcv <- xgb.cv(params = params
,data = dtrain
,nrounds = 100
,nfold = 5
,showsd = T
,stratified = T
,print.every.n = 10
,early.stop.round = 20
,maximize = F
)
## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1] train-error:0.143607+0.001799 test-error:0.145450+0.004040
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [11] train-error:0.130901+0.001264 test-error:0.136298+0.003846
## [21] train-error:0.119975+0.001381 test-error:0.129971+0.004189
## [31] train-error:0.114570+0.000965 test-error:0.128129+0.004159
## [41] train-error:0.111038+0.000738 test-error:0.127084+0.004922
## [51] train-error:0.107990+0.000382 test-error:0.127422+0.004602
## [61] train-error:0.104987+0.001369 test-error:0.127422+0.004747
## Stopping. Best iteration:
## [47] train-error:0.109387+0.000237 test-error:0.126593+0.004664
##best iteration = 79
min(xgbcv$test.error.mean)
## Warning in min(xgbcv$test.error.mean): no non-missing arguments to min;
## returning Inf
## [1] Inf
#0.1263
#first default - model training
xgb1 <- xgb.train(
params = params
,data = dtrain
,nrounds = 79
,watchlist = list(val=dtest,train=dtrain)
,print.every.n = 10
,early.stop.round = 10
,maximize = F
,eval_metric = "error"
)
## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1] val-error:0.143726 train-error:0.144805
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
##
## [11] val-error:0.131073 train-error:0.131568
## [21] val-error:0.127879 train-error:0.122140
## [31] val-error:0.126589 train-error:0.115168
## [41] val-error:0.125914 train-error:0.111821
## [51] val-error:0.125914 train-error:0.110347
## [61] val-error:0.126466 train-error:0.109149
## [71] val-error:0.126466 train-error:0.106938
## [79] val-error:0.126282 train-error:0.105095
#model prediction
xgbpred <- predict(xgb1,dtest)
xgbpred <- ifelse(xgbpred > 0.5,1,0)
#confusion matrix
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:mlr':
##
## train
## The following object is masked from 'package:purrr':
##
## lift
confusionMatrix(xgbpred, ts_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11800 1421
## 1 635 2425
##
## Accuracy : 0.8737
## 95% CI : (0.8685, 0.8788)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6235
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9489
## Specificity : 0.6305
## Pos Pred Value : 0.8925
## Neg Pred Value : 0.7925
## Prevalence : 0.7638
## Detection Rate : 0.7248
## Detection Prevalence : 0.8121
## Balanced Accuracy : 0.7897
##
## 'Positive' Class : 0
##
#Accuracy - 86.54%
#view variable importance plot
mat <- xgb.importance(feature_names = colnames(new_tr),model = xgb1)
xgb.plot.importance(importance_matrix = mat[1:20]) #first 20 variables
Let’s proceed to the random / grid search procedure and attempt to find better accuracy. From here on, we’ll be using the MLR package for model building. A quick reminder, the MLR package creates its own frame of data, learner as shown below. Also, keep in mind that task functions in mlr doesn’t accept character variables. Hence, we need to convert them to factors before creating task:
#convert characters to factors
fact_col <- colnames(train)[sapply(train,is.character)]
for(i in fact_col)
set(train,j=i,value = factor(train[[i]]))
for(i in fact_col)
set(test,j=i,value = factor(test[[i]]))
#create tasks
traintask <- makeClassifTask(data = train,target = "target")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class
## data.table, hence it will be converted.
testtask <- makeClassifTask(data = test,target = "target")
## Warning in makeTask(type = type, data = data, weights = weights, blocking
## = blocking, : Provided data is not a pure data.frame but from class
## data.table, hence it will be converted.
#do one hot encoding
traintask <- createDummyFeatures(obj = traintask)
testtask <- createDummyFeatures(obj = testtask)
#create learner
lrn <- makeLearner("classif.xgboost",predict.type = "response")
lrn$par.vals <- list(
objective="binary:logistic",
eval_metric="error",
nrounds=1L,
eta=0.1
)
#set parameter space
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
#set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
#search strategy
ctrl <- makeTuneControlRandom(maxit = 5L)
#set parallel backend
#library(parallel)
#library(parallelMap)
#parallelStartSocket(cpus = 2)
#parameter tuning
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
## [Tune] Started tuning learner classif.xgboost for parameter set:
## Type len Def Constr Req Tunable Trafo
## booster discrete - - gbtree,gblinear - TRUE -
## max_depth integer - - 3 to 10 - TRUE -
## min_child_weight numeric - - 1 to 10 - TRUE -
## subsample numeric - - 0.5 to 1 - TRUE -
## colsample_bytree numeric - - 0.5 to 1 - TRUE -
## With control class: TuneControlRandom
## Imputation value: -0
## [Tune-x] 1: booster=gblinear; max_depth=4; min_child_weight=9.89; subsample=0.773; colsample_bytree=0.52
## [1] train-error:0.239203
## [1] train-error:0.239126
## [1] train-error:0.239289
## [1] train-error:0.238397
## [1] train-error:0.239357
## [Tune-y] 1: acc.test.mean=0.761; time: 0.0 min
## [Tune-x] 2: booster=gbtree; max_depth=8; min_child_weight=6.76; subsample=0.872; colsample_bytree=0.651
## [1] train-error:0.143077
## [1] train-error:0.141003
## [1] train-error:0.144157
## [1] train-error:0.141003
## [1] train-error:0.144497
## [Tune-y] 2: acc.test.mean=0.854; time: 0.0 min
## [Tune-x] 3: booster=gbtree; max_depth=3; min_child_weight=7.23; subsample=0.694; colsample_bytree=0.805
## [1] train-error:0.155630
## [1] train-error:0.155591
## [1] train-error:0.156941
## [1] train-error:0.156666
## [1] train-error:0.156590
## [Tune-y] 3: acc.test.mean=0.844; time: 0.0 min
## [Tune-x] 4: booster=gbtree; max_depth=5; min_child_weight=6.51; subsample=0.993; colsample_bytree=0.528
## [1] train-error:0.153173
## [1] train-error:0.152981
## [1] train-error:0.147151
## [1] train-error:0.153864
## [1] train-error:0.152443
## [Tune-y] 4: acc.test.mean=0.847; time: 0.0 min
## [Tune-x] 5: booster=gbtree; max_depth=9; min_child_weight=5.68; subsample=0.679; colsample_bytree=0.62
## [1] train-error:0.142194
## [1] train-error:0.141656
## [1] train-error:0.140356
## [1] train-error:0.140312
## [1] train-error:0.139621
## [Tune-y] 5: acc.test.mean=0.854; time: 0.0 min
## [Tune] Result: booster=gbtree; max_depth=8; min_child_weight=6.76; subsample=0.872; colsample_bytree=0.651 : acc.test.mean=0.854
mytune$y #0.873069
## acc.test.mean
## 0.854427
#set hyperparameters
lrn_tune <- setHyperPars(lrn,par.vals = mytune$x)
#train model
xgmodel <- mlr::train(learner = lrn_tune,task = traintask)
## [1] train-error:0.143515
#predict model
xgpred <- predict(xgmodel,testtask)
confusionMatrix(xgpred$data$response,xgpred$data$truth)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 11852 1767
## >50K 583 2079
##
## Accuracy : 0.8557
## 95% CI : (0.8502, 0.861)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5524
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9531
## Specificity : 0.5406
## Pos Pred Value : 0.8703
## Neg Pred Value : 0.7810
## Prevalence : 0.7638
## Detection Rate : 0.7280
## Detection Prevalence : 0.8365
## Balanced Accuracy : 0.7468
##
## 'Positive' Class : <=50K
##
#Accuracy : 0.8747
#stop parallelization
#parallelStop()