Tree can capture non-linear relationships. It helps finding groups in the data and collect average value of the parameter of interest within each group and so on. There are several tree algorithm existing such as CARET (Classification and Regression Tree). In R there are several functions doing it: rcart and xgboost. xgboost is more recent and more powerful. We will use it throughout this session.
## Warning: package 'glmnet' was built under R version 3.4.4
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.4.4
## Loaded glmnet 2.0-16
## Warning: package 'useful' was built under R version 3.4.4
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'coefplot' was built under R version 3.4.4
## Warning: package 'dygraphs' was built under R version 3.4.4
## Warning: package 'DiagrammeR' was built under R version 3.4.4
## Warning: package 'xgboost' was built under R version 3.4.4
## Parsed with column specification:
## cols(
## .default = col_integer(),
## TotalValue = col_double(),
## Borough = col_character(),
## SchoolDistrict = col_character(),
## FireService = col_character(),
## ZoneDist1 = col_character(),
## ZoneDist2 = col_character(),
## ZoneDist3 = col_character(),
## ZoneDist4 = col_character(),
## Class = col_character(),
## LandUse = col_character(),
## OwnerType = col_character(),
## NumFloors = col_double(),
## LotFront = col_double(),
## LotDepth = col_double(),
## BldgFront = col_double(),
## BldgDepth = col_double(),
## Extension = col_character(),
## Proximity = col_character(),
## IrregularLot = col_character(),
## LotType = col_character()
## # ... with 8 more columns
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 504 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 3509 CommFAR no trailing characters .4 'data/manhattan_Train.csv' file 2 3510 CommFAR no trailing characters .4 'data/manhattan_Train.csv' row 3 3511 CommFAR no trailing characters .4 'data/manhattan_Train.csv' col 4 3537 CommFAR no trailing characters .4 'data/manhattan_Train.csv' expected 5 3538 CommFAR no trailing characters .4 'data/manhattan_Train.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
Dygraph is a timeseries plotting library.
Before building the matrices, we need to use a formula and to use a formula we need to know what our response variable will be. The response variable should derive from the question we are trying to answer. So we need to start from the question.
## We will us eHistoricDistrict as our response variable
table(land_train$HistoricDistrict)
##
## No Yes
## 23791 8459
histFormula <- HistoricDistrict ~ FireService +
ZoneDist1 + ZoneDist2 + Class + LandUse +
OwnerType + LotArea + BldgArea + ComArea +
ResArea + OfficeArea + RetailArea +
GarageArea + FactryArea + NumBldgs +
NumFloors + UnitsRes + UnitsTotal +
LotFront + LotDepth + BldgFront +
BldgDepth + LotType + Landmark + BuiltFAR +
Built + TotalValue - 1
landx_train <- build.x(histFormula, data = land_train, contrasts = TRUE, sparse = TRUE)
## No Contrast or sparse in the build.y function
landy_train <- build.y(histFormula, data = land_train) %>% as.factor() %>% as.integer()-1
## Validation data matrices
landx_val <- build.x(histFormula, data = land_val, contrasts = TRUE, sparse = TRUE)
landy_val <- build.y(histFormula, data = land_val) %>% as.factor() %>% as.integer()-1
## Test data matrices
landx_test <- build.x(histFormula, data = land_test, contrasts = TRUE, sparse = TRUE)
landy_test <- build.y(histFormula, data = land_test) %>% as.factor() %>% as.integer()-1
### XGBoost needs it's own special self contained variables for the X and Y Matrices
xgTrain <- xgboost::xgb.DMatrix(data = landx_train, label=landy_train)
xgVal <- xgboost::xgb.DMatrix(data = landx_val, label=landy_val)
Intercept is removed from the formula as it is not needed for a decision tree.
xgboost is not capable of handling categorical variables which are in text format. It was not designed for that.It needs to have a matrice of number. We need to transform the levels of categorical variables into numerical digits representing the levels. Then xgboost will handle them. Also, keeping categorical variables as text if we could would also be processor intensive.
We then coerce the response variable into factors with as.factor() then we convert into integers with as.integer() and the output is going to be 1 & 2 as we have only 2 levels of factors. But XGBOOST needs 0 or 1 then we simply trick it by adding the “-1” a then.
xgb.train is the model fitting function for xgboost in the similar way lm does.
xg1 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1
)
xg1
## ##### xgb.Booster
## raw: 3.9 Kb
## call:
## xgboost::xgb.train(data = xgTrain, nrounds = 1, objective = "binary:logistic",
## eval_metric = "logloss", booster = "gbtree")
## params (as set within xgb.train):
## objective = "binary:logistic", eval_metric = "logloss", booster = "gbtree", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 86
## niter: 1
## nfeatures : 86
xgboost::xgb.plot.multi.trees(xg1, feature_names = colnames(landx_train))
## Warning: package 'bindrcpp' was built under R version 3.4.4
Function parameters:
eval_metric = ‘logloss’ this will tell how right or how wrong the binary decision is. Similar to posterior probability
feature_names = colnames(landx_train) assigns the variables names as features names of the tree.
$$
logloss = log(p)y+log(1-p)(1-y)
$$
P= probability of the model y = response value
in the case of a binary classification with a prediction of 0.85 you get
p=0.85
log(p)*1+log(1-p)*(1-1)
## [1] -0.1625189
log(p)*0+log(1-p)*(1-0)
## [1] -1.89712
The predicted probability of having 1 will be penalized by -0.1625189 and for 0 will be -1.89712. So using the logloss calculates the loss of each of our decisions on the tree. The smaller loss the better.
xg2 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1,
watchlist = list(train=xgTrain)
)
## [1] train-logloss:0.584465
The value yields it the model score.
Tree depth is highly variable depending of the dataset used and the number of features or rows. To deal with this variability, w ca use the principle of random forest.
Instead of building a single tree you build multiple trees using random sampling of features and observations and you finally do an average of all. This worked well but came up boosting.
Gradient boosted tree will instead of making multiple different trees seperated from each other and averaging them, it will make the trees each one sequentially filled of the previous tree. This is called an “additive tree”. In most of the cases, a gradient boosted tree will be the most accurate tree.
Gradient boosted tree works as well will other models not only with trees. So it can be used on multiple models. (Personal note: it is likely providing better performance than the BMA Bayesian Model Averaging as it will not fit all models and average them but it will improve the next model fitted based on the previous model fitted.)
Random Forest fits 500 models and average them vs Gadient boosted tree fits the models sequentially and uses the preivous as the input to the next.
xg3 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 100,
watchlist = list(train=xgTrain),
print_every_n = 1
)
## [1] train-logloss:0.584465
## [2] train-logloss:0.523592
## [3] train-logloss:0.481711
## [4] train-logloss:0.455022
## [5] train-logloss:0.434317
## [6] train-logloss:0.420081
## [7] train-logloss:0.407247
## [8] train-logloss:0.399616
## [9] train-logloss:0.392878
## [10] train-logloss:0.389032
## [11] train-logloss:0.384404
## [12] train-logloss:0.380103
## [13] train-logloss:0.371857
## [14] train-logloss:0.366398
## [15] train-logloss:0.364025
## [16] train-logloss:0.358663
## [17] train-logloss:0.355455
## [18] train-logloss:0.353247
## [19] train-logloss:0.349883
## [20] train-logloss:0.344666
## [21] train-logloss:0.342348
## [22] train-logloss:0.337295
## [23] train-logloss:0.331498
## [24] train-logloss:0.330276
## [25] train-logloss:0.328652
## [26] train-logloss:0.326020
## [27] train-logloss:0.320825
## [28] train-logloss:0.320375
## [29] train-logloss:0.317964
## [30] train-logloss:0.316191
## [31] train-logloss:0.314057
## [32] train-logloss:0.313288
## [33] train-logloss:0.309330
## [34] train-logloss:0.307408
## [35] train-logloss:0.306567
## [36] train-logloss:0.301784
## [37] train-logloss:0.300585
## [38] train-logloss:0.297599
## [39] train-logloss:0.294018
## [40] train-logloss:0.292622
## [41] train-logloss:0.292115
## [42] train-logloss:0.291807
## [43] train-logloss:0.290997
## [44] train-logloss:0.289898
## [45] train-logloss:0.289011
## [46] train-logloss:0.286688
## [47] train-logloss:0.285190
## [48] train-logloss:0.282824
## [49] train-logloss:0.282247
## [50] train-logloss:0.281675
## [51] train-logloss:0.281264
## [52] train-logloss:0.279706
## [53] train-logloss:0.279425
## [54] train-logloss:0.278365
## [55] train-logloss:0.275456
## [56] train-logloss:0.273506
## [57] train-logloss:0.271763
## [58] train-logloss:0.271414
## [59] train-logloss:0.270705
## [60] train-logloss:0.269869
## [61] train-logloss:0.268367
## [62] train-logloss:0.266514
## [63] train-logloss:0.265876
## [64] train-logloss:0.265632
## [65] train-logloss:0.264709
## [66] train-logloss:0.263314
## [67] train-logloss:0.261996
## [68] train-logloss:0.261721
## [69] train-logloss:0.261094
## [70] train-logloss:0.260565
## [71] train-logloss:0.259542
## [72] train-logloss:0.258185
## [73] train-logloss:0.257558
## [74] train-logloss:0.257038
## [75] train-logloss:0.256891
## [76] train-logloss:0.254984
## [77] train-logloss:0.254406
## [78] train-logloss:0.253683
## [79] train-logloss:0.253175
## [80] train-logloss:0.251948
## [81] train-logloss:0.250701
## [82] train-logloss:0.250029
## [83] train-logloss:0.248785
## [84] train-logloss:0.247689
## [85] train-logloss:0.246411
## [86] train-logloss:0.246245
## [87] train-logloss:0.245284
## [88] train-logloss:0.243597
## [89] train-logloss:0.240684
## [90] train-logloss:0.239559
## [91] train-logloss:0.238746
## [92] train-logloss:0.237811
## [93] train-logloss:0.237024
## [94] train-logloss:0.236848
## [95] train-logloss:0.236507
## [96] train-logloss:0.235101
## [97] train-logloss:0.233123
## [98] train-logloss:0.232634
## [99] train-logloss:0.231711
## [100] train-logloss:0.231034
The command simply tells to fit 100 models and use them sequentially as input to the next one and display for each model the logloss value.
We can see how the models are improving and we can see that the logloss drops down as it goes. Which is a massive improvement. Starting from 0.58 to 0.23 in 100 models.
xg2 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 500,
watchlist = list(train=xgTrain),
print_every_n = 10
)
## [1] train-logloss:0.584465
## [11] train-logloss:0.384404
## [21] train-logloss:0.342348
## [31] train-logloss:0.314057
## [41] train-logloss:0.292115
## [51] train-logloss:0.281264
## [61] train-logloss:0.268367
## [71] train-logloss:0.259542
## [81] train-logloss:0.250701
## [91] train-logloss:0.238746
## [101] train-logloss:0.230196
## [111] train-logloss:0.221640
## [121] train-logloss:0.215480
## [131] train-logloss:0.209816
## [141] train-logloss:0.202680
## [151] train-logloss:0.194716
## [161] train-logloss:0.187202
## [171] train-logloss:0.181328
## [181] train-logloss:0.176260
## [191] train-logloss:0.169026
## [201] train-logloss:0.162839
## [211] train-logloss:0.157698
## [221] train-logloss:0.154665
## [231] train-logloss:0.150882
## [241] train-logloss:0.145548
## [251] train-logloss:0.141210
## [261] train-logloss:0.136176
## [271] train-logloss:0.131470
## [281] train-logloss:0.126410
## [291] train-logloss:0.122303
## [301] train-logloss:0.119627
## [311] train-logloss:0.116382
## [321] train-logloss:0.111622
## [331] train-logloss:0.108995
## [341] train-logloss:0.106465
## [351] train-logloss:0.104001
## [361] train-logloss:0.101634
## [371] train-logloss:0.098924
## [381] train-logloss:0.096887
## [391] train-logloss:0.093586
## [401] train-logloss:0.091510
## [411] train-logloss:0.088995
## [421] train-logloss:0.086757
## [431] train-logloss:0.084118
## [441] train-logloss:0.082258
## [451] train-logloss:0.079296
## [461] train-logloss:0.077742
## [471] train-logloss:0.076195
## [481] train-logloss:0.075108
## [491] train-logloss:0.073621
## [500] train-logloss:0.071645
nrounds = 500 tells how many models we want to fit within the boosting. nrounds can be infinite but it will lead to overfitting the models at some point. We need to make sure that we get this done.
xg3 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 500,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10
)
## [1] train-logloss:0.584465 validate-logloss:0.588327
## [11] train-logloss:0.384404 validate-logloss:0.405349
## [21] train-logloss:0.342348 validate-logloss:0.379235
## [31] train-logloss:0.314057 validate-logloss:0.364513
## [41] train-logloss:0.292115 validate-logloss:0.354458
## [51] train-logloss:0.281264 validate-logloss:0.351119
## [61] train-logloss:0.268367 validate-logloss:0.348306
## [71] train-logloss:0.259542 validate-logloss:0.346874
## [81] train-logloss:0.250701 validate-logloss:0.345490
## [91] train-logloss:0.238746 validate-logloss:0.343428
## [101] train-logloss:0.230196 validate-logloss:0.341732
## [111] train-logloss:0.221640 validate-logloss:0.340093
## [121] train-logloss:0.215480 validate-logloss:0.338649
## [131] train-logloss:0.209816 validate-logloss:0.339353
## [141] train-logloss:0.202680 validate-logloss:0.339085
## [151] train-logloss:0.194716 validate-logloss:0.338262
## [161] train-logloss:0.187202 validate-logloss:0.338643
## [171] train-logloss:0.181328 validate-logloss:0.338940
## [181] train-logloss:0.176260 validate-logloss:0.339741
## [191] train-logloss:0.169026 validate-logloss:0.339345
## [201] train-logloss:0.162839 validate-logloss:0.338801
## [211] train-logloss:0.157698 validate-logloss:0.337890
## [221] train-logloss:0.154665 validate-logloss:0.337555
## [231] train-logloss:0.150882 validate-logloss:0.337162
## [241] train-logloss:0.145548 validate-logloss:0.336569
## [251] train-logloss:0.141210 validate-logloss:0.337049
## [261] train-logloss:0.136176 validate-logloss:0.337330
## [271] train-logloss:0.131470 validate-logloss:0.336994
## [281] train-logloss:0.126410 validate-logloss:0.336350
## [291] train-logloss:0.122303 validate-logloss:0.338129
## [301] train-logloss:0.119627 validate-logloss:0.338971
## [311] train-logloss:0.116382 validate-logloss:0.340108
## [321] train-logloss:0.111622 validate-logloss:0.338240
## [331] train-logloss:0.108995 validate-logloss:0.339800
## [341] train-logloss:0.106465 validate-logloss:0.340334
## [351] train-logloss:0.104001 validate-logloss:0.340527
## [361] train-logloss:0.101634 validate-logloss:0.341221
## [371] train-logloss:0.098924 validate-logloss:0.342078
## [381] train-logloss:0.096887 validate-logloss:0.343685
## [391] train-logloss:0.093586 validate-logloss:0.342940
## [401] train-logloss:0.091510 validate-logloss:0.344616
## [411] train-logloss:0.088995 validate-logloss:0.345296
## [421] train-logloss:0.086757 validate-logloss:0.345500
## [431] train-logloss:0.084118 validate-logloss:0.346314
## [441] train-logloss:0.082258 validate-logloss:0.347069
## [451] train-logloss:0.079296 validate-logloss:0.348758
## [461] train-logloss:0.077742 validate-logloss:0.348970
## [471] train-logloss:0.076195 validate-logloss:0.349495
## [481] train-logloss:0.075108 validate-logloss:0.350932
## [491] train-logloss:0.073621 validate-logloss:0.351846
## [500] train-logloss:0.071645 validate-logloss:0.351765
## Displaying the logloss
head(xg3$evaluation_log, 20)
## iter train_logloss validate_logloss
## 1: 1 0.584465 0.588327
## 2: 2 0.523592 0.528923
## 3: 3 0.481711 0.491124
## 4: 4 0.455022 0.466328
## 5: 5 0.434317 0.447768
## 6: 6 0.420081 0.435185
## 7: 7 0.407247 0.423834
## 8: 8 0.399616 0.417848
## 9: 9 0.392878 0.412354
## 10: 10 0.389032 0.409110
## 11: 11 0.384404 0.405349
## 12: 12 0.380103 0.400807
## 13: 13 0.371857 0.394605
## 14: 14 0.366398 0.390580
## 15: 15 0.364025 0.389653
## 16: 16 0.358663 0.386122
## 17: 17 0.355455 0.384233
## 18: 18 0.353247 0.382846
## 19: 19 0.349883 0.381358
## 20: 20 0.344666 0.379638
tail(xg3$evaluation_log, 20)
## iter train_logloss validate_logloss
## 1: 481 0.075108 0.350932
## 2: 482 0.074867 0.350788
## 3: 483 0.074849 0.350794
## 4: 484 0.074690 0.350639
## 5: 485 0.074416 0.350912
## 6: 486 0.074292 0.350999
## 7: 487 0.074018 0.350911
## 8: 488 0.073870 0.351238
## 9: 489 0.073783 0.351419
## 10: 490 0.073667 0.351676
## 11: 491 0.073621 0.351846
## 12: 492 0.073560 0.351919
## 13: 493 0.073519 0.351889
## 14: 494 0.073273 0.351843
## 15: 495 0.072955 0.351730
## 16: 496 0.072499 0.351818
## 17: 497 0.072235 0.351622
## 18: 498 0.071956 0.351510
## 19: 499 0.071777 0.351419
## 20: 500 0.071645 0.351765
By adding the validation data within the boosting function and displaying the logloss, at some point even thought the training logloss continuous to decrease, the validation data logloss starts decreasing.
dygraph(xg3$evaluation_log)
On this plot, when training & validation plot start diverting, it means we are overfitting.
xg4 <- xgboost::xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70
)
## [1] train-logloss:0.584465 validate-logloss:0.588327
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.384404 validate-logloss:0.405349
## [21] train-logloss:0.342348 validate-logloss:0.379235
## [31] train-logloss:0.314057 validate-logloss:0.364513
## [41] train-logloss:0.292115 validate-logloss:0.354458
## [51] train-logloss:0.281264 validate-logloss:0.351119
## [61] train-logloss:0.268367 validate-logloss:0.348306
## [71] train-logloss:0.259542 validate-logloss:0.346874
## [81] train-logloss:0.250701 validate-logloss:0.345490
## [91] train-logloss:0.238746 validate-logloss:0.343428
## [101] train-logloss:0.230196 validate-logloss:0.341732
## [111] train-logloss:0.221640 validate-logloss:0.340093
## [121] train-logloss:0.215480 validate-logloss:0.338649
## [131] train-logloss:0.209816 validate-logloss:0.339353
## [141] train-logloss:0.202680 validate-logloss:0.339085
## [151] train-logloss:0.194716 validate-logloss:0.338262
## [161] train-logloss:0.187202 validate-logloss:0.338643
## [171] train-logloss:0.181328 validate-logloss:0.338940
## [181] train-logloss:0.176260 validate-logloss:0.339741
## [191] train-logloss:0.169026 validate-logloss:0.339345
## [201] train-logloss:0.162839 validate-logloss:0.338801
## [211] train-logloss:0.157698 validate-logloss:0.337890
## [221] train-logloss:0.154665 validate-logloss:0.337555
## [231] train-logloss:0.150882 validate-logloss:0.337162
## [241] train-logloss:0.145548 validate-logloss:0.336569
## [251] train-logloss:0.141210 validate-logloss:0.337049
## [261] train-logloss:0.136176 validate-logloss:0.337330
## [271] train-logloss:0.131470 validate-logloss:0.336994
## [281] train-logloss:0.126410 validate-logloss:0.336350
## [291] train-logloss:0.122303 validate-logloss:0.338129
## [301] train-logloss:0.119627 validate-logloss:0.338971
## [311] train-logloss:0.116382 validate-logloss:0.340108
## [321] train-logloss:0.111622 validate-logloss:0.338240
## [331] train-logloss:0.108995 validate-logloss:0.339800
## [341] train-logloss:0.106465 validate-logloss:0.340334
## [351] train-logloss:0.104001 validate-logloss:0.340527
## Stopping. Best iteration:
## [281] train-logloss:0.126410 validate-logloss:0.336350
dygraph(xg4$evaluation_log)
xg4$best_iteration
## [1] 281
xg4$best_score
## validate-logloss
## 0.33635
What this does is it will stop if the validate logloss does not improve after 70 rounds.
xgb.plot.importance(
xgb.importance(
xg4, feature_names=colnames(landx_train)
)
)
The above plots shows the variables sorted by importance. The more a variable was used model fitting the more important it is.
xg6 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gblinear',
nrounds = 1000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70
)
## [1] train-logloss:0.520332 validate-logloss:0.522429
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.469052 validate-logloss:0.479333
## [21] train-logloss:0.465704 validate-logloss:0.477129
## [31] train-logloss:0.464385 validate-logloss:0.476583
## [41] train-logloss:0.463600 validate-logloss:0.476285
## [51] train-logloss:0.463058 validate-logloss:0.476110
## [61] train-logloss:0.462615 validate-logloss:0.475877
## [71] train-logloss:0.462274 validate-logloss:0.475774
## [81] train-logloss:0.461992 validate-logloss:0.475669
## [91] train-logloss:0.461737 validate-logloss:0.475531
## [101] train-logloss:0.461495 validate-logloss:0.475364
## [111] train-logloss:0.461285 validate-logloss:0.475213
## [121] train-logloss:0.461088 validate-logloss:0.475037
## [131] train-logloss:0.460911 validate-logloss:0.474864
## [141] train-logloss:0.460743 validate-logloss:0.474693
## [151] train-logloss:0.460580 validate-logloss:0.474509
## [161] train-logloss:0.460434 validate-logloss:0.474340
## [171] train-logloss:0.460302 validate-logloss:0.474207
## [181] train-logloss:0.460182 validate-logloss:0.474085
## [191] train-logloss:0.460061 validate-logloss:0.473952
## [201] train-logloss:0.459951 validate-logloss:0.473822
## [211] train-logloss:0.459849 validate-logloss:0.473709
## [221] train-logloss:0.459754 validate-logloss:0.473591
## [231] train-logloss:0.459665 validate-logloss:0.473490
## [241] train-logloss:0.459580 validate-logloss:0.473395
## [251] train-logloss:0.459497 validate-logloss:0.473305
## [261] train-logloss:0.459418 validate-logloss:0.473171
## [271] train-logloss:0.459350 validate-logloss:0.473103
## [281] train-logloss:0.459286 validate-logloss:0.473029
## [291] train-logloss:0.459227 validate-logloss:0.472957
## [301] train-logloss:0.459170 validate-logloss:0.472887
## [311] train-logloss:0.459117 validate-logloss:0.472822
## [321] train-logloss:0.459062 validate-logloss:0.472737
## [331] train-logloss:0.459016 validate-logloss:0.472687
## [341] train-logloss:0.458971 validate-logloss:0.472628
## [351] train-logloss:0.458929 validate-logloss:0.472566
## [361] train-logloss:0.458891 validate-logloss:0.472519
## [371] train-logloss:0.458855 validate-logloss:0.472476
## [381] train-logloss:0.458823 validate-logloss:0.472439
## [391] train-logloss:0.458791 validate-logloss:0.472401
## [401] train-logloss:0.458762 validate-logloss:0.472369
## [411] train-logloss:0.458734 validate-logloss:0.472340
## [421] train-logloss:0.458707 validate-logloss:0.472313
## [431] train-logloss:0.458681 validate-logloss:0.472286
## [441] train-logloss:0.458656 validate-logloss:0.472247
## [451] train-logloss:0.458632 validate-logloss:0.472214
## [461] train-logloss:0.458611 validate-logloss:0.472193
## [471] train-logloss:0.458591 validate-logloss:0.472170
## [481] train-logloss:0.458572 validate-logloss:0.472148
## [491] train-logloss:0.458554 validate-logloss:0.472128
## [501] train-logloss:0.458536 validate-logloss:0.472105
## [511] train-logloss:0.458519 validate-logloss:0.472090
## [521] train-logloss:0.458504 validate-logloss:0.472071
## [531] train-logloss:0.458489 validate-logloss:0.472058
## [541] train-logloss:0.458473 validate-logloss:0.472038
## [551] train-logloss:0.458459 validate-logloss:0.472016
## [561] train-logloss:0.458447 validate-logloss:0.472003
## [571] train-logloss:0.458434 validate-logloss:0.471992
## [581] train-logloss:0.458422 validate-logloss:0.471977
## [591] train-logloss:0.458411 validate-logloss:0.471968
## [601] train-logloss:0.458400 validate-logloss:0.471958
## [611] train-logloss:0.458390 validate-logloss:0.471950
## [621] train-logloss:0.458380 validate-logloss:0.471939
## [631] train-logloss:0.458370 validate-logloss:0.471928
## [641] train-logloss:0.458360 validate-logloss:0.471922
## [651] train-logloss:0.458351 validate-logloss:0.471907
## [661] train-logloss:0.458342 validate-logloss:0.471898
## [671] train-logloss:0.458333 validate-logloss:0.471887
## [681] train-logloss:0.458325 validate-logloss:0.471880
## [691] train-logloss:0.458317 validate-logloss:0.471873
## [701] train-logloss:0.458310 validate-logloss:0.471865
## [711] train-logloss:0.458302 validate-logloss:0.471860
## [721] train-logloss:0.458295 validate-logloss:0.471854
## [731] train-logloss:0.458288 validate-logloss:0.471842
## [741] train-logloss:0.458281 validate-logloss:0.471836
## [751] train-logloss:0.458275 validate-logloss:0.471826
## [761] train-logloss:0.458269 validate-logloss:0.471816
## [771] train-logloss:0.458263 validate-logloss:0.471812
## [781] train-logloss:0.458257 validate-logloss:0.471807
## [791] train-logloss:0.458251 validate-logloss:0.471804
## [801] train-logloss:0.458246 validate-logloss:0.471799
## [811] train-logloss:0.458240 validate-logloss:0.471794
## [821] train-logloss:0.458235 validate-logloss:0.471789
## [831] train-logloss:0.458230 validate-logloss:0.471784
## [841] train-logloss:0.458225 validate-logloss:0.471779
## [851] train-logloss:0.458219 validate-logloss:0.471769
## [861] train-logloss:0.458215 validate-logloss:0.471764
## [871] train-logloss:0.458210 validate-logloss:0.471762
## [881] train-logloss:0.458206 validate-logloss:0.471757
## [891] train-logloss:0.458201 validate-logloss:0.471750
## [901] train-logloss:0.458197 validate-logloss:0.471746
## [911] train-logloss:0.458193 validate-logloss:0.471743
## [921] train-logloss:0.458188 validate-logloss:0.471738
## [931] train-logloss:0.458184 validate-logloss:0.471733
## [941] train-logloss:0.458180 validate-logloss:0.471731
## [951] train-logloss:0.458176 validate-logloss:0.471723
## [961] train-logloss:0.458173 validate-logloss:0.471720
## [971] train-logloss:0.458169 validate-logloss:0.471717
## [981] train-logloss:0.458166 validate-logloss:0.471711
## [991] train-logloss:0.458162 validate-logloss:0.471708
## [1000] train-logloss:0.458159 validate-logloss:0.471705
coefplot(xg6, sort='magnitude')
## Warning: Removed 86 rows containing missing values (geom_errorbarh).
## Warning: Removed 86 rows containing missing values (geom_errorbarh).
xg7 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gblinear',
nrounds = 1000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70,
alpha=1000, lambda=1250
)
## [1] train-logloss:0.591774 validate-logloss:0.593570
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.574115 validate-logloss:0.576900
## [21] train-logloss:0.574119 validate-logloss:0.576912
## [31] train-logloss:0.574119 validate-logloss:0.576913
## [41] train-logloss:0.574119 validate-logloss:0.576913
## [51] train-logloss:0.574119 validate-logloss:0.576913
## [61] train-logloss:0.574119 validate-logloss:0.576913
## [71] train-logloss:0.574119 validate-logloss:0.576913
## Stopping. Best iteration:
## [6] train-logloss:0.574094 validate-logloss:0.576799
dygraph(xg6$evaluation_log)
xg8 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70,
max_depth=3
)
## [1] train-logloss:0.603659 validate-logloss:0.604209
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.448362 validate-logloss:0.454028
## [21] train-logloss:0.414391 validate-logloss:0.423773
## [31] train-logloss:0.395228 validate-logloss:0.409642
## [41] train-logloss:0.382617 validate-logloss:0.400707
## [51] train-logloss:0.375534 validate-logloss:0.395706
## [61] train-logloss:0.366840 validate-logloss:0.389380
## [71] train-logloss:0.361331 validate-logloss:0.386436
## [81] train-logloss:0.356170 validate-logloss:0.383251
## [91] train-logloss:0.351118 validate-logloss:0.379851
## [101] train-logloss:0.347111 validate-logloss:0.376924
## [111] train-logloss:0.342857 validate-logloss:0.374518
## [121] train-logloss:0.339350 validate-logloss:0.372461
## [131] train-logloss:0.335317 validate-logloss:0.369910
## [141] train-logloss:0.331713 validate-logloss:0.367460
## [151] train-logloss:0.328550 validate-logloss:0.366371
## [161] train-logloss:0.325832 validate-logloss:0.365652
## [171] train-logloss:0.321048 validate-logloss:0.362536
## [181] train-logloss:0.319305 validate-logloss:0.361419
## [191] train-logloss:0.317201 validate-logloss:0.360907
## [201] train-logloss:0.313792 validate-logloss:0.360295
## [211] train-logloss:0.311035 validate-logloss:0.359793
## [221] train-logloss:0.308473 validate-logloss:0.359058
## [231] train-logloss:0.306801 validate-logloss:0.358688
## [241] train-logloss:0.303780 validate-logloss:0.357695
## [251] train-logloss:0.301109 validate-logloss:0.356285
## [261] train-logloss:0.299059 validate-logloss:0.355513
## [271] train-logloss:0.296234 validate-logloss:0.354178
## [281] train-logloss:0.294480 validate-logloss:0.353587
## [291] train-logloss:0.292278 validate-logloss:0.352933
## [301] train-logloss:0.290543 validate-logloss:0.353068
## [311] train-logloss:0.288375 validate-logloss:0.352250
## [321] train-logloss:0.286994 validate-logloss:0.352266
## [331] train-logloss:0.285033 validate-logloss:0.351646
## [341] train-logloss:0.283074 validate-logloss:0.351565
## [351] train-logloss:0.281737 validate-logloss:0.351329
## [361] train-logloss:0.279561 validate-logloss:0.351502
## [371] train-logloss:0.277843 validate-logloss:0.351063
## [381] train-logloss:0.275062 validate-logloss:0.349784
## [391] train-logloss:0.273416 validate-logloss:0.349883
## [401] train-logloss:0.272227 validate-logloss:0.349157
## [411] train-logloss:0.270217 validate-logloss:0.348371
## [421] train-logloss:0.268144 validate-logloss:0.347567
## [431] train-logloss:0.266266 validate-logloss:0.346804
## [441] train-logloss:0.264554 validate-logloss:0.346860
## [451] train-logloss:0.263219 validate-logloss:0.346315
## [461] train-logloss:0.262100 validate-logloss:0.346076
## [471] train-logloss:0.260832 validate-logloss:0.345466
## [481] train-logloss:0.259166 validate-logloss:0.345340
## [491] train-logloss:0.257377 validate-logloss:0.345598
## [501] train-logloss:0.255632 validate-logloss:0.345221
## [511] train-logloss:0.253833 validate-logloss:0.345369
## [521] train-logloss:0.252070 validate-logloss:0.345507
## [531] train-logloss:0.250976 validate-logloss:0.345718
## [541] train-logloss:0.248805 validate-logloss:0.345667
## [551] train-logloss:0.247620 validate-logloss:0.345625
## [561] train-logloss:0.246340 validate-logloss:0.345535
## [571] train-logloss:0.245156 validate-logloss:0.345764
## Stopping. Best iteration:
## [502] train-logloss:0.255389 validate-logloss:0.345064
xg9 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70,
max_depth=8
)
## [1] train-logloss:0.572140 validate-logloss:0.579273
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.330669 validate-logloss:0.381459
## [21] train-logloss:0.274391 validate-logloss:0.353267
## [31] train-logloss:0.246988 validate-logloss:0.344600
## [41] train-logloss:0.234957 validate-logloss:0.343297
## [51] train-logloss:0.212794 validate-logloss:0.341149
## [61] train-logloss:0.200223 validate-logloss:0.338897
## [71] train-logloss:0.182952 validate-logloss:0.338597
## [81] train-logloss:0.172923 validate-logloss:0.339493
## [91] train-logloss:0.159571 validate-logloss:0.336634
## [101] train-logloss:0.149797 validate-logloss:0.335678
## [111] train-logloss:0.140420 validate-logloss:0.337240
## [121] train-logloss:0.132089 validate-logloss:0.338788
## [131] train-logloss:0.124389 validate-logloss:0.339475
## [141] train-logloss:0.116982 validate-logloss:0.339648
## [151] train-logloss:0.108052 validate-logloss:0.339492
## [161] train-logloss:0.102289 validate-logloss:0.339879
## Stopping. Best iteration:
## [99] train-logloss:0.151200 validate-logloss:0.335353
xg10 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 2500,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 100,
early_stopping_rounds = 70,
max_depth=8, eta=0.1
)
## [1] train-logloss:0.648202 validate-logloss:0.650631
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [101] train-logloss:0.258530 validate-logloss:0.345895
## [201] train-logloss:0.201911 validate-logloss:0.333866
## [301] train-logloss:0.160906 validate-logloss:0.330675
## [401] train-logloss:0.131328 validate-logloss:0.328852
## [501] train-logloss:0.107839 validate-logloss:0.329401
## Stopping. Best iteration:
## [453] train-logloss:0.118348 validate-logloss:0.327837
The smaller your eta the longer your boosting. So when eta decreases nrounds should increase.
xg10 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 2500,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 100,
early_stopping_rounds = 70,
max_depth=8, eta=0.1,
subsample=0.5
)
## [1] train-logloss:0.648970 validate-logloss:0.651668
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [101] train-logloss:0.248650 validate-logloss:0.342919
## [201] train-logloss:0.184253 validate-logloss:0.334730
## [301] train-logloss:0.140025 validate-logloss:0.333791
## Stopping. Best iteration:
## [274] train-logloss:0.150292 validate-logloss:0.333250
The result is closed to the preivous one but has it’s own error marging. It is slightly different from the previous
xg10 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 2500,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 100,
early_stopping_rounds = 70,
max_depth=8, eta=0.1,
subsample=0.5, colsample_bytree=0.5
)
## [1] train-logloss:0.651462 validate-logloss:0.653666
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [101] train-logloss:0.266239 validate-logloss:0.351175
## [201] train-logloss:0.206605 validate-logloss:0.339870
## [301] train-logloss:0.163161 validate-logloss:0.338172
## [401] train-logloss:0.132554 validate-logloss:0.336817
## Stopping. Best iteration:
## [426] train-logloss:0.125878 validate-logloss:0.336071
This is very similar to random forest
In this case it will run in parallel 100 trees instead of cascading them and only 1 round
xg11 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 1,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 100,
early_stopping_rounds = 70,
max_depth=8, eta=0.1,
subsample=0.5, colsample_bytree=0.5,
num_parallel_tree=100
)
## [1] train-logloss:0.651068 validate-logloss:0.653224
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
In this case it will run in parallel 100 trees instead of cascading them and only 1 round
xg11 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 100,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 10,
early_stopping_rounds = 70,
max_depth=8, eta=0.1,
subsample=0.5, colsample_bytree=0.5,
num_parallel_tree=20
)
## [1] train-logloss:0.651106 validate-logloss:0.653145
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [11] train-logloss:0.447023 validate-logloss:0.464848
## [21] train-logloss:0.380143 validate-logloss:0.409460
## [31] train-logloss:0.347951 validate-logloss:0.386400
## [41] train-logloss:0.328236 validate-logloss:0.373650
## [51] train-logloss:0.313041 validate-logloss:0.365136
## [61] train-logloss:0.300119 validate-logloss:0.358592
## [71] train-logloss:0.289302 validate-logloss:0.353505
## [81] train-logloss:0.278965 validate-logloss:0.349282
## [91] train-logloss:0.270137 validate-logloss:0.346180
## [100] train-logloss:0.262001 validate-logloss:0.343363
xg12 <- xgb.train(
data = xgTrain,
objective = 'binary:logistic',
eval_metric = 'logloss',
booster = 'gbtree',
nrounds = 2000,
watchlist = list(train=xgTrain, validate=xgVal),
print_every_n = 100,
early_stopping_rounds = 70,
max_depth=8, eta=0.1,
nthread=2
)
## [1] train-logloss:0.648202 validate-logloss:0.650631
## Multiple eval metrics are present. Will use validate_logloss for early stopping.
## Will train until validate_logloss hasn't improved in 70 rounds.
##
## [101] train-logloss:0.258530 validate-logloss:0.345895
## [201] train-logloss:0.201911 validate-logloss:0.333866
## [301] train-logloss:0.160906 validate-logloss:0.330675
## [401] train-logloss:0.131328 validate-logloss:0.328852
## [501] train-logloss:0.107839 validate-logloss:0.329401
## Stopping. Best iteration:
## [453] train-logloss:0.118348 validate-logloss:0.327837
This is a way to introduce parallel computation and this will maximixe the usage of stuffs like GPU. In this case the command nthread=2 is making the tree creation activities ran in parallel.
xgPreds12 <- predict(xg11, newdata = landx_test, outputmargin = FALSE)
xgPreds12 %>% head(20) %>% round(2)
## [1] 0.15 0.16 0.22 0.13 0.18 0.19 0.20 0.05 0.28 0.05 0.09 0.22 0.31 0.49
## [15] 0.71 0.17 0.19 0.18 0.00 0.44