source('Head.R')

Testing Data Setup

X1 and X2 are two independent variables while Y is the dependent variable. The plot shows the data, and the filled color indicates Y value.

sqr_size = 50
rep_time = 10
d = data.table(X1 = rep(rep(1:sqr_size, sqr_size), rep_time), X2 = rep(sort(rep(1:sqr_size, sqr_size)), rep_time))-0.5
d[, `:=`(idx = 1:.N, Y = as.factor(as.numeric(X2<=5 | X2>=max(X2)-4 | (X2<=X1+2 & X2>=X1-3))))]

d_train = d[sample(1:.N, round(0.75*.N))]
setkey(d, idx)
setkey(d_train, idx)
d_test = d[!d_train]

ggplot(data=unique(d), aes(x = X1, y = X2, fill = Y, title())) +
  ggtitle("Data Heat Map") + geom_raster() + theme_bw()

ggplot(data = d[, .(p = sum(Y==1)/.N), keyby = "X1"], aes(x = X1, y = p)) +
  geom_bar(stat="identity") + ggtitle("Y % at each X1") + ylab("Y %")

ggplot(data = d[, .(p = sum(Y==1)/.N), keyby = "X2"], aes(x = X2, y = p)) +
  geom_bar(stat="identity") + ggtitle("Y % at each X2") + ylab("Y %")

Build Random Forest Model

Random forest model uses a random subset of data to build one tree, and repeatedly sample n subsets to build n trees. The final prediction is the majority vote of all the trees.

Build Gradient Boosted Trees (GBT)

GBT first builds tree model 1, then use the error of the model 1 to build tree model 2, and so on. GBT creates a sequence of tree models.

## [1]  train-error:0.075893    test-error:0.076320 
## [11] train-error:0.069173    test-error:0.067520 
## [21] train-error:0.058507    test-error:0.056480 
## [31] train-error:0.052960    test-error:0.051360 
## [41] train-error:0.046507    test-error:0.045440 
## [51] train-error:0.028853    test-error:0.026240 
## [61] train-error:0.002240    test-error:0.002080 
## [71] train-error:0.000693    test-error:0.001120 
## [81] train-error:0.000107    test-error:0.001600 
## [91] train-error:0.000000    test-error:0.001280 
## [100]    train-error:0.000000    test-error:0.001120