Gradient Boosted Model

A GBM is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. Boosting is a flexible nonlinear regression procedure that helps improve the accuracy of trees. Weak classification algorithms are sequentially applied to the incrementally changed data to create a series of decision trees, producing an ensemble of weak prediction models. While boosting trees increases their accuracy, it also decreases speed and user interpretability. The gradient boosting method generalizes tree boosting to minimize these drawbacks. For more information, see Gradient Boosted Models with H2O.

Strengths

** Key parameters for Gradient boost model**

Reading libraries

library(data.table)  
library(h2o)
library(plyr)
library(readr)
set.seed(415)

Reading the files

test <-fread("C:/Users/6430/Desktop/Project/test.csv/test.csv")
train<-fread("C:/Users/6430/Desktop/Project/train.csv/train.csv")
store<- fread("C:/Users/6430/Desktop/Project/store.csv/store.csv")

##merging the two files because two files have the different feature that have to be combined in order to the see the full effect of features on sales.

train1 <- merge(train,store,by="Store")
test1 <- merge(test,store,by="Store")

Converting all the ‘NA’ in train data to Zeros. Store 622 has 11 missing values for the “open” column, in test data; so to predict correctly I have decided to input “1” for open column of store 622. Otherwise our prediction will not be correct.

train1[is.na(train1)]   <- 0
test1[is.na(test1)]   <- 1

train1 and test1 data have “Date” as column value. We will seperate the Date into month, year and day respectively. These new variables generated through “Date” column will be better handle to predict the sales

train1$Date <- as.Date(train1$Date)
test1$Date <- as.Date(test1$Date)

train1$month <- as.integer(format(train1$Date, "%m"))
train1$year <- as.integer(format(train1$Date, "%y"))
train1$day <- as.integer(format(train1$Date, "%d"))
train1$DayOfYear <- as.integer(as.POSIXlt(train1$Date)$yday)
train1$week <- as.integer( format(train1$Date+3, "%U"))


test1$month <- as.integer(format(test1$Date, "%m"))
test1$year <- as.integer(format(test1$Date, "%y"))
test1$day <- as.integer(format(test1$Date, "%d"))
test1$DayOfYear <-  as.integer(as.POSIXlt(test1$Date)$yday)
test1$week <- as.integer( format(test1$Date+3, "%U"))

H2O’s random forest.Start cluster with all available threads

h2o.init(nthreads=-1,max_mem_size='8G')
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 50 minutes 
##     H2O cluster version:        3.8.2.6 
##     H2O cluster name:           H2O_started_from_R_6430_nqf493 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   6.97 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.2.3 (2015-12-10)

Features relevant to our analysis; Sales column is left as we are going to predict.

variable <- names(train1)[c(1,2,6,7,8:12,14:23)]

Log transformation to not be as sensitive to high sales

train1[,logSales:=log1p(Sales)]

Variables created to use all the features defined above as “variable.names”

trainGbm<-as.h2o(train1)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
testGbm<-as.h2o(test1)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Training the model

resultGbm <- h2o.gbm(x=variable,
                   y="logSales",
                   training_frame=trainGbm,
                   model_id="introGBM",
                   nbins_cats=1115,
                   sample_rate = 0.5,
                   col_sample_rate = 0.5,
                   max_depth = 50,
                   learn_rate=0.05,
                   ntrees = 150
                   )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping constant columns: [StoreType, Assortment, PromoInterval, StateHoliday].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================|  99%
  |                                                                       
  |=================================================================| 100%

Summary of the model and importance of variables

summary(resultGbm)
## Model Details:
## ==============
## 
## H2ORegressionModel: gbm
## Model Key:  introGBM 
## Model Summary: 
##   number_of_trees model_size_in_bytes min_depth max_depth mean_depth
## 1             150            64473990        39        50   46.68667
##   min_leaves max_leaves mean_leaves
## 1      32257      38885 37077.37500
## 
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.005430122
## R2 :  0.999505
## Mean Residual Deviance :  0.005430122
## 
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration number_of_trees training_MSE
## 1 2016-06-11 17:41:04  0.001 sec               0     10.96944
## 2 2016-06-11 17:41:09  4.502 sec               1      9.90345
## 3 2016-06-11 17:41:13  8.979 sec               2      8.94111
## 4 2016-06-11 17:41:18 13.238 sec               3      8.07207
## 5 2016-06-11 17:41:26 21.215 sec               5      6.57945
##   training_deviance
## 1          10.96944
## 2           9.90345
## 3           8.94111
## 4           8.07207
## 5           6.57945
## 
## ---
##               timestamp          duration number_of_trees training_MSE
## 144 2016-06-11 17:54:22 13 min 17.349 sec             145      0.00559
## 145 2016-06-11 17:54:28 13 min 23.202 sec             146      0.00556
## 146 2016-06-11 17:54:33 13 min 28.874 sec             147      0.00552
## 147 2016-06-11 17:54:39 13 min 34.686 sec             148      0.00549
## 148 2016-06-11 17:54:44 13 min 39.904 sec             149      0.00546
## 149 2016-06-11 17:54:50 13 min 45.622 sec             150      0.00543
##     training_deviance
## 144           0.00559
## 145           0.00556
## 146           0.00552
## 147           0.00549
## 148           0.00546
## 149           0.00543
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##                    variable relative_importance scaled_importance
## 1                      Open     59745932.000000          1.000000
## 2                 DayOfWeek     30399520.000000          0.508813
## 3                     Promo      2130498.500000          0.035659
## 4                       day       329629.687500          0.005517
## 5                 DayOfYear       260440.171875          0.004359
## 6       CompetitionDistance       222680.812500          0.003727
## 7                     Store       213892.906250          0.003580
## 8                     month       116612.734375          0.001952
## 9                      week        90319.132812          0.001512
## 10 CompetitionOpenSinceYear        57976.910156          0.000970
## 11            SchoolHoliday        42675.679688          0.000714
## 12          Promo2SinceWeek        40174.687500          0.000672
## 13          Promo2SinceYear        34182.785156          0.000572
## 14                     year        29265.328125          0.000490
## 15                   Promo2         6288.659668          0.000105
##    percentage
## 1    0.637493
## 2    0.324365
## 3    0.022733
## 4    0.003517
## 5    0.002779
## 6    0.002376
## 7    0.002282
## 8    0.001244
## 9    0.000964
## 10   0.000619
## 11   0.000455
## 12   0.000429
## 13   0.000365
## 14   0.000312
## 15   0.000067
variableimps = data.frame(h2o.varimp(resultGbm))

Get predictions out; predicts in H2O, as.data.frame gets them into R

predictions<-as.data.frame(h2o.predict(resultGbm,testGbm))
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Return the predictions to the original scale of the Sales data

pred <- expm1(predictions[,1])

summary(pred)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    -0.015  4227.000  5859.000  5760.000  7620.000 30020.000
Finalfile <- data.frame(Id=test$Id, Sales=pred)

write_csv(Finalfile,"C:/Users/6430/Desktop/Project/Salespredictionfinal.csv")