Gradient Boosted Model
A GBM is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. Boosting is a flexible nonlinear regression procedure that helps improve the accuracy of trees. Weak classification algorithms are sequentially applied to the incrementally changed data to create a series of decision trees, producing an ensemble of weak prediction models. While boosting trees increases their accuracy, it also decreases speed and user interpretability. The gradient boosting method generalizes tree boosting to minimize these drawbacks. For more information, see Gradient Boosted Models with H2O.
Strengths
** Key parameters for Gradient boost model**
Reading libraries
library(data.table)
library(h2o)
library(plyr)
library(readr)
set.seed(415)
Reading the files
test <-fread("C:/Users/6430/Desktop/Project/test.csv/test.csv")
train<-fread("C:/Users/6430/Desktop/Project/train.csv/train.csv")
store<- fread("C:/Users/6430/Desktop/Project/store.csv/store.csv")
##merging the two files because two files have the different feature that have to be combined in order to the see the full effect of features on sales.
train1 <- merge(train,store,by="Store")
test1 <- merge(test,store,by="Store")
Converting all the ‘NA’ in train data to Zeros. Store 622 has 11 missing values for the “open” column, in test data; so to predict correctly I have decided to input “1” for open column of store 622. Otherwise our prediction will not be correct.
train1[is.na(train1)] <- 0
test1[is.na(test1)] <- 1
train1 and test1 data have “Date” as column value. We will seperate the Date into month, year and day respectively. These new variables generated through “Date” column will be better handle to predict the sales
train1$Date <- as.Date(train1$Date)
test1$Date <- as.Date(test1$Date)
train1$month <- as.integer(format(train1$Date, "%m"))
train1$year <- as.integer(format(train1$Date, "%y"))
train1$day <- as.integer(format(train1$Date, "%d"))
train1$DayOfYear <- as.integer(as.POSIXlt(train1$Date)$yday)
train1$week <- as.integer( format(train1$Date+3, "%U"))
test1$month <- as.integer(format(test1$Date, "%m"))
test1$year <- as.integer(format(test1$Date, "%y"))
test1$day <- as.integer(format(test1$Date, "%d"))
test1$DayOfYear <- as.integer(as.POSIXlt(test1$Date)$yday)
test1$week <- as.integer( format(test1$Date+3, "%U"))
H2O’s random forest.Start cluster with all available threads
h2o.init(nthreads=-1,max_mem_size='8G')
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 50 minutes
## H2O cluster version: 3.8.2.6
## H2O cluster name: H2O_started_from_R_6430_nqf493
## H2O cluster total nodes: 1
## H2O cluster total memory: 6.97 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.3 (2015-12-10)
Features relevant to our analysis; Sales column is left as we are going to predict.
variable <- names(train1)[c(1,2,6,7,8:12,14:23)]
Log transformation to not be as sensitive to high sales
train1[,logSales:=log1p(Sales)]
Variables created to use all the features defined above as “variable.names”
trainGbm<-as.h2o(train1)
##
|
| | 0%
|
|=================================================================| 100%
testGbm<-as.h2o(test1)
##
|
| | 0%
|
|=================================================================| 100%
Training the model
resultGbm <- h2o.gbm(x=variable,
y="logSales",
training_frame=trainGbm,
model_id="introGBM",
nbins_cats=1115,
sample_rate = 0.5,
col_sample_rate = 0.5,
max_depth = 50,
learn_rate=0.05,
ntrees = 150
)
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping constant columns: [StoreType, Assortment, PromoInterval, StateHoliday].
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|====== | 10%
|
|======= | 11%
|
|======== | 12%
|
|======== | 13%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 19%
|
|============= | 20%
|
|============= | 21%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================= | 27%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 39%
|
|========================== | 40%
|
|========================== | 41%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 43%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 46%
|
|============================== | 47%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 57%
|
|====================================== | 58%
|
|====================================== | 59%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================== | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|==================================================== | 81%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 91%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 99%
|
|=================================================================| 100%
Summary of the model and importance of variables
summary(resultGbm)
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model Key: introGBM
## Model Summary:
## number_of_trees model_size_in_bytes min_depth max_depth mean_depth
## 1 150 64473990 39 50 46.68667
## min_leaves max_leaves mean_leaves
## 1 32257 38885 37077.37500
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.005430122
## R2 : 0.999505
## Mean Residual Deviance : 0.005430122
##
##
##
##
##
## Scoring History:
## timestamp duration number_of_trees training_MSE
## 1 2016-06-11 17:41:04 0.001 sec 0 10.96944
## 2 2016-06-11 17:41:09 4.502 sec 1 9.90345
## 3 2016-06-11 17:41:13 8.979 sec 2 8.94111
## 4 2016-06-11 17:41:18 13.238 sec 3 8.07207
## 5 2016-06-11 17:41:26 21.215 sec 5 6.57945
## training_deviance
## 1 10.96944
## 2 9.90345
## 3 8.94111
## 4 8.07207
## 5 6.57945
##
## ---
## timestamp duration number_of_trees training_MSE
## 144 2016-06-11 17:54:22 13 min 17.349 sec 145 0.00559
## 145 2016-06-11 17:54:28 13 min 23.202 sec 146 0.00556
## 146 2016-06-11 17:54:33 13 min 28.874 sec 147 0.00552
## 147 2016-06-11 17:54:39 13 min 34.686 sec 148 0.00549
## 148 2016-06-11 17:54:44 13 min 39.904 sec 149 0.00546
## 149 2016-06-11 17:54:50 13 min 45.622 sec 150 0.00543
## training_deviance
## 144 0.00559
## 145 0.00556
## 146 0.00552
## 147 0.00549
## 148 0.00546
## 149 0.00543
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance
## 1 Open 59745932.000000 1.000000
## 2 DayOfWeek 30399520.000000 0.508813
## 3 Promo 2130498.500000 0.035659
## 4 day 329629.687500 0.005517
## 5 DayOfYear 260440.171875 0.004359
## 6 CompetitionDistance 222680.812500 0.003727
## 7 Store 213892.906250 0.003580
## 8 month 116612.734375 0.001952
## 9 week 90319.132812 0.001512
## 10 CompetitionOpenSinceYear 57976.910156 0.000970
## 11 SchoolHoliday 42675.679688 0.000714
## 12 Promo2SinceWeek 40174.687500 0.000672
## 13 Promo2SinceYear 34182.785156 0.000572
## 14 year 29265.328125 0.000490
## 15 Promo2 6288.659668 0.000105
## percentage
## 1 0.637493
## 2 0.324365
## 3 0.022733
## 4 0.003517
## 5 0.002779
## 6 0.002376
## 7 0.002282
## 8 0.001244
## 9 0.000964
## 10 0.000619
## 11 0.000455
## 12 0.000429
## 13 0.000365
## 14 0.000312
## 15 0.000067
variableimps = data.frame(h2o.varimp(resultGbm))
Get predictions out; predicts in H2O, as.data.frame gets them into R
predictions<-as.data.frame(h2o.predict(resultGbm,testGbm))
##
|
| | 0%
|
|=================================================================| 100%
Return the predictions to the original scale of the Sales data
pred <- expm1(predictions[,1])
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.015 4227.000 5859.000 5760.000 7620.000 30020.000
Finalfile <- data.frame(Id=test$Id, Sales=pred)
write_csv(Finalfile,"C:/Users/6430/Desktop/Project/Salespredictionfinal.csv")