Last Updated - 2016-06-08

Introduction

The previous May model round 2 predicted model of 47536 managed to have an close approximate of May Round 2 Actual Premium 47020, with differences of $518. The model will now add in new data and attempt to forecast the June 2016 COE Premium Category A Round 1.

Data source is from here.

Determine COE Premium for NEW vehicle bid

## Warning in TentativeRoughFix(boruta.train): There are no Tentative attributes! Returning original
## object.

##        meanImp medianImp   minImp   maxImp normHits  decision
## PQP   48.33234  48.94760 44.55557 50.87981        1 Confirmed
## QUOTA 21.94294  21.93750 19.69719 23.26416        1 Confirmed
## BIDS  16.62142  16.60232 15.79419 17.41780        1 Confirmed

## [1] "PQP"   "QUOTA"

We will create some linear regression model equations to forecast PREMIUM based on these variables.

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA, data = traindata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18872  -3359   -542   3819  17090 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.157e+04  2.040e+03   5.672 6.98e-08 ***
## PQP          8.563e-01  2.944e-02  29.082  < 2e-16 ***
## QUOTA       -3.628e+00  9.928e-01  -3.655 0.000355 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5874 on 151 degrees of freedom
## Multiple R-squared:  0.8654, Adjusted R-squared:  0.8637 
## F-statistic: 485.6 on 2 and 151 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17987.8  -3414.2   -592.2   3987.7  17867.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.289e+03  1.737e+03   4.197 4.58e-05 ***
## PQP         8.829e-01  2.967e-02  29.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6109 on 152 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.8526 
## F-statistic: 885.8 on 1 and 152 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA + BIDS, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18170.6  -3563.3   -347.1   3764.1  16716.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11494.9871  2039.4858   5.636 8.37e-08 ***
## PQP             0.8610     0.0297  28.991  < 2e-16 ***
## QUOTA          -0.4822     2.9180  -0.165    0.869    
## BIDS           -2.1524     1.8775  -1.146    0.253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5868 on 150 degrees of freedom
## Multiple R-squared:  0.8666, Adjusted R-squared:  0.8639 
## F-statistic: 324.9 on 3 and 150 DF,  p-value: < 2.2e-16

The PQP is a 3 month moving average and it is 46454 for June 2016 (number given by LTA site)

Adjusted R Square is 86.37% with PQP + Quota coefficients. Predicted COE Premium is 43308.

* Adjusted R Square is 85.26% with PQP coefficient only. Predicted COE Premium is 48305.

* Adjusted R Square is 86.39% with all coefficients . Predicted COE Premium is 51492.

Using H2O algorithm with GBM (For data-scientist only )

library(h2o)

## Loading required package: statmod

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse,
##     is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc

localH2O <- h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\admin\AppData\Local\Temp\Rtmpwp73tI/h2o_admin_started_from_r.out
##     C:\Users\admin\AppData\Local\Temp\Rtmpwp73tI/h2o_admin_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 205 milliseconds 
##     H2O cluster version:        3.8.1.3 
##     H2O cluster name:           H2O_started_from_R_admin_gcz764 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 485 milliseconds 
##     H2O cluster version:        3.8.1.3 
##     H2O cluster name:           H2O_started_from_R_admin_gcz764 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

#split data into datafame
samp <- sample(nrow(traindata), 0.7 * nrow(traindata))
training <- traindata[samp, ]
testing <- traindata[-samp, ]


#convert to H2O frame
train.h2o <- as.h2o(traindata); test.h2o  <- as.h2o(testing)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

### values below for columns
y.dep <- 4 #interested in PREMIUM COLUMNS
x.indep <- c(5:7) # use all varibles COLUMNS from PQP + BIDS  + QUOTA

#GBM
system.time(
 gbm.model <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, max_depth = 4, learn_rate = 0.01, seed = 1122)
)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |=========================                                                                 |  28%
  |                                                                                                
  |===================================================                                       |  57%
  |                                                                                                
  |======================================================================                    |  78%
  |                                                                                                
  |==========================================================================================| 100%

##    user  system elapsed 
##    0.14    0.00    4.70

h2o.varimp(gbm.model)

## Variable Importances: 
##   variable  relative_importance scaled_importance percentage
## 1      PQP 1612447219712.000000          1.000000   0.849386
## 2     BIDS  176865198080.000000          0.109687   0.093167
## 3    QUOTA  109056024576.000000          0.067634   0.057447

#h2o.performance(gbm.model)

# predict against test data
#predict.gbm <- as.data.frame(h2o.predict(gbm.model, test.h2o))


###############################################################
# i want to put in my figures to predict, so i put in PQP
##############################################################

mypqpdata <- data.frame(PQP=46454)

#convert to h20 frame
result_premium <- as.h2o(mypqpdata)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

predict.gbm <- as.data.frame(h2o.predict(gbm.model, result_premium))

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

* Adjusted R Square is 97.55% Predicted COE Premium is 53552.

Interim conclusion : This is the first time i am using H2O algorithm. Is H20 algorithm better than traditional linear regression ? hmmm …until next time.

COE JUNE 2016 for Category A Round 1

LIM KAH KHENG (jkklim@hotmail.com)