Last Updated - 2016-07-18

Introduction

The previous June model round 2 predicted model of 54856 (using H20)

JUNE Round 2 Actual Premium 55200 , with differences of 344.

The model will now add in new data and attempt to forecast the July 2016 COE Premium Category A Round 1.

Data source is from here.

Determine COE Premium for NEW vehicle bid

## Warning in TentativeRoughFix(boruta.train): There are no Tentative attributes! Returning original
## object.

##        meanImp medianImp   minImp   maxImp normHits  decision
## PQP   47.73317  47.84532 43.79855 50.44875        1 Confirmed
## QUOTA 21.96536  21.65856 20.89109 23.86468        1 Confirmed
## BIDS  16.41152  16.43338 15.44169 17.28222        1 Confirmed

## Warning: package 'plyr' was built under R version 3.3.1

## [1] "PQP"   "QUOTA"

We will create some linear regression model equations to forecast PREMIUM based on these variables.

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18680.8  -3478.3   -520.5   3828.8  17230.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.102e+04  2.063e+03   5.342 3.27e-07 ***
## PQP          8.573e-01  2.992e-02  28.653  < 2e-16 ***
## QUOTA       -2.809e+00  9.582e-01  -2.931   0.0039 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5970 on 153 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8574 
## F-statistic: 466.9 on 2 and 153 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP, data = traindata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17993  -3448   -607   3914  17846 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.549e+03  1.730e+03   4.363 2.34e-05 ***
## PQP         8.798e-01  2.963e-02  29.695  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6116 on 154 degrees of freedom
## Multiple R-squared:  0.8513, Adjusted R-squared:  0.8504 
## F-statistic: 881.8 on 1 and 154 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA + BIDS, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18596.0  -3433.8   -499.9   3827.1  17189.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.100e+04  2.074e+03   5.303 3.96e-07 ***
## PQP          8.579e-01  3.029e-02  28.326  < 2e-16 ***
## QUOTA       -2.429e+00  2.865e+00  -0.848    0.398    
## BIDS        -2.485e-01  1.768e+00  -0.141    0.888    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5989 on 152 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8565 
## F-statistic: 309.3 on 3 and 152 DF,  p-value: < 2.2e-16

The PQP is a 3 month moving average and it is 49519 for June 2016 (number given by LTA site)

Adjusted R Square is 85.74% with PQP + Quota coefficients. Predicted COE Premium is 47149.

* Adjusted R Square is 85.04% with PQP coefficient only. Predicted COE Premium is 51116.

* Adjusted R Square is 85.65% with all coefficients . Predicted COE Premium is 53481.

Using H2O algorithm with GBM (For data-scientist only )

library(h2o)

## Warning: package 'h2o' was built under R version 3.3.1

## Loading required package: statmod

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse,
##     is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc

localH2O <- h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\admin\AppData\Local\Temp\RtmpiU8LTh/h2o_admin_started_from_r.out
##     C:\Users\admin\AppData\Local\Temp\RtmpiU8LTh/h2o_admin_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 141 milliseconds 
##     H2O cluster version:        3.8.3.3 
##     H2O cluster name:           H2O_started_from_R_admin_bwv427 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 437 milliseconds 
##     H2O cluster version:        3.8.3.3 
##     H2O cluster name:           H2O_started_from_R_admin_bwv427 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

#split data into datafame
samp <- sample(nrow(traindata), 0.7 * nrow(traindata))
training <- traindata[samp, ]
testing <- traindata[-samp, ]


#convert to H2O frame
train.h2o <- as.h2o(traindata); test.h2o  <- as.h2o(testing)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

### values below for columns
y.dep <- 4 #interested in PREMIUM COLUMNS
x.indep <- c(5:7) # use all varibles COLUMNS from PQP + BIDS  + QUOTA

#GBM

gbm.model <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, max_depth = 4, learn_rate = 0.01, seed = 1122)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================                                                                |  29%
  |                                                                                                
  |========================================================                                  |  62%
  |                                                                                                
  |============================================================================              |  84%
  |                                                                                                
  |==========================================================================================| 100%

h2o.varimp(gbm.model)

## Variable Importances: 
##   variable  relative_importance scaled_importance percentage
## 1      PQP 1625082953728.000000          1.000000   0.858711
## 2     BIDS  173447249920.000000          0.106731   0.091651
## 3    QUOTA   93937188864.000000          0.057805   0.049637

#h2o.performance(gbm.model)

# predict against test data
#predict.gbm <- as.data.frame(h2o.predict(gbm.model, test.h2o))


###############################################################
# i want to put in my figures to predict, so i put in PQP
##############################################################

mypqpdata <- data.frame(PQP=46454)

#convert to h20 frame
result_premium <- as.h2o(mypqpdata)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

predict.gbm <- as.data.frame(h2o.predict(gbm.model, result_premium))

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

* Adjusted R Square is 97.21% Predicted COE Premium is 55013.

COE JULY 2016 for Category A Round 1

LIM KAH KHENG (jkklim@hotmail.com)