Last Updated - 2016-07-18

Introduction

The previous June model round 1 predicted model of 53552 (using H20) managed to have an close approximate of JUNE Round 1 Actual Premium 53694 , with differences of $42. The model will now add in new data and attempt to forecast the June 2016 COE Premium Category A Round 2.

Data source is from here.

Determine COE Premium for NEW vehicle bid

## Warning in TentativeRoughFix(boruta.train): There are no Tentative attributes! Returning original
## object.

##        meanImp medianImp   minImp   maxImp normHits  decision
## PQP   47.63295  47.39704 45.58051 52.05431        1 Confirmed
## QUOTA 21.72776  21.81886 20.16540 23.71655        1 Confirmed
## BIDS  15.93328  15.95814 14.93328 16.85962        1 Confirmed

## Warning: package 'plyr' was built under R version 3.3.1

## [1] "PQP"   "QUOTA"

We will create some linear regression model equations to forecast PREMIUM based on these variables.

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA, data = traindata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18785  -3344   -519   3854  17153 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.131e+04  2.048e+03   5.523 1.41e-07 ***
## PQP          8.569e-01  2.963e-02  28.920  < 2e-16 ***
## QUOTA       -3.244e+00  9.732e-01  -3.333  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5912 on 152 degrees of freedom
## Multiple R-squared:  0.8629, Adjusted R-squared:  0.8611 
## F-statistic: 478.2 on 2 and 152 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17991.8  -3444.7   -559.5   3957.4  17857.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.392e+03  1.731e+03    4.27 3.42e-05 ***
## PQP         8.817e-01  2.961e-02   29.78  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6104 on 153 degrees of freedom
## Multiple R-squared:  0.8528, Adjusted R-squared:  0.8519 
## F-statistic: 886.7 on 1 and 153 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = PREMIUM ~ PQP + QUOTA + BIDS, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18517.2  -3427.2   -474.2   3756.1  17020.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.125e+04  2.056e+03   5.472 1.81e-07 ***
## PQP          8.587e-01  2.998e-02  28.646  < 2e-16 ***
## QUOTA       -2.043e+00  2.842e+00  -0.719    0.473    
## BIDS        -7.966e-01  1.770e+00  -0.450    0.653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5927 on 151 degrees of freedom
## Multiple R-squared:  0.863,  Adjusted R-squared:  0.8603 
## F-statistic: 317.2 on 3 and 151 DF,  p-value: < 2.2e-16

The PQP is a 3 month moving average and it is 46454 for June 2016 (number given by LTA site)

Adjusted R Square is 86.11% with PQP + Quota coefficients. Predicted COE Premium is 43920.

* Adjusted R Square is 85.19% with PQP coefficient only. Predicted COE Premium is 48351.

* Adjusted R Square is 86.03% with all coefficients . Predicted COE Premium is 51144.

Using H2O algorithm with GBM (For data-scientist only )

library(h2o)

## Warning: package 'h2o' was built under R version 3.3.1

## Loading required package: statmod

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames, colnames<-, ifelse,
##     is.character, is.factor, is.numeric, log, log10, log1p, log2, round, signif, trunc

localH2O <- h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\admin\AppData\Local\Temp\Rtmpu8A9Ks/h2o_admin_started_from_r.out
##     C:\Users\admin\AppData\Local\Temp\Rtmpu8A9Ks/h2o_admin_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 244 milliseconds 
##     H2O cluster version:        3.8.3.3 
##     H2O cluster name:           H2O_started_from_R_admin_cwt957 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 seconds 401 milliseconds 
##     H2O cluster version:        3.8.3.3 
##     H2O cluster name:           H2O_started_from_R_admin_cwt957 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.10 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.0 (2016-05-03)

#split data into datafame
samp <- sample(nrow(traindata), 0.7 * nrow(traindata))
training <- traindata[samp, ]
testing <- traindata[-samp, ]


#convert to H2O frame
train.h2o <- as.h2o(traindata); test.h2o  <- as.h2o(testing)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

### values below for columns
y.dep <- 4 #interested in PREMIUM COLUMNS
x.indep <- c(5:7) # use all varibles COLUMNS from PQP + BIDS  + QUOTA

#GBM

gbm.model <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, max_depth = 4, learn_rate = 0.01, seed = 1122)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |=========================                                                                 |  28%
  |                                                                                                
  |=====================================================                                     |  58%
  |                                                                                                
  |========================================================================                  |  80%
  |                                                                                                
  |==========================================================================================| 100%

h2o.varimp(gbm.model)

## Variable Importances: 
##   variable  relative_importance scaled_importance percentage
## 1      PQP 1615321104384.000000          1.000000   0.851653
## 2     BIDS  175986163712.000000          0.108948   0.092786
## 3    QUOTA  105382166528.000000          0.065239   0.055561

#h2o.performance(gbm.model)

# predict against test data
#predict.gbm <- as.data.frame(h2o.predict(gbm.model, test.h2o))


###############################################################
# i want to put in my figures to predict, so i put in PQP
##############################################################

mypqpdata <- data.frame(PQP=46454)

#convert to h20 frame
result_premium <- as.h2o(mypqpdata)

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

predict.gbm <- as.data.frame(h2o.predict(gbm.model, result_premium))

## 
  |                                                                                                
  |                                                                                          |   0%
  |                                                                                                
  |==========================================================================================| 100%

* Adjusted R Square is 97.44% Predicted COE Premium is 54856.

COE JUNE 2016 for Category A Round 2

LIM KAH KHENG (jkklim@hotmail.com)