Last Updated - 2016-05-22

Introduction

I am doing this for a friend who ask me to forecast COE prices. There is no way to predict ACUTAL COE prices but what we want to achieve is an equation that plugs in the variables and then determine the forecasted COE prices. A good model is when the equation determine the projected COE price should NOT deviate much from actual COE price.

Both Premium and PQP can be a response, so what are the independent variables that them ? We will perform 2 algoritm and Linear tests for PREMIUM and PQP for new bids and renewal COE respectively. The data is for Cat A car from year 2010 to year 2016 latest.

Data source is from here.

Legend

Part 1 : Determine COE Premium for NEW vehicle bid

Algorithm 1 : Boruta (for data scientist only)

## Warning in TentativeRoughFix(boruta.train): There are no Tentative attributes! Returning original
## object.
##        meanImp medianImp   minImp   maxImp normHits  decision
## PQP   48.05860  47.99839 46.41306 49.54351        1 Confirmed
## QUOTA 22.06913  22.28534 20.33577 23.09287        1 Confirmed
## BIDS  16.38491  16.18743 15.63004 17.28816        1 Confirmed

Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute.

Red represent Z scores of rejected attributes.

yellow represent Z scores of tentative attributes.

Green represent Z scores of confirmed attributes.

Algorithm 2 : RFE (Recusive Feature Elimination) to determine variables (for data scientist)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables RMSE Rsquared RMSESD RsquaredSD Selected
##          1 4899   0.9060 1969.9    0.06821         
##          2 4466   0.9235  923.5    0.03214        *
##          3 4546   0.9247 1049.5    0.03222         
## 
## The top 2 variables (out of 2):
##    PQP, QUOTA

## [1] "PQP"   "QUOTA"

We will create a linear regression to forecast PREMIUM based on these variables and determine how the linear regression removes the un-necessary variables.

## 
## Call:
## lm(formula = PREMIUM ~ . - YEAR - MONTH - ROUND, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18272.2  -3534.5   -281.6   3604.8  16571.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.191e+04  2.061e+03   5.779 4.34e-08 ***
## PQP          8.609e-01  2.976e-02  28.927  < 2e-16 ***
## QUOTA       -8.620e-01  2.943e+00  -0.293    0.770    
## BIDS        -2.342e+00  1.890e+00  -1.239    0.217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5880 on 147 degrees of freedom
## Multiple R-squared:  0.8678, Adjusted R-squared:  0.8651 
## F-statistic: 321.7 on 3 and 147 DF,  p-value: < 2.2e-16

Adjusted R Square is 86.51%. We can see PQP variable has highest impact

fit1 <- lm(PREMIUM ~ PQP, data=traindata)
summary(fit1)
## 
## Call:
## lm(formula = PREMIUM ~ PQP, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17988.2  -3504.7   -696.1   4007.8  17866.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.306e+03  1.768e+03   4.133 5.95e-05 ***
## PQP         8.827e-01  3.008e-02  29.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6169 on 149 degrees of freedom
## Multiple R-squared:  0.8525, Adjusted R-squared:  0.8515 
## F-statistic: 861.3 on 1 and 149 DF,  p-value: < 2.2e-16

Adjusted R Square is 85.15%.

Example : Let’s interpret Maths coefficients

25 E-01 = 25 * 10^(-1) = 2.5

25E2 = 25 * 10^2 = 2500

7.8E-02 = 7.8 * 10^(-2) = 0.078

7.8E3 = 7.8 * 10^3 = 7800

From equation 2, forecasted premium will be

coeffs <- coefficients(fit1); 
# interested in the 1st row of traindata, at column 5 which is PQP
PQP <- traindata[1,5]
projected_premium <- coeffs[1] + coeffs[2]*PQP
projected_premium
## (Intercept) 
##    47980.06

Alternatively, we can wrapped it with a predicted function

# interested in the 1st row of traindata, at column 5 which is PQP
newdata <- data.frame(PQP = traindata[1,5]) # wrap the parameter 
results_premium <- predict(fit1, newdata)    # apply predict 

Interim conclusion : Predicted PREMIUM 47980

Part 2 : Determine COE PQP (for renewal COE for existing vehicles)

## Warning in TentativeRoughFix(boruta.train): There are no Tentative attributes! Returning original
## object.
##          meanImp medianImp   minImp   maxImp normHits  decision
## PREMIUM 46.62102  46.65102 45.13697 48.34065        1 Confirmed
## QUOTA   23.61932  23.40635 22.37639 25.05192        1 Confirmed
## BIDS    14.66363  14.54352 13.85459 16.07888        1 Confirmed

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables RMSE Rsquared RMSESD RsquaredSD Selected
##          1 6765   0.8371 1355.2    0.06955         
##          2 4713   0.9246  877.5    0.03012        *
##          3 5065   0.9151  980.1    0.03533         
## 
## The top 2 variables (out of 2):
##    PREMIUM, QUOTA

## [1] "PREMIUM" "QUOTA"
## 
## Call:
## lm(formula = PQP ~ . - YEAR - MONTH - ROUND, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13701.3  -3470.7   -458.4   3116.1  21357.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.549e+03  2.437e+03  -1.046   0.2973    
## PREMIUM      9.880e-01  3.416e-02  28.927   <2e-16 ***
## QUOTA       -2.299e+00  3.148e+00  -0.730   0.4663    
## BIDS         3.627e+00  2.013e+00   1.802   0.0737 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6299 on 147 degrees of freedom
## Multiple R-squared:  0.8614, Adjusted R-squared:  0.8585 
## F-statistic: 304.4 on 3 and 147 DF,  p-value: < 2.2e-16

Adjusted R Square is 85.85%. We can see PREMIUM variable has highest impact

## 
## Call:
## lm(formula = PQP ~ PREMIUM, data = traindata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14383.1  -3862.1   -875.2   4095.8  21557.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.254e+03  1.949e+03   0.643    0.521    
## PREMIUM     9.658e-01  3.291e-02  29.348   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6453 on 149 degrees of freedom
## Multiple R-squared:  0.8525, Adjusted R-squared:  0.8515 
## F-statistic: 861.3 on 1 and 149 DF,  p-value: < 2.2e-16
coeffs <- coefficients(fit4); 

# interested in the 1st row of traindata, at column 4 which is PREMIUM
PREMIUM <- traindata[1,4]
projected_pqp <- coeffs[1] + coeffs[2]*PREMIUM
projected_pqp
## (Intercept) 
##    45687.99

Alternatively, we can wrapped it with a predicted function

# interested in the 1st row of traindata, at column 4 which is PREMIUM

newdata <- data.frame(PREMIUM = traindata[1,4]) # wrap the parameter 
results_pqp <- predict(fit4, newdata)    # apply predict