library(Quandl) #to get VIX and Oil Price data
library(randomForest) # to predict net sales
library(plotly) #better visualization
library(rpart) # Recursive Partitioning and Regression Trees (RPART)
library(rattle) #fancy R plot
library(caret) #Classification and Regression Training (CARET)
library(caretEnsemble) #package of machine learning models
Show how the net sales of Hong Kong Equity Fund and Global Bond Fund will change if the Oil Price and VIX increases and decreases by 5%.
Target Variables:
These historical target variables data were obtained from the Hong Kong Investment Funds Association (HKIFA)
Features:
Data for features were convenient obtained using the Quandl package (getsymbols) from R.
Time Frame & Frequency
Change Instead of Absolute
head(RawData)
## Date HKEF GBF VIX Oil_Price
## 1 2010-12-31 60.56 -39.39 17.15 93.23
## 2 2011-01-31 128.34 -112.56 19.53 98.97
## 3 2011-02-28 100.69 -47.83 18.35 112.27
## 4 2011-03-31 66.58 12.43 17.74 116.94
## 5 2011-04-29 -33.78 17.25 14.75 126.59
## 6 2011-05-31 61.66 72.28 15.45 117.18
head(Data)
## Date HKEF GBF VIX Oil_Price
## 1 2011-01-31 1.11922061 1.8575781 0.13877551 0.06156817
## 2 2011-02-28 -0.21544335 -0.5750711 -0.06041987 0.13438416
## 3 2011-03-31 -0.33876254 -1.2598787 -0.03324251 0.04159615
## 4 2011-04-29 -1.50735957 0.3877715 -0.16854566 0.08252095
## 5 2011-05-31 -2.82534044 3.1901449 0.04745763 -0.07433447
## 6 2011-06-30 -0.02854363 1.9673492 0.06925566 -0.04668032
Preliminary visualization helps us understand the data better before coming out with an estimation model.
Price of the HKEF is plotted against the Oil Price and VIX to have an idea of it’s movements.
Price of the GBF is plotted against the Oil Price and VIX to have an idea of it’s movements.
From visualizing our data, we have an idea that:
\(\qquad \qquad \qquad \qquad y\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ X }_{ 1 }\quad +\quad { \beta }_{ 2 }{ X }_{ 2 }\)
Based on our findings from the preliminary visualization, we can see that our data is not linear. Hence, it is best we avoid using linear regression models which are invalid.
Tree based models does not assume linearity in data. In fact, a tree based model maps observations about an item (branches) to conclusions about the item’s target value (leaves). Think of it as the motherload of nested if-else statements.
Example of estimating employment with tree depth of 3:
The diagrams shows an attempt using linear regression and tree based regression.
Code preview and predicted change:
fit <- randomForest(HKEF ~ VIX + Oil_Price, data = Data, ntree=2000, importance = TRUE)
VIX <- c(0.05, -0.05)
Oil_Price <- c(0.05, -0.05)
change <- data.frame("Change in VIX" = VIX, "Change in Oil_Price" = Oil_Price )
predictions <- predict(fit, change)
change$Change.in.HKEF <- predictions
fit2 <- randomForest(GBF ~ VIX + Oil_Price, data = Data, ntree=2000)
change2 <- data.frame("Change in VIX" = VIX, "Change in Oil_Price" = Oil_Price )
predictions2 <- predict(fit2, change2)
change$Change.in.GBF <- predictions2
## Change.in.VIX Change.in.Oil_Price Change.in.HKEF Change.in.GBF
## 1 0.05 0.05 0.22 -0.10
## 2 -0.05 -0.05 0.73 0.47
Calculating the Mean Absolute Error (MAE) for predicting HKEF & GBF
#HKEF
mae <- mean(abs(Data$HKEF - predict(fit, Data[, 4:5])))
mae
## [1] 2.672729
#GBF
mae <- mean(abs(Data$GBF - predict(fit2, Data[ , 4:5])))
mae
## [1] 0.886063
Despite the errors being large due to volatility of the target variables:
## Date HKEF GBF VIX VIX_last2 VIX_last3
## 1 5/31/11 -2.82534044 3.19014493 0.04745763 -0.16854566 -0.03324251
## 2 6/30/11 -0.02854363 1.96734920 0.06925566 0.04745763 -0.16854566
## 3 7/29/11 -0.09899833 0.08373741 0.52845036 0.06925566 0.04745763
## 4 8/31/11 0.20511395 -0.48958871 0.25227723 0.52845036 0.06925566
## 5 9/30/11 -0.92220172 -0.14430209 0.35863378 0.25227723 0.52845036
## 6 10/31/11 -5.39328063 0.56648936 -0.30260708 0.35863378 0.25227723
## VIX_last4 VIX_last5 VIX_3months_average VIX_5months_average
## 1 -0.06041987 0.13877551 -0.05144351 -0.01519498
## 2 -0.03324251 -0.06041987 -0.01727746 -0.02909895
## 3 -0.16854566 -0.03324251 0.21505455 0.08867510
## 4 0.04745763 -0.16854566 0.28332775 0.14577904
## 5 0.06925566 0.04745763 0.37978712 0.25121493
## 6 0.52845036 0.06925566 0.10276798 0.18120199
## Oil_Price Oil_Price_last2 Oil_Price_last3 Oil_Price_last4
## 1 -0.074334466 0.082520951 0.041596152 0.13438416
## 2 -0.046680321 -0.074334466 0.082520951 0.04159615
## 3 0.037776385 -0.046680321 -0.074334466 0.08252095
## 4 0.004744242 0.037776385 -0.046680321 -0.07433447
## 5 -0.094951923 0.004744242 0.037776385 -0.04668032
## 6 0.028552457 -0.094951923 0.004744242 0.03777639
## Oil_Price_last5 Oil_Price_3months_average Oil_Price_5months_average
## 1 0.06156817 0.016594212 0.049146992
## 2 0.13438416 -0.012831279 0.027497295
## 3 0.04159615 -0.027746134 0.008175740
## 4 0.08252095 -0.001386565 0.000805358
## 5 -0.07433447 -0.017477099 -0.034689217
## 6 -0.04668032 -0.020551741 -0.014111832
Stands for Classification and Regression Training (CARET).
## rf rf.1 xgbLinear xgbTree kknn extraTrees
## rf 1.00 0.43 0.58 0.57 -0.08 0.35
## rf.1 0.43 1.00 0.25 0.44 0.29 0.72
## xgbLinear 0.58 0.25 1.00 0.34 -0.23 0.35
## xgbTree 0.57 0.44 0.34 1.00 -0.27 0.71
## kknn -0.08 0.29 -0.23 -0.27 1.00 -0.02
## extraTrees 0.35 0.72 0.35 0.71 -0.02 1.00
## A glmnet ensemble of 2 base models: rf, rf, xgbLinear, xgbTree, kknn, extraTrees
##
## Ensemble results:
## glmnet
##
## 46 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 42, 41, 41, 41, 42, 42, ...
## Resampling results across tuning parameters:
##
## alpha lambda MAE
## 0.10 0.0009849867 0.4325490
## 0.10 0.0098498665 0.4325490
## 0.10 0.0984986654 0.4437686
## 0.55 0.0009849867 0.5798329
## 0.55 0.0098498665 0.5798329
## 0.55 0.0984986654 0.5754898
## 1.00 0.0009849867 0.5794388
## 1.00 0.0098498665 0.5807361
## 1.00 0.0984986654 0.5276956
##
## MAE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
## = 0.009849867.
## Change.in.VIX Change.in.Oil_Price Change.in.HKEF
## 1 0.05 0.05 -0.41790984
## 2 -0.05 -0.05 0.01335041
## rf rf.1 xgbLinear xgbTree kknn extraTrees
## rf 1.00 0.89 -0.16 0.74 0.37 0.60
## rf.1 0.89 1.00 0.07 0.56 0.58 0.79
## xgbLinear -0.16 0.07 1.00 0.11 0.09 0.12
## xgbTree 0.74 0.56 0.11 1.00 0.23 0.16
## kknn 0.37 0.58 0.09 0.23 1.00 0.73
## extraTrees 0.60 0.79 0.12 0.16 0.73 1.00
## A glmnet ensemble of 2 base models: rf, rf, xgbLinear, xgbTree, kknn, extraTrees
##
## Ensemble results:
## glmnet
##
## 46 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 40, 42, 41, 41, 41, 42, ...
## Resampling results across tuning parameters:
##
## alpha lambda MAE
## 0.10 0.0007280638 0.3244459
## 0.10 0.0072806380 0.3244459
## 0.10 0.0728063801 0.3244459
## 0.55 0.0007280638 0.3177243
## 0.55 0.0072806380 0.3177243
## 0.55 0.0728063801 0.3092113
## 1.00 0.0007280638 0.3304099
## 1.00 0.0072806380 0.3292084
## 1.00 0.0728063801 0.2999794
##
## MAE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.07280638.
## Change.in.VIX Change.in.Oil_Price Change.in.GBF
## 1 0.05 0.05 -0.04026385
## 2 -0.05 -0.05 0.12993277
Though we managed to answer the business objective: