The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.

Data Codebook

Description of variables, units of measurements, data category and the source of data, can be found on the tables below:

Codebook Varible Description

Variable Name	Description
Date	Date of the observation
Month	Month of the observation
Year	Year of the observation
Active_Rigs	Number of active drilling rigs on and offshore recorded to be working on that date
US_Population	USA population estimates for the date of the observation
President_Party	Identified political party for the ruling US president on the date observation
Senate	Political party that holds the majority of US Senate on the date of the observation
House	Political party that holds the majority of US congress on the date of the observation
life_exp	US average life expectancy at the date of the observation
GDP	Gross Domestic Product of the US economy on the date of the observation
Oil_Production	US Total oil production on the date of the observation
Oil_Stock	US inventory of oil in stock on the date of the observation
Oil_Price	US average domestic first purchase price, not corrected by inflation
HDD	Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F
CDD	Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F
Energy_Consumption	The amount of energy consumed by the US market on the date of the observation
Net_imports_Energy	The Net amount of energy imported by the US market on the date of the observation
Oil_Reserves	US proven oil reserves on the date of the observation
Active_Conflicts	Number of world active military conflicts on the date of the observation, includes civil wars
Drilled_Wells	Number of drilled wells on US on the date of the observation

Codebook Varible Units of measurement

Variable Name	Units of measurement
Date	Date format Year-Month-Day
Month	Integer from 1-12
Year	Integer 4 digits
Active_Rigs	Integer each unit represents a drilling rig
US_Population	Numerical in Millions of people
President_Party	Categorical variable Either Democrat or Republican
Senate	Categorical variable Either Democrat or Republican
House	Categorical variable Either Democrat or Republican
life_exp	Numerical life expectancy in years
GDP	Numerical GDP in Trillions of dollars 10^12 USD
Oil_Production	Numerical Oil production in Thousand barrels per day
Oil_Stock	Numerical US inventory of oil in million barrels
Oil_Price	Numerical US dollars
HDD	Numerical Monthly US HDD measured un number of °F
CDD	Numerical Monthly US CDD measured un number of °F
Energy_Consumption	Numerical Energy consumed in Quadrillion btu’s 10^15
Net_imports_Energy	Numerical Net energy imported in Quadrillion btu’s 10^15
Oil_Reserves	NUmerical Oil reserves in million barrels
Active_Conflicts	Numerical each unit represents an active military conflict
Drilled_Wells	Numerical each unit represents a drilled well

Codebook Varible Source of data

Variable Name	Units of measurement
Date	Included with other measurements
Month	Included with other measurements
Year	Included with other measurements
Active_Rigs	Data Link
US_Population	Data Link
President_Party	Data Link
Senate	Data Link
House	Data Link
life_exp	Data Link
GDP	Data Link
Oil_Production	Data Link
Oil_Stock	Data Link
Oil_Price	Data Link
HDD	Data Link
CDD	Data Link
Energy_Consumption	Data Link
Net_imports_Energy	Data Link
Oil_Reserves	Data Link
Active_Conflicts	Data Link
Active_Conflicts 2	Data Link
Drilled_Wells	Data Link

Raw data in original .csv and .xls formats can be found on the following Link
Data wrangling code used on this work, can be found on the following Link
Already cleaned up .RData files can be found on the following Link

Objective

The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables.

Exploratory Data Analysis (EDA)

The Exploratory Data Analysis (EDA) performed on this dataset, so as various statistical tests, can be found on the following Link

Data

The working data set is composed by 23 variables and 528 observations, there is no NA’s since was already wrangled on the previous part of this work Link.

All the code related related to models generation can be found on the following link GIT.

'data.frame':   528 obs. of  23 variables:
 $ Date              : Date, format: "1974-01-01" "1974-02-01" ...
 $ Month             : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Year              : num  1974 1974 1974 1974 1974 ...
 $ Active_Rigs       : num  1372 1355 1367 1381 1413 ...
 $ US_Population     : num  214 214 214 214 214 ...
 $ President_Party   : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
 $ Senate            : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ House             : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ life_exp          : num  72 72 72 72 72 ...
 $ GDP               : num  1.55 1.55 1.55 1.55 1.55 ...
 $ Oil_Production    : num  10633 10870 10707 10650 10600 ...
 $ Oil_Stock         : num  233 240 244 256 269 268 268 264 266 269 ...
 $ Oil_Price         : num  6.95 6.87 6.77 6.77 6.87 6.85 6.8 6.71 6.7 6.97 ...
 $ HDD               : int  911 823 618 351 200 51 8 18 114 350 ...
 $ CDD               : int  16 5 22 23 88 149 296 242 97 30 ...
 $ Energy_Consumption: num  6.91 6.32 6.37 5.86 5.87 ...
 $ Net_imports_Energy: num  0.962 0.791 0.932 0.97 1.063 ...
 $ Oil_Reserves      : num  32250 32250 32250 32250 32250 ...
 $ Active_Conflicts  : num  56 56 56 56 56 56 56 56 56 56 ...
 $ Eco_Rec           : Factor w/ 2 levels "Normal","Recession": 2 2 2 2 2 2 2 2 2 2 ...
 $ Season            : Factor w/ 4 levels "Autumn","Spring",..: 4 4 2 2 2 3 3 3 1 1 ...
 $ Gov_ctrl_type     : Factor w/ 2 levels "shared","Total": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gov_ctrl_party    : Factor w/ 3 levels "Democrat","Republican",..: 3 3 3 3 3 3 3 3 3 3 ...

Response Variable transformation

The previous exploratory data analysis showed that Oil Prices follow a heavily right skewed distribution.

Due to the severe skewness of Oil price, we will search for an adequate transformation, by using the Boxcox.lambda function in R which will return the best lambda value and therefore the most suited transformation.

forecast::BoxCox.lambda(OIL_MODIFIED$Oil_Price)

[1] -0.0996917

The obtained lambda value is almost zero, which suggests a logaritmic transformation

The transformation centered the distribution and reduced the right side skewness

Training/Test Splitting

Due to the intrinsice association of our response variable with time, our dataset will be devided into a training and test data set, following time seres best practices on which we respect the order of observations.

The training data set will be observations from 1974 up to December of 2015, and the test data set will be composed of two years of monthly observations up until December 2017. The training data set will be used to train our different models and the test, set to evaluate the quality of our models.

OIL_MODIFIED <-OIL_MODIFIED %>% mutate(Oil_Price = log(Oil_Price))

oil_train <- OIL_MODIFIED[1:504,]

oil_test_6M <- OIL_MODIFIED[505:510,]

oil_test_1Y <- OIL_MODIFIED[505:516,]

oil_test_1.5Y <- OIL_MODIFIED[505:522,]

oil_test_2Y<- OIL_MODIFIED[505:528,]

dim(oil_train)

[1] 504  23

dim(oil_test_2Y)

[1] 24 23

Another important aspect of machine learning models is model tunning, due to this a partition of the training data set will be used as validation set, to tune hyperparameters and find the most optimized model, that is able to generalize to outside that and not overfit on the training set. The validation test set will be created following the concept of time series cross validation known in econometrics as “evaluation on a rolling forecasting origin”.

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

Performance Measurements and evaluation

Two evaluate the quality of the models, we selected to performance measurements which are:

Directional accuracy Ration (DAR): Also known as Mean Directional Accuracy (MDA), is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. It is defined by the following formula:

Root Mean Square Error (RMSE): is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed

Blaskowitz, O., & Herwartz, H. (2011). On economic evaluation of directional forecasts. International Journal of Forecasting, 27(4), 1058-1065.
Kulkarni and Haidar, (2009). Forecasting model for crude oil price using artificial neural networks and commodity future prices.International Journal of Computer Science and Information Security, 2 (1).
Schnader, M. H., & Stekler, H. O. (1990). Evaluating predictions of change. Journal of Business, 99-107
Sinclair, T. M., Stekler, H. O., & Kitzinger, L. (2010). Directional forecasts of GDP and inflation: a joint evaluation with an application to Federal Reserve predictions. Applied Economics, 42(18), 2289-2297.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

Time Series ARIMA Modeling

Thanks to the results obtained on the previously performed EDA and statistical analysis, we are aware that Oil_Prices shows a time dependency on which time data points are correlated with succesive and previous data points, thanks to a Chi-Square GOF we know that monthly data values are dependent on each other, this can also be confirmed as using and ACF plot of the variation of prices of monthly Oil price data.

The plot shows an strong auto-correlation at lag 1 and 2. To finally confirm this suspicious we will use an L-jung box test with an alpha value of 5% or 0.05.


    Box-Ljung test

data:  diff(OIL_MODIFIED_TS$Oil_Price)
X-squared = 116.52, df = 1, p-value < 2.2e-16

The low p-values obtained suggest that Oil_Price time series is not “white noise” and that information can be extracted from the series, therefore a time series model might be suitable, due to this the first chosen model will be an ARIMA non seasonal model.

Series: OIL_MODIFIED_TS[1:504, "Oil_Price"] 
ARIMA(2,1,1) 

Coefficients:
         ar1      ar2      ma1
      1.1971  -0.4270  -0.6797
s.e.  0.1510   0.0676   0.1590

sigma^2 estimated as 0.004606:  log likelihood=640.77
AIC=-1273.53   AICc=-1273.45   BIC=-1256.65

Training set error measures:
                      ME       RMSE       MAE      MPE     MAPE      MASE
Training set 0.002166705 0.06760035 0.0479155 0.068042 1.544229 0.9194907
                    ACF1
Training set 0.001812794


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)
Q* = 5.0384, df = 7, p-value = 0.6553

Model df: 3.   Total lags used: 10

Residuals inside the training data, seem to be normal and around 0, the Ljung box.test obtained p-value on the residuals, confirm that they are “white noise”, now the next step is to evaluate the model on the test-data.But before computing eny result we should first calculate a baseline for comparisson, on this case we will use as RMSE basline, a model that uses the mean oil price of the training data as the only predicted value and for DAR we will use a model that asumes that all differences between succesives oil prices are positive (Oil price is always growing), both baselines will be calculated for the total of the test-data (2 years).

(RMSE_BASELINE<-sqrt(mean((mean(exp(oil_train$Oil_Price))-exp(oil_test_2Y$Oil_Price))^2)))

[1] 12.65433

(DAR_BASELINE<-mean(diff(oil_train$Oil_Price)>0)*100)

[1] 56.06362

With the baselines to compare is alrady the time to evaluate the performance of the ARIMA model

Horizon	RMSE	DAR	DAR_NUM
0	2.898197	59.84095	0
6	7.319842	100.00000	6
12	9.417773	72.72727	9
18	11.534491	58.82353	11
24	13.599467	69.56522	17

As the model attempts to forecast away from the training data, the RMSE increases.with the 24 months test data, using the mean baseline as a model yields lower and better RMSE values, altough a main difference with a mean model is that ARIMA provides DAR values, which incredibly valuable for financial markets spacially for commodities like oil that have plenty of different financial objects and markets to trade them, on this metric the ARIMA model outperforms the basline model of 56% growing trend, another important aspect to highlight is that a mean model is not able to produce DAR, since all values are constant.

Time Series Linear Modeling

The next model to test is a linear regression including all explanatory variables and using time cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years).

total_formula_price<-as.formula(Oil_Price ~ Month +
                                        Year +
                                        Active_Rigs +
                                        US_Population +
                                        President_Party +
                                        Senate +
                                        House +
                                        life_exp +
                                        GDP +
                                        Oil_Production +
                                        Oil_Stock +
                                        HDD +
                                        CDD +
                                        Energy_Consumption +
                                        Net_imports_Energy +
                                        Oil_Reserves +
                                        Active_Conflicts +
                                        Eco_Rec +
                                        Season +
                                        Gov_ctrl_type +
                                        Gov_ctrl_party)


set.seed(123)

time_control <- trainControl(method = "timeslice",
                             initialWindow = 48,
                             horizon = 24,
                             fixedWindow = TRUE,
                             verboseIter = TRUE,
                             allowParallel = TRUE)

With the setup ready the next step is to create the linear model; for practicity purposes we are going to use the caret library which allow use to use the timeSlice cross validation. Every model in caret requires hyperparameters tunning, however an “lm” model doesnt have any hyperparameters to tine therefore we will create a custom tunning grid with only the intercept as the hyperparameter and we will set this parameter to false.

Linear Regression 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  4631.625  0.2844531  4571.577

Tuning parameter 'intercept' was held constant at a value of FALSE


Call:
lm(formula = .outcome ~ 0 + ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71120 -0.10446  0.01978  0.12215  0.46132 

Coefficients: (1 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
Month                      2.493e-04  3.527e-03   0.071  0.94367    
Year                       2.037e-01  7.750e-02   2.628  0.00886 ** 
Active_Rigs                4.276e-04  2.444e-05  17.496  < 2e-16 ***
US_Population             -9.492e-02  2.541e-02  -3.736  0.00021 ***
President_PartyRepublican -3.925e+02  1.456e+02  -2.696  0.00726 ** 
SenateRepublican           1.096e-02  5.029e-02   0.218  0.82761    
HouseRepublican           -3.926e+02  1.456e+02  -2.697  0.00725 ** 
life_exp                   1.960e-01  6.741e-02   2.907  0.00382 ** 
GDP                        2.394e-01  4.388e-02   5.456 7.79e-08 ***
Oil_Production            -1.597e-04  2.759e-05  -5.789 1.28e-08 ***
Oil_Stock                 -5.585e-04  4.057e-04  -1.377  0.16928    
HDD                       -1.851e-04  1.326e-04  -1.396  0.16321    
CDD                       -2.065e-04  3.208e-04  -0.644  0.52001    
Energy_Consumption         7.054e-02  5.052e-02   1.396  0.16329    
Net_imports_Energy        -4.868e-01  7.273e-02  -6.693 6.10e-11 ***
Oil_Reserves              -1.376e-05  1.037e-05  -1.327  0.18517    
Active_Conflicts          -2.630e-03  6.242e-04  -4.213 3.00e-05 ***
Eco_RecRecession          -1.317e-03  3.208e-02  -0.041  0.96727    
SeasonSpring               2.056e-02  3.371e-02   0.610  0.54227    
SeasonSummer               1.168e-02  4.955e-02   0.236  0.81375    
SeasonWinter              -5.428e-02  4.494e-02  -1.208  0.22778    
Gov_ctrl_typeTotal        -3.926e+02  1.456e+02  -2.696  0.00725 ** 
Gov_ctrl_partyRepublican   7.852e+02  2.912e+02   2.697  0.00724 ** 
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1976 on 481 degrees of freedom
Multiple R-squared:  0.9965,    Adjusted R-squared:  0.9964 
F-statistic:  5987 on 23 and 481 DF,  p-value: < 2.2e-16

The model has a high coefficient of determination with the training set, however the out of sample or validation RMSE shows high values of error, the next step is to compare the model with our testing data

Horizon	RMSE	DAR	DAR_NUM
0	7.41275	49.70179	0
6	13.49372	80.00000	5
12	11.58670	63.63636	8
18	16.37096	64.70588	12
24	19.53783	56.52174	14

Similar to ARIMA, As the linear model attempts to forecast away from the training data the RMSE increases.with the 24 months test data, using the mean baseline as a model yields lower and better RMSE values, for the linear model this also applies to other time horizons such as 1,1.5 years. The DAR values produced by the linear model outperforms the basline model of 56% growing trend, on time horizons below one year it renders superior results, however at 24 months the model barely arrives to achieve the baseline DAR percentage.

Time Series Random Forest Modeling

The next model to test is random forest model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, and we will ask caret to try 5 combinations of Hyperparameters, which on the case of random forest, the only sensible parameter to vary is the MTRY, which is the number of random variables to pick for each tree.

Random Forest 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ... 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE       Rsquared   MAE      
   2    variance    0.3203021  0.2091753  0.2729190
   2    extratrees  0.3214654  0.2005519  0.2745532
   7    variance    0.3098552  0.1970000  0.2614745
   7    extratrees  0.3099570  0.1938968  0.2614333
  13    variance    0.3112874  0.2077682  0.2637568
  13    extratrees  0.3111779  0.1895608  0.2625623
  18    variance    0.3128114  0.2081044  0.2656446
  18    extratrees  0.3125834  0.1854089  0.2640673
  24    variance    0.3158450  0.2028856  0.2689051
  24    extratrees  0.3137600  0.1858963  0.2655273

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 7, splitrule =
 variance and min.node.size = 5.

Compared with the linear model the out of sample RMSE values are drastically lower, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As the model plot shows,the model was tunned or optimized to use 7 randomly selected predictors MTRY which yields the best prediction results. THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	2.330145	81.90855	0
6	11.799760	60.00000	4
12	8.980799	63.63636	8
18	7.656877	58.82353	11
24	7.530799	47.82609	11

As opposed to previous evaluated models the testing data RMSE of Random Forest does not increases with time horizon, in fact random forest outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model is only able to perform better than the baseline until 18 months of testing data, after the model starts to desestablize and it actually underperforms the 56% DAR baseline model.

Time Series Extreme gradient boosting Modeling XGboost

The next model to test is an Extreme Gradient boosted model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, however in terms of the formula, Xgboost requires the input of explanatory variables as a matrix and an special input of the response variable, therefore we will not use the formulation notation used before on previous models.

For practicity purposes we are going to use the caret package to build the model, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters, for this specific case the best combination of hyperparameters found was:

nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.ref.

For more information on the algorithm, the following video provides a great description of it capabilities

trainX<- data.matrix(select(oil_train,-Oil_Price,-Date))

set.seed(123)

tunegrid_xg_default <- expand.grid(
        nrounds = 100,
        max_depth = 6,
        eta = 0.3,
        gamma = 0,
        colsample_bytree = 1,
        min_child_weight = 1,
        subsample = 1)

With all parameters setup the next step is to create the model

eXtreme Gradient Boosting 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.3275638  0.1714746  0.2777964

Tuning parameter 'nrounds' was held constant at a value of 100
 1
Tuning parameter 'min_child_weight' was held constant at a value of
 1
Tuning parameter 'subsample' was held constant at a value of 1

Compared with Random Forest, the out of sample RMSE values are similar, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	0.0647931	97.81312	0
6	7.7233566	60.00000	4
12	6.4759711	63.63636	8
18	6.3246999	58.82353	11
24	6.4187649	52.17391	13

As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of XGboost does not increases with time horizon, in fact XGboost outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. XGboost also provides lower RMSE values than Random Forest, which was already an improvement over “lm” and “ARIMA”. However in terms of DAR the model is only able to perform better than the baseline until 18 months of testing data, after the model starts to desestablize and it actually underperforms the 56% DAR baseline model.

Time Series Support Vector Machine Modeling SVM

The next model to test is a Support Vector Machine model, and specifically and specifically an SVM using an RBF, a Radial Basis Function, we will use the model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters.

SVMgrid <- expand.grid(sigma = seq(0.005,0.02,by=0.005), C = c(100,200,300,400))

With all setup up the next step is to model

Support Vector Machines with Radial Basis Function Kernel 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ... 
Resampling results across tuning parameters:

  sigma  C    RMSE       Rsquared   MAE      
  0.005  100  0.4329691  0.1823023  0.3911349
  0.005  200  0.4297423  0.1718971  0.3882421
  0.005  300  0.4285934  0.1686252  0.3872183
  0.005  400  0.4285501  0.1685428  0.3871764
  0.010  100  0.4201683  0.2389799  0.3800965
  0.010  200  0.4201683  0.2389799  0.3800965
  0.010  300  0.4201683  0.2389799  0.3800965
  0.010  400  0.4201683  0.2389799  0.3800965
  0.015  100  0.4171104  0.2438520  0.3775426
  0.015  200  0.4171104  0.2438520  0.3775426
  0.015  300  0.4171104  0.2438520  0.3775426
  0.015  400  0.4171104  0.2438520  0.3775426
  0.020  100  0.4169478  0.2037181  0.3775444
  0.020  200  0.4169478  0.2037181  0.3775444
  0.020  300  0.4169478  0.2037181  0.3775444
  0.020  400  0.4169478  0.2037181  0.3775444

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02 and C = 100.

Compared with Xgboost and RF, the out of sample RMSE values are higher, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As per the plot shows, the best combination of hyperparameters is a Cost of 100 and a sigma of 0.02

THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	2.655426	68.78728	0
6	11.427578	80.00000	5
12	9.309312	72.72727	9
18	13.851903	70.58824	13
24	18.903295	56.52174	14

Similar to previous evaluated models such as linear model and ARIMA, the testing data RMSE of SVM increases with time horizon, after one year the mean baseline model starts to outperform the SVM. However in terms of DAR the SVM model outperform the baseline 56% growing DAR for all testing time horizons and it also performs better in the long term than previous models.

Multilayer Perceptron feedforward Neural Network regression Modeling

A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.

The simplest networks contain no hidden layers and are equivalent to linear regressions. Figure 11.11 shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimises a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.

Once we add an intermediate layer with hidden neurons, the neural network becomes non-linear.

This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output

or example, the inputs into hidden neuron in the previous Figure are combined linearly to give

\[z_j = b_j + \sum_{i=1}^4 w_{i,j} x_i.\] In the hidden layer, this is then modified using a nonlinear function such as a sigmoid, or relu

\[s(z) = \frac{1}{1+e^{-z}},\]

to give the input for the next layer. This tends to reduce the effect of extreme input values, thus making the network somewhat robust to outliers.

The parameters b1,b2,b3 and w1,1,…,w4,3 are “learned” from the data. The values of the weights are often restricted to prevent them from becoming too large. The parameter that restricts the weights is known as the “decay parameter”, and is often set to be equal to 0.1.

The weights take random values to begin with, and these are then updated using the observed data. Consequently, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are averaged.

The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

The next model to test in our work is a simple feedforward neural network MLP for regression, including all explanatory variables but not using the time series cross validation since it will require a more advanced neural net, which will be a recurrent neural net. The neural net requires the input data as a matrix will all numeric parameters and scaled up. We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.

For the initial architecture of our neural Net we will use:

An initial input layer
A hidden flatten layer
A densely connected hidden layer with 128 units, with an L2 regularizer using “relu” as the activation function
A dropout layer with a dropout rate of 0.2
A densely connected layer with only one unit for the output value

as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function, we will train the net with a batch size of 1, sample and we will use 20 epochs to train it.

trainX_DL<-trainX

dimnames(trainX_DL) <- NULL

m<-apply(trainX_DL, 2, mean)

std<-apply(trainX_DL, 2, sd)

trainX_DL<-scale(trainX_DL,center = m,scale = std)

target<-data.matrix(oil_train$Oil_Price)

set.seed(123)

model_nnet_price <- keras_model_sequential() 

model_nnet_price %>% 
        layer_flatten(input_shape = dim(trainX_DL)[[2]]) %>%
        layer_dense(units = 128,activation = "relu",kernel_regularizer = regularizer_l2(0.01)) %>%
        layer_dropout(rate = 0.2) %>%
        layer_dense(units = 1)

model_nnet_price

Model
___________________________________________________________________________
Layer (type)                     Output Shape                  Param #     
===========================================================================
flatten (Flatten)                (None, 21)                    0           
___________________________________________________________________________
dense (Dense)                    (None, 128)                   2816        
___________________________________________________________________________
dropout (Dropout)                (None, 128)                   0           
___________________________________________________________________________
dense_1 (Dense)                  (None, 1)                     129         
===========================================================================
Total params: 2,945
Trainable params: 2,945
Non-trainable params: 0
___________________________________________________________________________

model_nnet_price %>% compile(
        optimizer = "adam",
        loss = "mse",
        metrics = c("mae")
)

net <- model_nnet_price %>%
        fit(trainX_DL,
            target,
            epochs = 20,
            batch_size=1
        )

pred_train<-model_nnet_price %>% predict(trainX_DL) %>% exp()

oil_test_6M_DL<- select(oil_test_6M,-Oil_Price,-Date) %>% data.matrix()

dimnames(oil_test_6M_DL) <- NULL

oil_test_6M_DL<-scale(oil_test_6M_DL,center = m,scale = std)

pred6M<-model_nnet_price %>% predict(oil_test_6M_DL) %>% exp()

oil_test_1Y_DL<- select(oil_test_1Y,-Oil_Price,-Date) %>% data.matrix()

dimnames(oil_test_1Y_DL) <- NULL

oil_test_1Y_DL<-scale(oil_test_1Y_DL,center = m,scale = std)

pred1Y<-model_nnet_price %>% predict(oil_test_1Y_DL) %>% exp()

oil_test_1.5Y_DL<- select(oil_test_1.5Y,-Oil_Price,-Date) %>% data.matrix()

dimnames(oil_test_1.5Y_DL) <- NULL

oil_test_1.5Y_DL<-scale(oil_test_1.5Y_DL,center = m,scale = std)

pred1.5Y<-model_nnet_price %>% predict(oil_test_1.5Y_DL) %>% exp()

oil_test_2Y_DL<- select(oil_test_2Y,-Oil_Price,-Date) %>% data.matrix()

dimnames(oil_test_2Y_DL) <- NULL

oil_test_2Y_DL<-scale(oil_test_2Y_DL,center = m,scale = std)

pred2Y<-model_nnet_price %>% predict(oil_test_2Y_DL) %>% exp()

Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data

Horizon	RMSE	DAR	DAR_NUM
0	9.219194	56.85885	0
6	3.662618	80.00000	5
12	6.226381	54.54545	7
18	8.271522	52.94118	10
24	9.740417	47.82609	11

Similar to previous evaluated models such as linear model and ARIMA, the testing data RMSE of Neural increasess with time horizon, however the Neural outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. However in terms of DAR the model mostly underperforms the 56% DAR baseline model.

#Time Series Neural network autoregression Modeling

With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in ARIMA models,We call this a neural network autoregression or NNAR model.Due to complexity we will only consider feed-forward networks with one hidden layer.

The previous modeled MLP Neural network model was built using all explanatory variables as inputs for a regression, on this case we will use only the oil price and its time dependency as the input variables, we will use previous values of oil prices as input exploratory variables attempting to predict the future behavior as in autoregressive models, but using the architecture of single hidden layer neural network. It is important to mention that an architecture that considers both the time dependency of NNAR and MLPs can be built and it known as a recurrent neural net, this concept will be tested on future sections of this work

For the architecture of this network we will use one single hidden layer densely connected, using 60 units and 12 inputs or one year of previous data points as explanatory variables.

set.seed(123)

NNETAr <- forecast::nnetar(y=OIL_MODIFIED_TS[1:504,"Oil_Price"],p=12,size = 60)

NNETAr

Series: OIL_MODIFIED_TS[1:504, "Oil_Price"] 
Model:  NNAR(12,60) 
Call:   forecast::nnetar(y = OIL_MODIFIED_TS[1:504, "Oil_Price"], p = 12, 
    size = 60)

Average of 20 networks, each of which is
a 12-60-1 network with 841 weights
options were - linear output units 

sigma^2 estimated as 0.002466

Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data

Horizon	RMSE	DAR	DAR_NUM
0	2.230611	66.59878	0
6	9.223506	60.00000	4
12	10.357155	63.63636	8
18	9.885619	70.58824	13
24	9.557852	69.56522	17

As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of the NNAR does not increases with time horizon, in fact NNAR outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. In terms of DAR the model is able to perform better than the 56% DAR baseline for all time horizons of the testing data.

Model comparisson

With all models already built up, the next step is to compare their performance against each other

Horizon	models	RMSE	DAR
0 Months	lm	7.4127496	0.4970179
0 Months	RF	2.3301449	0.8190855
0 Months	XGboost	0.0647931	0.9781312
0 Months	SVM_RBF	2.6554262	0.6878728
0 Months	ARIMA	2.8981971	0.5984095
0 Months	NNET	9.2191936	0.5685885
0 Months	NNAR	2.2306106	0.6659878
6 Months	lm	13.4937230	0.8000000
6 Months	RF	11.7997596	0.6000000
6 Months	XGboost	7.7233566	0.6000000
6 Months	SVM_RBF	11.4275781	0.8000000
6 Months	ARIMA	7.3198420	1.0000000
6 Months	NNET	3.6626182	0.8000000
6 Months	NNAR	9.2235058	0.6000000
12 Months	lm	11.5867010	0.6363636
12 Months	RF	8.9807994	0.6363636
12 Months	XGboost	6.4759711	0.6363636
12 Months	SVM_RBF	9.3093122	0.7272727
12 Months	ARIMA	9.4177726	0.7272727
12 Months	NNET	6.2263805	0.5454545
12 Months	NNAR	10.3571553	0.6363636
18 Months	lm	16.3709636	0.6470588
18 Months	RF	7.6568769	0.5882353
18 Months	XGboost	6.3246999	0.5882353
18 Months	SVM_RBF	13.8519033	0.7058824
18 Months	ARIMA	11.5344907	0.5882353
18 Months	NNET	8.2715220	0.5294118
18 Months	NNAR	9.8856189	0.7058824
24 Months	lm	19.5378270	0.5652174
24 Months	RF	7.5307989	0.4782609
24 Months	XGboost	6.4187649	0.5217391
24 Months	SVM_RBF	18.9032948	0.5652174
24 Months	ARIMA	13.5994669	0.6956522
24 Months	NNET	9.7404169	0.4782609
24 Months	NNAR	9.5578516	0.6956522

Looking at the table and plots with more detail, we can notice that XGboost provides the best results in terms of RMSE, altough in terms of DAR, NNAR and SVM provides the best results, to obtain the best of the two worlds we might work on the concept of stacked models.

#Machine Learning stack modeling

The concept of stacking is fairly simple it is the concept of using multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Stacking-a Meta Modeling Technique is introduced by Wolpert in the year 1992.In Stacking there are two types of learners called Base Learners and a Meta Learner.Base Learners and Meta Learners are the normal machine learning algorithms like Random Forests, SVM, Perceptron etc.Base Learners try to fit the normal data sets where as Meta learner fit on the predictions of the base Learner.

For the specific case of this work, we will use as Meta learner a simple linear combination of the obtained models

\[Stack = 0.35*NNAR+0.4*XG + 0.1*ARIMA+ 0.05*RF+0.05*NNET+0.025*lm+0.025*SVM\]

RMSE_Stack	RMSE_NNAR	RMSE_XG	DAR_Stack	DAR_NNAR	DAR_XG
7.511742	9.557852	6.418765	0.6521739	0.6956522	0.5217391

As expected by the theory, the stacked model is succesfully able to pick the best of both worlds, it shows a low RMSe and also good performance in terms of DAR.

Horizon	RMSE	DAR
6	6.391856	0.6000000
12	6.820706	0.6363636
18	6.945631	0.7058824
24	7.511742	0.6521739

By plotting our 2 Year Active Rig predictions using the NNAR versus the actual number of Active Rigs on a Gain plot, we can obtain the Gini score, which is this case is 0.012, such a low value of Gini Score suggest that both curves are (Predictions vs Actual) are nearly equals.

LSTM Recurrent Neural Networks

The stacked model gave us a hint of the benefits of using a combination model of time series autoregressive and structured data approaches, therefore the ideal model to use for this dataset, might be a recurrent neural network an RNN, and to be more specific a Long Short Term Memory Neural Network.

LSTMs are quite useful in time series prediction tasks involving autocorrelation, the presence of correlation between the time series and lagged versions of itself, because of their ability to maintain state and recognize patterns over the length of the time series. The recurrent architecture enables the states to persist, or communicate between updates of the weights as each epoch progresses. Further, the LSTM cell architecture enhances the RNN by enabling long term persistence in addition to short term, which is fascinating!

LSTM Theory

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

source: “Understanding LSTM Networks”

Building an LSTM Recurrent Neural Network for Oil Price

The LSTM Neural Net will include all explanatory variables with the exception of Year and Month, since these date related variables will be directly handled by the LSTM, The neural net requires the input data as a matrix will all numeric parameters and scaled up it also requires the input of a third dimension which are the time indexes, so as the lag, which is the amount of time points that are going to be used as exploratory variables to train the network. for this specific case, we decided to train/test split the dataset, and choose the following paramters:

lag: 12 months, equivalent to 1 year of data to be used as explanatory variables Batch Size = 4, to be used to train the model on each epoch Train: 504 data points,which represents the dataset up until the end of 2015 as in previous models Test: 24 Months, the last two years of data.

We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.

For the initial architecture of our neural Net we will use:

An initial input layer
An LSTM layer with 100 units and a 3D input object
A dropout layer with a dropout rate of 0.5
An LSTM layer with 50 units
A dropout layer with a dropout rate of 0.5
An a dense layer of one unit to produce the final result

as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function

Oil_LSTM <- select(OIL_MODIFIED,-Date,-Year,-Month)

Oil_LSTM <- Oil_LSTM %>% mutate_if(is.factor,as.numeric)

m2<-apply(Oil_LSTM, 2, mean)

m2[10]

Oil_Price 
 3.204852

std2<-apply(Oil_LSTM, 2, sd)

std2[10]

Oil_Price 
0.7650267

Oil_LSTM<-scale(Oil_LSTM,center = m2,scale = std2)

datalags = 12
train = Oil_LSTM[seq(480 + datalags), ]
test = Oil_LSTM[480 + datalags + seq(24+ datalags), ]
batch.size = 4

x.train = array(data = lag(cbind(train[,1], 
                                 train[,2],
                                 train[,3],
                                 train[,4],
                                 train[,5],
                                 train[,6],
                                 train[,7],
                                 train[,8],
                                 train[,9],
                                 train[,10],
                                 train[,11],
                                 train[,12],
                                 train[,13],
                                 train[,14],
                                 train[,15],
                                 train[,16],
                                 train[,17],
                                 train[,18],
                                 train[,19],
                                 train[,20]), datalags)[-(1:datalags), ],
                dim = c(nrow(train) - datalags, datalags, 20))
                        

y.train = array(data = train[,10][-(1:datalags)], dim = c(nrow(train)-datalags, 1))


x.test = array(data = lag(cbind(test[,1], 
                                test[,2],
                                test[,3],
                                test[,4],
                                test[,5],
                                test[,6],
                                test[,7],
                                test[,8],
                                test[,9],
                                test[,10],
                                test[,11],
                                test[,12],
                                test[,13],
                                test[,14],
                                test[,15],
                                test[,16],
                                test[,17],
                                test[,18],
                                test[,19],
                                test[,20]), datalags)[-(1:datalags), ], 
               dim = c(nrow(test) - datalags, datalags, 20))

y.test = array(data = test[,10][-(1:datalags)], dim = c(nrow(test) - datalags, 1))


set.seed(123)


LSTM <- keras_model_sequential()

LSTM %>%
        layer_lstm(units = 100,
                   input_shape = c(datalags, 20),
                   batch_size = batch.size,
                   return_sequences = TRUE,
                   stateful = TRUE) %>%
        layer_dropout(rate = 0.5) %>%
        layer_lstm(units = 50,
                   return_sequences = FALSE,
                   stateful = TRUE) %>%
        layer_dropout(rate = 0.5) %>%
        layer_dense(units = 1)

LSTM %>%
        compile(loss = 'mae', optimizer = 'adam')

LSTM

Model
___________________________________________________________________________
Layer (type)                     Output Shape                  Param #     
===========================================================================
lstm (LSTM)                      (4, 12, 100)                  48400       
___________________________________________________________________________
dropout (Dropout)                (4, 12, 100)                  0           
___________________________________________________________________________
lstm_1 (LSTM)                    (4, 50)                       30200       
___________________________________________________________________________
dropout_1 (Dropout)              (4, 50)                       0           
___________________________________________________________________________
dense (Dense)                    (4, 1)                        51          
===========================================================================
Total params: 78,651
Trainable params: 78,651
Non-trainable params: 0
___________________________________________________________________________

for(i in 1:480){
        LSTM %>% fit(x = x.train,
                      y = y.train,
                      batch_size = batch.size,
                      epochs = 1,
                      verbose = 0,
                      shuffle = FALSE)
        LSTM %>% reset_states()
}

Horizon	RMSE	DAR	DAR_NUM
0	40.396388	80.16701	0
6	4.546233	80.00000	5
12	5.837501	63.63636	8
18	7.104999	58.82353	11
24	8.816599	47.82609	11

On the short term, less than 12 months, the LSTM Neural Net provides incredible accurate results, altought after this period of time, the performance of the model starts to decay up until it is not able to beat the baselines.

By looking at the in train RMSE, we might guess that there is still plenty of room for improvement in terms of the architecture of the neural net, however this might require more extensive empirical optimization, since as it has been stated by many deep learning experts, the architecture of these kind of models is more an art rather than an strict science.

Conclusions

The Best model obtained on this work for the prediction of future Monthly Oil Prices, is an stacked model using a simple linear Meta learner of all previously generated models, which is predominantly influenced by an autoregressive Neural Network NNAR and XGboost regression values.
Using the selected stacked model it is possible to predict future Oil prices with an RMSE of ~ 7 USD Dollars and ~ 70% directional accuracy, up until two years ahead of training data.
In terms of testing data RMSE values, the best model performance is achieved by using XGboost.
In terms of testing data DAR, the best model performance is achieved by using an autoregressive Neural Network NNAR .
The stacked method is able to provide better performance in comparisson with the basic LSTM Neural net.
A bigger dataset, might yield better results for an LSTM.

MBA Thesis Part II, Oil Price Machine Learning Forecasting

Luis Fernando Perez Armas

4/23/2019

Data Codebook

Objective

Exploratory Data Analysis (EDA)

Data

Response Variable transformation

Training/Test Splitting

Performance Measurements and evaluation

Time Series ARIMA Modeling

Time Series Linear Modeling

Time Series Random Forest Modeling

Time Series Extreme gradient boosting Modeling XGboost

Time Series Support Vector Machine Modeling SVM

Multilayer Perceptron feedforward Neural Network regression Modeling

Model comparisson

LSTM Recurrent Neural Networks

LSTM Theory

Building an LSTM Recurrent Neural Network for Oil Price

Conclusions