The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.

Data Codebook

Description of variables, units of measurements, data category and the source of data, can be found on the tables below:

Codebook Varible Description

Variable Name	Description
Date	Date of the observation
Month	Month of the observation
Year	Year of the observation
Active_Rigs	Number of active drilling rigs on and offshore recorded to be working on that date
US_Population	USA population estimates for the date of the observation
President_Party	Identified political party for the ruling US president on the date observation
Senate	Political party that holds the majority of US Senate on the date of the observation
House	Political party that holds the majority of US congress on the date of the observation
life_exp	US average life expectancy at the date of the observation
GDP	Gross Domestic Product of the US economy on the date of the observation
Oil_Production	US Total oil production on the date of the observation
Oil_Stock	US inventory of oil in stock on the date of the observation
Oil_Price	US average domestic first purchase price, not corrected by inflation
HDD	Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F
CDD	Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F
Energy_Consumption	The amount of energy consumed by the US market on the date of the observation
Net_imports_Energy	The Net amount of energy imported by the US market on the date of the observation
Oil_Reserves	US proven oil reserves on the date of the observation
Active_Conflicts	Number of world active military conflicts on the date of the observation, includes civil wars
Drilled_Wells	Number of drilled wells on US on the date of the observation

Codebook Varible Units of measurement

Variable Name	Units of measurement
Date	Date format Year-Month-Day
Month	Integer from 1-12
Year	Integer 4 digits
Active_Rigs	Integer each unit represents a drilling rig
US_Population	Numerical in Millions of people
President_Party	Categorical variable Either Democrat or Republican
Senate	Categorical variable Either Democrat or Republican
House	Categorical variable Either Democrat or Republican
life_exp	Numerical life expectancy in years
GDP	Numerical GDP in Trillions of dollars 10^12 USD
Oil_Production	Numerical Oil production in Thousand barrels per day
Oil_Stock	Numerical US inventory of oil in million barrels
Oil_Price	Numerical US dollars
HDD	Numerical Monthly US HDD measured un number of °F
CDD	Numerical Monthly US CDD measured un number of °F
Energy_Consumption	Numerical Energy consumed in Quadrillion btu’s 10^15
Net_imports_Energy	Numerical Net energy imported in Quadrillion btu’s 10^15
Oil_Reserves	NUmerical Oil reserves in million barrels
Active_Conflicts	Numerical each unit represents an active military conflict
Drilled_Wells	Numerical each unit represents a drilled well

Codebook Varible Source of data

Variable Name	Units of measurement
Date	Included with other measurements
Month	Included with other measurements
Year	Included with other measurements
Active_Rigs	Data Link
US_Population	Data Link
President_Party	Data Link
Senate	Data Link
House	Data Link
life_exp	Data Link
GDP	Data Link
Oil_Production	Data Link
Oil_Stock	Data Link
Oil_Price	Data Link
HDD	Data Link
CDD	Data Link
Energy_Consumption	Data Link
Net_imports_Energy	Data Link
Oil_Reserves	Data Link
Active_Conflicts	Data Link
Active_Conflicts 2	Data Link
Drilled_Wells	Data Link

Raw data in original .csv and .xls formats can be found on the following Link
Data wrangling code used on this work, can be found on the following Link
Already cleaned up .RData files can be found on the following Link

Objective

The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables.

Exploratory Data Analysis (EDA)

The Exploratory Data Analysis (EDA) performed on this dataset, so as various statistical tests, can be found on the following Link

Data

The working data set is composed by 23 variables and 528 observations, there is no NA’s since was already wrangled on the previous part of this work Link.

All the code related related to models generation can be found on the following link GIT.

'data.frame':   528 obs. of  23 variables:
 $ Date              : Date, format: "1974-01-01" "1974-02-01" ...
 $ Month             : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Year              : num  1974 1974 1974 1974 1974 ...
 $ Active_Rigs       : num  1372 1355 1367 1381 1413 ...
 $ US_Population     : num  214 214 214 214 214 ...
 $ President_Party   : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
 $ Senate            : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ House             : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
 $ life_exp          : num  72 72 72 72 72 ...
 $ GDP               : num  1.55 1.55 1.55 1.55 1.55 ...
 $ Oil_Production    : num  10633 10870 10707 10650 10600 ...
 $ Oil_Stock         : num  233 240 244 256 269 268 268 264 266 269 ...
 $ Oil_Price         : num  6.95 6.87 6.77 6.77 6.87 6.85 6.8 6.71 6.7 6.97 ...
 $ HDD               : int  911 823 618 351 200 51 8 18 114 350 ...
 $ CDD               : int  16 5 22 23 88 149 296 242 97 30 ...
 $ Energy_Consumption: num  6.91 6.32 6.37 5.86 5.87 ...
 $ Net_imports_Energy: num  0.962 0.791 0.932 0.97 1.063 ...
 $ Oil_Reserves      : num  32250 32250 32250 32250 32250 ...
 $ Active_Conflicts  : num  56 56 56 56 56 56 56 56 56 56 ...
 $ Eco_Rec           : Factor w/ 2 levels "Normal","Recession": 2 2 2 2 2 2 2 2 2 2 ...
 $ Season            : Factor w/ 4 levels "Autumn","Spring",..: 4 4 2 2 2 3 3 3 1 1 ...
 $ Gov_ctrl_type     : Factor w/ 2 levels "shared","Total": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gov_ctrl_party    : Factor w/ 3 levels "Democrat","Republican",..: 3 3 3 3 3 3 3 3 3 3 ...

Response Variable transformation

The previous exploratory data analysis showed that Rig count (Activity_Rigs) follow a heavily right skewed distribution.

Due to the severe skewness of Active Rigs, we will search for an adequate transformation, by using the Boxcox.lambda function in R which will return the best lambda value and therefore the most suited transformation.

forecast::BoxCox.lambda(OIL_MODIFIED$Active_Rigs)

[1] 0.1143755

The obtained lambda value is almost zero, which suggests the use of a logaritmic transformation

The transformation centered the distribution and reduced the right side skewness

Training/Test Splitting

Due to the intrinsice association of our response variable with time, our dataset will be devided into a training and test data set, following time seres best practices on which we respect the order of observations.

The training data set will be observations from 1974 up to December of 2015, and the test data set will be composed of two years of monthly observations up until December 2017. The training data set will be used to train our different models and the test, set to evaluate the quality of our models.

OIL_MODIFIED <-OIL_MODIFIED %>% mutate(Active_Rigs = log(Active_Rigs))

oil_train <- OIL_MODIFIED[1:504,]

oil_test_6M <- OIL_MODIFIED[505:510,]

oil_test_1Y <- OIL_MODIFIED[505:516,]

oil_test_1.5Y <- OIL_MODIFIED[505:522,]

oil_test_2Y<- OIL_MODIFIED[505:528,]

dim(oil_train)

[1] 504  23

dim(oil_test_2Y)

[1] 24 23

Another important aspect of machine learning models is model tunning, due to this a partition of the training data set will be used as validation set, to tune hyperparameters and find the most optimized model, that is able to generalize to outside that and not overfit on the training set. The validation test set will be created following the concept of time series cross validation known in econometrics as “evaluation on a rolling forecasting origin”.

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

Performance Measurements and evaluation

Two evaluate the quality of the models, we selected to performance measurements which are:

Directional accuracy Ration (DAR): Also known as Mean Directional Accuracy (MDA), is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. It is defined by the following formula:

Root Mean Square Error (RMSE): is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed

Blaskowitz, O., & Herwartz, H. (2011). On economic evaluation of directional forecasts. International Journal of Forecasting, 27(4), 1058-1065.
Kulkarni and Haidar, (2009). Forecasting model for crude oil price using artificial neural networks and commodity future prices.International Journal of Computer Science and Information Security, 2 (1).
Schnader, M. H., & Stekler, H. O. (1990). Evaluating predictions of change. Journal of Business, 99-107
Sinclair, T. M., Stekler, H. O., & Kitzinger, L. (2010). Directional forecasts of GDP and inflation: a joint evaluation with an application to Federal Reserve predictions. Applied Economics, 42(18), 2289-2297.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia.OTexts.com/fpp2.

Time Series ARIMA Modeling

Thanks to the results obtained on the previously performed EDA and statistical analysis, we are aware that Oil_Prices shows a time dependency on which time data points are correlated with succesive and previous data points, thanks to a Chi-Square GOF we know that monthly data values are dependent on each other, this can also be confirmed as using and ACF plot of the variation of prices of monthly Oil price data.

The plot shows an strong auto-correlation at lag 1 and 2 and various other lags. To finally confirm this suspicious we will use an L-jung box test with an alpha value of 5% or 0.05.


    Box-Ljung test

data:  diff(OIL_MODIFIED_TS$Active_Rigs)
X-squared = 289.2, df = 1, p-value < 2.2e-16

The low p-values obtained suggest that Active Rigs time series is not “white noise” and that information can be extracted from the series, therefore a time series model might be suitable, due to this the first chosen model will be an ARIMA non seasonal model.

Series: OIL_MODIFIED_TS[1:504, "Active_Rigs"] 
ARIMA(2,1,1) 

Coefficients:
         ar1      ar2      ma1
      1.3542  -0.5612  -0.5003
s.e.  0.1060   0.0743   0.1188

sigma^2 estimated as 0.00116:  log likelihood=987.37
AIC=-1966.74   AICc=-1966.66   BIC=-1949.86

Training set error measures:
                        ME       RMSE        MAE          MPE      MAPE
Training set -0.0005588134 0.03392013 0.02370383 -0.007655009 0.3358983
                  MASE        ACF1
Training set 0.6624728 0.005979014


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)
Q* = 10.533, df = 7, p-value = 0.1603

Model df: 3.   Total lags used: 10

The residuals box.test on the fitted data, shows that they are not considered white noise and that still might be information to model in them, however the average point forecasts are still valuable. Now the next step is to evaluate the model on the test-data.But before computing eny result we should first calculate a baseline for comparisson, on this case we will use as RMSE baseline, a model that uses the mean Rig count of the training data as the only predicted value and for DAR we will use a model that asumes that all differences between succesives Rig counts are always positive (Rig count is always growing), both baselines will be calculated for the total of the test-data (2 years).

(RMSE_BASELINE<-sqrt(mean((mean(exp(oil_train$Active_Rigs))-exp(oil_test_2Y$Active_Rigs))^2)))

[1] 818.6661

(DAR_BASELINE<-mean(diff(oil_train$Active_Rigs)>0)*100)

[1] 61.2326

With the baselines to compare is alrady the time to evaluate the performance of the ARIMA model

Horizon	RMSE	DAR	DAR_NUM
0	50.9119	72.16700	0
6	198.5156	60.00000	4
12	182.2701	63.63636	8
18	174.1971	47.05882	8
24	195.8716	43.47826	10

As the model attempts to forecast away from the training data, the RMSE increases.Within all the test data horizon, ARIMA model yields lower and better RMSE values compared to the mean model baseline; a main difference with a mean model is that ARIMA provides DAR values, which incredibly valuable for financial markets spacially for commodities like oil that have plenty of different financial objects and markets to trade them, however on this metric the ARIMA model underperforms the baseline model of 61% growing trend, another important aspect to highlight is that a mean model is not able to produce DAR, since all values are constant.

Time Series Linear Modeling

The next model to test is a linear regression including all explanatory variables and using time cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (~17 years) for model training and succesive validation sets of 24 months (2 years).

total_formula_Rigs <-as.formula(Active_Rigs ~ Month +
                                        Oil_Price +
                                        Year +
                                        US_Population +
                                        President_Party +
                                        Senate +
                                        House +
                                        life_exp +
                                        GDP +
                                        Oil_Production +
                                        Oil_Stock +
                                        HDD +
                                        CDD +
                                        Energy_Consumption +
                                        Net_imports_Energy +
                                        Oil_Reserves +
                                        Active_Conflicts +
                                        Eco_Rec +
                                        Season +
                                        Gov_ctrl_type +
                                        Gov_ctrl_party)

# linear model bench mark using caret

set.seed(123)


time_control_Rigs <- trainControl(method = "timeslice",
                                  initialWindow = 200,
                                  horizon = 24,
                                  fixedWindow = TRUE,
                                  verboseIter = TRUE,
                                  allowParallel = TRUE)

With the setup ready the next step is to create the linear model; for practicity purposes we are going to use the caret library which allow use to use the timeSlice cross validation. Every model in caret requires hyperparameters tunning, however an “lm” model doesnt have any hyperparameters to tine therefore we will create a custom tunning grid with only the intercept as the hyperparameter and we will set this parameter to false.

Linear Regression 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  259.4039  0.289489  200.3085

Tuning parameter 'intercept' was held constant at a value of FALSE


Call:
lm(formula = .outcome ~ 0 + ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.57154 -0.10623  0.00041  0.10594  0.55338 

Coefficients: (1 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)    
Month                      4.429e-03  3.019e-03   1.467 0.143048    
Oil_Price                  1.180e-02  8.343e-04  14.147  < 2e-16 ***
Year                      -1.629e-01  6.638e-02  -2.453 0.014513 *  
US_Population             -1.898e-02  2.188e-02  -0.867 0.386103    
President_PartyRepublican  2.822e+02  1.248e+02   2.261 0.024176 *  
SenateRepublican           3.055e-01  4.077e-02   7.493 3.25e-13 ***
HouseRepublican            2.822e+02  1.248e+02   2.261 0.024216 *  
life_exp                   6.809e-01  5.196e-02  13.105  < 2e-16 ***
GDP                        2.021e-01  3.844e-02   5.258 2.20e-07 ***
Oil_Production            -3.838e-05  2.471e-05  -1.554 0.120917    
Oil_Stock                 -3.463e-04  3.478e-04  -0.995 0.320022    
HDD                       -1.180e-04  1.134e-04  -1.040 0.298783    
CDD                       -2.334e-04  2.748e-04  -0.849 0.396069    
Energy_Consumption         1.322e-01  4.281e-02   3.088 0.002130 ** 
Net_imports_Energy        -1.080e-01  6.426e-02  -1.680 0.093548 .  
Oil_Reserves               3.058e-05  8.791e-06   3.479 0.000548 ***
Active_Conflicts          -4.035e-04  5.508e-04  -0.733 0.464185    
Eco_RecRecession           2.029e-01  2.577e-02   7.870 2.36e-14 ***
SeasonSpring              -7.635e-02  2.886e-02  -2.645 0.008433 ** 
SeasonSummer              -6.708e-02  4.241e-02  -1.582 0.114378    
SeasonWinter              -5.460e-02  3.851e-02  -1.418 0.156907    
Gov_ctrl_typeTotal         2.824e+02  1.248e+02   2.263 0.024066 *  
Gov_ctrl_partyRepublican  -5.645e+02  2.496e+02  -2.262 0.024160 *  
Gov_ctrl_partyshared              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1694 on 481 degrees of freedom
Multiple R-squared:  0.9995,    Adjusted R-squared:  0.9994 
F-statistic: 3.962e+04 on 23 and 481 DF,  p-value: < 2.2e-16

The model has a high coefficient of determination with the training set, however the out of sample or validation RMSE shows high values of error, the next step is to compare the model with our testing data

Horizon	RMSE	DAR	DAR_NUM
0	282.5075	62.42545	0
6	299.8339	40.00000	2
12	351.5306	54.54545	7
18	357.2967	58.82353	11
24	377.5252	52.17391	13

Similar to ARIMA, As the linear model attempts to forecast away from the training data the RMSE increases.Within all the test data horizon, The linear model yields lower and better RMSE values compared to the mean model baseline; however in term of DAR the linear model underperforms the 61% growing baseline model for all test data.

Time Series Random Forest Modeling

The next model to test is random forest model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, and we will ask caret to try 5 combinations of Hyperparameters, which on the case of random forest, the only sensible parameter to vary is the MTRY, which is the number of random variables to pick for each tree.

Random Forest 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  mtry  splitrule   RMSE       Rsquared   MAE      
   2    variance    0.2325711  0.2369249  0.2050431
   2    extratrees  0.2479065  0.2714997  0.2200324
   7    variance    0.2179129  0.2441314  0.1888745
   7    extratrees  0.2234541  0.2611241  0.1943951
  13    variance    0.2239524  0.2369472  0.1933806
  13    extratrees  0.2202204  0.2651279  0.1909573
  18    variance    0.2309282  0.2296348  0.1989471
  18    extratrees  0.2230851  0.2748473  0.1932292
  24    variance    0.2393432  0.2300790  0.2060581
  24    extratrees  0.2269943  0.2735128  0.1958212

Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 7, splitrule =
 variance and min.node.size = 5.

Compared with the linear model the out of sample RMSE values are drastically lower, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As the model plot shows,the model was tunned or optimized to use 7 randomly selected predictors MTRY which yields the best prediction results. THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	54.99662	88.07157	0
6	553.30201	80.00000	5
12	530.96872	72.72727	9
18	458.12986	52.94118	10
24	403.63380	52.17391	13

As opposed to previous evaluated models the testing data RMSE of Random Forest does not increases with time horizon, in fact random forest outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model is only able to perform better than the baseline until 12 months of testing data, after that time, the model starts to desestablize and it actually underperforms the 61% DAR baseline model.

Time Series Extreme gradient boosting Modeling XGboost

The next model to test is an Extreme Gradient boosted model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, however in terms of the formula, Xgboost requires the input of explanatory variables as a matrix and an special input of the response variable, therefore we will not use the formulation notation used before on previous models.

For practicity purposes we are going to use the caret package to build the model, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters, for this specific case the best combination of hyperparameters found was:

nrounds = 100,
max_depth = 7,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.ref.

For more information on the algorithm, the following video provides a great description of it capabilities

trainX_Rigs<- data.matrix(select(oil_train,-Active_Rigs,-Date))

set.seed(123)

tunegrid_xg <- expand.grid(
        nrounds = 100,
        max_depth = 7,
        eta = 0.1,
        gamma = 0,
        colsample_bytree = 1,
        min_child_weight = 1,
        subsample = 1)

With all parameters setup the next step is to create the model

eXtreme Gradient Boosting 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.2272818  0.2082118  0.1980332

Tuning parameter 'nrounds' was held constant at a value of 100
 1
Tuning parameter 'min_child_weight' was held constant at a value of
 1
Tuning parameter 'subsample' was held constant at a value of 1

Compared with Random Forest, the out of sample RMSE values are similar, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	14.15146	91.65010	0
6	409.38629	60.00000	4
12	354.44541	54.54545	7
18	302.52374	52.94118	10
24	286.91612	52.17391	13

As opposed to previous evaluated models the testing data RMSE of XGboost does not increases with time horizon, in fact XGboost outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model underperforms the 61% DAR baseline for all time horizons of the testing data.

Time Series Support Vector Machine Modeling SVM

The next model to test is a Support Vector Machine model, and specifically and specifically an SVM using an RBF, a Radial Basis Function, we will use the model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters.

SVMgrid <- expand.grid(sigma = seq(0.005,0.02,by=0.005), C = c(100,200,300,400))

With all setup up the next step is to model

Support Vector Machines with Radial Basis Function Kernel 

504 samples
 21 predictor

No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  sigma  C    RMSE       Rsquared   MAE      
  0.005  100  0.4698541  0.2791247  0.4250718
  0.005  200  0.5096392  0.2877910  0.4561979
  0.005  300  0.5202698  0.2862592  0.4645631
  0.005  400  0.5259752  0.2893686  0.4684042
  0.010  100  0.4680788  0.3041593  0.4226456
  0.010  200  0.4845519  0.3169344  0.4358495
  0.010  300  0.4886574  0.3282740  0.4383617
  0.010  400  0.4913978  0.3308360  0.4406170
  0.015  100  0.4409003  0.3195932  0.4003715
  0.015  200  0.4458187  0.3233245  0.4043227
  0.015  300  0.4479417  0.3143934  0.4065934
  0.015  400  0.4481367  0.3129133  0.4069005
  0.020  100  0.4128205  0.2995149  0.3768963
  0.020  200  0.4151038  0.2915372  0.3791701
  0.020  300  0.4151046  0.2872727  0.3791994
  0.020  400  0.4144160  0.2858717  0.3785994

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02 and C = 100.

Compared with Xgboost and RF, the out of sample RMSE values are higher, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As per the plot shows, the best combination of hyperparameters is a Cost of 100 and a sigma of 0.02

THe next step is to test the model against the testing data.

Horizon	RMSE	DAR	DAR_NUM
0	75.45089	75.14911	0
6	414.63985	80.00000	5
12	313.23020	63.63636	8
18	386.20813	58.82353	11
24	416.46161	52.17391	13

Similar to previous evaluated models the testing data RMSE of SVM increasess with the time horizon, however SVM outperforms the mean baseline model RMSE, in all testing time horizons. In terms of DAR the model is only able to perform better than the baseline until 12 months of testing data, after that time, the model starts to desestablize and it actually underperforms the 61% DAR baseline model.

Time Series Neural network autoregression Modeling

A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.

The simplest networks contain no hidden layers and are equivalent to linear regressions. Figure 11.11 shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimises a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.

Once we add an intermediate layer with hidden neurons, the neural network becomes non-linear.

This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output

or example, the inputs into hidden neuron in the previous Figure are combined linearly to give

\[z_j = b_j + \sum_{i=1}^4 w_{i,j} x_i.\] In the hidden layer, this is then modified using a nonlinear function such as a sigmoid, or relu

\[s(z) = \frac{1}{1+e^{-z}},\]

to give the input for the next layer. This tends to reduce the effect of extreme input values, thus making the network somewhat robust to outliers.

The parameters b1,b2,b3 and w1,1,…,w4,3 are “learned” from the data. The values of the weights are often restricted to prevent them from becoming too large. The parameter that restricts the weights is known as the “decay parameter”, and is often set to be equal to 0.1.

The weights take random values to begin with, and these are then updated using the observed data. Consequently, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are averaged.

The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance.

With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in ARIMA models,We call this a neural network autoregression or NNAR model.Due to complexity we will only consider feed-forward networks with one hidden layer.

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

For the architecture of this network we will use one single hidden layer densely connected, using 60 units and 12 inputs or one year of previous data points as explanatory variables.

set.seed(123)

NNETAr_Rigs <- forecast::nnetar(y=OIL_MODIFIED_TS[1:504,"Active_Rigs"],p=12,size = 60)

NNETAr_Rigs

Series: OIL_MODIFIED_TS[1:504, "Active_Rigs"] 
Model:  NNAR(12,60) 
Call:   forecast::nnetar(y = OIL_MODIFIED_TS[1:504, "Active_Rigs"], p = 12, 
    size = 60)

Average of 20 networks, each of which is
a 12-60-1 network with 841 weights
options were - linear output units 

sigma^2 estimated as 0.0005491

Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data

Horizon	RMSE	DAR	DAR_NUM
0	32.66482	76.57841	0
6	53.52193	80.00000	5
12	169.64840	81.81818	10
18	143.21707	76.47059	14
24	130.64926	73.91304	18

As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of the NNAR does not increases with time horizon, in fact NNAR outperforms the mean baseline mean model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. In terms of DAR the model is able to perform better than the 61% DAR baseline for all time horizons of the testing data.

LSTM Recurrent Neural Networks

The stacked model gave us a hint of the benefits of using a combination model of time series autoregressive and structured data approaches, therefore the ideal model to use for this dataset, might be a recurrent neural network an RNN, and to be more specific a Long Short Term Memory Neural Network.

LSTMs are quite useful in time series prediction tasks involving autocorrelation, the presence of correlation between the time series and lagged versions of itself, because of their ability to maintain state and recognize patterns over the length of the time series. The recurrent architecture enables the states to persist, or communicate between updates of the weights as each epoch progresses. Further, the LSTM cell architecture enhances the RNN by enabling long term persistence in addition to short term, which is fascinating!

LSTM Theory

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

source: “Understanding LSTM Networks”

Building an LSTM Recurrent Neural Network for Rig Count

The LSTM Neural Net will include the two most important variables proposed by the boruta analisis performed during the EDA of this work, which are, Oil Price and Oil Stock, so as autoregressive terms of Active Rigs. The neural net requires the input data as a matrix will all numeric parameters and scaled up it also requires the input of a third dimension which are the time indexes, so as the lag, which is the amount of time points that are going to be used as exploratory variables to train the network. for this specific case, we decided to train/test split the dataset, and choose the following paramters:

lag: 96 months, equivalent to 8 years of data to be used as explanatory variables Batch Size = 4, to be used to train the model on each epoch Train: 408 data points,which represents the dataset up until the end of 2008. Test: 24 Months, the last two years of data.

We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.

For the initial architecture of our neural Net we will use:

An initial input layer
An LSTM layer with 70 units and a 3D input object
A dropout layer with a dropout rate of 0.5
An a dense layer of one unit to produce the final result

as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function.

RIGS_LSTM <- select(OIL_MODIFIED,-Date,-Year,-Month)

RIGS_LSTM <- RIGS_LSTM %>% mutate_if(is.factor,as.numeric)

m2<-apply(RIGS_LSTM, 2, mean)

m2[1]

Active_Rigs 
   7.158164

std2<-apply(RIGS_LSTM, 2, sd)

std2[1]

Active_Rigs 
  0.4874875

RIGS_LSTM<-scale(RIGS_LSTM,center = m2,scale = std2)

datalags = 96
train = RIGS_LSTM[seq(312 + datalags), ]
test = RIGS_LSTM[312 + datalags + seq(24+ datalags), ]
batch.size = 4

x.train = array(data = lag(cbind(train[,1],
                                 train[,9],
                                 train[,10]), datalags)[-(1:datalags), ],
                dim = c(nrow(train) - datalags, datalags, 3))

y.train = array(data = train[,1][-(1:datalags)], dim = c(nrow(train)-datalags, 1))

x.test = array(data = lag(cbind(test[,1],
                                test[,9],
                                test[,10]), datalags)[-(1:datalags), ], 
               dim = c(nrow(test) - datalags, datalags, 3))

y.test = array(data = test[,1][-(1:datalags)], dim = c(nrow(test) - datalags, 1))

set.seed(123)

LSTM <- keras_model_sequential()

LSTM %>%
        layer_lstm(units = 70,
                   input_shape = c(datalags, 3),
                   batch_size = batch.size,
                   return_sequences = FALSE,
                   stateful = TRUE) %>%
        layer_dropout(rate = 0.5) %>%
        layer_dense(units = 1)

LSTM %>%
        compile(loss = 'mae', optimizer = 'adam')

LSTM

Model
___________________________________________________________________________
Layer (type)                     Output Shape                  Param #     
===========================================================================
lstm (LSTM)                      (4, 70)                       20720       
___________________________________________________________________________
dropout (Dropout)                (4, 70)                       0           
___________________________________________________________________________
dense (Dense)                    (4, 1)                        71          
===========================================================================
Total params: 20,791
Trainable params: 20,791
Non-trainable params: 0
___________________________________________________________________________

for(i in 1:408){
        LSTM %>% fit(x = x.train,
                      y = y.train,
                      batch_size = batch.size,
                      epochs = 1,
                      verbose = 0,
                      shuffle = FALSE)
        LSTM %>% reset_states()
}

Horizon	RMSE	DAR	DAR_NUM
0	1185.8159	48.55305	0
6	118.6381	80.00000	5
12	102.6484	81.81818	10
18	146.9246	70.58824	13
24	245.7800	78.26087	19

For all the time horizon of the test data, the LSTM yield better results compared to both baselines RMSE and DAR, however as it starts to move away, the values of RMSE starts to decay so as DAR performance.

By looking at the in train RMSE, we might guess that there is still plenty of room for improvement in terms of the architecture of the neural net, however this might require more extensive empirical optimization, since as it has been stated by many deep learning experts, the architecture of these kind of models is more an art rather than an strict science.

Model comparisson

With all models already built up, the next step is to compare their performance against each other

Horizon	models	RMSE	DAR
0 Months	lm	282.507486	0.6242545
0 Months	RF	54.996617	0.8807157
0 Months	XGboost	2.835767	0.9801193
0 Months	SVM_RBF	75.450891	0.7514911
0 Months	ARIMA	50.911900	0.7216700
0 Months	NNAR	32.977514	0.7657841
0 Months	LSTM	1185.815910	0.4855305
6 Months	lm	299.833928	0.4000000
6 Months	RF	553.302010	0.8000000
6 Months	XGboost	408.720458	0.6000000
6 Months	SVM_RBF	414.639851	0.8000000
6 Months	ARIMA	198.515610	0.6000000
6 Months	NNAR	48.490524	1.0000000
6 Months	LSTM	118.638074	0.8000000
12 Months	lm	351.530637	0.5454545
12 Months	RF	530.968721	0.7272727
12 Months	XGboost	354.518957	0.5454545
12 Months	SVM_RBF	313.230198	0.6363636
12 Months	ARIMA	182.270112	0.6363636
12 Months	NNAR	144.182721	0.9090909
12 Months	LSTM	102.648369	0.8181818
18 Months	lm	357.296689	0.5882353
18 Months	RF	458.129857	0.5294118
18 Months	XGboost	302.937654	0.5294118
18 Months	SVM_RBF	386.208128	0.5882353
18 Months	ARIMA	174.197096	0.4705882
18 Months	NNAR	121.904982	0.8235294
18 Months	LSTM	146.924585	0.7058824
24 Months	lm	377.525182	0.5217391
24 Months	RF	403.633797	0.5217391
24 Months	XGboost	289.391698	0.5217391
24 Months	SVM_RBF	416.461614	0.5217391
24 Months	ARIMA	195.871558	0.4347826
24 Months	NNAR	113.204349	0.8260870
24 Months	LSTM	245.780025	0.7826087

Looking at the table and plots with more detail, we can notice that the autoregressive Neural Network NNAR, provides the best results in terms of both performance metrics RMSE and DAR, it overperforms the baseline models os as other models tested during this work, altough the LSTM did provided good results as well and very near the performance of the NNAR.

By plotting our 2 Year Active Rig predictions using the NNAR versus the actual number of Active Rigs on a Gain plot, we can obtain the Gini score, which is this case is 0.012, such a low value of Gini Score suggest that both curves are (Predictions vs Actual) are nearly equals.

Conclusions

The Best model obtained on this work for the prediction of future Monthly Oil Prices, is an Autoregressive Neural Network NNAR
Using the selected NNAR model it is possible to predict future Rig Count Activity with an RMSE of ~ 150 Rigs and ~ 80% directional accuracy, up until two years ahead of training data.
In terms of testing data RMSE values, the best model performance is achieved by using an autoregressive Neural Network NNAR
In terms of testing data DAR, the best model performance is achieved by using an autoregressive Neural Network NNAR
A bigger dataset, might yield better results for an LSTM.

MBA Thesis Part III, Rig Count Machine Learning Forecasting

Luis Fernando Perez Armas

4/24/2019