The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.
Description of variables, units of measurements, data category and the source of data, can be found on the tables below:
Codebook Varible Description
| Variable Name | Description |
|---|---|
| Date | Date of the observation |
| Month | Month of the observation |
| Year | Year of the observation |
| Active_Rigs | Number of active drilling rigs on and offshore recorded to be working on that date |
| US_Population | USA population estimates for the date of the observation |
| President_Party | Identified political party for the ruling US president on the date observation |
| Senate | Political party that holds the majority of US Senate on the date of the observation |
| House | Political party that holds the majority of US congress on the date of the observation |
| life_exp | US average life expectancy at the date of the observation |
| GDP | Gross Domestic Product of the US economy on the date of the observation |
| Oil_Production | US Total oil production on the date of the observation |
| Oil_Stock | US inventory of oil in stock on the date of the observation |
| Oil_Price | US average domestic first purchase price, not corrected by inflation |
| HDD | Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F |
| CDD | Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F |
| Energy_Consumption | The amount of energy consumed by the US market on the date of the observation |
| Net_imports_Energy | The Net amount of energy imported by the US market on the date of the observation |
| Oil_Reserves | US proven oil reserves on the date of the observation |
| Active_Conflicts | Number of world active military conflicts on the date of the observation, includes civil wars |
| Drilled_Wells | Number of drilled wells on US on the date of the observation |
Codebook Varible Units of measurement
| Variable Name | Units of measurement |
|---|---|
| Date | Date format Year-Month-Day |
| Month | Integer from 1-12 |
| Year | Integer 4 digits |
| Active_Rigs | Integer each unit represents a drilling rig |
| US_Population | Numerical in Millions of people |
| President_Party | Categorical variable Either Democrat or Republican |
| Senate | Categorical variable Either Democrat or Republican |
| House | Categorical variable Either Democrat or Republican |
| life_exp | Numerical life expectancy in years |
| GDP | Numerical GDP in Trillions of dollars 10^12 USD |
| Oil_Production | Numerical Oil production in Thousand barrels per day |
| Oil_Stock | Numerical US inventory of oil in million barrels |
| Oil_Price | Numerical US dollars |
| HDD | Numerical Monthly US HDD measured un number of °F |
| CDD | Numerical Monthly US CDD measured un number of °F |
| Energy_Consumption | Numerical Energy consumed in Quadrillion btu’s 10^15 |
| Net_imports_Energy | Numerical Net energy imported in Quadrillion btu’s 10^15 |
| Oil_Reserves | NUmerical Oil reserves in million barrels |
| Active_Conflicts | Numerical each unit represents an active military conflict |
| Drilled_Wells | Numerical each unit represents a drilled well |
Codebook Varible Source of data
| Variable Name | Units of measurement |
|---|---|
| Date | Included with other measurements |
| Month | Included with other measurements |
| Year | Included with other measurements |
| Active_Rigs | Data Link |
| US_Population | Data Link |
| President_Party | Data Link |
| Senate | Data Link |
| House | Data Link |
| life_exp | Data Link |
| GDP | Data Link |
| Oil_Production | Data Link |
| Oil_Stock | Data Link |
| Oil_Price | Data Link |
| HDD | Data Link |
| CDD | Data Link |
| Energy_Consumption | Data Link |
| Net_imports_Energy | Data Link |
| Oil_Reserves | Data Link |
| Active_Conflicts | Data Link |
| Active_Conflicts 2 | Data Link |
| Drilled_Wells | Data Link |
The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables.
The Exploratory Data Analysis (EDA) performed on this dataset, so as various statistical tests, can be found on the following Link
The working data set is composed by 23 variables and 528 observations, there is no NA’s since was already wrangled on the previous part of this work Link.
All the code related related to models generation can be found on the following link GIT.
'data.frame': 528 obs. of 23 variables:
$ Date : Date, format: "1974-01-01" "1974-02-01" ...
$ Month : num 1 2 3 4 5 6 7 8 9 10 ...
$ Year : num 1974 1974 1974 1974 1974 ...
$ Active_Rigs : num 1372 1355 1367 1381 1413 ...
$ US_Population : num 214 214 214 214 214 ...
$ President_Party : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
$ Senate : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ House : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ life_exp : num 72 72 72 72 72 ...
$ GDP : num 1.55 1.55 1.55 1.55 1.55 ...
$ Oil_Production : num 10633 10870 10707 10650 10600 ...
$ Oil_Stock : num 233 240 244 256 269 268 268 264 266 269 ...
$ Oil_Price : num 6.95 6.87 6.77 6.77 6.87 6.85 6.8 6.71 6.7 6.97 ...
$ HDD : int 911 823 618 351 200 51 8 18 114 350 ...
$ CDD : int 16 5 22 23 88 149 296 242 97 30 ...
$ Energy_Consumption: num 6.91 6.32 6.37 5.86 5.87 ...
$ Net_imports_Energy: num 0.962 0.791 0.932 0.97 1.063 ...
$ Oil_Reserves : num 32250 32250 32250 32250 32250 ...
$ Active_Conflicts : num 56 56 56 56 56 56 56 56 56 56 ...
$ Eco_Rec : Factor w/ 2 levels "Normal","Recession": 2 2 2 2 2 2 2 2 2 2 ...
$ Season : Factor w/ 4 levels "Autumn","Spring",..: 4 4 2 2 2 3 3 3 1 1 ...
$ Gov_ctrl_type : Factor w/ 2 levels "shared","Total": 1 1 1 1 1 1 1 1 1 1 ...
$ Gov_ctrl_party : Factor w/ 3 levels "Democrat","Republican",..: 3 3 3 3 3 3 3 3 3 3 ...
The previous exploratory data analysis showed that Rig count (Activity_Rigs) follow a heavily right skewed distribution.
Due to the severe skewness of Active Rigs, we will search for an adequate transformation, by using the Boxcox.lambda function in R which will return the best lambda value and therefore the most suited transformation.
forecast::BoxCox.lambda(OIL_MODIFIED$Active_Rigs)
[1] 0.1143755
The obtained lambda value is almost zero, which suggests the use of a logaritmic transformation
The transformation centered the distribution and reduced the right side skewness
Due to the intrinsice association of our response variable with time, our dataset will be devided into a training and test data set, following time seres best practices on which we respect the order of observations.
The training data set will be observations from 1974 up to December of 2015, and the test data set will be composed of two years of monthly observations up until December 2017. The training data set will be used to train our different models and the test, set to evaluate the quality of our models.
OIL_MODIFIED <-OIL_MODIFIED %>% mutate(Active_Rigs = log(Active_Rigs))
oil_train <- OIL_MODIFIED[1:504,]
oil_test_6M <- OIL_MODIFIED[505:510,]
oil_test_1Y <- OIL_MODIFIED[505:516,]
oil_test_1.5Y <- OIL_MODIFIED[505:522,]
oil_test_2Y<- OIL_MODIFIED[505:528,]
dim(oil_train)
[1] 504 23
dim(oil_test_2Y)
[1] 24 23
Another important aspect of machine learning models is model tunning, due to this a partition of the training data set will be used as validation set, to tune hyperparameters and find the most optimized model, that is able to generalize to outside that and not overfit on the training set. The validation test set will be created following the concept of time series cross validation known in econometrics as “evaluation on a rolling forecasting origin”.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.
Two evaluate the quality of the models, we selected to performance measurements which are:
Thanks to the results obtained on the previously performed EDA and statistical analysis, we are aware that Oil_Prices shows a time dependency on which time data points are correlated with succesive and previous data points, thanks to a Chi-Square GOF we know that monthly data values are dependent on each other, this can also be confirmed as using and ACF plot of the variation of prices of monthly Oil price data.
The plot shows an strong auto-correlation at lag 1 and 2 and various other lags. To finally confirm this suspicious we will use an L-jung box test with an alpha value of 5% or 0.05.
Box-Ljung test
data: diff(OIL_MODIFIED_TS$Active_Rigs)
X-squared = 289.2, df = 1, p-value < 2.2e-16
The low p-values obtained suggest that Active Rigs time series is not “white noise” and that information can be extracted from the series, therefore a time series model might be suitable, due to this the first chosen model will be an ARIMA non seasonal model.
Series: OIL_MODIFIED_TS[1:504, "Active_Rigs"]
ARIMA(2,1,1)
Coefficients:
ar1 ar2 ma1
1.3542 -0.5612 -0.5003
s.e. 0.1060 0.0743 0.1188
sigma^2 estimated as 0.00116: log likelihood=987.37
AIC=-1966.74 AICc=-1966.66 BIC=-1949.86
Training set error measures:
ME RMSE MAE MPE MAPE
Training set -0.0005588134 0.03392013 0.02370383 -0.007655009 0.3358983
MASE ACF1
Training set 0.6624728 0.005979014
Ljung-Box test
data: Residuals from ARIMA(2,1,1)
Q* = 10.533, df = 7, p-value = 0.1603
Model df: 3. Total lags used: 10
The residuals box.test on the fitted data, shows that they are not considered white noise and that still might be information to model in them, however the average point forecasts are still valuable. Now the next step is to evaluate the model on the test-data.But before computing eny result we should first calculate a baseline for comparisson, on this case we will use as RMSE baseline, a model that uses the mean Rig count of the training data as the only predicted value and for DAR we will use a model that asumes that all differences between succesives Rig counts are always positive (Rig count is always growing), both baselines will be calculated for the total of the test-data (2 years).
(RMSE_BASELINE<-sqrt(mean((mean(exp(oil_train$Active_Rigs))-exp(oil_test_2Y$Active_Rigs))^2)))
[1] 818.6661
(DAR_BASELINE<-mean(diff(oil_train$Active_Rigs)>0)*100)
[1] 61.2326
With the baselines to compare is alrady the time to evaluate the performance of the ARIMA model
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 50.9119 | 72.16700 | 0 |
| 6 | 198.5156 | 60.00000 | 4 |
| 12 | 182.2701 | 63.63636 | 8 |
| 18 | 174.1971 | 47.05882 | 8 |
| 24 | 195.8716 | 43.47826 | 10 |
As the model attempts to forecast away from the training data, the RMSE increases.Within all the test data horizon, ARIMA model yields lower and better RMSE values compared to the mean model baseline; a main difference with a mean model is that ARIMA provides DAR values, which incredibly valuable for financial markets spacially for commodities like oil that have plenty of different financial objects and markets to trade them, however on this metric the ARIMA model underperforms the baseline model of 61% growing trend, another important aspect to highlight is that a mean model is not able to produce DAR, since all values are constant.
The next model to test is a linear regression including all explanatory variables and using time cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (~17 years) for model training and succesive validation sets of 24 months (2 years).
total_formula_Rigs <-as.formula(Active_Rigs ~ Month +
Oil_Price +
Year +
US_Population +
President_Party +
Senate +
House +
life_exp +
GDP +
Oil_Production +
Oil_Stock +
HDD +
CDD +
Energy_Consumption +
Net_imports_Energy +
Oil_Reserves +
Active_Conflicts +
Eco_Rec +
Season +
Gov_ctrl_type +
Gov_ctrl_party)
# linear model bench mark using caret
set.seed(123)
time_control_Rigs <- trainControl(method = "timeslice",
initialWindow = 200,
horizon = 24,
fixedWindow = TRUE,
verboseIter = TRUE,
allowParallel = TRUE)
With the setup ready the next step is to create the linear model; for practicity purposes we are going to use the caret library which allow use to use the timeSlice cross validation. Every model in caret requires hyperparameters tunning, however an “lm” model doesnt have any hyperparameters to tine therefore we will create a custom tunning grid with only the intercept as the hyperparameter and we will set this parameter to false.
Linear Regression
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
RMSE Rsquared MAE
259.4039 0.289489 200.3085
Tuning parameter 'intercept' was held constant at a value of FALSE
Call:
lm(formula = .outcome ~ 0 + ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.57154 -0.10623 0.00041 0.10594 0.55338
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
Month 4.429e-03 3.019e-03 1.467 0.143048
Oil_Price 1.180e-02 8.343e-04 14.147 < 2e-16 ***
Year -1.629e-01 6.638e-02 -2.453 0.014513 *
US_Population -1.898e-02 2.188e-02 -0.867 0.386103
President_PartyRepublican 2.822e+02 1.248e+02 2.261 0.024176 *
SenateRepublican 3.055e-01 4.077e-02 7.493 3.25e-13 ***
HouseRepublican 2.822e+02 1.248e+02 2.261 0.024216 *
life_exp 6.809e-01 5.196e-02 13.105 < 2e-16 ***
GDP 2.021e-01 3.844e-02 5.258 2.20e-07 ***
Oil_Production -3.838e-05 2.471e-05 -1.554 0.120917
Oil_Stock -3.463e-04 3.478e-04 -0.995 0.320022
HDD -1.180e-04 1.134e-04 -1.040 0.298783
CDD -2.334e-04 2.748e-04 -0.849 0.396069
Energy_Consumption 1.322e-01 4.281e-02 3.088 0.002130 **
Net_imports_Energy -1.080e-01 6.426e-02 -1.680 0.093548 .
Oil_Reserves 3.058e-05 8.791e-06 3.479 0.000548 ***
Active_Conflicts -4.035e-04 5.508e-04 -0.733 0.464185
Eco_RecRecession 2.029e-01 2.577e-02 7.870 2.36e-14 ***
SeasonSpring -7.635e-02 2.886e-02 -2.645 0.008433 **
SeasonSummer -6.708e-02 4.241e-02 -1.582 0.114378
SeasonWinter -5.460e-02 3.851e-02 -1.418 0.156907
Gov_ctrl_typeTotal 2.824e+02 1.248e+02 2.263 0.024066 *
Gov_ctrl_partyRepublican -5.645e+02 2.496e+02 -2.262 0.024160 *
Gov_ctrl_partyshared NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1694 on 481 degrees of freedom
Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
F-statistic: 3.962e+04 on 23 and 481 DF, p-value: < 2.2e-16
The model has a high coefficient of determination with the training set, however the out of sample or validation RMSE shows high values of error, the next step is to compare the model with our testing data
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 282.5075 | 62.42545 | 0 |
| 6 | 299.8339 | 40.00000 | 2 |
| 12 | 351.5306 | 54.54545 | 7 |
| 18 | 357.2967 | 58.82353 | 11 |
| 24 | 377.5252 | 52.17391 | 13 |
Similar to ARIMA, As the linear model attempts to forecast away from the training data the RMSE increases.Within all the test data horizon, The linear model yields lower and better RMSE values compared to the mean model baseline; however in term of DAR the linear model underperforms the 61% growing baseline model for all test data.
The next model to test is random forest model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, and we will ask caret to try 5 combinations of Hyperparameters, which on the case of random forest, the only sensible parameter to vary is the MTRY, which is the number of random variables to pick for each tree.
Random Forest
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 0.2325711 0.2369249 0.2050431
2 extratrees 0.2479065 0.2714997 0.2200324
7 variance 0.2179129 0.2441314 0.1888745
7 extratrees 0.2234541 0.2611241 0.1943951
13 variance 0.2239524 0.2369472 0.1933806
13 extratrees 0.2202204 0.2651279 0.1909573
18 variance 0.2309282 0.2296348 0.1989471
18 extratrees 0.2230851 0.2748473 0.1932292
24 variance 0.2393432 0.2300790 0.2060581
24 extratrees 0.2269943 0.2735128 0.1958212
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 7, splitrule =
variance and min.node.size = 5.
Compared with the linear model the out of sample RMSE values are drastically lower, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As the model plot shows,the model was tunned or optimized to use 7 randomly selected predictors MTRY which yields the best prediction results. THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 54.99662 | 88.07157 | 0 |
| 6 | 553.30201 | 80.00000 | 5 |
| 12 | 530.96872 | 72.72727 | 9 |
| 18 | 458.12986 | 52.94118 | 10 |
| 24 | 403.63380 | 52.17391 | 13 |
As opposed to previous evaluated models the testing data RMSE of Random Forest does not increases with time horizon, in fact random forest outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model is only able to perform better than the baseline until 12 months of testing data, after that time, the model starts to desestablize and it actually underperforms the 61% DAR baseline model.
The next model to test is an Extreme Gradient boosted model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, however in terms of the formula, Xgboost requires the input of explanatory variables as a matrix and an special input of the response variable, therefore we will not use the formulation notation used before on previous models.
For practicity purposes we are going to use the caret package to build the model, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters, for this specific case the best combination of hyperparameters found was:
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.ref.
For more information on the algorithm, the following video provides a great description of it capabilities
trainX_Rigs<- data.matrix(select(oil_train,-Active_Rigs,-Date))
set.seed(123)
tunegrid_xg <- expand.grid(
nrounds = 100,
max_depth = 7,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
With all parameters setup the next step is to create the model
eXtreme Gradient Boosting
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
RMSE Rsquared MAE
0.2272818 0.2082118 0.1980332
Tuning parameter 'nrounds' was held constant at a value of 100
1
Tuning parameter 'min_child_weight' was held constant at a value of
1
Tuning parameter 'subsample' was held constant at a value of 1
Compared with Random Forest, the out of sample RMSE values are similar, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 14.15146 | 91.65010 | 0 |
| 6 | 409.38629 | 60.00000 | 4 |
| 12 | 354.44541 | 54.54545 | 7 |
| 18 | 302.52374 | 52.94118 | 10 |
| 24 | 286.91612 | 52.17391 | 13 |
As opposed to previous evaluated models the testing data RMSE of XGboost does not increases with time horizon, in fact XGboost outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model underperforms the 61% DAR baseline for all time horizons of the testing data.
The next model to test is a Support Vector Machine model, and specifically and specifically an SVM using an RBF, a Radial Basis Function, we will use the model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 200 months (17 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters.
SVMgrid <- expand.grid(sigma = seq(0.005,0.02,by=0.005), C = c(100,200,300,400))
With all setup up the next step is to model
Support Vector Machines with Radial Basis Function Kernel
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared MAE
0.005 100 0.4698541 0.2791247 0.4250718
0.005 200 0.5096392 0.2877910 0.4561979
0.005 300 0.5202698 0.2862592 0.4645631
0.005 400 0.5259752 0.2893686 0.4684042
0.010 100 0.4680788 0.3041593 0.4226456
0.010 200 0.4845519 0.3169344 0.4358495
0.010 300 0.4886574 0.3282740 0.4383617
0.010 400 0.4913978 0.3308360 0.4406170
0.015 100 0.4409003 0.3195932 0.4003715
0.015 200 0.4458187 0.3233245 0.4043227
0.015 300 0.4479417 0.3143934 0.4065934
0.015 400 0.4481367 0.3129133 0.4069005
0.020 100 0.4128205 0.2995149 0.3768963
0.020 200 0.4151038 0.2915372 0.3791701
0.020 300 0.4151046 0.2872727 0.3791994
0.020 400 0.4144160 0.2858717 0.3785994
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02 and C = 100.
Compared with Xgboost and RF, the out of sample RMSE values are higher, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As per the plot shows, the best combination of hyperparameters is a Cost of 100 and a sigma of 0.02
THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 75.45089 | 75.14911 | 0 |
| 6 | 414.63985 | 80.00000 | 5 |
| 12 | 313.23020 | 63.63636 | 8 |
| 18 | 386.20813 | 58.82353 | 11 |
| 24 | 416.46161 | 52.17391 | 13 |
Similar to previous evaluated models the testing data RMSE of SVM increasess with the time horizon, however SVM outperforms the mean baseline model RMSE, in all testing time horizons. In terms of DAR the model is only able to perform better than the baseline until 12 months of testing data, after that time, the model starts to desestablize and it actually underperforms the 61% DAR baseline model.
A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.
The simplest networks contain no hidden layers and are equivalent to linear regressions. Figure 11.11 shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimises a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.
Once we add an intermediate layer with hidden neurons, the neural network becomes non-linear.
This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output
or example, the inputs into hidden neuron in the previous Figure are combined linearly to give
\[z_j = b_j + \sum_{i=1}^4 w_{i,j} x_i.\] In the hidden layer, this is then modified using a nonlinear function such as a sigmoid, or relu
\[s(z) = \frac{1}{1+e^{-z}},\]
to give the input for the next layer. This tends to reduce the effect of extreme input values, thus making the network somewhat robust to outliers.
The parameters b1,b2,b3 and w1,1,…,w4,3 are “learned” from the data. The values of the weights are often restricted to prevent them from becoming too large. The parameter that restricts the weights is known as the “decay parameter”, and is often set to be equal to 0.1.
The weights take random values to begin with, and these are then updated using the observed data. Consequently, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are averaged.
The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance.
With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in ARIMA models,We call this a neural network autoregression or NNAR model.Due to complexity we will only consider feed-forward networks with one hidden layer.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.
For the architecture of this network we will use one single hidden layer densely connected, using 60 units and 12 inputs or one year of previous data points as explanatory variables.
set.seed(123)
NNETAr_Rigs <- forecast::nnetar(y=OIL_MODIFIED_TS[1:504,"Active_Rigs"],p=12,size = 60)
NNETAr_Rigs
Series: OIL_MODIFIED_TS[1:504, "Active_Rigs"]
Model: NNAR(12,60)
Call: forecast::nnetar(y = OIL_MODIFIED_TS[1:504, "Active_Rigs"], p = 12,
size = 60)
Average of 20 networks, each of which is
a 12-60-1 network with 841 weights
options were - linear output units
sigma^2 estimated as 0.0005491
Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 32.66482 | 76.57841 | 0 |
| 6 | 53.52193 | 80.00000 | 5 |
| 12 | 169.64840 | 81.81818 | 10 |
| 18 | 143.21707 | 76.47059 | 14 |
| 24 | 130.64926 | 73.91304 | 18 |
As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of the NNAR does not increases with time horizon, in fact NNAR outperforms the mean baseline mean model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. In terms of DAR the model is able to perform better than the 61% DAR baseline for all time horizons of the testing data.
The stacked model gave us a hint of the benefits of using a combination model of time series autoregressive and structured data approaches, therefore the ideal model to use for this dataset, might be a recurrent neural network an RNN, and to be more specific a Long Short Term Memory Neural Network.
LSTMs are quite useful in time series prediction tasks involving autocorrelation, the presence of correlation between the time series and lagged versions of itself, because of their ability to maintain state and recognize patterns over the length of the time series. The recurrent architecture enables the states to persist, or communicate between updates of the weights as each epoch progresses. Further, the LSTM cell architecture enhances the RNN by enabling long term persistence in addition to short term, which is fascinating!
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
The repeating module in a standard RNN contains a single layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
The repeating module in an LSTM contains four interacting layers.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
The LSTM Neural Net will include the two most important variables proposed by the boruta analisis performed during the EDA of this work, which are, Oil Price and Oil Stock, so as autoregressive terms of Active Rigs. The neural net requires the input data as a matrix will all numeric parameters and scaled up it also requires the input of a third dimension which are the time indexes, so as the lag, which is the amount of time points that are going to be used as exploratory variables to train the network. for this specific case, we decided to train/test split the dataset, and choose the following paramters:
lag: 96 months, equivalent to 8 years of data to be used as explanatory variables Batch Size = 4, to be used to train the model on each epoch Train: 408 data points,which represents the dataset up until the end of 2008. Test: 24 Months, the last two years of data.
We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.
For the initial architecture of our neural Net we will use:
as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function.
RIGS_LSTM <- select(OIL_MODIFIED,-Date,-Year,-Month)
RIGS_LSTM <- RIGS_LSTM %>% mutate_if(is.factor,as.numeric)
m2<-apply(RIGS_LSTM, 2, mean)
m2[1]
Active_Rigs
7.158164
std2<-apply(RIGS_LSTM, 2, sd)
std2[1]
Active_Rigs
0.4874875
RIGS_LSTM<-scale(RIGS_LSTM,center = m2,scale = std2)
datalags = 96
train = RIGS_LSTM[seq(312 + datalags), ]
test = RIGS_LSTM[312 + datalags + seq(24+ datalags), ]
batch.size = 4
x.train = array(data = lag(cbind(train[,1],
train[,9],
train[,10]), datalags)[-(1:datalags), ],
dim = c(nrow(train) - datalags, datalags, 3))
y.train = array(data = train[,1][-(1:datalags)], dim = c(nrow(train)-datalags, 1))
x.test = array(data = lag(cbind(test[,1],
test[,9],
test[,10]), datalags)[-(1:datalags), ],
dim = c(nrow(test) - datalags, datalags, 3))
y.test = array(data = test[,1][-(1:datalags)], dim = c(nrow(test) - datalags, 1))
set.seed(123)
LSTM <- keras_model_sequential()
LSTM %>%
layer_lstm(units = 70,
input_shape = c(datalags, 3),
batch_size = batch.size,
return_sequences = FALSE,
stateful = TRUE) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 1)
LSTM %>%
compile(loss = 'mae', optimizer = 'adam')
LSTM
Model
___________________________________________________________________________
Layer (type) Output Shape Param #
===========================================================================
lstm (LSTM) (4, 70) 20720
___________________________________________________________________________
dropout (Dropout) (4, 70) 0
___________________________________________________________________________
dense (Dense) (4, 1) 71
===========================================================================
Total params: 20,791
Trainable params: 20,791
Non-trainable params: 0
___________________________________________________________________________
for(i in 1:408){
LSTM %>% fit(x = x.train,
y = y.train,
batch_size = batch.size,
epochs = 1,
verbose = 0,
shuffle = FALSE)
LSTM %>% reset_states()
}
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 1185.8159 | 48.55305 | 0 |
| 6 | 118.6381 | 80.00000 | 5 |
| 12 | 102.6484 | 81.81818 | 10 |
| 18 | 146.9246 | 70.58824 | 13 |
| 24 | 245.7800 | 78.26087 | 19 |
For all the time horizon of the test data, the LSTM yield better results compared to both baselines RMSE and DAR, however as it starts to move away, the values of RMSE starts to decay so as DAR performance.
By looking at the in train RMSE, we might guess that there is still plenty of room for improvement in terms of the architecture of the neural net, however this might require more extensive empirical optimization, since as it has been stated by many deep learning experts, the architecture of these kind of models is more an art rather than an strict science.
With all models already built up, the next step is to compare their performance against each other
| Horizon | models | RMSE | DAR |
|---|---|---|---|
| 0 Months | lm | 282.507486 | 0.6242545 |
| 0 Months | RF | 54.996617 | 0.8807157 |
| 0 Months | XGboost | 2.835767 | 0.9801193 |
| 0 Months | SVM_RBF | 75.450891 | 0.7514911 |
| 0 Months | ARIMA | 50.911900 | 0.7216700 |
| 0 Months | NNAR | 32.977514 | 0.7657841 |
| 0 Months | LSTM | 1185.815910 | 0.4855305 |
| 6 Months | lm | 299.833928 | 0.4000000 |
| 6 Months | RF | 553.302010 | 0.8000000 |
| 6 Months | XGboost | 408.720458 | 0.6000000 |
| 6 Months | SVM_RBF | 414.639851 | 0.8000000 |
| 6 Months | ARIMA | 198.515610 | 0.6000000 |
| 6 Months | NNAR | 48.490524 | 1.0000000 |
| 6 Months | LSTM | 118.638074 | 0.8000000 |
| 12 Months | lm | 351.530637 | 0.5454545 |
| 12 Months | RF | 530.968721 | 0.7272727 |
| 12 Months | XGboost | 354.518957 | 0.5454545 |
| 12 Months | SVM_RBF | 313.230198 | 0.6363636 |
| 12 Months | ARIMA | 182.270112 | 0.6363636 |
| 12 Months | NNAR | 144.182721 | 0.9090909 |
| 12 Months | LSTM | 102.648369 | 0.8181818 |
| 18 Months | lm | 357.296689 | 0.5882353 |
| 18 Months | RF | 458.129857 | 0.5294118 |
| 18 Months | XGboost | 302.937654 | 0.5294118 |
| 18 Months | SVM_RBF | 386.208128 | 0.5882353 |
| 18 Months | ARIMA | 174.197096 | 0.4705882 |
| 18 Months | NNAR | 121.904982 | 0.8235294 |
| 18 Months | LSTM | 146.924585 | 0.7058824 |
| 24 Months | lm | 377.525182 | 0.5217391 |
| 24 Months | RF | 403.633797 | 0.5217391 |
| 24 Months | XGboost | 289.391698 | 0.5217391 |
| 24 Months | SVM_RBF | 416.461614 | 0.5217391 |
| 24 Months | ARIMA | 195.871558 | 0.4347826 |
| 24 Months | NNAR | 113.204349 | 0.8260870 |
| 24 Months | LSTM | 245.780025 | 0.7826087 |
Looking at the table and plots with more detail, we can notice that the autoregressive Neural Network NNAR, provides the best results in terms of both performance metrics RMSE and DAR, it overperforms the baseline models os as other models tested during this work, altough the LSTM did provided good results as well and very near the performance of the NNAR.
By plotting our 2 Year Active Rig predictions using the NNAR versus the actual number of Active Rigs on a Gain plot, we can obtain the Gini score, which is this case is 0.012, such a low value of Gini Score suggest that both curves are (Predictions vs Actual) are nearly equals.
The Best model obtained on this work for the prediction of future Monthly Oil Prices, is an Autoregressive Neural Network NNAR
Using the selected NNAR model it is possible to predict future Rig Count Activity with an RMSE of ~ 150 Rigs and ~ 80% directional accuracy, up until two years ahead of training data.
In terms of testing data RMSE values, the best model performance is achieved by using an autoregressive Neural Network NNAR
In terms of testing data DAR, the best model performance is achieved by using an autoregressive Neural Network NNAR
A bigger dataset, might yield better results for an LSTM.