The majority of the data used for this analysis was, downloaded directly from the EIA webpage, the link can be found down below.
Description of variables, units of measurements, data category and the source of data, can be found on the tables below:
Codebook Varible Description
| Variable Name | Description |
|---|---|
| Date | Date of the observation |
| Month | Month of the observation |
| Year | Year of the observation |
| Active_Rigs | Number of active drilling rigs on and offshore recorded to be working on that date |
| US_Population | USA population estimates for the date of the observation |
| President_Party | Identified political party for the ruling US president on the date observation |
| Senate | Political party that holds the majority of US Senate on the date of the observation |
| House | Political party that holds the majority of US congress on the date of the observation |
| life_exp | US average life expectancy at the date of the observation |
| GDP | Gross Domestic Product of the US economy on the date of the observation |
| Oil_Production | US Total oil production on the date of the observation |
| Oil_Stock | US inventory of oil in stock on the date of the observation |
| Oil_Price | US average domestic first purchase price, not corrected by inflation |
| HDD | Monthly US HDD are the number of degrees that the daily average temperature falls below 65 °F |
| CDD | Monthly US CDD are the number of degrees that the daily average temperature rises above 65 °F |
| Energy_Consumption | The amount of energy consumed by the US market on the date of the observation |
| Net_imports_Energy | The Net amount of energy imported by the US market on the date of the observation |
| Oil_Reserves | US proven oil reserves on the date of the observation |
| Active_Conflicts | Number of world active military conflicts on the date of the observation, includes civil wars |
| Drilled_Wells | Number of drilled wells on US on the date of the observation |
Codebook Varible Units of measurement
| Variable Name | Units of measurement |
|---|---|
| Date | Date format Year-Month-Day |
| Month | Integer from 1-12 |
| Year | Integer 4 digits |
| Active_Rigs | Integer each unit represents a drilling rig |
| US_Population | Numerical in Millions of people |
| President_Party | Categorical variable Either Democrat or Republican |
| Senate | Categorical variable Either Democrat or Republican |
| House | Categorical variable Either Democrat or Republican |
| life_exp | Numerical life expectancy in years |
| GDP | Numerical GDP in Trillions of dollars 10^12 USD |
| Oil_Production | Numerical Oil production in Thousand barrels per day |
| Oil_Stock | Numerical US inventory of oil in million barrels |
| Oil_Price | Numerical US dollars |
| HDD | Numerical Monthly US HDD measured un number of °F |
| CDD | Numerical Monthly US CDD measured un number of °F |
| Energy_Consumption | Numerical Energy consumed in Quadrillion btu’s 10^15 |
| Net_imports_Energy | Numerical Net energy imported in Quadrillion btu’s 10^15 |
| Oil_Reserves | NUmerical Oil reserves in million barrels |
| Active_Conflicts | Numerical each unit represents an active military conflict |
| Drilled_Wells | Numerical each unit represents a drilled well |
Codebook Varible Source of data
| Variable Name | Units of measurement |
|---|---|
| Date | Included with other measurements |
| Month | Included with other measurements |
| Year | Included with other measurements |
| Active_Rigs | Data Link |
| US_Population | Data Link |
| President_Party | Data Link |
| Senate | Data Link |
| House | Data Link |
| life_exp | Data Link |
| GDP | Data Link |
| Oil_Production | Data Link |
| Oil_Stock | Data Link |
| Oil_Price | Data Link |
| HDD | Data Link |
| CDD | Data Link |
| Energy_Consumption | Data Link |
| Net_imports_Energy | Data Link |
| Oil_Reserves | Data Link |
| Active_Conflicts | Data Link |
| Active_Conflicts 2 | Data Link |
| Drilled_Wells | Data Link |
The main purpose of this work is to create models that are able to predict the rig activity (Activity_Rig) in US and the behavior of oil price (Oil_Price), based on various explanatory variables.
The Exploratory Data Analysis (EDA) performed on this dataset, so as various statistical tests, can be found on the following Link
The working data set is composed by 23 variables and 528 observations, there is no NA’s since was already wrangled on the previous part of this work Link.
All the code related related to models generation can be found on the following link GIT.
'data.frame': 528 obs. of 23 variables:
$ Date : Date, format: "1974-01-01" "1974-02-01" ...
$ Month : num 1 2 3 4 5 6 7 8 9 10 ...
$ Year : num 1974 1974 1974 1974 1974 ...
$ Active_Rigs : num 1372 1355 1367 1381 1413 ...
$ US_Population : num 214 214 214 214 214 ...
$ President_Party : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 2 2 2 2 2 2 ...
$ Senate : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ House : Factor w/ 2 levels "Democrat","Republican": 1 1 1 1 1 1 1 1 1 1 ...
$ life_exp : num 72 72 72 72 72 ...
$ GDP : num 1.55 1.55 1.55 1.55 1.55 ...
$ Oil_Production : num 10633 10870 10707 10650 10600 ...
$ Oil_Stock : num 233 240 244 256 269 268 268 264 266 269 ...
$ Oil_Price : num 6.95 6.87 6.77 6.77 6.87 6.85 6.8 6.71 6.7 6.97 ...
$ HDD : int 911 823 618 351 200 51 8 18 114 350 ...
$ CDD : int 16 5 22 23 88 149 296 242 97 30 ...
$ Energy_Consumption: num 6.91 6.32 6.37 5.86 5.87 ...
$ Net_imports_Energy: num 0.962 0.791 0.932 0.97 1.063 ...
$ Oil_Reserves : num 32250 32250 32250 32250 32250 ...
$ Active_Conflicts : num 56 56 56 56 56 56 56 56 56 56 ...
$ Eco_Rec : Factor w/ 2 levels "Normal","Recession": 2 2 2 2 2 2 2 2 2 2 ...
$ Season : Factor w/ 4 levels "Autumn","Spring",..: 4 4 2 2 2 3 3 3 1 1 ...
$ Gov_ctrl_type : Factor w/ 2 levels "shared","Total": 1 1 1 1 1 1 1 1 1 1 ...
$ Gov_ctrl_party : Factor w/ 3 levels "Democrat","Republican",..: 3 3 3 3 3 3 3 3 3 3 ...
The previous exploratory data analysis showed that Oil Prices follow a heavily right skewed distribution.
Due to the severe skewness of Oil price, we will search for an adequate transformation, by using the Boxcox.lambda function in R which will return the best lambda value and therefore the most suited transformation.
forecast::BoxCox.lambda(OIL_MODIFIED$Oil_Price)
[1] -0.0996917
The obtained lambda value is almost zero, which suggests a logaritmic transformation
The transformation centered the distribution and reduced the right side skewness
Due to the intrinsice association of our response variable with time, our dataset will be devided into a training and test data set, following time seres best practices on which we respect the order of observations.
The training data set will be observations from 1974 up to December of 2015, and the test data set will be composed of two years of monthly observations up until December 2017. The training data set will be used to train our different models and the test, set to evaluate the quality of our models.
OIL_MODIFIED <-OIL_MODIFIED %>% mutate(Oil_Price = log(Oil_Price))
oil_train <- OIL_MODIFIED[1:504,]
oil_test_6M <- OIL_MODIFIED[505:510,]
oil_test_1Y <- OIL_MODIFIED[505:516,]
oil_test_1.5Y <- OIL_MODIFIED[505:522,]
oil_test_2Y<- OIL_MODIFIED[505:528,]
dim(oil_train)
[1] 504 23
dim(oil_test_2Y)
[1] 24 23
Another important aspect of machine learning models is model tunning, due to this a partition of the training data set will be used as validation set, to tune hyperparameters and find the most optimized model, that is able to generalize to outside that and not overfit on the training set. The validation test set will be created following the concept of time series cross validation known in econometrics as “evaluation on a rolling forecasting origin”.
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.
Two evaluate the quality of the models, we selected to performance measurements which are:
Thanks to the results obtained on the previously performed EDA and statistical analysis, we are aware that Oil_Prices shows a time dependency on which time data points are correlated with succesive and previous data points, thanks to a Chi-Square GOF we know that monthly data values are dependent on each other, this can also be confirmed as using and ACF plot of the variation of prices of monthly Oil price data.
The plot shows an strong auto-correlation at lag 1 and 2. To finally confirm this suspicious we will use an L-jung box test with an alpha value of 5% or 0.05.
Box-Ljung test
data: diff(OIL_MODIFIED_TS$Oil_Price)
X-squared = 116.52, df = 1, p-value < 2.2e-16
The low p-values obtained suggest that Oil_Price time series is not “white noise” and that information can be extracted from the series, therefore a time series model might be suitable, due to this the first chosen model will be an ARIMA non seasonal model.
Series: OIL_MODIFIED_TS[1:504, "Oil_Price"]
ARIMA(2,1,1)
Coefficients:
ar1 ar2 ma1
1.1971 -0.4270 -0.6797
s.e. 0.1510 0.0676 0.1590
sigma^2 estimated as 0.004606: log likelihood=640.77
AIC=-1273.53 AICc=-1273.45 BIC=-1256.65
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.002166705 0.06760035 0.0479155 0.068042 1.544229 0.9194907
ACF1
Training set 0.001812794
Ljung-Box test
data: Residuals from ARIMA(2,1,1)
Q* = 5.0384, df = 7, p-value = 0.6553
Model df: 3. Total lags used: 10
Residuals inside the training data, seem to be normal and around 0, the Ljung box.test obtained p-value on the residuals, confirm that they are “white noise”, now the next step is to evaluate the model on the test-data.But before computing eny result we should first calculate a baseline for comparisson, on this case we will use as RMSE basline, a model that uses the mean oil price of the training data as the only predicted value and for DAR we will use a model that asumes that all differences between succesives oil prices are positive (Oil price is always growing), both baselines will be calculated for the total of the test-data (2 years).
(RMSE_BASELINE<-sqrt(mean((mean(exp(oil_train$Oil_Price))-exp(oil_test_2Y$Oil_Price))^2)))
[1] 12.65433
(DAR_BASELINE<-mean(diff(oil_train$Oil_Price)>0)*100)
[1] 56.06362
With the baselines to compare is alrady the time to evaluate the performance of the ARIMA model
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 2.898197 | 59.84095 | 0 |
| 6 | 7.319842 | 100.00000 | 6 |
| 12 | 9.417773 | 72.72727 | 9 |
| 18 | 11.534491 | 58.82353 | 11 |
| 24 | 13.599467 | 69.56522 | 17 |
As the model attempts to forecast away from the training data, the RMSE increases.with the 24 months test data, using the mean baseline as a model yields lower and better RMSE values, altough a main difference with a mean model is that ARIMA provides DAR values, which incredibly valuable for financial markets spacially for commodities like oil that have plenty of different financial objects and markets to trade them, on this metric the ARIMA model outperforms the basline model of 56% growing trend, another important aspect to highlight is that a mean model is not able to produce DAR, since all values are constant.
The next model to test is a linear regression including all explanatory variables and using time cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years).
total_formula_price<-as.formula(Oil_Price ~ Month +
Year +
Active_Rigs +
US_Population +
President_Party +
Senate +
House +
life_exp +
GDP +
Oil_Production +
Oil_Stock +
HDD +
CDD +
Energy_Consumption +
Net_imports_Energy +
Oil_Reserves +
Active_Conflicts +
Eco_Rec +
Season +
Gov_ctrl_type +
Gov_ctrl_party)
set.seed(123)
time_control <- trainControl(method = "timeslice",
initialWindow = 48,
horizon = 24,
fixedWindow = TRUE,
verboseIter = TRUE,
allowParallel = TRUE)
With the setup ready the next step is to create the linear model; for practicity purposes we are going to use the caret library which allow use to use the timeSlice cross validation. Every model in caret requires hyperparameters tunning, however an “lm” model doesnt have any hyperparameters to tine therefore we will create a custom tunning grid with only the intercept as the hyperparameter and we will set this parameter to false.
Linear Regression
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ...
Resampling results:
RMSE Rsquared MAE
4631.625 0.2844531 4571.577
Tuning parameter 'intercept' was held constant at a value of FALSE
Call:
lm(formula = .outcome ~ 0 + ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.71120 -0.10446 0.01978 0.12215 0.46132
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
Month 2.493e-04 3.527e-03 0.071 0.94367
Year 2.037e-01 7.750e-02 2.628 0.00886 **
Active_Rigs 4.276e-04 2.444e-05 17.496 < 2e-16 ***
US_Population -9.492e-02 2.541e-02 -3.736 0.00021 ***
President_PartyRepublican -3.925e+02 1.456e+02 -2.696 0.00726 **
SenateRepublican 1.096e-02 5.029e-02 0.218 0.82761
HouseRepublican -3.926e+02 1.456e+02 -2.697 0.00725 **
life_exp 1.960e-01 6.741e-02 2.907 0.00382 **
GDP 2.394e-01 4.388e-02 5.456 7.79e-08 ***
Oil_Production -1.597e-04 2.759e-05 -5.789 1.28e-08 ***
Oil_Stock -5.585e-04 4.057e-04 -1.377 0.16928
HDD -1.851e-04 1.326e-04 -1.396 0.16321
CDD -2.065e-04 3.208e-04 -0.644 0.52001
Energy_Consumption 7.054e-02 5.052e-02 1.396 0.16329
Net_imports_Energy -4.868e-01 7.273e-02 -6.693 6.10e-11 ***
Oil_Reserves -1.376e-05 1.037e-05 -1.327 0.18517
Active_Conflicts -2.630e-03 6.242e-04 -4.213 3.00e-05 ***
Eco_RecRecession -1.317e-03 3.208e-02 -0.041 0.96727
SeasonSpring 2.056e-02 3.371e-02 0.610 0.54227
SeasonSummer 1.168e-02 4.955e-02 0.236 0.81375
SeasonWinter -5.428e-02 4.494e-02 -1.208 0.22778
Gov_ctrl_typeTotal -3.926e+02 1.456e+02 -2.696 0.00725 **
Gov_ctrl_partyRepublican 7.852e+02 2.912e+02 2.697 0.00724 **
Gov_ctrl_partyshared NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1976 on 481 degrees of freedom
Multiple R-squared: 0.9965, Adjusted R-squared: 0.9964
F-statistic: 5987 on 23 and 481 DF, p-value: < 2.2e-16
The model has a high coefficient of determination with the training set, however the out of sample or validation RMSE shows high values of error, the next step is to compare the model with our testing data
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 7.41275 | 49.70179 | 0 |
| 6 | 13.49372 | 80.00000 | 5 |
| 12 | 11.58670 | 63.63636 | 8 |
| 18 | 16.37096 | 64.70588 | 12 |
| 24 | 19.53783 | 56.52174 | 14 |
Similar to ARIMA, As the linear model attempts to forecast away from the training data the RMSE increases.with the 24 months test data, using the mean baseline as a model yields lower and better RMSE values, for the linear model this also applies to other time horizons such as 1,1.5 years. The DAR values produced by the linear model outperforms the basline model of 56% growing trend, on time horizons below one year it renders superior results, however at 24 months the model barely arrives to achieve the baseline DAR percentage.
The next model to test is random forest model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, and we will ask caret to try 5 combinations of Hyperparameters, which on the case of random forest, the only sensible parameter to vary is the MTRY, which is the number of random variables to pick for each tree.
Random Forest
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 0.3203021 0.2091753 0.2729190
2 extratrees 0.3214654 0.2005519 0.2745532
7 variance 0.3098552 0.1970000 0.2614745
7 extratrees 0.3099570 0.1938968 0.2614333
13 variance 0.3112874 0.2077682 0.2637568
13 extratrees 0.3111779 0.1895608 0.2625623
18 variance 0.3128114 0.2081044 0.2656446
18 extratrees 0.3125834 0.1854089 0.2640673
24 variance 0.3158450 0.2028856 0.2689051
24 extratrees 0.3137600 0.1858963 0.2655273
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 7, splitrule =
variance and min.node.size = 5.
Compared with the linear model the out of sample RMSE values are drastically lower, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As the model plot shows,the model was tunned or optimized to use 7 randomly selected predictors MTRY which yields the best prediction results. THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 2.330145 | 81.90855 | 0 |
| 6 | 11.799760 | 60.00000 | 4 |
| 12 | 8.980799 | 63.63636 | 8 |
| 18 | 7.656877 | 58.82353 | 11 |
| 24 | 7.530799 | 47.82609 | 11 |
As opposed to previous evaluated models the testing data RMSE of Random Forest does not increases with time horizon, in fact random forest outperforms the mean baseline model RMSE, in all testing time horizons. However in terms of DAR the model is only able to perform better than the baseline until 18 months of testing data, after the model starts to desestablize and it actually underperforms the 56% DAR baseline model.
The next model to test is an Extreme Gradient boosted model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, however in terms of the formula, Xgboost requires the input of explanatory variables as a matrix and an special input of the response variable, therefore we will not use the formulation notation used before on previous models.
For practicity purposes we are going to use the caret package to build the model, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters, for this specific case the best combination of hyperparameters found was:
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.ref.
For more information on the algorithm, the following video provides a great description of it capabilities
trainX<- data.matrix(select(oil_train,-Oil_Price,-Date))
set.seed(123)
tunegrid_xg_default <- expand.grid(
nrounds = 100,
max_depth = 6,
eta = 0.3,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
With all parameters setup the next step is to create the model
eXtreme Gradient Boosting
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ...
Resampling results:
RMSE Rsquared MAE
0.3275638 0.1714746 0.2777964
Tuning parameter 'nrounds' was held constant at a value of 100
1
Tuning parameter 'min_child_weight' was held constant at a value of
1
Tuning parameter 'subsample' was held constant at a value of 1
Compared with Random Forest, the out of sample RMSE values are similar, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 0.0647931 | 97.81312 | 0 |
| 6 | 7.7233566 | 60.00000 | 4 |
| 12 | 6.4759711 | 63.63636 | 8 |
| 18 | 6.3246999 | 58.82353 | 11 |
| 24 | 6.4187649 | 52.17391 | 13 |
As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of XGboost does not increases with time horizon, in fact XGboost outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. XGboost also provides lower RMSE values than Random Forest, which was already an improvement over “lm” and “ARIMA”. However in terms of DAR the model is only able to perform better than the baseline until 18 months of testing data, after the model starts to desestablize and it actually underperforms the 56% DAR baseline model.
The next model to test is a Support Vector Machine model, and specifically and specifically an SVM using an RBF, a Radial Basis Function, we will use the model including all explanatory variables and using time series cross validations or “evaluation on a rolling forecasting origin” with an initial window of 48 months (4 years) for model training and succesive validation sets of 24 months (2 years). The training control parameters were already setup during the creation of the linear model, so as the formuala to be used, the only thin remaining before modeling, is to set up the model hyperparameters. For practicity purposes we are going to use the caret package, however due to the complexity of the model we will create a customized grid of hyperparameters to find out the best combination of hyperparameters.
SVMgrid <- expand.grid(sigma = seq(0.005,0.02,by=0.005), C = c(100,200,300,400))
With all setup up the next step is to model
Support Vector Machines with Radial Basis Function Kernel
504 samples
21 predictor
No pre-processing
Resampling: Rolling Forecasting Origin Resampling (24 held-out with a fixed window)
Summary of sample sizes: 48, 48, 48, 48, 48, 48, ...
Resampling results across tuning parameters:
sigma C RMSE Rsquared MAE
0.005 100 0.4329691 0.1823023 0.3911349
0.005 200 0.4297423 0.1718971 0.3882421
0.005 300 0.4285934 0.1686252 0.3872183
0.005 400 0.4285501 0.1685428 0.3871764
0.010 100 0.4201683 0.2389799 0.3800965
0.010 200 0.4201683 0.2389799 0.3800965
0.010 300 0.4201683 0.2389799 0.3800965
0.010 400 0.4201683 0.2389799 0.3800965
0.015 100 0.4171104 0.2438520 0.3775426
0.015 200 0.4171104 0.2438520 0.3775426
0.015 300 0.4171104 0.2438520 0.3775426
0.015 400 0.4171104 0.2438520 0.3775426
0.020 100 0.4169478 0.2037181 0.3775444
0.020 200 0.4169478 0.2037181 0.3775444
0.020 300 0.4169478 0.2037181 0.3775444
0.020 400 0.4169478 0.2037181 0.3775444
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.02 and C = 100.
Compared with Xgboost and RF, the out of sample RMSE values are higher, it is important to highlight that the values reported of RMSE by the summary of the model are in fact RMSE log since we performed a response variable transformation, to obtain the RMSE values in US dollars we should exponentiate them first. As per the plot shows, the best combination of hyperparameters is a Cost of 100 and a sigma of 0.02
THe next step is to test the model against the testing data.
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 2.655426 | 68.78728 | 0 |
| 6 | 11.427578 | 80.00000 | 5 |
| 12 | 9.309312 | 72.72727 | 9 |
| 18 | 13.851903 | 70.58824 | 13 |
| 24 | 18.903295 | 56.52174 | 14 |
Similar to previous evaluated models such as linear model and ARIMA, the testing data RMSE of SVM increases with time horizon, after one year the mean baseline model starts to outperform the SVM. However in terms of DAR the SVM model outperform the baseline 56% growing DAR for all testing time horizons and it also performs better in the long term than previous models.
A neural network can be thought of as a network of “neurons” which are organised in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) form the top layer. There may also be intermediate layers containing “hidden neurons”.
The simplest networks contain no hidden layers and are equivalent to linear regressions. Figure 11.11 shows the neural network version of a linear regression with four predictors. The coefficients attached to these predictors are called “weights”. The forecasts are obtained by a linear combination of the inputs. The weights are selected in the neural network framework using a “learning algorithm” that minimises a “cost function” such as the MSE. Of course, in this simple example, we can use linear regression which is a much more efficient method of training the model.
Once we add an intermediate layer with hidden neurons, the neural network becomes non-linear.
This is known as a multilayer feed-forward network, where each layer of nodes receives inputs from the previous layers. The outputs of the nodes in one layer are inputs to the next layer. The inputs to each node are combined using a weighted linear combination. The result is then modified by a nonlinear function before being output
or example, the inputs into hidden neuron in the previous Figure are combined linearly to give
\[z_j = b_j + \sum_{i=1}^4 w_{i,j} x_i.\] In the hidden layer, this is then modified using a nonlinear function such as a sigmoid, or relu
\[s(z) = \frac{1}{1+e^{-z}},\]
to give the input for the next layer. This tends to reduce the effect of extreme input values, thus making the network somewhat robust to outliers.
The parameters b1,b2,b3 and w1,1,…,w4,3 are “learned” from the data. The values of the weights are often restricted to prevent them from becoming too large. The parameter that restricts the weights is known as the “decay parameter”, and is often set to be equal to 0.1.
The weights take random values to begin with, and these are then updated using the observed data. Consequently, there is an element of randomness in the predictions produced by a neural network. Therefore, the network is usually trained several times using different random starting points, and the results are averaged.
The number of hidden layers, and the number of nodes in each hidden layer, must be specified in advance
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.
The next model to test in our work is a simple feedforward neural network MLP for regression, including all explanatory variables but not using the time series cross validation since it will require a more advanced neural net, which will be a recurrent neural net. The neural net requires the input data as a matrix will all numeric parameters and scaled up. We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.
For the initial architecture of our neural Net we will use:
as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function, we will train the net with a batch size of 1, sample and we will use 20 epochs to train it.
trainX_DL<-trainX
dimnames(trainX_DL) <- NULL
m<-apply(trainX_DL, 2, mean)
std<-apply(trainX_DL, 2, sd)
trainX_DL<-scale(trainX_DL,center = m,scale = std)
target<-data.matrix(oil_train$Oil_Price)
set.seed(123)
model_nnet_price <- keras_model_sequential()
model_nnet_price %>%
layer_flatten(input_shape = dim(trainX_DL)[[2]]) %>%
layer_dense(units = 128,activation = "relu",kernel_regularizer = regularizer_l2(0.01)) %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 1)
model_nnet_price
Model
___________________________________________________________________________
Layer (type) Output Shape Param #
===========================================================================
flatten (Flatten) (None, 21) 0
___________________________________________________________________________
dense (Dense) (None, 128) 2816
___________________________________________________________________________
dropout (Dropout) (None, 128) 0
___________________________________________________________________________
dense_1 (Dense) (None, 1) 129
===========================================================================
Total params: 2,945
Trainable params: 2,945
Non-trainable params: 0
___________________________________________________________________________
model_nnet_price %>% compile(
optimizer = "adam",
loss = "mse",
metrics = c("mae")
)
net <- model_nnet_price %>%
fit(trainX_DL,
target,
epochs = 20,
batch_size=1
)
pred_train<-model_nnet_price %>% predict(trainX_DL) %>% exp()
oil_test_6M_DL<- select(oil_test_6M,-Oil_Price,-Date) %>% data.matrix()
dimnames(oil_test_6M_DL) <- NULL
oil_test_6M_DL<-scale(oil_test_6M_DL,center = m,scale = std)
pred6M<-model_nnet_price %>% predict(oil_test_6M_DL) %>% exp()
oil_test_1Y_DL<- select(oil_test_1Y,-Oil_Price,-Date) %>% data.matrix()
dimnames(oil_test_1Y_DL) <- NULL
oil_test_1Y_DL<-scale(oil_test_1Y_DL,center = m,scale = std)
pred1Y<-model_nnet_price %>% predict(oil_test_1Y_DL) %>% exp()
oil_test_1.5Y_DL<- select(oil_test_1.5Y,-Oil_Price,-Date) %>% data.matrix()
dimnames(oil_test_1.5Y_DL) <- NULL
oil_test_1.5Y_DL<-scale(oil_test_1.5Y_DL,center = m,scale = std)
pred1.5Y<-model_nnet_price %>% predict(oil_test_1.5Y_DL) %>% exp()
oil_test_2Y_DL<- select(oil_test_2Y,-Oil_Price,-Date) %>% data.matrix()
dimnames(oil_test_2Y_DL) <- NULL
oil_test_2Y_DL<-scale(oil_test_2Y_DL,center = m,scale = std)
pred2Y<-model_nnet_price %>% predict(oil_test_2Y_DL) %>% exp()
Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 9.219194 | 56.85885 | 0 |
| 6 | 3.662618 | 80.00000 | 5 |
| 12 | 6.226381 | 54.54545 | 7 |
| 18 | 8.271522 | 52.94118 | 10 |
| 24 | 9.740417 | 47.82609 | 11 |
Similar to previous evaluated models such as linear model and ARIMA, the testing data RMSE of Neural increasess with time horizon, however the Neural outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. However in terms of DAR the model mostly underperforms the 56% DAR baseline model.
#Time Series Neural network autoregression Modeling
With time series data, lagged values of the time series can be used as inputs to a neural network, just as we used lagged values in ARIMA models,We call this a neural network autoregression or NNAR model.Due to complexity we will only consider feed-forward networks with one hidden layer.
The previous modeled MLP Neural network model was built using all explanatory variables as inputs for a regression, on this case we will use only the oil price and its time dependency as the input variables, we will use previous values of oil prices as input exploratory variables attempting to predict the future behavior as in autoregressive models, but using the architecture of single hidden layer neural network. It is important to mention that an architecture that considers both the time dependency of NNAR and MLPs can be built and it known as a recurrent neural net, this concept will be tested on future sections of this work
For the architecture of this network we will use one single hidden layer densely connected, using 60 units and 12 inputs or one year of previous data points as explanatory variables.
set.seed(123)
NNETAr <- forecast::nnetar(y=OIL_MODIFIED_TS[1:504,"Oil_Price"],p=12,size = 60)
NNETAr
Series: OIL_MODIFIED_TS[1:504, "Oil_Price"]
Model: NNAR(12,60)
Call: forecast::nnetar(y = OIL_MODIFIED_TS[1:504, "Oil_Price"], p = 12,
size = 60)
Average of 20 networks, each of which is
a 12-60-1 network with 841 weights
options were - linear output units
sigma^2 estimated as 0.002466
Compared with previous models the in sample errors are minimum however we can only confirm the validity of the model by using the testing data
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 2.230611 | 66.59878 | 0 |
| 6 | 9.223506 | 60.00000 | 4 |
| 12 | 10.357155 | 63.63636 | 8 |
| 18 | 9.885619 | 70.58824 | 13 |
| 24 | 9.557852 | 69.56522 | 17 |
As opposed to previous evaluated models such as linear model and ARIMA, the testing data RMSE of the NNAR does not increases with time horizon, in fact NNAR outperforms the mean baseline model RMSE, in all testing time horizons, and the values seems to stabilize in the long run. In terms of DAR the model is able to perform better than the 56% DAR baseline for all time horizons of the testing data.
With all models already built up, the next step is to compare their performance against each other
| Horizon | models | RMSE | DAR |
|---|---|---|---|
| 0 Months | lm | 7.4127496 | 0.4970179 |
| 0 Months | RF | 2.3301449 | 0.8190855 |
| 0 Months | XGboost | 0.0647931 | 0.9781312 |
| 0 Months | SVM_RBF | 2.6554262 | 0.6878728 |
| 0 Months | ARIMA | 2.8981971 | 0.5984095 |
| 0 Months | NNET | 9.2191936 | 0.5685885 |
| 0 Months | NNAR | 2.2306106 | 0.6659878 |
| 6 Months | lm | 13.4937230 | 0.8000000 |
| 6 Months | RF | 11.7997596 | 0.6000000 |
| 6 Months | XGboost | 7.7233566 | 0.6000000 |
| 6 Months | SVM_RBF | 11.4275781 | 0.8000000 |
| 6 Months | ARIMA | 7.3198420 | 1.0000000 |
| 6 Months | NNET | 3.6626182 | 0.8000000 |
| 6 Months | NNAR | 9.2235058 | 0.6000000 |
| 12 Months | lm | 11.5867010 | 0.6363636 |
| 12 Months | RF | 8.9807994 | 0.6363636 |
| 12 Months | XGboost | 6.4759711 | 0.6363636 |
| 12 Months | SVM_RBF | 9.3093122 | 0.7272727 |
| 12 Months | ARIMA | 9.4177726 | 0.7272727 |
| 12 Months | NNET | 6.2263805 | 0.5454545 |
| 12 Months | NNAR | 10.3571553 | 0.6363636 |
| 18 Months | lm | 16.3709636 | 0.6470588 |
| 18 Months | RF | 7.6568769 | 0.5882353 |
| 18 Months | XGboost | 6.3246999 | 0.5882353 |
| 18 Months | SVM_RBF | 13.8519033 | 0.7058824 |
| 18 Months | ARIMA | 11.5344907 | 0.5882353 |
| 18 Months | NNET | 8.2715220 | 0.5294118 |
| 18 Months | NNAR | 9.8856189 | 0.7058824 |
| 24 Months | lm | 19.5378270 | 0.5652174 |
| 24 Months | RF | 7.5307989 | 0.4782609 |
| 24 Months | XGboost | 6.4187649 | 0.5217391 |
| 24 Months | SVM_RBF | 18.9032948 | 0.5652174 |
| 24 Months | ARIMA | 13.5994669 | 0.6956522 |
| 24 Months | NNET | 9.7404169 | 0.4782609 |
| 24 Months | NNAR | 9.5578516 | 0.6956522 |
Looking at the table and plots with more detail, we can notice that XGboost provides the best results in terms of RMSE, altough in terms of DAR, NNAR and SVM provides the best results, to obtain the best of the two worlds we might work on the concept of stacked models.
#Machine Learning stack modeling
The concept of stacking is fairly simple it is the concept of using multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Stacking-a Meta Modeling Technique is introduced by Wolpert in the year 1992.In Stacking there are two types of learners called Base Learners and a Meta Learner.Base Learners and Meta Learners are the normal machine learning algorithms like Random Forests, SVM, Perceptron etc.Base Learners try to fit the normal data sets where as Meta learner fit on the predictions of the base Learner.
For the specific case of this work, we will use as Meta learner a simple linear combination of the obtained models
\[Stack = 0.35*NNAR+0.4*XG + 0.1*ARIMA+ 0.05*RF+0.05*NNET+0.025*lm+0.025*SVM\]
| RMSE_Stack | RMSE_NNAR | RMSE_XG | DAR_Stack | DAR_NNAR | DAR_XG |
|---|---|---|---|---|---|
| 7.511742 | 9.557852 | 6.418765 | 0.6521739 | 0.6956522 | 0.5217391 |
As expected by the theory, the stacked model is succesfully able to pick the best of both worlds, it shows a low RMSe and also good performance in terms of DAR.
| Horizon | RMSE | DAR |
|---|---|---|
| 6 | 6.391856 | 0.6000000 |
| 12 | 6.820706 | 0.6363636 |
| 18 | 6.945631 | 0.7058824 |
| 24 | 7.511742 | 0.6521739 |
By plotting our 2 Year Active Rig predictions using the NNAR versus the actual number of Active Rigs on a Gain plot, we can obtain the Gini score, which is this case is 0.012, such a low value of Gini Score suggest that both curves are (Predictions vs Actual) are nearly equals.
The stacked model gave us a hint of the benefits of using a combination model of time series autoregressive and structured data approaches, therefore the ideal model to use for this dataset, might be a recurrent neural network an RNN, and to be more specific a Long Short Term Memory Neural Network.
LSTMs are quite useful in time series prediction tasks involving autocorrelation, the presence of correlation between the time series and lagged versions of itself, because of their ability to maintain state and recognize patterns over the length of the time series. The recurrent architecture enables the states to persist, or communicate between updates of the weights as each epoch progresses. Further, the LSTM cell architecture enhances the RNN by enabling long term persistence in addition to short term, which is fascinating!
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
The repeating module in a standard RNN contains a single layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
The repeating module in an LSTM contains four interacting layers.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
The LSTM Neural Net will include all explanatory variables with the exception of Year and Month, since these date related variables will be directly handled by the LSTM, The neural net requires the input data as a matrix will all numeric parameters and scaled up it also requires the input of a third dimension which are the time indexes, so as the lag, which is the amount of time points that are going to be used as exploratory variables to train the network. for this specific case, we decided to train/test split the dataset, and choose the following paramters:
lag: 12 months, equivalent to 1 year of data to be used as explanatory variables Batch Size = 4, to be used to train the model on each epoch Train: 504 data points,which represents the dataset up until the end of 2015 as in previous models Test: 24 Months, the last two years of data.
We will build the network using the Keras package using googles TensorFlow as the engine of the neural net.
For the initial architecture of our neural Net we will use:
as the neural network optimizer we will use “adam” instead of the classical “rmsProp” and we will use the mean squared errors “mse” as the loss function
Oil_LSTM <- select(OIL_MODIFIED,-Date,-Year,-Month)
Oil_LSTM <- Oil_LSTM %>% mutate_if(is.factor,as.numeric)
m2<-apply(Oil_LSTM, 2, mean)
m2[10]
Oil_Price
3.204852
std2<-apply(Oil_LSTM, 2, sd)
std2[10]
Oil_Price
0.7650267
Oil_LSTM<-scale(Oil_LSTM,center = m2,scale = std2)
datalags = 12
train = Oil_LSTM[seq(480 + datalags), ]
test = Oil_LSTM[480 + datalags + seq(24+ datalags), ]
batch.size = 4
x.train = array(data = lag(cbind(train[,1],
train[,2],
train[,3],
train[,4],
train[,5],
train[,6],
train[,7],
train[,8],
train[,9],
train[,10],
train[,11],
train[,12],
train[,13],
train[,14],
train[,15],
train[,16],
train[,17],
train[,18],
train[,19],
train[,20]), datalags)[-(1:datalags), ],
dim = c(nrow(train) - datalags, datalags, 20))
y.train = array(data = train[,10][-(1:datalags)], dim = c(nrow(train)-datalags, 1))
x.test = array(data = lag(cbind(test[,1],
test[,2],
test[,3],
test[,4],
test[,5],
test[,6],
test[,7],
test[,8],
test[,9],
test[,10],
test[,11],
test[,12],
test[,13],
test[,14],
test[,15],
test[,16],
test[,17],
test[,18],
test[,19],
test[,20]), datalags)[-(1:datalags), ],
dim = c(nrow(test) - datalags, datalags, 20))
y.test = array(data = test[,10][-(1:datalags)], dim = c(nrow(test) - datalags, 1))
set.seed(123)
LSTM <- keras_model_sequential()
LSTM %>%
layer_lstm(units = 100,
input_shape = c(datalags, 20),
batch_size = batch.size,
return_sequences = TRUE,
stateful = TRUE) %>%
layer_dropout(rate = 0.5) %>%
layer_lstm(units = 50,
return_sequences = FALSE,
stateful = TRUE) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 1)
LSTM %>%
compile(loss = 'mae', optimizer = 'adam')
LSTM
Model
___________________________________________________________________________
Layer (type) Output Shape Param #
===========================================================================
lstm (LSTM) (4, 12, 100) 48400
___________________________________________________________________________
dropout (Dropout) (4, 12, 100) 0
___________________________________________________________________________
lstm_1 (LSTM) (4, 50) 30200
___________________________________________________________________________
dropout_1 (Dropout) (4, 50) 0
___________________________________________________________________________
dense (Dense) (4, 1) 51
===========================================================================
Total params: 78,651
Trainable params: 78,651
Non-trainable params: 0
___________________________________________________________________________
for(i in 1:480){
LSTM %>% fit(x = x.train,
y = y.train,
batch_size = batch.size,
epochs = 1,
verbose = 0,
shuffle = FALSE)
LSTM %>% reset_states()
}
| Horizon | RMSE | DAR | DAR_NUM |
|---|---|---|---|
| 0 | 40.396388 | 80.16701 | 0 |
| 6 | 4.546233 | 80.00000 | 5 |
| 12 | 5.837501 | 63.63636 | 8 |
| 18 | 7.104999 | 58.82353 | 11 |
| 24 | 8.816599 | 47.82609 | 11 |
On the short term, less than 12 months, the LSTM Neural Net provides incredible accurate results, altought after this period of time, the performance of the model starts to decay up until it is not able to beat the baselines.
By looking at the in train RMSE, we might guess that there is still plenty of room for improvement in terms of the architecture of the neural net, however this might require more extensive empirical optimization, since as it has been stated by many deep learning experts, the architecture of these kind of models is more an art rather than an strict science.
The Best model obtained on this work for the prediction of future Monthly Oil Prices, is an stacked model using a simple linear Meta learner of all previously generated models, which is predominantly influenced by an autoregressive Neural Network NNAR and XGboost regression values.
Using the selected stacked model it is possible to predict future Oil prices with an RMSE of ~ 7 USD Dollars and ~ 70% directional accuracy, up until two years ahead of training data.
In terms of testing data RMSE values, the best model performance is achieved by using XGboost.
In terms of testing data DAR, the best model performance is achieved by using an autoregressive Neural Network NNAR .
The stacked method is able to provide better performance in comparisson with the basic LSTM Neural net.
A bigger dataset, might yield better results for an LSTM.