Instruction

We are interested in the way you approach a data science problem and how you present and/or support your reasons with data.

You are asked by the business to provide one day ahead probabilistic forecast (not a point forecast) for the oil prices. The requirement is to use the gamlss package in R ( https://cran.r-project.org/web/packages/gamlss/gamlss.pdf ) to forecast the probability distribution of the oil prices. The dataset available to you is the “oil” dataset within the gamlss package, where the variable “OILPRICE” is the response variable. You are expected to come up with a predictive model on “OILPRICE” and provide the forecast for one day ahead, explaining how you selected the model, the model diagnostics that you used to assess the model etc. The preferred way to report your findings is an R markdown. Alternative you can report your finding by an R script or any other tool, in a way that does not require further processing.

You can use the code below to install the gamlss package in R and extract the oil data.

Solution

To get the prediction of oil price, I am going to load gamlss package. And follow the below steps for this assignment.

  1. Problem statement
  2. Explain the data
  3. Data exploration (EDA)
  4. Model processes
  5. Result and conclusion

Problem statement

Oil price is predicted by two main approaches: The structural model and the time series model. The structural model depends on other factors like demand and supply, consumption, production, etc. We do not have those data here so we will use time series models to make the prediction.

Explain the data

Load data using gamlss package and necessary libraries.

library(gamlss)
library(gamlss.add)
library(gamlss.dist)
library(ggplot2)
library(dplyr)
library(forecast)
library(lubridate)

Lets have a look on data, below shows first 6 data rows.

# view data
head(oil)
##   OILPRICE  CL2_log  CL3_log  CL4_log  CL5_log  CL6_log  CL7_log  CL8_log
## 1 4.640923 4.636475 4.641116 4.644968 4.648038 4.649761 4.650908 4.651863
## 2 4.633077 4.645352 4.649857 4.653484 4.656338 4.657858 4.658711 4.659564
## 3 4.634049 4.637831 4.642466 4.646312 4.649665 4.651672 4.653103 4.654341
## 4 4.646312 4.638315 4.642562 4.646120 4.648708 4.649952 4.650621 4.651099
## 5 4.631520 4.650526 4.654722 4.658047 4.660321 4.661267 4.661740 4.662117
## 6 4.627616 4.635893 4.640344 4.644199 4.646984 4.648421 4.649378 4.650335
##    CL9_log CL10_log CL11_log CL12_log CL13_log CL14_log CL15_log BDIY_log
## 1 4.652340 4.651672 4.650621 4.648613 4.646120 4.643236 4.639765 6.850126
## 2 4.660037 4.659374 4.657952 4.655578 4.652531 4.649187 4.645256 6.850126
## 3 4.655293 4.655007 4.653865 4.651672 4.648804 4.645736 4.642081 6.879356
## 4 4.651577 4.651004 4.649570 4.647080 4.644102 4.640923 4.636960 6.882437
## 5 4.662401 4.661645 4.659942 4.656908 4.653484 4.649761 4.645544 6.896694
## 6 4.651099 4.650813 4.649570 4.646984 4.643910 4.640537 4.636475 6.913737
##    SPX_log  DX1_log  GC1_log  HO1_log USCI_log  GNR_log SHCOMP_log FTSE_log
## 1 7.221624 4.386554 7.413367 1.136197 4.108412 3.917806   7.744539 8.636699
## 2 7.235309 4.379762 7.419680 1.152564 4.120986 3.942552   7.762536 8.650062
## 3 7.222756 4.387449 7.418481 1.155182 4.115127 3.923952   7.766061 8.639729
## 4 7.222252 4.383675 7.410347 1.136743 4.103965 3.925531   7.765158 8.642292
## 5 7.237620 4.382364 7.399704 1.139946 4.107096 3.941970   7.755763 8.659907
## 6 7.233556 4.382951 7.404888 1.137256 4.097672 3.936325   7.775213 8.656137
##    respLAG
## 1 4.631812
## 2 4.640923
## 3 4.633077
## 4 4.634049
## 5 4.646312
## 6 4.631520
# dimension of data
dim(oil)
## [1] 1000   25

OILPRICE: response variable.

CL2_log, CL3_log, CL4_log, CL5_log, CL6_log, CL7_logCL8_log, CL9_log, CL10_log, CL11_log, CL12_log, CL13_log, CL14_log, CL15_log: numeric vectors which are the log prices of the 2 to 15 months ahead WTI oil contracts traded by NYMEX.

BDIY_log: Baltic Dry Index

SPX_log: S&P 500 index

DX1_log: US Dollar Index.

GC1_log: log price of front month gold price contract traded by NYMEX

HO1_log: log price of front month heating oil contract traded by NYMEX

USCI_log: United States Commodity Index

GNR_log: S&P Global Natural Resources Index

SHCOMP_log: Shanghai Stock Exchange Composite Index.

FTSE_log: FTSE 100 Index

respLAG: lagged version of the response variable.

Exploratory data analysis

Data has 1000 rows and 25 columns. We cannot see any date column in the dataset. Add date column to display the future prediction. Data looks like daily data, and there is no starting date given in the instruction so can start with any date, for example, 1st June 2016.

[Note: Assume trade is for all days in a week]

Add date column and create a subset of oil which contains OILPRICE and Date because to perform time series analysis data should be univariable.

data <- oil

# added date
data$Date <- as.Date(c(0:999), origin="2016-06-01")

# select required columns
data_new <- data[, c("OILPRICE", "Date")]

# view data
head(data_new)
##   OILPRICE       Date
## 1 4.640923 2016-06-01
## 2 4.633077 2016-06-02
## 3 4.634049 2016-06-03
## 4 4.646312 2016-06-04
## 5 4.631520 2016-06-05
## 6 4.627616 2016-06-06

Data exploration (EDA)

Check, if any NA present in the data. Use summary() to see the descriptive statistics and null values.

summary(data_new$OILPRICE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.266   3.966   4.517   4.309   4.580   4.705

NA is not available in the data.

Plot OILPRICE and Date to see how the price changes over time.

ggplot(data_new) +
  aes(x = Date, y = OILPRICE) + 
  geom_line(color = 'steelblue') + 
  labs(title = 'Oil Price', x = 'Time Period') + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
          panel.grid.minor.x = element_blank(),
          axis.ticks.x = element_line(color = "grey")
  )

Time Series Analysis

Seasonality

Understanding seasonality in this data was extremely difficult since we didn’t know what the time period of our observations were. A key component to the time series objects in R is understanding the frequency of your data. All further analysis hinges on the values you place in your frequency parameter. Based on our previous discussion, we made the assumption that this was daily data and used both 1 and 365 as frequency values in our time series objects. With our assumption in mind, we had several approaches to determining seasonality. One method was to construct plots and to visually examine them for seasonality. One such plot is the seasonal plot created by using gg_season. This plot allows us to overlay different years of a variable on top of each other so seasonality can be inspected:

data_new %>%
  mutate(year = as.factor(year(Date))) %>%
  mutate(new_date = as.Date(paste0('2016-', month(Date),'-', day(Date)))) %>% 
  ggplot() + 
  aes(x = new_date, y = data_new$OILPRICE, color = year) + 
  geom_line() +
  labs(title = "Seasonal Analysis of Oil Price", x = '', y = 'Oil Price') + 
  scale_x_date(date_labels = '%b')

Seasonality is not clearly showing in the above plot. Another way to check seasonality in the data is the auto co-relation function (ACF) plot. Let’s ACF and partial autocorrelation function (PACF). ACF in layman terms describes how well the present value of the series is related to its past values. From ACF and PACF plot shows the series is Autoregressive (AR) or Moving average or both AR and MA i.e Autoregressive Integrated Moving Average (ARIMA).

data_new$OILPRICE %>%
  ggtsdisplay(main="ACF and PACF", ylab="Oil Price")

The above plot shows, ACF is slowly decreasing while lags are increasing, this means data is not seasonal. ACF and PCAF plot shows there are significant lags that are present. In this case, ARIMA is the suitable choice to go with. Additionally, data is not stationary. The top plot does not look like white noise and the ACF and PACF plot both have many spikes above the critical threshold. This tells us that we need to apply diff() on the data.

data_new$OILPRICE %>%
  diff() %>%
  ggtsdisplay(ylab="Oil Price")

Now, data looks stationary, and the ACF and PACF has significant lag at 1, 5, 20 etc.

Model processes

We will create 2 models to make the prediction, i.e ARIMA (Autoregressive Integrated Moving Average), and ETS (Exponential Smoothing). Split the data into 2 parts train and test, train contains 80% of data and test 20%. Apply model on the train test. Out of 2 which models perform better, we consider that model is our final model and apply the prediction using test data.

set.seed(101)
# each dataset was said to have 1622 rows
n <- 1000

# find 80% of row count
train_rows <- floor(1000 * 0.80)
test_rows <- n - train_rows

# split training and test sets
train <- data_new[1:train_rows,]
test <- data_new[(train_rows + 1):1000,]

# time series objects
data_new_ts <- ts(data_new$OILPRICE)
data_train_ts <- ts(train$OILPRICE)
data_test_ts <- ts(test$OILPRICE)


# create models based off training data
fa <- data_train_ts %>% Arima(order = c(1, 1, 1))
fe <- data_train_ts %>% ets()

# summary of models
summary(fa)
## Series: . 
## ARIMA(1,1,1) 
## 
## Coefficients:
##           ar1     ma1
##       -0.4540  0.3302
## s.e.   0.1775  0.1869
## 
## sigma^2 estimated as 0.000314:  log likelihood=2089.71
## AIC=-4173.43   AICc=-4173.4   BIC=-4159.38
## 
## Training set error measures:
##                         ME       RMSE        MAE         MPE      MAPE
## Training set -0.0007427288 0.01768661 0.01242051 -0.01800145 0.2872314
##                   MASE        ACF1
## Training set 0.9940263 0.001485793
summary(fe)
## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = .) 
## 
##   Smoothing parameters:
##     alpha = 0.8544 
##     beta  = 0.0203 
##     phi   = 0.971 
## 
##   Initial states:
##     l = 4.6456 
##     b = -0.005 
## 
##   sigma:  0.0177
## 
##       AIC      AICc       BIC 
## -1100.635 -1100.529 -1072.527 
## 
## Training set error measures:
##                         ME       RMSE        MAE          MPE      MAPE
## Training set -0.0002949578 0.01763794 0.01242314 -0.007280784 0.2871792
##                   MASE         ACF1
## Training set 0.9942371 -0.004927889

Using the automatically selected Arima function, an ARIMA(0,1,2) model is produced. This shows that the function believes oil price integrated of order 0 and has 2 significant lags (as demonstrated by ACF analysis). After attempting various combinations, including (0,1,1), (1,1,1) and (1,1,0) models, find that ARIMA(1,1,1) is the best performing among other models.

We show below our comparison of model diagnostics between ARIMA and ets:

# arima accuracy
prediction_arima <- ts(fa %>% forecast(h = test_rows))
forecast::accuracy(as.numeric((prediction_arima$mean)),as.numeric((data_test_ts)))
##                  ME      RMSE       MAE       MPE     MAPE
## Test set -0.3768124 0.4200878 0.3771382 -10.41862 10.42656
# ets accuracy
prediction_ets <- ts(fe %>% forecast(h = test_rows))
forecast::accuracy(as.numeric((prediction_ets$mean)),as.numeric((data_test_ts)))
##                  ME      RMSE       MAE       MPE     MAPE
## Test set -0.3950896 0.4382275 0.3953368 -10.91674 10.92276

Looking at our MAPE on the test set, we see that on average, 10 to 11% of our forecast is is incorrect. Now we’ll look at the residuals.

fa %>% residuals () %>%
  ggtsdisplay()

fe %>% residuals () %>%
  ggtsdisplay()

The residuals look like white noise, which we see in our ACF & PACF plot that does not have many spikes above the critical boundary. It means that our confidence interval calculations can be reliable.

Additionally, plot the results of our model run below:

ARIMA() forecast:

# evaluate accuracy of arima
data_train_ts %>% 
  autoplot(series = "historical") + 
  autolayer(fa %>% forecast(h = test_rows))  + 
  labs(title = "ARIMA - Actuals vs Predictions", y = "Oil Price in log")

ets() forecast:

# evaluate accuracy of ets
data_train_ts %>% 
  autoplot(series = "historical") + 
  autolayer(fe %>% forecast(h = test_rows))  + 
  labs(title = "ETS - Actuals vs Predictions", y = "Oil Price in log")

Result and conclusion

Both ARIMA and ETS perform equally but ARIMA performs a little better than ETS. So use ARIMA model as the optimal model and fit the model with entire dataset, then apply the prediction.

The forecasts for oil prices over the next 10 days are:

# fit the entire data
final_fit <- data_new_ts %>% Arima(order = c(1, 1, 1))
# prediction for next 10 days
predictions <- ts(final_fit %>% forecast(h = 10))
# create data frame
df <- as.data.frame(predictions$mean)
# add date
df$Date <- as.Date(c(1:10), origin = "2019-02-25")
# rename column name
names(df) <- c('OILPRICE', "Date")
# move last column to first
df <- df %>% select("Date", everything())
# view data
df
##          Date OILPRICE
## 1  2019-02-26 3.609921
## 2  2019-02-27 3.608357
## 3  2019-02-28 3.608878
## 4  2019-03-01 3.608705
## 5  2019-03-02 3.608762
## 6  2019-03-03 3.608743
## 7  2019-03-04 3.608750
## 8  2019-03-05 3.608747
## 9  2019-03-06 3.608748
## 10 2019-03-07 3.608748

The above analysis found that oil is hard to predict, due to the absence of seasonality, trend, cyclicity makes it difficult to forecast accurately. There are other models such as SLT, Neural nets, non-linear time series may show better result. Nevertheless and due to the complicated cost dynamics of oil, even the best models can have very large errors in predicting the real price of oil in the future, particularly when it is forecasting for longer stretches of time.