We are interested in the way you approach a data science problem and how you present and/or support your reasons with data.
You are asked by the business to provide one day ahead probabilistic forecast (not a point forecast) for the oil prices. The requirement is to use the gamlss package in R ( https://cran.r-project.org/web/packages/gamlss/gamlss.pdf ) to forecast the probability distribution of the oil prices. The dataset available to you is the “oil” dataset within the gamlss package, where the variable “OILPRICE” is the response variable. You are expected to come up with a predictive model on “OILPRICE” and provide the forecast for one day ahead, explaining how you selected the model, the model diagnostics that you used to assess the model etc. The preferred way to report your findings is an R markdown. Alternative you can report your finding by an R script or any other tool, in a way that does not require further processing.
You can use the code below to install the gamlss package in R and extract the oil data.
To get the prediction of oil price, I am going to load gamlss
package. And follow the below steps for this assignment.
Oil price is predicted by two main approaches: The structural model and the time series model. The structural model depends on other factors like demand and supply, consumption, production, etc. We do not have those data here so we will use time series models to make the prediction.
Load data using gamlss
package and necessary libraries.
library(gamlss)
library(gamlss.add)
library(gamlss.dist)
library(ggplot2)
library(dplyr)
library(forecast)
library(lubridate)
Lets have a look on data, below shows first 6 data rows.
# view data
head(oil)
## OILPRICE CL2_log CL3_log CL4_log CL5_log CL6_log CL7_log CL8_log
## 1 4.640923 4.636475 4.641116 4.644968 4.648038 4.649761 4.650908 4.651863
## 2 4.633077 4.645352 4.649857 4.653484 4.656338 4.657858 4.658711 4.659564
## 3 4.634049 4.637831 4.642466 4.646312 4.649665 4.651672 4.653103 4.654341
## 4 4.646312 4.638315 4.642562 4.646120 4.648708 4.649952 4.650621 4.651099
## 5 4.631520 4.650526 4.654722 4.658047 4.660321 4.661267 4.661740 4.662117
## 6 4.627616 4.635893 4.640344 4.644199 4.646984 4.648421 4.649378 4.650335
## CL9_log CL10_log CL11_log CL12_log CL13_log CL14_log CL15_log BDIY_log
## 1 4.652340 4.651672 4.650621 4.648613 4.646120 4.643236 4.639765 6.850126
## 2 4.660037 4.659374 4.657952 4.655578 4.652531 4.649187 4.645256 6.850126
## 3 4.655293 4.655007 4.653865 4.651672 4.648804 4.645736 4.642081 6.879356
## 4 4.651577 4.651004 4.649570 4.647080 4.644102 4.640923 4.636960 6.882437
## 5 4.662401 4.661645 4.659942 4.656908 4.653484 4.649761 4.645544 6.896694
## 6 4.651099 4.650813 4.649570 4.646984 4.643910 4.640537 4.636475 6.913737
## SPX_log DX1_log GC1_log HO1_log USCI_log GNR_log SHCOMP_log FTSE_log
## 1 7.221624 4.386554 7.413367 1.136197 4.108412 3.917806 7.744539 8.636699
## 2 7.235309 4.379762 7.419680 1.152564 4.120986 3.942552 7.762536 8.650062
## 3 7.222756 4.387449 7.418481 1.155182 4.115127 3.923952 7.766061 8.639729
## 4 7.222252 4.383675 7.410347 1.136743 4.103965 3.925531 7.765158 8.642292
## 5 7.237620 4.382364 7.399704 1.139946 4.107096 3.941970 7.755763 8.659907
## 6 7.233556 4.382951 7.404888 1.137256 4.097672 3.936325 7.775213 8.656137
## respLAG
## 1 4.631812
## 2 4.640923
## 3 4.633077
## 4 4.634049
## 5 4.646312
## 6 4.631520
# dimension of data
dim(oil)
## [1] 1000 25
OILPRICE: response variable.
CL2_log, CL3_log, CL4_log, CL5_log, CL6_log, CL7_logCL8_log, CL9_log, CL10_log, CL11_log, CL12_log, CL13_log, CL14_log, CL15_log: numeric vectors which are the log prices of the 2 to 15 months ahead WTI oil contracts traded by NYMEX.
BDIY_log: Baltic Dry Index
SPX_log: S&P 500 index
DX1_log: US Dollar Index.
GC1_log: log price of front month gold price contract traded by NYMEX
HO1_log: log price of front month heating oil contract traded by NYMEX
USCI_log: United States Commodity Index
GNR_log: S&P Global Natural Resources Index
SHCOMP_log: Shanghai Stock Exchange Composite Index.
FTSE_log: FTSE 100 Index
respLAG: lagged version of the response variable.
Data has 1000 rows and 25 columns. We cannot see any date column in the dataset. Add date column to display the future prediction. Data looks like daily data, and there is no starting date given in the instruction so can start with any date, for example, 1st June 2016.
[Note: Assume trade is for all days in a week]
Add date column and create a subset of oil which contains OILPRICE
and Date
because to perform time series analysis data should be univariable.
data <- oil
# added date
data$Date <- as.Date(c(0:999), origin="2016-06-01")
# select required columns
data_new <- data[, c("OILPRICE", "Date")]
# view data
head(data_new)
## OILPRICE Date
## 1 4.640923 2016-06-01
## 2 4.633077 2016-06-02
## 3 4.634049 2016-06-03
## 4 4.646312 2016-06-04
## 5 4.631520 2016-06-05
## 6 4.627616 2016-06-06
Check, if any NA present in the data. Use summary() to see the descriptive statistics and null values.
summary(data_new$OILPRICE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.266 3.966 4.517 4.309 4.580 4.705
NA is not available in the data.
Plot OILPRICE
and Date
to see how the price changes over time.
ggplot(data_new) +
aes(x = Date, y = OILPRICE) +
geom_line(color = 'steelblue') +
labs(title = 'Oil Price', x = 'Time Period') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
Understanding seasonality in this data was extremely difficult since we didn’t know what the time period of our observations were. A key component to the time series objects in R is understanding the frequency
of your data. All further analysis hinges on the values you place in your frequency
parameter. Based on our previous discussion, we made the assumption that this was daily data and used both 1 and 365 as frequency values in our time series objects. With our assumption in mind, we had several approaches to determining seasonality. One method was to construct plots and to visually examine them for seasonality. One such plot is the seasonal plot created by using gg_season
. This plot allows us to overlay different years of a variable on top of each other so seasonality can be inspected:
data_new %>%
mutate(year = as.factor(year(Date))) %>%
mutate(new_date = as.Date(paste0('2016-', month(Date),'-', day(Date)))) %>%
ggplot() +
aes(x = new_date, y = data_new$OILPRICE, color = year) +
geom_line() +
labs(title = "Seasonal Analysis of Oil Price", x = '', y = 'Oil Price') +
scale_x_date(date_labels = '%b')
Seasonality is not clearly showing in the above plot. Another way to check seasonality in the data is the auto co-relation function (ACF) plot. Let’s ACF and partial autocorrelation function (PACF). ACF in layman terms describes how well the present value of the series is related to its past values. From ACF and PACF plot shows the series is Autoregressive (AR) or Moving average or both AR and MA i.e Autoregressive Integrated Moving Average (ARIMA).
data_new$OILPRICE %>%
ggtsdisplay(main="ACF and PACF", ylab="Oil Price")
The above plot shows, ACF is slowly decreasing while lags are increasing, this means data is not seasonal. ACF and PCAF plot shows there are significant lags that are present. In this case, ARIMA is the suitable choice to go with. Additionally, data is not stationary. The top plot does not look like white noise and the ACF and PACF plot both have many spikes above the critical threshold. This tells us that we need to apply diff() on the data.
data_new$OILPRICE %>%
diff() %>%
ggtsdisplay(ylab="Oil Price")
Now, data looks stationary, and the ACF and PACF has significant lag at 1, 5, 20 etc.
We will create 2 models to make the prediction, i.e ARIMA (Autoregressive Integrated Moving Average), and ETS (Exponential Smoothing). Split the data into 2 parts train and test, train contains 80% of data and test 20%. Apply model on the train test. Out of 2 which models perform better, we consider that model is our final model and apply the prediction using test data.
set.seed(101)
# each dataset was said to have 1622 rows
n <- 1000
# find 80% of row count
train_rows <- floor(1000 * 0.80)
test_rows <- n - train_rows
# split training and test sets
train <- data_new[1:train_rows,]
test <- data_new[(train_rows + 1):1000,]
# time series objects
data_new_ts <- ts(data_new$OILPRICE)
data_train_ts <- ts(train$OILPRICE)
data_test_ts <- ts(test$OILPRICE)
# create models based off training data
fa <- data_train_ts %>% Arima(order = c(1, 1, 1))
fe <- data_train_ts %>% ets()
# summary of models
summary(fa)
## Series: .
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## -0.4540 0.3302
## s.e. 0.1775 0.1869
##
## sigma^2 estimated as 0.000314: log likelihood=2089.71
## AIC=-4173.43 AICc=-4173.4 BIC=-4159.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0007427288 0.01768661 0.01242051 -0.01800145 0.2872314
## MASE ACF1
## Training set 0.9940263 0.001485793
summary(fe)
## ETS(A,Ad,N)
##
## Call:
## ets(y = .)
##
## Smoothing parameters:
## alpha = 0.8544
## beta = 0.0203
## phi = 0.971
##
## Initial states:
## l = 4.6456
## b = -0.005
##
## sigma: 0.0177
##
## AIC AICc BIC
## -1100.635 -1100.529 -1072.527
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0002949578 0.01763794 0.01242314 -0.007280784 0.2871792
## MASE ACF1
## Training set 0.9942371 -0.004927889
Using the automatically selected Arima function, an ARIMA(0,1,2) model is produced. This shows that the function believes oil price integrated of order 0 and has 2 significant lags (as demonstrated by ACF analysis). After attempting various combinations, including (0,1,1), (1,1,1) and (1,1,0) models, find that ARIMA(1,1,1) is the best performing among other models.
We show below our comparison of model diagnostics between ARIMA
and ets
:
# arima accuracy
prediction_arima <- ts(fa %>% forecast(h = test_rows))
forecast::accuracy(as.numeric((prediction_arima$mean)),as.numeric((data_test_ts)))
## ME RMSE MAE MPE MAPE
## Test set -0.3768124 0.4200878 0.3771382 -10.41862 10.42656
# ets accuracy
prediction_ets <- ts(fe %>% forecast(h = test_rows))
forecast::accuracy(as.numeric((prediction_ets$mean)),as.numeric((data_test_ts)))
## ME RMSE MAE MPE MAPE
## Test set -0.3950896 0.4382275 0.3953368 -10.91674 10.92276
Looking at our MAPE on the test set, we see that on average, 10 to 11% of our forecast is is incorrect. Now we’ll look at the residuals.
fa %>% residuals () %>%
ggtsdisplay()
fe %>% residuals () %>%
ggtsdisplay()
The residuals look like white noise, which we see in our ACF & PACF plot that does not have many spikes above the critical boundary. It means that our confidence interval calculations can be reliable.
Additionally, plot the results of our model run below:
ARIMA()
forecast:
# evaluate accuracy of arima
data_train_ts %>%
autoplot(series = "historical") +
autolayer(fa %>% forecast(h = test_rows)) +
labs(title = "ARIMA - Actuals vs Predictions", y = "Oil Price in log")
ets()
forecast:
# evaluate accuracy of ets
data_train_ts %>%
autoplot(series = "historical") +
autolayer(fe %>% forecast(h = test_rows)) +
labs(title = "ETS - Actuals vs Predictions", y = "Oil Price in log")
Both ARIMA and ETS perform equally but ARIMA performs a little better than ETS. So use ARIMA model as the optimal model and fit the model with entire dataset, then apply the prediction.
The forecasts for oil prices over the next 10 days are:
# fit the entire data
final_fit <- data_new_ts %>% Arima(order = c(1, 1, 1))
# prediction for next 10 days
predictions <- ts(final_fit %>% forecast(h = 10))
# create data frame
df <- as.data.frame(predictions$mean)
# add date
df$Date <- as.Date(c(1:10), origin = "2019-02-25")
# rename column name
names(df) <- c('OILPRICE', "Date")
# move last column to first
df <- df %>% select("Date", everything())
# view data
df
## Date OILPRICE
## 1 2019-02-26 3.609921
## 2 2019-02-27 3.608357
## 3 2019-02-28 3.608878
## 4 2019-03-01 3.608705
## 5 2019-03-02 3.608762
## 6 2019-03-03 3.608743
## 7 2019-03-04 3.608750
## 8 2019-03-05 3.608747
## 9 2019-03-06 3.608748
## 10 2019-03-07 3.608748
The above analysis found that oil is hard to predict, due to the absence of seasonality, trend, cyclicity makes it difficult to forecast accurately. There are other models such as SLT, Neural nets, non-linear time series may show better result. Nevertheless and due to the complicated cost dynamics of oil, even the best models can have very large errors in predicting the real price of oil in the future, particularly when it is forecasting for longer stretches of time.