This research work is aimed towards applying Time Series Analysis on Nigeria’s Crude Oil Production from the year 2005 to 2021, analyze trends and patterns and give an idea of how different forecasting methods work.
Introduction
Oil is a major source of non-renewable energy for not only Nigeria but also for rest of the world. It plays a vital role in shaping the Nigerian economy. Nigeria is Africa’s top oil producer and a part of OPEC. The Nigerian Oil Industry began to play a prominent role in Nigeria’s economy only after the Nigerian Civil War (1967-1970). Nigeria is now the 12th largest crude-oil producer worldwide.
The primary focus here, is to examine the trend and pattern of crude oil production over 17 years, to investigate stationarity, develop and fit a statistical model and forecast production for the next 21 months
Our Data
Data: The Monthly Crude Oil Production Data is available for the years 1993 to 2021. Click here
We start by loading the data file along with necessary libraries. We then split the original data into 2 parts for training the models and testing the results. For Training, we use data from Jan 2005 to Dec 2019. Based on the training data, we seek to fit a model and forecast 21 months ahead. For checking how the models perform, we’ll keep the remaining data of 21 Months as Test data. For this we use the window() command in R.
Code
# Load datadata =read.csv("Crude_Oil_Production_Nigeria_Monthly.csv")crude.oil =ts(data$Million.Barrels, start=1993, frequency =12) # Load Librarylibrary(forecast); library(ggplot2)library(tseries); library(car)library(lmtest)ts_train <-window(crude.oil, start =c(2005, 1), end =c(2019, 12), frequency =12)ts_test <-window(crude.oil, start =c(2020, 1), frequency =12)
Here we have a plot of the month-wise data series.
Say, for January, the plot are all figures for January from 2005 to 2019.
The horizontal line is the monthly-average.
These averages seem to be very close to each other.
However, there is a slight decrease in the months of May-June-July.
Code
ggseasonplot(ts_train, year.labels = T)+ggtitle("Seasonal Plot", sub="Monthly Production of Crude Oil")
Here we have a plot to visualize the seasonality aspects.
Every year gets it’s own curve.
It is very haphazard i.e. there is no common seasonal trend over the years.
The Y-axis magnitude indicates that the variation due to seasonality is not much.
In order to understand what is going on, we can breakdown/ decompose the data into 3 of it’s components: (1) Trend (2) Seasonality -&- (3) Randomness
For this we use the stl() command in R.
Code
tsdecompse_train <-stl(ts_train, s.window ="periodic") plot(tsdecompse_train, main="Seasonal Decomposition by Loess")
Data = Trend + Seasonal + Remainder
Because we add all three components to get our data, we say that the components are additive
If we multiplied them, we’d say the components are multiplicative
swindow = ‘periodic’ indicates that there is seasonality in the data
STL command in R gives us ‘error bars’. They tell us if anything is significant or not. Anything between the bars is insignificant.
Thus, we can say that the seasonal component is of not much importance.
Code
range(tsdecompse_train$time.series[,1]) # range of seasonalityrange(tsdecompse_train$time.series[,3]) # range of randomness
Range of seasonality is -0.051 to 0.038 i.e. due to changing season, the production falls by 5.1% and rises by 3.8%. This is the size of seasonal variation.
Range of randomness is -0.21 to 0.16 i.e. the randomness, the data might fall by 21% or rise by 16%. This is the unpredictable part of the data.
Let’s Visualize the Trend, Seasonality and Randomness
We know to make a series stationary, we can difference it. Now, to determine the order of difference let’s run a for loop.
Code
# for loopfor (i in1:5) { print(paste("Order of Differencing :",i,"| Value of Variance :",round(var(diff(ts_train, differences = i)),4)) ) }
[1] "Order of Differencing : 1 | Value of Variance : 0.0073"
[1] "Order of Differencing : 2 | Value of Variance : 0.017"
[1] "Order of Differencing : 3 | Value of Variance : 0.0528"
[1] "Order of Differencing : 4 | Value of Variance : 0.1785"
[1] "Order of Differencing : 5 | Value of Variance : 0.6299"
When the loops runs for the first time, we difference the series once. Then calculate variance of first differenced series.
We repeat this 5 times.
Now because the variance increases beyond d = 1 we difference the series once.
From the plot below, we can see that we achieve stationarity after first-difference.
#Naivets_train_naive =naive(ts_train, h =21)#SNaivets_train_snaive =snaive(ts_train, h =21)
Note: The dark and light blue shaded regions around the forecast line represent 80% and 95% Confidence Intervals.
Note: The dark and light blue shaded regions around the forecast line represent 80% and 95% Confidence Intervals.
Let’s look at the values of 2019 - 2020
Naive Models are very simple. They say:
Forecast_tomorrow = Observed_today
So, last observed value Dec 2019 is 1.65 million barrels
Thus, All future forecasts are the same i.e. 1.65 million barrels
SNaive or Seasonal Naive extends the Naive Model introducing seasonality.
Forecast_Jan 2020 = Observed_Jan 2019 = 1.55
Forecast_Feb 2020 = Observed_Feb 2019 = 1.58 … and so on.
The Naive Models are vital because they act as a good basis to compare rest of the models. If a model is performing poorer than the Naive Models, it is best we re-evaluate or discard it.
Here, “drift” implies that there is an ability to pick up the trend and seasonality.
We can see that that forecast is not constant. It has picked up some trend and seasonality.
The dark and light shaded region around the forecast represent 80% and 95% CI.
Simple Exponential Smoothing
This model forecasts future values based on an equation :
Code
ts_train_ses <-ses(ts_train, h=21)
What has R done?
It has given the SAME result for all months. But, Why!
This is because R did not have a “forecast_today” value to begin with.
So R assumed that forecast_today = observed_today. Thus, the equation became \(Y_{t+1}=Y_t\)
Thus, forecast_Jan2020 = Actual_Dec2020 i.e. 1.65 Million Barrels.
Solution?
I propose we use Excel to get a better Simple Exponential Smoothing results. Thus, I present you with another Article dealing with it. Do check it out to understand the process in Excel.
Holt Winter Double Exponential
Code
ts_train_HW <-holt(ts_train, h=21)
Holt Winter Triple Exponential
Under Double Exponential, seasonality was ignored. Here, we’ll capture it.
QQ plot compares the quantile of our errors with the theoretical ones. The closer the points to the black line, the better. We can see there is not much deviation.
Histogram & Density Plot shows us how the errors are distributed. It seems to be normally distributed because most of the values lie close the centre.
Thus, assumptions of our model seem to hold in reality.
datax = T keeps sample quantiles on the X Axis
tsdiag
Code
tsdiag(arma11, gof.lag =15)
Here, we get 3 plots:
(1) Plot of Standardized Residuals
We do not want any patterns in our error terms.
This is because we assume errors are 0 on average.
Here, there is no visible patterns thus, we have a good fit!
(2) ACF of Residuals
We do not want any significant values beyond lag 0.
Autocorrelated errors are not good for forecasting.
Non-Autocorrelated errors imply ‘my errors/mistakes today will not cause mistakes tomorrow’
Here, there is no significant values beyond 1st lag, we have a good fit!
(3) P-value of Ljung-Box Statistics
Under, Ljung Box Test, H0: Residuals are independent.
For this plot, R performs Ljung Box Test at different lags (X-axis) and calculates the p-values (Y-axis)
High p-values suggest that residuals are independent. Thus, we have a good fit!
Forecasting from Fitted SARIMA Model
Code
arima.fit =arima(x = ts_train, order =c(1,1,1), seasonal =c(0,0,1))
Linear Regression with SARIMA errors
The idea here is to model the series as a Linear Regression but with Errors from SARIMA model.
Code
# Creating time covariate objecttime =seq(2005,2019+11/12, by =1/12)# Modelreg.arma =auto.arima(ts_train, xreg = time, ic="aic")# Predictionreg.arma.p =forecast(reg.arma, xreg =seq(2020, 2021+9/12, 1/12))
It is quite clear from the graphs how the respective models perform. But to compare them, we need some numbers i.e. accuracy measures.
Conclusion
Accuracy Measures: RMSE –&– MAPE
RMSE is Square Root of Mean Square Error. It is analogous to standard deviation.
What does RMSE say?
If RMSE = 2 and forecast = 20, it roughly implies that we are 95% confident of observing reality between 16 and 24 \((20 +/- 1.96*2)\)
It measures Mean Absolute Errors in percentage (%).
What does MAPE say?
If MAPE = 0.036, it implies that on average we are 3.6% away from the actual value. But it does not specify the direction i.e. +/-.
For a good forecast, we want small values of RMSE and MAPE as it implies that we have committed less errors.
Bias-Variance Trade off
This is another important aspect in building and choosing any statistical model.
Bias refers to assumptions made in the model to simplify the task at hand. Whereas, Variance refers to sensitivity of model to changes in the training data.
If we want to reduce the variability of the model, we need lesser parameters to be estimated.
But as soon as we estimate less parameters, we are not estimating correctly i.e. creating a bias.
From an analytical perspective, to some extent, RMSE tracks Variance and MAPE tracks bias.
So, a model with low MAPE has less BIAS and a model with low RMSE has less uncertainty.
Table : Model Performance
This table is formulated in Excel, to properly compare the results.
Based on the above table, Linear Regression with SARIMA errors seems to be the best fit with lowest RMSE and MAPE.
However, it is important to remember:
All Models Are Wrong But Some Are Useful!
Let us call the graph of original data for concluding remarks.
Animated Time Series Graph
From the above plot, it is clear that the Crude Oil Production has been on a decline since 2005-06. There are many global and local economic and geo-political factors contributing to this decline.
However, from an academic point-of-view, we can use Better Forecasting methods and models like GARCH models, Dynamic Linear Models, TBATS, Prophet, NNETAR (neural networks), LSTM. For some insight have a look at this article
What Problems does Nigeria face?
The First problem is NNPC’s (Nigerian National Petroleum Corporation) structure. NNPC has two roles; a commercial and regulatory. At present, there is a lot of corruption and scams like the infamous case of 2016, where the Government Audit said 16 Billion dollars are missing!
The Second problem is Tax Evasion i.e. under-reporting of profits by International companies operating in Nigeria. This leads to lower taxes collected by the NNPC. So, the public oil money gets lost in a swamp of tax havens and murky accounting.
The passing of the Petroleum Industry Bill PIB will split the NNPC’s regulatory and commercial sides. This will ensure more transparency.
Regular, strict and independent Annual Auditing of Local and International Companies will reduce tax evasion and scams.
The government should seek ways to diversify it’s economy to reduce dependency on oil sector and foreign companies.
However, none of these steps will single-handedly improve Nigeria’s oil industry, or it’s economy or it’s poverty levels. But the combined impact will be fruitful for the country as a whole.