This research work is aimed towards applying Time Series Analysis on Nigeria’s Crude Oil Production from the year 2005 to 2021, analyze trends and patterns and give an idea of how different forecasting methods work.
Data: The Monthly Crude Oil Production Data is available for the years 1993 to 2021. Click here
Oil is a major source of non-renewable energy for not only Nigeria but also for rest of the world. It plays a vital role in shaping the Nigerian economy. Nigeria is Africa’s top oil producer and a part of OPEC. Nigerian Oil Industry began to play a prominent role in Nigeria’s economy only after the Nigerian Civil War (1967-1970). Nigeria is the 12th largest producer worldwide.
The primary focus here, is to examine the trend and pattern of crude oil production over 17 years, to investigate stationarity, develop and fit a statistical model and forecast production for the next 21 months
We start by reading in the data file along with necessary libraries.
We then split the original data into 2 sets for training the models and testing the results. For Training, we use data from Jan 2005 to Dec 2019. Based on the training data, we seek to fit a model and forecast 21 months ahead. For checking how the models perform, we’ll keep the remaining data of 21 Months as Test data.
For this we use the window command.
#Load data
= read.csv("Crude_Oil_Production_Nigeria_Monthly.csv")
data = ts(data$Million.Barrels, start=1993, frequency = 12)
crude.oil
#Load Library
library(forecast); library(ggplot2)
library(tseries); library(car)
library(lmtest)
<- window(crude.oil, start = c(2005, 1), end = c(2019, 12), frequency = 12)
ts_train <- window(crude.oil, start = c(2020, 1), frequency = 12) ts_test
ggtsdisplay(ts_train,points = T, plot.type = "partial", smooth = T, cex=0.8,
main="Training Dataset", xlab="years", ylab="Million Barrels")
monthplot(ts_train, main="Month Plot", xlab="Months", ylab="Million Barrels")
ggseasonplot(ts_train, year.labels = T)+
ggtitle("Seasonal Plot", sub="Monthly Production of Crude Oil")
In order to understand what is going on, we can breakdown/ decompose the data into 3 of it’s components: (1)trend (2)seasonality and (3)randomness.
For this we use the stl command in R.
<- stl(ts_train, s.window = "periodic")
tsdecompse_train
plot(tsdecompse_train, main="Seasonal Decomposition by Loess")
range(tsdecompse_train$time.series[,1]) #range of seasonality
range(tsdecompse_train$time.series[,3]) #range of randomness
Let’s Visualize the Trend, Seasonality and Randomness
We know that when it difference the series it can achieve stationarity. To determine the order of difference let’s run a for loop.
#loop
for (i in 1:5) { print(var(diff(ts_train, lag = 1, differences = i))) }
## [1] 0.00731542
## [1] 0.01699673
## [1] 0.05284384
## [1] 0.1785229
## [1] 0.6298697
= diff(ts_train, lag = 1, differences = 1)
diff1 ggtsdisplay(diff1, points = T, plot.type = "partial", smooth = T, cex=0.5,
main="Differenced Series", xlab="years", ylab="Change")
From the plot above, it seems we achieve stationarity. But to confirm, we perform some formal tests.
adf.test(ts_train)
adf.test(diff1)
kpss.test(ts_train)
kpss.test(diff1)
For the Differenced Series,
Both tests with opposing H0 are in agreement!
Naive, SNaive Model
Seasonal Decomposition + Random Walk with drift
Simple Exponential Smoothing
Holt Winter Double Exponential
Holt Winter Triple Exponential
SARIMA Model
Linear Regression + SARIMA errors
#Naive
= naive(ts_train, h = 21)
ts_train_naive #SNaive
= snaive(ts_train, h = 21) ts_train_snaive
Note: The dark and light blue shaded regions around the forecast line represent 80% and 95% Confidence Intervals.
Note: The dark and light blue shaded regions around the forecast line represent 80% and 95% Confidence Intervals.
Let’s look at the values of 2019 - 2020
Naive Models are very simple. They say:
SNaive or Seasonal Naive extends the Naive Model introducing seasonality.
The Naive Models are vital because they act as a good basis to compare rest of the models. If a model is performing poorer than the Naive Models, it is best we re-evaluate or discard it.
<- forecast(tsdecompse_train, method = "rwdrift", h=21) ts_train_rwd
This model forecasts future values based on an equation :
<- ses(ts_train, h=21) ts_train_ses
What has R done?
It has given the SAME result for all months. But, Why!
Solution?
I propose we use Excel to get a better Simple Exponential Smoothing results. Thus, I present you with another Article dealing with it. Do check it out to understand the process in Excel.
<- holt(ts_train, h=21) ts_train_HW
Under Double Exponential, seasonality was ignored. Here, we’ll capture it.
<- hw(ts_train, h=21, seasonal = "additive") ts_train_HW1
SARIMA stands for Seasonal Auto Regressive Integrated Moving Average.
auto.arima(ts_train, ic="aic")
## Series: ts_train
## ARIMA(1,1,1)(0,0,1)[12]
##
## Coefficients:
## ar1 ma1 sma1
## 0.7350 -0.8555 -0.1010
## s.e. 0.1624 0.1246 0.0709
##
## sigma^2 = 0.007073: log likelihood = 190.55
## AIC=-373.09 AICc=-372.86 BIC=-360.34
We can difference the series to make it an I(0) process.
= auto.arima(diff1, ic = "aic")
arma11 print(arma11)
## Series: diff1
## ARIMA(1,0,1)(0,0,1)[12] with zero mean
##
## Coefficients:
## ar1 ma1 sma1
## 0.7350 -0.8555 -0.1010
## s.e. 0.1624 0.1246 0.0709
##
## sigma^2 = 0.007073: log likelihood = 190.55
## AIC=-373.09 AICc=-372.86 BIC=-360.34
Residual Analysis of Fitted Model
Aim: To see if our assumption \(e_t ~ N(0, 0.007073)\) holds.
What does this mean?
= arma11$residuals
e11 = dnorm(e11,0, arma11$sigma2^0.5) pdf
#Density Plot of Residuals
plot(e11, pdf, col="#e88215", cex=0.8, xlim=c(-0.5,0.5),ylim=c(0,5),
xlab="Residuals",ylab="Density",main="SARIMA Residual Density Analysis")
hist(e11, prob=T, add = T, density=35, col="#e85e0e"); lines(density(e11))
#QQ plot of residuals
qqnorm(e11, datax = T, col="#e88215"); qqline(e11, datax = T)
tsdiag
tsdiag(arma11, gof.lag = 15)
Here, we get 3 plots:
(1) Plot of Standardized Residuals
(2) ACF of Residuals
(3) P-value of Ljung-Box Statistics
Forecasting from Fitted SARIMA Model
= arima(x = ts_train, order = c(1,1,1), seasonal = c(0,0,1)) arima.fit
The idea here is to model the series as a Linear Regression but with Errors from SARIMA model.
#Creating time covariate object
= seq(2005,2019+11/12, by = 1/12)
time
#Model
= auto.arima(ts_train, xreg = time, ic="aic")
reg.arma
#Prediction
= forecast(reg.arma, xreg = seq(2020, 2021+9/12, 1/12)) reg.arma.p
It is quite clear from the graphs how the respective models perform. But to compare them, we need some numbers i.e. accuracy measures.
RMSE is Square Root of Mean Square Error. It is analogous to standard deviation.
What does RMSE say?
If RMSE = 2 and forecast = 20, it roughly implies that we are 95% confident of observing reality between 16 and 24 \((20 +/- 1.96*2)\)
It is measures Mean Absolute Errors in percentage (%).
What does MAPE say?
If MAPE = 0.036, it implies that on average we are 3.6% away from the actual value. But it does not specify the direction i.e. +/-.
For a good forecast, we want small values of RMSE and MAPE as it implies that we have committed less errors.
This is another important aspect in building and choosing any statistical model.
Bias refers to assumptions made in the model to simplify the task at hand. Whereas, Variance refers to sensitivity of model to changes in the training data.
If we want to reduce the variability of the model, we need lesser parameters to be estimated.
But as soon as we estimate less parameters, we are not estimating correctly i.e. creating a bias.
From an analytical perspective, to some extent, RMSE tracks Variance and MAPE tracks bias.
So, a model with low MAPE has less BIAS and a model with low RMSE has less uncertainty.
This table is formulated in Excel, to properly compare the results.
Based on the above table, Linear Regression with SARIMA errors seems to be the best fit with lowest RMSE and MAPE.
However, it is important to remember:
All Models Are Wrong But Some Are Useful!
Let us call the graph of original data for concluding remarks.
From the above plot, it is clear that the Crude Oil Production has been on a decline since 2005-06. There are many global and local economic and geo-political factors contributing to this decline.
However, from an academic point-of-view, we can use Better Forecasting methods and models like GARCH models, Dynamic Linear Models, TBATS, Prophet, NNETAR (neural networks), LSTM. For some insight have a look at this article
The First problem is NNPC’s (Nigerian National Petroleum Corporation) structure. NNPC has two roles; a commercial and regulatory. At present, there is a lot of corruption and scams like the infamous case of 2016, where the Government Audit said 16 Billion dollars are missing!
The Second problem is Tax Evasion i.e. under-reporting of profits by International companies operating in Nigeria. This leads to lower taxes collected by the NNPC. So, the public oil money gets lost in a swamp of tax havens and murky accounting.
The Third problem is Oil-theft. There are hundreds of ill-legal refineries in the delta controlled by criminal gangs. In early 2019, about 22 million barrels were stolen
The passing of the Petroleum Industry Bill PIB will split the NNPC’s regulatory and commercial sides. This will ensure more transparency.
Regular, strict and independent Annual Auditing of Local and International Companies will reduce tax evasion and scams.
The government should seek ways to diversify it’s economy to reduce dependency on oil sector and foreign companies.
However, none of these steps will single-handedly improve Nigeria’s oil industry, or it’s economy or it’s poverty levels. But the combined impact will be fruitful for the country as a whole.