Nigerian Crude Oil Production

ABSTRACT

This research work is aimed towards applying Time Series Analysis on Nigeria’s Crude Oil Production from the year 2005 to 2021, analyze trends and patterns and give an idea of how different forecasting methods work.

Data: The Monthly Crude Oil Production Data is available for the years 1993 to 2021. Click here

Introduction

Oil is a major source of non-renewable energy for not only Nigeria but also for rest of the world. It plays a vital role in shaping the Nigerian economy. Nigeria is Africa’s top oil producer and a part of OPEC. Nigerian Oil Industry began to play a prominent role in Nigeria’s economy only after the Nigerian Civil War (1967-1970). Nigeria is the 12th largest producer worldwide.

The primary focus here, is to examine the trend and pattern of crude oil production over 17 years, to investigate stationarity, develop and fit a statistical model and forecast production for the next 21 months

Our Data

We start by reading in the data file along with necessary libraries.

We then split the original data into 2 sets for training the models and testing the results. For Training, we use data from Jan 2005 to Dec 2019. Based on the training data, we seek to fit a model and forecast 21 months ahead. For checking how the models perform, we’ll keep the remaining data of 21 Months as Test data.

For this we use the window command.

#Load data
data = read.csv("Crude_Oil_Production_Nigeria_Monthly.csv")
crude.oil = ts(data$Million.Barrels, start=1993, frequency = 12) 

#Load Library
library(forecast); library(ggplot2)
library(tseries); library(car)
library(lmtest)

ts_train <- window(crude.oil, start = c(2005, 1), end = c(2019, 12), frequency = 12)
ts_test <- window(crude.oil, start = c(2020, 1),  frequency = 12)

Exploratory Data Analysis

ggtsdisplay

ggtsdisplay(ts_train,points = T, plot.type = "partial", smooth = T, cex=0.8,
            main="Training Dataset", xlab="years", ylab="Million Barrels")

ggtsdisplay gives us 3 plots - (1)Plot of Original Series (2)Plot of ACF (3)Plot of PACF.
smooth = T tells R to add a smooth blue-coloured regression line to help us better visualize the underlying trend.
It is clear, on average, the production is not constant. It’s variability is also not constant.Thus, we say that this series is not stationary.
There is an overall downward trend. The blue line is a good indicator of this.
Sample ACF decays very slowly.
Sample PACF cuts off after few initial lags
The blue dotted lines are the 95% Confidence Intervals. Anything inside these lines are statistically insignificant.
Sample values may will not conform perfectly with theoretical behaviours. So, we can ignore a few lags slightly peaking beyond the intervals.
This tells us that the series is not stationary.

monthplot

monthplot(ts_train, main="Month Plot", xlab="Months", ylab="Million Barrels")

Here we have a plot of the month-wise data series.
Say, for January, the plot are all figures for January from 2005 to 2019.
The horizontal line is the monthly-average.
These averages seem to be very close to each other.
However, there is a slight decrease in the months of May-June-July.

ggseasonplot

ggseasonplot(ts_train, year.labels = T)+
  ggtitle("Seasonal Plot", sub="Monthly Production of Crude Oil")

Here we have a plot to visualize the seasonality aspects.
Every year gets it’s own curve.
It is very haphazard i.e. there is no common seasonal trend over the years.
The Y-axis magnitude indicates that the variation due to seasonality is not much.

Decomposition

In order to understand what is going on, we can breakdown/ decompose the data into 3 of it’s components: (1)trend (2)seasonality and (3)randomness.

For this we use the stl command in R.

tsdecompse_train <- stl(ts_train, s.window = "periodic")  

plot(tsdecompse_train, main="Seasonal Decomposition by Loess")

Data = Trend + Seasonal + Remainder
Because we add all three components to get our data, we say that the components are additive
If we multiplied them, we’d say the components are multiplicative
swindow = ‘periodic’ indicates that there is seasonality in the data_
STL command in R gives us ‘error bars’. They tell us if anything is significant or not. Anything between the bars is insignificant.
Thus, we can say that the seasonal component is of not much importance.

range(tsdecompse_train$time.series[,1]) #range of seasonality
range(tsdecompse_train$time.series[,3]) #range of randomness

Range of seasonality is -0.051 to 0.038 i.e. due to changing season, the production falls by 5.1% and rises by 3.8%. This is the size of seasonal variation.
Range of randomness is -0.21 to 0.16 i.e. the randomness, the data might fall by 21% or rise by 16%. This is the unpredictable part of the data.

Let’s Visualize the Trend, Seasonality and Randomness

Differencing

We know that when it difference the series it can achieve stationarity. To determine the order of difference let’s run a for loop.

#loop
for (i in 1:5) { print(var(diff(ts_train, lag = 1, differences = i))) }

## [1] 0.00731542
## [1] 0.01699673
## [1] 0.05284384
## [1] 0.1785229
## [1] 0.6298697

When the loops runs for the first time, we difference the series once. Then calculate variance of first differenced series.
We repeat this 5 times.
Now because the variance increases beyond d = 1 we difference the series once.
From the plot below, we can see that we achieve stationarity after first-difference.

diff1 = diff(ts_train, lag = 1, differences = 1)
ggtsdisplay(diff1, points = T, plot.type = "partial", smooth = T, cex=0.5,
            main="Differenced Series", xlab="years", ylab="Change")

Unit Root Tests

From the plot above, it seems we achieve stationarity. But to confirm, we perform some formal tests.

Augmented Dickey Fuller Test

H0: Series is not stationary
H1: Series is stationary

adf.test(ts_train)
adf.test(diff1)

Kwiatkowski Philips Schmidt Shin (KPSS) Test

H0: Series is stationary
H1: Series is not stationary

kpss.test(ts_train)
kpss.test(diff1)

We are returned with Dickey-Fuller test statistic, Lag order used, P-Value and alternative hypothesis used.

For the Differenced Series,

ADF test returns a p-value < 0.05. We have sufficient evidence to reject H0 at 5% LS. Thus, the differenced series is stationary!
KPSS test returns a p-value > 0.05
We cannot reject KPSS’s H0, thus differenced series is stationary

Both tests with opposing H0 are in agreement!

Forecasting Methods Used

Naive, SNaive Model
Seasonal Decomposition + Random Walk with drift
Simple Exponential Smoothing
Holt Winter Double Exponential
Holt Winter Triple Exponential
SARIMA Model
Linear Regression + SARIMA errors

Naive, SNaive Model

#Naive
ts_train_naive = naive(ts_train, h = 21)
#SNaive
ts_train_snaive = snaive(ts_train, h = 21)

Note: The dark and light blue shaded regions around the forecast line represent 80% and 95% Confidence Intervals.

Let’s look at the values of 2019 - 2020

Naive Models are very simple. They say:

Forecast_tomorrow = Observed_today
So, last observed value Dec 2019 is 1.65 million barrels
Thus, All future forecasts are the same i.e. 1.65 million barrels

SNaive or Seasonal Naive extends the Naive Model introducing seasonality.

Forecast_Jan 2020 = Observed_Jan 2019 = 1.55
Forecast_Feb 2020 = Observed_Feb 2019 = 1.58 … and so on.

The Naive Models are vital because they act as a good basis to compare rest of the models. If a model is performing poorer than the Naive Models, it is best we re-evaluate or discard it.

Seasonal Decomposition + Random Walk with Drift

ts_train_rwd <- forecast(tsdecompse_train, method = "rwdrift", h=21)

Here, “drift” implies that there is an ability to pick up the trend and seasonality.
We can see that that forecast is not constant. It has picked up some trend and seasonality.
The dark and light shaded region around the forecast represent 80% and 95% CI.

Simple Exponential Smoothing

This model forecasts future values based on an equation :

ts_train_ses <- ses(ts_train, h=21)

What has R done?

It has given the SAME result for all months. But, Why!

This is because R did not have a “forecast_today” value to begin with.
So R assumed that forecast_today = observed_today. Thus, the equation became \(Y_{t+1}=Y_t\)
Thus, forecast_Jan2020 = Actual_Dec2020 i.e. 1.65 Million Barrels.

Solution?

I propose we use Excel to get a better Simple Exponential Smoothing results. Thus, I present you with another Article dealing with it. Do check it out to understand the process in Excel.

Holt Winter Double Exponential

ts_train_HW <- holt(ts_train, h=21)

Holt Winter Triple Exponential

Under Double Exponential, seasonality was ignored. Here, we’ll capture it.

ts_train_HW1 <- hw(ts_train, h=21, seasonal = "additive")

SARIMA Model

SARIMA stands for Seasonal Auto Regressive Integrated Moving Average.

auto.arima(ts_train, ic="aic")

## Series: ts_train 
## ARIMA(1,1,1)(0,0,1)[12] 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.7350  -0.8555  -0.1010
## s.e.  0.1624   0.1246   0.0709
## 
## sigma^2 = 0.007073:  log likelihood = 190.55
## AIC=-373.09   AICc=-372.86   BIC=-360.34

Code: auto.arima automatically fits an SARIMA model based on the specified Information Criteria, AIC. We can use BIC or AICc as well.
Fitted Model: ARIMA(1,1,1)(0,0,1)[12]
This means that we have ARIMA(1,1,1) with MA(1)-Seasonality or SMA(1) i.e. seasonal moving average of order 1.
[12] indicates Monthly seasonal component - makes sense because we have seasonal data.

We can difference the series to make it an I(0) process.

arma11 = auto.arima(diff1, ic = "aic")
print(arma11)

## Series: diff1 
## ARIMA(1,0,1)(0,0,1)[12] with zero mean 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.7350  -0.8555  -0.1010
## s.e.  0.1624   0.1246   0.0709
## 
## sigma^2 = 0.007073:  log likelihood = 190.55
## AIC=-373.09   AICc=-372.86   BIC=-360.34

Residual Analysis of Fitted Model

Aim: To see if our assumption \(e_t ~ N(0, 0.007073)\) holds.

What does this mean?

On average, errors are zero i.e. \(E(e_t) = 0\)
All errors have the same variance i.e. \(V(e_t) = 0.007\) for all et. This property is called ‘Homoskedasticity’
Errors are not autocorrelated i.e. past errors do not influence future errors.
Errors are Normally distributed.

e11 = arma11$residuals 
pdf = dnorm(e11,0, arma11$sigma2^0.5)

#Density Plot of Residuals
plot(e11, pdf, col="#e88215", cex=0.8, xlim=c(-0.5,0.5),ylim=c(0,5), 
     xlab="Residuals",ylab="Density",main="SARIMA Residual Density Analysis")
hist(e11, prob=T, add = T, density=35, col="#e85e0e"); lines(density(e11))

#QQ plot of residuals
qqnorm(e11, datax = T, col="#e88215"); qqline(e11, datax = T)

Q-Q plot compares the quantile of our errors with the theoretical ones. The closer the points to the black line, the better. We can see there is not much deviation.
Histogram & Density Plot shows us how the errors are distributed. It seems to be normally distributed because most of the values lie close the center.
Thus, assumptions of our model seem to hold in reality.
Note 1 : For colour: Google > color picker > copy ‘#HEX code’
datax = T keeps sample quantiles on the X Axis

tsdiag

tsdiag(arma11, gof.lag = 15)

Here, we get 3 plots:

(1) Plot of Standardized Residuals

We do not want any patterns in our error terms.
This is because we assume errors are 0 on average.
Here, there is no visible patterns thus, we have a good fit!

(2) ACF of Residuals

We do not want any significant values beyond lag 0.
Autocorrelated errors are not good for forecasting.
Non-Autocorrelated errors imply ‘my errors/mistakes today will not cause mistakes tomorrow’
Here, there is no significant values beyond 1st lag, we have a good fit!

(3) P-value of Ljung-Box Statistics

Under, Ljung Box Test, H0: Residuals are independent.
For this plot, R performs Ljung Box Test at different lags (X-axis) and calculates the p-values (Y-axis)
High p-values suggest that residuals are independent. Thus, we have a good fit!

Forecasting from Fitted SARIMA Model

arima.fit = arima(x = ts_train, order = c(1,1,1), seasonal = c(0,0,1))

Linear Regression + SARIMA errors

The idea here is to model the series as a Linear Regression but with Errors from SARIMA model.

#Creating time covariate object
time = seq(2005,2019+11/12, by = 1/12)

#Model
reg.arma = auto.arima(ts_train, xreg = time, ic="aic")

#Prediction
reg.arma.p = forecast(reg.arma, xreg = seq(2020, 2021+9/12, 1/12))

It is quite clear from the graphs how the respective models perform. But to compare them, we need some numbers i.e. accuracy measures.

Conclusion

Accuracy Measures: RMSE

RMSE is Square Root of Mean Square Error. It is analogous to standard deviation.

What does RMSE say?

If RMSE = 2 and forecast = 20, it roughly implies that we are 95% confident of observing reality between 16 and 24 \((20 +/- 1.96*2)\)

Accuracy Measures: MAPE

It is measures Mean Absolute Errors in percentage (%).

What does MAPE say?

If MAPE = 0.036, it implies that on average we are 3.6% away from the actual value. But it does not specify the direction i.e. +/-.

For a good forecast, we want small values of RMSE and MAPE as it implies that we have committed less errors.

Bias-Variance Trade off

This is another important aspect in building and choosing any statistical model.

Bias refers to assumptions made in the model to simplify the task at hand. Whereas, Variance refers to sensitivity of model to changes in the training data.

If we want to reduce the variability of the model, we need lesser parameters to be estimated.

But as soon as we estimate less parameters, we are not estimating correctly i.e. creating a bias.

From an analytical perspective, to some extent, RMSE tracks Variance and MAPE tracks bias.

So, a model with low MAPE has less BIAS and a model with low RMSE has less uncertainty.

This table is formulated in Excel, to properly compare the results.

Based on the above table, Linear Regression with SARIMA errors seems to be the best fit with lowest RMSE and MAPE.

However, it is important to remember:

All Models Are Wrong But Some Are Useful!

Let us call the graph of original data for concluding remarks.

From the above plot, it is clear that the Crude Oil Production has been on a decline since 2005-06. There are many global and local economic and geo-political factors contributing to this decline.

However, from an academic point-of-view, we can use Better Forecasting methods and models like GARCH models, Dynamic Linear Models, TBATS, Prophet, NNETAR (neural networks), LSTM. For some insight have a look at this article

What Problems does Nigeria face?

The First problem is NNPC’s (Nigerian National Petroleum Corporation) structure. NNPC has two roles; a commercial and regulatory. At present, there is a lot of corruption and scams like the infamous case of 2016, where the Government Audit said 16 Billion dollars are missing!
The Second problem is Tax Evasion i.e. under-reporting of profits by International companies operating in Nigeria. This leads to lower taxes collected by the NNPC. So, the public oil money gets lost in a swamp of tax havens and murky accounting.
The Third problem is Oil-theft. There are hundreds of ill-legal refineries in the delta controlled by criminal gangs. In early 2019, about 22 million barrels were stolen

Some Possible Solutions

The passing of the Petroleum Industry Bill PIB will split the NNPC’s regulatory and commercial sides. This will ensure more transparency.

Regular, strict and independent Annual Auditing of Local and International Companies will reduce tax evasion and scams.

The government should seek ways to diversify it’s economy to reduce dependency on oil sector and foreign companies.

However, none of these steps will single-handedly improve Nigeria’s oil industry, or it’s economy or it’s poverty levels. But the combined impact will be fruitful for the country as a whole.