Introduction:

About the company:

Subsequent to the dismantling of the Administered Pricing Mechanism (APM) in the petroleum sector, Oil Coordination Committee (OCC) was abolished and a new cell called Petroleum Planning & Analysis Cell (PPAC), attached to the Ministry of Petroleum and Natural Gas, was created effective 1st April 2002.

PPAC is to assist the Government in discharge of some of the functions earlier performed by OCC.

Shri Ram Naik, the then Hon’ble Minister of Petroleum & Natural Gas inaugurated Petroleum Planning & Analysis Cell at SCOPE Complex, Lodhi Road, New Delhi.

The expenditure of PPAC is borne by Oil Industry Development Board (OIDB).

The cell is located at New Delhi (India).

Here we collected data of net imports by india since Apr’2011 to Mar’2020.

Goal: To forecast the net crude oil import by india for next 2 years.

Data Collection:

Data collected from PPAC (Petroleum planning and analysis cell)

Data Exploration and Visualization:

tsdisplay(netimport.ts) # inspecting the ts object

autoplot(netimport.ts) # Plot function 
## Warning in is.na(main): is.na() applied to non-(list or vector) of type
## 'NULL'

ggseasonplot(netimport.ts) # Plot the seasonality 
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

ggseasonplot(netimport.ts, polar = TRUE) # poltting seasonality with time as circular rather than linear 
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

ggsubseriesplot(netimport.ts) # Plotting subseries for each month (season)

ggAcf(netimport.ts) # Autocorrelation Plot 

ggPacf(netimport.ts) # partial Autocorrelation Plot 

Data pre partition and preprocessing:

# Data Partition
nValid <- 12
nTrain <- length(netimport.ts) - nValid
train.ts <- window(netimport.ts, start = c(2011, 4), end = c(2011, nTrain))
valid.ts <- window(netimport.ts, start = c(2011, nTrain + 4 ), end = c(2011, nTrain + nValid))

applying forecast methods

######################################################################################
# Naive forecasts 
naive_fc <- naive(train.ts, h = 20)
summary(naive_fc)
## 
## Forecast method: Naive method
## 
## Model Information:
## Call: naive(y = train.ts, h = 20) 
## 
## Residual sd: 1777.2562 
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE     MASE
## Training set 50.14565 1768.282 1341.233 -0.6894039 10.62507 1.014018
##                    ACF1
## Training set -0.5272263
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2019       17453.22 15187.076 19719.36 13987.451 20918.99
## Feb 2019       17453.22 14248.408 20658.03 12551.883 22354.56
## Mar 2019       17453.22 13528.143 21378.30 11450.332 23456.11
## Apr 2019       17453.22 12920.931 21985.51 10521.682 24384.76
## May 2019       17453.22 12385.967 22520.47  9703.525 25202.91
## Jun 2019       17453.22 11902.322 23004.12  8963.855 25942.59
## Jul 2019       17453.22 11457.565 23448.87  8283.657 26622.78
## Aug 2019       17453.22 11043.596 23862.84  7650.545 27255.89
## Sep 2019       17453.22 10654.787 24251.65  7055.913 27850.53
## Oct 2019       17453.22 10287.042 24619.40  6493.496 28412.94
## Nov 2019       17453.22  9937.269 24969.17  5958.565 28947.88
## Dec 2019       17453.22  9603.065 25303.37  5447.444 29459.00
## Jan 2020       17453.22  9282.520 25623.92  4957.213 29949.23
## Feb 2020       17453.22  8974.084 25932.36  4485.500 30420.94
## Mar 2020       17453.22  8676.480 26229.96  4030.355 30876.09
## Apr 2020       17453.22  8388.642 26517.80  3590.144 31316.30
## May 2020       17453.22  8109.667 26796.77  3163.489 31742.95
## Jun 2020       17453.22  7838.783 27067.66  2749.208 32157.23
## Jul 2020       17453.22  7575.325 27331.11  2346.284 32560.16
## Aug 2020       17453.22  7318.714 27587.73  1953.830 32952.61
# Use accuracy() to compute RMSE statistics
accuracy(naive_fc,netimport.ts)
##                      ME     RMSE      MAE        MPE      MAPE      MASE
## Training set   50.14565 1768.282 1341.233 -0.6894039 10.625070 1.0140184
## Test set     -362.49800 1380.497 1036.706 -2.7455271  6.236945 0.7837856
##                     ACF1 Theil's U
## Training set -0.52722625        NA
## Test set     -0.07991316 0.7057737
######################################################################################

Decomposition

######################################################################################
# Decompose the given series into its various components: seasonal, trend, and irregular components (multiplicative decomposition).
import_decomposed <- decompose(netimport.ts,  type = "multiplicative")
plot(import_decomposed) # Plot the decomposed series

#####################################################################################
# Seasonal Index from the decomposed series 
seasonal_index <- data.frame(import_decomposed$figure)
seasonal_index <- cbind(seasonal_index, seq(1:12))

Linear trend model:

#####################################################################################
# Linear Trend Model
train.lm <- tslm(netimport.ts ~ trend)
summary(train.lm)
## 
## Call:
## tslm(formula = netimport.ts ~ trend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3269.2  -666.2    27.3   841.2  3244.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9924.591    252.315   39.33   <2e-16 ***
## trend         72.445      4.019   18.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1302 on 106 degrees of freedom
## Multiple R-squared:  0.7541, Adjusted R-squared:  0.7517 
## F-statistic:   325 on 1 and 106 DF,  p-value: < 2.2e-16
train.lm.linear.trend.pred <- forecast(train.lm, h = nValid)
accuracy(train.lm.linear.trend.pred, nValid) # Model accuracy 
##                         ME      RMSE        MAE           MPE         MAPE
## Training set  1.347056e-13  1289.863   996.1412 -9.942159e-01 7.609353e+00
## Test set     -1.780908e+04 17809.085 17809.0847 -1.484090e+05 1.484090e+05
##                    MASE       ACF1
## Training set  0.7203631 0.01945983
## Test set     12.8787034         NA
# Plotting the linear trend 
plot(train.ts, xlab = "Time", ylab = "netimport", ylim = c(10000, 20000), bty = "l")
lines(train.lm$fitted, lwd = 2)

Exponential model

####################################################################################
# Exponential Trend Model
train.lm.expo.trend <- tslm(train.ts ~ trend, lambda = 0)
train.lm.expo.trend.pred <- forecast(train.lm.expo.trend, h = nValid)
accuracy(train.lm.expo.trend.pred, nValid) # Model accuracy
##                        ME      RMSE        MAE           MPE        MAPE
## Training set     64.21258  1274.451   979.9133 -5.237561e-01 7.73784e+00
## Test set     -17149.08591 17149.086 17149.0859 -1.429090e+05 1.42909e+05
##                    MASE       ACF1
## Training set  0.7306064 0.02246003
## Test set     12.7860623         NA
# Plots for fitted value for Exponential Trend 
plot(train.lm.expo.trend.pred, ylim = c(10000, 25000),  ylab = "netimports", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2022.25), main = "", flty = 2)
axis(1, at = seq(2011, 2022, 1), labels = format(seq(2011, 2022, 1)))
lines(train.lm.expo.trend.pred$fitted, lwd = 2, col = "blue")  # Added in 6-5
lines(train.lm.linear.trend.pred$fitted, lwd = 2, col = "red", lty = 3)
lines(train.lm.linear.trend.pred$mean, lwd = 2, col = "black", lty = 3)
lines(valid.ts)
lines(c(2020.25 - 3, 2020.25 - 3), c(0, 35000)) 
lines(c(2020.25, 2020.25), c(0, 35000))
text(2011.25, 25000, "Training")
text(2018.25, 25000, "Validation")
text(2022.25, 25000, "Future")
arrows(2020 - 3, 24500, 2011.25, 24500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 24500, 2020, 24500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 24500, 2022, 24500, code = 3, length = 0.1, lwd = 1, angle = 30)

Model with seasonality

########################################################################################
# Modeling Seasonality 
train.lm.season <- tslm(train.ts ~ season)
summary(train.lm.season)
## 
## Call:
## tslm(formula = train.ts ~ season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4799.2 -1854.9  -200.5  2043.8  4394.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14669.9      918.5  15.971   <2e-16 ***
## season2      -2279.8     1299.0  -1.755   0.0830 .  
## season3      -1617.7     1299.0  -1.245   0.2166    
## season4       -426.8     1257.8  -0.339   0.7353    
## season5       -648.8     1257.8  -0.516   0.6073    
## season6      -1437.5     1257.8  -1.143   0.2564    
## season7      -1373.4     1257.8  -1.092   0.2781    
## season8      -1075.0     1257.8  -0.855   0.3952    
## season9      -2568.1     1257.8  -2.042   0.0444 *  
## season10     -1597.0     1257.8  -1.270   0.2078    
## season11     -1753.1     1257.8  -1.394   0.1672    
## season12     -1009.8     1257.8  -0.803   0.4244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2430 on 81 degrees of freedom
## Multiple R-squared:  0.08635,    Adjusted R-squared:  -0.03773 
## F-statistic: 0.6959 on 11 and 81 DF,  p-value: 0.7388
train.lm.season.pred <- forecast(train.lm.season, h = nValid)

accuracy(train.lm.season.pred , nValid)
##                         ME      RMSE       MAE           MPE         MAPE
## Training set -1.160911e-13  2268.006  2009.405 -3.108099e+00     15.84392
## Test set     -1.465795e+04 14657.950 14657.950 -1.221496e+05 122149.58333
##                   MASE      ACF1
## Training set  1.498178 0.7905511
## Test set     10.928714        NA
train.lm.season%>% forecast(h = 10) %>% autoplot()

# Plots of Fitted and Resdiuals 
par(mfrow = c(2,1))
plot(train.lm.season.pred, ylim = c(13000, 26250),  ylab = "netimports", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2022.25), main = "", flty = 2)
axis(1, at = seq(2011, 2022, 1), labels = format(seq(2011, 2022, 1)))
lines(train.lm.season.pred$fitted, lwd = 2, col = "blue")
lines(valid.ts)
lines(c(2020.25 - 3, 2020.25 - 3), c(0, 35000)) 
lines(c(2020.25, 2020.25), c(0, 35000))
text(2011.25, 26000, "Training")
text(2018.75, 26000, "Validation")
text(2022.25, 26000, "Future")
arrows(2020 - 3, 24500, 2011.25, 24500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 24500, 2020, 24500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 24500, 2022, 24500, code = 3, length = 0.1, lwd = 1, angle = 30)

# Residuals 
plot(train.lm.season.pred$residuals, ylim = c(-4000, 5500),  ylab = "Residuals", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2022.25), main = "")
axis(1, at = seq(2011, 2022, 1), labels = format(seq(2011, 2022, 1)))
lines(train.ts - train.lm.season.pred$fitted)
lines(valid.ts - train.lm.season.pred$mean)
lines(c(2020.25 - 3, 2020.25 - 3), c(-5000, 35000))
lines(c(2020.25, 2020.25), c(-5000, 35000))
text(2011.25, 5250, "Training")
text(2018.75, 5250, "Validation")
text(2022.25, 5250, "Future")
arrows(2020 - 3, 4250, 2011.25, 4250, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 4250, 2020, 4250, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 4250, 2022, 4250, code = 3, length = 0.1, lwd = 1, angle = 30)

########################################################################################

Quadratic trend + Seasonality

# Modleing Quadratic Trend + Seasonality 
train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)
train.lm.trend.season.pred <- forecast(train.lm.trend.season, h = nValid, level = 0)
accuracy(train.lm.trend.season.pred , nValid)
##                         ME       RMSE      MAE           MPE         MAPE
## Training set  1.957783e-14   993.5972   799.11 -6.445713e-01 6.319136e+00
## Test set     -1.857160e+04 18571.5952 18571.60 -1.547633e+05 1.547633e+05
##                    MASE         ACF1
## Training set  0.5958026 -0.005597907
## Test set     13.8466606           NA
# Summary
summary(train.lm.trend.season)
## 
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2) + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2282.31  -639.18    81.64   679.96  1991.83 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11427.1619   533.2136  21.431  < 2e-16 ***
## trend          61.1973    16.9490   3.611 0.000534 ***
## I(trend^2)      0.1589     0.1747   0.909 0.365983    
## season2     -2355.7907   576.2563  -4.088 0.000104 ***
## season3     -1769.9995   576.3018  -3.071 0.002922 ** 
## season4      -229.3490   559.1232  -0.410 0.682774    
## season5      -526.4552   558.9719  -0.942 0.349152    
## season6     -1390.4495   558.8743  -2.488 0.014953 *  
## season7     -1402.0141   558.8290  -2.509 0.014159 *  
## season8     -1179.5847   558.8348  -2.111 0.037953 *  
## season9     -2749.0115   558.8915  -4.919 4.65e-06 ***
## season10    -1854.4656   558.9993  -3.317 0.001375 ** 
## season11    -2087.5015   559.1594  -3.733 0.000355 ***
## season12    -1421.4534   559.3731  -2.541 0.013007 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1078 on 79 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.7958 
## F-statistic: 28.58 on 13 and 79 DF,  p-value: < 2.2e-16
train.lm.trend.season %>% forecast(h = 20) %>% autoplot()

#Plotting the fitted and residuals 
par(mfrow = c(2,1))
plot(train.lm.trend.season.pred, ylim = c(8000, 35250),  ylab = "netimports", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2020.25), main = "", flty = 2)
axis(1, at = seq(2011, 2022, 4), labels = format(seq(2011, 2022, 4)))
lines(train.lm.trend.season.pred$fitted, lwd = 2, col = "green")
lines(valid.ts)
lines(c(2020.25 - 3, 2020.25 - 3), c(0, 35000))
lines(c(2020.25, 2020.25), c(0, 35000))
text(2011.25, 35000, "Training")
text(2018.25, 35000, "Validation")
text(2022.25, 35000, "Future")
arrows(2020 - 3, 35500, 2011.25, 35500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 35500, 2020, 35500, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 35500, 2022, 35500, code = 3, length = 0.1, lwd = 1, angle = 30)


# Plotting with residuals
plot(train.lm.trend.season.pred$residuals, ylim = c(-4000, 5500),  ylab = "Residuals", xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2022.25), main = "")
 axis(1, at = seq(2011, 2022, 1), labels = format(seq(2011, 2022, 1)))
lines(train.ts - train.lm.trend.season.pred$fitted)
lines(valid.ts - train.lm.trend.season.pred$mean)
lines(c(2020.25 - 3, 2020.25 - 3), c(-5000, 35000))
lines(c(2020.25, 2020.25), c(-5000, 35000))
 text(2011.25, 5250, "Training")
text(2018.75, 5250, "Validation")
text(2022.25, 5250, "Future")
 arrows(2020 - 3, 4250, 2011.25, 4250, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 4250, 2020, 4250, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 4250, 2022, 4250, code = 3, length = 0.1, lwd = 1, angle = 30)

# testing for white noise 
train.lm.trend.season.resid <- train.lm.trend.season$residuals
  
Box.test(train.lm.trend.season.resid, lag = 24, fitdf = 0, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  train.lm.trend.season.resid
## X-squared = 52.952, df = 24, p-value = 0.000587
ggAcf(train.lm.trend.season.resid)

gglagplot(train.lm.trend.season.resid) 
## Warning in is.na(xlab): is.na() applied to non-(list or vector) of type
## 'NULL'
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

# There is strong correlation among neighboring values 
# we need to take advantage of it in our forecast.
###########################################################################################

There are two ways

1 - Model ARMA/ ARIMA

2 - Model Residual itself as its not White Noise, so extract whatever information can be extracted

from the residual. Then, after modeling the residual, update the forecast for the main series.

############################################################################################
# Fitting ARIMA directly on the main series 
fit <- auto.arima(netimport.ts)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(2,0,0)[12]
## Q* = 11.351, df = 18, p-value = 0.8789
## 
## Model df: 4.   Total lags used: 22
AIC(fit)
## [1] 1829.304
fit %>% forecast(h = 10) %>% autoplot()

# ARMA and ARIMA 
ggAcf(netimport.ts)

gglagplot(netimport.ts)
## Warning in is.na(xlab): is.na() applied to non-(list or vector) of type
## 'NULL'
## Warning in is.na(ylab): is.na() applied to non-(list or vector) of type
## 'NULL'

# Fitting an ARIMA on residual series 
train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)
train.res.arima <- Arima(train.lm.trend.season$residuals, order = c(1,0,0))
train.res.arima.pred <- forecast(train.res.arima, h = nValid)

# Plot
plot(train.lm.trend.season$residuals, ylim = c(-5000, 5000),  ylab = "Residuals", 
     xlab = "Time", bty = "l", xaxt = "n", xlim = c(2011,2022.25), main = "")
axis(1, at = seq(2011, 2022, 1), labels = format(seq(2011, 2022, 1)))
lines(train.res.arima.pred$fitted, lwd = 2, col = "red")
lines(c(2020.25 - 3, 2020.25 - 3), c(-5000, 35000))
lines(c(2020.25, 2020.25), c(-5000, 35000))
text(2011.25, 2250, "Training")
text(2018.75, 2250, "Validation")
text(2022.25, 22500, "Future")
arrows(2011 - 3, 2000, 2011.25, 2000, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5 - 3, 2000, 2020, 2000, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2020.5, 2000, 2022, 2000, code = 3, length = 0.1, lwd = 1, angle = 30)

summary(train.res.arima)
## Series: train.lm.trend.season$residuals 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##           ar1      mean
##       -0.0057   -0.1196
## s.e.   0.1047  102.4748
## 
## sigma^2 estimated as 1008900:  log likelihood=-773.78
## AIC=1553.57   AICc=1553.84   BIC=1561.17
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.09657344 993.5812 799.3317 99.44335 99.44335 0.7010363
##                     ACF1
## Training set 0.001783958
# ACF function 
ggAcf(train.res.arima$residuals, lag.max = 12, main = "")
## Warning: Ignoring unknown parameters: main

# # White Noise 
set.seed(3)
wn <- ts(rnorm(120))
autoplot(wn)
## Warning in is.na(main): is.na() applied to non-(list or vector) of type
## 'NULL'

ggAcf(wn) + ggtitle("Sample ACF for whitenoise")

# How to test if a series is whitenoise or not 
# Ljung-Box test 
Box.test(wn, lag = 24, fitdf = 0, type = "Ljung")
## 
##  Box-Ljung test
## 
## data:  wn
## X-squared = 10.853, df = 24, p-value = 0.99

Conclusion:

Since crude oil import not only depend on demand, there is many factors influencing those are like socio economic factors , cross border relations, global econimc policies, population growth , exchange rates etc.

The above forecasting models forecast of crude oil net import by india.

To conclude that quadratic trend model is the best fit for above models for this problem statement,However many other factors influencing crude oil imports hence we have to explore other models to get accurate forecasting of crude imports.