Forecasting the sales of Food and Beverages of South Africa

Author

Katleho Nyoni

ABSTRACT

The Box and Jenkins methodology to Time Series Analysis(TSA) has shed light on how to deal with processes that do not follow classical methods. I use the Box and Jenkins methodology to fit and find a candidate model for the sales of the Food and Beverages of South Africa(SA). This analysis features fitting various seasonal ARIMA models to the sales of the Food & Beverages of South Africa as well as testing the necessary assumptions. The ARIMA(2,1,3)(1,1,0) is chosen to be the best model and the analysis proceeds to forecast future values and evaluates the performance of the model using the MAPE(evaluated to be 1.98%) metric.

1 INTRODUCTION

The history of Time Series Analysis(TSA) goes back to the famous Autoregressive(AR) and Moving Average(MA) proposed models. It was shortly when they were extended by the Autoregressive Integrated Moving Average(ARIMA) which has proved itself to be powerful and is widely used in various field like finance, engineering, and biological sciences to name a few. This analysis is more like a Tutorial which is more concerned with the practicality of TSA techniques. The analysis thus does not include references as it doesn’t cover any theoretical aspect of TSA and it gradually introduces TSA terminology.

2 ANALYSIS

2.1 PROCESS & METHOD

I will be analyzing the monthly sales of Food and Beverages of South Africa from January 1998 to December 2015(in South African Rand - ZAR). The dataset with the period from January 2016 to December 2019 will be used as the validation dataset to evaluate how well our predictions of the fitted model is. The dataset is made available to the public by Statistics South Africa(Stats SA) which can be found at https://www.statssa.gov.za .


Attaching package: 'TSA'
The following objects are masked from 'package:stats':

    acf, arima
The following object is masked from 'package:utils':

    tar
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Registered S3 methods overwritten by 'forecast':
  method       from
  fitted.Arima TSA 
  plot.Arima   TSA 

Attaching package: 'forecast'
The following object is masked from 'package:Metrics':

    accuracy

Attaching package: 'sjPlot'
The following objects are masked from 'package:TSstudio':

    plot_grid, plot_model
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 8382196 12753241 17302507 19984423 26404763 40146894 

Below is the time series plot of t he sales of food and beverages from January 1998 to December 2015. There is a clear increasing trend of the sales, which shows the increasing demand of food supply, which in turn might point to the increasing population size of the country.

Below is the decomposition plot of our sales which show a linearly increasing trend as well as seasonality. “Random” refers to what’s left after removing the trend and seasonality from our sales.

The BoxCox lambda is calculated to be 0.0939, which is approximately 1. This suggests that the sales of food & beverages don’t need to be transformed.

[1] 0.0939

To investigate an important concept, we test for stationarity. To do this we use both the Partial Autocorrelation Function(PACF) and Autocorrelation Function(ACF) then compute the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. The ACF trails off while the PACF cuts-off after lag 1(only significant at lag 1), this shows that our sales are non-stationary.

To confirm our findings above, we compute the KPSS test. The KPSS test has the Null Hypothesis(Null) : data is trend stationary, and the Alternative Hypothesis : data is trend non-stationary. Accepting one Hypothesis type is consequently rejecting the other, and vice versa. The Null is rejected if the p-value is less than 0.05 or 5%. The computed results of the KPSS test shows a trend statistic of and a p-value = 0.01 which is less that 0.05, thus, we reject the Null that the sales are trend stationary(meaning that we accept they’re trend non-stationary). When a process is non-stationary, it oath to be made stationary. To do that we use a technique called differencing, I thus first difference and seasonally difference the sales to de-trend and remove the seasonal component.

Warning in kpss.test(Sales, null = "Trend"): p-value smaller than printed
p-value
Original Data
statistic parameter p.value method data.name
0.9032 4 0.0100 KPSS Test for Trend Stationarity Sales

Furthermore, we test if the seasonally & first differenced sales process is stationary or not. We used the famous Agumented Dickey-Fuller(ADF) and Phillips-Perron(PP) Unit Root tests, both with the Null that the process is non-stationary. They both show a p-value of 0.01 < 0.05, thus we reject the Null, the process is stationary. The KPSS also confirms that the differenced process is now trend stationary with a p-value = 0.1 .

Warning in adf.test(Ds): p-value smaller than printed p-value
Warning in pp.test(Ds): p-value smaller than printed p-value
statistic parameter alternative p.value method data.name
-10.9176 5 stationary 0.0100 Augmented Dickey-Fuller Test Ds
statistic parameter alternative p.value method data.name
-310.0175 4 stationary 0.0100 Phillips-Perron Unit Root Test Ds
Warning in kpss.test(Ds, null = "Trend"): p-value greater than printed p-value
statistic parameter p.value method data.name
0.0233 4 0.1000 KPSS Test for Trend Stationarity Ds

In addition, we plot the first and seasonally differenced sales and see that there is no particular trend, the process is stationary.

2.2 MODEL SPECIFICATION

The ACF is significant at lag 1, with lag 12 slightly significant. The significant lags at 13,33 & 34 can be attributed to random error and thus be ignored. The PACF cuts-off after lag 4. Both the ACF & PACF proposes an ARIMA(4,1,1)(0,1,1) model for our sales.

In contrast, the Extended Autocorrelation Function(EACF) proposes an ARIMA(0,1,2)(0,0,1). The EACF lets us know the order of an Autoregressive Integrated Moving Average(ARIMA) model, thus is can be trusted as opposed to the ACF & PACF for the order of ARIMA. The numbers with o are the suggested and most appropriate, as compared to those with an x. Let us investigate the suggested models further and select the best candidate.

AR/MA
  0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o  x  x  o 
1 x x x o o o o o o o o  o  x  o 
2 x o x o o o o o o o o  x  x  o 
3 x o x o o o o o o o o  x  x  o 
4 x x x x o o o o o o o  o  x  o 
5 x x o x o o o o o o o  o  x  o 
6 x x o x o o o o o o o  o  x  o 
7 x x o x o o o o o o o  o  o  o 

2.3 MODEL SELECTION

We first start by fitting the auto-ARIMA which claims that ARIMA(2,1,2)(1,1,0) is the best model. All the parameters(coefficients) of the model are statistically significant as their absolute value is greater than their corresponding Standard Error(s.e) except the AR(2) parameter with estimate = -0.0040 and the corresponding s.e of -0.0040. I claim this model is not the best as the AR(2) coefficient is statistically insignificant and the ACF showed a significant lag at 12, not the PACF, thus the bold parameter [ ARIMA(2,1,2)(1,1,0) ] should not be in the model.

Series: TrainD 
ARIMA(2,1,2)(1,1,0)[12] 

Coefficients:
         ar1      ar2      ma1     ma2     sar1
      0.7093  -0.0040  -1.2771  0.5489  -0.4971
s.e.  0.1451   0.1314   0.1275  0.1154   0.0620

sigma^2 = 2.484e+11:  log likelihood = -2950.71
AIC=5913.43   AICc=5913.86   BIC=5933.31

Below we fit a number of models proposed by the ACF & PACF, and EACF as well as over-fitting the parameters to find the best fit. The 1st model in the table is the model proposed by auto-ARIMA, followed by the ACF & PACF model, then followed by the suggested model by the EACF. The model ARIMA(0,1,3)(1,1,0) seems to be the best fit as it has the lowest AIC of 5852.460 . The Mean Absolute Percentage Error(MAPE) is computed by comparing the forecasted Sales with those from the Validation Dataset that was introduced in section 2.1 . This is a measure of how much error does our model incorporate-how much does the forecasted values deviate from the observed(validation).

Fitted Models
Models AIC MAPE
ARIMA(2,1,2)(1,1,0) 5913.4296 0.0433
ARIMA(4,1,1)(0,1,1) 5856.4040 0.0181
ARIMA(0,1,2))(0,1,1) 5862.9418 0.0297
ARIMA(0,1,3))(0,1,1) 5857.5688 0.0349
ARIMA(1,1,3)(0,1,1) 5855.0407 0.0270
ARIMA(2,1,3)(0,1,1) 5852.4600 0.0287
ARIMA(2,1,1)(0,1,1) 5866.8917 0.0287
ARIMA(3,1,1)(0,1,1) 5856.5395 0.0165

2.4 MODEL DIAGNOSIS

We observe that our fitted model don’t have Innovative Outliers(IO) but has an Additive outliers(AO) corresponding to index 192(November 2013). The AO is treated as an Intervention, thus it is removed. To further inspect our candidate model we check whether if the residuals of the model are Normally distributed by applying the Shapiro-Wilk(SW) test with Null : residuals are Normally distributed. The computed SW test shows a p-value = 0.0092 which is less than 0.05, thus we reject the Null(the residuals are non-Normal). The McLeod-Li test with the Null : no conditional heteroscedasticity, shows that all the lags are not significant under the 5% level. There is no evidence that supports an Autoregressive Conditional Heteroscedasticity(ARCH) process or autocorrelation. The final diagnosis include checking the residuals of the model using the ACF, which shows white noise, there is no significant autocorrelation. This shows that our model is very good.

              [,1]
ind     192.000000
lambda2   3.953247
[1] "No IO detected"
Warning in window.default(x, ...): 'end' value not changed
statistic p.value method data.name
0.99 0.03 Shapiro-Wilk normality test residuals(M)

Warning: Ignoring 36 observations
Warning: Ignoring 34 observations
Warning: Ignoring 4 observations

2.5 PARAMETER ESTIMATION

The fitted model has statistically significant parameters as their absolute is greater than their corresponding s.e . The MAPE of 0.0287 implies that the model forecasts predictions that are off by 2.87% , which suggests that our forecasts are great.


Call:
arima(x = TrainD, order = c(2, 1, 3), seasonal = list(order = c(0, 1, 1), period = 12))

Coefficients:
          ar1     ar2      ma1     ma2     ma3     sma1
      -0.1987  0.6474  -0.2710  -0.706  0.5648  -0.9998
s.e.   0.1257  0.1254   0.1162   0.099  0.0663   0.1526

sigma^2 estimated as 1.485e+11:  log likelihood = -2919.23,  aic = 5850.46

2.6 FORECASTING

We forecast for January 2016 to December 2019, where the Blue line is our forecast, while the Grey area is the 95% confidence interval.

The table below shows the difference of the forecast from the actual observations(validation dataset).

Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
check.names, : Item 3 has 47 rows but longest item has 48; recycled with
remainder.
Actual Forecast Difference
40338249 39714378.89 379201.52
39801841 39959047.48 -408559.01
40905278 40210400.01 329790.15
40553824 40575487.85 -112841.64
42242193 40666665.64 1291650.46
42543889 40950542.54 1408001.56
42801258 41135887.44 1494274.13
41974923 41306983.87 456381.53
43070325 41518541.47 1495823.45
42385137 41574501.55 601582.55
42967995 41783554.45 853942.96
42566863 42114052.04 334133.25
42405733 42232729.75 74579.51
42750689 42331153.49 304441.69
42241345 42446247.31 -491816.92
43624135 42733161.92 865630.64
42841374 42758504.36 -160225.25
43591047 43001599.25 436344.29
43233437 43154702.71 -71388.61
43741526 43304825.61 241076.04
43069542 43500449.96 -476174.68
43697004 43545716.68 -49844.44
45351529 43746848.44 1279605.93
43548179 44071923.07 -638468.51
44060624 44186647.51 -221706.24
44039147 44282330.24 -356299.10
44586293 44395446.10 -94685.18
45173438 44680978.18 468108.61
44551562 44705329.39 -396165.87
45504475 44947727.87 404140.92
45308701 45100334.08 58594.53
46714547 45250106.47 1269065.80
46005870 45445481.20 515298.39
46429995 45490571.61 738416.98
47355645 45691578.02 1339081.01
47644180 46016563.99 1512954.55
50305239 46131225.45 4078375.39
47047241 46226863.61 707293.17
47968299 46339947.83 1342841.50
49068915 46625457.50 2419122.19
48420864 46649792.81 1528683.97
48538912 46892180.03 1494133.75
49123987 47044778.25 1929442.01
49907465 47194544.99 2517549.29
49159373 47389915.71 1724369.74
49412233 47435003.26 1776225.34
50180108 47636007.66 2219115.80
49565946 47960992.20 379201.52

3 CONCLUSION

The fitted ARIMA(2,1,3)(1,1,0) model has proved itself to be the best candidate model to forecast the Sales of Food and Beverages in South Africa. The consistency in the increasing trend of sales suggest an increasing population which will correspond to increased food production. South African Policy makers and citizens should start exploring alternative ways to grow and produce food to avoid the issue of food insecurity in the near/far future. The model can be trusted to predict future values but should be updated with the availability of new data so to avoid affecting the performance of the model and to avoid bias with a model that will give large intervals of estimates that will incorporate a lot of unnessesary uncertainty.