Forecasting the sales of Food and Beverages of South Africa

Author

Katleho Nyoni

ABSTRACT

The Box and Jenkins methodology to Time Series Analysis(TSA) has shed light on how to deal with processes that do not follow classical methods. I use the Box and Jenkins methodology to fit and find a candidate model for the sales of the Food and Beverages of South Africa(SA). This analysis features fitting various seasonal ARIMA models to the sales of the Food & Beverages of South Africa as well as testing the necessary assumptions. The ARIMA(2,1,3)(1,1,0) is chosen to be the best model and the analysis proceeds to forecast future values and evaluates the performance of the model using the MAPE(evaluated to be 1.98%) metric.

1 INTRODUCTION

The history of Time Series Analysis(TSA) goes back to the famous Autoregressive(AR) and Moving Average(MA) proposed models. It was shortly when they were extended by the Autoregressive Integrated Moving Average(ARIMA) which has proved itself to be powerful and is widely used in various field like finance, engineering, and biological sciences to name a few. This analysis is more like a Tutorial which is more concerned with the practicality of TSA techniques. The analysis thus does not include references as it doesn’t cover any theoretical aspect of TSA and it gradually introduces TSA terminology.

2 ANALYSIS

2.1 PROCESS & METHOD

I will be analyzing the monthly sales of Food and Beverages of South Africa from January 1998 to December 2015(in South African Rand - ZAR). The dataset with the period from January 2016 to December 2019 will be used as the validation dataset to evaluate how well our predictions of the fitted model is. The dataset is made available to the public by Statistics South Africa(Stats SA) which can be found at https://www.statssa.gov.za .


Attaching package: 'TSA'

The following objects are masked from 'package:stats':

    acf, arima

The following object is masked from 'package:utils':

    tar

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Registered S3 methods overwritten by 'forecast':
  method       from
  fitted.Arima TSA 
  plot.Arima   TSA


Attaching package: 'forecast'

The following object is masked from 'package:Metrics':

    accuracy


Attaching package: 'sjPlot'

The following objects are masked from 'package:TSstudio':

    plot_grid, plot_model

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 8382196 12753241 17302507 19984423 26404763 40146894

Below is the time series plot of t he sales of food and beverages from January 1998 to December 2015. There is a clear increasing trend of the sales, which shows the increasing demand of food supply, which in turn might point to the increasing population size of the country.

Below is the decomposition plot of our sales which show a linearly increasing trend as well as seasonality. “Random” refers to what’s left after removing the trend and seasonality from our sales.

The BoxCox lambda is calculated to be 0.0939, which is approximately 1. This suggests that the sales of food & beverages don’t need to be transformed.

[1] 0.0939

To investigate an important concept, we test for stationarity. To do this we use both the Partial Autocorrelation Function(PACF) and Autocorrelation Function(ACF) then compute the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. The ACF trails off while the PACF cuts-off after lag 1(only significant at lag 1), this shows that our sales are non-stationary.

To confirm our findings above, we compute the KPSS test. The KPSS test has the Null Hypothesis(Null) : data is trend stationary, and the Alternative Hypothesis : data is trend non-stationary. Accepting one Hypothesis type is consequently rejecting the other, and vice versa. The Null is rejected if the p-value is less than 0.05 or 5%. The computed results of the KPSS test shows a trend statistic of and a p-value = 0.01 which is less that 0.05, thus, we reject the Null that the sales are trend stationary(meaning that we accept they’re trend non-stationary). When a process is non-stationary, it oath to be made stationary. To do that we use a technique called differencing, I thus first difference and seasonally difference the sales to de-trend and remove the seasonal component.

Warning in kpss.test(Sales, null = "Trend"): p-value smaller than printed
p-value

Original Data
statistic	parameter	p.value	method	data.name
0.9032	4	0.0100	KPSS Test for Trend Stationarity	Sales

Furthermore, we test if the seasonally & first differenced sales process is stationary or not. We used the famous Agumented Dickey-Fuller(ADF) and Phillips-Perron(PP) Unit Root tests, both with the Null that the process is non-stationary. They both show a p-value of 0.01 < 0.05, thus we reject the Null, the process is stationary. The KPSS also confirms that the differenced process is now trend stationary with a p-value = 0.1 .

Warning in adf.test(Ds): p-value smaller than printed p-value

Warning in pp.test(Ds): p-value smaller than printed p-value

statistic	parameter	alternative	p.value	method	data.name
-10.9176	5	stationary	0.0100	Augmented Dickey-Fuller Test	Ds

statistic	parameter	alternative	p.value	method	data.name
-310.0175	4	stationary	0.0100	Phillips-Perron Unit Root Test	Ds

Warning in kpss.test(Ds, null = "Trend"): p-value greater than printed p-value

statistic	parameter	p.value	method	data.name
0.0233	4	0.1000	KPSS Test for Trend Stationarity	Ds

In addition, we plot the first and seasonally differenced sales and see that there is no particular trend, the process is stationary.

2.2 MODEL SPECIFICATION

The ACF is significant at lag 1, with lag 12 slightly significant. The significant lags at 13,33 & 34 can be attributed to random error and thus be ignored. The PACF cuts-off after lag 4. Both the ACF & PACF proposes an ARIMA(4,1,1)(0,1,1) model for our sales.

In contrast, the Extended Autocorrelation Function(EACF) proposes an ARIMA(0,1,2)(0,0,1). The EACF lets us know the order of an Autoregressive Integrated Moving Average(ARIMA) model, thus is can be trusted as opposed to the ACF & PACF for the order of ARIMA. The numbers with o are the suggested and most appropriate, as compared to those with an x. Let us investigate the suggested models further and select the best candidate.

AR/MA
  0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o  x  x  o 
1 x x x o o o o o o o o  o  x  o 
2 x o x o o o o o o o o  x  x  o 
3 x o x o o o o o o o o  x  x  o 
4 x x x x o o o o o o o  o  x  o 
5 x x o x o o o o o o o  o  x  o 
6 x x o x o o o o o o o  o  x  o 
7 x x o x o o o o o o o  o  o  o

2.3 MODEL SELECTION

We first start by fitting the auto-ARIMA which claims that ARIMA(2,1,2)(1,1,0) is the best model. All the parameters(coefficients) of the model are statistically significant as their absolute value is greater than their corresponding Standard Error(s.e) except the AR(2) parameter with estimate = -0.0040 and the corresponding s.e of -0.0040. I claim this model is not the best as the AR(2) coefficient is statistically insignificant and the ACF showed a significant lag at 12, not the PACF, thus the bold parameter [ ARIMA(2,1,2)(1,1,0) ] should not be in the model.

Series: TrainD 
ARIMA(2,1,2)(1,1,0)[12] 

Coefficients:
         ar1      ar2      ma1     ma2     sar1
      0.7093  -0.0040  -1.2771  0.5489  -0.4971
s.e.  0.1451   0.1314   0.1275  0.1154   0.0620

sigma^2 = 2.484e+11:  log likelihood = -2950.71
AIC=5913.43   AICc=5913.86   BIC=5933.31

Below we fit a number of models proposed by the ACF & PACF, and EACF as well as over-fitting the parameters to find the best fit. The 1st model in the table is the model proposed by auto-ARIMA, followed by the ACF & PACF model, then followed by the suggested model by the EACF. The model ARIMA(0,1,3)(1,1,0) seems to be the best fit as it has the lowest AIC of 5852.460 . The Mean Absolute Percentage Error(MAPE) is computed by comparing the forecasted Sales with those from the Validation Dataset that was introduced in section 2.1 . This is a measure of how much error does our model incorporate-how much does the forecasted values deviate from the observed(validation).

Fitted Models
Models	AIC	MAPE
ARIMA(2,1,2)(1,1,0)	5913.4296	0.0433
ARIMA(4,1,1)(0,1,1)	5856.4040	0.0181
ARIMA(0,1,2))(0,1,1)	5862.9418	0.0297
ARIMA(0,1,3))(0,1,1)	5857.5688	0.0349
ARIMA(1,1,3)(0,1,1)	5855.0407	0.0270
ARIMA(2,1,3)(0,1,1)	5852.4600	0.0287
ARIMA(2,1,1)(0,1,1)	5866.8917	0.0287
ARIMA(3,1,1)(0,1,1)	5856.5395	0.0165

2.4 MODEL DIAGNOSIS

We observe that our fitted model don’t have Innovative Outliers(IO) but has an Additive outliers(AO) corresponding to index 192(November 2013). The AO is treated as an Intervention, thus it is removed. To further inspect our candidate model we check whether if the residuals of the model are Normally distributed by applying the Shapiro-Wilk(SW) test with Null : residuals are Normally distributed. The computed SW test shows a p-value = 0.0092 which is less than 0.05, thus we reject the Null(the residuals are non-Normal). The McLeod-Li test with the Null : no conditional heteroscedasticity, shows that all the lags are not significant under the 5% level. There is no evidence that supports an Autoregressive Conditional Heteroscedasticity(ARCH) process or autocorrelation. The final diagnosis include checking the residuals of the model using the ACF, which shows white noise, there is no significant autocorrelation. This shows that our model is very good.

              [,1]
ind     192.000000
lambda2   3.953247

[1] "No IO detected"

Warning in window.default(x, ...): 'end' value not changed

statistic	p.value	method	data.name
0.99	0.03	Shapiro-Wilk normality test	residuals(M)

Warning: Ignoring 36 observations

Warning: Ignoring 34 observations

Warning: Ignoring 4 observations

2.5 PARAMETER ESTIMATION

The fitted model has statistically significant parameters as their absolute is greater than their corresponding s.e . The MAPE of 0.0287 implies that the model forecasts predictions that are off by 2.87% , which suggests that our forecasts are great.


Call:
arima(x = TrainD, order = c(2, 1, 3), seasonal = list(order = c(0, 1, 1), period = 12))

Coefficients:
          ar1     ar2      ma1     ma2     ma3     sma1
      -0.1987  0.6474  -0.2710  -0.706  0.5648  -0.9998
s.e.   0.1257  0.1254   0.1162   0.099  0.0663   0.1526

sigma^2 estimated as 1.485e+11:  log likelihood = -2919.23,  aic = 5850.46

2.6 FORECASTING

We forecast for January 2016 to December 2019, where the Blue line is our forecast, while the Grey area is the 95% confidence interval.

The table below shows the difference of the forecast from the actual observations(validation dataset).

Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
check.names, : Item 3 has 47 rows but longest item has 48; recycled with
remainder.

Actual	Forecast	Difference
40338249	39714378.89	379201.52
39801841	39959047.48	-408559.01
40905278	40210400.01	329790.15
40553824	40575487.85	-112841.64
42242193	40666665.64	1291650.46
42543889	40950542.54	1408001.56
42801258	41135887.44	1494274.13
41974923	41306983.87	456381.53
43070325	41518541.47	1495823.45
42385137	41574501.55	601582.55
42967995	41783554.45	853942.96
42566863	42114052.04	334133.25
42405733	42232729.75	74579.51
42750689	42331153.49	304441.69
42241345	42446247.31	-491816.92
43624135	42733161.92	865630.64
42841374	42758504.36	-160225.25
43591047	43001599.25	436344.29
43233437	43154702.71	-71388.61
43741526	43304825.61	241076.04
43069542	43500449.96	-476174.68
43697004	43545716.68	-49844.44
45351529	43746848.44	1279605.93
43548179	44071923.07	-638468.51
44060624	44186647.51	-221706.24
44039147	44282330.24	-356299.10
44586293	44395446.10	-94685.18
45173438	44680978.18	468108.61
44551562	44705329.39	-396165.87
45504475	44947727.87	404140.92
45308701	45100334.08	58594.53
46714547	45250106.47	1269065.80
46005870	45445481.20	515298.39
46429995	45490571.61	738416.98
47355645	45691578.02	1339081.01
47644180	46016563.99	1512954.55
50305239	46131225.45	4078375.39
47047241	46226863.61	707293.17
47968299	46339947.83	1342841.50
49068915	46625457.50	2419122.19
48420864	46649792.81	1528683.97
48538912	46892180.03	1494133.75
49123987	47044778.25	1929442.01
49907465	47194544.99	2517549.29
49159373	47389915.71	1724369.74
49412233	47435003.26	1776225.34
50180108	47636007.66	2219115.80
49565946	47960992.20	379201.52

3 CONCLUSION

The fitted ARIMA(2,1,3)(1,1,0) model has proved itself to be the best candidate model to forecast the Sales of Food and Beverages in South Africa. The consistency in the increasing trend of sales suggest an increasing population which will correspond to increased food production. South African Policy makers and citizens should start exploring alternative ways to grow and produce food to avoid the issue of food insecurity in the near/far future. The model can be trusted to predict future values but should be updated with the availability of new data so to avoid affecting the performance of the model and to avoid bias with a model that will give large intervals of estimates that will incorporate a lot of unnessesary uncertainty.