2 ANALYSIS
2.1 PROCESS & METHOD
I will be analyzing the monthly sales of Food and Beverages of South Africa from January 1998 to December 2015(in South African Rand - ZAR). The dataset with the period from January 2016 to December 2019 will be used as the validation dataset to evaluate how well our predictions of the fitted model is. The dataset is made available to the public by Statistics South Africa(Stats SA) which can be found at https://www.statssa.gov.za .
The following objects are masked from 'package:stats':
acf, arima
The following object is masked from 'package:utils':
tar
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Registered S3 methods overwritten by 'forecast':
method from
fitted.Arima TSA
plot.Arima TSA
Attaching package: 'forecast'
The following object is masked from 'package:Metrics':
accuracy
Attaching package: 'sjPlot'
The following objects are masked from 'package:TSstudio':
plot_grid, plot_model
Min. 1st Qu. Median Mean 3rd Qu. Max.
8382196 12753241 17302507 19984423 26404763 40146894
Below is the time series plot of t he sales of food and beverages from January 1998 to December 2015. There is a clear increasing trend of the sales, which shows the increasing demand of food supply, which in turn might point to the increasing population size of the country.
Below is the decomposition plot of our sales which show a linearly increasing trend as well as seasonality. “Random” refers to what’s left after removing the trend and seasonality from our sales.
The BoxCox lambda is calculated to be 0.0939, which is approximately 1. This suggests that the sales of food & beverages don’t need to be transformed.
To investigate an important concept, we test for stationarity. To do this we use both the Partial Autocorrelation Function(PACF) and Autocorrelation Function(ACF) then compute the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. The ACF trails off while the PACF cuts-off after lag 1(only significant at lag 1), this shows that our sales are non-stationary.
To confirm our findings above, we compute the KPSS test. The KPSS test has the Null Hypothesis(Null) : data is trend stationary, and the Alternative Hypothesis : data is trend non-stationary. Accepting one Hypothesis type is consequently rejecting the other, and vice versa. The Null is rejected if the p-value is less than 0.05 or 5%. The computed results of the KPSS test shows a trend statistic of and a p-value = 0.01 which is less that 0.05, thus, we reject the Null that the sales are trend stationary(meaning that we accept they’re trend non-stationary). When a process is non-stationary, it oath to be made stationary. To do that we use a technique called differencing, I thus first difference and seasonally difference the sales to de-trend and remove the seasonal component.
Warning in kpss.test(Sales, null = "Trend"): p-value smaller than printed
p-value
Original Data
| statistic |
parameter |
p.value |
method |
data.name |
| 0.9032 |
4 |
0.0100 |
KPSS Test for Trend Stationarity |
Sales |
Furthermore, we test if the seasonally & first differenced sales process is stationary or not. We used the famous Agumented Dickey-Fuller(ADF) and Phillips-Perron(PP) Unit Root tests, both with the Null that the process is non-stationary. They both show a p-value of 0.01 < 0.05, thus we reject the Null, the process is stationary. The KPSS also confirms that the differenced process is now trend stationary with a p-value = 0.1 .
Warning in adf.test(Ds): p-value smaller than printed p-value
Warning in pp.test(Ds): p-value smaller than printed p-value
| statistic |
parameter |
alternative |
p.value |
method |
data.name |
| -10.9176 |
5 |
stationary |
0.0100 |
Augmented Dickey-Fuller Test |
Ds |
| statistic |
parameter |
alternative |
p.value |
method |
data.name |
| -310.0175 |
4 |
stationary |
0.0100 |
Phillips-Perron Unit Root Test |
Ds |
Warning in kpss.test(Ds, null = "Trend"): p-value greater than printed p-value
| statistic |
parameter |
p.value |
method |
data.name |
| 0.0233 |
4 |
0.1000 |
KPSS Test for Trend Stationarity |
Ds |
In addition, we plot the first and seasonally differenced sales and see that there is no particular trend, the process is stationary.
2.2 MODEL SPECIFICATION
The ACF is significant at lag 1, with lag 12 slightly significant. The significant lags at 13,33 & 34 can be attributed to random error and thus be ignored. The PACF cuts-off after lag 4. Both the ACF & PACF proposes an ARIMA(4,1,1)(0,1,1) model for our sales.
In contrast, the Extended Autocorrelation Function(EACF) proposes an ARIMA(0,1,2)(0,0,1). The EACF lets us know the order of an Autoregressive Integrated Moving Average(ARIMA) model, thus is can be trusted as opposed to the ACF & PACF for the order of ARIMA. The numbers with o are the suggested and most appropriate, as compared to those with an x. Let us investigate the suggested models further and select the best candidate.
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o x x o
1 x x x o o o o o o o o o x o
2 x o x o o o o o o o o x x o
3 x o x o o o o o o o o x x o
4 x x x x o o o o o o o o x o
5 x x o x o o o o o o o o x o
6 x x o x o o o o o o o o x o
7 x x o x o o o o o o o o o o
2.3 MODEL SELECTION
We first start by fitting the auto-ARIMA which claims that ARIMA(2,1,2)(1,1,0) is the best model. All the parameters(coefficients) of the model are statistically significant as their absolute value is greater than their corresponding Standard Error(s.e) except the AR(2) parameter with estimate = -0.0040 and the corresponding s.e of -0.0040. I claim this model is not the best as the AR(2) coefficient is statistically insignificant and the ACF showed a significant lag at 12, not the PACF, thus the bold parameter [ ARIMA(2,1,2)(1,1,0) ] should not be in the model.
Series: TrainD
ARIMA(2,1,2)(1,1,0)[12]
Coefficients:
ar1 ar2 ma1 ma2 sar1
0.7093 -0.0040 -1.2771 0.5489 -0.4971
s.e. 0.1451 0.1314 0.1275 0.1154 0.0620
sigma^2 = 2.484e+11: log likelihood = -2950.71
AIC=5913.43 AICc=5913.86 BIC=5933.31
Below we fit a number of models proposed by the ACF & PACF, and EACF as well as over-fitting the parameters to find the best fit. The 1st model in the table is the model proposed by auto-ARIMA, followed by the ACF & PACF model, then followed by the suggested model by the EACF. The model ARIMA(0,1,3)(1,1,0) seems to be the best fit as it has the lowest AIC of 5852.460 . The Mean Absolute Percentage Error(MAPE) is computed by comparing the forecasted Sales with those from the Validation Dataset that was introduced in section 2.1 . This is a measure of how much error does our model incorporate-how much does the forecasted values deviate from the observed(validation).
Fitted Models
| Models |
AIC |
MAPE |
| ARIMA(2,1,2)(1,1,0) |
5913.4296 |
0.0433 |
| ARIMA(4,1,1)(0,1,1) |
5856.4040 |
0.0181 |
| ARIMA(0,1,2))(0,1,1) |
5862.9418 |
0.0297 |
| ARIMA(0,1,3))(0,1,1) |
5857.5688 |
0.0349 |
| ARIMA(1,1,3)(0,1,1) |
5855.0407 |
0.0270 |
| ARIMA(2,1,3)(0,1,1) |
5852.4600 |
0.0287 |
| ARIMA(2,1,1)(0,1,1) |
5866.8917 |
0.0287 |
| ARIMA(3,1,1)(0,1,1) |
5856.5395 |
0.0165 |
2.4 MODEL DIAGNOSIS
We observe that our fitted model don’t have Innovative Outliers(IO) but has an Additive outliers(AO) corresponding to index 192(November 2013). The AO is treated as an Intervention, thus it is removed. To further inspect our candidate model we check whether if the residuals of the model are Normally distributed by applying the Shapiro-Wilk(SW) test with Null : residuals are Normally distributed. The computed SW test shows a p-value = 0.0092 which is less than 0.05, thus we reject the Null(the residuals are non-Normal). The McLeod-Li test with the Null : no conditional heteroscedasticity, shows that all the lags are not significant under the 5% level. There is no evidence that supports an Autoregressive Conditional Heteroscedasticity(ARCH) process or autocorrelation. The final diagnosis include checking the residuals of the model using the ACF, which shows white noise, there is no significant autocorrelation. This shows that our model is very good.
[,1]
ind 192.000000
lambda2 3.953247
Warning in window.default(x, ...): 'end' value not changed
| statistic |
p.value |
method |
data.name |
| 0.99 |
0.03 |
Shapiro-Wilk normality test |
residuals(M) |
Warning: Ignoring 36 observations
Warning: Ignoring 34 observations
Warning: Ignoring 4 observations
2.5 PARAMETER ESTIMATION
The fitted model has statistically significant parameters as their absolute is greater than their corresponding s.e . The MAPE of 0.0287 implies that the model forecasts predictions that are off by 2.87% , which suggests that our forecasts are great.
Call:
arima(x = TrainD, order = c(2, 1, 3), seasonal = list(order = c(0, 1, 1), period = 12))
Coefficients:
ar1 ar2 ma1 ma2 ma3 sma1
-0.1987 0.6474 -0.2710 -0.706 0.5648 -0.9998
s.e. 0.1257 0.1254 0.1162 0.099 0.0663 0.1526
sigma^2 estimated as 1.485e+11: log likelihood = -2919.23, aic = 5850.46
2.6 FORECASTING
We forecast for January 2016 to December 2019, where the Blue line is our forecast, while the Grey area is the 95% confidence interval.
The table below shows the difference of the forecast from the actual observations(validation dataset).
Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
check.names, : Item 3 has 47 rows but longest item has 48; recycled with
remainder.
| Actual |
Forecast |
Difference |
| 40338249 |
39714378.89 |
379201.52 |
| 39801841 |
39959047.48 |
-408559.01 |
| 40905278 |
40210400.01 |
329790.15 |
| 40553824 |
40575487.85 |
-112841.64 |
| 42242193 |
40666665.64 |
1291650.46 |
| 42543889 |
40950542.54 |
1408001.56 |
| 42801258 |
41135887.44 |
1494274.13 |
| 41974923 |
41306983.87 |
456381.53 |
| 43070325 |
41518541.47 |
1495823.45 |
| 42385137 |
41574501.55 |
601582.55 |
| 42967995 |
41783554.45 |
853942.96 |
| 42566863 |
42114052.04 |
334133.25 |
| 42405733 |
42232729.75 |
74579.51 |
| 42750689 |
42331153.49 |
304441.69 |
| 42241345 |
42446247.31 |
-491816.92 |
| 43624135 |
42733161.92 |
865630.64 |
| 42841374 |
42758504.36 |
-160225.25 |
| 43591047 |
43001599.25 |
436344.29 |
| 43233437 |
43154702.71 |
-71388.61 |
| 43741526 |
43304825.61 |
241076.04 |
| 43069542 |
43500449.96 |
-476174.68 |
| 43697004 |
43545716.68 |
-49844.44 |
| 45351529 |
43746848.44 |
1279605.93 |
| 43548179 |
44071923.07 |
-638468.51 |
| 44060624 |
44186647.51 |
-221706.24 |
| 44039147 |
44282330.24 |
-356299.10 |
| 44586293 |
44395446.10 |
-94685.18 |
| 45173438 |
44680978.18 |
468108.61 |
| 44551562 |
44705329.39 |
-396165.87 |
| 45504475 |
44947727.87 |
404140.92 |
| 45308701 |
45100334.08 |
58594.53 |
| 46714547 |
45250106.47 |
1269065.80 |
| 46005870 |
45445481.20 |
515298.39 |
| 46429995 |
45490571.61 |
738416.98 |
| 47355645 |
45691578.02 |
1339081.01 |
| 47644180 |
46016563.99 |
1512954.55 |
| 50305239 |
46131225.45 |
4078375.39 |
| 47047241 |
46226863.61 |
707293.17 |
| 47968299 |
46339947.83 |
1342841.50 |
| 49068915 |
46625457.50 |
2419122.19 |
| 48420864 |
46649792.81 |
1528683.97 |
| 48538912 |
46892180.03 |
1494133.75 |
| 49123987 |
47044778.25 |
1929442.01 |
| 49907465 |
47194544.99 |
2517549.29 |
| 49159373 |
47389915.71 |
1724369.74 |
| 49412233 |
47435003.26 |
1776225.34 |
| 50180108 |
47636007.66 |
2219115.80 |
| 49565946 |
47960992.20 |
379201.52 |