LBB-TS: Holt-Winters and ARIMA Forecast Comparison Study Based on Staples Food Pricing in Indonesia

Wayan K.

May 30, 2021

1 About the Data

This report gathers data that represent the statistics of the certain food products pricing in Indonesia and its consumption and livelihoods. It also covers the availability of strategic food commodities in the markets and respective price trends at the national levels. The food-price dataset includes various categories and types of food items. The data was collected in monthly basis and extends from 2007 up to year 2020. Food price indicators were calculated depending on its category – either in monthly period based on food commodities categories and observed year.

This report will try to evaluate the data using the time series methods, models, and calculations, and try to leverage as much as possible all of information from it by using time series modeling comparison. We will also try to see which is the time series modeling is considered better in understanding and predicting/ forcasting the staples food pricing based on the available data, and also some of the errors and assumptions conditions which may occured during interpreting the staples food pricing.

1.1 Extracting the Data Source:

food <- read.csv("wfpfood2021.csv")

head(food)

glimpse(food)

#> Rows: 72,437
#> Columns: 17
#> $ date     <chr> "1/15/2007", "2/15/2007", "3/15/2007", "4/15/2007", "5/15/200~
#> $ cmname   <chr> "Rice - Retail", "Rice - Retail", "Rice - Retail", "Rice - Re~
#> $ unit     <chr> "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "~
#> $ category <chr> "cereals and tubers", "cereals and tubers", "cereals and tube~
#> $ price    <dbl> 5941.975, 6445.000, 6414.000, 6083.000, 5955.000, 5949.000, 5~
#> $ currency <chr> "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR"~
#> $ country  <chr> "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesi~
#> $ admname  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ adm1id   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ mktname  <chr> "National Average", "National Average", "National Average", "~
#> $ mktid    <int> 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 9~
#> $ cmid     <int> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5~
#> $ ptid     <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
#> $ umid     <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5~
#> $ catid    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ sn       <chr> "974_52_15_5", "974_52_15_5", "974_52_15_5", "974_52_15_5", "~
#> $ default  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

1.2 Data Dictionary:

date : Date of Sampling
cmname : Commodity Name
unit : Unit of Measurement
category : Commodity Category
price : Price of Commodity
currency : Currency
country : Country
admname : Admin Name
adm1id : Admin ID
mktname : Marketing Name
mktid : Marketing ID
cmid : Commodity ID
ptid : PT ID
umid : UM ID
catid : Category ID
sn : Serial No
default : Default

1.2.1 Bussiness Requirements:

As there are too many types of food products on the dataset (cmname), for this reporting purpose, we will only focus on two mostly known food commodities to evaluate its pricing using the time series methods, models, and calculations, and based on the modeling results then try to leverage as much as possible all of information from it by using time series modeling comparison.

Two mostly known food commodities will be used on the report are as follows: Rice - Retail, and Vegetable Oil - Retail.

1.3 Data Cleansing & Subsetting:

For Time Series modeling and calculation, we will only need the Pricing and Date data; therefore new object (food_clean) will be made to make the size of the initial dataset to be smaller and efficient.

food_rice <- food %>% 
          filter(cmname == "Rice - Retail") %>%
          mutate(date = mdy(date)) %>% 
          arrange(date) %>%
          select( - unit, - category, - currency, - ptid, - umid, - catid, - sn, - default,
                  - country, - admname, - adm1id, - mktname, - mktid, - cmid)

food_oil <- food %>% 
          filter(cmname == "Oil (vegetable) - Retail") %>%
          mutate(date = mdy(date)) %>% 
          arrange(date) %>%
          select( - unit, - category, - currency, - ptid, - umid, - catid, - sn, - default,
                  - country, - admname, - adm1id, - mktname, - mktid, - cmid)

head(food_rice,12)

head(food_oil,12)

tail(food_rice,12)

Checking for NA value of Data:

anyNA(food_rice)

#> [1] FALSE

anyNA(food_oil)

#> [1] FALSE

colSums(is.na(food_rice))

#>   date cmname  price 
#>      0      0      0

colSums(is.na(food_oil))

#>   date cmname  price 
#>      0      0      0

Note: There is no indication of NA/ missing values on the columns on dataset.

1.4 Time Series Modeling

For time series analysis, we need to convert our data into ts objects using the ts() function. We will try to create a time series objects from monthly data Rice - Retail and Vegetable Oil - Retail with annual seasonality (per 1 year).

This report will use two kind of time series

rice_ts <- ts(data = food_rice$price, 
             start = 2007, 
             frequency = 12)

oil_ts <- ts(data = food_oil$price, 
             start = 2007, 
             frequency = 12)

We then try to visualize both of these time series objects (rice_ts and oil_ts) into the graph below:

rice_ts %>% 
  autoplot() + theme_minimal()

oil_ts %>% 
  autoplot() + theme_minimal()

Insight: From the graphs above, it can be concluded that both of Rice and Vegetable Oil commodities seems to have both Additive types of time series data pattern as its trend and seasonal patterns tend to be constant. This can further be ensured by conducting the exploratory data analysis.

2 Exploratory Data Analysis

As stated priorly from the business assumption requirement, this report will only gather and process these columns of dataset; These requirement is based on the assumption that both of commodities (rice and vegetable oil) are belong to the most consumed commodities by Indonesian people: - Rice - Retail - Oil (vegetable) - Retail

Exploratory data analysis can be done by doing a decomposition of the time series objects that been created priorly in order to get more insights of the time series data.

Decomposing time series objects for EDA purpose:

rice_dc <- decompose(rice_ts, type = "additive")
oil_dc <- decompose(oil_ts, type = "additive")

autoplot(rice_dc)

autoplot(oil_dc)

Insight:

As it can be seen from the graphs above, although both graphs can clearly shown the trend of both commodities, but as for its seasonal conditions, we still cannot conclude its seasonal conditions (perhaps due to its tight seasonal patterns); therefore we will try to use a more robust time series decomposition method, named Seasonal and Trend Decomposition using Loess (STL) in order to get a better understanding of its seasonal conditions.

Decomposition of time series objects using Seasonal and Trend Decomposition using Loess Method:

forecast::mstl(rice_ts) %>% 
  autoplot()

forecast::mstl(oil_ts) %>% 
  autoplot()

Note: - Although not clearly indicated, but it seemed that both commodities are having a slight seasonal patterns; This can be seen by the amplitude repetition indication that occurs in the time series data for both commodities.

2.1 Time Series Forcasting Model

As we suspect that there are both trend and seasonal on both of the commodities data, This report will use Triple Exponential Smoothing (Holt Winters) and ARIMA Model to forecast its trend and seasonal patterns.

2.1.1 Cross Validation of Data

In order to do the cross validation, the data will be splitted into two parts, the train data (consists of all data up to last 2 years of observation data), and test data (consist of only last 2 years of observation data).

The train data will be used for training of the model, where the test data will be used for testing the model performance. The model then will also be tested to predict the test data. The predicted results and actual data from the test data will then be compared to validate the model performance.

rice_train <- head(rice_ts, n= length(rice_ts) - 24)
rice_test <- tail(rice_ts, n=24)

oil_train <- head(oil_ts, n= length(oil_ts) - 24)
oil_test <- tail(oil_ts, n=24)

2.2 Triple Exponential Smoothing (Holt-Winters) Model

Holt-Winters time series model uses exponential smoothing to encode lots of values from the past and use them to predict “typical” values for the present and future.

The Holt-Winters model can be described by the following six components:

x = time series object
alpha = smoothing error
beta = smoothing trend
gamma = seasonal smoothing
seasonal: seasonal additive type (default) / multiplicative

Either alpha/ beta/ gamma values can be set manually/ allowed to be automatic, or disabled (F / False).

In this report, all alpha (for smoothing error), beta (for smoothing trend), and gamma (for smoothing error) values will be used on the model as we suspect that there are indications of the existence of these three components in the data, so that they will be left to set to automatic.

Creating Holt-Winters model for Rice and Vegetable Oil data:

model_rice <- HoltWinters(x = rice_train)
model_oil <- HoltWinters(x = oil_train)

2.2.1 Forecasting Holt-Winters Model Using Test Data

forecast_rice <- forecast(object = model_rice, h = length(rice_test))
forecast_oil <- forecast(object = model_oil, h = length(oil_test))

2.2.2 Holt-Winters Model Visualization

plot_forecast(forecast_rice)

plot_forecast(forecast_oil)

test_forecast(actual = rice_ts, forecast.obj = forecast_rice, test = rice_test)

test_forecast(actual = oil_ts, forecast.obj = forecast_oil, test = oil_test)

2.2.3 Holt-Winters Model Evaluation

To evaluate the model, this report will calculate the prediction error using the MAPE (Mean Absolute Percentage Error) method, where the MAPE value range is between 0-100%, where the model will be considered better if the error is close to 0%.

MAPE(y_pred = forecast_rice$mean, y_true = rice_test)*100

#> [1] 14.18583

MAPE(y_pred = forecast_oil$mean, y_true = oil_test)*100

#> [1] 9.943515

Insight: From the MAPE results above, it can be gathered that Rice commodity is having a MAPE value of about 14.19%, whereas the Vegetables Oil commodity is having the MAPE value of about 9.94%, thus bringing the Vegetables Oil commmodity data is having a slight better of data error compared to the Rice commodity data.

2.3 ARIMA Model

ARIMA Model stands for Auto Regressive Integrated Moving Average Model. This model represents the combination of two time series model methods, namely Autoregressive (AR) and Moving Average (MA).

The ARIMA model is explained by three terms:

p: order for AR model.
d: how many times the data needs to be differencing until it becomes stationary.
q: order for MA models.

One of the requirements for data to be processed using the ARIMA model is that the data must be stationary. Data is considered to be stationary when its value fluctuates around its mean (ie. when data is without a trend).

One of the method to make the data to be stationary is by using ** data differencing **, which is adjusting the value of an observation with previous observations. The amount of differencing can be more than 1 time, depending on the complexity of the data.

Check Data Stationary Status using adf.test function:

H0: Data is not stationary.
H1: Data is Stationary.
Reject H0 when p-value < 0.05

adf.test(rice_ts)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  rice_ts
#> Dickey-Fuller = -10.413, Lag order = 14, p-value = 0.01
#> alternative hypothesis: stationary

adf.test(oil_ts)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  oil_ts
#> Dickey-Fuller = -9.8177, Lag order = 14, p-value = 0.01
#> alternative hypothesis: stationary

Insight: As the adf.test results shown that both of the time series objects p-value results are lower than 0.05 (p-val < 0.05), therefore it can be assumed that the data is already in stationary state.

2.3.1 Visualizing Data to Check for AR Order and MA Order:

In order to find the p order for an AR model, we will look at the PACF (Partial Autocorrelation Function) graph, where it shows autocorrelation between t (current) and t-k (lag) without being influenced by the correlation between the intermediate periods.

As to find the q order for the MA model, we will look at the ACF (Autocorrelation Function) plot, where it shows autocorrelation between t and t-k, with the influence of the intermediate period.

There are 2 types of patterns in the PACF and ACF plots:

tails off: where lag line underwent slow changes.
cuts off after lag: where the line changes rapidly (significantly).

tsdisplay(diff(rice_ts))

tsdisplay(diff(oil_ts))

2.3.2 Creating ARIMA Model using Auto-ARIMA function:

In this report, we assumed that all p (order for AR model), d (how many times the data needs to be differencing until it becomes stationary), and q (order for MA models values will be used) will be set to automatic.

ARIMA Model for Rice Commodity:

arima_rice <- auto.arima(y = rice_ts)
summary(arima_rice)

#> Series: rice_ts 
#> ARIMA(1,1,2)(0,0,1)[12] with drift 
#> 
#> Coefficients:
#>          ar1      ma1     ma2     sma1   drift
#>       0.7970  -1.1423  0.1515  -0.0491  0.9053
#> s.e.  0.0231   0.0334  0.0301   0.0187  0.8420
#> 
#> sigma^2 estimated as 872978:  log likelihood=-23899.35
#> AIC=47810.71   AICc=47810.74   BIC=47846.53
#> 
#> Training set error measures:
#>                    ME     RMSE      MAE        MPE     MAPE      MASE
#> Training set 13.05576 933.3643 668.8733 -0.4741684 5.920817 0.4608643
#>                     ACF1
#> Training set 0.008476651

ARIMA Model for Oil Commodity:

arima_oil <- auto.arima(y = oil_ts)
summary(arima_oil)

#> Series: oil_ts 
#> ARIMA(2,1,1)(1,0,0)[12] 
#> 
#> Coefficients:
#>          ar1      ar2      ma1     sar1
#>       0.6071  -0.0379  -0.9462  -0.0205
#> s.e.  0.0272   0.0251   0.0214   0.0202
#> 
#> sigma^2 estimated as 766383:  log likelihood=-23711.02
#> AIC=47432.04   AICc=47432.06   BIC=47461.89
#> 
#> Training set error measures:
#>                    ME    RMSE     MAE        MPE     MAPE      MASE        ACF1
#> Training set 13.22907 874.677 584.724 -0.2342855 4.252181 0.5679973 0.002254561

2.3.3 Forecasting ARIMA Model Using Test Data

forecast_arima_rice <- forecast(object = arima_rice, h = length(rice_test))
forecast_arima_oil <- forecast(object = arima_oil, h = length(oil_test))

2.3.4 ARIMA Model Visualization

plot_forecast(forecast_obj = forecast_arima_rice)

plot_forecast(forecast_obj = forecast_arima_oil)

Decomposition of ARIMA model using Seasonal and Trend Decomposition using Loess Method:

rice_stl_arima <- forecast::stlm(y = rice_train, 
                              s.window = 12, 
                              method = "arima") 

oil_stl_arima <- forecast::stlm(y = oil_train, 
                              s.window = 12, 
                              method = "arima") 

autoplot(rice_train) +
  autolayer(rice_stl_arima$fitted)

autoplot(oil_train) +
  autolayer(oil_stl_arima$fitted)

2.3.5 ARIMA Model Evaluation

From the above model calculations, it can be concluded as follow: - Both of the Rice and Vegetable Oil commodities are having a tails-off type of data patterns as it shows slow changes of lag line throughout its periods.

The calculations of both Auto-Arima Models from the two models above (for Rice and Vegetable Oil commodities) will produce MAPE (Mean Absolute Percentage Error) of 5.92% (for Rice commodity), and MAPE of 4.25% (for vegetable oil commodity).
According to the Auto-Arima calculation, both of the Rice and Vegetable Oil commodities are having one time of differencing process in order to make the data to be more stationary.

3 Conclusion

From the Holt-Winters model and ARIMA model that we have calculated, forecasted, and visualized above, there are some conclusions that can be derived from both of Rice and Vegetable Oil commodities, These are:

Both of Rice and Oil commodities seems to have both Additive types of time series data pattern as its trend and seasonal patterns tend to be constant. This can further be ensured by conducting the exploratory data analysis either by using Holt-Winters model and ARIMA model.
Using a more robust time series decomposition method (Seasonal and Trend Decomposition using Loess - STL) we can get a slight better understanding of the seasonal conditions both for Rice and Vegetable Oil models.
When using Holt-Winters model using an automatic values for its alpha (for smoothing error), beta (for smoothing trend), and gamma (for smoothing error), we can gathered the MAPE (Mean Absolute Percentage Error) of 14.19% for Rice model and 9.94% for Vegetable Oil model. These MAPE values show that both of Rice and Vegetable Oil models have a relatively low errors (under 15% of errors in overall).
Meanwhile when using ARIMA model using an automatic values for its p (order for AR model), d (number of times the data needs to be differenced until it becomes stationary), and q (order for MA models values), we can gathered the MAPE (Mean Absolute Percentage Error) of 5.92% for Rice model and 4.25% for Vegetable Oil model. These MAPE values show that both of Rice and Vegetable Oil models have an even better low errors (under 6% of errors in overall).
Based on the MAPE values comparison, it can be concluded that the performance of ARIMA model surpasses the Holt-Winters model by about 9% in predicting the available test data based on the forecasting model of each method and for each commodities.