This report gathers data that represent the statistics of the certain food products pricing in Indonesia and its consumption and livelihoods. It also covers the availability of strategic food commodities in the markets and respective price trends at the national levels. The food-price dataset includes various categories and types of food items. The data was collected in monthly basis and extends from 2007 up to year 2020. Food price indicators were calculated depending on its category – either in monthly period based on food commodities categories and observed year.
This report will try to evaluate the data using the time series methods, models, and calculations, and try to leverage as much as possible all of information from it by using time series modeling comparison. We will also try to see which is the time series modeling is considered better in understanding and predicting/ forcasting the staples food pricing based on the available data, and also some of the errors and assumptions conditions which may occured during interpreting the staples food pricing.
food <- read.csv("wfpfood2021.csv")
head(food)glimpse(food)#> Rows: 72,437
#> Columns: 17
#> $ date <chr> "1/15/2007", "2/15/2007", "3/15/2007", "4/15/2007", "5/15/200~
#> $ cmname <chr> "Rice - Retail", "Rice - Retail", "Rice - Retail", "Rice - Re~
#> $ unit <chr> "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "KG", "~
#> $ category <chr> "cereals and tubers", "cereals and tubers", "cereals and tube~
#> $ price <dbl> 5941.975, 6445.000, 6414.000, 6083.000, 5955.000, 5949.000, 5~
#> $ currency <chr> "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR", "IDR"~
#> $ country <chr> "Indonesia", "Indonesia", "Indonesia", "Indonesia", "Indonesi~
#> $ admname <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ adm1id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ mktname <chr> "National Average", "National Average", "National Average", "~
#> $ mktid <int> 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 974, 9~
#> $ cmid <int> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5~
#> $ ptid <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
#> $ umid <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5~
#> $ catid <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ sn <chr> "974_52_15_5", "974_52_15_5", "974_52_15_5", "974_52_15_5", "~
#> $ default <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
As there are too many types of food products on the dataset (cmname), for this reporting purpose, we will only focus on two mostly known food commodities to evaluate its pricing using the time series methods, models, and calculations, and based on the modeling results then try to leverage as much as possible all of information from it by using time series modeling comparison.
Two mostly known food commodities will be used on the report are as follows: Rice - Retail, and Vegetable Oil - Retail.
For Time Series modeling and calculation, we will only need the Pricing and Date data; therefore new object (food_clean) will be made to make the size of the initial dataset to be smaller and efficient.
food_rice <- food %>%
filter(cmname == "Rice - Retail") %>%
mutate(date = mdy(date)) %>%
arrange(date) %>%
select( - unit, - category, - currency, - ptid, - umid, - catid, - sn, - default,
- country, - admname, - adm1id, - mktname, - mktid, - cmid)
food_oil <- food %>%
filter(cmname == "Oil (vegetable) - Retail") %>%
mutate(date = mdy(date)) %>%
arrange(date) %>%
select( - unit, - category, - currency, - ptid, - umid, - catid, - sn, - default,
- country, - admname, - adm1id, - mktname, - mktid, - cmid)
head(food_rice,12)head(food_oil,12)tail(food_rice,12)Checking for NA value of Data:
anyNA(food_rice)#> [1] FALSE
anyNA(food_oil)#> [1] FALSE
colSums(is.na(food_rice))#> date cmname price
#> 0 0 0
colSums(is.na(food_oil))#> date cmname price
#> 0 0 0
Note: There is no indication of NA/ missing values on the columns on dataset.
For time series analysis, we need to convert our data into ts objects using the ts() function. We will try to create a time series objects from monthly data Rice - Retail and Vegetable Oil - Retail with annual seasonality (per 1 year).
This report will use two kind of time series
rice_ts <- ts(data = food_rice$price,
start = 2007,
frequency = 12)
oil_ts <- ts(data = food_oil$price,
start = 2007,
frequency = 12)We then try to visualize both of these time series objects (rice_ts and oil_ts) into the graph below:
rice_ts %>%
autoplot() + theme_minimal()oil_ts %>%
autoplot() + theme_minimal()Insight: From the graphs above, it can be concluded that both of Rice and Vegetable Oil commodities seems to have both Additive types of time series data pattern as its trend and seasonal patterns tend to be constant. This can further be ensured by conducting the exploratory data analysis.
As stated priorly from the business assumption requirement, this report will only gather and process these columns of dataset; These requirement is based on the assumption that both of commodities (rice and vegetable oil) are belong to the most consumed commodities by Indonesian people: - Rice - Retail - Oil (vegetable) - Retail
Exploratory data analysis can be done by doing a decomposition of the time series objects that been created priorly in order to get more insights of the time series data.
Decomposing time series objects for EDA purpose:
rice_dc <- decompose(rice_ts, type = "additive")
oil_dc <- decompose(oil_ts, type = "additive")
autoplot(rice_dc)autoplot(oil_dc)Insight:
As it can be seen from the graphs above, although both graphs can clearly shown the trend of both commodities, but as for its seasonal conditions, we still cannot conclude its seasonal conditions (perhaps due to its tight seasonal patterns); therefore we will try to use a more robust time series decomposition method, named Seasonal and Trend Decomposition using Loess (STL) in order to get a better understanding of its seasonal conditions.
Decomposition of time series objects using Seasonal and Trend Decomposition using Loess Method:
forecast::mstl(rice_ts) %>%
autoplot()forecast::mstl(oil_ts) %>%
autoplot() Note: - Although not clearly indicated, but it seemed that both commodities are having a slight seasonal patterns; This can be seen by the amplitude repetition indication that occurs in the time series data for both commodities.
As we suspect that there are both trend and seasonal on both of the commodities data, This report will use Triple Exponential Smoothing (Holt Winters) and ARIMA Model to forecast its trend and seasonal patterns.
In order to do the cross validation, the data will be splitted into two parts, the train data (consists of all data up to last 2 years of observation data), and test data (consist of only last 2 years of observation data).
The train data will be used for training of the model, where the test data will be used for testing the model performance. The model then will also be tested to predict the test data. The predicted results and actual data from the test data will then be compared to validate the model performance.
rice_train <- head(rice_ts, n= length(rice_ts) - 24)
rice_test <- tail(rice_ts, n=24)
oil_train <- head(oil_ts, n= length(oil_ts) - 24)
oil_test <- tail(oil_ts, n=24)Holt-Winters time series model uses exponential smoothing to encode lots of values from the past and use them to predict “typical” values for the present and future.
The Holt-Winters model can be described by the following six components:
x = time series objectalpha = smoothing errorbeta = smoothing trendgamma = seasonal smoothingseasonal: seasonal additive type (default) / multiplicativeEither alpha/ beta/ gamma values can be set manually/ allowed to be automatic, or disabled (F / False).
In this report, all alpha (for smoothing error), beta (for smoothing trend), and gamma (for smoothing error) values will be used on the model as we suspect that there are indications of the existence of these three components in the data, so that they will be left to set to automatic.
Creating Holt-Winters model for Rice and Vegetable Oil data:
model_rice <- HoltWinters(x = rice_train)
model_oil <- HoltWinters(x = oil_train)forecast_rice <- forecast(object = model_rice, h = length(rice_test))
forecast_oil <- forecast(object = model_oil, h = length(oil_test))plot_forecast(forecast_rice)plot_forecast(forecast_oil)test_forecast(actual = rice_ts, forecast.obj = forecast_rice, test = rice_test)test_forecast(actual = oil_ts, forecast.obj = forecast_oil, test = oil_test)To evaluate the model, this report will calculate the prediction error using the MAPE (Mean Absolute Percentage Error) method, where the MAPE value range is between 0-100%, where the model will be considered better if the error is close to 0%.
MAPE(y_pred = forecast_rice$mean, y_true = rice_test)*100#> [1] 14.18583
MAPE(y_pred = forecast_oil$mean, y_true = oil_test)*100#> [1] 9.943515
Insight: From the MAPE results above, it can be gathered that Rice commodity is having a MAPE value of about 14.19%, whereas the Vegetables Oil commodity is having the MAPE value of about 9.94%, thus bringing the Vegetables Oil commmodity data is having a slight better of data error compared to the Rice commodity data.
ARIMA Model stands for Auto Regressive Integrated Moving Average Model. This model represents the combination of two time series model methods, namely Autoregressive (AR) and Moving Average (MA).
The ARIMA model is explained by three terms:
p: order for AR model.d: how many times the data needs to be differencing until it becomes stationary.q: order for MA models.One of the requirements for data to be processed using the ARIMA model is that the data must be stationary. Data is considered to be stationary when its value fluctuates around its mean (ie. when data is without a trend).
One of the method to make the data to be stationary is by using ** data differencing **, which is adjusting the value of an observation with previous observations. The amount of differencing can be more than 1 time, depending on the complexity of the data.
Check Data Stationary Status using adf.test function:
H0: Data is not stationary.
H1: Data is Stationary.
Reject H0 when p-value < 0.05
adf.test(rice_ts)#>
#> Augmented Dickey-Fuller Test
#>
#> data: rice_ts
#> Dickey-Fuller = -10.413, Lag order = 14, p-value = 0.01
#> alternative hypothesis: stationary
adf.test(oil_ts)#>
#> Augmented Dickey-Fuller Test
#>
#> data: oil_ts
#> Dickey-Fuller = -9.8177, Lag order = 14, p-value = 0.01
#> alternative hypothesis: stationary
Insight: As the adf.test results shown that both of the time series objects p-value results are lower than 0.05 (p-val < 0.05), therefore it can be assumed that the data is already in stationary state.
In order to find the p order for an AR model, we will look at the PACF (Partial Autocorrelation Function) graph, where it shows autocorrelation between t (current) and t-k (lag) without being influenced by the correlation between the intermediate periods.
As to find the q order for the MA model, we will look at the ACF (Autocorrelation Function) plot, where it shows autocorrelation between t and t-k, with the influence of the intermediate period.
There are 2 types of patterns in the PACF and ACF plots:
tsdisplay(diff(rice_ts)) tsdisplay(diff(oil_ts)) In this report, we assumed that all p (order for AR model), d (how many times the data needs to be differencing until it becomes stationary), and q (order for MA models values will be used) will be set to automatic.
ARIMA Model for Rice Commodity:
arima_rice <- auto.arima(y = rice_ts)
summary(arima_rice)#> Series: rice_ts
#> ARIMA(1,1,2)(0,0,1)[12] with drift
#>
#> Coefficients:
#> ar1 ma1 ma2 sma1 drift
#> 0.7970 -1.1423 0.1515 -0.0491 0.9053
#> s.e. 0.0231 0.0334 0.0301 0.0187 0.8420
#>
#> sigma^2 estimated as 872978: log likelihood=-23899.35
#> AIC=47810.71 AICc=47810.74 BIC=47846.53
#>
#> Training set error measures:
#> ME RMSE MAE MPE MAPE MASE
#> Training set 13.05576 933.3643 668.8733 -0.4741684 5.920817 0.4608643
#> ACF1
#> Training set 0.008476651
ARIMA Model for Oil Commodity:
arima_oil <- auto.arima(y = oil_ts)
summary(arima_oil)#> Series: oil_ts
#> ARIMA(2,1,1)(1,0,0)[12]
#>
#> Coefficients:
#> ar1 ar2 ma1 sar1
#> 0.6071 -0.0379 -0.9462 -0.0205
#> s.e. 0.0272 0.0251 0.0214 0.0202
#>
#> sigma^2 estimated as 766383: log likelihood=-23711.02
#> AIC=47432.04 AICc=47432.06 BIC=47461.89
#>
#> Training set error measures:
#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set 13.22907 874.677 584.724 -0.2342855 4.252181 0.5679973 0.002254561
forecast_arima_rice <- forecast(object = arima_rice, h = length(rice_test))
forecast_arima_oil <- forecast(object = arima_oil, h = length(oil_test))plot_forecast(forecast_obj = forecast_arima_rice)plot_forecast(forecast_obj = forecast_arima_oil)Decomposition of ARIMA model using Seasonal and Trend Decomposition using Loess Method:
rice_stl_arima <- forecast::stlm(y = rice_train,
s.window = 12,
method = "arima")
oil_stl_arima <- forecast::stlm(y = oil_train,
s.window = 12,
method = "arima")
autoplot(rice_train) +
autolayer(rice_stl_arima$fitted)autoplot(oil_train) +
autolayer(oil_stl_arima$fitted)From the above model calculations, it can be concluded as follow: - Both of the Rice and Vegetable Oil commodities are having a tails-off type of data patterns as it shows slow changes of lag line throughout its periods.
The calculations of both Auto-Arima Models from the two models above (for Rice and Vegetable Oil commodities) will produce MAPE (Mean Absolute Percentage Error) of 5.92% (for Rice commodity), and MAPE of 4.25% (for vegetable oil commodity).
According to the Auto-Arima calculation, both of the Rice and Vegetable Oil commodities are having one time of differencing process in order to make the data to be more stationary.
From the Holt-Winters model and ARIMA model that we have calculated, forecasted, and visualized above, there are some conclusions that can be derived from both of Rice and Vegetable Oil commodities, These are:
Both of Rice and Oil commodities seems to have both Additive types of time series data pattern as its trend and seasonal patterns tend to be constant. This can further be ensured by conducting the exploratory data analysis either by using Holt-Winters model and ARIMA model.
Using a more robust time series decomposition method (Seasonal and Trend Decomposition using Loess - STL) we can get a slight better understanding of the seasonal conditions both for Rice and Vegetable Oil models.
When using Holt-Winters model using an automatic values for its alpha (for smoothing error), beta (for smoothing trend), and gamma (for smoothing error), we can gathered the MAPE (Mean Absolute Percentage Error) of 14.19% for Rice model and 9.94% for Vegetable Oil model. These MAPE values show that both of Rice and Vegetable Oil models have a relatively low errors (under 15% of errors in overall).
Meanwhile when using ARIMA model using an automatic values for its p (order for AR model), d (number of times the data needs to be differenced until it becomes stationary), and q (order for MA models values), we can gathered the MAPE (Mean Absolute Percentage Error) of 5.92% for Rice model and 4.25% for Vegetable Oil model. These MAPE values show that both of Rice and Vegetable Oil models have an even better low errors (under 6% of errors in overall).
Based on the MAPE values comparison, it can be concluded that the performance of ARIMA model surpasses the Holt-Winters model by about 9% in predicting the available test data based on the forecasting model of each method and for each commodities.