Price Vegetables and Fruits || Forecasting

Background

Choosing the agricultural products to grow is a challenging task for these farmers, as they lack the tools needed to determine which agricultural products will have the best price when they are ready to go to market.

At this point, machine learning algorithms have been applied in many different sectors to solve prediction problems. Due to the importance and use of machine learning, price prediction for agricultural products, will be applied.

Dataset

This dataset contains official price information for major vegetables and fruits in Nepal from 2013 to 2021. The dataset includes daily price data for each vegetable and fruit, as well as the maximum, minimum, and average prices over the period. The prices are based on official figures.

Bussines Objective

Mr. Saitama is a new Nepalese farmer. With a degree in agriculture, he wants to identify five profitable crops to cultivate. His expertise in the field of agriculture is reliable. In order to maximize profits and avoid losses, Mr. Saitama seeks to determine price projections and is requesting assistance from a data scientist to formulate his strategy.

Exploratory & Explanatory Data Analysis

# Data Wrangling
library(tidyverse)
library(lubridate)
library(glue)

# Visualization
library(ggthemes)
library(scales)
library(plotly)

# Time Series
library(forecast)
library(tseries)
library(padr)
library(zoo)
library(tibbletime)
library(sweep)
library(anomalize)
library(timetk)

# Machine Learning
library(rsample)
library(recipes)
library(purrr)
library(yardstick)
library(magrittr)
commodity_price = read.csv("kalimati_tarkari_dataset.csv")

rmarkdown::paged_table(commodity_price)

Columns description:

  • Commodity: Kind of vegetables and fruits
  • Date : Date the commodity’s price. The data represents daily prices
  • Unit : The unit of mass
  • Minimum : The minimum price of the commodity on that date
  • Maximum : The maximum price of the commodity on that date
  • Average : The average price of the commodity on that date
glimpse(commodity_price)
#> Rows: 197,161
#> Columns: 7
#> $ SN        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
#> $ Commodity <chr> "Tomato Big(Nepali)", "Tomato Small(Local)", "Potato Red", "~
#> $ Date      <chr> "2013-06-16", "2013-06-16", "2013-06-16", "2013-06-16", "201~
#> $ Unit      <chr> "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", ~
#> $ Minimum   <dbl> 35, 26, 20, 15, 28, 30, 6, 30, 35, 25, 16, 20, 20, 55, 25, 6~
#> $ Maximum   <dbl> 40, 32, 21, 16, 30, 35, 10, 35, 40, 30, 18, 22, 25, 60, 30, ~
#> $ Average   <dbl> 37.5, 29.0, 20.5, 15.5, 29.0, 32.5, 8.0, 32.5, 37.5, 27.5, 1~
# Checking kinds of commodities
length(unique(commodity_price$Commodity))
#> [1] 132
# Checking the range of dates in the dataset
range(commodity_price$Date)
#> [1] "2013-06-16" "2021-05-13"

This dataset contains 197,161 rows and 7 columns, with 132 commodities. The range of dates recorded is from 2013-06-16 to 2021-05-13. There is data in the dataset with wrong type, that is Date. So, I’ll perform the conversion of the type from character.type to date.type. Also, with the Commodity column, the type will be converted to factor.

commodity_price <- commodity_price %>% 
  
  # select columns with reliable value to future analysis
  select(Date, Commodity, Average) %>% 
  
  # convers the data type
  mutate(Date = ymd(Date),
         Commodity = as.factor(Commodity))

rmarkdown::paged_table(commodity_price)

Based on the business objective, i’ll extract the top five commodities in Nepal.

top5_Commodity <- commodity_price %>%
  mutate(Commodity = droplevels(Commodity)) %>% 
  group_by(Commodity) %>% 
  summarise(Frekuensi = length(Commodity)) %>% 
  arrange(-Frekuensi) %>%
  head(5)

droplevels(top5_Commodity$Commodity)
#> [1] Ginger               Cauli Local          Cabbage(Local)      
#> [4] Chilli Dry           Raddish White(Local)
#> 5 Levels: Cabbage(Local) Cauli Local Chilli Dry ... Raddish White(Local)

The top five commodities in Nepal are Ginger, Cauli Local, Cabbage Local, Chilli Dry, and Raddish White Local. Then, at this time, we’ll obtain the time series dataset of prices for the top five commodities.

top5_Comm_prices <- commodity_price %>% 
  filter(Commodity %in% droplevels(top5_Commodity$Commodity)) %>% 
  select(Date, Commodity, Average)


rmarkdown::paged_table(top5_Comm_prices)

Then, for time series analysis, we must ensure that there are no jumps in date periods.

# Data preparation for check the date and time series analysis
top5_Comm_wider <- pivot_wider(top5_Comm_prices,
            names_from = Commodity,
            values_from = Average)

rmarkdown::paged_table(top5_Comm_wider)
# Checking the date
colSums(is.na(pad(top5_Comm_wider)))
#>                 Date       Cabbage(Local)          Cauli Local 
#>                    0                  140                  139 
#> Raddish White(Local)               Ginger           Chilli Dry 
#>                  142                  138                  141

After padding the dates, there are many missing values. This ensures that the dataset is disjointed or not in sequential order. So, we will fill the missing values with the mean of the data before and after each missing data point.

top5_Comm_clean <- top5_Comm_wider %>% 
  pad() %>% 
  mutate(`Cabbage(Local)` = na.fill(object = `Cabbage(Local)`, fill = "extend"),
         `Cauli Local` = na.fill(object = `Cauli Local`, fill = "extend"),
        `Raddish White(Local)` = na.fill(object = `Raddish White(Local)`, fill = "extend"),
         Ginger = na.fill(object = Ginger, fill = "extend"),
         `Chilli Dry` = na.fill(object = `Chilli Dry`, fill = "extend"))

colSums(is.na(pad(top5_Comm_clean)))
#>                 Date       Cabbage(Local)          Cauli Local 
#>                    0                    0                    0 
#> Raddish White(Local)               Ginger           Chilli Dry 
#>                    0                    0                    0

Literature Review

To obtain a great analysis, we have to acquire domain knowledge about the top five commodities mentioned above. Firstly, we’ll search for the planting period, then the harvest period, and finally, the growing season. This way, we can determine the seasonal periods and develop a great strategy.

From Chat-GPT, we obtained information for five commodities in Nepal.
1. Planting Period

  • Cabbage : September-Desember
  • Cauli : September-January
  • Raddish : Februari-November
  • Ginger : January-Desember
  • Chilli : April-July

2. Harvest Period

  • Cabbage : November-March
  • Cauli : November-March
  • Raddish : March-November
  • Ginger : November-July
  • Chilli : June-October

3. Growing Season

  • Cabbage : 2,5-4,5 months
  • Cauli : 2,5-4,5 months
  • Raddish : 3-6 weeks
  • Ginger : 8-10 months
  • Chilli : 2-4 months

Based on the information above, we can identify the presence of annual seasonality in the prices. For time series forecasting, we’ll perform monthly and yearly seasonal analysis because Raddish commodity is considered to have monthly seasonality.

Seasonal Analysis

Firstly, we’ll analysis for annually seasonality of the top five commodities.

# Plotting Annually Seasonality
top5_Comm_clean %>% 
  mutate(
    `Cabbage` = decompose(ts(top5_Comm_clean[,2],
                            frequency = 365))$seasonal,
    `Cauli` = decompose(ts(top5_Comm_clean[,3],
                            frequency = 365))$seasonal,
    `Raddish` = decompose(ts(top5_Comm_clean[,4],
                            frequency = 365))$seasonal,
    `Ginger` = decompose(ts(top5_Comm_clean[,5],
                            frequency = 365))$seasonal,
    `Chilli Dry` = decompose(ts(top5_Comm_clean[,6],
                            frequency = 365))$seasonal,
    Month = month(Date, label = T, abb = T)
  ) %>%   
  distinct(Month, `Cabbage`, `Cauli`, `Raddish`, `Ginger`, `Chilli Dry`) %>%
  pivot_longer(
    cols = c("Cabbage", "Cauli", "Raddish", "Ginger", "Chilli Dry"),
    names_to = "Commodity",
    values_to = "Seasonal") %>% 
  ggplot(mapping = aes(x = Month, y = Seasonal, fill = Commodity)) +
  geom_col() +
  facet_grid(rows = vars(Commodity),
             scales = "free_y") +
  scale_fill_ordinal()+
  geom_hline(yintercept = 0) +
  labs(title = "Annual Seasonality in Top 5 Commodities",
       subtitle = "price of the top five commodities",
       y = "Observed Seasonality",
       x = "Month") +
  theme_light() + 
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

From the plot above, the following conclusions can be drawn. - Cabbage, Cauli, and Raddish, The prices are considered to decrease in the first semester and increase in the second semester. - Chilli Dry The prices decline during the middle of the year, between April and September, in contrast to Ginger - The highest prices for each commodity. + Cabbage: October to November + Cauli: September + Chilli: January + Ginger: Augustus + Raddish: September - The lowest prices for each commodity. + Cabbage: April to May + Cauli: March + Chilli: July + Ginger: December to January + Raddish: February

Anomaly Detection

We will conduct anomaly detection on the prices of the top five commodities. By doing so, we aim to develop a robust strategy for the future.

anom_cabbage_yearly <- top5_Comm_clean %>% 
  select(Date, `Cabbage(Local)`) %>% 
  as_tibble() %>% 
  time_decompose(`Cabbage(Local)`, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>% 
  plot_anomalies(size_dots = 1, 
                 size_circles = 3) +
  geom_line(linewidth = 0.1) +  
  labs(title = "The Cabbage Local Anomaly Price on a Yearly Basis",
       subtitle = "Nepal, 2013-2021",
       x = "Date",
       y = "Price (observed)")+ 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplotly(anom_cabbage_yearly)
anom_cauli_yearly <- top5_Comm_clean %>% 
  select(Date, `Cauli Local`) %>% 
  as_tibble() %>% 
  time_decompose(`Cauli Local`, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>% 
  plot_anomalies(size_dots = 1, 
                 size_circles = 3) +
  geom_line(linewidth = 0.1) +
  labs(title = "The Cauli Local Anomaly Price on a Yearly Basis",
       subtitle = "Nepal, 2013-2021",
       x = "Date",
       y = "Price (observed)")+ 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplotly(anom_cauli_yearly)
anom_raddish_yearly <- top5_Comm_clean %>% 
  select(Date, `Raddish White(Local)`) %>% 
  as_tibble() %>% 
  time_decompose(`Raddish White(Local)`, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>% 
  plot_anomalies(size_dots = 1, 
                 size_circles = 3) +
  geom_line(linewidth = 0.1) +
  labs(title = "The Raddish White Local Anomaly Price on a Yearly Basis",
       subtitle = "Nepal, 2013-2021",
       x = "Date",
       y = "Price (observed)")+ 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplotly(anom_raddish_yearly) 
anom_ginger_yearly <- top5_Comm_clean %>% 
  select(Date, Ginger) %>% 
  as_tibble() %>% 
  time_decompose(Ginger, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>% 
  plot_anomalies(size_dots = 1, 
                 size_circles = 3) +
  geom_line(linewidth = 0.1) +
  labs(title = "The Ginger Anomaly Price on a Yearly Basis",
       subtitle = "Nepal, 2013-2021",
       x = "Date",
       y = "Price (observed)")+ 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplotly(anom_ginger_yearly)
anom_chili_yearly <- top5_Comm_clean %>% 
  select(Date, `Chilli Dry`) %>% 
  as_tibble() %>% 
  time_decompose(`Chilli Dry`, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>%  
  plot_anomalies(size_dots = 1, 
                 size_circles = 3) +
  geom_line(linewidth = 0.1) +
  labs(title = "The Chilli Dry Anomaly Price on a Yearly Basis",
       x = "Date",
       y = "Price (observed)")+ 
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(anom_chili_yearly)

From the anomaly detection with 95% confidence intervals, we can conclude that all commodities tend to experience frequent anomalies in the form of drastic price increases, except for the commodity ‘chilli.’ Chilli commodity often experiences anomalous price decreases, and this serves as an important indicator for Mr. Saitama in the future.

Next, we will further analyze the commodity ‘chilli’ concerning the months that frequently experience anomalies. This will help Mr. Saitama stay vigilant during those months when harvesting chilli crops.

anomaly_chilli <- top5_Comm_clean %>% 
  select(Date, `Chilli Dry`) %>% 
  as_tibble() %>% 
  time_decompose(`Chilli Dry`, 
                 frequency = "1 year", 
                 trend = "1 year",
                 merge = TRUE) %>%
  anomalize(remainder) %>%
  time_recompose() %>% 
  filter(anomaly == "Yes") %>% 
  mutate(Year = year(Date),
         Month = month(Date, label = T, abbr = F)) %>% 
  group_by(Year, Month) %>% 
  summarise(Count = length(Month)) %>% 
  ggplot(aes(x = Year, y = Count, fill = Month,
             text = glue("{Year}, {Month}
                         Count: {Count}"))) +
  geom_col()+
  scale_fill_hue() +
  labs(title = "The Chilli Dry Anomaly Price on a Yearly Basis")+ 
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_minimal() 

ggplotly(anomaly_chilli, tooltip = "text")

Based on the anomaly_chilli and anomaly detection plots, it can be observed that the months with frequent significant price declines for the commodity ‘chilli’ are between June and September. However, the month that stands out the most with a drastic decrease and requires special attention is September.

Great! Let’s begin the preparation steps for making the forecasting model. In this process, we will automatically select the appropriate model and perform the necessary preprocessing. Additionally, we will consider the time series seasonality for improved accuracy in our forecasts.

Cross Validation

The first step is to split the data for cross-validation. This time, we’ll set aside 120 days or four months for the test data.

# Data wrangling for cross-validation preparation
comm_name <- colnames(top5_Comm_clean)[-1]

top5_Comm_longer <- top5_Comm_clean %>% 
  pivot_longer(cols = comm_name,
               names_to = "Commodity",
               values_to = "Price")
train_list <- list(
  `Cabbage(Local)` =  top5_Comm_longer %>% filter(Commodity == "Cabbage(Local)") %>% select(Price) %>% head(-120),
  `Cauli Local` =  top5_Comm_longer %>% filter(Commodity == "Cauli Local") %>% select(Price) %>% head(-120),
  `Raddish White(Local)` =  top5_Comm_longer %>% filter(Commodity == "Raddish White(Local)") %>% select(Price) %>% head(-120),
  Ginger =  top5_Comm_longer %>% filter(Commodity == "Ginger") %>% select(Price) %>% head(-120),
  `Chilli Dry` =  top5_Comm_longer %>% filter(Commodity == "Chilli Dry") %>% select(Price) %>% head(-120)
)%>% 
  enframe(name = "Commodity", value = "Price")

test_list <- list(
  `Cabbage(Local)` =  top5_Comm_longer %>% filter(Commodity == "Cabbage(Local)") %>% select(Price) %>% tail(120),
  `Cauli Local` =  top5_Comm_longer %>% filter(Commodity == "Cauli Local") %>% select(Price) %>% tail(120),
  `Raddish White(Local)` =  top5_Comm_longer %>% filter(Commodity == "Raddish White(Local)") %>% select(Price) %>% tail(120),
  Ginger =  top5_Comm_longer %>% filter(Commodity == "Ginger") %>% select(Price) %>% tail(120),
  `Chilli Dry` =  top5_Comm_longer %>% filter(Commodity == "Chilli Dry") %>% select(Price) %>% tail(120)
)%>% 
  enframe(name = "Commodity", value = "Price")

rmarkdown::paged_table(test_list)

Preprocessing Specification

We will experiment with multiple preprocessing approaches as there is a possibility that transformed data may perform better than the original scale. We will apply the following treatments:

  • No data transformation
  • Squared value
  • Scaled value, data will be normalized using z-score
  • Log transformation
recipe_spec <- list(
  normal_spec = function(x) x, # no transformation
  squared_spec = function(x) sqrt(x), # square the demand value
  scale_spec = function(x) scale(x), # normalize the demand value with Z-score
  log_spec = function(x) log(x+1) # convert demand to log
) %>%
  enframe(name = "preprocess_name", value = "preprocess_spec")

rmarkdown::paged_table(recipe_spec)

Additionally, we need to create a function to revert the scaled values back to their original form for later model evaluation. For instance, if the data is scaled using square root, the function should undo the scaling by squaring the data. This step is essential to ensure accurate evaluation and interpretation of the model results.

reverse_spec <- list(

  normal_spec = function(x, y) {
    y <- y
    return(x)
    },

  squared_spec = function(x, y) {
    y <- y
    return(x^2)
    },

  scale_spec = function(x, y) x * sd(y) + mean(y),

  log_spec = function(x,y) {
    y <- y
    return(exp(x)-1)
  }
) %>%
  enframe(name = "reverse_name", value = "reverse_spec")

# joint the preprocess and the scale-back function
recipe_spec <- recipe_spec %>%
  bind_cols(reverse_spec)

rmarkdown::paged_table(recipe_spec)

Impute Outlier

Additionally, we will explore preprocessing the data by determining whether outliers should be replaced or kept. If we decide to replace outliers, we will identify them and estimate the replacement using the tsoutliers function. Residuals will be detected by fitting a loess curve for non-seasonal data and using periodic STL decomposition for seasonal data. This approach will help us handle outliers and improve the data quality before feeding it into the forecasting model.

outlier_spec <- list(
  normal_spec = function(x) x, # no transformation
  
  out_spec = function(x){
    outlier_place <- tsoutliers(x)
    x[outlier_place$index] <- outlier_place$replacement
    return(x)
  }
) %>% 
  enframe(name = "outlier_name", value = "out_spec")

rmarkdown::paged_table(outlier_spec)

Seasonality Specification

The next step is to specify the seasonal period for the series. We will try several seasonal period, including:

  • monthly seasonality
  • annual seasonality
  • monthly and annual seasonality (multi-seasonal)
seasonal_forecast <- list(
  monthly = function(x) ts(x, frequency = 30),
  yearly = function(x) ts(x, frequency = 365),
  monthly_yearly = function(x) msts(x, seasonal.periods = c(30, 365))
  ) %>% 
  enframe(name = "season_name", value = "season_spec")

rmarkdown::paged_table(seasonal_forecast)

Model Specification

Next, we will specify the model the data will be fit into. The model includes:

  • Holt’s Winters Exponential
  • STL + ETS model
  • STL + ARIMA
model_forecast <- list(
  holt_winters = function(x) HoltWinters(x),
  stl_ets = function(x) stlm(x, method = "ets"),
  stl_arima = function(x) stlm(x, method = "arima")
  ) %>% 
  enframe(name = "model_name", value = "model_spec")

rmarkdown::paged_table(model_forecast)

Model Fitting

Firstly, we will combine all the lists of preprocessing steps and model specifications with the lists of data for training and testing. This integration will allow us to apply the defined preprocessing methods and model settings to the respective training and testing datasets. By doing this, we can systematically implement the entire forecasting process and ensure consistency in the analysis.

train_process <- crossing(train_list, recipe_spec, seasonal_forecast, outlier_spec, model_forecast)

test_process <- crossing(test_list, recipe_spec, seasonal_forecast, outlier_spec, model_forecast)

rmarkdown::paged_table(train_process)

Next, the price data will be transformed according to the preprocessing steps mentioned earlier. By applying these transformations, we can ensure that the data is appropriately prepared for the subsequent steps in the forecasting model.

tranformasi_data <- map2(.x = train_process$Price,
                         .y = train_process$preprocess_spec,
                         .f = ~exec( .y, .x)
                         )

Then, the automated model selection process with the predefined preprocessing steps will be executed to ensure efficiency. During this stage, the forecasting process will also be carried out simultaneously. By automating the model selection and forecasting, we can streamline the analysis and obtain predictions more efficiently. This approach allows us to quickly evaluate multiple models with different preprocessing techniques and identify the most suitable one for our dataset.

forecast_map <- tranformasi_data %>%

  # Convert data to time series with different seasonalities
  map2(.y = train_process$season_spec,
       .f = ~exec(.y, .x)) %>%

  # Transform outlier
  map2(.y = train_process$out_spec,
       .f = ~exec(.y, .x)) %>%

  # Fit data to time series model
  map2(.y = train_process$model_spec,
       .f =  ~exec(.y,.x)) %>% 

  # Forecast for the next 4 months
  map(forecast, h = 120) %>%

  # Take the mean of the forecast
  map(~pluck(.x, "mean")) %>%

  map(as.numeric)

The forecasting results will be transformed back to their original form to ensure that the forecasted data align with their true meaning, which is the commodity prices. This step is crucial as it allows us to interpret and utilize the forecasted values correctly in the context of the actual commodity prices. By reverting the data to its original scale, we ensure that the forecasting outcomes are meaningful and practical for decision-making.

transform_forecast <- list()

for (i in 1:length(forecast_map)) {
  transform_forecast[[i]] <- train_process$reverse_spec[[i]]( x = forecast_map[[i]], y = tranformasi_data[[i]])
}

Next, to evaluate the forecasting results, we will calculate the errors against the actual data using metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and MAPE (Mean Absolute Percentage Error). These metrics will provide us with quantitative measures of how well the forecasting model performs in comparison to the actual commodity prices. By assessing the errors, we can determine the accuracy and reliability of the forecasting model and make any necessary adjustments to improve its performance.

mae_forecast <- transform_forecast %>% 
  map2(.y = test_process$Price, 
       .f = ~yardstick::mae_vec(.x %>% as.numeric(), .y$Price))

rmse_forecast <- transform_forecast %>% 
  map2(.y = test_process$Price, 
       .f = ~yardstick::rmse_vec(.x %>% as.numeric(), .y$Price))

mape_forecast <- transform_forecast %>% 
  map2(.y = test_process$Price, 
       .f = ~yardstick::mape_vec(.x %>% as.numeric(), .y$Price))

# show result
mae_rmse_mape <- train_process %>% 
  select_if(is.character) %>% 
  bind_cols(mae = mae_forecast %>% as.numeric()) %>% 
  bind_cols(rmse = rmse_forecast%>% as.numeric()) %>% 
  bind_cols(mape = mape_forecast%>% as.numeric()) %>% 
  select(Commodity, mae, rmse, mape, everything())

rmarkdown::paged_table(mae_rmse_mape)

Let’s work together to select the best model for each commodity based on the smallest parameter values of the evaluation metrics.

mae_rmse_mape %>%   
  group_by(Commodity) %>% 
  arrange(mae) %>% 
  select(Commodity, mae, rmse, mape, model_name, everything()) %>% 
  slice(1) 
#> # A tibble: 5 x 9
#> # Groups:   Commodity [5]
#>   Commodity              mae  rmse  mape model_name preprocess_name reverse_name
#>   <chr>                <dbl> <dbl> <dbl> <chr>      <chr>           <chr>       
#> 1 Cabbage(Local)        6.59  8.77 38.4  stl_ets    normal_spec     normal_spec 
#> 2 Cauli Local           9.73 12.6  23.4  stl_ets    normal_spec     normal_spec 
#> 3 Chilli Dry            3.86 10.2   1.21 stl_ets    normal_spec     normal_spec 
#> 4 Ginger                5.80  8.43  7.87 stl_arima  squared_spec    squared_spec
#> 5 Raddish White(Local) 16.6  18.7  80.0  stl_ets    normal_spec     normal_spec 
#> # i 2 more variables: season_name <chr>, outlier_name <chr>

For the Chilli Dry and Ginger commodities, the forecasting models have good accuracy, with MAPE being the most reliable metric to describe the accuracy. The smaller the MAPE value, the better the model’s accuracy. However, the forecasting model failed to adequately model the Raddish White commodity. On the other hand, the accuracy for the Cabbage and Cauli commodities is not very high, but still acceptable.

Let’s create the best forecasting models for each commodity. Next, we will plot the results of the model fitting and forecasting against the actual price data. This way, we can visually compare the model predictions with the actual values to evaluate the performance of our forecasting models. The visualization will help us understand how well the models are capturing the price patterns and provide a clear representation of their accuracy and reliability. Let’s proceed with building the models and plotting the results!

best_adjust <- train_process %>% 
  bind_cols(mae = mae_forecast %>% as.numeric()) %>%
  bind_cols(rmse = rmse_forecast %>% as.numeric()) %>% 
  bind_cols(mape = mape_forecast%>% as.numeric()) %>%   
  group_by(Commodity) %>% 
  arrange(mae) %>% 
  slice(1) 

best_adjust
#> # A tibble: 5 x 15
#> # Groups:   Commodity [5]
#>   Commodity   Price    preprocess_name preprocess_spec reverse_name reverse_spec
#>   <chr>       <list>   <chr>           <list>          <chr>        <list>      
#> 1 Cabbage(Lo~ <tibble> normal_spec     <fn>            normal_spec  <fn>        
#> 2 Cauli Local <tibble> normal_spec     <fn>            normal_spec  <fn>        
#> 3 Chilli Dry  <tibble> normal_spec     <fn>            normal_spec  <fn>        
#> 4 Ginger      <tibble> squared_spec    <fn>            squared_spec <fn>        
#> 5 Raddish Wh~ <tibble> normal_spec     <fn>            normal_spec  <fn>        
#> # i 9 more variables: season_name <chr>, season_spec <list>,
#> #   outlier_name <chr>, out_spec <list>, model_name <chr>, model_spec <list>,
#> #   mae <dbl>, rmse <dbl>, mape <dbl>
model_best <- 
  map2(.x = best_adjust$Price, 
       .y = best_adjust$preprocess_spec,
       .f =  ~exec(.y,.x)
       ) %>% 
  map2(.y = best_adjust$season_spec,
       .f =  ~exec(.y,.x)
       ) %>% 
  map2(.y = best_adjust$out_spec,
       .f = ~exec(.y, .x)
       ) %>%
  map2(.y = best_adjust$model_spec,
       .f =  ~exec(.y,.x)
       ) 
data_test1 <- test_list$Price[[1]] %>% 
  ts(start = c(8, 215),
     end = c(8, 334),   
     frequency = 365)

best_adjust$reverse_spec[[1]](model_best[[1]], 1)$fitted %>% 
  autoplot(series = "model fitted", 
           linewidth = 1.2) +
  autolayer(best_adjust$season_spec[[1]](train_list$Price[[1]]),
            color = "black", 
            series = "train price",
            linewidth = 1.2,
            alpha = 0.5) +
  autolayer(best_adjust$reverse_spec[[1]](forecast(model_best[[1]], 
                                                   h = 120),
                              best_adjust$Price[[1]]), 
            series = "forecast price") +
  autolayer(data_test1, 
            series = "test price") +
  scale_x_continuous(
    name = "Date",
    breaks = breaks_extended(6),
    labels = as.Date.numeric(seq(25,525,100)*5-9, 
                             origin = range(commodity_price$Date)[1])) +
  labs(
    title = glue("{best_adjust$Commodity[[1]]} Price"),
    y = "Price") +
  theme_economist_white(
    base_size = 9,
    gray_bg = F
    )

data_test2 <- test_list$Price[[1]] %>% 
  ts(start = c(93, 10),
     end = c(97, 9),   
     frequency = 30)

best_adjust$reverse_spec[[2]](model_best[[2]], 1)$fitted %>% 
  autoplot(series = "model fitted", 
           linewidth = 1.2) +
  autolayer(best_adjust$season_spec[[2]](train_list$Price[[2]]),
            color = "black", 
            series = "train price",
            linewidth = 1.2,
            alpha = 0.5) +
  autolayer(best_adjust$reverse_spec[[2]](forecast(model_best[[2]], 
                                                   h = 120),
                              best_adjust$Price[[2]]), 
            series = "forecast price") +
  autolayer(data_test2, 
            series = "test price") +
  scale_x_continuous(
    name = "Date",
    breaks = breaks_extended(6),
    labels = as.Date.numeric(seq(25,525,100)*5-9, 
                             origin = range(commodity_price$Date)[1])) +
  labs(
    title = glue("{best_adjust$Commodity[[2]]} Price"),
    y = "Price") +
  theme_economist_white(
    base_size = 9,
    gray_bg = F
    )

data_test3 <- test_list$Price[[5]] %>% 
  ts(start = c(8, 215),
     end = c(8, 334),   
     frequency = 365)

best_adjust$reverse_spec[[3]](model_best[[3]], 1)$fitted %>% 
  autoplot(series = "model fitted", 
           linewidth = 1.2) +
  autolayer(best_adjust$season_spec[[3]](train_list$Price[[5]]),
            color = "black", 
            series = "train price",
            linewidth = 1.2,
            alpha = 0.5) +
  autolayer(best_adjust$reverse_spec[[3]](forecast(model_best[[3]], 
                                                   h = 120),
                              best_adjust$Price[[3]]), 
            series = "forecast price") +
  autolayer(data_test3, 
            series = "test price") +
  scale_x_continuous(
    name = "Date",
    breaks = breaks_extended(6),
    labels = as.Date.numeric(seq(25,525,100)*5-9, 
                             origin = range(commodity_price$Date)[1])) +
  labs(
    title = glue("{best_adjust$Commodity[[3]]} Price"),
    y = "Price") +
  theme_economist_white(
    base_size = 9,
    gray_bg = F
    )

data_test4 <- test_list$Price[[4]] %>% 
  ts(start = c(8, 215),
     end = c(8, 334),   
     frequency = 365)

best_adjust$reverse_spec[[4]](model_best[[4]]$fitted, 1) %>% 
  autoplot(series = "model fitted", 
           linewidth = 1.2) +
  autolayer(best_adjust$season_spec[[4]](train_list$Price[[4]]),
            color = "black", 
            series = "train price",
            linewidth = 1.2,
            alpha = 0.5) +
  autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]], 
                                                   h = 120)$mean,
                              best_adjust$Price[[4]]), 
            series = "forecast price") +  autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]], 
                                                   h = 120)$lower,
                              best_adjust$Price[[4]]), 
            series = "lower forecast") + autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]], 
                                                   h = 120)$upper,
                              best_adjust$Price[[4]]), 
            series = "upper forecast") +
  autolayer(data_test4, 
            series = "test price") +
  scale_x_continuous(
    name = "Date",
    breaks = breaks_extended(6),
    labels = as.Date.numeric(seq(25,525,100)*5-9, 
                             origin = range(commodity_price$Date)[1])) +
  labs(
    title = glue("{best_adjust$Commodity[[4]]} Price"),
    y = "Price") +
  theme_economist_white(
    base_size = 9,
    gray_bg = F
    )

data_test5 <- test_list$Price[[3]] %>% 
  ts(start = c(8, 215),
     end = c(8, 334),   
     frequency = 365)

best_adjust$reverse_spec[[5]](model_best[[5]], 1)$fitted %>% 
  autoplot(series = "model fitted", 
           linewidth = 1.2) +
  autolayer(best_adjust$season_spec[[5]](train_list$Price[[3]]),
            color = "black", 
            series = "train price",
            linewidth = 1.2,
            alpha = 0.5) +
  autolayer(best_adjust$reverse_spec[[5]](forecast(model_best[[5]], 
                                                   h = 120),
                              best_adjust$Price[[3]]), 
            series = "forecast price") +
  autolayer(data_test5, 
            series = "test price") +
  scale_x_continuous(
    name = "Date",
    breaks = breaks_extended(6),
    labels = as.Date.numeric(seq(25,525,100)*5-9, 
                             origin = range(commodity_price$Date)[1])) +
  labs(
    title = glue("{best_adjust$Commodity[[5]]} Price"),
    y = "Price") +
  theme_economist_white(
    base_size = 9,
    gray_bg = F
    )

Checking Assumption

The assumption check is specifically conducted for models based on Regression (ARIMA, SARIMA -> Auto Regressive). So, the assumption check will be performed only on the model for the Ginger commodity.

1. No-autocorrelation residual

This assumption is to check whether there is still autocorrelation in the residuals. If yes, it indicates that there are still patterns in the data that the model has not captured for forecasting. In that case, the model needs to undergo tuning and further adjustments to improve its performance.

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

Box.test(model_best[[4]]$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model_best[[4]]$residuals
#> X-squared = 0.00011024, df = 1, p-value = 0.9916

Because the p-value is greater than 0.05, the model fails to reject the null hypothesis (H0), indicating that the residuals have no autocorrelation. This means that the assumption of no autocorrelation is met, and the model is capturing the relevant patterns in the data for forecasting.

2. Normality of Residual

To check the normality of residuals in the time series forecasting results, we can use a histogram plot of the residuals. The histogram will display the frequency distribution of the residuals, and if the distribution is close to a normal distribution (bell-shaped curve), it can be considered that the residuals meet the assumption of normality. However, if the distribution deviates significantly from normal, it may require data transformation or consideration of alternative methods to address the non-normality in the residuals.

ggplotly(gghistogram(model_best[[4]]$residuals)+
           labs(title="Ginger : Histogram of residuals", x = "residuals"))

Based on the plotting results, we can see together that the models with regression-based algorithms are able to meet the assumption of normality in the distribution of residuals. This indicates that the residuals are approximately normally distributed, which is a positive sign for the accuracy and reliability of the forecasting models. It provides confidence that the model assumptions are valid and the forecasts are more robust.