Price Vegetables and Fruits || Forecasting
Background
Choosing the agricultural products to grow is a challenging task for these farmers, as they lack the tools needed to determine which agricultural products will have the best price when they are ready to go to market.
At this point, machine learning algorithms have been applied in many different sectors to solve prediction problems. Due to the importance and use of machine learning, price prediction for agricultural products, will be applied.
Dataset
This dataset contains official price information for major vegetables and fruits in Nepal from 2013 to 2021. The dataset includes daily price data for each vegetable and fruit, as well as the maximum, minimum, and average prices over the period. The prices are based on official figures.
Bussines Objective
Mr. Saitama is a new Nepalese farmer. With a degree in agriculture, he wants to identify five profitable crops to cultivate. His expertise in the field of agriculture is reliable. In order to maximize profits and avoid losses, Mr. Saitama seeks to determine price projections and is requesting assistance from a data scientist to formulate his strategy.
Exploratory & Explanatory Data Analysis
# Data Wrangling
library(tidyverse)
library(lubridate)
library(glue)
# Visualization
library(ggthemes)
library(scales)
library(plotly)
# Time Series
library(forecast)
library(tseries)
library(padr)
library(zoo)
library(tibbletime)
library(sweep)
library(anomalize)
library(timetk)
# Machine Learning
library(rsample)
library(recipes)
library(purrr)
library(yardstick)
library(magrittr)
= read.csv("kalimati_tarkari_dataset.csv")
commodity_price
::paged_table(commodity_price) rmarkdown
Columns description:
Commodity
: Kind of vegetables and fruitsDate
: Date the commodity’s price. The data represents daily pricesUnit
: The unit of massMinimum
: The minimum price of the commodity on that dateMaximum
: The maximum price of the commodity on that dateAverage
: The average price of the commodity on that date
glimpse(commodity_price)
#> Rows: 197,161
#> Columns: 7
#> $ SN <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
#> $ Commodity <chr> "Tomato Big(Nepali)", "Tomato Small(Local)", "Potato Red", "~
#> $ Date <chr> "2013-06-16", "2013-06-16", "2013-06-16", "2013-06-16", "201~
#> $ Unit <chr> "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", "Kg", ~
#> $ Minimum <dbl> 35, 26, 20, 15, 28, 30, 6, 30, 35, 25, 16, 20, 20, 55, 25, 6~
#> $ Maximum <dbl> 40, 32, 21, 16, 30, 35, 10, 35, 40, 30, 18, 22, 25, 60, 30, ~
#> $ Average <dbl> 37.5, 29.0, 20.5, 15.5, 29.0, 32.5, 8.0, 32.5, 37.5, 27.5, 1~
# Checking kinds of commodities
length(unique(commodity_price$Commodity))
#> [1] 132
# Checking the range of dates in the dataset
range(commodity_price$Date)
#> [1] "2013-06-16" "2021-05-13"
This dataset contains 197,161 rows and 7 columns, with 132
commodities. The range of dates recorded is from 2013-06-16 to
2021-05-13. There is data in the dataset with wrong type, that is
Date
. So, I’ll perform the conversion of the type from
character.type
to date.type
. Also, with the
Commodity
column, the type will be converted to
factor
.
<- commodity_price %>%
commodity_price
# select columns with reliable value to future analysis
select(Date, Commodity, Average) %>%
# convers the data type
mutate(Date = ymd(Date),
Commodity = as.factor(Commodity))
::paged_table(commodity_price) rmarkdown
Based on the business objective, i’ll extract the top five commodities in Nepal.
<- commodity_price %>%
top5_Commodity mutate(Commodity = droplevels(Commodity)) %>%
group_by(Commodity) %>%
summarise(Frekuensi = length(Commodity)) %>%
arrange(-Frekuensi) %>%
head(5)
droplevels(top5_Commodity$Commodity)
#> [1] Ginger Cauli Local Cabbage(Local)
#> [4] Chilli Dry Raddish White(Local)
#> 5 Levels: Cabbage(Local) Cauli Local Chilli Dry ... Raddish White(Local)
The top five commodities in Nepal are Ginger, Cauli Local, Cabbage Local, Chilli Dry, and Raddish White Local. Then, at this time, we’ll obtain the time series dataset of prices for the top five commodities.
<- commodity_price %>%
top5_Comm_prices filter(Commodity %in% droplevels(top5_Commodity$Commodity)) %>%
select(Date, Commodity, Average)
::paged_table(top5_Comm_prices) rmarkdown
Then, for time series analysis, we must ensure that there are no jumps in date periods.
# Data preparation for check the date and time series analysis
<- pivot_wider(top5_Comm_prices,
top5_Comm_wider names_from = Commodity,
values_from = Average)
::paged_table(top5_Comm_wider) rmarkdown
# Checking the date
colSums(is.na(pad(top5_Comm_wider)))
#> Date Cabbage(Local) Cauli Local
#> 0 140 139
#> Raddish White(Local) Ginger Chilli Dry
#> 142 138 141
After padding the dates, there are many missing values. This ensures that the dataset is disjointed or not in sequential order. So, we will fill the missing values with the mean of the data before and after each missing data point.
<- top5_Comm_wider %>%
top5_Comm_clean pad() %>%
mutate(`Cabbage(Local)` = na.fill(object = `Cabbage(Local)`, fill = "extend"),
`Cauli Local` = na.fill(object = `Cauli Local`, fill = "extend"),
`Raddish White(Local)` = na.fill(object = `Raddish White(Local)`, fill = "extend"),
Ginger = na.fill(object = Ginger, fill = "extend"),
`Chilli Dry` = na.fill(object = `Chilli Dry`, fill = "extend"))
colSums(is.na(pad(top5_Comm_clean)))
#> Date Cabbage(Local) Cauli Local
#> 0 0 0
#> Raddish White(Local) Ginger Chilli Dry
#> 0 0 0
Literature Review
To obtain a great analysis, we have to acquire domain knowledge about the top five commodities mentioned above. Firstly, we’ll search for the planting period, then the harvest period, and finally, the growing season. This way, we can determine the seasonal periods and develop a great strategy.
From Chat-GPT, we obtained information for five
commodities in Nepal.
1. Planting Period
- Cabbage : September-Desember
- Cauli : September-January
- Raddish : Februari-November
- Ginger : January-Desember
- Chilli : April-July
2. Harvest Period
- Cabbage : November-March
- Cauli : November-March
- Raddish : March-November
- Ginger : November-July
- Chilli : June-October
3. Growing Season
- Cabbage : 2,5-4,5 months
- Cauli : 2,5-4,5 months
- Raddish : 3-6 weeks
- Ginger : 8-10 months
- Chilli : 2-4 months
Based on the information above, we can identify the presence of annual seasonality in the prices. For time series forecasting, we’ll perform monthly and yearly seasonal analysis because Raddish commodity is considered to have monthly seasonality.
Seasonal Analysis
Firstly, we’ll analysis for annually seasonality of the top five commodities.
# Plotting Annually Seasonality
%>%
top5_Comm_clean mutate(
`Cabbage` = decompose(ts(top5_Comm_clean[,2],
frequency = 365))$seasonal,
`Cauli` = decompose(ts(top5_Comm_clean[,3],
frequency = 365))$seasonal,
`Raddish` = decompose(ts(top5_Comm_clean[,4],
frequency = 365))$seasonal,
`Ginger` = decompose(ts(top5_Comm_clean[,5],
frequency = 365))$seasonal,
`Chilli Dry` = decompose(ts(top5_Comm_clean[,6],
frequency = 365))$seasonal,
Month = month(Date, label = T, abb = T)
%>%
) distinct(Month, `Cabbage`, `Cauli`, `Raddish`, `Ginger`, `Chilli Dry`) %>%
pivot_longer(
cols = c("Cabbage", "Cauli", "Raddish", "Ginger", "Chilli Dry"),
names_to = "Commodity",
values_to = "Seasonal") %>%
ggplot(mapping = aes(x = Month, y = Seasonal, fill = Commodity)) +
geom_col() +
facet_grid(rows = vars(Commodity),
scales = "free_y") +
scale_fill_ordinal()+
geom_hline(yintercept = 0) +
labs(title = "Annual Seasonality in Top 5 Commodities",
subtitle = "price of the top five commodities",
y = "Observed Seasonality",
x = "Month") +
theme_light() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
From the plot above, the following conclusions can be drawn. -
Cabbage
, Cauli
, and Raddish
, The
prices are considered to decrease in the first semester and increase in
the second semester. - Chilli Dry
The prices decline during
the middle of the year, between April and September, in contrast to
Ginger
- The highest prices for each commodity. +
Cabbage
: October to November + Cauli
:
September + Chilli
: January + Ginger
: Augustus
+ Raddish
: September - The lowest prices for each
commodity. + Cabbage
: April to May + Cauli
:
March + Chilli
: July + Ginger
: December to
January + Raddish
: February
Anomaly Detection
We will conduct anomaly detection on the prices of the top five commodities. By doing so, we aim to develop a robust strategy for the future.
<- top5_Comm_clean %>%
anom_cabbage_yearly select(Date, `Cabbage(Local)`) %>%
as_tibble() %>%
time_decompose(`Cabbage(Local)`,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(size_dots = 1,
size_circles = 3) +
geom_line(linewidth = 0.1) +
labs(title = "The Cabbage Local Anomaly Price on a Yearly Basis",
subtitle = "Nepal, 2013-2021",
x = "Date",
y = "Price (observed)")+
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplotly(anom_cabbage_yearly)
<- top5_Comm_clean %>%
anom_cauli_yearly select(Date, `Cauli Local`) %>%
as_tibble() %>%
time_decompose(`Cauli Local`,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(size_dots = 1,
size_circles = 3) +
geom_line(linewidth = 0.1) +
labs(title = "The Cauli Local Anomaly Price on a Yearly Basis",
subtitle = "Nepal, 2013-2021",
x = "Date",
y = "Price (observed)")+
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplotly(anom_cauli_yearly)
<- top5_Comm_clean %>%
anom_raddish_yearly select(Date, `Raddish White(Local)`) %>%
as_tibble() %>%
time_decompose(`Raddish White(Local)`,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(size_dots = 1,
size_circles = 3) +
geom_line(linewidth = 0.1) +
labs(title = "The Raddish White Local Anomaly Price on a Yearly Basis",
subtitle = "Nepal, 2013-2021",
x = "Date",
y = "Price (observed)")+
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplotly(anom_raddish_yearly)
<- top5_Comm_clean %>%
anom_ginger_yearly select(Date, Ginger) %>%
as_tibble() %>%
time_decompose(Ginger,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(size_dots = 1,
size_circles = 3) +
geom_line(linewidth = 0.1) +
labs(title = "The Ginger Anomaly Price on a Yearly Basis",
subtitle = "Nepal, 2013-2021",
x = "Date",
y = "Price (observed)")+
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplotly(anom_ginger_yearly)
<- top5_Comm_clean %>%
anom_chili_yearly select(Date, `Chilli Dry`) %>%
as_tibble() %>%
time_decompose(`Chilli Dry`,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(size_dots = 1,
size_circles = 3) +
geom_line(linewidth = 0.1) +
labs(title = "The Chilli Dry Anomaly Price on a Yearly Basis",
x = "Date",
y = "Price (observed)")+
theme(plot.title = element_text(hjust = 0.5))
ggplotly(anom_chili_yearly)
From the anomaly detection with 95% confidence intervals, we can conclude that all commodities tend to experience frequent anomalies in the form of drastic price increases, except for the commodity ‘chilli.’ Chilli commodity often experiences anomalous price decreases, and this serves as an important indicator for Mr. Saitama in the future.
Next, we will further analyze the commodity ‘chilli’ concerning the months that frequently experience anomalies. This will help Mr. Saitama stay vigilant during those months when harvesting chilli crops.
<- top5_Comm_clean %>%
anomaly_chilli select(Date, `Chilli Dry`) %>%
as_tibble() %>%
time_decompose(`Chilli Dry`,
frequency = "1 year",
trend = "1 year",
merge = TRUE) %>%
anomalize(remainder) %>%
time_recompose() %>%
filter(anomaly == "Yes") %>%
mutate(Year = year(Date),
Month = month(Date, label = T, abbr = F)) %>%
group_by(Year, Month) %>%
summarise(Count = length(Month)) %>%
ggplot(aes(x = Year, y = Count, fill = Month,
text = glue("{Year}, {Month}
Count: {Count}"))) +
geom_col()+
scale_fill_hue() +
labs(title = "The Chilli Dry Anomaly Price on a Yearly Basis")+
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal()
ggplotly(anomaly_chilli, tooltip = "text")
Based on the anomaly_chilli
and
anomaly detection
plots, it can be observed that the months
with frequent significant price declines for the commodity ‘chilli’ are
between June and September. However, the month that stands out the most
with a drastic decrease and requires special attention is September.
Great! Let’s begin the preparation steps for making the forecasting model. In this process, we will automatically select the appropriate model and perform the necessary preprocessing. Additionally, we will consider the time series seasonality for improved accuracy in our forecasts.
Cross Validation
The first step is to split the data for cross-validation. This time, we’ll set aside 120 days or four months for the test data.
# Data wrangling for cross-validation preparation
<- colnames(top5_Comm_clean)[-1]
comm_name
<- top5_Comm_clean %>%
top5_Comm_longer pivot_longer(cols = comm_name,
names_to = "Commodity",
values_to = "Price")
<- list(
train_list `Cabbage(Local)` = top5_Comm_longer %>% filter(Commodity == "Cabbage(Local)") %>% select(Price) %>% head(-120),
`Cauli Local` = top5_Comm_longer %>% filter(Commodity == "Cauli Local") %>% select(Price) %>% head(-120),
`Raddish White(Local)` = top5_Comm_longer %>% filter(Commodity == "Raddish White(Local)") %>% select(Price) %>% head(-120),
Ginger = top5_Comm_longer %>% filter(Commodity == "Ginger") %>% select(Price) %>% head(-120),
`Chilli Dry` = top5_Comm_longer %>% filter(Commodity == "Chilli Dry") %>% select(Price) %>% head(-120)
%>%
)enframe(name = "Commodity", value = "Price")
<- list(
test_list `Cabbage(Local)` = top5_Comm_longer %>% filter(Commodity == "Cabbage(Local)") %>% select(Price) %>% tail(120),
`Cauli Local` = top5_Comm_longer %>% filter(Commodity == "Cauli Local") %>% select(Price) %>% tail(120),
`Raddish White(Local)` = top5_Comm_longer %>% filter(Commodity == "Raddish White(Local)") %>% select(Price) %>% tail(120),
Ginger = top5_Comm_longer %>% filter(Commodity == "Ginger") %>% select(Price) %>% tail(120),
`Chilli Dry` = top5_Comm_longer %>% filter(Commodity == "Chilli Dry") %>% select(Price) %>% tail(120)
%>%
)enframe(name = "Commodity", value = "Price")
::paged_table(test_list) rmarkdown
Preprocessing Specification
We will experiment with multiple preprocessing approaches as there is a possibility that transformed data may perform better than the original scale. We will apply the following treatments:
- No data transformation
- Squared value
- Scaled value, data will be normalized using z-score
- Log transformation
<- list(
recipe_spec normal_spec = function(x) x, # no transformation
squared_spec = function(x) sqrt(x), # square the demand value
scale_spec = function(x) scale(x), # normalize the demand value with Z-score
log_spec = function(x) log(x+1) # convert demand to log
%>%
) enframe(name = "preprocess_name", value = "preprocess_spec")
::paged_table(recipe_spec) rmarkdown
Additionally, we need to create a function to revert the scaled values back to their original form for later model evaluation. For instance, if the data is scaled using square root, the function should undo the scaling by squaring the data. This step is essential to ensure accurate evaluation and interpretation of the model results.
<- list(
reverse_spec
normal_spec = function(x, y) {
<- y
y return(x)
},
squared_spec = function(x, y) {
<- y
y return(x^2)
},
scale_spec = function(x, y) x * sd(y) + mean(y),
log_spec = function(x,y) {
<- y
y return(exp(x)-1)
}%>%
) enframe(name = "reverse_name", value = "reverse_spec")
# joint the preprocess and the scale-back function
<- recipe_spec %>%
recipe_spec bind_cols(reverse_spec)
::paged_table(recipe_spec) rmarkdown
Impute Outlier
Additionally, we will explore preprocessing the data by determining
whether outliers should be replaced or kept. If we decide to replace
outliers, we will identify them and estimate the replacement using the
tsoutliers
function. Residuals will be detected by fitting
a loess curve for non-seasonal data and using periodic STL decomposition
for seasonal data. This approach will help us handle outliers and
improve the data quality before feeding it into the forecasting
model.
<- list(
outlier_spec normal_spec = function(x) x, # no transformation
out_spec = function(x){
<- tsoutliers(x)
outlier_place $index] <- outlier_place$replacement
x[outlier_placereturn(x)
}%>%
) enframe(name = "outlier_name", value = "out_spec")
::paged_table(outlier_spec) rmarkdown
Seasonality Specification
The next step is to specify the seasonal period for the series. We will try several seasonal period, including:
- monthly seasonality
- annual seasonality
- monthly and annual seasonality (multi-seasonal)
<- list(
seasonal_forecast monthly = function(x) ts(x, frequency = 30),
yearly = function(x) ts(x, frequency = 365),
monthly_yearly = function(x) msts(x, seasonal.periods = c(30, 365))
%>%
) enframe(name = "season_name", value = "season_spec")
::paged_table(seasonal_forecast) rmarkdown
Model Specification
Next, we will specify the model the data will be fit into. The model includes:
- Holt’s Winters Exponential
- STL + ETS model
- STL + ARIMA
<- list(
model_forecast holt_winters = function(x) HoltWinters(x),
stl_ets = function(x) stlm(x, method = "ets"),
stl_arima = function(x) stlm(x, method = "arima")
%>%
) enframe(name = "model_name", value = "model_spec")
::paged_table(model_forecast) rmarkdown
Model Fitting
Firstly, we will combine all the lists of preprocessing steps and model specifications with the lists of data for training and testing. This integration will allow us to apply the defined preprocessing methods and model settings to the respective training and testing datasets. By doing this, we can systematically implement the entire forecasting process and ensure consistency in the analysis.
<- crossing(train_list, recipe_spec, seasonal_forecast, outlier_spec, model_forecast)
train_process
<- crossing(test_list, recipe_spec, seasonal_forecast, outlier_spec, model_forecast)
test_process
::paged_table(train_process) rmarkdown
Next, the price data will be transformed according to the preprocessing steps mentioned earlier. By applying these transformations, we can ensure that the data is appropriately prepared for the subsequent steps in the forecasting model.
<- map2(.x = train_process$Price,
tranformasi_data .y = train_process$preprocess_spec,
.f = ~exec( .y, .x)
)
Then, the automated model selection process with the predefined preprocessing steps will be executed to ensure efficiency. During this stage, the forecasting process will also be carried out simultaneously. By automating the model selection and forecasting, we can streamline the analysis and obtain predictions more efficiently. This approach allows us to quickly evaluate multiple models with different preprocessing techniques and identify the most suitable one for our dataset.
<- tranformasi_data %>%
forecast_map
# Convert data to time series with different seasonalities
map2(.y = train_process$season_spec,
.f = ~exec(.y, .x)) %>%
# Transform outlier
map2(.y = train_process$out_spec,
.f = ~exec(.y, .x)) %>%
# Fit data to time series model
map2(.y = train_process$model_spec,
.f = ~exec(.y,.x)) %>%
# Forecast for the next 4 months
map(forecast, h = 120) %>%
# Take the mean of the forecast
map(~pluck(.x, "mean")) %>%
map(as.numeric)
The forecasting results will be transformed back to their original form to ensure that the forecasted data align with their true meaning, which is the commodity prices. This step is crucial as it allows us to interpret and utilize the forecasted values correctly in the context of the actual commodity prices. By reverting the data to its original scale, we ensure that the forecasting outcomes are meaningful and practical for decision-making.
<- list()
transform_forecast
for (i in 1:length(forecast_map)) {
<- train_process$reverse_spec[[i]]( x = forecast_map[[i]], y = tranformasi_data[[i]])
transform_forecast[[i]] }
Next, to evaluate the forecasting results, we will calculate the errors against the actual data using metrics such as MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and MAPE (Mean Absolute Percentage Error). These metrics will provide us with quantitative measures of how well the forecasting model performs in comparison to the actual commodity prices. By assessing the errors, we can determine the accuracy and reliability of the forecasting model and make any necessary adjustments to improve its performance.
<- transform_forecast %>%
mae_forecast map2(.y = test_process$Price,
.f = ~yardstick::mae_vec(.x %>% as.numeric(), .y$Price))
<- transform_forecast %>%
rmse_forecast map2(.y = test_process$Price,
.f = ~yardstick::rmse_vec(.x %>% as.numeric(), .y$Price))
<- transform_forecast %>%
mape_forecast map2(.y = test_process$Price,
.f = ~yardstick::mape_vec(.x %>% as.numeric(), .y$Price))
# show result
<- train_process %>%
mae_rmse_mape select_if(is.character) %>%
bind_cols(mae = mae_forecast %>% as.numeric()) %>%
bind_cols(rmse = rmse_forecast%>% as.numeric()) %>%
bind_cols(mape = mape_forecast%>% as.numeric()) %>%
select(Commodity, mae, rmse, mape, everything())
::paged_table(mae_rmse_mape) rmarkdown
Let’s work together to select the best model for each commodity based on the smallest parameter values of the evaluation metrics.
%>%
mae_rmse_mape group_by(Commodity) %>%
arrange(mae) %>%
select(Commodity, mae, rmse, mape, model_name, everything()) %>%
slice(1)
#> # A tibble: 5 x 9
#> # Groups: Commodity [5]
#> Commodity mae rmse mape model_name preprocess_name reverse_name
#> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Cabbage(Local) 6.59 8.77 38.4 stl_ets normal_spec normal_spec
#> 2 Cauli Local 9.73 12.6 23.4 stl_ets normal_spec normal_spec
#> 3 Chilli Dry 3.86 10.2 1.21 stl_ets normal_spec normal_spec
#> 4 Ginger 5.80 8.43 7.87 stl_arima squared_spec squared_spec
#> 5 Raddish White(Local) 16.6 18.7 80.0 stl_ets normal_spec normal_spec
#> # i 2 more variables: season_name <chr>, outlier_name <chr>
For the Chilli Dry
and Ginger
commodities,
the forecasting models have good accuracy, with MAPE being the most
reliable metric to describe the accuracy. The smaller the MAPE value,
the better the model’s accuracy. However, the forecasting model failed
to adequately model the Raddish White
commodity. On the
other hand, the accuracy for the Cabbage
and
Cauli
commodities is not very high, but still
acceptable.
Let’s create the best forecasting models for each commodity. Next, we will plot the results of the model fitting and forecasting against the actual price data. This way, we can visually compare the model predictions with the actual values to evaluate the performance of our forecasting models. The visualization will help us understand how well the models are capturing the price patterns and provide a clear representation of their accuracy and reliability. Let’s proceed with building the models and plotting the results!
<- train_process %>%
best_adjust bind_cols(mae = mae_forecast %>% as.numeric()) %>%
bind_cols(rmse = rmse_forecast %>% as.numeric()) %>%
bind_cols(mape = mape_forecast%>% as.numeric()) %>%
group_by(Commodity) %>%
arrange(mae) %>%
slice(1)
best_adjust
#> # A tibble: 5 x 15
#> # Groups: Commodity [5]
#> Commodity Price preprocess_name preprocess_spec reverse_name reverse_spec
#> <chr> <list> <chr> <list> <chr> <list>
#> 1 Cabbage(Lo~ <tibble> normal_spec <fn> normal_spec <fn>
#> 2 Cauli Local <tibble> normal_spec <fn> normal_spec <fn>
#> 3 Chilli Dry <tibble> normal_spec <fn> normal_spec <fn>
#> 4 Ginger <tibble> squared_spec <fn> squared_spec <fn>
#> 5 Raddish Wh~ <tibble> normal_spec <fn> normal_spec <fn>
#> # i 9 more variables: season_name <chr>, season_spec <list>,
#> # outlier_name <chr>, out_spec <list>, model_name <chr>, model_spec <list>,
#> # mae <dbl>, rmse <dbl>, mape <dbl>
<-
model_best map2(.x = best_adjust$Price,
.y = best_adjust$preprocess_spec,
.f = ~exec(.y,.x)
%>%
) map2(.y = best_adjust$season_spec,
.f = ~exec(.y,.x)
%>%
) map2(.y = best_adjust$out_spec,
.f = ~exec(.y, .x)
%>%
) map2(.y = best_adjust$model_spec,
.f = ~exec(.y,.x)
)
<- test_list$Price[[1]] %>%
data_test1 ts(start = c(8, 215),
end = c(8, 334),
frequency = 365)
$reverse_spec[[1]](model_best[[1]], 1)$fitted %>%
best_adjustautoplot(series = "model fitted",
linewidth = 1.2) +
autolayer(best_adjust$season_spec[[1]](train_list$Price[[1]]),
color = "black",
series = "train price",
linewidth = 1.2,
alpha = 0.5) +
autolayer(best_adjust$reverse_spec[[1]](forecast(model_best[[1]],
h = 120),
$Price[[1]]),
best_adjustseries = "forecast price") +
autolayer(data_test1,
series = "test price") +
scale_x_continuous(
name = "Date",
breaks = breaks_extended(6),
labels = as.Date.numeric(seq(25,525,100)*5-9,
origin = range(commodity_price$Date)[1])) +
labs(
title = glue("{best_adjust$Commodity[[1]]} Price"),
y = "Price") +
theme_economist_white(
base_size = 9,
gray_bg = F
)
<- test_list$Price[[1]] %>%
data_test2 ts(start = c(93, 10),
end = c(97, 9),
frequency = 30)
$reverse_spec[[2]](model_best[[2]], 1)$fitted %>%
best_adjustautoplot(series = "model fitted",
linewidth = 1.2) +
autolayer(best_adjust$season_spec[[2]](train_list$Price[[2]]),
color = "black",
series = "train price",
linewidth = 1.2,
alpha = 0.5) +
autolayer(best_adjust$reverse_spec[[2]](forecast(model_best[[2]],
h = 120),
$Price[[2]]),
best_adjustseries = "forecast price") +
autolayer(data_test2,
series = "test price") +
scale_x_continuous(
name = "Date",
breaks = breaks_extended(6),
labels = as.Date.numeric(seq(25,525,100)*5-9,
origin = range(commodity_price$Date)[1])) +
labs(
title = glue("{best_adjust$Commodity[[2]]} Price"),
y = "Price") +
theme_economist_white(
base_size = 9,
gray_bg = F
)
<- test_list$Price[[5]] %>%
data_test3 ts(start = c(8, 215),
end = c(8, 334),
frequency = 365)
$reverse_spec[[3]](model_best[[3]], 1)$fitted %>%
best_adjustautoplot(series = "model fitted",
linewidth = 1.2) +
autolayer(best_adjust$season_spec[[3]](train_list$Price[[5]]),
color = "black",
series = "train price",
linewidth = 1.2,
alpha = 0.5) +
autolayer(best_adjust$reverse_spec[[3]](forecast(model_best[[3]],
h = 120),
$Price[[3]]),
best_adjustseries = "forecast price") +
autolayer(data_test3,
series = "test price") +
scale_x_continuous(
name = "Date",
breaks = breaks_extended(6),
labels = as.Date.numeric(seq(25,525,100)*5-9,
origin = range(commodity_price$Date)[1])) +
labs(
title = glue("{best_adjust$Commodity[[3]]} Price"),
y = "Price") +
theme_economist_white(
base_size = 9,
gray_bg = F
)
<- test_list$Price[[4]] %>%
data_test4 ts(start = c(8, 215),
end = c(8, 334),
frequency = 365)
$reverse_spec[[4]](model_best[[4]]$fitted, 1) %>%
best_adjustautoplot(series = "model fitted",
linewidth = 1.2) +
autolayer(best_adjust$season_spec[[4]](train_list$Price[[4]]),
color = "black",
series = "train price",
linewidth = 1.2,
alpha = 0.5) +
autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]],
h = 120)$mean,
$Price[[4]]),
best_adjustseries = "forecast price") + autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]],
h = 120)$lower,
$Price[[4]]),
best_adjustseries = "lower forecast") + autolayer(best_adjust$reverse_spec[[4]](forecast(model_best[[4]],
h = 120)$upper,
$Price[[4]]),
best_adjustseries = "upper forecast") +
autolayer(data_test4,
series = "test price") +
scale_x_continuous(
name = "Date",
breaks = breaks_extended(6),
labels = as.Date.numeric(seq(25,525,100)*5-9,
origin = range(commodity_price$Date)[1])) +
labs(
title = glue("{best_adjust$Commodity[[4]]} Price"),
y = "Price") +
theme_economist_white(
base_size = 9,
gray_bg = F
)
<- test_list$Price[[3]] %>%
data_test5 ts(start = c(8, 215),
end = c(8, 334),
frequency = 365)
$reverse_spec[[5]](model_best[[5]], 1)$fitted %>%
best_adjustautoplot(series = "model fitted",
linewidth = 1.2) +
autolayer(best_adjust$season_spec[[5]](train_list$Price[[3]]),
color = "black",
series = "train price",
linewidth = 1.2,
alpha = 0.5) +
autolayer(best_adjust$reverse_spec[[5]](forecast(model_best[[5]],
h = 120),
$Price[[3]]),
best_adjustseries = "forecast price") +
autolayer(data_test5,
series = "test price") +
scale_x_continuous(
name = "Date",
breaks = breaks_extended(6),
labels = as.Date.numeric(seq(25,525,100)*5-9,
origin = range(commodity_price$Date)[1])) +
labs(
title = glue("{best_adjust$Commodity[[5]]} Price"),
y = "Price") +
theme_economist_white(
base_size = 9,
gray_bg = F
)
Checking Assumption
The assumption check is specifically conducted for models based on
Regression (ARIMA, SARIMA -> Auto Regressive). So, the assumption
check will be performed only on the model for the Ginger
commodity.
1. No-autocorrelation residual
This assumption is to check whether there is still autocorrelation in the residuals. If yes, it indicates that there are still patterns in the data that the model has not captured for forecasting. In that case, the model needs to undergo tuning and further adjustments to improve its performance.
\(H_0\): residual has no-autocorrelation
\(H_1\): residual has autocorrelation
Box.test(model_best[[4]]$residuals, type = "Ljung-Box")
#>
#> Box-Ljung test
#>
#> data: model_best[[4]]$residuals
#> X-squared = 0.00011024, df = 1, p-value = 0.9916
Because the p-value is greater than 0.05, the model fails to reject the null hypothesis (H0), indicating that the residuals have no autocorrelation. This means that the assumption of no autocorrelation is met, and the model is capturing the relevant patterns in the data for forecasting.
2. Normality of Residual
To check the normality of residuals in the time series forecasting results, we can use a histogram plot of the residuals. The histogram will display the frequency distribution of the residuals, and if the distribution is close to a normal distribution (bell-shaped curve), it can be considered that the residuals meet the assumption of normality. However, if the distribution deviates significantly from normal, it may require data transformation or consideration of alternative methods to address the non-normality in the residuals.
ggplotly(gghistogram(model_best[[4]]$residuals)+
labs(title="Ginger : Histogram of residuals", x = "residuals"))
Based on the plotting results, we can see together that the models with regression-based algorithms are able to meet the assumption of normality in the distribution of residuals. This indicates that the residuals are approximately normally distributed, which is a positive sign for the accuracy and reliability of the forecasting models. It provides confidence that the model assumptions are valid and the forecasts are more robust.