Picture are taken from Kaggle

Introduction

1.1 Greetings

Hello Everyone.

Welcome to my Rmd.

This is my HTML_Document about IMF Zinc Price Forecast.

Hope you can enjoy that!

1.2 What We Will Do

We will learn to forecast the future prices of zinc by using IMF Zinc Price Forecast dataset.

Data Source: https://www.kaggle.com/datasets/yasserh/imf-zinc-price-forecast-dataset

1.3 About Dataset

A simple yet challenging project, to forecast the IMF commodity price of Zinc, based on monthly totals zinc price recorded from 1980 to 2016.

1.4 Business Goal

We will try to overcome these obstacles & Forecast its future prices.

Data Preparing

Library and Setup

library(pageviews)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)

In this project we want to forecast the future prices of zinc by using IMF Zinc Price dataset.The datasets contains 2 columns, Date and Price. The Date started from 1980 till 2016.

Read Data

zinc <- read.csv("data_input/zinc.csv")
str(zinc)
## 'data.frame':    434 obs. of  2 variables:
##  $ Date : chr  "1-Jan-80" "1-Feb-80" "1-Mar-80" "1-Apr-80" ...
##  $ Price: num  774 869 741 708 701 ...

After reading the dataset, we can see if the type of Date column is not right yet. So we need to convert the type of the column.

Data Preprocessing

zinc <- zinc %>% 
  mutate(Date = as.Date(paste(Date,sep="-"),"%d-%b-%y")) %>% 
  arrange(Date)
glimpse(zinc)
## Rows: 434
## Columns: 2
## $ Date  <date> 1980-01-01, 1980-02-01, 1980-03-01, 1980-04-01, 1980-05-01, 198~
## $ Price <dbl> 773.82, 868.62, 740.75, 707.68, 701.07, 676.82, 712.09, 767.21, ~
head(zinc)
##         Date  Price
## 1 1980-01-01 773.82
## 2 1980-02-01 868.62
## 3 1980-03-01 740.75
## 4 1980-04-01 707.68
## 5 1980-05-01 701.07
## 6 1980-06-01 676.82

Why should we change the type of Date column? Because in Time Series and Forecasting, the variables that will be used are time and y, so that we need to convert from character type to Date.

We already have some rules when doing Time Series and Forecasting, such as :

  • The dataset can’t contain NA.

  • The dataset must be arrange by Date column.

  • The dataset can’t contain empty value in Date column.

Cek Missing Value

anyNA(zinc)
## [1] FALSE
colSums(is.na(zinc))
##  Date Price 
##     0     0

In this dataset, there are no missing value or NA datas. The dataset already has been arranged and there are no missing value in Date column too. So we can continue to the next step.

Data Time Series

Changing type of Date column to Time series

We will use a function to change from date to timeseries object. The function is ts(), and it will be followed by some parameters like :

  • price = is value of y that will be observed.

  • start = the beginning of the time series, in this case is 1980.

  • frequency = amount of data in 1 season, the pattern that we will use is 12 (months in a year)

zinc_ts <- ts(zinc$Price, start = c(1980,1), frequency = 12)
autoplot(zinc_ts, series = "Actual")

Decompose data time Series

The purpose of Decompose is to break ts to be some components like :

  • Trend = the mean movement globally

  • Seasonal = data pattern per frequency

  • Error = a value that Trend and Seasonal cannot get

zinc_dec <- decompose(zinc_ts)
autoplot(zinc_dec)

  • We want to check the pattern or seasonal
autoplot(zinc_dec$x - zinc_dec$trend)

  • If we want to check trend using adjusted seasonal (discarding the effects of seasonal data)
autoplot(zinc_dec$x - zinc_dec$seasonal)

  • The data ts has trend and seassonal too. So we can assume that we don’t use SES for modelling.

Build Model

Cross Validation

We can separate the dataset into data train and data test.

nrow(zinc)
## [1] 434
zinc_train <- zinc_ts %>% 
  head(-34)

zinc_test <- zinc_ts %>% 
  tail(34)

#zinc_train <- zinc_ts[88:434]

#zinc_test <- zinc_ts[1:87]

length(zinc_train)
## [1] 400
length(zinc_test)
## [1] 34
head(zinc_train)
##         Jan    Feb    Mar    Apr    May    Jun
## 1980 773.82 868.62 740.75 707.68 701.07 676.82
tail(zinc)
##           Date   Price
## 429 2015-09-01 1720.23
## 430 2015-10-01 1724.34
## 431 2015-11-01 1583.31
## 432 2015-12-01 1527.79
## 433 2016-01-01 1520.36
## 434 2016-02-01 1709.85
head(zinc_test)
##          May     Jun     Jul     Aug     Sep     Oct
## 2013 1831.01 1839.01 1837.62 1896.39 1846.88 1884.84

Double Exponential Smoothing (Holt)

zinc_holt_ets <- ets(zinc_train, model = "AAA")
zinc_holt_ets
## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = zinc_train, model = "AAA") 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.1413 
##     gamma = 1e-04 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 707.5389 
##     b = 7.4116 
##     s = -8.2147 -16.8277 -17.5468 -1.9369 1.106 -3.9202
##            -5.4498 37.9439 31.9173 9.5334 -12.4781 -14.1266
## 
##   sigma:  114.7689
## 
##      AIC     AICc      BIC 
## 6209.550 6211.346 6281.397
autoplot(zinc_holt_ets$fitted, series = "AAA") +
  autolayer(zinc_train, series = "Actual")

  • Compare with HoltWinters(data)
zinc_hw <- HoltWinters(zinc_train)

Check error.

sqrt(zinc_holt_ets$mse)
## [1] 112.3036
sqrt(MSE(zinc_hw$fitted[,1], zinc_train))
## [1] 121.6185

From the result of RMSE above, we can see that Double Exponential Smoothing (Holt) model has the less error than Holtwinters model.

ARIMA

ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality). It predicts future values by combining Auto Regresive and Moving Averange method, whereas integrated stands for how many times differencing applied to the data (a way to transform non-stationary into stationary time series data). There is also a special approach for stationary data with seasonality that is using Seasonal ARIMA (SARIMA).

Check stationary data

zinc_train %>% 
  adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -3.998, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

The result is the dataset is stationary data with p-value < 0.05, so we don’t need to do differencing.

Check PACF

The purpose of PACF is to see how strong or how great the correlation of data to lag.

zinc_train %>% 
  pacf()

The result is first lag has high spike, followed the second lag. It means we need to do tunning AR for ARIMA.

Model Tunning

# AR= 2, I=0, MA=0
zinc_train_200 <- arima(zinc_train, order = c(2,0,0))
zinc_train_200$residuals %>% 
  pacf()

# AR= 2, I=0, MA=15
zinc_train_2015 <- arima(zinc_train, order = c(2,0,15))
zinc_train_2015$residuals %>% 
  pacf()

From the result we can use ARIMA model : 2,0,15 which gave many lags that have no autocol.

diff(zinc_ts) %>% tsdisplay()

AUTO ARIMA

We can make ARIMA model using automatic way called auto.arima()

model_auto <- auto.arima(zinc_ts, seasonal = F)
model_auto
## Series: zinc_ts 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.3312
## s.e.  0.0443
## 
## sigma^2 = 11699:  log likelihood = -2641.97
## AIC=5287.95   AICc=5287.98   BIC=5296.09
model_auto_train <- auto.arima(zinc_train, seasonal = F)
model_auto_train
## Series: zinc_train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.3338
## s.e.  0.0460
## 
## sigma^2 = 12065:  log likelihood = -2440.62
## AIC=4885.25   AICc=4885.28   BIC=4893.23
model_auto_trainT <- auto.arima(zinc_train, seasonal = T)
model_auto_trainT
## Series: zinc_train 
## ARIMA(0,1,1) 
## 
## Coefficients:
##          ma1
##       0.3338
## s.e.  0.0460
## 
## sigma^2 = 12065:  log likelihood = -2440.62
## AIC=4885.25   AICc=4885.28   BIC=4893.23
model_auto$aic
## [1] 5287.947
accuracy(model_auto)
##                    ME     RMSE      MAE        MPE     MAPE     MASE
## Training set 1.726787 107.9135 67.06823 0.01942194 4.535081 0.205718
##                     ACF1
## Training set 0.008722621
# Visualisasi
zinc_ts %>% autoplot()+
  autolayer(model_auto$fitted)

adf.test(zinc_train)
## Warning in adf.test(zinc_train): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  zinc_train
## Dickey-Fuller = -3.998, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
zinc_auto <- auto.arima(zinc_train, seasonal = T)
sqrt(MSE(zinc_auto$fitted, zinc_train))
## [1] 109.5647
MAPE(zinc_auto$fitted, zinc_train)
## [1] 0.04667532

Then compare AIC from each model that we have built.

zinc_train_2015$aic
## [1] 4887.642
zinc_auto$aic
## [1] 4885.249
  • We choose zinc_auto (Auto Arima) because the AIC is smaller than model ARIMA.

Compare RMSE from each model and the best model will be used for predicton.

sqrt(zinc_holt_ets$mse)
## [1] 112.3036
sqrt(MSE(zinc_hw$fitted[,1], zinc_train))
## [1] 121.6185
sqrt(MSE(zinc_auto$fitted, zinc_train))
## [1] 109.5647

The result is AUTO ARIMA model has the smaller RMSE so we will use this model for prediction.

Forecasting

Using Auto Arima Model

forecasting_zinc_AUTO <- forecast(zinc_auto, h = 34)
forecasting_zinc_AUTO %>% 
  autoplot() +
  autolayer(zinc_test)

forecasting_zinc_AUTO
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## May 2013       1857.447 1716.6819 1998.212 1642.1654 2072.729
## Jun 2013       1857.447 1622.7901 2092.104 1498.5703 2216.324
## Jul 2013       1857.447 1556.9260 2157.968 1397.8397 2317.054
## Aug 2013       1857.447 1503.0998 2211.794 1315.5197 2399.374
## Sep 2013       1857.447 1456.4346 2258.460 1244.1514 2470.743
## Oct 2013       1857.447 1414.6603 2300.234 1180.2632 2534.631
## Nov 2013       1857.447 1376.5010 2338.393 1121.9035 2592.991
## Dec 2013       1857.447 1341.1543 2373.740 1067.8455 2647.049
## Jan 2014       1857.447 1308.0772 2406.817 1017.2584 2697.636
## Feb 2014       1857.447 1276.8816 2438.013  969.5488 2745.345
## Mar 2014       1857.447 1247.2788 2467.615  924.2752 2790.619
## Apr 2014       1857.447 1219.0472 2495.847  881.0987 2833.795
## May 2014       1857.447 1192.0123 2522.882  839.7524 2875.142
## Jun 2014       1857.447 1166.0337 2548.860  800.0215 2914.873
## Jul 2014       1857.447 1140.9964 2573.898  761.7303 2953.164
## Aug 2014       1857.447 1116.8051 2598.089  724.7329 2990.161
## Sep 2014       1857.447 1093.3792 2621.515  688.9062 3025.988
## Oct 2014       1857.447 1070.6506 2644.244  654.1457 3060.748
## Nov 2014       1857.447 1048.5603 2666.334  620.3616 3094.533
## Dec 2014       1857.447 1027.0575 2687.837  587.4759 3127.418
## Jan 2015       1857.447 1006.0976 2708.796  555.4205 3159.474
## Feb 2015       1857.447  985.6415 2729.253  524.1356 3190.759
## Mar 2015       1857.447  965.6545 2749.240  493.5681 3221.326
## Apr 2015       1857.447  946.1057 2768.788  463.6708 3251.223
## May 2015       1857.447  926.9676 2787.927  434.4015 3280.493
## Jun 2015       1857.447  908.2152 2806.679  405.7222 3309.172
## Jul 2015       1857.447  889.8262 2825.068  377.5986 3337.295
## Aug 2015       1857.447  871.7802 2843.114  349.9997 3364.894
## Sep 2015       1857.447  854.0587 2860.835  322.8970 3391.997
## Oct 2015       1857.447  836.6448 2878.249  296.2647 3418.629
## Nov 2015       1857.447  819.5230 2895.371  270.0792 3444.815
## Dec 2015       1857.447  802.6791 2912.215  244.3187 3470.575
## Jan 2016       1857.447  786.1000 2928.794  218.9632 3495.931
## Feb 2016       1857.447  769.7736 2945.120  193.9941 3520.900

Evaluation

Error Checking

RMSE check

sqrt(MSE(forecasting_zinc_AUTO$mean, zinc_test))
## [1] 253.525

MAPE check

MAPE(forecasting_zinc_AUTO$mean, zinc_test)
## [1] 0.1035662

Assumption check

Norma Distribution Check

hist(forecasting_zinc_AUTO$residuals, breaks = 100)

Autocol Check

Box.test(zinc_auto$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  zinc_auto$residuals
## X-squared = 0.026856, df = 1, p-value = 0.8698

Conclusion

By the result of evaluation that we have made, we can get conclusion that Auto ARIMA model is the best model so far for this case. The result of autocol check has p-value > 0.5, it means there are no autocol in Auto ARIMA model.