Picture are taken from Kaggle
Hello Everyone.
Welcome to my Rmd.
This is my HTML_Document about IMF Zinc Price Forecast.
Hope you can enjoy that!
We will learn to forecast the future prices of zinc by using IMF Zinc Price Forecast dataset.
Data Source: https://www.kaggle.com/datasets/yasserh/imf-zinc-price-forecast-dataset
A simple yet challenging project, to forecast the IMF commodity price of Zinc, based on monthly totals zinc price recorded from 1980 to 2016.
We will try to overcome these obstacles & Forecast its future prices.
Library and Setup
library(pageviews)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
In this project we want to forecast the future prices of zinc by
using IMF Zinc Price dataset.The datasets contains 2 columns,
Date and Price. The Date started from 1980
till 2016.
Read Data
zinc <- read.csv("data_input/zinc.csv")
str(zinc)
## 'data.frame': 434 obs. of 2 variables:
## $ Date : chr "1-Jan-80" "1-Feb-80" "1-Mar-80" "1-Apr-80" ...
## $ Price: num 774 869 741 708 701 ...
After reading the dataset, we can see if the type of
Date column is not right yet. So we need to convert the
type of the column.
zinc <- zinc %>%
mutate(Date = as.Date(paste(Date,sep="-"),"%d-%b-%y")) %>%
arrange(Date)
glimpse(zinc)
## Rows: 434
## Columns: 2
## $ Date <date> 1980-01-01, 1980-02-01, 1980-03-01, 1980-04-01, 1980-05-01, 198~
## $ Price <dbl> 773.82, 868.62, 740.75, 707.68, 701.07, 676.82, 712.09, 767.21, ~
head(zinc)
## Date Price
## 1 1980-01-01 773.82
## 2 1980-02-01 868.62
## 3 1980-03-01 740.75
## 4 1980-04-01 707.68
## 5 1980-05-01 701.07
## 6 1980-06-01 676.82
Why should we change the type of Date column? Because in
Time Series and Forecasting, the variables that will be used are time
and y, so that we need to convert from character type to Date.
We already have some rules when doing Time Series and Forecasting, such as :
The dataset can’t contain NA.
The dataset must be arrange by Date column.
The dataset can’t contain empty value in Date column.
Cek Missing Value
anyNA(zinc)
## [1] FALSE
colSums(is.na(zinc))
## Date Price
## 0 0
In this dataset, there are no missing value or NA datas. The dataset already has been arranged and there are no missing value in Date column too. So we can continue to the next step.
Changing type of Date column to Time series
We will use a function to change from date to timeseries object. The
function is ts(), and it will be followed by some
parameters like :
price = is value of y that will be observed.
start = the beginning of the time series, in this case is 1980.
frequency = amount of data in 1 season, the pattern that we will use is 12 (months in a year)
zinc_ts <- ts(zinc$Price, start = c(1980,1), frequency = 12)
autoplot(zinc_ts, series = "Actual")
Decompose data time Series
The purpose of Decompose is to break ts to be some components like :
Trend = the mean movement globally
Seasonal = data pattern per frequency
Error = a value that Trend and Seasonal cannot get
zinc_dec <- decompose(zinc_ts)
autoplot(zinc_dec)
autoplot(zinc_dec$x - zinc_dec$trend)
autoplot(zinc_dec$x - zinc_dec$seasonal)
Cross Validation
We can separate the dataset into data train and data test.
nrow(zinc)
## [1] 434
zinc_train <- zinc_ts %>%
head(-34)
zinc_test <- zinc_ts %>%
tail(34)
#zinc_train <- zinc_ts[88:434]
#zinc_test <- zinc_ts[1:87]
length(zinc_train)
## [1] 400
length(zinc_test)
## [1] 34
head(zinc_train)
## Jan Feb Mar Apr May Jun
## 1980 773.82 868.62 740.75 707.68 701.07 676.82
tail(zinc)
## Date Price
## 429 2015-09-01 1720.23
## 430 2015-10-01 1724.34
## 431 2015-11-01 1583.31
## 432 2015-12-01 1527.79
## 433 2016-01-01 1520.36
## 434 2016-02-01 1709.85
head(zinc_test)
## May Jun Jul Aug Sep Oct
## 2013 1831.01 1839.01 1837.62 1896.39 1846.88 1884.84
Double Exponential Smoothing (Holt)
zinc_holt_ets <- ets(zinc_train, model = "AAA")
zinc_holt_ets
## ETS(A,Ad,A)
##
## Call:
## ets(y = zinc_train, model = "AAA")
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.1413
## gamma = 1e-04
## phi = 0.8
##
## Initial states:
## l = 707.5389
## b = 7.4116
## s = -8.2147 -16.8277 -17.5468 -1.9369 1.106 -3.9202
## -5.4498 37.9439 31.9173 9.5334 -12.4781 -14.1266
##
## sigma: 114.7689
##
## AIC AICc BIC
## 6209.550 6211.346 6281.397
autoplot(zinc_holt_ets$fitted, series = "AAA") +
autolayer(zinc_train, series = "Actual")
zinc_hw <- HoltWinters(zinc_train)
Check error.
sqrt(zinc_holt_ets$mse)
## [1] 112.3036
sqrt(MSE(zinc_hw$fitted[,1], zinc_train))
## [1] 121.6185
From the result of RMSE above, we can see that Double Exponential Smoothing (Holt) model has the less error than Holtwinters model.
ARIMA
ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality). It predicts future values by combining Auto Regresive and Moving Averange method, whereas integrated stands for how many times differencing applied to the data (a way to transform non-stationary into stationary time series data). There is also a special approach for stationary data with seasonality that is using Seasonal ARIMA (SARIMA).
Check stationary data
zinc_train %>%
adf.test()
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -3.998, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
The result is the dataset is stationary data with p-value < 0.05, so we don’t need to do differencing.
Check PACF
The purpose of PACF is to see how strong or how great the correlation of data to lag.
zinc_train %>%
pacf()
The result is first lag has high spike, followed the second lag. It means we need to do tunning AR for ARIMA.
Model Tunning
# AR= 2, I=0, MA=0
zinc_train_200 <- arima(zinc_train, order = c(2,0,0))
zinc_train_200$residuals %>%
pacf()
# AR= 2, I=0, MA=15
zinc_train_2015 <- arima(zinc_train, order = c(2,0,15))
zinc_train_2015$residuals %>%
pacf()
From the result we can use ARIMA model : 2,0,15 which gave many lags that have no autocol.
diff(zinc_ts) %>% tsdisplay()
AUTO ARIMA
We can make ARIMA model using automatic way called auto.arima()
model_auto <- auto.arima(zinc_ts, seasonal = F)
model_auto
## Series: zinc_ts
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.3312
## s.e. 0.0443
##
## sigma^2 = 11699: log likelihood = -2641.97
## AIC=5287.95 AICc=5287.98 BIC=5296.09
model_auto_train <- auto.arima(zinc_train, seasonal = F)
model_auto_train
## Series: zinc_train
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.3338
## s.e. 0.0460
##
## sigma^2 = 12065: log likelihood = -2440.62
## AIC=4885.25 AICc=4885.28 BIC=4893.23
model_auto_trainT <- auto.arima(zinc_train, seasonal = T)
model_auto_trainT
## Series: zinc_train
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## 0.3338
## s.e. 0.0460
##
## sigma^2 = 12065: log likelihood = -2440.62
## AIC=4885.25 AICc=4885.28 BIC=4893.23
model_auto$aic
## [1] 5287.947
accuracy(model_auto)
## ME RMSE MAE MPE MAPE MASE
## Training set 1.726787 107.9135 67.06823 0.01942194 4.535081 0.205718
## ACF1
## Training set 0.008722621
# Visualisasi
zinc_ts %>% autoplot()+
autolayer(model_auto$fitted)
adf.test(zinc_train)
## Warning in adf.test(zinc_train): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: zinc_train
## Dickey-Fuller = -3.998, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
zinc_auto <- auto.arima(zinc_train, seasonal = T)
sqrt(MSE(zinc_auto$fitted, zinc_train))
## [1] 109.5647
MAPE(zinc_auto$fitted, zinc_train)
## [1] 0.04667532
Then compare AIC from each model that we have built.
zinc_train_2015$aic
## [1] 4887.642
zinc_auto$aic
## [1] 4885.249
Compare RMSE from each model and the best model will be used for predicton.
sqrt(zinc_holt_ets$mse)
## [1] 112.3036
sqrt(MSE(zinc_hw$fitted[,1], zinc_train))
## [1] 121.6185
sqrt(MSE(zinc_auto$fitted, zinc_train))
## [1] 109.5647
The result is AUTO ARIMA model has the smaller RMSE so we will use this model for prediction.
Using Auto Arima Model
forecasting_zinc_AUTO <- forecast(zinc_auto, h = 34)
forecasting_zinc_AUTO %>%
autoplot() +
autolayer(zinc_test)
forecasting_zinc_AUTO
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2013 1857.447 1716.6819 1998.212 1642.1654 2072.729
## Jun 2013 1857.447 1622.7901 2092.104 1498.5703 2216.324
## Jul 2013 1857.447 1556.9260 2157.968 1397.8397 2317.054
## Aug 2013 1857.447 1503.0998 2211.794 1315.5197 2399.374
## Sep 2013 1857.447 1456.4346 2258.460 1244.1514 2470.743
## Oct 2013 1857.447 1414.6603 2300.234 1180.2632 2534.631
## Nov 2013 1857.447 1376.5010 2338.393 1121.9035 2592.991
## Dec 2013 1857.447 1341.1543 2373.740 1067.8455 2647.049
## Jan 2014 1857.447 1308.0772 2406.817 1017.2584 2697.636
## Feb 2014 1857.447 1276.8816 2438.013 969.5488 2745.345
## Mar 2014 1857.447 1247.2788 2467.615 924.2752 2790.619
## Apr 2014 1857.447 1219.0472 2495.847 881.0987 2833.795
## May 2014 1857.447 1192.0123 2522.882 839.7524 2875.142
## Jun 2014 1857.447 1166.0337 2548.860 800.0215 2914.873
## Jul 2014 1857.447 1140.9964 2573.898 761.7303 2953.164
## Aug 2014 1857.447 1116.8051 2598.089 724.7329 2990.161
## Sep 2014 1857.447 1093.3792 2621.515 688.9062 3025.988
## Oct 2014 1857.447 1070.6506 2644.244 654.1457 3060.748
## Nov 2014 1857.447 1048.5603 2666.334 620.3616 3094.533
## Dec 2014 1857.447 1027.0575 2687.837 587.4759 3127.418
## Jan 2015 1857.447 1006.0976 2708.796 555.4205 3159.474
## Feb 2015 1857.447 985.6415 2729.253 524.1356 3190.759
## Mar 2015 1857.447 965.6545 2749.240 493.5681 3221.326
## Apr 2015 1857.447 946.1057 2768.788 463.6708 3251.223
## May 2015 1857.447 926.9676 2787.927 434.4015 3280.493
## Jun 2015 1857.447 908.2152 2806.679 405.7222 3309.172
## Jul 2015 1857.447 889.8262 2825.068 377.5986 3337.295
## Aug 2015 1857.447 871.7802 2843.114 349.9997 3364.894
## Sep 2015 1857.447 854.0587 2860.835 322.8970 3391.997
## Oct 2015 1857.447 836.6448 2878.249 296.2647 3418.629
## Nov 2015 1857.447 819.5230 2895.371 270.0792 3444.815
## Dec 2015 1857.447 802.6791 2912.215 244.3187 3470.575
## Jan 2016 1857.447 786.1000 2928.794 218.9632 3495.931
## Feb 2016 1857.447 769.7736 2945.120 193.9941 3520.900
Error Checking
RMSE check
sqrt(MSE(forecasting_zinc_AUTO$mean, zinc_test))
## [1] 253.525
MAPE check
MAPE(forecasting_zinc_AUTO$mean, zinc_test)
## [1] 0.1035662
Assumption check
Norma Distribution Check
hist(forecasting_zinc_AUTO$residuals, breaks = 100)
Autocol Check
Box.test(zinc_auto$residuals, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: zinc_auto$residuals
## X-squared = 0.026856, df = 1, p-value = 0.8698
By the result of evaluation that we have made, we can get conclusion that Auto ARIMA model is the best model so far for this case. The result of autocol check has p-value > 0.5, it means there are no autocol in Auto ARIMA model.