General Brief

This is Learning By Building Project for Time Series (TS) course. The data used for this analysis is Daily Gold Price (2015-2021), you can find the data in Kaggle. This analysis will be separated into 6 parts:

  1. Pre-processing Data
  2. Data exploration
  3. Model and forecast
  4. Model evaluation
  5. Conclusion
  6. Reference

Let’s get started!


Main Objective

In this report, we want to forecast gold price in 6 months ahead.

Library Setup

In this report, we will use following library:

#import library
library(tidyverse) 
library(lubridate) # date manipulation
library(forecast) # time series library
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(TSstudio) 
library(padr) # complete data frame
library(imputeTS)
theme_set(theme_minimal())

1 Pre-processing Data

In this part, we’re going to read our data, and prepare it into “clean data”.

1.1 Read The Data

First, let’s read our data and we will assign it as gold.

gold <- read.csv("data_input/Gold Price.csv")
gold

Hereby following description for every column:

this part refer to data source website, please refer to this link for more detail information.

  • Date: Recording date.
  • Price: It is close price which can be considered as final price.
  • Open: Price at the time of market opening at that day.
  • High: Highest price during whole day.
  • Low: Lowest price during whole day.
  • Volume: Traded volume.
  • Chg: % Change from previous price.

1.2 Data Cleansing

1.2.1 Data Type

Check any mismatch data types in our data.

# check data types for each column
str(gold)
#> 'data.frame':    2072 obs. of  7 variables:
#>  $ Date  : chr  "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" ...
#>  $ Price : int  29542 29975 29727 29279 29119 28959 28934 28997 29169 29312 ...
#>  $ Open  : int  29435 29678 30031 29279 29300 29130 28916 28990 29030 29170 ...
#>  $ High  : int  29598 30050 30125 29279 29395 29195 29029 29053 29198 29330 ...
#>  $ Low   : int  29340 29678 29539 29279 29051 28912 28820 28865 28960 29133 ...
#>  $ Volume: int  2930 3140 3050 0 24380 18710 18140 15130 15810 13780 ...
#>  $ Chg.  : num  0.25 1.47 -0.83 -1.51 -0.55 -0.55 -0.09 0.22 0.59 0.49 ...

Date column still in character type, we have to convert it into time format.

# changing data type
gold <- gold %>% 
  mutate(Date = ymd(Date))

str(gold)
#> 'data.frame':    2072 obs. of  7 variables:
#>  $ Date  : Date, format: "2014-01-01" "2014-01-02" ...
#>  $ Price : int  29542 29975 29727 29279 29119 28959 28934 28997 29169 29312 ...
#>  $ Open  : int  29435 29678 30031 29279 29300 29130 28916 28990 29030 29170 ...
#>  $ High  : int  29598 30050 30125 29279 29395 29195 29029 29053 29198 29330 ...
#>  $ Low   : int  29340 29678 29539 29279 29051 28912 28820 28865 28960 29133 ...
#>  $ Volume: int  2930 3140 3050 0 24380 18710 18140 15130 15810 13780 ...
#>  $ Chg.  : num  0.25 1.47 -0.83 -1.51 -0.55 -0.55 -0.09 0.22 0.59 0.49 ...

Each column now already in their proper types.

1.2.2 Padding and Imputation

If we observe our data, there is skipped date in our data. The data only record weekdays, but in time series we have to make our data as complete as possible (no skipped date). We have to pad our data first, then we will impute our data.

# padding our data
gold <- gold %>% 
  pad(interval = "day") 
gold
# impute NA value with last day observation

gold_clean <- gold %>% na.locf()
gold_clean

Now our data already complete.

2 Data Exploration

After we got clean data, now we’re going to explore our data. To explore our data, we have to create Time Series object first (ts). In this analysis we will focus on gold price.

2.1 TS Object

We will make time series object with ts(). The data is daily, and we want to observe it into a year, so the frequency will be 7 * 4 * 12.

# ts
gold_ts <- ts(data = gold_clean$Price,
             start = 2014,
             frequency = 7*4*12)
gold_ts %>% 
  autoplot()

Insight:

  • Gold price over the year is increasing
  • We have multiplicative model

2.2 Decompose

Decompose is a process to extract trend, seasonal, and remainder (error) from our time series data.

gold_decompose <- gold_ts %>% 
  decompose(type = "multiplicative")
  
gold_decompose %>% 
  autoplot()

Insight:

  • The price has increasing trend
  • There is seasonal pattern in our data.

2.2.1 Seasonal Analysis

gold_decompose$seasonal %>% 
  autoplot()

We will try to look for monthly pattern for gold price from 2014 - 2021.

gold_clean %>% 
  mutate(Month = month(Date, label = T)) %>% 
  mutate(seasons = gold_decompose$seasonal) %>% 
  group_by(Month) %>% 
  summarise(total = sum(seasons)) %>%   
  ggplot(aes(Month, total)) +
  geom_col()+
  theme_minimal()

Insight:

  • The price is high in January, but going down in February and going up again in March.

3 Time Series Model and Forecasting

We will build time series model base on our data. Our data has seasonality and trend, so we will make Triple Exponential Smoothing model, and for comparison we will make ARIMA model too.

Cross Validation

We will split our data into train and test. We have 2014 - 2021 data, and we will try to forecast last 6 months price.

  • 6 months = 7 * 4 * 6 = 168
# train data
train <- head(gold_ts, -168)
# test data
test <- tail(gold_ts, 168)

3.1 Triple Exponential Smoothing Model

In Triple Exponential Smoothing there are value for alpha, beta, and gamma. This value is tune able, but we want the function to define the value for our model.

# Making model using HoltWinters()
gold_HWmodel <- HoltWinters(train, seasonal = "multiplicative")

# alpha beta gamma
gold_HWmodel$alpha
#>     alpha 
#> 0.7941857
gold_HWmodel$beta
#>         beta 
#> 0.0008367679
gold_HWmodel$gamma
#> gamma 
#>     1

3.1.1 Holt’s Winter Model Forecasting

After making model, we will try to forecast using our test data and visualize it.

# forecasting
gold_HWforecast <- forecast(gold_HWmodel, h = 168)
# visualize
gold_ts %>% 
  autoplot() +
  autolayer(gold_HWmodel$fitted[,1], lwd = 0.5, 
            series = "HW model") +
  autolayer(gold_HWforecast$mean, lwd = 0.5,
            series = "Forecast 1 year")

From the graph we can see the blue line (HW model) with train data is quite similar with our original data. Unfortunately, the red line (Forecast) is not really accurate, the forecast predict the price will going up, but the original data is going down. We will check our error with MAPE.

MAPE(gold_HWmodel$fitted[,1], train)*100
#> [1] 0.5856348
MAPE(gold_HWforecast$mean, test)*100
#> [1] 1.854949

The error for model is 0.585%, and the error for forecasting is 1.85%.

3.2 ARIMA Model

As an alternative, we will make ARIMA model. ARIMA is an integration from 2 method, AR (Auto Regression) and MA (Moving Average). We want the ARIMA method considering our seasonality data, so we will use STLM with ARIMA method.

gold_arima_stl <- stlm(y = train, method = "arima")
gold_arima_stl$model
#> Series: x 
#> ARIMA(2,1,3) with drift 
#> 
#> Coefficients:
#>           ar1      ar2     ma1     ma2      ma3   drift
#>       -0.4435  -0.8978  0.3942  0.9145  -0.0463  6.0428
#> s.e.   0.0319   0.0428  0.0366  0.0355   0.0202  4.1497
#> 
#> sigma^2 = 50850:  log likelihood = -18806.34
#> AIC=37626.68   AICc=37626.72   BIC=37668.11

3.2.1 ARIMA Model Forecasting

After making model, we will try to forecast using our test data and visualize it.

# forecasting
gold_ARIMAforecast <- forecast(gold_arima_stl, h = 168)
# visualize
gold_ts %>% 
  autoplot() +
  autolayer(gold_arima_stl$fitted, lwd = 0.5, 
            series = "ARIMA model") +
  autolayer(gold_ARIMAforecast$mean, lwd = 0.5,
            series = "Forecast 1 year")

From the graph we can see the red line (ARIMA model) with train data is quite similar with our original data. Unfortunately, the blue line (Forecast) still having deviation from original data. We will check our error with MAPE.

MAPE(gold_arima_stl$fitted, train)*100
#> [1] 0.4136891
MAPE(gold_ARIMAforecast$mean, test)*100
#> [1] 1.468281

The error for ARIMA model is 0.413%, and the error for forecasting is 1.468%.

The error is slightly lower than Triple Exponential Smoothing model, we will choose this ARIMA model as our final model.

4 Model Evaluation

After making prediction, we will evaluate our ARIMA model with assumption check.

4.0.1 No Auto Correlation

\(H_0\): residual has no-autocorrelation

\(H_1\): residual has autocorrelation

# ARIMA Model
Box.test(gold_arima_stl$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  gold_arima_stl$residuals
#> X-squared = 0.0053734, df = 1, p-value = 0.9416

p-value = 0.94 > 0.05 (residual has no-autocorrelation)

4.0.2 Normality of Residuals

\(H_0\): residual normally distributed

\(H_1\): residual not normally distributed

shapiro.test(gold_arima_stl$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  gold_arima_stl$residuals
#> W = 0.88046, p-value < 0.00000000000000022

p-value < 0.05 (residual not normally distributed).

Our model fail in Normality of Residuals check.

5 Conclusion

To sum up our analysis, we have to return to our main objective. We already makes model that can predict gold price in 6 months ahead, with forecasting error at 1.468% and still needs improvement in normality of residuals check. As an alternative, we already makes another model with Triple Exponential Smoothing Method with forecasting error at 1.85%.

Here is recommendation that may improve our model:

  • We can try making ARIMA model using arima() and define the parameter on our own.

6 Reference