Introduction

Antam is a shortname for PT Aneka Tambang, Tbk., which is an Indonesian State-Owned Enterprise mining company, mainly in gold and nickel. Its gold operations was the primary contributor of the company’s revenue as it forms 71 percent of Antam’s Q1 2018 revenue. Antam is the only Indonesian gold refiner to be accredited by the London Bullion Market Association. Due to this nature, Antam is a very lucrative stock to collect. Hence, it would be interesting to see the pattern of Antam’s stock price over the years and forecast how it will look like in the future.

The dataset used for this exercise was retrieved from: https://www.kaggle.com/datasets/muhardianabasandi/antam-stock-market-by-kitto.

Import R Library

library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(padr)
library(zoo)
library(xts)
library(ggplot2)
library(gridExtra)

Read Data

antam <- read.csv("ANTM.JK.csv")
glimpse(antam)

#> Rows: 3,808
#> Columns: 7
#> $ Date      <chr> "2005-09-29", "2005-09-30", "2005-10-03", "2005-10-04", "200…
#> $ Open      <chr> "432.588074", "457.787384", "470.387024", "474.586914", "482…
#> $ High      <chr> "436.787964", "457.787384", "474.586914", "482.986694", "482…
#> $ Low       <chr> "407.388763", "432.588074", "440.987854", "466.187164", "466…
#> $ Close     <chr> "432.588074", "457.787384", "470.387024", "474.586914", "482…
#> $ Adj.Close <chr> "304.904572", "322.666016", "331.546692", "334.506927", "340…
#> $ Volume    <chr> "76180670", "105493978", "59712955", "56236668", "31319315",…

The variables in the dataset are as follows: - Date = specifies trading date - Open = the price at which a stock started trading when the opening bell rangopening bell rang - High = highest price at which a stock traded during a period - Low = lowest price at which a stock traded during a period - Close = the price of an individual stock when the stock exchange closed shop for the end of period - Adj.Close = adjusted close price adjusted for both dividends and splits - Volume = the number of shares that changed hands during a given period

Next, we need to convert our data into time series object using the function of ts(). The parameters used are as follows: - data: this is our target variable, in our case we will be using the variable Close - start: the starting time of our time series. In this case we will start from 2005. - frequency: The number of data in one seasonal pattern or known as seasonality. For our case, we want to see the weekly seasonality of the Antam Closing Price starting on 29 September 2005 and ending on 3 February 2021. Hence the seasonality that we’ll be setting is frequency = 365.

Since we only want to see Antam’s closing price, let’s delete the other variables.

antam_close <- antam %>% select(c(Date, Close))
antam_close

Data Wrangling

Prior to making our data into a time series object, we need to fulfill the following requirements:

The time series should be in order from past to present
No missing time. In our case since the data used is working days and we know that there are many holidays happening in Indonesia, hence we will need to fill this missing time using the function of padr
There should be no NA.

So, let’s make sure that the data follows the above requirements.

antam_close <- antam_close %>%
  mutate(Date = lubridate::ymd(Date)) %>% 
    arrange(Date) %>% 
  padr::pad()

# check NA
anyNA(antam_close)

#> [1] TRUE

As we notice from the above data, there seems to be NAs. Hence we would need to imputate these NAs using the function of na.fill.

antam_close$Close <- antam_close %>%
  pad() %>%
  pull(Close) %>%
  na.fill(fill = "extend")

# check NA again
anyNA(antam_close)

#> [1] FALSE

Let’s check the data again to be safe.

glimpse(antam_close)

#> Rows: 5,607
#> Columns: 2
#> $ Date  <date> 2005-09-29, 2005-09-30, 2005-10-01, 2005-10-02, 2005-10-03, 200…
#> $ Close <chr> "432.588074", "457.787384", "461.987264", "466.187144", "470.387…

We need to change the Close variable from character to numeric.

antam_close <- antam_close %>%
  mutate(Close = as.numeric(Close))
glimpse(antam_close)

#> Rows: 5,607
#> Columns: 2
#> $ Date  <date> 2005-09-29, 2005-09-30, 2005-10-01, 2005-10-02, 2005-10-03, 200…
#> $ Close <dbl> 432.5881, 457.7874, 461.9873, 466.1871, 470.3870, 474.5869, 482.…

Perfect. Let’s proceed.

Create Time Series Object and Exploratory Data Analysis (EDA)

Now that we have fulfill the criteria to make a Time Series Object, let’s proceed in creating a Time Series object.

antam_ts <- ts(antam_close$Close, start = 2005, frequency = 365)
autoplot(antam_ts)

Decompose data time series

In order to gain insight from the time series data, we can decompose the data to breakdown three components of time series, namely: - Trend: there are two trends, going upward or downward. - Seasonal: this is a repeated pattern within a period. - Error: this is the data that cannot be explained by trend nor season.

Decomposing is done to make sure that the trend and seasonal pattern can be nicely captured, and those that cannot be explained will be the error data.

antam_dc <- decompose(antam_ts)
autoplot(antam_dc)

Insight:

Trend: Trend looks pretty good and smooth. There is an upward motion from 2005 to around 2008, but then declines constantly since 2010 to 2015 after which it slightly increases until 2021.
Seasonal: There is a repeated pattern every year (additive)
Error: Error is higher during early years, particularly around the period of 2005-2010 but stays a bit constant ever since.

Seasonality Analysis

Based on our data above, we would like to know in which period the Antam’s closing price peaks and slumps.

# data preparation
season_data <- antam_close %>% 
  mutate(seasonal = antam_dc$seasonal,
         date = Date, label = T, abbr = T) %>% 
  distinct(date, seasonal)

season_data_inorder <- season_data[order(season_data$seasonal, decreasing=T),]
head(season_data_inorder)

tail(season_data_inorder)

season_data_inorder

Apparently, Antam’s closing price usually peaks at the end of April to first week of May, and at its lowest in September, around mid September. This warrants for another checking for multi-seasonality for our time series data, particularly for weekly, monthly, and yearly basis.

Multi-Seasonality Checking

antam_msts <- msts(antam_close$Close, start = 2005, seasonal.periods = c(7, 365/12, 365))

mstl(antam_msts) %>% 
  autoplot()

The trend data looks smoother after we check the data multi-seasonality. So, let’s use this for our model building.

Stationary Checking

Before we build our model, let’s check whether our data is stationary or not usin ADF test. Data is considered stationary if the value fluctuates around its mean (just like a data without trend). The reason we need to do stationary checking is because this is the prerequisite of conducting ARIMA method.

adf.test(antam_msts)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  antam_msts
#> Dickey-Fuller = -2.4181, Lag order = 17, p-value = 0.4013
#> alternative hypothesis: stationary

Based on the Augmented Dickey-Fuller Test, the p-value > alpha 0.05, therefore we would need to do differencing to our data to make it stationary.

adf.test(diff(antam_msts))

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  diff(antam_msts)
#> Dickey-Fuller = -16.166, Lag order = 17, p-value = 0.01
#> alternative hypothesis: stationary

After 1x differencing, we get p-value < alpha. This is to confirm the ordo that we’ll be using when we do the ARIMA method.

Cross-Validation

Let’s split the data into data test and data train. Since pandemic starts in early 2020 onward, we can use this as our data test. But, we need to know when our data ends.

tail(antam_close)

Since our data ends on 3 February 2021, let’s use 3 February 2020 - 3 February 2021 as our train data.

# the last two years as data test
antam_train <- head(antam_msts, length(antam_ts) - 365)
antam_test <- tail(antam_msts, 365)

Model Building and Evaluation

Since our model has trend as well as seasonal, therefore we will use Holt-Winters and ARIMA. The reason why we are using ARIMA instead of SARIMA is because when we check using auto.arima function, the results are showing ARIMA ordo (p,d,q), instead of SARIMA ordo (p,d,q)(P,D,Q)[freq].

antam_auto <- auto.arima(antam_train)
summary(antam_auto)

#> Series: antam_train 
#> ARIMA(2,1,0) 
#> 
#> Coefficients:
#>          ar1     ar2
#>       0.0410  0.0828
#> s.e.  0.0138  0.0138
#> 
#> sigma^2 = 1142:  log likelihood = -25885.71
#> AIC=51777.43   AICc=51777.43   BIC=51797.12
#> 
#> Training set error measures:
#>                      ME    RMSE     MAE         MPE     MAPE       MASE
#> Training set 0.05008048 33.7859 17.3448 -0.01244763 1.355187 0.03351607
#>                      ACF1
#> Training set -0.001075351

Note that the (p,d,q) ordo here means: * p = PACF (Partial Autocorrelation Function) plot which shows the correlation between t (current) and t-k (the lag) without being affected by the correlation of the middle period. * d = differencing. This confirms that the data would need 1x differencing as what we’ve done earlier. * q = ACF (Autocorrelation Function) plot which shows the correlation between t and t-k, influenced by the middle period.

Based on the auto ARIMA, the ordo to be used is p = 2, d = 1, q = 0.

Let’s check it manually.

# PACF & ACF plot
tsdisplay(diff(antam_msts), lag.max = 5)

If we check our data, the ordo that we get based on the line that goes beyond the blue line (significant) are: * Ordo p: 1,2 * Ordo d: 1 * Ordo q: 1,2

Now, let’s check which ARIMA model is the best fit to be used for forecasting based on the MAPE value.

# modelling
antam_arima_1 <- arima(antam_train, order = c(1,1,1))
antam_arima_2 <- arima(antam_train, order = c(1,1,2))
antam_arima_3 <- arima(antam_train, order = c(2,1,1))
antam_arima_4 <- arima(antam_train, order = c(2,1,2))

# forecasting
forecast_antam_auto <- forecast(antam_auto, h = 365*3)
forecast_antam1 <- forecast(antam_arima_1, h = 365*3)
forecast_antam2 <- forecast(antam_arima_2, h = 365*3)
forecast_antam3 <- forecast(antam_arima_3, h = 365*3)
forecast_antam4 <- forecast(antam_arima_4, h = 365*3)

# evaluation with data test
accuracy(object = forecast_antam_auto, x = antam_test)

#>                        ME     RMSE      MAE         MPE      MAPE       MASE
#> Training set   0.05008048  33.7859  17.3448 -0.01244763  1.355187 0.03351607
#> Test set     237.57703183 680.2658 394.3771  1.65290429 32.397513 0.76207088
#>                      ACF1 Theil's U
#> Training set -0.001075351        NA
#> Test set      0.989770610  11.45393

accuracy(object = forecast_antam1, x = antam_test)

#>                       ME      RMSE       MAE         MPE      MAPE       MASE
#> Training set   0.0496899  33.84062  17.28788 -0.01236272  1.349792 0.03340608
#> Test set     237.8778628 680.37039 394.35701  1.69332892 32.381526 0.76203214
#>                     ACF1 Theil's U
#> Training set -0.02081075        NA
#> Test set      0.98976973  11.44983

accuracy(object = forecast_antam2, x = antam_test)

#>                        ME      RMSE       MAE         MPE     MAPE       MASE
#> Training set   0.05010421  33.77908  17.35964 -0.01245231  1.35668 0.03354473
#> Test set     237.56482911 680.26172 394.37809  1.65126355 32.39817 0.76207288
#>                      ACF1 Theil's U
#> Training set 0.0004079673        NA
#> Test set     0.9897706490  11.45409

accuracy(object = forecast_antam3, x = antam_test)

#>                        ME      RMSE       MAE         MPE      MAPE       MASE
#> Training set   0.04969832  33.78419  17.34763 -0.01229478  1.355421 0.03352153
#> Test set     237.47475477 680.22978 394.38394  1.63916474 32.403022 0.76208418
#>                       ACF1 Theil's U
#> Training set -0.0002684074        NA
#> Test set      0.9897708067  11.45534

accuracy(object = forecast_antam4, x = antam_test)

#>                       ME      RMSE       MAE         MPE     MAPE       MASE
#> Training set   0.0525294  33.75796  17.42725 -0.01346786  1.36425 0.03367539
#> Test set     239.0426805 680.78065 394.27862  1.84981924 32.31900 0.76188065
#>                     ACF1 Theil's U
#> Training set 0.007695389        NA
#> Test set     0.989769134  11.43392

The best fit model is antam_arima_4 of ARIMA(2,1,2) with MAPE score at 32.32% in data test.

Now, let’s compare the above ARIMA model, with ARIMA model derived from stlm function, as well as with the Holt-Winters model.

# modelling
model_arima <- stlm(antam_train, s.window=365, method = "arima")
model_hw <- HoltWinters(antam_train, seasonal = "additive")

# forecasting
arima_forecast <- forecast(model_arima, h = 365*3)
hw_forecast <- forecast(model_hw, h = 365*3)

Forecasting Visualization

Let’s see how they fare in terms of forecasting visualization.

# visualize ARIMA 4
x <- autoplot(forecast_antam4, series = "ARIMA4", fcol = "salmon") +
  autolayer(antam_msts, series = "Actual", color = "navy") +
  labs(subtitle = "Antam Closing Price Forecasting",
       y = "Closing Price") +
  theme_minimal()

# visualize ARIMA STLM
y <- autoplot(arima_forecast, series = "ARIMA_stlm", fcol = "yellow") +
  autolayer(antam_msts, series = "Actual", color = "navy") +
  labs(subtitle = "Antam Closing Price Forecasting",
       y = "Closing Price") +
  theme_minimal()

# visualize Holt-Winters
z <- autoplot(hw_forecast, series = "Holt-Winters", fcol = "green") +
  autolayer(antam_msts, series = "Actual", color = "navy") +
  labs(subtitle = "Antam Closing Price Forecasting",
       y = "Closing Price") +
  theme_minimal()

grid.arrange(x,y,z)

Evaluation

# evaluation in data test
accuracy(object = forecast_antam4, x = antam_test)

#>                       ME      RMSE       MAE         MPE     MAPE       MASE
#> Training set   0.0525294  33.75796  17.42725 -0.01346786  1.36425 0.03367539
#> Test set     239.0426805 680.78065 394.27862  1.84981924 32.31900 0.76188065
#>                     ACF1 Theil's U
#> Training set 0.007695389        NA
#> Test set     0.989769134  11.43392

accuracy(object = arima_forecast, x = antam_test)

#>                        ME      RMSE       MAE         MPE      MAPE       MASE
#> Training set   0.04024304  32.36694  18.56239 -0.01276132  1.568766 0.03586886
#> Test set     206.26782947 685.05777 422.91831 -3.83033244 38.215506 0.81722230
#>                      ACF1 Theil's U
#> Training set -0.001257541        NA
#> Test set      0.989848240  13.19604

accuracy(object = hw_forecast, x = antam_test)

#>                       ME      RMSE       MAE          MPE      MAPE      MASE
#> Training set  -0.4445879  41.24909  25.35241  -0.01796043  2.153081 0.0489895
#> Test set     -61.1842252 699.24352 507.55839 -41.79685143 61.457216 0.9807758
#>                   ACF1 Theil's U
#> Training set 0.1956635        NA
#> Test set     0.9893577  22.85537

Based on the above comparison, the model antam_arima_4 of ARIMA(2,1,2) still gives the best MAPE score of 32.32% in data test compared to model_arima which scored 38.22% and model_hw which scored 61.46%.

Assumption Checking

Normality of Residuals

H0: Residual/error normally distributed (what we aim)
H1: Residual not distributed normally

Decision: Reject H0 if p-value < 0.05

Let’s check the normality of residuals of model antam_arima_4. What we aim is that the residual/error is normally distributed or the error value should be around 0

# Shapiro Test
shapiro.test(antam_arima_4$residuals[0:5000])

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  antam_arima_4$residuals[0:5000]
#> W = 0.72029, p-value < 0.00000000000000022

p-value < alpha, therefore reject H0, meaning the residuals are not distributed normally. Although the assumption is not fulfilled, this does not mean that we cannot use this model (note that the model is based on data train). This suggests that there are certain prediction points where the model has high error result

plot(antam_arima_4$residuals)

No-Autocorrelation on Residuals

H0: Residuals do not experience autocorrelation (what we aim)
H1: Residuals experience autocorrelation

We do not want the residuals to have autocorrelation because it means that there are patterns in the data that have not been captured.