Antam is a shortname for PT Aneka Tambang, Tbk., which is an Indonesian State-Owned Enterprise mining company, mainly in gold and nickel. Its gold operations was the primary contributor of the company’s revenue as it forms 71 percent of Antam’s Q1 2018 revenue. Antam is the only Indonesian gold refiner to be accredited by the London Bullion Market Association. Due to this nature, Antam is a very lucrative stock to collect. Hence, it would be interesting to see the pattern of Antam’s stock price over the years and forecast how it will look like in the future.
The dataset used for this exercise was retrieved from: https://www.kaggle.com/datasets/muhardianabasandi/antam-stock-market-by-kitto.
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(padr)
library(zoo)
library(xts)
library(ggplot2)
library(gridExtra)antam <- read.csv("ANTM.JK.csv")
glimpse(antam)#> Rows: 3,808
#> Columns: 7
#> $ Date <chr> "2005-09-29", "2005-09-30", "2005-10-03", "2005-10-04", "200…
#> $ Open <chr> "432.588074", "457.787384", "470.387024", "474.586914", "482…
#> $ High <chr> "436.787964", "457.787384", "474.586914", "482.986694", "482…
#> $ Low <chr> "407.388763", "432.588074", "440.987854", "466.187164", "466…
#> $ Close <chr> "432.588074", "457.787384", "470.387024", "474.586914", "482…
#> $ Adj.Close <chr> "304.904572", "322.666016", "331.546692", "334.506927", "340…
#> $ Volume <chr> "76180670", "105493978", "59712955", "56236668", "31319315",…
The variables in the dataset are as follows: - Date =
specifies trading date - Open = the price at which a stock
started trading when the opening bell rangopening bell rang -
High = highest price at which a stock traded during a
period - Low = lowest price at which a stock traded during
a period - Close = the price of an individual stock when
the stock exchange closed shop for the end of period -
Adj.Close = adjusted close price adjusted for both
dividends and splits - Volume = the number of shares that
changed hands during a given period
Next, we need to convert our data into time series
object using the function of ts(). The parameters
used are as follows: - data: this is our target variable,
in our case we will be using the variable Close -
start: the starting time of our time series. In this case
we will start from 2005. - frequency: The number of data in
one seasonal pattern or known as seasonality. For our
case, we want to see the weekly seasonality of the
Antam Closing Price starting on 29 September 2005 and ending on 3
February 2021. Hence the seasonality that we’ll be setting is
frequency = 365.
Since we only want to see Antam’s closing price, let’s delete the other variables.
antam_close <- antam %>% select(c(Date, Close))
antam_closePrior to making our data into a time series object, we need to fulfill the following requirements:
padrSo, let’s make sure that the data follows the above requirements.
antam_close <- antam_close %>%
mutate(Date = lubridate::ymd(Date)) %>%
arrange(Date) %>%
padr::pad()
# check NA
anyNA(antam_close)#> [1] TRUE
As we notice from the above data, there seems to be NAs. Hence we
would need to imputate these NAs using the function of
na.fill.
antam_close$Close <- antam_close %>%
pad() %>%
pull(Close) %>%
na.fill(fill = "extend")
# check NA again
anyNA(antam_close)#> [1] FALSE
Let’s check the data again to be safe.
glimpse(antam_close)#> Rows: 5,607
#> Columns: 2
#> $ Date <date> 2005-09-29, 2005-09-30, 2005-10-01, 2005-10-02, 2005-10-03, 200…
#> $ Close <chr> "432.588074", "457.787384", "461.987264", "466.187144", "470.387…
We need to change the Close variable from character to
numeric.
antam_close <- antam_close %>%
mutate(Close = as.numeric(Close))
glimpse(antam_close)#> Rows: 5,607
#> Columns: 2
#> $ Date <date> 2005-09-29, 2005-09-30, 2005-10-01, 2005-10-02, 2005-10-03, 200…
#> $ Close <dbl> 432.5881, 457.7874, 461.9873, 466.1871, 470.3870, 474.5869, 482.…
Perfect. Let’s proceed.
Now that we have fulfill the criteria to make a Time Series Object, let’s proceed in creating a Time Series object.
antam_ts <- ts(antam_close$Close, start = 2005, frequency = 365)
autoplot(antam_ts)In order to gain insight from the time series data, we can decompose the data to breakdown three components of time series, namely: - Trend: there are two trends, going upward or downward. - Seasonal: this is a repeated pattern within a period. - Error: this is the data that cannot be explained by trend nor season.
Decomposing is done to make sure that the trend and seasonal pattern can be nicely captured, and those that cannot be explained will be the error data.
antam_dc <- decompose(antam_ts)
autoplot(antam_dc)Insight:
Based on our data above, we would like to know in which period the Antam’s closing price peaks and slumps.
# data preparation
season_data <- antam_close %>%
mutate(seasonal = antam_dc$seasonal,
date = Date, label = T, abbr = T) %>%
distinct(date, seasonal)
season_data_inorder <- season_data[order(season_data$seasonal, decreasing=T),]
head(season_data_inorder)tail(season_data_inorder)season_data_inorderApparently, Antam’s closing price usually peaks at the end of April to first week of May, and at its lowest in September, around mid September. This warrants for another checking for multi-seasonality for our time series data, particularly for weekly, monthly, and yearly basis.
antam_msts <- msts(antam_close$Close, start = 2005, seasonal.periods = c(7, 365/12, 365))
mstl(antam_msts) %>%
autoplot()
The trend data looks smoother after we check the data multi-seasonality.
So, let’s use this for our model building.
Before we build our model, let’s check whether our data is stationary or not usin ADF test. Data is considered stationary if the value fluctuates around its mean (just like a data without trend). The reason we need to do stationary checking is because this is the prerequisite of conducting ARIMA method.
adf.test(antam_msts)#>
#> Augmented Dickey-Fuller Test
#>
#> data: antam_msts
#> Dickey-Fuller = -2.4181, Lag order = 17, p-value = 0.4013
#> alternative hypothesis: stationary
Based on the Augmented Dickey-Fuller Test, the p-value > alpha 0.05, therefore we would need to do differencing to our data to make it stationary.
adf.test(diff(antam_msts))#>
#> Augmented Dickey-Fuller Test
#>
#> data: diff(antam_msts)
#> Dickey-Fuller = -16.166, Lag order = 17, p-value = 0.01
#> alternative hypothesis: stationary
After 1x differencing, we get p-value < alpha. This is to confirm the ordo that we’ll be using when we do the ARIMA method.
Let’s split the data into data test and data train. Since pandemic starts in early 2020 onward, we can use this as our data test. But, we need to know when our data ends.
tail(antam_close)Since our data ends on 3 February 2021, let’s use 3 February 2020 - 3 February 2021 as our train data.
# the last two years as data test
antam_train <- head(antam_msts, length(antam_ts) - 365)
antam_test <- tail(antam_msts, 365)Since our model has trend as well as seasonal, therefore we will use
Holt-Winters and ARIMA. The reason why we are using
ARIMA instead of SARIMA is because when we check using
auto.arima function, the results are showing ARIMA ordo
(p,d,q), instead of SARIMA ordo (p,d,q)(P,D,Q)[freq].
antam_auto <- auto.arima(antam_train)
summary(antam_auto)#> Series: antam_train
#> ARIMA(2,1,0)
#>
#> Coefficients:
#> ar1 ar2
#> 0.0410 0.0828
#> s.e. 0.0138 0.0138
#>
#> sigma^2 = 1142: log likelihood = -25885.71
#> AIC=51777.43 AICc=51777.43 BIC=51797.12
#>
#> Training set error measures:
#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.05008048 33.7859 17.3448 -0.01244763 1.355187 0.03351607
#> ACF1
#> Training set -0.001075351
Note that the (p,d,q) ordo here means: * p =
PACF (Partial Autocorrelation Function) plot which
shows the correlation between t (current) and t-k (the lag) without
being affected by the correlation of the middle period. * d
= differencing. This confirms that the data would need 1x differencing
as what we’ve done earlier. * q = ACF
(Autocorrelation Function) plot which shows the correlation
between t and t-k, influenced by the middle period.
Based on the auto ARIMA, the ordo to be used is p = 2, d = 1, q = 0.
Let’s check it manually.
# PACF & ACF plot
tsdisplay(diff(antam_msts), lag.max = 5)
If we check our data, the ordo that we get based on the line that goes
beyond the blue line (significant) are: * Ordo p: 1,2 * Ordo d: 1 * Ordo
q: 1,2
Now, let’s check which ARIMA model is the best fit to be used for forecasting based on the MAPE value.
# modelling
antam_arima_1 <- arima(antam_train, order = c(1,1,1))
antam_arima_2 <- arima(antam_train, order = c(1,1,2))
antam_arima_3 <- arima(antam_train, order = c(2,1,1))
antam_arima_4 <- arima(antam_train, order = c(2,1,2))# forecasting
forecast_antam_auto <- forecast(antam_auto, h = 365*3)
forecast_antam1 <- forecast(antam_arima_1, h = 365*3)
forecast_antam2 <- forecast(antam_arima_2, h = 365*3)
forecast_antam3 <- forecast(antam_arima_3, h = 365*3)
forecast_antam4 <- forecast(antam_arima_4, h = 365*3)# evaluation with data test
accuracy(object = forecast_antam_auto, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.05008048 33.7859 17.3448 -0.01244763 1.355187 0.03351607
#> Test set 237.57703183 680.2658 394.3771 1.65290429 32.397513 0.76207088
#> ACF1 Theil's U
#> Training set -0.001075351 NA
#> Test set 0.989770610 11.45393
accuracy(object = forecast_antam1, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.0496899 33.84062 17.28788 -0.01236272 1.349792 0.03340608
#> Test set 237.8778628 680.37039 394.35701 1.69332892 32.381526 0.76203214
#> ACF1 Theil's U
#> Training set -0.02081075 NA
#> Test set 0.98976973 11.44983
accuracy(object = forecast_antam2, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.05010421 33.77908 17.35964 -0.01245231 1.35668 0.03354473
#> Test set 237.56482911 680.26172 394.37809 1.65126355 32.39817 0.76207288
#> ACF1 Theil's U
#> Training set 0.0004079673 NA
#> Test set 0.9897706490 11.45409
accuracy(object = forecast_antam3, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.04969832 33.78419 17.34763 -0.01229478 1.355421 0.03352153
#> Test set 237.47475477 680.22978 394.38394 1.63916474 32.403022 0.76208418
#> ACF1 Theil's U
#> Training set -0.0002684074 NA
#> Test set 0.9897708067 11.45534
accuracy(object = forecast_antam4, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.0525294 33.75796 17.42725 -0.01346786 1.36425 0.03367539
#> Test set 239.0426805 680.78065 394.27862 1.84981924 32.31900 0.76188065
#> ACF1 Theil's U
#> Training set 0.007695389 NA
#> Test set 0.989769134 11.43392
The best fit model is antam_arima_4 of ARIMA(2,1,2) with
MAPE score at 32.32% in data test.
Now, let’s compare the above ARIMA model, with ARIMA model derived
from stlm function, as well as with the Holt-Winters
model.
# modelling
model_arima <- stlm(antam_train, s.window=365, method = "arima")
model_hw <- HoltWinters(antam_train, seasonal = "additive")# forecasting
arima_forecast <- forecast(model_arima, h = 365*3)
hw_forecast <- forecast(model_hw, h = 365*3)Let’s see how they fare in terms of forecasting visualization.
# visualize ARIMA 4
x <- autoplot(forecast_antam4, series = "ARIMA4", fcol = "salmon") +
autolayer(antam_msts, series = "Actual", color = "navy") +
labs(subtitle = "Antam Closing Price Forecasting",
y = "Closing Price") +
theme_minimal()
# visualize ARIMA STLM
y <- autoplot(arima_forecast, series = "ARIMA_stlm", fcol = "yellow") +
autolayer(antam_msts, series = "Actual", color = "navy") +
labs(subtitle = "Antam Closing Price Forecasting",
y = "Closing Price") +
theme_minimal()
# visualize Holt-Winters
z <- autoplot(hw_forecast, series = "Holt-Winters", fcol = "green") +
autolayer(antam_msts, series = "Actual", color = "navy") +
labs(subtitle = "Antam Closing Price Forecasting",
y = "Closing Price") +
theme_minimal()
grid.arrange(x,y,z)# evaluation in data test
accuracy(object = forecast_antam4, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.0525294 33.75796 17.42725 -0.01346786 1.36425 0.03367539
#> Test set 239.0426805 680.78065 394.27862 1.84981924 32.31900 0.76188065
#> ACF1 Theil's U
#> Training set 0.007695389 NA
#> Test set 0.989769134 11.43392
accuracy(object = arima_forecast, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.04024304 32.36694 18.56239 -0.01276132 1.568766 0.03586886
#> Test set 206.26782947 685.05777 422.91831 -3.83033244 38.215506 0.81722230
#> ACF1 Theil's U
#> Training set -0.001257541 NA
#> Test set 0.989848240 13.19604
accuracy(object = hw_forecast, x = antam_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -0.4445879 41.24909 25.35241 -0.01796043 2.153081 0.0489895
#> Test set -61.1842252 699.24352 507.55839 -41.79685143 61.457216 0.9807758
#> ACF1 Theil's U
#> Training set 0.1956635 NA
#> Test set 0.9893577 22.85537
Based on the above comparison, the model antam_arima_4
of ARIMA(2,1,2) still gives the best MAPE score of 32.32% in data test
compared to model_arima which scored 38.22% and
model_hw which scored 61.46%.
Decision: Reject H0 if p-value < 0.05
Let’s check the normality of residuals of model
antam_arima_4. What we aim is that the residual/error is
normally distributed or the error value should be around 0
# Shapiro Test
shapiro.test(antam_arima_4$residuals[0:5000]) #>
#> Shapiro-Wilk normality test
#>
#> data: antam_arima_4$residuals[0:5000]
#> W = 0.72029, p-value < 0.00000000000000022
p-value < alpha, therefore reject H0, meaning the residuals are not distributed normally. Although the assumption is not fulfilled, this does not mean that we cannot use this model (note that the model is based on data train). This suggests that there are certain prediction points where the model has high error result
plot(antam_arima_4$residuals)We do not want the residuals to have autocorrelation because it means that there are patterns in the data that have not been captured.
Decision: Reject H0 if p-value < 0.05
# Ljung-Box test
Box.test(antam_arima_4$residuals)#>
#> Box-Pierce test
#>
#> data: antam_arima_4$residuals
#> X-squared = 0.31043, df = 1, p-value = 0.5774
The p-value > 0.05, therefore residuals do not experience autocorrelation (assumption fulfilled)
The model that we can use to forecast Antam’s closing price is the
model deriving from antam_arima_4 due to it having the
smallest MAPE score. However, after assumption checking, this model only
ticks the no auto-correlation box and does not fulfil the normality of
residuals test. This warrants for seasonality adjustment or warning that
there are data prediction points that have high error.