Introduction

Shares are investment instruments that signify a person’s or entity’s ownership in a company. By purchasing a stock, the owner of the stock (often referred to as an investor or shareholder) is entitled to a portion of the company’s assets and profits according to the number of shares owned, the problem we will solve is to forecast/predict the stock price for the next 3 months. The library we need includes:

library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(padr) # complete data frame
library(zoo) # Missing value imputation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # data for forecasting: principles and practice
library(TSstudio) # mempercantik visualisasi timeseries
library(ggplot2)
library(tidyr)
library(quantmod)
library(xts)

Scraping & Read Data

With the help of the library (quantmod) we will do scraping first, the stock data that we will scrape with code BBCA on yfinance:

getSymbols("BBCA.JK", src = "yahoo", from = "2020-10-01", to = Sys.Date())
#> [1] "BBCA.JK"
BBCA.JK <- data.frame(Date = index(BBCA.JK), coredata(BBCA.JK))
head(BBCA.JK)

We will save it into a data frame

write.csv(BBCA.JK, "data_saham_BBCA.csv", row.names = FALSE)

We will call the file we saved again

saham_bbca <- read.csv("data_saham_BBCA.csv")
head(saham_bbca)

Data Wrangling

At this stage we will check and adjust whether the data we have has met the characteristics of time series data. At this stage we will select data for Date and BBCA.JK.Close. some columns that we delete such as BBCA.JK.Open, BBCA.JK.High, BBCA.JK.Low, BBCA.JK.Volume and BBCA.JK.Adjusted.

bbca_clean <- saham_bbca %>% select(Date,BBCA.JK.Close) %>%  mutate(Date = ymd(Date))
all(seq.Date(from = min(bbca_clean$Date), to = max(bbca_clean$Date), by = "day")==bbca_clean$Date)
#> [1] FALSE

Because there is a missed day then we will do padding

bbca_clean <- bbca_clean %>% pad()

Check Missing values

anyNA(bbca_clean)
#> [1] TRUE

There are missing values because we did padding, we will fill the missing values with the previous value.

bbca_clean <- bbca_clean %>% mutate(BBCA.JK.Close = na.locf(BBCA.JK.Close))
anyNA(bbca_clean)
#> [1] FALSE
head(bbca_clean)
names(bbca_clean) <- c("ds", "y")

From the above steps, we can conclude that there is no missing time series data and no missing values.

With the data above, we will create a time series for the next 1 year.

bbca_ts <- ts(data = bbca_clean$y,
               start = c(2020,10),
               frequency = 365)
bbca_ts
#> Time Series:
#> Start = c(2020, 10) 
#> End = c(2024, 35) 
#> Frequency = 365 
#>    [1]  5570  5505  5505  5505  5520  5700  5755  5780  5775  5775  5775  5855
#>   [13]  5855  5900  5785  5760  5760  5760  5900  5805  5780  5800  5770  5770
#>   [25]  5770  5815  5790  5790  5790  5790  5790  5790  5820  5890  5820  6150
#>   [37]  6300  6300  6300  6285  6480  6540  6420  6390  6390  6390  6480  6550
#>   [49]  6570  6615  6600  6600  6600  6600  6565  6410  6480  6385  6385  6385
#>   [61]  6205  6395  6450  6460  6390  6390  6390  6520  6490  6490  6575  6735
#>   [73]  6735  6735  6820  6790  6950  6935  6800  6800  6800  6830  6715  6725
#>   [85]  6725  6725  6725  6725  6780  6765  6770  6770  6770  6770  6770  6835
#>   [97]  7090  6945  6965  7050  7050  7050  7345  7160  7120  7020  6955  6955
#>  [109]  6955  7120  7095  7095  7075  7080  7080  7080  7035  6820  6790  6900
#>  [121]  6760  6760  6760  6820  6800  6825  6855  6915  6915  6915  6920  6980
#>  [133]  6920  6880  6880  6880  6880  6800  6940  6900  6735  6825  6825  6825
#>  [145]  6790  6825  6725  6705  6710  6710  6710  7045  7015  7000  6720  6800
#>  [157]  6800  6800  6720  6605  6705  6705  6765  6765  6765  6665  6625  6610
#>  [169]  6705  6760  6760  6760  6620  6565  6440  6370  6415  6415  6415  6360
#>  [181]  6395  6215  6225  6225  6225  6225  6155  6165  6250  6135  6200  6200
#>  [193]  6200  6065  6000  6305  6280  6275  6275  6275  6270  6235  6165  6215
#>  [205]  6390  6390  6390  6285  6405  6320  6410  6405  6405  6405  6390  6400
#>  [217]  6425  6425  6400  6400  6400  6420  6480  6480  6480  6480  6480  6480
#>  [229]  6500  6390  6345  6380  6360  6360  6360  6325  6355  6355  6270  6340
#>  [241]  6340  6340  6375  6375  6465  6600  6580  6580  6580  6530  6430  6530
#>  [253]  6620  6470  6470  6470  6410  6470  6385  6330  6325  6325  6325  6250
#>  [265]  6340  6230  6210  6190  6190  6190  6055  6045  6025  6025  6100  6100
#>  [277]  6100  6110  6055  6065  6015  6020  6020  6020  6170  6045  5990  6115
#>  [289]  6110  6110  6110  6005  6005  6010  6155  6035  6035  6035  6005  6005
#>  [301]  5980  6040  5970  5970  5970  5960  6145  6120  6300  6160  6160  6160
#>  [313]  6200  6300  6300  6300  6410  6410  6410  6420  6420  6600  6600  6600
#>  [325]  6600  6600  6590  6600  6600  6560  6510  6510  6510  6565  6550  6565
#>  [337]  6540  6600  6600  6600  6575  6570  6440  6570  6520  6520  6520  6555
#>  [349]  6545  6495  6500  6520  6520  6520  6585  6490  6555  6580  6585  6585
#>  [361]  6585  6580  6520  6580  7000  6760  6760  6760  6960  6945  7180  7160
#>  [373]  7290  7290  7290  7255  7320  7525  7750  7650  7650  7650  7525  7500
#>  [385]  7500  7400  7525  7525  7525  7525  7525  7450  7375  7475  7475  7475
#>  [397]  7400  7300  7350  7375  7450  7450  7450  7575  7675  7650  7675  7525
#>  [409]  7525  7525  7500  7475  7575  7400  7425  7425  7425  7475  7475  7475
#>  [421]  7425  7275  7275  7275  7400  7275  7300  7500  7375  7375  7375  7350
#>  [433]  7350  7425  7350  7375  7375  7375  7300  7300  7300  7275  7500  7500
#>  [445]  7500  7375  7375  7325  7300  7300  7300  7300  7350  7350  7300  7300
#>  [457]  7300  7300  7300  7325  7400  7450  7475  7650  7650  7650  7600  7700
#>  [469]  7700  7700  7850  7850  7850  7750  7675  7675  7775  7950  7950  7950
#>  [481]  7800  7775  7700  7800  7775  7775  7775  7625  7625  7800  7725  7725
#>  [493]  7725  7725  7800  7725  7950  7750  7825  7825  7825  7700  7875  7975
#>  [505]  7900  7925  7925  7925  7950  7900  8050  8000  8050  8050  8050  8050
#>  [517]  8050  7975  7975  7900  7900  7900  7700  7650  7850  7925  7950  7950
#>  [529]  7950  8075  8150  8200  8000  7900  7900  7900  7900  7925  7900  7925
#>  [541]  7950  7950  7950  7900  7850  7875  7975  7925  7925  7925  7900  7900
#>  [553]  7750  7750  7850  7850  7850  7725  7800  7800  7700  7700  7700  7700
#>  [565]  7700  7625  7650  7925  7875  7875  7875  8000  8125  8200  8125  8125
#>  [577]  8125  8125  8125  8125  8125  8125  8125  8125  8125  7600  7525  7650
#>  [589]  7275  7325  7325  7325  7325  7400  7575  7450  7400  7400  7400  7375
#>  [601]  7350  7375  7375  7575  7575  7575  7575  7750  7750  7575  7600  7600
#>  [613]  7600  7450  7375  7600  7500  7350  7350  7350  7350  7400  7325  7575
#>  [625]  7500  7500  7500  7625  7650  7500  7525  7475  7475  7475  7350  7300
#>  [637]  7275  7250  7250  7250  7250  7050  7250  7300  7100  7150  7150  7150
#>  [649]  7125  7175  7000  7025  7000  7000  7000  7150  7175  7400  7400  7325
#>  [661]  7325  7325  7300  7300  7325  7350  7350  7350  7350  7500  7600  7625
#>  [673]  7800  7875  7875  7875  7875  7900  7900  7950  7925  7925  7925  7950
#>  [685]  7975  7975  8000  7900  7900  7900  8000  7900  7950  8075  8000  8000
#>  [697]  8000  8150  8175  8200  8150  8225  8225  8225  8275  8275  8375  8350
#>  [709]  8375  8375  8375  8375  8525  8500  8750  8450  8450  8450  8650  8550
#>  [721]  8475  8475  8375  8375  8375  8425  8300  8325  8375  8550  8550  8550
#>  [733]  8500  8550  8450  8425  8200  8200  8200  8300  8250  8325  8275  8250
#>  [745]  8250  8250  8250  8300  8275  8500  8650  8650  8650  8900  8700  8550
#>  [757]  8700  8750  8750  8750  8800  8800  8750  8800  8775  8775  8775  8850
#>  [769]  8750  8875  8800  8850  8850  8850  8750  8800  8600  8725  8825  8825
#>  [781]  8825  8725  8900  8875  9000  8975  8975  8975  9025  8975  9300  9000
#>  [793]  8900  8900  8900  8775  8675  8450  8500  8575  8575  8575  8700  8700
#>  [805]  8625  8500  8600  8600  8600  8650  8575  8675  8575  8500  8500  8500
#>  [817]  8575  8600  8650  8575  8550  8550  8550  8550  8550  8350  8250  8300
#>  [829]  8300  8300  8450  8175  8125  8175  8050  8050  8050  8150  8325  8300
#>  [841]  8325  8300  8300  8300  8300  8225  8200  8475  8700  8700  8700  8700
#>  [853]  8475  8500  8450  8700  8700  8700  8725  8850  8825  8900  8825  8825
#>  [865]  8825  8875  8950  8875  8700  8725  8725  8725  8750  8700  8675  8725
#>  [877]  8675  8675  8675  8775  8750  8600  8625  8475  8475  8475  8400  8425
#>  [889]  8575  8575  8450  8450  8450  8550  8325  8325  8300  8375  8375  8375
#>  [901]  8400  8500  8500  8500  8825  8825  8825  8700  8675  8800  8825  8750
#>  [913]  8750  8750  8800  8775  8725  8750  8750  8750  8750  8800  8825  8900
#>  [925]  8925  9000  9000  9000  9025  9125  9125  9125  9125  9125  9125  9125
#>  [937]  9125  9200  9150  9050  9050  9050  9050  9050  8925  9000  9000  9000
#>  [949]  9000  9000  8925  8925  8825  8800  8800  8800  8775  8700  8775  8775
#>  [961]  9000  9000  9000  9000  9125  9025  9050  9150  9150  9150  9150  9250
#>  [973]  9050  9050  9050  9050  9050  9200  9150  9100  9125  9100  9100  9100
#>  [985]  9150  9150  9075  9050  9050  9050  9050  9000  9050  9125  9050  9050
#>  [997]  9050  9050  9075  9150  9150  9150  9150  9150  9150  9075  9050  9050
#> [1009]  9075  9025  9025  9025  9050  9025  9175  9125  9200  9200  9200  9175
#> [1021]  9150  9150  9150  9150  9150  9150  9100  9150  9350  9225  9125  9125
#> [1033]  9125  9125  9125  9200  9250  9150  9150  9150  9275  9200  9400  9400
#> [1045]  9400  9400  9400  9350  9300  9300  9300  9250  9250  9250  9175  9300
#> [1057]  9300  9200  9275  9275  9275  9200  9250  9200  9175  9225  9225  9225
#> [1069]  9225  9225  9150  9175  9125  9125  9125  9125  9100  9075  9100  9000
#> [1081]  9000  9000  9000  9075  9150  9125  9075  9075  9075  9000  8950  8875
#> [1093]  8875  8825  8825  8825  9075  9200  9200  9075  9025  9025  9025  9050
#> [1105]  8925  8925  9050  9075  9075  9075  9100  8950  8850  8750  8975  8975
#> [1117]  8975  8850  8775  8875  8725  8700  8700  8700  8850  8750  8600  8850
#> [1129]  8900  8900  8900  9050  8975  9000  9000  8825  8825  8825  8875  8925
#> [1141]  9050  9075  9075  9075  9075  8875  8775  8875  8925  8925  8925  8925
#> [1153]  8875  8875  8900  8975  8950  8950  8950  8925  8900  8800  8825  8750
#> [1165]  8750  8750  8750  8700  8675  9050  9225  9225  9225  9200  9250  9300
#> [1177]  9325  9325  9325  9325  9325  9325  9375  9400  9400  9400  9400  9400
#> [1189]  9425  9350  9475  9575  9575  9575  9575  9625  9550  9575  9700  9700
#> [1201]  9700  9725  9700  9750  9675  9625  9625  9625  9625  9600  9525  9500
#> [1213]  9350  9350  9350  9550  9650  9550  9700  9700  9700  9700  9575  9625
#> [1225]  9700  9700  9700  9700  9700  9800  9725  9725  9850  9950  9950  9950
#> [1237]  9875 10025  9975  9875  9825  9825  9825  9800  9875 10000  9875  9825
#> [1249]  9825  9825  9750  9800  9950 10125 10150 10150 10150 10150 10150 10000
#> [1261] 10325 10150 10150 10150 10150 10175 10125 10125 10100 10100 10100 10075
#> [1273] 10050 10075 10075 10075 10075 10075  9850  9900  9525  9850  9825  9825
#> [1285]  9825  9825  9825  9825  9825  9825  9825  9825  9825  9475  9525  9475
#> [1297]  9475  9475  9475  9350  9725  9950  9775  9625  9625  9625  9800  9800
#> [1309]  9800  9550  9850  9850  9850  9800  9700  9375  9375  9375  9375  9375
#> [1321]  9525  9550  9500  9600  9750  9750  9750  9475  9375  9425  9425  9425
#> [1333]  9425  9425  9300  9300  9150  9000  9250  9250  9250  9275  9350  9450
#> [1345]  9475  9325  9325  9325  9525  9300  9250  9200  9200  9200  9200  9200
#> [1357]  9200  9050  9425  9600  9600  9600  9600  9600  9500  9750  9925  9925
#> [1369]  9925  9875  9900 10000  9825  9950  9950  9950 10050 10075 10100 10075
#> [1381] 10075 10075 10075 10050  9950  9800 10100 10125 10125 10125 10100 10175
#> [1393] 10075 10300 10325 10325 10325 10250 10175 10275 10375 10200 10200 10200
#> [1405]  9875 10000 10100 10225 10150 10150 10150 10200 10300 10200 10225 10325
#> [1417] 10325 10325 10375 10425 10425 10325 10325 10325 10325 10325 10200 10350
#> [1429] 10225 10325 10325 10325 10275 10175 10300 10250 10300 10300 10300 10275
#> [1441] 10350 10425 10475 10425 10425 10425 10425 10500 10625 10900 10775 10775
#> [1453] 10775 10950 10800 10850 10700 10650 10650 10650 10325 10550 10500 10450
#> [1465] 10475 10475 10475 10300 10400 10425 10500 10375 10375 10375 10500 10625
#> [1477] 10475 10725 10750 10750 10750 10675 10500 10650 10700 10750

EDA

Visualisasi

After we create the bbca stock time series, we can see the time series plot below:

autoplot(bbca_ts)

Insight: From the data above, we can see that the data fluctuates, the trend tends to increase.

Decomposition

Decomposition is a stage in time series analysis that is used to decompose several components in time series data. Components/elements in time series:

  • Trend (Tt) : the general pattern of data, tends to increase or decrease.
  • Seasonal (St): seasonal patterns that form a recurring pattern over a fixed period of time
  • Error/Reminder/Random (Et): patterns that cannot be captured in trends and seasonality

To be able to decompose our time series object into these 3 components, we can use the decompose() function.

bbca_decom <- decompose(bbca_ts)

Visualize the decompose result using autoplot()

bbca_decom %>% autoplot()

In the decompose result we get visualization information:

  • Data : original data pattern
  • Seasonal (S) : have seasonal
  • Trend (T) : global data patterns (up or down)
  • Remainder (E) : data patterns that cannot be captured by seasonality and trends

Cross Validation

A good cross validation stage will always be done before model building, the data will be divided into train data and test data. Especially for time series data, the division of data should not be taken randomly but divided by splitting sequentially. In the migration_ts data, we know that the observed pattern (seasonal) is daily, so we will take test data for one season, namely the last 365 days

bbca_test <- tail(bbca_ts,365)
  
#membuat data train
bbca_train <- head(bbca_ts,-365)

Build Model

ARIMA

ARIMA is a combination of two methods, namely Auto Regressive (AR) and Moving Average (MA). The I explains Integrated.

  • AR(p) : Auto Regressive Auto regressive in ARIMA means that we create a linear regression model based on the lag of the data as a predictor.
  • I(d) : Integrated is the number of times the data is differenced to make the data stationary. The value of d can be known by finding out how many times differencing is done on the data.
  • MA(q) : Moving Average Moving Average in ARIMA means that we do a running average of the time series data itself

The requirement for data to be processed using ARIMA is that the data must be stationary. Stationary in time series means that the time series data we have does not have a trend or seasonal and fluctuates around its mean. In knowing whether our data is stationary or not, we can test the assumptions with the adf.test() function.

Integrated

bbca_train %>% adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -2.9951, Lag order = 10, p-value = 0.1571
#> alternative hypothesis: stationary
bbca_train %>% 
  diff(lag = 365) %>%
  adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -2.8689, Lag order = 9, p-value = 0.2105
#> alternative hypothesis: stationary

From the data above, the p-value is greater than 0.05, so it is stationary.

from the differencing results above we get the order d for model I in the SARIMA(p,d,q)(P,D,Q)[seasonal] model is:

  • d = 1 -> obtained from trend differencing diff()
  • D = 1 -> obtained from seasonal differencing diff(lag = 365)

AR(p) dan MA(q)

To find the value of p and q, we can see with the plot below:

bbca_train %>% 
  diff(lag = 365) %>% 
  diff() %>% 
  tsdisplay()

From the plot above we can see to look for the value of p then we look at the plot in the PACF that crosses the blue line. and to look for the value of q then we look at the plot in the ACF that crosses the blue line. the maximum value of p and q that we take is 5

  • p = 0,3
  • d = 1
  • q = 0,3

model candidate:

  • c(0,1,0)
  • c(0,1,3)
  • c(3,1,0)
  • c(3,1,3)
model_010 <- Arima(y = bbca_train,order = c(0,1,0))
model_013 <- Arima(y = bbca_train,order = c(0,1,3))
model_310 <- Arima(y = bbca_train,order = c(3,1,0))
model_313 <- Arima(y = bbca_train,order = c(3,1,3))
model_autoarima <- auto.arima(bbca_train)

Goodness of fit

For selecting the best model, we can use the AIC value. Remember: AIC value is information loss -> so the smallest value is desired. can be done by way of name_model$aic

model_010$aic
#> [1] 13107.29
model_013$aic
#> [1] 13090.24
model_310$aic
#> [1] 13089.3
model_313$aic
#> [1] 13093.82
model_autoarima$aic
#> [1] 13088.88

the best model of arima is:

  • model_autoarima
  • model_310
  • model_013

### Forecasting ARIMA

forecast_autoarima <- forecast(model_autoarima, h = 365)
forecast_310 <- forecast(model_310, h = 365)
forecast_013 <- forecast(model_013, h = 365)

Accuracy Arima

we will evaluate the model using the accuracy() function of the best model.

accuracy(forecast_autoarima$mean, bbca_test)
#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 1018.037 1148.011 1020.241 10.15693 10.18238 0.9705849  10.53864
accuracy(forecast_310$mean, bbca_test)
#>                ME     RMSE      MAE     MPE     MAPE      ACF1 Theil's U
#> Test set 1016.768 1146.883 1019.007 10.1439 10.16975 0.9705785  10.52798
accuracy(forecast_013$mean, bbca_test)
#>                ME     RMSE     MAE      MPE     MAPE     ACF1 Theil's U
#> Test set 1014.565 1144.921 1016.88 10.12129 10.14801 0.970595  10.50944

From the Arima evaluation above model_013 has the smallest MAPE of our best models

Assumption

Assumptions on time series are tested to measure whether the residuals obtained from the modeling results are good enough to describe and capture the information in the data. Below are the assumptions to test:

Normality of residual

hist(model_013$residuals)

From the histogram above that the plot forms a normal looking curve. to make sure we can use the shapiro.test

shapiro.test(model_013$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_013$residuals
#> W = 0.913, p-value < 0.00000000000000022

after doing the shapiro test, the p-value obtained is smaller than 0.05, so the residuals of the 111 model are not normally distributed (if the residuals are changed normally, then do it like regression).

No-autocorrelation residual

To check the presence or absence of autocorrelation in time series forecasting results, we can use the Ljung-box test using the Box.test(residual model, type = “Ljung-Box) function which will produce a p-value. The hypothesis used is:

H0: residual has no-autocorrelation

H1: residual has autocorrelation

desired p-value > 0.05 (alpha), no-autocorrelation

Box.test(model_013$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model_013$residuals
#> X-squared = 0.010191, df = 1, p-value = 0.9196

Insight:

  • obtained p value greater than 0.05, then the residuals of model 013 do not occur autocorrelation.

STLM

It is one way to decompose the data but still retain the information from all the data we have. Using the train data information, we will try to create a model using stlm(). Some model parameters:

  • y : object time series

  • s.window : seasonal pattern to be captured (frequency value in ts() function)

  • method: “arima”

model_stlm_arima <- stlm(y = bbca_train,
                   s.window = 365,
                   method = "arima")

Forecasting

Forecast for the next 365 days

forecast_stlm_arima <- forecast(model_stlm_arima, h = 365)

Visualization of forecast results

bbca_ts %>% 
  autoplot()+
  autolayer(forecast_autoarima$mean, series = " model autoarima")+
  autolayer(forecast_310$mean, series = "model 310")+
  autolayer(forecast_013$mean, series = "model 013")+
  autolayer(forecast_stlm_arima$mean, series = "model stlm arima")

Evaluation

Evaluate between models using the MAPE() function from the MLmetrics package.

accuracy(forecast_013$mean, bbca_test)
#>                ME     RMSE     MAE      MPE     MAPE     ACF1 Theil's U
#> Test set 1014.565 1144.921 1016.88 10.12129 10.14801 0.970595  10.50944
accuracy(forecast_stlm_arima$mean, bbca_test)
#>                ME     RMSE      MAE     MPE     MAPE      ACF1 Theil's U
#> Test set 466.1979 622.3011 528.5009 4.56954 5.271799 0.9584233  5.709883

From the above evaluation with the percentage of MAPE between the test data, model_stlm_arima is our best model.

Conclusion

With a MAPE of 5.271799, model_stlm is the best model to predict stock price for 365 days from our test data.

Reference

  • yfinance