In this modern era, data plays a big role when it comes to decision making. If we talk about it, either as organization or individual, we often have to deal with data that is taken sequentially in time. This kind of data is called time-series data. Do you happen to deal with this kind of data recently? Can you name some examples of time-series data that often “drive” your decision making?
Let me give you some example. Let’s now play our role as investors, of course stock price will become one of the most important data that will affect our buy or sell decision. As we all know, stock price data is taken sequentially in time, and that’s why this is an example of time series data that “drive” our decision making.
Through this article, I want to share my knowledge about how to analyze time-series data using R. As the case study, we will continue to play our role as investors that want to analyze stock price as one of the factors that will affect our investment decision. Remember that when it comes to investment decision, a lot of other factors can affect our decisions, so don’t use this analysis as your only factor that “drive” your decision!
As I mentioned before, time-series data is data that is collected sequentially in time. Usually, it is recorded in a specific time interval, such as a day, a month, or a year. When we plot this kind of data, we will see some patterns like trend, seasonality, or combination of them. Identifying the patterns of our data can help us in making our decision. For example, identifying seasonal pattern in our stock price data can be used to determine the time to buy or sell the stock. In R, we can see these patterns in more details if we decompose the data.
Obtaining stock price can be easily done in R using quantmod package. It is a package that is designed for quantitative trader to assist them in exploring and building trading models easily. This package use Yahoo Finance as the source of its stock prices, so remember to check the ticker of your stock in the Yahoo Finance before using this package. Here, we will try to analyze the stock price of “Bank Rakyat Indonesia” (BBRI.JK) from 2014 until 2019.
library(quantmod)
price_bbri <- getSymbols("BBRI.JK", auto.assign=FALSE, from="2014-01-01", to="2019-12-31")
head(price_bbri)## BBRI.JK.Open BBRI.JK.High BBRI.JK.Low BBRI.JK.Close BBRI.JK.Volume
## 2014-01-02 1460 1470 1440 1460 122627500
## 2014-01-03 1450 1460 1430 1450 102682500
## 2014-01-06 1440 1450 1400 1405 111960500
## 2014-01-07 1415 1425 1415 1415 88971500
## 2014-01-08 1420 1445 1420 1435 96254500
## 2014-01-09 1440 1475 1440 1465 212697500
## BBRI.JK.Adjusted
## 2014-01-02 1151.861
## 2014-01-03 1143.972
## 2014-01-06 1108.469
## 2014-01-07 1116.359
## 2014-01-08 1132.137
## 2014-01-09 1155.806
The data we obtained is a daily OHLC data. We will use the closing price, that can be obtained using Cl() function, for further analysis in this article. Closing price is considered as a useful markers to assess changes in stock price over time. There are also some other investors that prefer to use adjusted price than closing price.
Visualizing stock price data can be done by creating a static or interactive plot. By visualizing it, we can analyze the patterns of the data and the impacts to our investment decision.
# Candle stick chart
chartSeries(price_bbri, name = "BBRI Price 2014-2019")Overall, we can see that there are upward trend and seasonal patterns in the plot. In order to make sure the components of the data, we will decompose it. We will focus only on the closing price and change the periods to monthly data first.
# Take only the closing price
closing_pr <- Cl(to.monthly(price_bbri))## Warning in to.period(x, "months", indexAt = indexAt, name = name, ...): missing
## values removed from data
# Decompose it
dc <- decompose(as.ts(closing_pr, start=c(2014,1)))
plot(dc)# Seasonal component
dc$seasonal## Jan Feb Mar Apr May Jun
## 2014 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## 2015 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## 2016 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## 2017 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## 2018 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## 2019 94.225694 125.392361 178.267361 8.684028 -35.815972 -71.940972
## Jul Aug Sep Oct Nov Dec
## 2014 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
## 2015 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
## 2016 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
## 2017 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
## 2018 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
## 2019 -67.649306 -42.482639 -159.399306 -83.607639 -32.690972 87.017361
The output shows four plots of our closing price data, which are:
Observed : original plot of the data
Trend : long term movements in the mean. In this plot, we can see the significant upward trend started around 2017.
Seasonal : repetitive seasonal fluctuation of the data. The closing price of BBRI tended to reach the highest in March and the lowest in September. Looking at this pattern, we can say that overall, the right time to sell this stock was at the beginning of the year (especially in March) and the right time to buy was at the end of the year (especially in September).
Random : irregular or random fluctuation not captured by the trend and seasonal. The current situation of Covid-19 pandemic is an example of the factor that cause this random fluctuation. When random component is dominant in a data, forecast will be harder to be done accurately. Therefore, in this article I only use the data until 2019.
library(highcharter)## Warning: package 'highcharter' was built under R version 4.0.4
highchart(type="stock") %>%
hc_add_series(price_bbri) %>%
hc_add_series(SMA(na.omit(Cl(price_bbri)),n=50),name="SMA(50)") %>%
hc_add_series(SMA(na.omit(Cl(price_bbri)),n=200),name="SMA(200)") %>%
hc_title(text="<b>BBRI Price Candle Stick Chart 2014-2019</b>")Using this kind of visualization will be very useful if we need to see the detail of the data. Addition indicator such as Single Moving Average(SMA) is also easier to read when we use interactive plot than static plot. We can use this plot to highlight some important technical analysis that affect our decision, such as golden or death cross.
Investopedia explains golden and death cross as follow:
The golden cross occurs when a short-term moving average crosses over a major long-term moving average to the upside and is interpreted by analysts and traders as signaling a definitive upward turn in a market. Basically, the short-term average trends up faster than the long-term average, until they cross. Conversely, a similar downside moving average crossover constitutes the death cross and is understood to signal a decisive downturn in a market. The death cross occurs when the short term average trends down and crosses the long-term average, basically going in the opposite direction of the golden cross.
In other word, the golden cross is considered as a good time to buy stock and the death cross is considered as a good time to sell stock. But remember to consider the other factors too!
We can also use this chart to compare our stock price with other stocks in the same sector and the stock index. This kind of visualization can help us to analyze the performance of our stock compare to the market. Determine whether it is still a good choice to invest in our stock instead of other stocks in the same sector can be done after the comparison and other analysis.
# Fetch BBNI, BMRI, and IHSG stock prices
price_bbni <- getSymbols("BBNI.JK",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")
price_bmri <- getSymbols("BMRI.JK",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")
price_ihsg <- getSymbols("^JKSE",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")
# Compare the stock prices
highchart(type="stock") %>%
hc_add_series(Cl(price_bbri), name="BBRI") %>%
hc_add_series(Cl(price_bbni), name="BBNI") %>%
hc_add_series(Cl(price_bmri), name="BMRI") %>%
hc_add_series(Cl(price_ihsg), name="IHSG") %>%
hc_title(text="<b>BBRI vs BBNI vs BMRI vs IHSG Closing Price</b>")It turns out that our stock, BBRI, had the lowest price compared to Bank Mandiri (BMRI) and Bank Negara Indonesia (BBNI). But if we compare the stability, BBRI had the most stable price than the other stocks. However, the fluctuation of the BBRI was similar with the index (IHSG) which was a good thing because our stock performance was in line with the market.
Taking a closer look at return is also important for investors. Of course, they do not want to have a stock that gives a bad return for their investment. Return itself can be differed according to the period we choose, such as daily, weekly, or monthly return.
# Calculate the stocks return
return_bbri <- dailyReturn(Cl(price_bbri))
return_bbni <- dailyReturn(Cl(price_bbni))
return_bmri <- dailyReturn(Cl(price_bmri))
# Combine the returns as one data frame
returns <- data.frame(return_bbri,return_bbni,return_bmri)
names(returns) <- c("return_bbri","return_bbni","return_bmri")
returns <- as.xts(returns)
# Plot the returns
library(PerformanceAnalytics)
charts.PerformanceSummary(returns,main="Daily Return BBRI vs BBNI vs BMRI 2014-2019")BBRI turned out to have the highest cumulative return compared to BBNI and BMRI. If we relate it to the previous discussion about the fluctuation of the price, we can say that the highest cumulative return of BBRI was the result of its most stable price during this period. It is also shown in the drawdown part of the plot that BBRI was the fastest to get back on track compared to the other two stocks.
The idea of forecasting time-series data is to use the previous and present data to predict the future. Although all models that are used to forecast can not be 100% accurate, usually the result still can be very useful to make decision regarding the future. There are plenty of models and algorithms can be used to forecast time-series data, but we will only focus on Naive Method and ARIMA model in this discussion.
Before forecasting our time-series data, we split our data into train and test data first. The objective is so that we can evaluate our model in predicting data that are not used to train the model.In other words, we can say that:
Train data : data that is used to fit the model
Test data : data that is used to evaluate the model
As the analyst, we have to determine a split percentage that can represent our train and test data while not taking too much computational cost in training the model.The most common split percentage is 80% train - 20% test, but still it depends to the data and also the purpose of our forecasting.
# Number of period we want to forecast
n <- 100
# Splitting the data
train <- head(Cl(price_bbri), length(Cl(price_bbri))-n)
test <- tail(Cl(price_bbri), n)In this case, we want to forecast 100 days of stock price in our dataset. Therefore, we use the 100 last observations as the test data while using the remaining data as the train data.
Naive Method is a forecasting method that used the last observation as the forecast result of our data. It is used as the base model in forecasting. If our model’s result is worse than the Naive Method’s, we will prefer not to use our model.
library(forecast)
# Forecast the data
fc_na <- naive(train, h=n)
# Plot the result
autoplot(fc_na) +
autolayer(ts(test, start=length(train)), series = "Test Data")The blue line is the mean of our prediction, while the darker and light darker areas, representing 80% and 95% confidence intervals respectively. If we compare the result to the actual test data, we can see that there are differences between them.
Auto-Regressive Integrated Moving Average (ARIMA) model is a combination of Auto-Regressive model, Integration of differencing, and Moving Average model. The Auto-Regressive (AR) model defines the relation of an observation and some lagged observations. In order to use ARIMA model, the time-series data has to become stasionary and this can be achieved by differencing the data. Last, the Moving Average (MA) model defines the relation of an observation and residual error of moving average model on the lagged observations.
ARIMA models are generally denoted by ARIMA(p,d,q) where p, d, and q are the parameters with positive values. The order of AR (p) is the number of lag observations in the model. For differencing, we have d as the number of times the actual data are differenced to become stationary. Usually, the maximum times of differencing to achieve stationarity is 2 times. The order of MA (q) is the size of moving average window.
There are ways to determine appropriate values of those parameters for our model, but they are difficult. Fortunately, in R we have auto.arima() function that will do the job for us. When using this function, there are two kinds of ARIMA we can get, which are non-seasonal ARIMA (the one we discussed before) and seasonal ARIMA. Clearly, in non-seasonal ARIMA we do not include the seasonal part of the data, while in seasonal ARIMA we include it. We will see which one produce better forecast result to our data.
Non-seasonal ARIMA
# Create the Model
model_non <- auto.arima(train, seasonal=FALSE)
# Forecast n periods of the data
fc_non <- forecast(model_non, h=n)
# Plot the result
autoplot(fc_non)+
autolayer(ts(test, start= length(train)), series="Test Data")The result using non-seasonal ARIMA is an upward trend data. However, if we compare it to the actual test data, there are still some differences between them.
Seasonal ARIMA
# Create the Model
model_s <- auto.arima(train)
# Forecast n periods of the data
fc_s <- forecast(model_s, h=n)
# Plot the result
autoplot(fc_s)+
autolayer(ts(test, start= length(train)), series="Test Data")It turns out that whether we include seasonal part of our data or not, the auto.arima() function produces the same result. It means that the seasonal part of our data is not significant so that non-seasonal ARIMA is the best ARIMA model based on auto.arima() for our dataset.
After forecasting our data, the last step is evaluate our forecast. Forecast evaluation is done by checking whether the residuals meet the residual assumptions and comparing the accuracy metrics.
Check Residuals
Residual is the difference between the forecast data and the actual data. A good model must have a randomly distributed residual without an obvious pattern. Here are some residual assumptions that have to be met:
Normally distributed (mean = 0) that can be checked using normal curve. The normal curve has to be a bell-shaped one.
Have a constant variance that can be checked using residual plot. Constant variance is shown in a constant fluctuation of the data.
Have no autocorrelation that can be checked using ACF plot and Ljung Box test. Autocorrelation can be detected in ACF plot when there are lines beyond the upper or lower bound. In Ljung Box test, to meet this assumption, we must have a p-value that is greater than 0.05. If the result in ACF plot and Ljung Box test is difference, we will prefer to use the result from Ljung Box test.
Residual of Naive Method
checkresiduals(fc_na)##
## Ljung-Box test
##
## data: Residuals from Naive method
## Q* = 21.746, df = 10, p-value = 0.01645
##
## Model df: 0. Total lags used: 10
Residual of ARIMA Model
checkresiduals(fc_non)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2) with drift
## Q* = 9.483, df = 7, p-value = 0.2198
##
## Model df: 3. Total lags used: 10
From the results, we know that the Naive Method did not meet the assumption of no autocorrelation, while ARIMA met all of the assumptions. So based on the residual check, our ARIMA model produced better result than Naive Method, which is a good thing.
Accuracy Metrics Comparison
There are a lot of accuracy metrics in forecasting. In this discussion, we will compare Root Mean Squared Error (RMSE) of both models.
Accuracy Metrics of Naive Method
accuracy(fc_na)## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.012802 49.25043 33.89758 0.05909604 1.291768 1.000338 0.0408483
Accuracy Metrics of ARIMA
accuracy(fc_non)## ME RMSE MAE MPE MAPE MASE
## Training set 0.006778766 48.94989 33.95388 -0.01925042 1.292598 1.002
## ACF1
## Training set 0.0007096646
Based on RMSE, we can see that ARIMA model has a better result than Naive Method in forecasting our BBRI closing price.
Combining our analysis and forecast result of stock price data with other analysis such as technical or sentiment analysis is a good approach to analyze stock price data in order to make investment decision.
Time-series data is everywhere and most of them can affect our decision. Beside the stock price that we have discussed through out this article, certainly there are a lot of other examples. For a businessman that owns a department store, sales data must be an important time-series data that drives his business decision. We can also analyze and forecast sales data to help a businessman in making his decision.
Decomposing time-series data can give us a more detail view of the pattern of our data. Analyzing it will help us in making decision about our data. However, decomposition in R can only be done to ts object that the frequency has been specified.
After analyzing our time-series data, we usually want to forecast it to help us create decision regarding the future. However, keep in mind that the result of forecasting can never be 100% accurate, so we must have second plan for the future decision. Forecasting can be done using many models and algorithms. In this article, we have discussed about Naive Method and ARIMA model.
Naive method is a great tool to become a baseline of our other models and a really simple method because we only use the last observation as the forecast result. Although in some case the forecast result of this method can be useful (for example: a salesman that target his next month sales to become the same as this month), however it is suggested that we use other more robust approach to deal with other cases. On the other hand, ARIMA model (including the seasonal one) can be a good model to forecast data that are more fluctuated. However, this model can only be used for data that can be differenced to achieve stationarity. There are many other models that maybe can give a better result for your dataset, so don’t be hesitate to explore more!