Time Series & Stock Analysis

Background : Time-Series Data-Driven Decision Making

In this modern era, data plays a big role when it comes to decision making. If we talk about it, either as organization or individual, we often have to deal with data that is taken sequentially in time. This kind of data is called time-series data. Do you happen to deal with this kind of data recently? Can you name some examples of time-series data that often “drive” your decision making?

Let me give you some example. Let’s now play our role as investors, of course stock price will become one of the most important data that will affect our buy or sell decision. As we all know, stock price data is taken sequentially in time, and that’s why this is an example of time series data that “drive” our decision making.

Through this article, I want to share my knowledge about how to analyze time-series data using R. As the case study, we will continue to play our role as investors that want to analyze stock price as one of the factors that will affect our investment decision. Remember that when it comes to investment decision, a lot of other factors can affect our decisions, so don’t use this analysis as your only factor that “drive” your decision!

Get to Know Time-Series Data

As I mentioned before, time-series data is data that is collected sequentially in time. Usually, it is recorded in a specific time interval, such as a day, a month, or a year. When we plot this kind of data, we will see some patterns like trend, seasonality, or combination of them. Identifying the patterns of our data can help us in making our decision. For example, identifying seasonal pattern in our stock price data can be used to determine the time to buy or sell the stock. In R, we can see these patterns in more details if we decompose the data.

Obtaining Stock Price

Obtaining stock price can be easily done in R using quantmod package. It is a package that is designed for quantitative trader to assist them in exploring and building trading models easily. This package use Yahoo Finance as the source of its stock prices, so remember to check the ticker of your stock in the Yahoo Finance before using this package. Here, we will try to analyze the stock price of “Bank Rakyat Indonesia” (BBRI.JK) from 2014 until 2019.

library(quantmod)
price_bbri <- getSymbols("BBRI.JK", auto.assign=FALSE, from="2014-01-01", to="2019-12-31")
head(price_bbri)

##            BBRI.JK.Open BBRI.JK.High BBRI.JK.Low BBRI.JK.Close BBRI.JK.Volume
## 2014-01-02         1460         1470        1440          1460      122627500
## 2014-01-03         1450         1460        1430          1450      102682500
## 2014-01-06         1440         1450        1400          1405      111960500
## 2014-01-07         1415         1425        1415          1415       88971500
## 2014-01-08         1420         1445        1420          1435       96254500
## 2014-01-09         1440         1475        1440          1465      212697500
##            BBRI.JK.Adjusted
## 2014-01-02         1151.861
## 2014-01-03         1143.972
## 2014-01-06         1108.469
## 2014-01-07         1116.359
## 2014-01-08         1132.137
## 2014-01-09         1155.806

The data we obtained is a daily OHLC data. We will use the closing price, that can be obtained using Cl() function, for further analysis in this article. Closing price is considered as a useful markers to assess changes in stock price over time. There are also some other investors that prefer to use adjusted price than closing price.

Stock Price Visualization

Visualizing stock price data can be done by creating a static or interactive plot. By visualizing it, we can analyze the patterns of the data and the impacts to our investment decision.

a. Static Plot

# Candle stick chart
chartSeries(price_bbri, name = "BBRI Price 2014-2019")

Overall, we can see that there are upward trend and seasonal patterns in the plot. In order to make sure the components of the data, we will decompose it. We will focus only on the closing price and change the periods to monthly data first.

# Take only the closing price
closing_pr <- Cl(to.monthly(price_bbri))

## Warning in to.period(x, "months", indexAt = indexAt, name = name, ...): missing
## values removed from data

# Decompose it
dc <- decompose(as.ts(closing_pr, start=c(2014,1)))
plot(dc)

# Seasonal component 
dc$seasonal

##              Jan         Feb         Mar         Apr         May         Jun
## 2014   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
## 2015   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
## 2016   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
## 2017   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
## 2018   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
## 2019   94.225694  125.392361  178.267361    8.684028  -35.815972  -71.940972
##              Jul         Aug         Sep         Oct         Nov         Dec
## 2014  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361
## 2015  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361
## 2016  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361
## 2017  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361
## 2018  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361
## 2019  -67.649306  -42.482639 -159.399306  -83.607639  -32.690972   87.017361

The output shows four plots of our closing price data, which are:

Observed : original plot of the data
Trend : long term movements in the mean. In this plot, we can see the significant upward trend started around 2017.
Seasonal : repetitive seasonal fluctuation of the data. The closing price of BBRI tended to reach the highest in March and the lowest in September. Looking at this pattern, we can say that overall, the right time to sell this stock was at the beginning of the year (especially in March) and the right time to buy was at the end of the year (especially in September).
Random : irregular or random fluctuation not captured by the trend and seasonal. The current situation of Covid-19 pandemic is an example of the factor that cause this random fluctuation. When random component is dominant in a data, forecast will be harder to be done accurately. Therefore, in this article I only use the data until 2019.

b. Interactive Plot

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.0.4

highchart(type="stock") %>% 
  hc_add_series(price_bbri) %>% 
  hc_add_series(SMA(na.omit(Cl(price_bbri)),n=50),name="SMA(50)") %>% 
  hc_add_series(SMA(na.omit(Cl(price_bbri)),n=200),name="SMA(200)") %>% 
  hc_title(text="<b>BBRI Price Candle Stick Chart 2014-2019</b>")

Using this kind of visualization will be very useful if we need to see the detail of the data. Addition indicator such as Single Moving Average(SMA) is also easier to read when we use interactive plot than static plot. We can use this plot to highlight some important technical analysis that affect our decision, such as golden or death cross.

Investopedia explains golden and death cross as follow:

The golden cross occurs when a short-term moving average crosses over a major long-term moving average to the upside and is interpreted by analysts and traders as signaling a definitive upward turn in a market. Basically, the short-term average trends up faster than the long-term average, until they cross. Conversely, a similar downside moving average crossover constitutes the death cross and is understood to signal a decisive downturn in a market. The death cross occurs when the short term average trends down and crosses the long-term average, basically going in the opposite direction of the golden cross.

In other word, the golden cross is considered as a good time to buy stock and the death cross is considered as a good time to sell stock. But remember to consider the other factors too!

We can also use this chart to compare our stock price with other stocks in the same sector and the stock index. This kind of visualization can help us to analyze the performance of our stock compare to the market. Determine whether it is still a good choice to invest in our stock instead of other stocks in the same sector can be done after the comparison and other analysis.

# Fetch BBNI, BMRI, and IHSG stock prices
price_bbni <- getSymbols("BBNI.JK",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")
price_bmri <- getSymbols("BMRI.JK",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")
price_ihsg <- getSymbols("^JKSE",auto.assign=FALSE,from="2014-01-01",to="2019-12-31")

# Compare the stock prices
highchart(type="stock") %>% 
  hc_add_series(Cl(price_bbri), name="BBRI") %>% 
  hc_add_series(Cl(price_bbni), name="BBNI") %>% 
  hc_add_series(Cl(price_bmri), name="BMRI") %>% 
  hc_add_series(Cl(price_ihsg), name="IHSG") %>% 
  hc_title(text="<b>BBRI vs BBNI vs BMRI vs IHSG Closing Price</b>")

It turns out that our stock, BBRI, had the lowest price compared to Bank Mandiri (BMRI) and Bank Negara Indonesia (BBNI). But if we compare the stability, BBRI had the most stable price than the other stocks. However, the fluctuation of the BBRI was similar with the index (IHSG) which was a good thing because our stock performance was in line with the market.

Stocks Return Analysis

Taking a closer look at return is also important for investors. Of course, they do not want to have a stock that gives a bad return for their investment. Return itself can be differed according to the period we choose, such as daily, weekly, or monthly return.

# Calculate the stocks return
return_bbri <- dailyReturn(Cl(price_bbri))
return_bbni <- dailyReturn(Cl(price_bbni))
return_bmri <- dailyReturn(Cl(price_bmri))

# Combine the returns as one data frame
returns <- data.frame(return_bbri,return_bbni,return_bmri)
names(returns) <- c("return_bbri","return_bbni","return_bmri")
returns <- as.xts(returns)

# Plot the returns
library(PerformanceAnalytics)
charts.PerformanceSummary(returns,main="Daily Return BBRI vs BBNI vs BMRI 2014-2019")

BBRI turned out to have the highest cumulative return compared to BBNI and BMRI. If we relate it to the previous discussion about the fluctuation of the price, we can say that the highest cumulative return of BBRI was the result of its most stable price during this period. It is also shown in the drawdown part of the plot that BBRI was the fastest to get back on track compared to the other two stocks.

Forecasting Time-Series Data

The idea of forecasting time-series data is to use the previous and present data to predict the future. Although all models that are used to forecast can not be 100% accurate, usually the result still can be very useful to make decision regarding the future. There are plenty of models and algorithms can be used to forecast time-series data, but we will only focus on Naive Method and ARIMA model in this discussion.

a. Splitting The Data

Before forecasting our time-series data, we split our data into train and test data first. The objective is so that we can evaluate our model in predicting data that are not used to train the model.In other words, we can say that:

Train data : data that is used to fit the model
Test data : data that is used to evaluate the model

As the analyst, we have to determine a split percentage that can represent our train and test data while not taking too much computational cost in training the model.The most common split percentage is 80% train - 20% test, but still it depends to the data and also the purpose of our forecasting.

# Number of period we want to forecast
n <- 100

# Splitting the data
train <- head(Cl(price_bbri), length(Cl(price_bbri))-n)
test <- tail(Cl(price_bbri), n)

In this case, we want to forecast 100 days of stock price in our dataset. Therefore, we use the 100 last observations as the test data while using the remaining data as the train data.

b. Naive Method

Naive Method is a forecasting method that used the last observation as the forecast result of our data. It is used as the base model in forecasting. If our model’s result is worse than the Naive Method’s, we will prefer not to use our model.

library(forecast)

# Forecast the data
fc_na <- naive(train, h=n)

# Plot the result
autoplot(fc_na) +
  autolayer(ts(test, start=length(train)), series = "Test Data")

The blue line is the mean of our prediction, while the darker and light darker areas, representing 80% and 95% confidence intervals respectively. If we compare the result to the actual test data, we can see that there are differences between them.

c. ARIMA Model

Auto-Regressive Integrated Moving Average (ARIMA) model is a combination of Auto-Regressive model, Integration of differencing, and Moving Average model. The Auto-Regressive (AR) model defines the relation of an observation and some lagged observations. In order to use ARIMA model, the time-series data has to become stasionary and this can be achieved by differencing the data. Last, the Moving Average (MA) model defines the relation of an observation and residual error of moving average model on the lagged observations.

ARIMA models are generally denoted by ARIMA(p,d,q) where p, d, and q are the parameters with positive values. The order of AR (p) is the number of lag observations in the model. For differencing, we have d as the number of times the actual data are differenced to become stationary. Usually, the maximum times of differencing to achieve stationarity is 2 times. The order of MA (q) is the size of moving average window.

There are ways to determine appropriate values of those parameters for our model, but they are difficult. Fortunately, in R we have auto.arima() function that will do the job for us. When using this function, there are two kinds of ARIMA we can get, which are non-seasonal ARIMA (the one we discussed before) and seasonal ARIMA. Clearly, in non-seasonal ARIMA we do not include the seasonal part of the data, while in seasonal ARIMA we include it. We will see which one produce better forecast result to our data.

Non-seasonal ARIMA

# Create the Model
model_non <- auto.arima(train, seasonal=FALSE)

# Forecast n periods of the data
fc_non <- forecast(model_non, h=n)

# Plot the result
autoplot(fc_non)+
  autolayer(ts(test, start= length(train)), series="Test Data")

The result using non-seasonal ARIMA is an upward trend data. However, if we compare it to the actual test data, there are still some differences between them.

Seasonal ARIMA

# Create the Model
model_s <- auto.arima(train)

# Forecast n periods of the data
fc_s <- forecast(model_s, h=n)

# Plot the result
autoplot(fc_s)+
  autolayer(ts(test, start= length(train)), series="Test Data")

It turns out that whether we include seasonal part of our data or not, the auto.arima() function produces the same result. It means that the seasonal part of our data is not significant so that non-seasonal ARIMA is the best ARIMA model based on auto.arima() for our dataset.

d. Forecast Evaluation

After forecasting our data, the last step is evaluate our forecast. Forecast evaluation is done by checking whether the residuals meet the residual assumptions and comparing the accuracy metrics.

Check Residuals

Residual is the difference between the forecast data and the actual data. A good model must have a randomly distributed residual without an obvious pattern. Here are some residual assumptions that have to be met:

Normally distributed (mean = 0) that can be checked using normal curve. The normal curve has to be a bell-shaped one.
Have a constant variance that can be checked using residual plot. Constant variance is shown in a constant fluctuation of the data.
Have no autocorrelation that can be checked using ACF plot and Ljung Box test. Autocorrelation can be detected in ACF plot when there are lines beyond the upper or lower bound. In Ljung Box test, to meet this assumption, we must have a p-value that is greater than 0.05. If the result in ACF plot and Ljung Box test is difference, we will prefer to use the result from Ljung Box test.

Residual of Naive Method

checkresiduals(fc_na)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 21.746, df = 10, p-value = 0.01645
## 
## Model df: 0.   Total lags used: 10

Residual of ARIMA Model

checkresiduals(fc_non)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2) with drift
## Q* = 9.483, df = 7, p-value = 0.2198
## 
## Model df: 3.   Total lags used: 10

From the results, we know that the Naive Method did not meet the assumption of no autocorrelation, while ARIMA met all of the assumptions. So based on the residual check, our ARIMA model produced better result than Naive Method, which is a good thing.

Accuracy Metrics Comparison

There are a lot of accuracy metrics in forecasting. In this discussion, we will compare Root Mean Squared Error (RMSE) of both models.

Root Mean Squared Error (RMSE) : the standard deviation of the residual that measures how spread out our residuals are. A smaller value of RMSE is a sign of a better result.

Accuracy Metrics of Naive Method

accuracy(fc_na)

##                    ME     RMSE      MAE        MPE     MAPE     MASE      ACF1
## Training set 2.012802 49.25043 33.89758 0.05909604 1.291768 1.000338 0.0408483

Accuracy Metrics of ARIMA

accuracy(fc_non)

##                       ME     RMSE      MAE         MPE     MAPE  MASE
## Training set 0.006778766 48.94989 33.95388 -0.01925042 1.292598 1.002
##                      ACF1
## Training set 0.0007096646

Based on RMSE, we can see that ARIMA model has a better result than Naive Method in forecasting our BBRI closing price.

Combining our analysis and forecast result of stock price data with other analysis such as technical or sentiment analysis is a good approach to analyze stock price data in order to make investment decision.

Conclusion

Time-series data is everywhere and most of them can affect our decision. Beside the stock price that we have discussed through out this article, certainly there are a lot of other examples. For a businessman that owns a department store, sales data must be an important time-series data that drives his business decision. We can also analyze and forecast sales data to help a businessman in making his decision.

Decomposing time-series data can give us a more detail view of the pattern of our data. Analyzing it will help us in making decision about our data. However, decomposition in R can only be done to ts object that the frequency has been specified.

After analyzing our time-series data, we usually want to forecast it to help us create decision regarding the future. However, keep in mind that the result of forecasting can never be 100% accurate, so we must have second plan for the future decision. Forecasting can be done using many models and algorithms. In this article, we have discussed about Naive Method and ARIMA model.

Naive method is a great tool to become a baseline of our other models and a really simple method because we only use the last observation as the forecast result. Although in some case the forecast result of this method can be useful (for example: a salesman that target his next month sales to become the same as this month), however it is suggested that we use other more robust approach to deal with other cases. On the other hand, ARIMA model (including the seasonal one) can be a good model to forecast data that are more fluctuated. However, this model can only be used for data that can be differenced to achieve stationarity. There are many other models that maybe can give a better result for your dataset, so don’t be hesitate to explore more!