Introduction

The purpose is to build a time series model and predict the Hang Seng Index (HSI). Firstly, get the HSI index and build the training data. Then, explore the data and capture the features for a model. Find the ARIMA model with the least AIC and fit the residuals by GARCH model. Finally, make a 10 days forecast, compare the outcome with testing data.

Process Data

#loading package
library(tidyquant) 
library(fpp3)
library(tseries)
library(rugarch)
library(data.table)
library(plotly)
setwd("C:/Users/Apple/Desktop/RStudio Tour/note/TSA")

The historical data is downloaded from Yahoo Finance([ctrl+click] to attach link) The training data covered from 2010-01-05 to 2021-01-29.

HSI <- fread("^HSI.csv")
HSI <- HSI[, .(Date, Close = as.numeric(Close))]
HSI <- HSI %>%
  tsibble(index = Date) %>%
  mutate(Return = difference(log(HSI$Close)) * 100)

HSI <- HSI %>%
  na.omit() %>%
  mutate(time = row_number()) %>%
  update_tsibble(index = time)
head(HSI)
## # A tsibble: 6 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2010-01-05 22280.  2.07      1
## 2 2010-01-06 22417.  0.613     2
## 3 2010-01-07 22269. -0.659     3
## 4 2010-01-08 22297.  0.123     4
## 5 2010-01-11 22412.  0.513     5
## 6 2010-01-12 22327. -0.379     6

Plot the Close Price

HSI %>%
  ggplot(aes(x = Date, y = Close)) +
  geom_line() +
  theme_tq() +
  labs(title = "HSI Index (date data)")

According to the plot, the Close Price is not stationary. In order to fit an ARIMA model, it could be helpful to difference the logarithm of the initial price.

Plot the return

Return = (ln(Close Price (t+1)) - ln(Close Price(t))) *100

HSI %>%
  ggplot(aes(x = Date, y = Return)) +
  geom_line() +
  theme_tq() +
  labs(title = "HSI Return (date data)")

The Return seems stationary without seasonality.

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

Unit Root Test

KPSS Test

H0:the data is stationary around a deterministic trend.

HSI%>%features(Return,unitroot_kpss)
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0308         0.1

KPSS test Pvalue = 0.1 > 0.05, so the null hypothesis is accepted.

ADF Test

H0: a unit root is present.

H1: time series is stationary.

adf.test(HSI$Return, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  HSI$Return
## Dickey-Fuller = -13.999, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary

ADF test Pvalue = 0.01 < 0.05, so the null hypothesis is rejected.

In conclusion, the time series is stationary for an ARIMA model.

Model:ARIMA+GARCH

ARIMA part

Find the ARIMA model with the least AIC.

arima.mod <- HSI %>%
    model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))

arima.mod
## # A mable: 1 x 1
##   `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
##                                                    <model>
## 1                                           <ARIMA(2,0,3)>

ARIMA residual analysis:

arima.mod%>%
    gg_tsresiduals()

Ljung-box Test

H0: Data cannot be distinguished from white noise.

augment(arima.mod)%>%features(.innov,ljung_box)
## # A tibble: 1 x 3
##   .model                                                 lb_stat lb_pvalue
##   <chr>                                                    <dbl>     <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE)  0.0101     0.920

According to the plot and ljung-box test, pvalue>0.05, so the residuals of the model cannot be distinguished from white noise. It is rational to accept ARIMA(2,0,3).

Detect the ARCH effect on residuals:

There would be an ARCH effect if the acf and pacf of residuals^2 are significant.

p1 <- arima.mod %>%
  augment() %>%
  select(.innov) %>%
  ACF(.innov ^ 2) %>%
  autoplot() +
  labs(y = "resid^2  acf")

p2 <- arima.mod %>%
  augment() %>%
  select(.innov) %>%
  PACF(.innov ^ 2) %>%
  autoplot() +
  labs(y = "resid^2  pacf")

gridExtra::grid.arrange(p1, p2)

Thus the ARCH effect on residuals is significant, it plausible to fit a GARCH model on residuals.

GARCH part

Try model: ARIMA(2,0,3)+GARCH(1,1)

spec <-
  ugarchspec(
    variance.model = list(
      model = "sGARCH",
      garchOrder = c(1, 1),
      submodel = NULL,
      external.regressors = NULL,
      variance.targeting = FALSE
    ),
    mean.model = list(armaOrder = c(2, 3),
                      include.mean = TRUE),
    distribution.model = "sged"
  )


fit <- ugarchfit(spec, HSI$Return,
                 solver = "hybrid")

Report the final model

print(fit)
## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : sGARCH(1,1)
## Mean Model   : ARFIMA(2,0,3)
## Distribution : sged 
## 
## Optimal Parameters
## ------------------------------------
##         Estimate  Std. Error   t value Pr(>|t|)
## mu      0.025299    0.018994   1.33197 0.182871
## ar1    -1.312107    0.015853 -82.76469 0.000000
## ar2    -0.440748    0.020674 -21.31875 0.000000
## ma1     1.311944    0.002183 601.05156 0.000000
## ma2     0.429959    0.029283  14.68293 0.000000
## ma3     0.006806    0.011885   0.57264 0.566887
## omega   0.020438    0.007478   2.73295 0.006277
## alpha1  0.051404    0.009379   5.48086 0.000000
## beta1   0.933010    0.012840  72.66518 0.000000
## skew    0.916142    0.021129  43.36049 0.000000
## shape   1.383882    0.053582  25.82747 0.000000
## 
## Robust Standard Errors:
##         Estimate  Std. Error    t value Pr(>|t|)
## mu      0.025299    0.019449    1.30079 0.193331
## ar1    -1.312107    0.012415 -105.69088 0.000000
## ar2    -0.440748    0.016284  -27.06573 0.000000
## ma1     1.311944    0.002901  452.27417 0.000000
## ma2     0.429959    0.016539   25.99719 0.000000
## ma3     0.006806    0.006901    0.98623 0.324022
## omega   0.020438    0.008056    2.53687 0.011185
## alpha1  0.051404    0.011494    4.47233 0.000008
## beta1   0.933010    0.014748   63.26441 0.000000
## skew    0.916142    0.022258   41.16010 0.000000
## shape   1.383882    0.055471   24.94806 0.000000
## 
## LogLikelihood : -4046.441 
## 
## Information Criteria
## ------------------------------------
##                    
## Akaike       2.9933
## Bayes        3.0173
## Shibata      2.9933
## Hannan-Quinn 3.0020
## 
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
##                          statistic p-value
## Lag[1]                       3.696 0.05453
## Lag[2*(p+q)+(p+q)-1][14]     8.013 0.19343
## Lag[4*(p+q)+(p+q)-1][24]    11.670 0.59853
## d.o.f=5
## H0 : No serial correlation
## 
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
##                         statistic p-value
## Lag[1]                      2.893 0.08898
## Lag[2*(p+q)+(p+q)-1][5]     6.596 0.06492
## Lag[4*(p+q)+(p+q)-1][9]     8.511 0.10215
## d.o.f=2
## 
## Weighted ARCH LM Tests
## ------------------------------------
##             Statistic Shape Scale P-Value
## ARCH Lag[3]     1.928 0.500 2.000  0.1650
## ARCH Lag[5]     2.589 1.440 1.667  0.3551
## ARCH Lag[7]     3.664 2.315 1.543  0.3976
## 
## Nyblom stability test
## ------------------------------------
## Joint Statistic:  1.5655
## Individual Statistics:              
## mu     0.10650
## ar1    0.06487
## ar2    0.06350
## ma1    0.08800
## ma2    0.08856
## ma3    0.12493
## omega  0.09098
## alpha1 0.07925
## beta1  0.07443
## skew   0.18009
## shape  0.27474
## 
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic:          2.49 2.75 3.27
## Individual Statistic:     0.35 0.47 0.75
## 
## Sign Bias Test
## ------------------------------------
##                    t-value    prob sig
## Sign Bias           0.6568 0.51138    
## Negative Sign Bias  0.2529 0.80038    
## Positive Sign Bias  2.0783 0.03778  **
## Joint Effect       10.3007 0.01618  **
## 
## 
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
##   group statistic p-value(g-1)
## 1    20     20.06       0.3911
## 2    30     32.83       0.2845
## 3    40     38.35       0.4995
## 4    50     56.52       0.2146
## 
## 
## Elapsed time : 4.976874

Forecast

Build a forecast function to predict return in 10 days.

fore <- function(input, p, q, n) {
  spec <-
    ugarchspec(
      variance.model = list(
        model = "sGARCH",
        garchOrder = c(1, 1),
        submodel = NULL,
        external.regressors = NULL,
        variance.targeting = FALSE
      ),
      mean.model = list(armaOrder = c(p, q),
                        include.mean = TRUE),
      distribution.model = "sged"
    )
  fit <- ugarchfit(spec, input$Return,
                   solver = "hybrid")
  fore <- ugarchforecast(fit, n.ahead = n)
  pred <- fore@forecast$seriesFor
  pred <- as.data.table(pred)
  setnames(pred, "Return")
  pred[, time := seq.int(from = 1, to = n, by = 1)][]
  
}


pred <- fore(HSI, 2, 3, 10)

#View the predictions
pred
##           Return time
##  1:  0.154925180    1
##  2: -0.109081450    2
##  3:  0.138608698    3
##  4: -0.064147019    4
##  5:  0.092721230    5
##  6: -0.023742514    6
##  7:  0.059931009    7
##  8:  0.001473559    8
##  9:  0.041297048    9
## 10:  0.014809375   10

Examine the predictions with testing set. The raw test data starts from 2021-01-29, the last day of the traing set.

test <- fread("test.csv")
test <- test[, .(Date, Close = as.numeric(Close))]
test <- test %>%
  tsibble(index = Date) %>%
  mutate(Return = difference(log(test$Close)) * 100)

test <- test %>%
  na.omit() %>%
  mutate(time = row_number()) %>%
  update_tsibble(index = time)

# Preview the test data
head(test, 5)
## # A tsibble: 5 x 4 [1]
##   Date        Close Return  time
##   <date>      <dbl>  <dbl> <int>
## 1 2021-02-01 28893.  2.13      1
## 2 2021-02-02 29249.  1.22      2
## 3 2021-02-03 29307.  0.201     3
## 4 2021-02-04 29114. -0.664     4
## 5 2021-02-05 29289.  0.600     5

Plot forecast value and test value.

foretest <- function(pred, test) {
  p <- pred %>%
    tsibble(index = time) %>%
    left_join(test, by = "time") %>%
    mutate(Pred = Return.x,
           Actual = Return.y) %>%
    select(time, Date, Close, Pred, Actual) %>%
    pivot_longer(
      cols = c(Pred, Actual),
      values_to = "Return",
      names_to = "Type"
    ) %>%
    ggplot(aes(x = Date, y = Return, color = Type)) +
    geom_line() +
    scale_x_date(date_minor_breaks = "1 day") +
    ggsci::scale_color_aaas()+
    labs(title = "10 Trading Days Forecast")
  ggplotly(p)
}

foretest(pred, test)

Discrepancy could easily be observed, the actual curve seems 1 lag behind the prediction curve. However, the model is still capable of capturing some trend since the locations of inflection points are pretty close. Also, as the forecast period grows, the prediction curve gradually smooths.

(Perhaps the deviation is caused by efficient market, which maybe a good news for society. Just kidding!)

Of course the model has to be improved. Cross-validation would be useful. It might be fun to use bootstrap and function hilo() to draw the prediction interval.

Cross Validation

From year 2020 to 2021, splice every 100 trading days as train data and make a one day forecast.

For instance,

train Day1~Day100, forecast Day101;

train Day2~Day101, forecast Day102;

train Day3~Day102, forecast Day103,etc.

Get the ARMA p,q with the least AIC for “year > 2020”

CrossValidation <- HSI %>%
  filter(year(Date) >= 2020)

buildARIMA <- function(input) {
    arima.mod <- input %>%
        model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))
    
    list(
        model = arima.mod,
        ljung_box = augment(arima.mod) %>% features(.innov, ljung_box)
    )
}

buildARIMA(CrossValidation)
## $model
## # A mable: 1 x 1
##   `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
##                                                    <model>
## 1                                           <ARIMA(3,0,1)>
## 
## $ljung_box
## # A tibble: 1 x 3
##   .model                                                  lb_stat lb_pvalue
##   <chr>                                                     <dbl>     <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE) 0.000959     0.975

According to the result, ARMA(3,1) is acceptable.

Cross Validation Test

ARIMA(3,0,1)+GARCH(1,1)

CrossValidation <- CrossValidation %>%
  tsibble(index = time) %>%
  stretch_tsibble(.step = 1, .init = 100, .id = ".id") %>%
  group_by_key() %>%
  slice(n() - 99:0)
 spec <-
            ugarchspec(
                variance.model = list(
                    model = "sGARCH",
                    garchOrder = c(1, 1),
                    submodel = NULL,
                    external.regressors = NULL,
                    variance.targeting = FALSE
                ),
                mean.model = list(armaOrder = c(3, 1),
                                  include.mean = TRUE),
                distribution.model = "sged"
            )

 
 
 
 Crossforecast <- function(data) {
   CrossFore <- vector()
   for (i in 1:tail(data$.id)[1]) {
     train <- data %>%
       filter(.id == i)
     
     fit <- ugarchfit(spec, train$Return,
                      solver = "hybrid")
     
     CrossFore[i] <-
       ugarchforecast(fit, n.ahead = 1)@forecast$seriesFor[1]
     
   }
   CrossFore
 }

 CrossPred<-Crossforecast(CrossValidation)
 
 tsCV<-HSI %>%
    filter(year(Date) >= 2020)%>%
    mutate(time=row_number())%>%
    filter(time>100)%>%
   mutate(Pred=CrossPred[1:(length(CrossPred)-1)])
 head(tsCV)
## # A tsibble: 6 x 5 [1]
##   Date        Close Return  time    Pred
##   <date>      <dbl>  <dbl> <int>   <dbl>
## 1 2020-05-29 22961. -0.743   101 -0.490 
## 2 2020-06-01 23733.  3.30    102  0.0586
## 3 2020-06-02 23996.  1.10    103 -0.957 
## 4 2020-06-03 24326.  1.36    104 -0.744 
## 5 2020-06-04 24366.  0.167   105 -0.329 
## 6 2020-06-05 24770.  1.64    106 -0.532

Plot the result

tsCV %>%
  pivot_longer(
    cols = c(Pred, Return),
    values_to = "Return",
    names_to = "Type"
  ) %>%
  ggplot(aes(x = Date, y = Return, color = Type)) +
  geom_line() +
  scale_x_date(date_minor_breaks = "1 day") +
  theme_tq() +
  labs(title = "Cross Validation Test")

Weakness:

The fluctuation of the forecast curve is significantly weaker than the actual curve.

Advantage:

The trend of the prediction curve is informative. Also, it is harmless to determine whether return is positive or negative by forecast value after June 2020.

A More Practical Application

The model above may catch some trend on HSI, however, not accuracy enough for practical use. The bright point is that for some specific stock, the model may fit better.

Below is a better fitted ARIMA+GARCH model on China Shenhua Energy Company Limited (601088.SS).

To keep brief, code has been hided in this part since it is similar as the code presented above. All code for this report could be view on my Github([ctrl+click] to attach link)

To assure reproducibility, all code and data have been uploaded to Github([ctrl+click] to attach link)