The purpose is to build a time series model and predict the Hang Seng Index (HSI). Firstly, get the HSI index and build the training data. Then, explore the data and capture the features for a model. Find the ARIMA model with the least AIC and fit the residuals by GARCH model. Finally, make a 10 days forecast, compare the outcome with testing data.
#loading package
library(tidyquant)
library(fpp3)
library(tseries)
library(rugarch)
library(data.table)
library(plotly)
setwd("C:/Users/Apple/Desktop/RStudio Tour/note/TSA")
The historical data is downloaded from Yahoo Finance([ctrl+click] to attach link) The training data covered from 2010-01-05 to 2021-01-29.
HSI <- fread("^HSI.csv")
HSI <- HSI[, .(Date, Close = as.numeric(Close))]
HSI <- HSI %>%
tsibble(index = Date) %>%
mutate(Return = difference(log(HSI$Close)) * 100)
HSI <- HSI %>%
na.omit() %>%
mutate(time = row_number()) %>%
update_tsibble(index = time)
head(HSI)
## # A tsibble: 6 x 4 [1]
## Date Close Return time
## <date> <dbl> <dbl> <int>
## 1 2010-01-05 22280. 2.07 1
## 2 2010-01-06 22417. 0.613 2
## 3 2010-01-07 22269. -0.659 3
## 4 2010-01-08 22297. 0.123 4
## 5 2010-01-11 22412. 0.513 5
## 6 2010-01-12 22327. -0.379 6
HSI %>%
ggplot(aes(x = Date, y = Close)) +
geom_line() +
theme_tq() +
labs(title = "HSI Index (date data)")
According to the plot, the Close Price is not stationary. In order to fit an ARIMA model, it could be helpful to difference the logarithm of the initial price.
Return = (ln(Close Price (t+1)) - ln(Close Price(t))) *100
HSI %>%
ggplot(aes(x = Date, y = Return)) +
geom_line() +
theme_tq() +
labs(title = "HSI Return (date data)")
The Return seems stationary without seasonality.
One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.
H0:the data is stationary around a deterministic trend.
HSI%>%features(Return,unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0308 0.1
KPSS test Pvalue = 0.1 > 0.05, so the null hypothesis is accepted.
H0: a unit root is present.
H1: time series is stationary.
adf.test(HSI$Return, alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: HSI$Return
## Dickey-Fuller = -13.999, Lag order = 13, p-value = 0.01
## alternative hypothesis: stationary
ADF test Pvalue = 0.01 < 0.05, so the null hypothesis is rejected.
In conclusion, the time series is stationary for an ARIMA model.
arima.mod <- HSI %>%
model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))
arima.mod
## # A mable: 1 x 1
## `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
## <model>
## 1 <ARIMA(2,0,3)>
arima.mod%>%
gg_tsresiduals()
H0: Data cannot be distinguished from white noise.
augment(arima.mod)%>%features(.innov,ljung_box)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE) 0.0101 0.920
According to the plot and ljung-box test, pvalue>0.05, so the residuals of the model cannot be distinguished from white noise. It is rational to accept ARIMA(2,0,3).
There would be an ARCH effect if the acf and pacf of residuals^2 are significant.
p1 <- arima.mod %>%
augment() %>%
select(.innov) %>%
ACF(.innov ^ 2) %>%
autoplot() +
labs(y = "resid^2 acf")
p2 <- arima.mod %>%
augment() %>%
select(.innov) %>%
PACF(.innov ^ 2) %>%
autoplot() +
labs(y = "resid^2 pacf")
gridExtra::grid.arrange(p1, p2)
Thus the ARCH effect on residuals is significant, it plausible to fit a GARCH model on residuals.
Try model: ARIMA(2,0,3)+GARCH(1,1)
spec <-
ugarchspec(
variance.model = list(
model = "sGARCH",
garchOrder = c(1, 1),
submodel = NULL,
external.regressors = NULL,
variance.targeting = FALSE
),
mean.model = list(armaOrder = c(2, 3),
include.mean = TRUE),
distribution.model = "sged"
)
fit <- ugarchfit(spec, HSI$Return,
solver = "hybrid")
Report the final model
print(fit)
##
## *---------------------------------*
## * GARCH Model Fit *
## *---------------------------------*
##
## Conditional Variance Dynamics
## -----------------------------------
## GARCH Model : sGARCH(1,1)
## Mean Model : ARFIMA(2,0,3)
## Distribution : sged
##
## Optimal Parameters
## ------------------------------------
## Estimate Std. Error t value Pr(>|t|)
## mu 0.025299 0.018994 1.33197 0.182871
## ar1 -1.312107 0.015853 -82.76469 0.000000
## ar2 -0.440748 0.020674 -21.31875 0.000000
## ma1 1.311944 0.002183 601.05156 0.000000
## ma2 0.429959 0.029283 14.68293 0.000000
## ma3 0.006806 0.011885 0.57264 0.566887
## omega 0.020438 0.007478 2.73295 0.006277
## alpha1 0.051404 0.009379 5.48086 0.000000
## beta1 0.933010 0.012840 72.66518 0.000000
## skew 0.916142 0.021129 43.36049 0.000000
## shape 1.383882 0.053582 25.82747 0.000000
##
## Robust Standard Errors:
## Estimate Std. Error t value Pr(>|t|)
## mu 0.025299 0.019449 1.30079 0.193331
## ar1 -1.312107 0.012415 -105.69088 0.000000
## ar2 -0.440748 0.016284 -27.06573 0.000000
## ma1 1.311944 0.002901 452.27417 0.000000
## ma2 0.429959 0.016539 25.99719 0.000000
## ma3 0.006806 0.006901 0.98623 0.324022
## omega 0.020438 0.008056 2.53687 0.011185
## alpha1 0.051404 0.011494 4.47233 0.000008
## beta1 0.933010 0.014748 63.26441 0.000000
## skew 0.916142 0.022258 41.16010 0.000000
## shape 1.383882 0.055471 24.94806 0.000000
##
## LogLikelihood : -4046.441
##
## Information Criteria
## ------------------------------------
##
## Akaike 2.9933
## Bayes 3.0173
## Shibata 2.9933
## Hannan-Quinn 3.0020
##
## Weighted Ljung-Box Test on Standardized Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 3.696 0.05453
## Lag[2*(p+q)+(p+q)-1][14] 8.013 0.19343
## Lag[4*(p+q)+(p+q)-1][24] 11.670 0.59853
## d.o.f=5
## H0 : No serial correlation
##
## Weighted Ljung-Box Test on Standardized Squared Residuals
## ------------------------------------
## statistic p-value
## Lag[1] 2.893 0.08898
## Lag[2*(p+q)+(p+q)-1][5] 6.596 0.06492
## Lag[4*(p+q)+(p+q)-1][9] 8.511 0.10215
## d.o.f=2
##
## Weighted ARCH LM Tests
## ------------------------------------
## Statistic Shape Scale P-Value
## ARCH Lag[3] 1.928 0.500 2.000 0.1650
## ARCH Lag[5] 2.589 1.440 1.667 0.3551
## ARCH Lag[7] 3.664 2.315 1.543 0.3976
##
## Nyblom stability test
## ------------------------------------
## Joint Statistic: 1.5655
## Individual Statistics:
## mu 0.10650
## ar1 0.06487
## ar2 0.06350
## ma1 0.08800
## ma2 0.08856
## ma3 0.12493
## omega 0.09098
## alpha1 0.07925
## beta1 0.07443
## skew 0.18009
## shape 0.27474
##
## Asymptotic Critical Values (10% 5% 1%)
## Joint Statistic: 2.49 2.75 3.27
## Individual Statistic: 0.35 0.47 0.75
##
## Sign Bias Test
## ------------------------------------
## t-value prob sig
## Sign Bias 0.6568 0.51138
## Negative Sign Bias 0.2529 0.80038
## Positive Sign Bias 2.0783 0.03778 **
## Joint Effect 10.3007 0.01618 **
##
##
## Adjusted Pearson Goodness-of-Fit Test:
## ------------------------------------
## group statistic p-value(g-1)
## 1 20 20.06 0.3911
## 2 30 32.83 0.2845
## 3 40 38.35 0.4995
## 4 50 56.52 0.2146
##
##
## Elapsed time : 4.976874
Build a forecast function to predict return in 10 days.
fore <- function(input, p, q, n) {
spec <-
ugarchspec(
variance.model = list(
model = "sGARCH",
garchOrder = c(1, 1),
submodel = NULL,
external.regressors = NULL,
variance.targeting = FALSE
),
mean.model = list(armaOrder = c(p, q),
include.mean = TRUE),
distribution.model = "sged"
)
fit <- ugarchfit(spec, input$Return,
solver = "hybrid")
fore <- ugarchforecast(fit, n.ahead = n)
pred <- fore@forecast$seriesFor
pred <- as.data.table(pred)
setnames(pred, "Return")
pred[, time := seq.int(from = 1, to = n, by = 1)][]
}
pred <- fore(HSI, 2, 3, 10)
#View the predictions
pred
## Return time
## 1: 0.154925180 1
## 2: -0.109081450 2
## 3: 0.138608698 3
## 4: -0.064147019 4
## 5: 0.092721230 5
## 6: -0.023742514 6
## 7: 0.059931009 7
## 8: 0.001473559 8
## 9: 0.041297048 9
## 10: 0.014809375 10
Examine the predictions with testing set. The raw test data starts from 2021-01-29, the last day of the traing set.
test <- fread("test.csv")
test <- test[, .(Date, Close = as.numeric(Close))]
test <- test %>%
tsibble(index = Date) %>%
mutate(Return = difference(log(test$Close)) * 100)
test <- test %>%
na.omit() %>%
mutate(time = row_number()) %>%
update_tsibble(index = time)
# Preview the test data
head(test, 5)
## # A tsibble: 5 x 4 [1]
## Date Close Return time
## <date> <dbl> <dbl> <int>
## 1 2021-02-01 28893. 2.13 1
## 2 2021-02-02 29249. 1.22 2
## 3 2021-02-03 29307. 0.201 3
## 4 2021-02-04 29114. -0.664 4
## 5 2021-02-05 29289. 0.600 5
Plot forecast value and test value.
foretest <- function(pred, test) {
p <- pred %>%
tsibble(index = time) %>%
left_join(test, by = "time") %>%
mutate(Pred = Return.x,
Actual = Return.y) %>%
select(time, Date, Close, Pred, Actual) %>%
pivot_longer(
cols = c(Pred, Actual),
values_to = "Return",
names_to = "Type"
) %>%
ggplot(aes(x = Date, y = Return, color = Type)) +
geom_line() +
scale_x_date(date_minor_breaks = "1 day") +
ggsci::scale_color_aaas()+
labs(title = "10 Trading Days Forecast")
ggplotly(p)
}
foretest(pred, test)
Discrepancy could easily be observed, the actual curve seems 1 lag behind the prediction curve. However, the model is still capable of capturing some trend since the locations of inflection points are pretty close. Also, as the forecast period grows, the prediction curve gradually smooths.
(Perhaps the deviation is caused by efficient market, which maybe a good news for society. Just kidding!)
Of course the model has to be improved. Cross-validation would be useful. It might be fun to use bootstrap and function hilo() to draw the prediction interval.
From year 2020 to 2021, splice every 100 trading days as train data and make a one day forecast.
For instance,
train Day1~Day100, forecast Day101;
train Day2~Day101, forecast Day102;
train Day3~Day102, forecast Day103,etc.
CrossValidation <- HSI %>%
filter(year(Date) >= 2020)
buildARIMA <- function(input) {
arima.mod <- input %>%
model(ARIMA(Return, stepwise = FALSE, approximation = FALSE))
list(
model = arima.mod,
ljung_box = augment(arima.mod) %>% features(.innov, ljung_box)
)
}
buildARIMA(CrossValidation)
## $model
## # A mable: 1 x 1
## `ARIMA(Return, stepwise = FALSE, approximation = FALSE)`
## <model>
## 1 <ARIMA(3,0,1)>
##
## $ljung_box
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Return, stepwise = FALSE, approximation = FALSE) 0.000959 0.975
According to the result, ARMA(3,1) is acceptable.
ARIMA(3,0,1)+GARCH(1,1)
CrossValidation <- CrossValidation %>%
tsibble(index = time) %>%
stretch_tsibble(.step = 1, .init = 100, .id = ".id") %>%
group_by_key() %>%
slice(n() - 99:0)
spec <-
ugarchspec(
variance.model = list(
model = "sGARCH",
garchOrder = c(1, 1),
submodel = NULL,
external.regressors = NULL,
variance.targeting = FALSE
),
mean.model = list(armaOrder = c(3, 1),
include.mean = TRUE),
distribution.model = "sged"
)
Crossforecast <- function(data) {
CrossFore <- vector()
for (i in 1:tail(data$.id)[1]) {
train <- data %>%
filter(.id == i)
fit <- ugarchfit(spec, train$Return,
solver = "hybrid")
CrossFore[i] <-
ugarchforecast(fit, n.ahead = 1)@forecast$seriesFor[1]
}
CrossFore
}
CrossPred<-Crossforecast(CrossValidation)
tsCV<-HSI %>%
filter(year(Date) >= 2020)%>%
mutate(time=row_number())%>%
filter(time>100)%>%
mutate(Pred=CrossPred[1:(length(CrossPred)-1)])
head(tsCV)
## # A tsibble: 6 x 5 [1]
## Date Close Return time Pred
## <date> <dbl> <dbl> <int> <dbl>
## 1 2020-05-29 22961. -0.743 101 -0.490
## 2 2020-06-01 23733. 3.30 102 0.0586
## 3 2020-06-02 23996. 1.10 103 -0.957
## 4 2020-06-03 24326. 1.36 104 -0.744
## 5 2020-06-04 24366. 0.167 105 -0.329
## 6 2020-06-05 24770. 1.64 106 -0.532
tsCV %>%
pivot_longer(
cols = c(Pred, Return),
values_to = "Return",
names_to = "Type"
) %>%
ggplot(aes(x = Date, y = Return, color = Type)) +
geom_line() +
scale_x_date(date_minor_breaks = "1 day") +
theme_tq() +
labs(title = "Cross Validation Test")
Weakness:
The fluctuation of the forecast curve is significantly weaker than the actual curve.
Advantage:
The trend of the prediction curve is informative. Also, it is harmless to determine whether return is positive or negative by forecast value after June 2020.
The model above may catch some trend on HSI, however, not accuracy enough for practical use. The bright point is that for some specific stock, the model may fit better.
Below is a better fitted ARIMA+GARCH model on China Shenhua Energy Company Limited (601088.SS).
To keep brief, code has been hided in this part since it is similar as the code presented above. All code for this report could be view on my Github([ctrl+click] to attach link)
To assure reproducibility, all code and data have been uploaded to Github([ctrl+click] to attach link)