In this project we will try to fit different Time Series models to the most influential stock market index: S&P 500.
The first step of every analysis. In this case I downloaded the data from Yahoo! Finance, but there are other methods and libraries for this purpose.
tail(SP, n = 10)
## Open High Low Close Adj.Close Volume
## 2018-07-11 2779.82 2785.91 2770.77 2774.02 2774.02 2964740000
## 2018-07-12 2783.14 2799.22 2781.53 2798.29 2798.29 2821690000
## 2018-07-13 2796.93 2804.53 2791.69 2801.31 2801.31 2614000000
## 2018-07-16 2797.36 2801.19 2793.39 2798.43 2798.43 2812230000
## 2018-07-17 2789.34 2814.19 2789.24 2809.55 2809.55 3050730000
## 2018-07-18 2811.35 2816.76 2805.89 2815.62 2815.62 3089780000
## 2018-07-19 2809.37 2812.05 2799.77 2804.49 2804.49 3266700000
## 2018-07-20 2804.55 2809.70 2800.01 2801.83 2801.83 3230210000
## 2018-07-23 2799.17 2808.61 2795.14 2806.98 2806.98 2907430000
## 2018-07-24 2820.68 2829.99 2811.12 2820.40 2820.40 3417530000
It is a good idea to remove the Adj.Close column. The name is similar to the Close column and this might prove to be a problem.
SP <- SP[,colnames(SP) != "Adj.Close"]
charts.PerformanceSummary(ROC(SP[, 1:4], n = 1, type = "discrete"), main = "S&P 500 stock evolution")
The best way to get a glimpse of our data is to create a stickchart. We will add to this a T indicator. This indicator sums up the large absolute returns, namely, returns above the target variation margin, fixed beforehand in a threshold of 2.5%. A high T would mean that there are opportunities to make profits given an expected value rise Conversely, a too low T value means that we should sell actions given an expected price decline.
We will use the following formula to forecast the behaviour of the stock value: \[T_{i} = \sum_v \{v \in\ V_{i}: v > p\% \bigvee_{v} < -p\%\}\]
Where \(V_{i}\) is the so-called arithmetic return, containing the values of close, high and low quote.Here the threshold is fixed in 0.25%, so any change above or below this threshold will change the T parameter, and it will give us a clue to buy or sell. P is fixed in 0.25% since this would cover any transaction cost.
T.ind <- function(quotes, tgt.margin = 0.025, n.days = 10) {
v <- apply(HLC(quotes), 1, mean)
r <- matrix(NA, ncol = n.days, nrow = NROW(quotes))
for (x in 1:n.days) r[, x] <- Next(Delt(v, k = x), x)
x <- apply(r, 1, function(x) sum(x[x > tgt.margin | x <
+ -tgt.margin]))
if (is.xts(quotes))
xts(x, time(quotes))
else x
}
chartSeries(SP, subset='last 1 years')
addBBands()
avgPrice <- function(p) apply(HLC(p), 1, mean)
addAvgPrice <- newTA(FUN = avgPrice, col = 1, legend = "AvgPrice")
addT.ind <- newTA(FUN = T.ind, col = "red", legend = "tgtRet")
addAvgPrice(on = 1, col = "blue")
addT.ind()
As we can observe in the plot, there are two differentiated periods: one with a smooth rise of prices, and another one of slight fall. The overall tend of these three months is clearly upward as we can see in the Bollinger brands.
First up, we need to convert the data into a xts object. We can use the function as.xts to do so. This will create a xts object we can work with to do a time series analysis.
sp.xts <- xts(SP$Close)
sp.ts <- as.ts(sp.zoo <- as.zooreg(sp.xts))
There are too many NA values corresponding to the weekend days. We can use the na.locf() function to handle them.
sp.ts <- na.locf(sp.ts)
any(is.na(sp.ts))
## [1] FALSE
Now our database is clean. Next, let’s create the Hold-out 80% split.
index <- (3653*.2) + 14084
sp.train <- window(sp.ts, start = 14500, end = index)
sp.test <- window(sp.ts, start = index + 1)
We can take a look at our data and see its variance and mean to get a glimpse.
apply(sp.train, MARGIN = 2, FUN = var)
## Close
## 2047.436
apply(sp.train, MARGIN = 2, FUN = sd)
## Close
## 45.2486
In this model, we give more importance to the last data and less to the past values. We achieve this applying an exponential smoothing (SES) where alpha is the degree of smoothness to apply.
\[l_{t}=\alpha y_{t}+(1???\alpha)l_{t???1}\] Let’s get started with an a = 0.2.
ses.sp <- ses(sp.train, alpha = .2, h = 100)
autoplot(ses.sp)
checkresiduals(ses.sp)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 412.05, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
The data doesn’t show a Random Walk pattern as we can check in the ACF plot. We need to transform the data until we see a White Noise pattern in the data in order to forecast it properly.
sp.dif <- diff(sp.train)
ses.sp.dif <- ses(sp.dif, alpha = .2, h = 100)
autoplot(ses.sp.dif)
Here we have the plots for the forecasting and the difference values with the confidence intervals at 80 and 95. Now let’s calculate the accuracy of the model.
sp.dif.test <- diff(sp.test)
accuracy(ses.sp.dif, sp.dif.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01107545 11.664380 7.816932 NaN Inf 0.7196407
## Test set -3.29823421 9.350432 6.986933 -Inf Inf 0.6432295
## ACF1 Theil's U
## Training set -0.10818959 NA
## Test set -0.04800174 NaN
Not bad. But at this point we must ask ourselves if we can improve our model even more. We can create a for loop with all the range available for the alpha smoothing and look for the optimal value.
alpha <- seq(.01, .99, by = .01)
RMSE <- NA
for(i in seq_along(alpha)) {
fit <- ses(sp.dif, alpha = alpha[i], h = 100)
RMSE[i] <- accuracy(fit, sp.dif.test)[2,2]
}
alpha.fit <- data_frame(alpha, RMSE)
(alpha.min <- filter(alpha.fit, RMSE == min(RMSE)))
## # A tibble: 1 x 2
## alpha RMSE
## <dbl> <dbl>
## 1 0.93 8.75
require(ggplot2)
ggplot(alpha.fit, aes(alpha, RMSE)) +
geom_line() +
geom_point(data = alpha.min, aes(alpha, RMSE), size = 2, color = "blue")
Since the past data is not relevant for forecasting, we should rewrite the model with an alpha = .9.
ses.sp2 <- ses(sp.train, alpha = .93, h = 100)
autoplot(ses.sp2)
checkresiduals(ses.sp2)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 10.947, df = 8, p-value = 0.2047
##
## Model df: 2. Total lags used: 10
sp.dif2 <- diff(sp.train)
ses.sp.dif2 <- ses(sp.dif2, alpha = .93, h = 100)
autoplot(ses.sp.dif2)
sp.dif.test2 <- diff(sp.test)
accuracy(ses.sp.dif2, sp.dif.test2)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01968657 15.14350 10.532256 NaN Inf 0.9696183
## Test set 0.01702358 8.74943 5.473453 -Inf Inf 0.5038958
## ACF1 Theil's U
## Training set -0.50312380 NA
## Test set -0.04800174 NaN
Although the RMSE of the training test has increased, we managed to decrease the RMSE of the test set, so now our model can forecast better than before.
The SES model usually doesn’t work well when the data shows a long term trend. This method uses two smoothing techniques instead of just the alpha one. Let’s see if we can forecast better.
holt.sp <- holt(sp.train, h = 200)
autoplot(holt.sp)
accuracy(holt.sp, sp.test)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.06540846 11.02196 6.79317 -0.01075023 0.6171706 1.013467
## Test set 64.74676072 89.57583 74.07009 5.15025526 6.0223041 11.050454
## ACF1 Theil's U
## Training set 0.002795245 NA
## Test set 0.981063115 10.09028
As before, we have to check if we can use a better beta smoothing value, again, with a for loop.
beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
fit <- holt(sp.train, beta = beta[i], h = 100)
RMSE[i] <- accuracy(fit, sp.test)[2,2]
}
beta.fit <- data_frame(beta, RMSE)
(beta.min <- filter(beta.fit, RMSE == min(RMSE)))
## # A tibble: 1 x 2
## beta RMSE
## <dbl> <dbl>
## 1 0.0331 29.0
ggplot(beta.fit, aes(beta, RMSE)) +
geom_line() +
geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")
We can define a better beta parameter. We will rewrite the model with the value suggested by the for loop.
holt.sp.opt <- holt(sp.train, h = 200, beta = 0.0331)
autoplot(holt.sp.opt)
accuracy(holt.sp.opt, sp.test)
## ME RMSE MAE MPE MAPE
## Training set -0.01933033 11.11156 6.986682 -0.004071152 0.6344377
## Test set 22.00431747 45.34574 37.462480 1.658662567 3.0922154
## MASE ACF1 Theil's U
## Training set 1.042337 -0.0107087 NA
## Test set 5.588996 0.9706806 5.22302
The model has improved, but it’s quite overfitted (the RMSE is way bigger in the train sample than the test sample) and has a RMSE too high compared to the SES model.
This model is equivalent to ARIMA(0, 1, 0).
naive.sp <- naive(sp.train, h = 2)
autoplot(naive.sp)
checkresiduals(naive.sp)
##
## Ljung-Box test
##
## data: Residuals from Naive method
## Q* = 10.12, df = 10, p-value = 0.4301
##
## Model df: 0. Total lags used: 10
accuracy(naive.sp, sp.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.19086 11.00468 6.70290 0.01278136 0.6085691 1.000000
## Test set 12.34998 12.34998 12.34998 1.10761122 1.1076112 1.842482
## ACF1
## Training set -0.01899267
## Test set NA
The model performed quite well, but it can be improved. In order to select the best value for h let’s create a for loop.
sq <- function(u){u^2}
for(h in 1:10)
{
sp.train %>% tsCV(forecastfunction=naive, h = h) %>% sq() %>% mean(na.rm=TRUE) %>% print()
}
## [1] 121.3504
## [1] 179.9522
## [1] 243.2503
## [1] 301.3722
## [1] 354.2179
## [1] 404.8024
## [1] 454.6297
## [1] 505.8894
## [1] 556.2989
## [1] 606.938
The white noise pattern is statiscally significant and the model is
On of the most used Time Series analysis models. We will use the auto.arima() function to find the values automatically. Also we can use the BoxCox.lambda() function to look for the best value for the smoothing parameter \(\lambda\).
BoxCox.lambda(sp.train)
## [1] 1.999924
arima.sp <- auto.arima(sp.train)
arima.sp %>% forecast(h = 10) %>% autoplot()
summary(arima.sp)
## Series: sp.train
## ARIMA(0,1,0)
##
## sigma^2 estimated as 121.1: log likelihood=-1198.62
## AIC=2399.24 AICc=2399.25 BIC=2402.99
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
## ACF1
## Training set -0.01884848
arima.fore <- forecast(arima.sp, h = 10)
accuracy(arima.fore, sp.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
## Test set 7.0766198 11.30033 8.034437 0.63144903 0.7183983 1.1986509
## ACF1 Theil's U
## Training set -0.01884848 NA
## Test set 0.39699432 1.185804
Now let’s set up forecast functions for ARIMA model and to do a Cross Validation to see its performance with future data.
farima <- function(x, h) {
forecast(auto.arima(x), h = h)
}
e <- sp.train %>% tsCV(farima, h = 1)
mean(e^2, na.rm = TRUE)
## [1] 123.4365
To see the whole picture of our analysis, now we will display all the accuracies of each model we have created so far.
accuracy(ses.sp.dif2, sp.dif.test2)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01968657 15.14350 10.532256 NaN Inf 0.9696183
## Test set 0.01702358 8.74943 5.473453 -Inf Inf 0.5038958
## ACF1 Theil's U
## Training set -0.50312380 NA
## Test set -0.04800174 NaN
accuracy(holt.sp.opt, sp.test)
## ME RMSE MAE MPE MAPE
## Training set -0.01933033 11.11156 6.986682 -0.004071152 0.6344377
## Test set 22.00431747 45.34574 37.462480 1.658662567 3.0922154
## MASE ACF1 Theil's U
## Training set 1.042337 -0.0107087 NA
## Test set 5.588996 0.9706806 5.22302
accuracy(naive.sp, sp.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.19086 11.00468 6.70290 0.01278136 0.6085691 1.000000
## Test set 12.34998 12.34998 12.34998 1.10761122 1.1076112 1.842482
## ACF1
## Training set -0.01899267
## Test set NA
accuracy(arima.fore, sp.test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
## Test set 7.0766198 11.30033 8.034437 0.63144903 0.7183983 1.1986509
## ACF1 Theil's U
## Training set -0.01884848 NA
## Test set 0.39699432 1.185804
The best model was the SES one given its low RMSE in the test sample.