Importing the data
Exploratory analysis
Split
Model creation
Comparison between models
Other methods

In this project we will try to fit different Time Series models to the most influential stock market index: S&P 500.

1. Importing the data

The first step of every analysis. In this case I downloaded the data from Yahoo! Finance, but there are other methods and libraries for this purpose.

tail(SP, n = 10)

##               Open    High     Low   Close Adj.Close     Volume
## 2018-07-11 2779.82 2785.91 2770.77 2774.02   2774.02 2964740000
## 2018-07-12 2783.14 2799.22 2781.53 2798.29   2798.29 2821690000
## 2018-07-13 2796.93 2804.53 2791.69 2801.31   2801.31 2614000000
## 2018-07-16 2797.36 2801.19 2793.39 2798.43   2798.43 2812230000
## 2018-07-17 2789.34 2814.19 2789.24 2809.55   2809.55 3050730000
## 2018-07-18 2811.35 2816.76 2805.89 2815.62   2815.62 3089780000
## 2018-07-19 2809.37 2812.05 2799.77 2804.49   2804.49 3266700000
## 2018-07-20 2804.55 2809.70 2800.01 2801.83   2801.83 3230210000
## 2018-07-23 2799.17 2808.61 2795.14 2806.98   2806.98 2907430000
## 2018-07-24 2820.68 2829.99 2811.12 2820.40   2820.40 3417530000

It is a good idea to remove the Adj.Close column. The name is similar to the Close column and this might prove to be a problem.

SP <- SP[,colnames(SP) != "Adj.Close"]

2. Exploratory analysis

2.1. Closing values

plot(SP$Close,main=paste("Closing prices of SP"))

2.2. Market evolution line chart

charts.PerformanceSummary(ROC(SP[, 1:4], n = 1, type = "discrete"), main = "S&P 500 stock evolution")

2.3. Stickplot

The best way to get a glimpse of our data is to create a stickchart. We will add to this a T indicator. This indicator sums up the large absolute returns, namely, returns above the target variation margin, fixed beforehand in a threshold of 2.5%. A high T would mean that there are opportunities to make profits given an expected value rise Conversely, a too low T value means that we should sell actions given an expected price decline.

We will use the following formula to forecast the behaviour of the stock value: \[T_{i} = \sum_v \{v \in\ V_{i}: v > p\% \bigvee_{v} < -p\%\}\]

Where \(V_{i}\) is the so-called arithmetic return, containing the values of close, high and low quote.Here the threshold is fixed in 0.25%, so any change above or below this threshold will change the T parameter, and it will give us a clue to buy or sell. P is fixed in 0.25% since this would cover any transaction cost.

T.ind <- function(quotes, tgt.margin = 0.025, n.days = 10) {
  v <- apply(HLC(quotes), 1, mean)
  r <- matrix(NA, ncol = n.days, nrow = NROW(quotes))
  for (x in 1:n.days) r[, x] <- Next(Delt(v, k = x), x)
  x <- apply(r, 1, function(x) sum(x[x > tgt.margin | x <
                                         + -tgt.margin]))
  if (is.xts(quotes))
    xts(x, time(quotes))
  else x
  }

chartSeries(SP, subset='last 1 years')

addBBands()

avgPrice <- function(p) apply(HLC(p), 1, mean)
addAvgPrice <- newTA(FUN = avgPrice, col = 1, legend = "AvgPrice")
addT.ind <- newTA(FUN = T.ind, col = "red", legend = "tgtRet")
addAvgPrice(on = 1, col = "blue")

addT.ind()

As we can observe in the plot, there are two differentiated periods: one with a smooth rise of prices, and another one of slight fall. The overall tend of these three months is clearly upward as we can see in the Bollinger brands.

3. Split

First up, we need to convert the data into a xts object. We can use the function as.xts to do so. This will create a xts object we can work with to do a time series analysis.

sp.xts <- xts(SP$Close)
sp.ts <- as.ts(sp.zoo <- as.zooreg(sp.xts))

There are too many NA values corresponding to the weekend days. We can use the na.locf() function to handle them.

sp.ts <- na.locf(sp.ts)
any(is.na(sp.ts))

## [1] FALSE

Now our database is clean. Next, let’s create the Hold-out 80% split.

index <- (3653*.2) + 14084

sp.train <- window(sp.ts, start = 14500, end = index)
sp.test <- window(sp.ts, start = index + 1)

4. Model creation

We can take a look at our data and see its variance and mean to get a glimpse.

apply(sp.train, MARGIN = 2, FUN = var)

##    Close 
## 2047.436

apply(sp.train, MARGIN = 2, FUN = sd)

##   Close 
## 45.2486

4.1. Simple Exponential Smoothing (SES model)

In this model, we give more importance to the last data and less to the past values. We achieve this applying an exponential smoothing (SES) where alpha is the degree of smoothness to apply.

\[l_{t}=\alpha y_{t}+(1???\alpha)l_{t???1}\] Let’s get started with an a = 0.2.

ses.sp <- ses(sp.train, alpha = .2, h = 100)
autoplot(ses.sp)

checkresiduals(ses.sp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 412.05, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10

The data doesn’t show a Random Walk pattern as we can check in the ACF plot. We need to transform the data until we see a White Noise pattern in the data in order to forecast it properly.

sp.dif <- diff(sp.train)
ses.sp.dif <- ses(sp.dif, alpha = .2, h = 100)
autoplot(ses.sp.dif)

Here we have the plots for the forecasting and the difference values with the confidence intervals at 80 and 95. Now let’s calculate the accuracy of the model.

sp.dif.test <- diff(sp.test)
accuracy(ses.sp.dif, sp.dif.test)

##                       ME      RMSE      MAE  MPE MAPE      MASE
## Training set  0.01107545 11.664380 7.816932  NaN  Inf 0.7196407
## Test set     -3.29823421  9.350432 6.986933 -Inf  Inf 0.6432295
##                     ACF1 Theil's U
## Training set -0.10818959        NA
## Test set     -0.04800174       NaN

Not bad. But at this point we must ask ourselves if we can improve our model even more. We can create a for loop with all the range available for the alpha smoothing and look for the optimal value.

alpha <- seq(.01, .99, by = .01)
RMSE <- NA
for(i in seq_along(alpha)) {
  fit <- ses(sp.dif, alpha = alpha[i], h = 100)
  RMSE[i] <- accuracy(fit, sp.dif.test)[2,2]
}

alpha.fit <- data_frame(alpha, RMSE)
(alpha.min <- filter(alpha.fit, RMSE == min(RMSE)))

## # A tibble: 1 x 2
##   alpha  RMSE
##   <dbl> <dbl>
## 1  0.93  8.75

require(ggplot2)
ggplot(alpha.fit, aes(alpha, RMSE)) +
  geom_line() +
  geom_point(data = alpha.min, aes(alpha, RMSE), size = 2, color = "blue")

Since the past data is not relevant for forecasting, we should rewrite the model with an alpha = .9.

ses.sp2 <- ses(sp.train, alpha = .93, h = 100)
autoplot(ses.sp2)

checkresiduals(ses.sp2)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 10.947, df = 8, p-value = 0.2047
## 
## Model df: 2.   Total lags used: 10

sp.dif2 <- diff(sp.train)
ses.sp.dif2 <- ses(sp.dif2, alpha = .93, h = 100)
autoplot(ses.sp.dif2)

sp.dif.test2 <- diff(sp.test)
accuracy(ses.sp.dif2, sp.dif.test2)

##                       ME     RMSE       MAE  MPE MAPE      MASE
## Training set -0.01968657 15.14350 10.532256  NaN  Inf 0.9696183
## Test set      0.01702358  8.74943  5.473453 -Inf  Inf 0.5038958
##                     ACF1 Theil's U
## Training set -0.50312380        NA
## Test set     -0.04800174       NaN

Although the RMSE of the training test has increased, we managed to decrease the RMSE of the test set, so now our model can forecast better than before.

4.2. Holt’s model

The SES model usually doesn’t work well when the data shows a long term trend. This method uses two smoothing techniques instead of just the alpha one. Let’s see if we can forecast better.

holt.sp <- holt(sp.train, h = 200)
autoplot(holt.sp)

accuracy(holt.sp, sp.test)

##                       ME     RMSE      MAE         MPE      MAPE      MASE
## Training set -0.06540846 11.02196  6.79317 -0.01075023 0.6171706  1.013467
## Test set     64.74676072 89.57583 74.07009  5.15025526 6.0223041 11.050454
##                     ACF1 Theil's U
## Training set 0.002795245        NA
## Test set     0.981063115  10.09028

As before, we have to check if we can use a better beta smoothing value, again, with a for loop.

beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
  fit <- holt(sp.train, beta = beta[i], h = 100)
  RMSE[i] <- accuracy(fit, sp.test)[2,2]
}

beta.fit <- data_frame(beta, RMSE)
(beta.min <- filter(beta.fit, RMSE == min(RMSE)))

## # A tibble: 1 x 2
##     beta  RMSE
##    <dbl> <dbl>
## 1 0.0331  29.0

ggplot(beta.fit, aes(beta, RMSE)) +
  geom_line() +
  geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")

We can define a better beta parameter. We will rewrite the model with the value suggested by the for loop.

holt.sp.opt <- holt(sp.train, h = 200, beta = 0.0331)
autoplot(holt.sp.opt)

accuracy(holt.sp.opt, sp.test)

##                       ME     RMSE       MAE          MPE      MAPE
## Training set -0.01933033 11.11156  6.986682 -0.004071152 0.6344377
## Test set     22.00431747 45.34574 37.462480  1.658662567 3.0922154
##                  MASE       ACF1 Theil's U
## Training set 1.042337 -0.0107087        NA
## Test set     5.588996  0.9706806   5.22302

The model has improved, but it’s quite overfitted (the RMSE is way bigger in the train sample than the test sample) and has a RMSE too high compared to the SES model.

4.3 Naive model

This model is equivalent to ARIMA(0, 1, 0).

naive.sp <- naive(sp.train, h = 2)
autoplot(naive.sp)

checkresiduals(naive.sp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Naive method
## Q* = 10.12, df = 10, p-value = 0.4301
## 
## Model df: 0.   Total lags used: 10

accuracy(naive.sp, sp.test)

##                    ME     RMSE      MAE        MPE      MAPE     MASE
## Training set  0.19086 11.00468  6.70290 0.01278136 0.6085691 1.000000
## Test set     12.34998 12.34998 12.34998 1.10761122 1.1076112 1.842482
##                     ACF1
## Training set -0.01899267
## Test set              NA

The model performed quite well, but it can be improved. In order to select the best value for h let’s create a for loop.

sq <- function(u){u^2}
for(h in 1:10)
{
  sp.train %>% tsCV(forecastfunction=naive, h = h) %>% sq() %>% mean(na.rm=TRUE) %>% print()
}

## [1] 121.3504
## [1] 179.9522
## [1] 243.2503
## [1] 301.3722
## [1] 354.2179
## [1] 404.8024
## [1] 454.6297
## [1] 505.8894
## [1] 556.2989
## [1] 606.938

The white noise pattern is statiscally significant and the model is

4.4. ARIMA

On of the most used Time Series analysis models. We will use the auto.arima() function to find the values automatically. Also we can use the BoxCox.lambda() function to look for the best value for the smoothing parameter \(\lambda\).

BoxCox.lambda(sp.train)

## [1] 1.999924

arima.sp <- auto.arima(sp.train)
arima.sp %>% forecast(h = 10) %>% autoplot()

summary(arima.sp)

## Series: sp.train 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 121.1:  log likelihood=-1198.62
## AIC=2399.24   AICc=2399.25   BIC=2402.99
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
##                     ACF1
## Training set -0.01884848

arima.fore <- forecast(arima.sp, h = 10)
accuracy(arima.fore, sp.test)

##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
## Test set     7.0766198 11.30033 8.034437 0.63144903 0.7183983 1.1986509
##                     ACF1 Theil's U
## Training set -0.01884848        NA
## Test set      0.39699432  1.185804

Now let’s set up forecast functions for ARIMA model and to do a Cross Validation to see its performance with future data.

farima <- function(x, h) {
  forecast(auto.arima(x), h = h)
}
e <- sp.train %>% tsCV(farima, h = 1) 
mean(e^2, na.rm = TRUE)

## [1] 123.4365

5. Comparison between models

To see the whole picture of our analysis, now we will display all the accuracies of each model we have created so far.

accuracy(ses.sp.dif2, sp.dif.test2)

##                       ME     RMSE       MAE  MPE MAPE      MASE
## Training set -0.01968657 15.14350 10.532256  NaN  Inf 0.9696183
## Test set      0.01702358  8.74943  5.473453 -Inf  Inf 0.5038958
##                     ACF1 Theil's U
## Training set -0.50312380        NA
## Test set     -0.04800174       NaN

accuracy(holt.sp.opt, sp.test)

##                       ME     RMSE       MAE          MPE      MAPE
## Training set -0.01933033 11.11156  6.986682 -0.004071152 0.6344377
## Test set     22.00431747 45.34574 37.462480  1.658662567 3.0922154
##                  MASE       ACF1 Theil's U
## Training set 1.042337 -0.0107087        NA
## Test set     5.588996  0.9706806   5.22302

accuracy(naive.sp, sp.test)

##                    ME     RMSE      MAE        MPE      MAPE     MASE
## Training set  0.19086 11.00468  6.70290 0.01278136 0.6085691 1.000000
## Test set     12.34998 12.34998 12.34998 1.10761122 1.1076112 1.842482
##                     ACF1
## Training set -0.01899267
## Test set              NA

accuracy(arima.fore, sp.test)

##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 0.1935644 10.98736 6.684931 0.01305825 0.6069546 0.9973193
## Test set     7.0766198 11.30033 8.034437 0.63144903 0.7183983 1.1986509
##                     ACF1 Theil's U
## Training set -0.01884848        NA
## Test set      0.39699432  1.185804

The best model was the SES one given its low RMSE in the test sample.

6. Other methods

There is a lot of different methods and models we can try to fit. Just to mention a few, we could fild an ETS model, Dinamic Regression, Harmonic Regression or the TBATS model. Since this project is already too long we will try to create them in a future project.

Time Series Analysis (S&P 500 index)

David Valls Lanàquera

15/8/2018