Time series forecasting - with deep learning

Sigrid Keydana, Trivadis GmbH
2017-05-23

Time series forecasting: the classical approach

A time series

cola_df <- read_csv("monthly-sales-of-tasty-cola.csv", col_names = c("month", "sales"), skip = 1, 
                    col_types = cols(month = col_date("%y-%m")))
ggplot(cola_df, aes(x = month, y = sales)) + geom_line() + ggtitle("Monthly sales of Tasty Cola")

plot of chunk unnamed-chunk-2

Another one ...

traffic_df <- read_csv("internet-traffic-data-in-bits-fr.csv", col_names = c("hour", "bits"), skip = 1)
ggplot(traffic_df, aes(x = hour, y = bits)) + geom_line() + ggtitle("Internet traffic")

plot of chunk unnamed-chunk-3

And another.

win_df <- read_csv("winning-times-for-the-mens-400-m.csv", col_names = c("year", "seconds"), skip = 1)
ggplot(win_df, aes(x = year, y = seconds)) + geom_line() + ggtitle("Men's 400m winning times")

plot of chunk unnamed-chunk-4

Sometimes we may look at several time series together.

deaths_df <- read_csv("deaths-from-homicides-and-suicid.csv", col_names = c("year", "homicide", "suicide"), skip = 1)
deaths_df <- gather(deaths_df, key = 'type', value = 'deaths', homicide:suicide)
ggplot(deaths_df, aes(x = year, y = deaths, color = type)) + geom_line() + scale_colour_manual(values=c("green","blue")) + ggtitle("Australia: Homicides and suicides")

plot of chunk unnamed-chunk-5

Sometimes there's more to the picture than we might think.

So far, this is nothing but a univariate time series of lynx population.

data("lynx")
autoplot(lynx) + ggtitle("Lynx population over time")

plot of chunk unnamed-chunk-6

However ...

Source: Rudolfo’s Usenet Animal Pictures Gallery (link no longer exists)

Lynx and hare

lynx_df <- read_delim("lynxhare.csv", delim = ";") %>% select(year, hare, lynx) %>%   filter(between(year, 1890, 1945)) %>% mutate(hare = scale(hare), lynx = scale(lynx))
lynx_df <- gather(lynx_df, key = 'species', value = 'number', hare:lynx)
ggplot(lynx_df, aes(x = year, y = number, color = species)) + geom_line() + scale_colour_manual(values=c("green","red")) + ggtitle("Lynx and hare populations over time")

plot of chunk unnamed-chunk-7

Concepts in classical time series modeling

Stationarity
Decomposition
Autocorrelation

Wait. This will be about deep learning

… why would the classical approach even matter?

Stationarity (1)

We want to forecast future values of a time series
We need fundamental statistical properties like mean, variance …
What is the mean, or the variance, of a time series?

Stationarity (2)

If a time series \( y_{t} \) is stationary, then for all \( s \), the distribution of (\( y_{t} \),…, \( y_{t+s} \)) does not depend on \( t \).
By ergodicity, after we remove any trend and seasonality effects, we may assume that the residual series is stationary in the mean: \( \mu(t)=t \)

Differencing

The trend is usually removed using differencing (forming the differences of neighboring values)

set.seed(7777)
trend <- 1:100 + rnorm(100, sd = 5)
diff <- diff(trend)
df <- data_frame(time_id = 1:100,
                 trend = trend,
                 diff = c(NA, diff))
df <- df %>% gather(key = 'trend_diff', value = 'value', -time_id)
ggplot(df, aes(x = time_id, y = value, color = trend_diff)) + geom_line()

plot of chunk unnamed-chunk-8

Time series decomposition

autoplot(stl(ts(cola_df$sales, frequency=12), s.window = 12))

plot of chunk unnamed-chunk-9

Autocorrelation - Case 1: White noise

If consecutive values were not related, there'd be no way of forecasting future values

s1 <- ts(rnorm(100))
ts1 <- autoplot(s1)
acf1 <- ggfortify:::autoplot.acf(acf(s1, plot = FALSE), conf.int.fill = '#00cccc', conf.int.value = 0.95)
do.call('grid.arrange', list('grobs' = list(ts1, acf1), 'ncol' = 2, top = "White noise"))

plot of chunk unnamed-chunk-10

Autocorrelation - Case 2: Linear trend

s2 <- ts(1:100 + rnorm(100, 2, 4))
ts2 <- autoplot(s2)
acf2 <- ggfortify:::autoplot.acf(acf(s2, plot = FALSE), conf.int.fill = '#00cccc', conf.int.value = 0.95)
do.call('grid.arrange', list('grobs' = list(ts2, acf2), 'ncol' = 2, top = "Series with a trend"))

plot of chunk unnamed-chunk-11

Autocorrelation - Case 3: Seasonality

s3 <- ts(rep(1:5,20) + rnorm(100, sd= 0.5))
ts3 <- autoplot(s3)
acf3 <- ggfortify:::autoplot.acf(acf(s3, plot = FALSE), conf.int.fill = '#00cccc', conf.int.value = 0.95)
do.call('grid.arrange', list('grobs' = list(ts3, acf3), 'ncol' = 2, top = "Series with seasonality"))

plot of chunk unnamed-chunk-12

Forecasting internet traffic, the classical way

Here's the traffic time series again.

ggplot(traffic_df, aes(x = hour, y = bits)) + geom_line() + ggtitle("Internet traffic")

plot of chunk unnamed-chunk-13

Let's first look at decomposition.

traffic_ts <- msts(traffic_df$bits,seasonal.periods = c(24, 24*7))
autoplot(stl(traffic_ts, s.window = 7 * 24))

plot of chunk unnamed-chunk-14

How about autocorrelation?

ggfortify:::autoplot.acf(acf(traffic_ts, plot = FALSE), conf.int.fill = '#00cccc', conf.int.value = 0.95)

plot of chunk unnamed-chunk-15

The usual ARIMA won't work...

arima_fit <- auto.arima(traffic_ts, stepwise = FALSE, max.order = 10, trace = TRUE)


 ARIMA(0,0,0) with zero mean     : 60774.59
 ARIMA(0,0,0) with non-zero mean : Inf *
 ARIMA(0,0,0)(0,0,1)[168] with zero mean     : Inf
 ARIMA(0,0,0)(0,0,1)[168] with non-zero mean : Inf *

Error in myarima(x, order = c(i, d, j), seasonal = c(I, D, J), constant = (K == : root finding code failed

... will regression with ARIMA errors?

Let's add an indicator variable for whether it's weekend.

traffic_df_wd <- traffic_df %>% mutate(weekend = if_else(wday(hour) %in% c(7,1), 1, 0))
ggplot(traffic_df_wd, aes(x=hour, y=bits, color=factor(weekend))) + geom_point()

plot of chunk unnamed-chunk-17

No.

This will run forever and you'll have to kill it.

# arima_fit <- auto.arima(ts(traffic_df_wd$bits, frequency = 24 * 7), xreg = traffic_df_wd$weekend, 
#                            stepwise = FALSE, max.order = 10, trace = TRUE)

Trying TBATS

TBATS (“Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components”) does not fail…

But, look at the forecast.

tbats_fit <- tbats(traffic_ts)
plot(forecast(tbats_fit, h=14*24))

plot of chunk unnamed-chunk-19

Will deep learning do any better for this time series?

Let's step back a little though.
How does deep learning even do time series?

Enter: LSTM (Long Short Term Memory)


Source: Christopher Olah’s post on LSTM

New world, new rules?

Do we still care about stationarity and decomposition?
How does DL handle trends, or seasonality?

Let's compare ARIMA vs. LSTM on a little set of benchmarks

synthetic dataset, with trend only, test data out-of-range
synthetic dataset, with trend only, test data in-range
synthetic dataset, seasonal only

ARIMA vs. LSTM, Round 1: Trend-only dataset, test data out-of-range

Trend-only dataset

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 111:130 + rnorm(20, sd =2)
df <- data_frame(time_id = 1:120,
                 train = c(trend_train, rep(NA, length(trend_test))),
                 test = c(rep(NA, length(trend_train)), trend_test))
df <- df %>% gather(key = 'train_test', value = 'value', -time_id)
ggplot(df, aes(x = time_id, y = value, color = train_test)) + geom_line()

plot of chunk unnamed-chunk-20

Trend-only dataset: Enter: ARIMA (1)

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 111:130 + rnorm(20, sd =2)
h <- 1
n <- length(trend_test) - h + 1
fit <- auto.arima(trend_train)

# re-estimate the model as new data arrives, as per https://robjhyndman.com/hyndsight/rolling-forecasts/ 
order <- arimaorder(fit)
predictions <- matrix(0, nrow=n, ncol=h)
lower <- matrix(0, nrow=n, ncol=h) # 95% prediction interval
upper <- matrix(0, nrow=n, ncol=h)
for(i in 1:n) {  
  x <- c(trend_train[(1+i):length(trend_train)], trend_test[1:i])
  refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
  predictions[i,] <- forecast(refit, h=h)$mean
  lower[i,] <- unclass(forecast(refit, h=h)$lower)[,2]
  upper[i,] <- unclass(forecast(refit, h=h)$upper)[,2]
}

(test_rsme <- sqrt(sum((trend_test - predictions)^2)))

[1] 5.31686

Trend-only dataset: Enter: ARIMA (2)

df <- data_frame(time_id = 1:120,
  train = c(trend_train, rep(NA, length(trend_test))),
  test = c(rep(NA, length(trend_train)), trend_test),
  fitted = c(fit$fitted, rep(NA, length(trend_test))),
  preds = c(rep(NA, length(trend_train)), predictions),
  lower = c(rep(NA, length(trend_train)), lower),
  upper = c(rep(NA, length(trend_train)), upper))
df <- df %>% gather(key = 'type', value = 'value', train:preds)
ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type)) + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.1)

plot of chunk unnamed-chunk-22

Trend-only dataset: Enter: LSTM

Let's first show what we have to do for time series prediction with LSTM networks.
We'll choose the Keras framework, and the R bindings provided by kerasR.

Background: Data preparation for LSTM in Keras (1)

Firstly, LSTMs work with a sliding window of input, so we need to provide the data (train and test) in “window form”:

# example given for training set, - do the same for test set
# length(trend_train)
lstm_num_timesteps <- 5
X_train <- t(sapply(1:(length(trend_train) - lstm_num_timesteps), function(x) trend_train[x:(x + lstm_num_timesteps - 1)]))
# dim(X_train)
X_train[1:5, ]

         [,1]     [,2]     [,3]     [,4]     [,5]
[1,]  7.24409 12.65291 12.39629 16.36328 14.67253
[2,] 12.65291 12.39629 16.36328 14.67253 18.59298
[3,] 12.39629 16.36328 14.67253 18.59298 19.70279
[4,] 16.36328 14.67253 18.59298 19.70279 17.41485
[5,] 14.67253 18.59298 19.70279 17.41485 17.67259

y_train <- sapply((lstm_num_timesteps + 1):(length(trend_train)), function(x) trend_train[x])
y_train[1:5]

[1] 18.59298 19.70279 17.41485 17.67259 23.13023

Background: Data preparation for LSTM in Keras (2)

Keras LSTMs expect the input array to be shaped as (no. samples, no. time steps, no. features)

# example given for training set, - do the same for test set
# add 3rd dimension
dim(X_train)

[1] 95  5

X_train <- expand_dims(X_train, axis = 2)
dim(X_train)

[1] 95  5  1

# LSTM input shape: (samples, time steps, features)
num_samples <- dim(X_train)[1]
num_steps <- dim(X_train)[2]
num_features <- dim(X_train)[3]
c(num_samples, num_steps, num_features)

[1] 95  5  1

Background: Keras - build the model

model <- Sequential()
model$add(LSTM(units = 4, input_shape=c(num_steps, num_features)))
model$add(Dense(1))
keras_compile(model, loss='mean_squared_error', optimizer='adam')

Background: Keras - fit the model!

# not executed "live" ;-)
# keras_fit(model, X_train, y_train, batch_size = 1, epochs = 500, verbose = 1)

We'll load the fitted model instead, and get the predictions.

# we'll load the fitted model instead...
# keras_fit(model, X_train, y_train, batch_size = 1, epochs = 500, verbose = 1)

model <- keras_load('trend_nodiff.h5')
pred_train <- keras_predict(model, X_train, batch_size = 1)
pred_test <-keras_predict(model, X_test, batch_size = 1)

Hm. Whatever happened to predicting the test data?

df <- data_frame(time_id = 1:120,
  train = c(trend_train, rep(NA, length(trend_test))),
  test = c(rep(NA, length(trend_train)), trend_test),
  pred_train = c(rep(NA, lstm_num_timesteps), pred_train, rep(NA, length(trend_test))),
  pred_test = c(rep(NA, length(trend_train)), rep(NA, lstm_num_timesteps), pred_test))
df <- df %>% gather(key = 'type', value = 'value', train:pred_test)
ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-30

Anything we could have done to our data to make this work better?

What if we had worked with the value differences, instead of the original values (learning from ARIMA & co.)?
Or even the relative differences?
What if we had scaled the data?

What if we work with differenced data?

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 111:130 + rnorm(20, sd =2)

trend_train_start <- trend_train[1]
trend_test_start <- trend_test[1]

trend_train_diff <- diff(trend_train)
trend_test_diff <- diff(trend_test)
trend_test_diff

 [1]  4.2235013 -3.5657054  4.9708362 -3.8893841  5.9185922  0.1646659
 [7]  1.4168194  2.8387000 -2.4343835  2.6326364  1.6875305 -1.6268218
[13]  2.4204479  3.5124616 -2.6026255 -0.8882357  4.0684488 -0.3788169
[19]  0.7841282

lstm_num_timesteps <- 4

Get the predictions

# we'll load the fitted model instead...
model <- keras_load('trend_diff.h5')
pred_train <- keras_predict(model, X_train, batch_size = 1)
pred_test <-keras_predict(model, X_test, batch_size = 1)

# "undiff"
pred_train_undiff <- pred_train + trend_train[(lstm_num_timesteps+1):(length(trend_train)-1)]
pred_test_undiff <- pred_test + trend_test[(lstm_num_timesteps+1):(length(trend_test)-1)]

Differencing makes the difference...

df <- data_frame(time_id = 1:120,
   train = c(trend_train, rep(NA, length(trend_test))),
   test = c(rep(NA, length(trend_train)), trend_test),
   pred_train = c(rep(NA, lstm_num_timesteps+1), pred_train_undiff, rep(NA, length(trend_test))),
   pred_test = c(rep(NA, length(trend_train)), rep(NA, lstm_num_timesteps+1), pred_test_undiff))
df <- df %>% gather(key = 'type', value = 'value', train:pred_test)
(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 9.231277

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-34

Just for completeness, let's try relative differences as well

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 111:130 + rnorm(20, sd =2)

trend_train_start <- trend_train[1]
trend_test_start <- trend_test[1]

trend_train_diff <- diff(trend_train)/trend_train[-length(trend_train)]
trend_test_diff <- diff(trend_test)/trend_test[-length(trend_test)]
trend_test_diff

 [1]  0.038423558 -0.031238910  0.044953465 -0.033660270  0.053006039
 [6]  0.001400490  0.012033246  0.023822810 -0.019954355  0.022018781
[11]  0.013810047 -0.013131880  0.019798102  0.028172487 -0.020302958
[16] -0.007072681  0.032626255 -0.002941878  0.006107476

lstm_num_timesteps <- 4

Get the predictions

# we'll load the fitted model instead...
model <- keras_load('trend_reldiff.h5')

pred_train <- keras_predict(model, X_train, batch_size = 1)
pred_test <-keras_predict(model, X_test, batch_size = 1)
pred_train_undiff <- pred_train * trend_train[(lstm_num_timesteps+1):(length(trend_train)-1)] + trend_train[(lstm_num_timesteps+1):(length(trend_train)-1)]
pred_test_undiff <- pred_test * trend_test[(lstm_num_timesteps+1):(length(trend_test)-1)] + trend_test[(lstm_num_timesteps+1):(length(trend_test)-1)]

Relative differences: results

df <- data_frame(time_id = 1:120,
    train = c(trend_train, rep(NA, length(trend_test))),
    test = c(rep(NA, length(trend_train)), trend_test),
    pred_train = c(rep(NA, lstm_num_timesteps+1), pred_train_undiff, rep(NA, length(trend_test))),
    pred_test = c(rep(NA, length(trend_train)), rep(NA, lstm_num_timesteps+1), pred_test_undiff))
df <- df %>% gather(key = 'type', value = 'value', train:pred_test)
(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 10.35424

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-38

What if we difference AND normalize?

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 111:130 + rnorm(20, sd =2)

trend_train_start <- trend_train[1]
trend_test_start <- trend_test[1]

trend_train_diff <- diff(trend_train)
trend_test_diff <- diff(trend_test)

minval <- min(trend_train_diff)
maxval <- max(trend_train_diff)

normalize <- function(vec, min, max) {
  (vec-min) / (max-min)
}
denormalize <- function(vec,min,max) {
  vec * (max - min) + min
}

trend_train_diff <- normalize(trend_train_diff, minval, maxval)
trend_test_diff <- normalize(trend_test_diff, minval, maxval)

lstm_num_timesteps <- 4

Get the predictions

# we'll load the fitted model instead...
model <- keras_load('trend_diffnorm.h5')
pred_train <- keras_predict(model, X_train, batch_size = 1)
pred_test <-keras_predict(model, X_test, batch_size = 1)

pred_train <- denormalize(pred_train, minval, maxval)
pred_test <- denormalize(pred_test, minval, maxval)

pred_train_undiff <- pred_train + trend_train[(lstm_num_timesteps+1):(length(trend_train)-1)]
pred_test_undiff <- pred_test + trend_test[(lstm_num_timesteps+1):(length(trend_test)-1)]

Difference and normalize: results

df <- data_frame(time_id = 1:120,
    train = c(trend_train, rep(NA, length(trend_test))),
    test = c(rep(NA, length(trend_train)), trend_test),
    pred_train = c(rep(NA, lstm_num_timesteps+1), pred_train_undiff, rep(NA, length(trend_test))),
    pred_test = c(rep(NA, length(trend_train)), rep(NA, lstm_num_timesteps+1), pred_test_undiff))
df <- df %>% gather(key = 'type', value = 'value', train:pred_test)
(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 6.662725

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-42

ARIMA vs. LSTM, Round 2: Trend-only dataset, test data in-range

Would anything change if the test data were in the range already known by the model?

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 31:50 + rnorm(20, sd =2)
df <- data_frame(time_id = 1:120,
train = c(trend_train, rep(NA, length(trend_test))),
test = c(rep(NA, length(trend_train)), trend_test))
df <- df %>% gather(key = 'train_test', value = 'value', -time_id)
ggplot(df, aes(x = time_id, y = value, color = train_test)) + geom_line()

plot of chunk unnamed-chunk-43

Trend-only dataset, test data in-range: Enter: ARIMA (1)

set.seed(7777)
trend_train <- 11:110 + rnorm(100, sd = 2)
trend_test <- 31:50 + rnorm(20, sd =2)
fit <- auto.arima(trend_train)
order <- arimaorder(fit)
# fit on the test set
refit <- Arima(trend_test, order=order[1:3], seasonal=order[4:6])
predictions <- refit$fitted
(test_rsme <- sqrt(sum((trend_test - predictions)^2)))

[1] 9.629612

Trend-only dataset, test data in-range: Enter: ARIMA (2)

df <- data_frame(time_id = 1:120,
  train = c(trend_train, rep(NA, length(trend_test))),
  test = c(rep(NA, length(trend_train)), trend_test),
  fitted = c(fit$fitted, rep(NA, length(trend_test))),
  preds = c(rep(NA, length(trend_train)), predictions))
df <- df %>% gather(key = 'type', value = 'value', train:preds)
ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-45

Trend-only dataset, test data in-range: Enter: LSTM (no differencing)

(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps) - pred_test)^2)))

[1] 9.254975

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-47

Does it get even better with differencing?

(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 6.394894

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-49

And with relative differencing?

(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 7.491432

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-51

Finally, what about both differencing and normalizing?

(test_rsme <- sqrt(sum((tail(trend_test,length(trend_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 6.710867

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-53

ARIMA vs. LSTM, Round 3: seasonal-only dataset

Seasonal-only dataset

set.seed(7777)
seasonal_train <- rep(1:7, times = 13) + rnorm(91, sd=0.2)
seasonal_test <- rep(1:7, times = 3) + rnorm(21, sd=0.2)
df <- data_frame(time_id = 1:112,
train = c(seasonal_train, rep(NA, length(seasonal_test))),
test = c(rep(NA, length(seasonal_train)), seasonal_test))
df <- df %>% gather(key = 'train_test', value = 'value', -time_id)
ggplot(df, aes(x = time_id, y = value, color = train_test)) + geom_line()

plot of chunk unnamed-chunk-54

Seasonal-only dataset: Enter: ARIMA (1)

set.seed(7777)
seasonal_train <- rep(1:7, times = 13) + rnorm(91, sd=0.2)
seasonal_test <- rep(1:7, times = 3) + rnorm(21, sd=0.2)
h <- 1
n <- length(seasonal_test) - h + 1
fit <- auto.arima(seasonal_train)
order <- arimaorder(fit)
refit <- Arima(seasonal_test, order=order[1:3], seasonal=order[4:6])
predictions <- refit$fitted
(test_rsme <- sqrt(sum((seasonal_test - predictions)^2)))

[1] 4.136755

Seasonal-only dataset: enter: ARIMA (2)

df <- data_frame(time_id = 1:112,
  train = c(seasonal_train, rep(NA, length(seasonal_test))),
  test = c(rep(NA, length(seasonal_train)), seasonal_test),
  fitted = c(fit$fitted, rep(NA, length(seasonal_test))),
  preds = c(rep(NA, length(seasonal_train)), predictions))
df <- df %>% gather(key = 'type', value = 'value', train:preds)
ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-56

Seasonal-only dataset: Enter: LSTM (no differencing)

(test_rsme <- sqrt(sum((tail(seasonal_test,length(seasonal_test) - lstm_num_timesteps) - pred_test)^2)))

[1] 0.6890056

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-58

Does it get even better with differencing? (How could it ;-))

(test_rsme <- sqrt(sum((tail(seasonal_test,length(seasonal_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 1.005745

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-60

Or with relative differencing (for completeness' sake)?

(test_rsme <- sqrt(sum((tail(seasonal_test,length(seasonal_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 0.5890867

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-62

Finally, what about both differencing and normalizing

(test_rsme <- sqrt(sum((tail(seasonal_test,length(seasonal_test) - lstm_num_timesteps - 1) - pred_test_undiff)^2)))

[1] 1.064543

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-64

Benchmark: Remarks and conclusions

We've used a very basic setup for the LSTM
- just one layer
- no experiments with number of units, optimization routines, activation functions…
- no use of dropout, regularization, weight decay…
- not making use of Keras stateful LSTM
We've only compared both methods on a rolling forecast (not forecasting several periods into the future)

Aside (1): Stateful RNNs in Keras

With stateful RNNs, states computed for the samples in one batch will be reused as initial states for the respective samples in the next batch
“Makes sense” for a time series, as long as the data is reformatted or batch_size=1 is used
Presupposes batches arriving in the same order in every epoch (set shuffle = False)
Not currently implemented in KerasR - time for some Python…

Demo: Stateful LSTM in Keras

Aside (2): Multi-step-ahead forecasts in Keras

Multi-step-ahead forecasts using LSTM can be done in a number of ways:
- build predictions on earlier predictions (“low end”)
- seq2seq architecture (“high end”, not currently available out of the box in Keras)
- using TimeDistributed layer (not currently implemented in KerasR)

Demo: Multi-step ahead forecasting in Keras using TimeDistributed

Finally...

Forecasting internet traffic using LSTM

Here, again, is our time series...

ggplot(traffic_df, aes(x = hour, y = bits)) + geom_line() + ggtitle("Internet traffic")

plot of chunk unnamed-chunk-65

Forecasting internet traffic with LSTM

We apply first-order differencing and normalize the data.

traffic_train <- traffic_df$bits[1:800]
traffic_test <- traffic_df$bits[801:nrow(traffic_df)]

traffic_train_start <- traffic_train[1]
traffic_test_start <- traffic_test[1]

traffic_train_diff <- diff(traffic_train)
traffic_test_diff <- diff(traffic_test)

minval <- min(traffic_train_diff)
maxval <- max(traffic_train_diff)

normalize <- function(vec, min, max) {
  (vec-min) / (max-min)
}
denormalize <- function(vec,min,max) {
  vec * (max - min) + min
}

traffic_train_diff <- normalize(traffic_train_diff, minval, maxval)
traffic_test_diff <- normalize(traffic_test_diff, minval, maxval)

We choose 7*24=168 for the number of timesteps.

lstm_num_timesteps <- 7*24

And the results? ... Wow!

ggplot(df, aes(x = time_id, y = value)) + geom_line(aes(color = type))

plot of chunk unnamed-chunk-69

The end?

This is more like a beginning
Lots of things to explore…
- experiment with parameters, architectures…
- experiment with different datasets…
Have fun!

Thanks for your attention!!