Math 533 Midterm Part 2 - Reproduce Code from ISLR Chapter 10.9

Reproduce the Code from Section 10.9 (pg.443)

A Single Layer Network on the Hitters Data

We load the data and subset test vs training indexes. It looks like we’re going to use about 1/3 of the data as a validation set.

library(ISLR2)
Gitters <- na.omit(Hitters)
n <- nrow(Gitters)
set.seed(13)
ntest <- trunc(n/3)
testid <- sample(1:n, ntest)

Let’s check out the structure of our data.

summary(Gitters)

##      AtBat            Hits           HmRun            Runs       
##  Min.   : 19.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:282.5   1st Qu.: 71.5   1st Qu.: 5.00   1st Qu.: 33.50  
##  Median :413.0   Median :103.0   Median : 9.00   Median : 52.00  
##  Mean   :403.6   Mean   :107.8   Mean   :11.62   Mean   : 54.75  
##  3rd Qu.:526.0   3rd Qu.:141.5   3rd Qu.:18.00   3rd Qu.: 73.00  
##  Max.   :687.0   Max.   :238.0   Max.   :40.00   Max.   :130.00  
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 30.00   1st Qu.: 23.00   1st Qu.: 4.000   1st Qu.:  842.5  
##  Median : 47.00   Median : 37.00   Median : 6.000   Median : 1931.0  
##  Mean   : 51.49   Mean   : 41.11   Mean   : 7.312   Mean   : 2657.5  
##  3rd Qu.: 71.00   3rd Qu.: 57.00   3rd Qu.:10.000   3rd Qu.: 3890.5  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##      CHits            CHmRun           CRuns             CRBI       
##  Min.   :   4.0   Min.   :  0.00   Min.   :   2.0   Min.   :   3.0  
##  1st Qu.: 212.0   1st Qu.: 15.00   1st Qu.: 105.5   1st Qu.:  95.0  
##  Median : 516.0   Median : 40.00   Median : 250.0   Median : 230.0  
##  Mean   : 722.2   Mean   : 69.24   Mean   : 361.2   Mean   : 330.4  
##  3rd Qu.:1054.0   3rd Qu.: 92.50   3rd Qu.: 497.5   3rd Qu.: 424.5  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.0  
##      CWalks       League  Division    PutOuts          Assists     
##  Min.   :   1.0   A:139   E:129    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  71.0   N:124   W:134    1st Qu.: 113.5   1st Qu.:  8.0  
##  Median : 174.0                    Median : 224.0   Median : 45.0  
##  Mean   : 260.3                    Mean   : 290.7   Mean   :118.8  
##  3rd Qu.: 328.5                    3rd Qu.: 322.5   3rd Qu.:192.0  
##  Max.   :1566.0                    Max.   :1377.0   Max.   :492.0  
##      Errors           Salary       NewLeague
##  Min.   : 0.000   Min.   :  67.5   A:141    
##  1st Qu.: 3.000   1st Qu.: 190.0   N:122    
##  Median : 7.000   Median : 425.0            
##  Mean   : 8.593   Mean   : 535.9            
##  3rd Qu.:13.000   3rd Qu.: 750.0            
##  Max.   :32.000   Max.   :2460.0

Let’s train a linear model, i.e., response $Y$ is Salary and we’ll run a least squares linear regression with all the predictors available.

lfit <- lm(Salary ~., data=Gitters[-testid,])
lpred <- predict(lfit, Gitters[testid,])
with(Gitters[testid,], mean(abs(lpred - Salary)))

## [1] 254.6687

From above we can see that the mean absolute prediction error on this data was 254.6687.

Now let’s try fitting a lasso regression using glmnet.

x <- scale(model.matrix(Salary ~. -1, data=Gitters))
y <- Gitters$Salary

The call model.matrix() creates the same matrix that is used in lm() (and -1 omits the intercept). This function will automatelly convert factors to dummy variables. The scale() function standardizes the matrix so each column has mean zero and variance one.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-2

cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure='mae')
cpred <- predict(cvfit, x[testid,], s='lambda.min')
mean(abs(y[testid] - cpred))

## [1] 252.2994

The lasso regression model has a slightly better mean absolute error than the least squares model.

Now let’s try setting up a neural network.

library(keras)
modnn <- keras_model_sequential() %>% 
  layer_dense(units=50, activation='relu', input_shape = ncol(x)) %>% 
  layer_dropout(rate=0.4) %>%
  layer_dense(units=1)

## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio/python.exe": The system cannot find
## the file specified

## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio-/python.exe": The system cannot find
## the file specified

We are using a single hidden layer with $K_1=50$ hidden units and a ReLU activation function. To adjust for overfitting we use a dropout layer, in which a random 40% of the 50 activations from the previous layer are set to zero during each iteration of the stochastic gradient descent. Finally, the output layer has just one unit with no activation function, indicating that the model provides a single quantitative output.

We now add code to control the fitting algorithm. We will minimize the squared-error loss. Due to the high CPU demands and time, we will only run 200 epochs.

modnn %>% compile(loss='mse', optimizer=optimizer_rmsprop(), metrics=list('mean_absolute_error'))

history <- modnn %>% fit(
  x[-testid, ], y[-testid], epochs=200, batch_size=32, 
  validation_data=list(x[testid,], y[testid])
)
plot(history)

Finally we will evaluate the model’s performance.

npred <- predict(modnn, x[testid,])
mean.abs.error <- mean(abs(y[testid] - npred))
mean.abs.error

## [1] 374.4037

Interestingly enough, with almost 1/6 of the epochs as the textbook suggestion, our model performed comparable, possibly better, but that could just be due to the stoachastic gradient descent.

A multilayer network on MNIST dataset: Digit Classication

We will partition our dataset into 60,000 training images and 10,000 test images.

mnist <- dataset_mnist()
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)

## [1] 60000    28    28

dim(x_test)

## [1] 10000    28    28

Recall that each image is 28x28 pixels, and it is stored as a three-dimensional array. We would like to reshape them into a vector of length 28 x 28 = 784. This matches our model specification vector that $X = (X_1,....,X_p)$ and $p = 784$. Luckily package keras has built-in functions that do this for us.

x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
y_train <- to_categorical(g_train, 10)
y_test <- to_categorical(g_test, 10)

Neural networks are sensitive to the scale of the inputs, so we will recale each component feature of the input $X$ to the unit interval. In other words, insted of a number between 0 and 255, each $X_j$ component will be between 0 and 1.

x_train <- x_train/255
x_test <- x_test/255

We are now ready to fit our neural network. In Layer1 we have 256 activation units, and they are being fed by the input vector with length 784 pixels. The layer_dropout() is a regularization rate that will help protect from overfitting. We protect against overfitting at each hidden layer. Finally, we will pipe the Layer2 into the output layer, which has 10 units (0-9 digits) and, as we stated before, the softmax activation function so that it returns a probability of belonging to each digit class.

Why do we need regularization? Even with such a large training set (60,000 images), the number of coefficients we needed to estimate four this particular neural network is nearly 4x times that (235,146).

modelnn <- keras_model_sequential()
modelnn %>% 
  layer_dense(units=256, activation="relu", input_shape=c(784)) %>%
  layer_dropout(rate=0.4) %>%
  layer_dense(units=128, activation='relu') %>%
  layer_dropout(rate=0.3) %>%
  layer_dense(units=10, activation='softmax')

summary(modelnn)

## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_4 (Dense)                     (None, 256)                     200960      
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 256)                     0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 128)                     32896       
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________

So far so good. Next, we add details to the model to specify the fitting criteria. As discussed, we will minimize the negative multinomial log-likelihood, otherwise known as the cross-entropy.

modelnn %>% compile(loss='categorical_crossentropy', optimizer=optimizer_rmsprop(), metrics=c("accuracy"))

Now we are ready to go. We just need to supply the training data and fit the model.

system.time(
  history <- modelnn %>% 
    fit(x_train, y_train, epochs=30, batch_size=128, validation_split=0.2)
)

##    user  system elapsed 
##  144.73    2.34   60.59

plot(history, smooth=FALSE)

Finally, we can compare predicted and true class labels and evaluate the accuracy.

predict_x=modelnn %>% predict(x_test) 
classes_x <- rep(-1, nrow(predict_x))
for(i in 1:nrow(predict_x)) {
  classes_x[i] = which.max(predict_x[i,]) - 1
}
correct.accuracy <- mean(g_test == classes_x)
correct.accuracy

## [1] 0.9816

Wow. On our test/validation set we have achieved 96.85% accuracy. Quite a feat.

10.9.4 Using Pretrained CNN Models

We will show how to use a pretrained CNN model to classify natural images. The CNN model was pretrained on the imagenet database.

We first load the images from the directory “book_images”. We add this dataset (from the ISLR website) to our current working directory. We then convert each image to an array “x” format that keras software expects and that matches the specifications of the imagenet database.

img_dir <- "book_images"
image_names <- list.files(img_dir)
num_images <- length(image_names)
x <- array(dim=c(num_images, 224, 224, 3))
for(i in 1:num_images) {
  img_path <- paste(img_dir, image_names[i], sep='/')
  img <- image_load(img_path, target_size=c(224, 224))
  x[i,,,] <- image_to_array(img)
}
x <- imagenet_preprocess_input(x)

Now we load the trained network and use it to classify these 6 images we just loaded into variable “x”.

model <- application_resnet50(weights='imagenet')
#summary(model)

Finally, we classify our six images and return the top 3 class choices in terms of predicted probability.

pred6 <- model %>% predict(x) %>% 
  imagenet_decode_predictions(top=3)
print(pred6)

## [[1]]
##   class_name class_description       score
## 1  n02007558          flamingo 0.926349938
## 2  n02006656         spoonbill 0.071699433
## 3  n02002556       white_stork 0.001228212
## 
## [[2]]
##   class_name class_description     score
## 1  n03388043          fountain 0.2788654
## 2  n03532672              hook 0.1785543
## 3  n03804744              nail 0.1080727
## 
## [[3]]
##   class_name class_description      score
## 1  n01608432              kite 0.72270912
## 2  n01622779    great_grey_owl 0.08182563
## 3  n01532829       house_finch 0.04218881
## 
## [[4]]
##   class_name           class_description      score
## 1  n02097474             Tibetan_terrier 0.50929713
## 2  n02098413                       Lhasa 0.42209893
## 3  n02098105 soft-coated_wheaten_terrier 0.01695858
## 
## [[5]]
##   class_name    class_description      score
## 1  n02105641 Old_English_sheepdog 0.83266032
## 2  n02086240             Shih-Tzu 0.04513876
## 3  n03223299              doormat 0.03299771
## 
## [[6]]
##   class_name class_description      score
## 1  n01843065           jacamar 0.49795389
## 2  n01818515             macaw 0.22193295
## 3  n02494079   squirrel_monkey 0.04287862

Similar to the section 10.4, the image classifier has trouble classifying objects when there are more than one object in the foreground. For example, in image 2 there is a hawk perched on a fountain. The classifier correctly identified the image as a fountain, but did not correctly match the label that we humans gave it, which as a hawk.

10.9.5 IMDb Document Classification

Now we perform document classification from the IMDb dataset. Recall our goal is to classify the document as a positive or negative review. We limit the dictionary size to the 10,000 most frequently-used words and tokens.

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

Let us try to better understand the data structure of a training document. Each element of x_train is a vector of numbers between 0 and 9999. Each vector is a document. The numbers refer to words found in the dictionary. For example, the first training document as a vector of numbers is the following:

x_train[[1]][1:12]

##  [1]    1   14   22   16   43  530  973 1622 1385   65  458 4468

This training document corresponds to the positive review from pg.419 of the ISLR.v2 textbook. We will copy and paste that positive review here:

“ this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert…”

Thus, it will benefit us to decode the words and create an interface to the dictionary.

word_index <- dataset_imdb_word_index()
decode_review <- function(text, word_index) {
  word <- names(word_index)
  idx <- unlist(word_index, use.names=FALSE)
  word <- c("<PAD>", "<START>", "<UNK>", "UNUSED", word)
  idx <- c(0:3, idx+3)
  words <- word[match(text, idx, 2)]
  paste(words, collape='' )
}
decode_review(x_train[[1]][1:12], word_index)

##  [1] "<START> "    "this "       "film "       "was "        "just "      
##  [6] "brilliant "  "casting "    "location "   "scenery "    "story "     
## [11] "direction "  "everyone's "

Thus, we have written a function that will decode a document in the training set.

Recall we are trying to featurize each document. One such way (not the only way) that the book presents is to score each document for the presence of each of the words in our $p=10,000$ dictionary. Thus, the feature vector that represents each document is a binary string of length $p=10,000$ where component $i$ equals 1 if word $i$ from the dictionary is present and equals 0 if word $i$ from the dictionary is not present.

Ultimately, our training set of 25,000 reviews can be stored as 25,000 vectors of length 10,000. We can store this in a feature matrix X of size 25,000 x 10,000 (consisting mostly of 0’s). Thus, we will have a sparse matrix.

Therefore, with all that being said, we need to write a function to “one-hot” encode each document in a list of documents, and return a binary matrix in sparse-matrix format.

library(Matrix)
one_hot <- function(sequences, dimension) {
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep(1:n, seqlen)
  colind <- unlist(sequences)
  sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}

x_train_1h <- one_hot(x_train, 10000)
x_test_1h <- one_hot(x_test, 10000)
dim(x_train_1h)

## [1] 25000 10000

nnzero(x_train_1h)/(25000 * 10000)

## [1] 0.01316987

We’ve saved a lot of memory here. As we can see, only 1.3% of the entries are non-zero.

Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.

set.seed(3)
ival <- sample(seq(along=y_train), 2000)

Let’s fit a lasso logistic regression model using glmnet() and then evaluate its performance on the validation data.

library(glmnet)
fitlm <- glmnet(x_train_1h[-ival,], y_train[-ival], family='binomial', standardize=FALSE)
classlmv <- predict(fitlm, x_train_1h[ival,], type='response') > 0.50


accuracy <- function(pred, truth) {
  return(mean(drop(pred) == drop(truth)))
}

acclmv <- apply(classlmv, 2, accuracy, y_train[ival] > 0)

We use the standard boundary criteria, i.e. if the prob(positive) > 0.50 then the classifier will classify document as positive.

classlmv is a matrix of size 2,000 x 100. Each column is a value of lambda that the model used to fit itself. Thus, each column is the prediction for each of the 2,000 documents using a specific value of lambda.

Ultimately we want to find the value of lambda that maximizes the accuracy on our validation set. Therefore, we plot the accuracy as a function of these lambdas.

par(mar=c(4,4,4,4), mfrow=c(1,1))
maximizer = -log(fitlm$lambda[which.max(acclmv)])
plot(-log(fitlm$lambda), acclmv)
abline(v = maximizer, col='red')

print(paste("The maximized validation set accuracy is: ", acclmv[which.max(acclmv)]))

## [1] "The maximized validation set accuracy is:  0.883"

Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(10000)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))

Now we can check the accuracy of the model on the validation set.

predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy <- mean(y_train[ival] == predict_class_x)
correct.accuracy

## [1] 0.862

Our lasso logistic regression had a validation set accuracy of 0.883. Comparing that to our neural network accuracy of 0.864, we see that the lasso logistic regression performed slightly better. This is summarized well by Figure 10.11 as seen on page 420 of ISLR.v2. For completeness the image has been copy pasted below.

Figure 10.11 Accuracy of lasso and two-hidden-layer nn.

10.9.6 Recurrent Neural Networks

Sequential Models for Document Classification

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
wc <- sapply(x_train, length)
median(wc)

## [1] 178

sum(wc <= 500)/length(wc)

## [1] 0.91568

Thus, about 91% of documents have fewer than 500 words. Our Recurren Neural Network requires all documets to have the same length. A work-around is to restrict any documents that are longer to the last 500 words, and any documents that are shorter we can pad the beginning with blanks.

maxlen <- 500
x_train <- pad_sequences(x_train, maxlen=maxlen)
x_test <- pad_sequences(x_test, maxlen=maxlen)
dim(x_train)

## [1] 25000   500

dim(x_test)

## [1] 25000   500

x_train[1, 490:500]

##  [1]   16 4472  113  103   32   15   16 5345   19  178   32

Thus, our feature matrix X is a 2,500 x 500 matrix. Each row is a document, and each document contains exactly 500 words (either the last 500 or padded). We can see what a typical feature vector looks like in our last line of code. Each of the 500 words in the document is represented using an integer corresponding to the location of that word in the 10,000-word dictionary.

We will use a two layer RNN to classify. THe first layer is of size 32. This layer will one-hot encode each document as a matrix of size 500 x 10,000, and then maps them down to 32 units. The second layer is an LSTM with 32 units, and the output layer is a single sigmoid for the binary classification task.

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim=10000, output_dim = 32) %>%
  layer_lstm(units=32) %>%
  layer_dense(units=1, activation='sigmoid')

model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('acc'))
history <- model %>% fit(x_train, y_train, epochs=10, batch_size=128, validation_data=list(x_test, y_test))
plot(history)

## `geom_smooth()` using formula 'y ~ x'

predy <- predict(model, x_test) > 0.5
mean(abs(y_test == as.numeric(predy)))

## [1] 0.83784

Thus, our RNN model was able to accurately classify positive or negative review about 87% of the time.

Time Series Prediction

We’ll now use deep learning to fit the time series models of section 10.5

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

In particular, we wish to predict a value $v_t$ from past values $v_{t-1}, v_{t-2},… $.

We have one series of data, and we have an entire series of targets $v_t$, and the inputs include past values of this series. For a mathematical representation of the structure, see the Figure below from the textbook.

Mathematical Structure of Time Series Problem.

For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use $L=5$, and since $T=6,051$ we can create 6,046 such $(X,Y)$ pairs.

we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag $L$, and returns a lagged version of the matrix. It simply inserts $L$ rows of NA at the top, and truncates the bottom.

lagm <- function(x, k=1) {
  n <- nrow(x)
  pad <- matrix(NA, k, ncol(x))
  rbind(pad, x[1:(n-k),])
}

arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

arframe[1:5,]

##   log_volume L1.DJ_return L1.log_volume L1.log_volatility L2.DJ_return
## 1  0.1750605           NA            NA                NA           NA
## 2  1.5171653   -0.5497779     0.1750605         -4.356718           NA
## 3  2.2836006    0.9051251     1.5171653         -2.528849   -0.5497779
## 4  0.9350983    0.4347768     2.2836006         -2.417837    0.9051251
## 5  0.2247600   -0.4313611     0.9350983         -2.366325    0.4347768
##   L2.log_volume L2.log_volatility L3.DJ_return L3.log_volume L3.log_volatility
## 1            NA                NA           NA            NA                NA
## 2            NA                NA           NA            NA                NA
## 3     0.1750605         -4.356718           NA            NA                NA
## 4     1.5171653         -2.528849   -0.5497779     0.1750605         -4.356718
## 5     2.2836006         -2.417837    0.9051251     1.5171653         -2.528849
##   L4.DJ_return L4.log_volume L4.log_volatility L5.DJ_return L5.log_volume
## 1           NA            NA                NA           NA            NA
## 2           NA            NA                NA           NA            NA
## 3           NA            NA                NA           NA            NA
## 4           NA            NA                NA           NA            NA
## 5   -0.5497779     0.1750605         -4.356718           NA            NA
##   L5.log_volatility
## 1                NA
## 2                NA
## 3                NA
## 4                NA
## 5                NA

We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]

We now fit the linear AR model to the training data using lm() and predict on the test data

arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0

## [1] 0.413223

The last two lines compute the $R^2$ on the test data, as defined in section 3.17 of the textbook.

We’ll add a factor variable day_of_week to our data because we have a hunch that day_of_week has a strong association with trading volume.

arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
arfitd <- lm(log_volume ~., data=arframed[istrain, ])
arpredd <- predict(arfitd, arframed[!istrain, ])
1 - mean((arpredd - arframe[!istrain, "log_volume"])^2)/V0

## [1] 0.4598616

Including day_of_week improved our R^2 by about 5%.

Now let’s fit our RNN. The RNN model expects a sequence of $L=5$ feature vectors $X = {X_{\ell}}_{1}^{L}$ These are lagged versions of the time series going back $L$ time points.

n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)

## [1] 6046    5    3

Now let’s contruct our RNN, which uses 12 hidden units.

model <- keras_model_sequential() %>% 
  layer_simple_rnn(units=12, 
                   input_shape=list(5,3),
                   dropout = 0.1, recurrent_dropout = 0.1) %>%
  layer_dense(units=1)

model %>% compile(optimizer = optimizer_rmsprop(), loss='mse')

We fit the model as follows. We supply the fit function with test data as validation data, so that when we monitor its progress and plot the history function we can see theprogress on the test data.

history <- model %>% fit(
  xrnn[istrain,,], arframe[istrain, "log_volume"], 
  batch_size=64, epochs = 200, 
  validation_data = 
    list(xrnn[!istrain,,], arframe[!istrain, "log_volume"])
)

Now we can assess the model’s performance (using R^2).

kpred <- predict(model, xrnn[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0

## [1] 0.4139273

The performance is about the same to that of a standard lm() AR model.