Chapter 10 Problem 7.) Fit a neural network to the Default data…

We need to fit a neural network to the Default data using a single hidden layer with 10 units and dropout regularization.

library(ISLR2)
library(keras)
library(tensorflow)

Per the ISLR package we know that Default is a data frame with 10,000 observations on the following 4 variables.

default: a factor with levels No and Yes indicating whether the customer defaulted on their debt

student: a factor with levels No and Yes indicating whether the customer is a student

balance: the average balance that the customer has remaining on their credit card after making their monthly payment

income: Income of customer

summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

We need to first split our data into training and validation sets.

set.seed(1)
n <- nrow(Default)
ntest <- trunc(n/5)
testid <- sample(1:n, ntest)
y_test <- Default$default[testid] == 'Yes'

One thing to note above is that our default variable is No (if the Default is FALSE) and Yes (if the Default is TRUE).

Let’s train our linear logistic regression model and then assess it on the validation set. In our model it will predict Default Yes if the probability of Default is greater than 0.5. In other words, our model will predict \(P(\text{Default} | X)\). If this probability is greater than 0.5, then our model will classify \(X\) as a Default TRUE. Otherwise, it will classify \(X\) as a Default FALSE.

ll.reg <- glm(default~student+balance+income,family="binomial",data=Default[-testid,])
ll.pred <- predict(ll.reg, data=Default[testid,], type='response') > 0.5
ll.accuracy = mean(ll.pred == y_test)
ll.accuracy
## [1] 0.95225

Our linear logistic regression model performs well on this validation set with accuracy around 95%.

Let us now try to fit a neural network.

x = model.matrix(default ~. -1, data=Default)

x_train <- x[-testid,]
g_train <- Default$default[-testid]=='Yes'

x_test <- x[testid,]
g_test <- Default$default[testid] == 'Yes'

modnn <- keras_model_sequential() %>% 
  layer_dense(units=10, activation='relu', input_shape=ncol(x)) %>%
  layer_dropout(rate=0.4) %>% 
  layer_dense(units = 1, activation='sigmoid')

Now let’s compile our model.

modnn %>% compile(
  optimizer=optimizer_rmsprop(), 
  loss='binary_crossentropy', 
  metrics='accuracy')

Now let’s fit our model to our data.

history <- modnn %>% fit(
  x = x_train, 
  y = g_train, 
  epochs=30, 
  batch_size=128)

Let’s check our neural network accuracy.

nnpred <- predict(modnn, x_test) > 0.5
nn.accuracy <- mean(nnpred == g_test)
nn.accuracy
## [1] 0.965

As we can see, our neural network test set accuracy is about 96%. This is comparable to our linear logistic regression which had 95% accuracy.

Chapter 10 Problem 9.) Fit a lag-5 autoregressive model to the NYSE data…

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.

we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.

lagm <- function(x, k=1) {
  n <- nrow(x)
  pad <- matrix(NA, k, ncol(x))
  rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]

We now fit the linear AR model to the training data using lm() and predict on the test data

arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223

The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook. In other words, our model explains 41% of the variation seen in the data.

We’ll add a factor variable month_number to our data because we have a hunch that month_number has a strong association with trading volume.

library(lubridate) #for extracting month from date object

string.dates <- NYSE[-c(1:5), "date"]
dates <- as.Date(string.dates)
months <- as.factor(month(dates))
arframe.month <- data.frame(months, arframe)
arfit.month <- lm(log_volume ~., data=arframe.month[istrain, ])
arpred.month <- predict(arfit.month, arframe.month[!istrain, ])
1 - mean((arpred.month - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4170418

It appears that including this 12-level factor representation doesn’t significantly improve the performance of the model. Before, without the 12-level month factor, the \(R^2\) was equal to 0.413. Now, with the 12-level month factor, the \(R^2\) was equal to 0.417. This is hardly an improvement.

Chapter 10 Problem 10.) Flatten short sequences to fit a linear AR Model.

Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.

n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046    5    3
model.rnn <- keras_model_sequential() %>%
  layer_flatten(input_shape=c(5,3)) %>% 
  layer_dense(units=1)

Compile our model.

model.rnn %>% compile(loss='mse', optimizer=optimizer_rmsprop())

Fit our model to the data.

history <- model.rnn %>% fit(
  xrnn[istrain,, ], arframe[istrain, "log_volume"], epochs=200, 
  batch_size = 64, validation_data =
    list(xrnn[!istrain,, ], arframe[!istrain, "log_volumne"])
)

Now let’s check our test set \(R^2\).

npred <- predict(model.rnn, xrnn[!istrain,,])
test.Rsquared <- 1 - mean((arframe[!istrain, "log_volume"] - npred)^2)/V0
test.Rsquared
## [1] 0.4124667

It appears that our \(R^2\) for this RNN linear AR model is the same as that of just a simple lm() function call.

The advantage of using lm() is that the interpretation of the coefficients is easier to communicate.The disadvantage of using lm() is that we assume a normality condition on the residuals, which may or may not be true.

Chapter 10 Problem 12.) RNN with inclusion of variable day_of_week

Let’s reload the data incase we mutated anything.

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

We will need to create the AR dataframe and then add day of week as a one-hot vector (of length 5) for each observation.

arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
day=NYSE[-(1:5), "day_of_week"]

arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
x <- model.matrix(log_volume ~. -1, data=arframed)

#first five columns are the factor one-hot vector for day of week. 
day.feature = x[,c(1:5)]

arframe.day <- data.frame(arframe$log_volume, 
                          arframe[,c(2:4)], day.feature, 
                          arframe[,c(5:7)], day.feature,
                          arframe[,c(8:10)], day.feature,
                          arframe[,c(11:13)], day.feature,
                          arframe[,c(14:16)], day.feature
                          )

Let’s set up our RNN data structure.

xrnn.day <- data.matrix(arframe.day[, -1])
xrnn.day <- array(xrnn.day, c(n, 8, 5))
xrnn.day <- xrnn.day[,, 5:1]
xrnn.day <- aperm(xrnn.day, c(1,3,2))

Now let’s compile our model.

arnn.day <- keras_model_sequential() %>%
  layer_simple_rnn(units = 12, 
                   input_shape = list(5,8), 
                   dropout = 0.1, recurrent_dropout = 0.1) %>%
  layer_dense(units = 1)

arnn.day %>% compile(optimizer = optimizer_rmsprop(), 
                     loss = 'mse')

Now let’s fit our model

history <- arnn.day %>% fit(
  xrnn.day[istrain,,], arframe[istrain, "log_volume"], 
  batch_size = 64, epochs = 200, 
  validation_data = 
    list(xrnn.day[!istrain,,], arframe[!istrain, "log_volume"])
)

Retry a \(R^2\). Let’s see how our model does on the test set.

kpred <- predict(arnn.day, xrnn.day[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4561137

As we can see, this new RNN model with variable day_of_week included explains about 5% more of the variation than the RNN model without variable day_of_week that the textbook completed on page 457.

Chapter 10 Problem 13.) Repeat Analysis of Lab 10.9.5 on the IMDb data

There we used a dictionary of size 10,000. Consider the effects of varying the dictionary size. Try the values

1,000
3,000
5,000
10,000

and compare the results.

Dictionary of size 1,000

Load our data and one_hot() code our documents into feature vectors which then can joined into a feature matrix. Since we are dealing with a sparse matrix we will use the Matrix library to create a sparseMatrix() object that will save considerable memory space

max_features <- 1000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

library(Matrix)
one_hot <- function(sequences, dimension) {
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep(1:n, seqlen)
  colind <- unlist(sequences)
  sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}

x_train_1h <- one_hot(x_train, 1000)
x_test_1h <- one_hot(x_test, 1000)
dim(x_train_1h)
## [1] 25000  1000

We can see above that our x_train_1h is a matrix of size 25,000 x 1,000.

Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.

set.seed(3)
ival <- sample(seq(along=y_train), 2000)

Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.

Define model and compile it.

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(1000)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))

Fit to our data:

history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))

Now we can check the accuracy of the model on the validation set.

predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.1000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.1000
## [1] 0.8435

We see that our test accuracy when we use a dictionary of size 1,000 is approximately 0.851.

Dictionary of size 3,000

We repeat the steps above. Fewer comments will be made.

max_features <- 3000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.3000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.3000
## [1] 0.857

We see that our test accuracy when we use a dictionary of size 3,000 is approximately 0.857.

Dictionary of size 5,000

max_features <- 5000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.5000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.5000
## [1] 0.862

We see that our test accuracy when we use a dictionary of size 5,000 is approximately 0.8625.

Dictionary of size 10,000

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.10000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.10000
## [1] 0.8585

We see that our test accuracy when we use a dictionary of size 10,000 is approximately 0.866.

Compare the results

Accuracy of NN Models on IMDb data
Model Accuracy
Dictionary Size 1,000 0.8435
Dictionary Size 3,000 0.8570
Dictionary Size 5,000 0.8620
Dictionary Size 10,000 0.8585

We can see that as the dictionary size increases, so does the accuracy, albeit marginally.