Part 1:

ISLR Summary

ISLR 10.1 Single Layer Neural Networks

A single layer neural network takes an input vector of \(p\) variables \(X = (X_1, ..., X_p)\) and builds a nonlinear function \(f(X)\) to predict the response \(Y\). The particular structure of the a simple feed-forward neural network can be seen in the Figure below using \(p=4\) predictos and \(K=5\) hidden units, or sometimes named activations.

Single Layer Neural Network.

The neural network has the form \[ \begin{align} f(X) &= \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\\ &= \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \end{align} \]

Step 1.) Compute the \(K\) activations \(A_k, k=1...K\) as functions of the input features \(X_1,...,X_p\). In particular,

\[ A_k = h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \]

where \(g(z)\) is a nonlinear activation function that is specified in advance. For example, \(g(z)\) can be the sigmmoid activation function, or perhaps the ReLU (rectified linear unit) activation function.

Step 2.) Feed these \(K\) activations from the hidden layer feed into the output layer. The output \(f(X)\) will be a linear regression model of the \(K\) activations. In particular,

\[ f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X) \]

All the parameters will be estimated from the data. For this simple \(K=5\) and \(p=4\) example, we would need to estimate \(\beta_0,...,\beta_5\) and \(w_{10},...,w_{15}\) … and \(w_{50},...,w_{55}\). In total this is 6 + (6x6) = 42 parameters. It is easy to see now that in order to apply a neural network large datasets and strong computational tools are necessary. Thus, it wasn’t until the last 15-20 years did neural networks/deep learning take off in popularity in research and industry.

As with any parameter estimation problem we must choose a criteria. For a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize \(\sum_{i=1}^n (y_i - f(x_i))^2\)

ISLR 10.2 Multilayer Neural Networks

Modern neural networks typically have more than one hidden layer. We will illustrate a large dense network on the famous and publicly available MNIST handwritten digit dataset. See Midterm Part 2 where we go into full detail, including source R code, for the MNIST digit dataset.

Data Input and Output Structure

Every image has 784 pixels, of which each pixel is stored as an eight-bit grayscale value between 0 and 255 representing the relative amount of the written digit in that tiny pixel square. Thus, an input vector is \(X = (X_1,...,X_p)\) where \(p = 784\) pixels.

The output is the class label, represented by a vector \(Y = (Y_0,...,Y_9)\) of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere.

Neural Network Model Specification

We wil use two hidden layers \(L_1\) and \(L_2\). See the image below for a visual diagram of how our model will be constructed.

The input layer has \(p=784\) units (remember thats how many pixels are in each input image). Hidden layer \(K_1 = 256\) and hidden layer \(K_2 = 128\). Output layer 10 units (number of 0-9 digits). One can show that the number of parameters is 235,146.

We have adopted some new notation.

\(W_1\) in Figure below represents the entire matrix of weights that feed from input layer to the first hidden layer. This matrix will have (784+1) x 256 = 200,960 elements/weights.

Secondly, \(W_2\) is the matrix of weights that feed from hidden layer 1 to hidden layer 2. This matrix will have (256+1) x 128 = 32,896 elements/weights.

Lastly, \(B\) is the matrix of weights from hidden layer 2 to output layer. This matrix will have (128 + 1) x 10 = 1,290 elements/weights.

Multilayer Neural Network for Digit Classication.

One thing to note about our output functions \(f_m(X) = Z_m\). Since we are dealing with classification we would prefer that our estimate represent class probabilities (just as in logistic or multinomial regression). Similar to multinomial logistic regression we use the special softmax activation function. This will ensure that the 10 numbers behave like porbabilities. The classifier then assigns the image to the class with the highest probability.

\[ f_m(X) = Pr(Y = m | X) = \frac{e^{Z_M}}{\sum_{l=0}^9 e^{Z_l}} \]

Our criteria will be to minimize the negative multinomial log-likelihood, also known as the cross-entropy.

For full code details and implementation see Midterm Part 2

ISLR 10.3 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) mimic to some degree how humans classify images, by recognizing specific features or patterns anywhere in the image that distinguish each particular object class.

See Figure 10.6 below.

Figure 10.6 Covolutional Neural Network.

A CNN first identifies low-level features, such as edges, patches of color, etc. Then, these low-level features are combined to form higher-level features, such as parts of ears, eyes, and so on. Eventually, the presence or absence of these higher-level features contributes to the probability of any given output classs.

A CNN builds up this hierarchy by combining twp specialized types of hidden layers, called convolution layers and pooling layers. Convolution layers search for instances of small patterns in the image. Pooling layers downsample these to select a prominent subset.

ISLR 10.5 Recurrent Neural Networks

In a recurrent neural network (RNN), the input object \(X\) is a sequence. Consider a collection of documents, such as the collection of IMDb movie reviews. Each document can be expressed as a sequence of \(L\) words, so \(X={X_1,.X_2,...X_L}\), where each \(X_l\) represents a word. The order of the words, and closeness of certain words in a sentence, convey semantic meaning.

RNNs are designed to accomodate a take advantage of the sequential nature of such input objects, much like convolutional neural networks accomodate the spatial structure of image inputs.

Figure 10.12 below illustrates the structure of a very basic RNN with sequence \(X={X_1,.X_2,...X_L}\) as input, a simple output \(Y\), and a hidden-layer sequence \(A={A_1,A_2,...,A_L}\). Each \(X_l\) is a vector; in the document example \(X_l\) could represent a one-hot encoding for the l-th word based on the language dictionary for the collection of documents. As the sequence is processed on vector \(X_l\) at a time, the network updates the activations \(A_l\) in the hidden layer, taking as input the vector \(X_l\) and the activation vector \(A_{l-1}\) from the previous step in the sequence. Each \(A_l\) feeds into the output layer and produces a prediction \(O_l\) for \(Y\).

Figure 10.12 Recurrent Neural Network.

The rest of 10.5 in the textbook goes into case studies of applications of RNNs. For example, Time Series Forecasting and Document classification. For more details of the implementation of these examples in R, please see Midterm Part 2.

ISLR 10.6 When to Use Deep Learning

A guiding principle is Occam’s razor: when faced with several methods that give roughly equivalent performance, pick the simplest.

If we can produce models with the simpler tools that perform as well, they are likely to be easier to fit and understand and potentially less fragile than the more complex approaches.

Typically we expect deep learning to be an attractive choice when the sample size of the training set is extremely large, and when interpretability of the model is not a high priority.

ISLR 10.7 Fitting a Neural Network

The technicalities of fitting/optimizing a neural network and possibly introducing penalization are shown in this section. Due to the highly technical nature of this section, details are omitted and we encourage the reader to seek the actual textbook.

CASI Summary Chapter 18

Neural networks are fit using back-propogation and gradient descent. The details of such can be found in the first half of Chapter 18. In our summary below we will focus on other tuning parameters.

Number of Hidden Layers, and their Sizes

The current collective wisdom suggests it is better to have an abundant number of hidden units, and control the model complexity instead by weight regularization. Having deeper networks (more hidden layers) increases the complexity as well. The correct number tends to be task specific. For example, in the digit classification problem having two hidden layers lead to the most competitive performance

Choice of Nonlinearities

We have many choices for our nonlinear activation functions \(g^{(k)}\). These include but are not limited to

the sigmoid function

rectified linear

leaky rectified linear

hyperbolic tan function

See the graph below for a visual comparison of the different activation functions.

Activation Functions.

Choice of Regularization

Typically this is a mixture of \(l_2\) and \(l_1\) regularization, each of which requires a tuning parameter.

Dropout

This is a form of regularization that is performed when learning a network, typically at different rates at the different layers. It applies to all networks, not just convolutional; in fact, it appears to work better when applied at the deeper, denser layers.

#Part 2: Reproducing Code from Lab 10.9

Reproduce the Code from Section 10.9 (pg.443)

A Single Layer Network on the Hitters Data

We load the data and subset test vs training indexes. It looks like we’re going to use about 1/3 of the data as a validation set.

library(ISLR2)
Gitters <- na.omit(Hitters)
n <- nrow(Gitters)
set.seed(13)
ntest <- trunc(n/3)
testid <- sample(1:n, ntest)

Let’s check out the structure of our data.

summary(Gitters)
##      AtBat            Hits           HmRun            Runs       
##  Min.   : 19.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:282.5   1st Qu.: 71.5   1st Qu.: 5.00   1st Qu.: 33.50  
##  Median :413.0   Median :103.0   Median : 9.00   Median : 52.00  
##  Mean   :403.6   Mean   :107.8   Mean   :11.62   Mean   : 54.75  
##  3rd Qu.:526.0   3rd Qu.:141.5   3rd Qu.:18.00   3rd Qu.: 73.00  
##  Max.   :687.0   Max.   :238.0   Max.   :40.00   Max.   :130.00  
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 30.00   1st Qu.: 23.00   1st Qu.: 4.000   1st Qu.:  842.5  
##  Median : 47.00   Median : 37.00   Median : 6.000   Median : 1931.0  
##  Mean   : 51.49   Mean   : 41.11   Mean   : 7.312   Mean   : 2657.5  
##  3rd Qu.: 71.00   3rd Qu.: 57.00   3rd Qu.:10.000   3rd Qu.: 3890.5  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##      CHits            CHmRun           CRuns             CRBI       
##  Min.   :   4.0   Min.   :  0.00   Min.   :   2.0   Min.   :   3.0  
##  1st Qu.: 212.0   1st Qu.: 15.00   1st Qu.: 105.5   1st Qu.:  95.0  
##  Median : 516.0   Median : 40.00   Median : 250.0   Median : 230.0  
##  Mean   : 722.2   Mean   : 69.24   Mean   : 361.2   Mean   : 330.4  
##  3rd Qu.:1054.0   3rd Qu.: 92.50   3rd Qu.: 497.5   3rd Qu.: 424.5  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.0  
##      CWalks       League  Division    PutOuts          Assists     
##  Min.   :   1.0   A:139   E:129    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  71.0   N:124   W:134    1st Qu.: 113.5   1st Qu.:  8.0  
##  Median : 174.0                    Median : 224.0   Median : 45.0  
##  Mean   : 260.3                    Mean   : 290.7   Mean   :118.8  
##  3rd Qu.: 328.5                    3rd Qu.: 322.5   3rd Qu.:192.0  
##  Max.   :1566.0                    Max.   :1377.0   Max.   :492.0  
##      Errors           Salary       NewLeague
##  Min.   : 0.000   Min.   :  67.5   A:141    
##  1st Qu.: 3.000   1st Qu.: 190.0   N:122    
##  Median : 7.000   Median : 425.0            
##  Mean   : 8.593   Mean   : 535.9            
##  3rd Qu.:13.000   3rd Qu.: 750.0            
##  Max.   :32.000   Max.   :2460.0

Let’s train a linear model, i.e., response \(Y\) is Salary and we’ll run a least squares linear regression with all the predictors available.

lfit <- lm(Salary ~., data=Gitters[-testid,])
lpred <- predict(lfit, Gitters[testid,])
with(Gitters[testid,], mean(abs(lpred - Salary)))
## [1] 254.6687

From above we can see that the mean absolute prediction error on this data was 254.6687.

Now let’s try fitting a lasso regression using glmnet.

x <- scale(model.matrix(Salary ~. -1, data=Gitters))
y <- Gitters$Salary

The call model.matrix() creates the same matrix that is used in lm() (and -1 omits the intercept). This function will automatelly convert factors to dummy variables. The scale() function standardizes the matrix so each column has mean zero and variance one.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure='mae')
cpred <- predict(cvfit, x[testid,], s='lambda.min')
mean(abs(y[testid] - cpred))
## [1] 252.2994

The lasso regression model has a slightly better mean absolute error than the least squares model.

Now let’s try setting up a neural network.

library(keras)
modnn <- keras_model_sequential() %>% 
  layer_dense(units=50, activation='relu', input_shape = ncol(x)) %>% 
  layer_dropout(rate=0.4) %>%
  layer_dense(units=1)
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio/python.exe": The system cannot find
## the file specified
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio-/python.exe": The system cannot find
## the file specified

We are using a single hidden layer with \(K_1=50\) hidden units and a ReLU activation function. To adjust for overfitting we use a dropout layer, in which a random 40% of the 50 activations from the previous layer are set to zero during each iteration of the stochastic gradient descent. Finally, the output layer has just one unit with no activation function, indicating that the model provides a single quantitative output.

We now add code to control the fitting algorithm. We will minimize the squared-error loss. Due to the high CPU demands and time, we will only run 200 epochs.

modnn %>% compile(loss='mse', optimizer=optimizer_rmsprop(), metrics=list('mean_absolute_error'))

history <- modnn %>% fit(
  x[-testid, ], y[-testid], epochs=200, batch_size=32, 
  validation_data=list(x[testid,], y[testid])
)
plot(history)

Finally we will evaluate the model’s performance.

npred <- predict(modnn, x[testid,])
mean.abs.error <- mean(abs(y[testid] - npred))
mean.abs.error
## [1] 369.0129

Interestingly enough, with almost 1/6 of the epochs as the textbook suggestion, our model performed comparable, possibly better, but that could just be due to the stoachastic gradient descent.

A multilayer network on MNIST dataset: Digit Classication

We will partition our dataset into 60,000 training images and 10,000 test images.

mnist <- dataset_mnist()
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)
## [1] 60000    28    28
dim(x_test)
## [1] 10000    28    28

Recall that each image is 28x28 pixels, and it is stored as a three-dimensional array. We would like to reshape them into a vector of length 28 x 28 = 784. This matches our model specification vector that \(X = (X_1,....,X_p)\) and \(p = 784\). Luckily package keras has built-in functions that do this for us.

x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
y_train <- to_categorical(g_train, 10)
y_test <- to_categorical(g_test, 10)

Neural networks are sensitive to the scale of the inputs, so we will recale each component feature of the input \(X\) to the unit interval. In other words, insted of a number between 0 and 255, each \(X_j\) component will be between 0 and 1.

x_train <- x_train/255
x_test <- x_test/255

We are now ready to fit our neural network. In Layer1 we have 256 activation units, and they are being fed by the input vector with length 784 pixels. The layer_dropout() is a regularization rate that will help protect from overfitting. We protect against overfitting at each hidden layer. Finally, we will pipe the Layer2 into the output layer, which has 10 units (0-9 digits) and, as we stated before, the softmax activation function so that it returns a probability of belonging to each digit class.

Why do we need regularization? Even with such a large training set (60,000 images), the number of coefficients we needed to estimate four this particular neural network is nearly 4x times that (235,146).

modelnn <- keras_model_sequential()
modelnn %>% 
  layer_dense(units=256, activation="relu", input_shape=c(784)) %>%
  layer_dropout(rate=0.4) %>%
  layer_dense(units=128, activation='relu') %>%
  layer_dropout(rate=0.3) %>%
  layer_dense(units=10, activation='softmax')

summary(modelnn)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_4 (Dense)                     (None, 256)                     200960      
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 256)                     0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 128)                     32896       
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________

So far so good. Next, we add details to the model to specify the fitting criteria. As discussed, we will minimize the negative multinomial log-likelihood, otherwise known as the cross-entropy.

modelnn %>% compile(loss='categorical_crossentropy', optimizer=optimizer_rmsprop(), metrics=c("accuracy"))

Now we are ready to go. We just need to supply the training data and fit the model.

system.time(
  history <- modelnn %>% 
    fit(x_train, y_train, epochs=30, batch_size=128, validation_split=0.2)
)
##    user  system elapsed 
##  132.77    3.17   58.33
plot(history, smooth=FALSE)

Finally, we can compare predicted and true class labels and evaluate the accuracy.

predict_x=modelnn %>% predict(x_test) 
classes_x <- rep(-1, nrow(predict_x))
for(i in 1:nrow(predict_x)) {
  classes_x[i] = which.max(predict_x[i,]) - 1
}
correct.accuracy <- mean(g_test == classes_x)
correct.accuracy
## [1] 0.9804

Wow. On our test/validation set we have achieved 96.85% accuracy. Quite a feat.

10.9.4 Using Pretrained CNN Models

We will show how to use a pretrained CNN model to classify natural images. The CNN model was pretrained on the imagenet database.

We first load the images from the directory “book_images”. We add this dataset (from the ISLR website) to our current working directory. We then convert each image to an array “x” format that keras software expects and that matches the specifications of the imagenet database.

img_dir <- "book_images"
image_names <- list.files(img_dir)
num_images <- length(image_names)
x <- array(dim=c(num_images, 224, 224, 3))
for(i in 1:num_images) {
  img_path <- paste(img_dir, image_names[i], sep='/')
  img <- image_load(img_path, target_size=c(224, 224))
  x[i,,,] <- image_to_array(img)
}
x <- imagenet_preprocess_input(x)

Now we load the trained network and use it to classify these 6 images we just loaded into variable “x”.

model <- application_resnet50(weights='imagenet')
#summary(model)

Finally, we classify our six images and return the top 3 class choices in terms of predicted probability.

pred6 <- model %>% predict(x) %>% 
  imagenet_decode_predictions(top=3)
print(pred6)
## [[1]]
##   class_name class_description       score
## 1  n02007558          flamingo 0.926349938
## 2  n02006656         spoonbill 0.071699433
## 3  n02002556       white_stork 0.001228212
## 
## [[2]]
##   class_name class_description     score
## 1  n03388043          fountain 0.2788654
## 2  n03532672              hook 0.1785543
## 3  n03804744              nail 0.1080727
## 
## [[3]]
##   class_name class_description      score
## 1  n01608432              kite 0.72270912
## 2  n01622779    great_grey_owl 0.08182563
## 3  n01532829       house_finch 0.04218881
## 
## [[4]]
##   class_name           class_description      score
## 1  n02097474             Tibetan_terrier 0.50929713
## 2  n02098413                       Lhasa 0.42209893
## 3  n02098105 soft-coated_wheaten_terrier 0.01695858
## 
## [[5]]
##   class_name    class_description      score
## 1  n02105641 Old_English_sheepdog 0.83266032
## 2  n02086240             Shih-Tzu 0.04513876
## 3  n03223299              doormat 0.03299771
## 
## [[6]]
##   class_name class_description      score
## 1  n01843065           jacamar 0.49795389
## 2  n01818515             macaw 0.22193295
## 3  n02494079   squirrel_monkey 0.04287862

Similar to the section 10.4, the image classifier has trouble classifying objects when there are more than one object in the foreground. For example, in image 2 there is a hawk perched on a fountain. The classifier correctly identified the image as a fountain, but did not correctly match the label that we humans gave it, which as a hawk.

10.9.5 IMDb Document Classification

Now we perform document classification from the IMDb dataset. Recall our goal is to classify the document as a positive or negative review. We limit the dictionary size to the 10,000 most frequently-used words and tokens.

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

Let us try to better understand the data structure of a training document. Each element of x_train is a vector of numbers between 0 and 9999. Each vector is a document. The numbers refer to words found in the dictionary. For example, the first training document as a vector of numbers is the following:

x_train[[1]][1:12]
##  [1]    1   14   22   16   43  530  973 1622 1385   65  458 4468

This training document corresponds to the positive review from pg.419 of the ISLR.v2 textbook. We will copy and paste that positive review here:

this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert…”

Thus, it will benefit us to decode the words and create an interface to the dictionary.

word_index <- dataset_imdb_word_index()
decode_review <- function(text, word_index) {
  word <- names(word_index)
  idx <- unlist(word_index, use.names=FALSE)
  word <- c("<PAD>", "<START>", "<UNK>", "UNUSED", word)
  idx <- c(0:3, idx+3)
  words <- word[match(text, idx, 2)]
  paste(words, collape='' )
}
decode_review(x_train[[1]][1:12], word_index)
##  [1] "<START> "    "this "       "film "       "was "        "just "      
##  [6] "brilliant "  "casting "    "location "   "scenery "    "story "     
## [11] "direction "  "everyone's "

Thus, we have written a function that will decode a document in the training set.

Recall we are trying to featurize each document. One such way (not the only way) that the book presents is to score each document for the presence of each of the words in our \(p=10,000\) dictionary. Thus, the feature vector that represents each document is a binary string of length \(p=10,000\) where component \(i\) equals 1 if word \(i\) from the dictionary is present and equals 0 if word \(i\) from the dictionary is not present.

Ultimately, our training set of 25,000 reviews can be stored as 25,000 vectors of length 10,000. We can store this in a feature matrix X of size 25,000 x 10,000 (consisting mostly of 0’s). Thus, we will have a sparse matrix.

Therefore, with all that being said, we need to write a function to “one-hot” encode each document in a list of documents, and return a binary matrix in sparse-matrix format.

library(Matrix)
one_hot <- function(sequences, dimension) {
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep(1:n, seqlen)
  colind <- unlist(sequences)
  sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}

x_train_1h <- one_hot(x_train, 10000)
x_test_1h <- one_hot(x_test, 10000)
dim(x_train_1h)
## [1] 25000 10000
nnzero(x_train_1h)/(25000 * 10000)
## [1] 0.01316987

We’ve saved a lot of memory here. As we can see, only 1.3% of the entries are non-zero.

Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.

set.seed(3)
ival <- sample(seq(along=y_train), 2000)

Let’s fit a lasso logistic regression model using glmnet() and then evaluate its performance on the validation data.

library(glmnet)
fitlm <- glmnet(x_train_1h[-ival,], y_train[-ival], family='binomial', standardize=FALSE)
classlmv <- predict(fitlm, x_train_1h[ival,], type='response') > 0.50


accuracy <- function(pred, truth) {
  return(mean(drop(pred) == drop(truth)))
}

acclmv <- apply(classlmv, 2, accuracy, y_train[ival] > 0)

We use the standard boundary criteria, i.e. if the prob(positive) > 0.50 then the classifier will classify document as positive.

classlmv is a matrix of size 2,000 x 100. Each column is a value of lambda that the model used to fit itself. Thus, each column is the prediction for each of the 2,000 documents using a specific value of lambda.

Ultimately we want to find the value of lambda that maximizes the accuracy on our validation set. Therefore, we plot the accuracy as a function of these lambdas.

par(mar=c(4,4,4,4), mfrow=c(1,1))
maximizer = -log(fitlm$lambda[which.max(acclmv)])
plot(-log(fitlm$lambda), acclmv)
abline(v = maximizer, col='red')

print(paste("The maximized validation set accuracy is: ", acclmv[which.max(acclmv)]))
## [1] "The maximized validation set accuracy is:  0.883"

Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(10000)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))

Now we can check the accuracy of the model on the validation set.

predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy <- mean(y_train[ival] == predict_class_x)
correct.accuracy
## [1] 0.862

Our lasso logistic regression had a validation set accuracy of 0.883. Comparing that to our neural network accuracy of 0.864, we see that the lasso logistic regression performed slightly better. This is summarized well by Figure 10.11 as seen on page 420 of ISLR.v2. For completeness the image has been copy pasted below.

Figure 10.11 Accuracy of lasso and two-hidden-layer nn.

10.9.6 Recurrent Neural Networks

Sequential Models for Document Classification

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
wc <- sapply(x_train, length)
median(wc)
## [1] 178
sum(wc <= 500)/length(wc)
## [1] 0.91568

Thus, about 91% of documents have fewer than 500 words. Our Recurren Neural Network requires all documets to have the same length. A work-around is to restrict any documents that are longer to the last 500 words, and any documents that are shorter we can pad the beginning with blanks.

maxlen <- 500
x_train <- pad_sequences(x_train, maxlen=maxlen)
x_test <- pad_sequences(x_test, maxlen=maxlen)
dim(x_train)
## [1] 25000   500
dim(x_test)
## [1] 25000   500
x_train[1, 490:500]
##  [1]   16 4472  113  103   32   15   16 5345   19  178   32

Thus, our feature matrix X is a 2,500 x 500 matrix. Each row is a document, and each document contains exactly 500 words (either the last 500 or padded). We can see what a typical feature vector looks like in our last line of code. Each of the 500 words in the document is represented using an integer corresponding to the location of that word in the 10,000-word dictionary.

We will use a two layer RNN to classify. The first layer is of size 32. This layer will one-hot encode each document as a matrix of size 500 x 10,000, and then maps them down to 32 units. The second layer is an LSTM with 32 units, and the output layer is a single sigmoid for the binary classification task.

model <- keras_model_sequential() %>% 
  layer_embedding(input_dim=10000, output_dim = 32) %>%
  layer_lstm(units=32) %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('acc'))
history <- model %>% fit(x_train, y_train, epochs=10, batch_size=128, validation_data=list(x_test, y_test))
plot(history)
## `geom_smooth()` using formula 'y ~ x'

predy <- predict(model, x_test) > 0.5
mean(abs(y_test == as.numeric(predy)))
## [1] 0.8432

Thus, our RNN model was able to accurately classify positive or negative review about 87% of the time.

Time Series Prediction

We’ll now use deep learning to fit the time series models of section 10.5

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

In particular, we wish to predict a value \(v_t\) from past values $v_{t-1}, v_{t-2},… $.

We have one series of data, and we have an entire series of targets \(v_t\), and the inputs include past values of this series. For a mathematical representation of the structure, see the Figure below from the textbook.

Mathematical Structure of Time Series Problem.

For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.

we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.

lagm <- function(x, k=1) {
  n <- nrow(x)
  pad <- matrix(NA, k, ncol(x))
  rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

arframe[1:5,]
##   log_volume L1.DJ_return L1.log_volume L1.log_volatility L2.DJ_return
## 1  0.1750605           NA            NA                NA           NA
## 2  1.5171653   -0.5497779     0.1750605         -4.356718           NA
## 3  2.2836006    0.9051251     1.5171653         -2.528849   -0.5497779
## 4  0.9350983    0.4347768     2.2836006         -2.417837    0.9051251
## 5  0.2247600   -0.4313611     0.9350983         -2.366325    0.4347768
##   L2.log_volume L2.log_volatility L3.DJ_return L3.log_volume L3.log_volatility
## 1            NA                NA           NA            NA                NA
## 2            NA                NA           NA            NA                NA
## 3     0.1750605         -4.356718           NA            NA                NA
## 4     1.5171653         -2.528849   -0.5497779     0.1750605         -4.356718
## 5     2.2836006         -2.417837    0.9051251     1.5171653         -2.528849
##   L4.DJ_return L4.log_volume L4.log_volatility L5.DJ_return L5.log_volume
## 1           NA            NA                NA           NA            NA
## 2           NA            NA                NA           NA            NA
## 3           NA            NA                NA           NA            NA
## 4           NA            NA                NA           NA            NA
## 5   -0.5497779     0.1750605         -4.356718           NA            NA
##   L5.log_volatility
## 1                NA
## 2                NA
## 3                NA
## 4                NA
## 5                NA

We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]

We now fit the linear AR model to the training data using lm() and predict on the test data

arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223

The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook.

We’ll add a factor variable day_of_week to our data because we have a hunch that day_of_week has a strong association with trading volume.

arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
arfitd <- lm(log_volume ~., data=arframed[istrain, ])
arpredd <- predict(arfitd, arframed[!istrain, ])
1 - mean((arpredd - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4598616

Including day_of_week improved our R^2 by about 5%.

Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.

n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046    5    3

Now let’s contruct our RNN, which uses 12 hidden units.

model <- keras_model_sequential() %>% 
  layer_simple_rnn(units=12, 
                   input_shape=list(5,3),
                   dropout = 0.1, recurrent_dropout = 0.1) %>%
  layer_dense(units=1)

model %>% compile(optimizer = optimizer_rmsprop(), loss='mse')

We fit the model as follows. We supply the fit function with test data as validation data, so that when we monitor its progress and plot the history function we can see theprogress on the test data.

history <- model %>% fit(
  xrnn[istrain,,], arframe[istrain, "log_volume"], 
  batch_size=64, epochs = 200, 
  validation_data = 
    list(xrnn[!istrain,,], arframe[!istrain, "log_volume"])
)

Now we can assess the model’s performance (using R^2).

kpred <- predict(model, xrnn[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4129158

The performance is about the same to that of a standard lm() AR model.

#Part 3: Problems at end of Chapter 10

Chapter 10 Problem 7.) Fit a neural network to the Default data…

We need to fit a neural network to the Default data using a single hidden layer with 10 units and dropout regularization.

library(ISLR2)
library(keras)
library(tensorflow)

Per the ISLR package we know that Default is a data frame with 10,000 observations on the following 4 variables.

default: a factor with levels No and Yes indicating whether the customer defaulted on their debt

student: a factor with levels No and Yes indicating whether the customer is a student

balance: the average balance that the customer has remaining on their credit card after making their monthly payment

income: Income of customer

summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

We need to first split our data into training and validation sets.

set.seed(1)
n <- nrow(Default)
ntest <- trunc(n/5)
testid <- sample(1:n, ntest)
y_test <- Default$default[testid] == 'Yes'

One thing to note above is that our default variable is No (if the Default is FALSE) and Yes (if the Default is TRUE).

Let’s train our linear logistic regression model and then assess it on the validation set. In our model it will predict Default Yes if the probability of Default is greater than 0.5. In other words, our model will predict \(P(\text{Default} | X)\). If this probability is greater than 0.5, then our model will classify \(X\) as a Default TRUE. Otherwise, it will classify \(X\) as a Default FALSE.

ll.reg <- glm(default~student+balance+income,family="binomial",data=Default[-testid,])
ll.pred <- predict(ll.reg, data=Default[testid,], type='response') > 0.5
ll.accuracy = mean(ll.pred == y_test)
ll.accuracy
## [1] 0.95225

Our linear logistic regression model performs well on this validation set with accuracy around 95%.

Let us now try to fit a neural network.

x = model.matrix(default ~. -1, data=Default)

x_train <- x[-testid,]
g_train <- Default$default[-testid]=='Yes'

x_test <- x[testid,]
g_test <- Default$default[testid] == 'Yes'

modnn <- keras_model_sequential() %>% 
  layer_dense(units=10, activation='relu', input_shape=ncol(x)) %>%
  layer_dropout(rate=0.4) %>% 
  layer_dense(units = 1, activation='sigmoid')

Now let’s compile our model.

modnn %>% compile(
  optimizer=optimizer_rmsprop(), 
  loss='binary_crossentropy', 
  metrics='accuracy')

Now let’s fit our model to our data.

history <- modnn %>% fit(
  x = x_train, 
  y = g_train, 
  epochs=30, 
  batch_size=128)

Let’s check our neural network accuracy.

nnpred <- predict(modnn, x_test) > 0.5
nn.accuracy <- mean(nnpred == g_test)
nn.accuracy
## [1] 0.965

As we can see, our neural network test set accuracy is about 96%. This is comparable to our linear logistic regression which had 95% accuracy.

Chapter 10 Problem 9.) Fit a lag-5 autoregressive model to the NYSE data…

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.

we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.

lagm <- function(x, k=1) {
  n <- nrow(x)
  pad <- matrix(NA, k, ncol(x))
  rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]

We now fit the linear AR model to the training data using lm() and predict on the test data

arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223

The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook. In other words, our model explains 41% of the variation seen in the data.

We’ll add a factor variable month_number to our data because we have a hunch that month_number has a strong association with trading volume.

library(lubridate) #for extracting month from date object

string.dates <- NYSE[-c(1:5), "date"]
dates <- as.Date(string.dates)
months <- as.factor(month(dates))
arframe.month <- data.frame(months, arframe)
arfit.month <- lm(log_volume ~., data=arframe.month[istrain, ])
arpred.month <- predict(arfit.month, arframe.month[!istrain, ])
1 - mean((arpred.month - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4170418

It appears that including this 12-level factor representation doesn’t significantly improve the performance of the model. Before, without the 12-level month factor, the \(R^2\) was equal to 0.413. Now, with the 12-level month factor, the \(R^2\) was equal to 0.417. This is hardly an improvement.

Chapter 10 Problem 10.) Flatten short sequences to fit a linear AR Model.

Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.

n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046    5    3
model.rnn <- keras_model_sequential() %>%
  layer_flatten(input_shape=c(5,3)) %>% 
  layer_dense(units=1)

Compile our model.

model.rnn %>% compile(loss='mse', optimizer=optimizer_rmsprop())

Fit our model to the data.

history <- model.rnn %>% fit(
  xrnn[istrain,, ], arframe[istrain, "log_volume"], epochs=200, 
  batch_size = 64, validation_data =
    list(xrnn[!istrain,, ], arframe[!istrain, "log_volumne"])
)

Now let’s check our test set \(R^2\).

npred <- predict(model.rnn, xrnn[!istrain,,])
test.Rsquared <- 1 - mean((arframe[!istrain, "log_volume"] - npred)^2)/V0
test.Rsquared
## [1] 0.4122102

It appears that our \(R^2\) for this RNN linear AR model is the same as that of just a simple lm() function call.

The advantage of using lm() is that the interpretation of the coefficients is easier to communicate.The disadvantage of using lm() is that we assume a normality condition on the residuals, which may or may not be true.

Chapter 10 Problem 12.) RNN with inclusion of variable day_of_week

Let’s reload the data incase we mutated anything.

library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)

We will need to create the AR dataframe and then add day of week as a one-hot vector (of length 5) for each observation.

arframe <- data.frame(log_volume=xdata[, "log_volume"], 
                      L1 = lagm(xdata,1), L2=lagm(xdata,2),
                      L3 = lagm(xdata,3), L4=lagm(xdata,4),
                      L5 = lagm(xdata,5))

arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
day=NYSE[-(1:5), "day_of_week"]

arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
x <- model.matrix(log_volume ~. -1, data=arframed)

#first five columns are the factor one-hot vector for day of week. 
day.feature = x[,c(1:5)]

arframe.day <- data.frame(arframe$log_volume, 
                          arframe[,c(2:4)], day.feature, 
                          arframe[,c(5:7)], day.feature,
                          arframe[,c(8:10)], day.feature,
                          arframe[,c(11:13)], day.feature,
                          arframe[,c(14:16)], day.feature
                          )

Let’s set up our RNN data structure.

xrnn.day <- data.matrix(arframe.day[, -1])
xrnn.day <- array(xrnn.day, c(n, 8, 5))
xrnn.day <- xrnn.day[,, 5:1]
xrnn.day <- aperm(xrnn.day, c(1,3,2))

Now let’s compile our model.

arnn.day <- keras_model_sequential() %>%
  layer_simple_rnn(units = 12, 
                   input_shape = list(5,8), 
                   dropout = 0.1, recurrent_dropout = 0.1) %>%
  layer_dense(units = 1)

arnn.day %>% compile(optimizer = optimizer_rmsprop(), 
                     loss = 'mse')

Now let’s fit our model

history <- arnn.day %>% fit(
  xrnn.day[istrain,,], arframe[istrain, "log_volume"], 
  batch_size = 64, epochs = 200, 
  validation_data = 
    list(xrnn.day[!istrain,,], arframe[!istrain, "log_volume"])
)

Retry a \(R^2\). Let’s see how our model does on the test set.

kpred <- predict(arnn.day, xrnn.day[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4645618

As we can see, this new RNN model with variable day_of_week included explains about 5% more of the variation than the RNN model without variable day_of_week that the textbook completed on page 457.

Chapter 10 Problem 13.) Repeat Analysis of Lab 10.9.5 on the IMDb data

There we used a dictionary of size 10,000. Consider the effects of varying the dictionary size. Try the values

1,000
3,000
5,000
10,000

and compare the results.

Dictionary of size 1,000

Load our data and one_hot() code our documents into feature vectors which then can joined into a feature matrix. Since we are dealing with a sparse matrix we will use the Matrix library to create a sparseMatrix() object that will save considerable memory space

max_features <- 1000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

library(Matrix)
one_hot <- function(sequences, dimension) {
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep(1:n, seqlen)
  colind <- unlist(sequences)
  sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}

x_train_1h <- one_hot(x_train, 1000)
x_test_1h <- one_hot(x_test, 1000)
dim(x_train_1h)
## [1] 25000  1000

We can see above that our x_train_1h is a matrix of size 25,000 x 1,000.

Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.

set.seed(3)
ival <- sample(seq(along=y_train), 2000)

Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.

Define model and compile it.

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(1000)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))

Fit to our data:

history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))

Now we can check the accuracy of the model on the validation set.

predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.1000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.1000
## [1] 0.845

We see that our test accuracy when we use a dictionary of size 1,000 is approximately 0.851.

Dictionary of size 3,000

We repeat the steps above. Fewer comments will be made.

max_features <- 3000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.3000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.3000
## [1] 0.8655

We see that our test accuracy when we use a dictionary of size 3,000 is approximately 0.857.

Dictionary of size 5,000

max_features <- 5000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.5000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.5000
## [1] 0.8635

We see that our test accuracy when we use a dictionary of size 5,000 is approximately 0.8625.

Dictionary of size 10,000

max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb

x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)

model <- keras_model_sequential() %>% 
  layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
  layer_dense(units=16, activation='relu') %>%
  layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))


history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))


predict_x=model %>% predict(x_train_1h[ival,]) 
predict_class_x <- as.numeric(predict_x > 0.5)

correct.accuracy.10000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.10000
## [1] 0.861

We see that our test accuracy when we use a dictionary of size 10,000 is approximately 0.866.

Compare the results

Accuracy of NN Models on IMDb data
Model Accuracy
Dictionary Size 1,000 0.8450
Dictionary Size 3,000 0.8655
Dictionary Size 5,000 0.8635
Dictionary Size 10,000 0.8610

We can see that as the dictionary size increases, so does the accuracy, albeit marginally.