A single layer neural network takes an input vector of \(p\) variables \(X = (X_1, ..., X_p)\) and builds a nonlinear function \(f(X)\) to predict the response \(Y\). The particular structure of the a simple feed-forward neural network can be seen in the Figure below using \(p=4\) predictos and \(K=5\) hidden units, or sometimes named activations.
Single Layer Neural Network.
The neural network has the form \[ \begin{align} f(X) &= \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\\ &= \beta_0 + \sum_{k=1}^K \beta_k g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \end{align} \]
Step 1.) Compute the \(K\) activations \(A_k, k=1...K\) as functions of the input features \(X_1,...,X_p\). In particular,
\[ A_k = h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j) \]
where \(g(z)\) is a nonlinear activation function that is specified in advance. For example, \(g(z)\) can be the sigmmoid activation function, or perhaps the ReLU (rectified linear unit) activation function.
Step 2.) Feed these \(K\) activations from the hidden layer feed into the output layer. The output \(f(X)\) will be a linear regression model of the \(K\) activations. In particular,
\[ f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X) \]
All the parameters will be estimated from the data. For this simple \(K=5\) and \(p=4\) example, we would need to estimate \(\beta_0,...,\beta_5\) and \(w_{10},...,w_{15}\) … and \(w_{50},...,w_{55}\). In total this is 6 + (6x6) = 42 parameters. It is easy to see now that in order to apply a neural network large datasets and strong computational tools are necessary. Thus, it wasn’t until the last 15-20 years did neural networks/deep learning take off in popularity in research and industry.
As with any parameter estimation problem we must choose a criteria. For a quantitative response, typically squared-error loss is used, so that the parameters are chosen to minimize \(\sum_{i=1}^n (y_i - f(x_i))^2\)
Modern neural networks typically have more than one hidden layer. We will illustrate a large dense network on the famous and publicly available MNIST handwritten digit dataset. See Midterm Part 2 where we go into full detail, including source R code, for the MNIST digit dataset.
Every image has 784 pixels, of which each pixel is stored as an eight-bit grayscale value between 0 and 255 representing the relative amount of the written digit in that tiny pixel square. Thus, an input vector is \(X = (X_1,...,X_p)\) where \(p = 784\) pixels.
The output is the class label, represented by a vector \(Y = (Y_0,...,Y_9)\) of 10 dummy variables, with a one in the position corresponding to the label, and zeros elsewhere.
We wil use two hidden layers \(L_1\) and \(L_2\). See the image below for a visual diagram of how our model will be constructed.
The input layer has \(p=784\) units (remember thats how many pixels are in each input image). Hidden layer \(K_1 = 256\) and hidden layer \(K_2 = 128\). Output layer 10 units (number of 0-9 digits). One can show that the number of parameters is 235,146.
We have adopted some new notation.
\(W_1\) in Figure below represents the entire matrix of weights that feed from input layer to the first hidden layer. This matrix will have (784+1) x 256 = 200,960 elements/weights.
Secondly, \(W_2\) is the matrix of weights that feed from hidden layer 1 to hidden layer 2. This matrix will have (256+1) x 128 = 32,896 elements/weights.
Lastly, \(B\) is the matrix of weights from hidden layer 2 to output layer. This matrix will have (128 + 1) x 10 = 1,290 elements/weights.
Multilayer Neural Network for Digit Classication.
One thing to note about our output functions \(f_m(X) = Z_m\). Since we are dealing with classification we would prefer that our estimate represent class probabilities (just as in logistic or multinomial regression). Similar to multinomial logistic regression we use the special softmax activation function. This will ensure that the 10 numbers behave like porbabilities. The classifier then assigns the image to the class with the highest probability.
\[ f_m(X) = Pr(Y = m | X) = \frac{e^{Z_M}}{\sum_{l=0}^9 e^{Z_l}} \]
Our criteria will be to minimize the negative multinomial log-likelihood, also known as the cross-entropy.
For full code details and implementation see Midterm Part 2
Convolutional Neural Networks (CNNs) mimic to some degree how humans classify images, by recognizing specific features or patterns anywhere in the image that distinguish each particular object class.
See Figure 10.6 below.
Figure 10.6 Covolutional Neural Network.
A CNN first identifies low-level features, such as edges, patches of color, etc. Then, these low-level features are combined to form higher-level features, such as parts of ears, eyes, and so on. Eventually, the presence or absence of these higher-level features contributes to the probability of any given output classs.
A CNN builds up this hierarchy by combining twp specialized types of hidden layers, called convolution layers and pooling layers. Convolution layers search for instances of small patterns in the image. Pooling layers downsample these to select a prominent subset.
In a recurrent neural network (RNN), the input object \(X\) is a sequence. Consider a collection of documents, such as the collection of IMDb movie reviews. Each document can be expressed as a sequence of \(L\) words, so \(X={X_1,.X_2,...X_L}\), where each \(X_l\) represents a word. The order of the words, and closeness of certain words in a sentence, convey semantic meaning.
RNNs are designed to accomodate a take advantage of the sequential nature of such input objects, much like convolutional neural networks accomodate the spatial structure of image inputs.
Figure 10.12 below illustrates the structure of a very basic RNN with sequence \(X={X_1,.X_2,...X_L}\) as input, a simple output \(Y\), and a hidden-layer sequence \(A={A_1,A_2,...,A_L}\). Each \(X_l\) is a vector; in the document example \(X_l\) could represent a one-hot encoding for the l-th word based on the language dictionary for the collection of documents. As the sequence is processed on vector \(X_l\) at a time, the network updates the activations \(A_l\) in the hidden layer, taking as input the vector \(X_l\) and the activation vector \(A_{l-1}\) from the previous step in the sequence. Each \(A_l\) feeds into the output layer and produces a prediction \(O_l\) for \(Y\).
Figure 10.12 Recurrent Neural Network.
The rest of 10.5 in the textbook goes into case studies of applications of RNNs. For example, Time Series Forecasting and Document classification. For more details of the implementation of these examples in R, please see Midterm Part 2.
A guiding principle is Occam’s razor: when faced with several methods that give roughly equivalent performance, pick the simplest.
If we can produce models with the simpler tools that perform as well, they are likely to be easier to fit and understand and potentially less fragile than the more complex approaches.
Typically we expect deep learning to be an attractive choice when the sample size of the training set is extremely large, and when interpretability of the model is not a high priority.
The technicalities of fitting/optimizing a neural network and possibly introducing penalization are shown in this section. Due to the highly technical nature of this section, details are omitted and we encourage the reader to seek the actual textbook.
Neural networks are fit using back-propogation and gradient descent. The details of such can be found in the first half of Chapter 18. In our summary below we will focus on other tuning parameters.
We have many choices for our nonlinear activation functions \(g^{(k)}\). These include but are not limited to
the sigmoid function
rectified linear
leaky rectified linear
hyperbolic tan function
See the graph below for a visual comparison of the different activation functions.
Activation Functions.
Typically this is a mixture of \(l_2\) and \(l_1\) regularization, each of which requires a tuning parameter.
This is a form of regularization that is performed when learning a network, typically at different rates at the different layers. It applies to all networks, not just convolutional; in fact, it appears to work better when applied at the deeper, denser layers.
#Part 2: Reproducing Code from Lab 10.9
We load the data and subset test vs training indexes. It looks like we’re going to use about 1/3 of the data as a validation set.
library(ISLR2)
Gitters <- na.omit(Hitters)
n <- nrow(Gitters)
set.seed(13)
ntest <- trunc(n/3)
testid <- sample(1:n, ntest)
Let’s check out the structure of our data.
summary(Gitters)
## AtBat Hits HmRun Runs
## Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
## Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
## Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75
## 3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
## Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
## Median : 47.00 Median : 37.00 Median : 6.000 Median : 1931.0
## Mean : 51.49 Mean : 41.11 Mean : 7.312 Mean : 2657.5
## 3rd Qu.: 71.00 3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 2.0 Min. : 3.0
## 1st Qu.: 212.0 1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0
## Median : 516.0 Median : 40.00 Median : 250.0 Median : 230.0
## Mean : 722.2 Mean : 69.24 Mean : 361.2 Mean : 330.4
## 3rd Qu.:1054.0 3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
## CWalks League Division PutOuts Assists
## Min. : 1.0 A:139 E:129 Min. : 0.0 Min. : 0.0
## 1st Qu.: 71.0 N:124 W:134 1st Qu.: 113.5 1st Qu.: 8.0
## Median : 174.0 Median : 224.0 Median : 45.0
## Mean : 260.3 Mean : 290.7 Mean :118.8
## 3rd Qu.: 328.5 3rd Qu.: 322.5 3rd Qu.:192.0
## Max. :1566.0 Max. :1377.0 Max. :492.0
## Errors Salary NewLeague
## Min. : 0.000 Min. : 67.5 A:141
## 1st Qu.: 3.000 1st Qu.: 190.0 N:122
## Median : 7.000 Median : 425.0
## Mean : 8.593 Mean : 535.9
## 3rd Qu.:13.000 3rd Qu.: 750.0
## Max. :32.000 Max. :2460.0
Let’s train a linear model, i.e., response \(Y\) is Salary and we’ll run a least squares linear regression with all the predictors available.
lfit <- lm(Salary ~., data=Gitters[-testid,])
lpred <- predict(lfit, Gitters[testid,])
with(Gitters[testid,], mean(abs(lpred - Salary)))
## [1] 254.6687
From above we can see that the mean absolute prediction error on this data was 254.6687.
Now let’s try fitting a lasso regression using glmnet.
x <- scale(model.matrix(Salary ~. -1, data=Gitters))
y <- Gitters$Salary
The call model.matrix() creates the same matrix that is used in lm() (and -1 omits the intercept). This function will automatelly convert factors to dummy variables. The scale() function standardizes the matrix so each column has mean zero and variance one.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure='mae')
cpred <- predict(cvfit, x[testid,], s='lambda.min')
mean(abs(y[testid] - cpred))
## [1] 252.2994
The lasso regression model has a slightly better mean absolute error than the least squares model.
Now let’s try setting up a neural network.
library(keras)
modnn <- keras_model_sequential() %>%
layer_dense(units=50, activation='relu', input_shape = ncol(x)) %>%
layer_dropout(rate=0.4) %>%
layer_dense(units=1)
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio/python.exe": The system cannot find
## the file specified
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\hunter.alzate\Anaconda3\envs\rstudio-/python.exe": The system cannot find
## the file specified
We are using a single hidden layer with \(K_1=50\) hidden units and a ReLU activation function. To adjust for overfitting we use a dropout layer, in which a random 40% of the 50 activations from the previous layer are set to zero during each iteration of the stochastic gradient descent. Finally, the output layer has just one unit with no activation function, indicating that the model provides a single quantitative output.
We now add code to control the fitting algorithm. We will minimize the squared-error loss. Due to the high CPU demands and time, we will only run 200 epochs.
modnn %>% compile(loss='mse', optimizer=optimizer_rmsprop(), metrics=list('mean_absolute_error'))
history <- modnn %>% fit(
x[-testid, ], y[-testid], epochs=200, batch_size=32,
validation_data=list(x[testid,], y[testid])
)
plot(history)
Finally we will evaluate the model’s performance.
npred <- predict(modnn, x[testid,])
mean.abs.error <- mean(abs(y[testid] - npred))
mean.abs.error
## [1] 369.0129
Interestingly enough, with almost 1/6 of the epochs as the textbook suggestion, our model performed comparable, possibly better, but that could just be due to the stoachastic gradient descent.
We will partition our dataset into 60,000 training images and 10,000 test images.
mnist <- dataset_mnist()
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)
## [1] 60000 28 28
dim(x_test)
## [1] 10000 28 28
Recall that each image is 28x28 pixels, and it is stored as a three-dimensional array. We would like to reshape them into a vector of length 28 x 28 = 784. This matches our model specification vector that \(X = (X_1,....,X_p)\) and \(p = 784\). Luckily package keras has built-in functions that do this for us.
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
y_train <- to_categorical(g_train, 10)
y_test <- to_categorical(g_test, 10)
Neural networks are sensitive to the scale of the inputs, so we will recale each component feature of the input \(X\) to the unit interval. In other words, insted of a number between 0 and 255, each \(X_j\) component will be between 0 and 1.
x_train <- x_train/255
x_test <- x_test/255
We are now ready to fit our neural network. In Layer1 we have 256 activation units, and they are being fed by the input vector with length 784 pixels. The layer_dropout() is a regularization rate that will help protect from overfitting. We protect against overfitting at each hidden layer. Finally, we will pipe the Layer2 into the output layer, which has 10 units (0-9 digits) and, as we stated before, the softmax activation function so that it returns a probability of belonging to each digit class.
Why do we need regularization? Even with such a large training set (60,000 images), the number of coefficients we needed to estimate four this particular neural network is nearly 4x times that (235,146).
modelnn <- keras_model_sequential()
modelnn %>%
layer_dense(units=256, activation="relu", input_shape=c(784)) %>%
layer_dropout(rate=0.4) %>%
layer_dense(units=128, activation='relu') %>%
layer_dropout(rate=0.3) %>%
layer_dense(units=10, activation='softmax')
summary(modelnn)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_4 (Dense) (None, 256) 200960
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 256) 0
## ________________________________________________________________________________
## dense_3 (Dense) (None, 128) 32896
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ________________________________________________________________________________
## dense_2 (Dense) (None, 10) 1290
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________
So far so good. Next, we add details to the model to specify the fitting criteria. As discussed, we will minimize the negative multinomial log-likelihood, otherwise known as the cross-entropy.
modelnn %>% compile(loss='categorical_crossentropy', optimizer=optimizer_rmsprop(), metrics=c("accuracy"))
Now we are ready to go. We just need to supply the training data and fit the model.
system.time(
history <- modelnn %>%
fit(x_train, y_train, epochs=30, batch_size=128, validation_split=0.2)
)
## user system elapsed
## 132.77 3.17 58.33
plot(history, smooth=FALSE)
Finally, we can compare predicted and true class labels and evaluate the accuracy.
predict_x=modelnn %>% predict(x_test)
classes_x <- rep(-1, nrow(predict_x))
for(i in 1:nrow(predict_x)) {
classes_x[i] = which.max(predict_x[i,]) - 1
}
correct.accuracy <- mean(g_test == classes_x)
correct.accuracy
## [1] 0.9804
Wow. On our test/validation set we have achieved 96.85% accuracy. Quite a feat.
We will show how to use a pretrained CNN model to classify natural images. The CNN model was pretrained on the imagenet database.
We first load the images from the directory “book_images”. We add this dataset (from the ISLR website) to our current working directory. We then convert each image to an array “x” format that keras software expects and that matches the specifications of the imagenet database.
img_dir <- "book_images"
image_names <- list.files(img_dir)
num_images <- length(image_names)
x <- array(dim=c(num_images, 224, 224, 3))
for(i in 1:num_images) {
img_path <- paste(img_dir, image_names[i], sep='/')
img <- image_load(img_path, target_size=c(224, 224))
x[i,,,] <- image_to_array(img)
}
x <- imagenet_preprocess_input(x)
Now we load the trained network and use it to classify these 6 images we just loaded into variable “x”.
model <- application_resnet50(weights='imagenet')
#summary(model)
Finally, we classify our six images and return the top 3 class choices in terms of predicted probability.
pred6 <- model %>% predict(x) %>%
imagenet_decode_predictions(top=3)
print(pred6)
## [[1]]
## class_name class_description score
## 1 n02007558 flamingo 0.926349938
## 2 n02006656 spoonbill 0.071699433
## 3 n02002556 white_stork 0.001228212
##
## [[2]]
## class_name class_description score
## 1 n03388043 fountain 0.2788654
## 2 n03532672 hook 0.1785543
## 3 n03804744 nail 0.1080727
##
## [[3]]
## class_name class_description score
## 1 n01608432 kite 0.72270912
## 2 n01622779 great_grey_owl 0.08182563
## 3 n01532829 house_finch 0.04218881
##
## [[4]]
## class_name class_description score
## 1 n02097474 Tibetan_terrier 0.50929713
## 2 n02098413 Lhasa 0.42209893
## 3 n02098105 soft-coated_wheaten_terrier 0.01695858
##
## [[5]]
## class_name class_description score
## 1 n02105641 Old_English_sheepdog 0.83266032
## 2 n02086240 Shih-Tzu 0.04513876
## 3 n03223299 doormat 0.03299771
##
## [[6]]
## class_name class_description score
## 1 n01843065 jacamar 0.49795389
## 2 n01818515 macaw 0.22193295
## 3 n02494079 squirrel_monkey 0.04287862
Similar to the section 10.4, the image classifier has trouble classifying objects when there are more than one object in the foreground. For example, in image 2 there is a hawk perched on a fountain. The classifier correctly identified the image as a fountain, but did not correctly match the label that we humans gave it, which as a hawk.
Now we perform document classification from the IMDb dataset. Recall our goal is to classify the document as a positive or negative review. We limit the dictionary size to the 10,000 most frequently-used words and tokens.
max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
Let us try to better understand the data structure of a training document. Each element of x_train is a vector of numbers between 0 and 9999. Each vector is a document. The numbers refer to words found in the dictionary. For example, the first training document as a vector of numbers is the following:
x_train[[1]][1:12]
## [1] 1 14 22 16 43 530 973 1622 1385 65 458 4468
This training document corresponds to the positive review from pg.419 of the ISLR.v2 textbook. We will copy and paste that positive review here:
“
Thus, it will benefit us to decode the words and create an interface to the dictionary.
word_index <- dataset_imdb_word_index()
decode_review <- function(text, word_index) {
word <- names(word_index)
idx <- unlist(word_index, use.names=FALSE)
word <- c("<PAD>", "<START>", "<UNK>", "UNUSED", word)
idx <- c(0:3, idx+3)
words <- word[match(text, idx, 2)]
paste(words, collape='' )
}
decode_review(x_train[[1]][1:12], word_index)
## [1] "<START> " "this " "film " "was " "just "
## [6] "brilliant " "casting " "location " "scenery " "story "
## [11] "direction " "everyone's "
Thus, we have written a function that will decode a document in the training set.
Recall we are trying to featurize each document. One such way (not the only way) that the book presents is to score each document for the presence of each of the words in our \(p=10,000\) dictionary. Thus, the feature vector that represents each document is a binary string of length \(p=10,000\) where component \(i\) equals 1 if word \(i\) from the dictionary is present and equals 0 if word \(i\) from the dictionary is not present.
Ultimately, our training set of 25,000 reviews can be stored as 25,000 vectors of length 10,000. We can store this in a feature matrix X of size 25,000 x 10,000 (consisting mostly of 0’s). Thus, we will have a sparse matrix.
Therefore, with all that being said, we need to write a function to “one-hot” encode each document in a list of documents, and return a binary matrix in sparse-matrix format.
library(Matrix)
one_hot <- function(sequences, dimension) {
seqlen <- sapply(sequences, length)
n <- length(seqlen)
rowind <- rep(1:n, seqlen)
colind <- unlist(sequences)
sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}
x_train_1h <- one_hot(x_train, 10000)
x_test_1h <- one_hot(x_test, 10000)
dim(x_train_1h)
## [1] 25000 10000
nnzero(x_train_1h)/(25000 * 10000)
## [1] 0.01316987
We’ve saved a lot of memory here. As we can see, only 1.3% of the entries are non-zero.
Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.
set.seed(3)
ival <- sample(seq(along=y_train), 2000)
Let’s fit a lasso logistic regression model using glmnet() and then evaluate its performance on the validation data.
library(glmnet)
fitlm <- glmnet(x_train_1h[-ival,], y_train[-ival], family='binomial', standardize=FALSE)
classlmv <- predict(fitlm, x_train_1h[ival,], type='response') > 0.50
accuracy <- function(pred, truth) {
return(mean(drop(pred) == drop(truth)))
}
acclmv <- apply(classlmv, 2, accuracy, y_train[ival] > 0)
We use the standard boundary criteria, i.e. if the prob(positive) > 0.50 then the classifier will classify document as positive.
classlmv is a matrix of size 2,000 x 100. Each column is a value of lambda that the model used to fit itself. Thus, each column is the prediction for each of the 2,000 documents using a specific value of lambda.
Ultimately we want to find the value of lambda that maximizes the accuracy on our validation set. Therefore, we plot the accuracy as a function of these lambdas.
par(mar=c(4,4,4,4), mfrow=c(1,1))
maximizer = -log(fitlm$lambda[which.max(acclmv)])
plot(-log(fitlm$lambda), acclmv)
abline(v = maximizer, col='red')
print(paste("The maximized validation set accuracy is: ", acclmv[which.max(acclmv)]))
## [1] "The maximized validation set accuracy is: 0.883"
Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(10000)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
Now we can check the accuracy of the model on the validation set.
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy <- mean(y_train[ival] == predict_class_x)
correct.accuracy
## [1] 0.862
Our lasso logistic regression had a validation set accuracy of 0.883. Comparing that to our neural network accuracy of 0.864, we see that the lasso logistic regression performed slightly better. This is summarized well by Figure 10.11 as seen on page 420 of ISLR.v2. For completeness the image has been copy pasted below.
Figure 10.11 Accuracy of lasso and two-hidden-layer nn.
max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
wc <- sapply(x_train, length)
median(wc)
## [1] 178
sum(wc <= 500)/length(wc)
## [1] 0.91568
Thus, about 91% of documents have fewer than 500 words. Our Recurren Neural Network requires all documets to have the same length. A work-around is to restrict any documents that are longer to the last 500 words, and any documents that are shorter we can pad the beginning with blanks.
maxlen <- 500
x_train <- pad_sequences(x_train, maxlen=maxlen)
x_test <- pad_sequences(x_test, maxlen=maxlen)
dim(x_train)
## [1] 25000 500
dim(x_test)
## [1] 25000 500
x_train[1, 490:500]
## [1] 16 4472 113 103 32 15 16 5345 19 178 32
Thus, our feature matrix X is a 2,500 x 500 matrix. Each row is a document, and each document contains exactly 500 words (either the last 500 or padded). We can see what a typical feature vector looks like in our last line of code. Each of the 500 words in the document is represented using an integer corresponding to the location of that word in the 10,000-word dictionary.
We will use a two layer RNN to classify. The first layer is of size 32. This layer will one-hot encode each document as a matrix of size 500 x 10,000, and then maps them down to 32 units. The second layer is an LSTM with 32 units, and the output layer is a single sigmoid for the binary classification task.
model <- keras_model_sequential() %>%
layer_embedding(input_dim=10000, output_dim = 32) %>%
layer_lstm(units=32) %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('acc'))
history <- model %>% fit(x_train, y_train, epochs=10, batch_size=128, validation_data=list(x_test, y_test))
plot(history)
## `geom_smooth()` using formula 'y ~ x'
predy <- predict(model, x_test) > 0.5
mean(abs(y_test == as.numeric(predy)))
## [1] 0.8432
Thus, our RNN model was able to accurately classify positive or negative review about 87% of the time.
We’ll now use deep learning to fit the time series models of section 10.5
library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)
In particular, we wish to predict a value \(v_t\) from past values $v_{t-1}, v_{t-2},… $.
We have one series of data, and we have an entire series of targets \(v_t\), and the inputs include past values of this series. For a mathematical representation of the structure, see the Figure below from the textbook.
Mathematical Structure of Time Series Problem.
For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.
we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.
lagm <- function(x, k=1) {
n <- nrow(x)
pad <- matrix(NA, k, ncol(x))
rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"],
L1 = lagm(xdata,1), L2=lagm(xdata,2),
L3 = lagm(xdata,3), L4=lagm(xdata,4),
L5 = lagm(xdata,5))
arframe[1:5,]
## log_volume L1.DJ_return L1.log_volume L1.log_volatility L2.DJ_return
## 1 0.1750605 NA NA NA NA
## 2 1.5171653 -0.5497779 0.1750605 -4.356718 NA
## 3 2.2836006 0.9051251 1.5171653 -2.528849 -0.5497779
## 4 0.9350983 0.4347768 2.2836006 -2.417837 0.9051251
## 5 0.2247600 -0.4313611 0.9350983 -2.366325 0.4347768
## L2.log_volume L2.log_volatility L3.DJ_return L3.log_volume L3.log_volatility
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 0.1750605 -4.356718 NA NA NA
## 4 1.5171653 -2.528849 -0.5497779 0.1750605 -4.356718
## 5 2.2836006 -2.417837 0.9051251 1.5171653 -2.528849
## L4.DJ_return L4.log_volume L4.log_volatility L5.DJ_return L5.log_volume
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 -0.5497779 0.1750605 -4.356718 NA NA
## L5.log_volatility
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
We now fit the linear AR model to the training data using lm() and predict on the test data
arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223
The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook.
We’ll add a factor variable day_of_week to our data because we have a hunch that day_of_week has a strong association with trading volume.
arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
arfitd <- lm(log_volume ~., data=arframed[istrain, ])
arpredd <- predict(arfitd, arframed[!istrain, ])
1 - mean((arpredd - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4598616
Including day_of_week improved our R^2 by about 5%.
Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.
n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046 5 3
Now let’s contruct our RNN, which uses 12 hidden units.
model <- keras_model_sequential() %>%
layer_simple_rnn(units=12,
input_shape=list(5,3),
dropout = 0.1, recurrent_dropout = 0.1) %>%
layer_dense(units=1)
model %>% compile(optimizer = optimizer_rmsprop(), loss='mse')
We fit the model as follows. We supply the fit function with test data as validation data, so that when we monitor its progress and plot the history function we can see theprogress on the test data.
history <- model %>% fit(
xrnn[istrain,,], arframe[istrain, "log_volume"],
batch_size=64, epochs = 200,
validation_data =
list(xrnn[!istrain,,], arframe[!istrain, "log_volume"])
)
Now we can assess the model’s performance (using R^2).
kpred <- predict(model, xrnn[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4129158
The performance is about the same to that of a standard lm() AR model.
#Part 3: Problems at end of Chapter 10
We need to fit a neural network to the Default data using a single hidden layer with 10 units and dropout regularization.
library(ISLR2)
library(keras)
library(tensorflow)
Per the ISLR package we know that Default is a data frame with 10,000 observations on the following 4 variables.
default: a factor with levels No and Yes indicating whether the customer defaulted on their debt
student: a factor with levels No and Yes indicating whether the customer is a student
balance: the average balance that the customer has remaining on their credit card after making their monthly payment
income: Income of customer
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
We need to first split our data into training and validation sets.
set.seed(1)
n <- nrow(Default)
ntest <- trunc(n/5)
testid <- sample(1:n, ntest)
y_test <- Default$default[testid] == 'Yes'
One thing to note above is that our default variable is No (if the Default is FALSE) and Yes (if the Default is TRUE).
Let’s train our linear logistic regression model and then assess it on the validation set. In our model it will predict Default Yes if the probability of Default is greater than 0.5. In other words, our model will predict \(P(\text{Default} | X)\). If this probability is greater than 0.5, then our model will classify \(X\) as a Default TRUE. Otherwise, it will classify \(X\) as a Default FALSE.
ll.reg <- glm(default~student+balance+income,family="binomial",data=Default[-testid,])
ll.pred <- predict(ll.reg, data=Default[testid,], type='response') > 0.5
ll.accuracy = mean(ll.pred == y_test)
ll.accuracy
## [1] 0.95225
Our linear logistic regression model performs well on this validation set with accuracy around 95%.
Let us now try to fit a neural network.
x = model.matrix(default ~. -1, data=Default)
x_train <- x[-testid,]
g_train <- Default$default[-testid]=='Yes'
x_test <- x[testid,]
g_test <- Default$default[testid] == 'Yes'
modnn <- keras_model_sequential() %>%
layer_dense(units=10, activation='relu', input_shape=ncol(x)) %>%
layer_dropout(rate=0.4) %>%
layer_dense(units = 1, activation='sigmoid')
Now let’s compile our model.
modnn %>% compile(
optimizer=optimizer_rmsprop(),
loss='binary_crossentropy',
metrics='accuracy')
Now let’s fit our model to our data.
history <- modnn %>% fit(
x = x_train,
y = g_train,
epochs=30,
batch_size=128)
Let’s check our neural network accuracy.
nnpred <- predict(modnn, x_test) > 0.5
nn.accuracy <- mean(nnpred == g_test)
nn.accuracy
## [1] 0.965
As we can see, our neural network test set accuracy is about 96%. This is comparable to our linear logistic regression which had 95% accuracy.
library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)
For the NYSE data we will use the past five trading days to predict the next day’s trading volume. Hence, we use \(L=5\), and since \(T=6,051\) we can create 6,046 such \((X,Y)\) pairs.
we will write functions to create lagged versions of the three time series. We start with a function that takes as input a data matrix and a lag \(L\), and returns a lagged version of the matrix. It simply inserts \(L\) rows of NA at the top, and truncates the bottom.
lagm <- function(x, k=1) {
n <- nrow(x)
pad <- matrix(NA, k, ncol(x))
rbind(pad, x[1:(n-k),])
}
arframe <- data.frame(log_volume=xdata[, "log_volume"],
L1 = lagm(xdata,1), L2=lagm(xdata,2),
L3 = lagm(xdata,3), L4=lagm(xdata,4),
L5 = lagm(xdata,5))
We can see that our arframe has missing values in the lagged variables (due to the construction of them). We remove these rows and adjust istrain (our train index) accordingly.
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
We now fit the linear AR model to the training data using lm() and predict on the test data
arfit <- lm(log_volume ~., data=arframe[istrain, ])
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.413223
The last two lines compute the \(R^2\) on the test data, as defined in section 3.17 of the textbook. In other words, our model explains 41% of the variation seen in the data.
We’ll add a factor variable month_number to our data because we have a hunch that month_number has a strong association with trading volume.
library(lubridate) #for extracting month from date object
string.dates <- NYSE[-c(1:5), "date"]
dates <- as.Date(string.dates)
months <- as.factor(month(dates))
arframe.month <- data.frame(months, arframe)
arfit.month <- lm(log_volume ~., data=arframe.month[istrain, ])
arpred.month <- predict(arfit.month, arframe.month[!istrain, ])
1 - mean((arpred.month - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4170418
It appears that including this 12-level factor representation doesn’t significantly improve the performance of the model. Before, without the 12-level month factor, the \(R^2\) was equal to 0.413. Now, with the 12-level month factor, the \(R^2\) was equal to 0.417. This is hardly an improvement.
Now let’s fit our RNN. The RNN model expects a sequence of \(L=5\) feature vectors \(X = {X_{\ell}}_{1}^{L}\) These are lagged versions of the time series going back \(L\) time points.
n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1,3,2))
dim(xrnn)
## [1] 6046 5 3
model.rnn <- keras_model_sequential() %>%
layer_flatten(input_shape=c(5,3)) %>%
layer_dense(units=1)
Compile our model.
model.rnn %>% compile(loss='mse', optimizer=optimizer_rmsprop())
Fit our model to the data.
history <- model.rnn %>% fit(
xrnn[istrain,, ], arframe[istrain, "log_volume"], epochs=200,
batch_size = 64, validation_data =
list(xrnn[!istrain,, ], arframe[!istrain, "log_volumne"])
)
Now let’s check our test set \(R^2\).
npred <- predict(model.rnn, xrnn[!istrain,,])
test.Rsquared <- 1 - mean((arframe[!istrain, "log_volume"] - npred)^2)/V0
test.Rsquared
## [1] 0.4122102
It appears that our \(R^2\) for this RNN linear AR model is the same as that of just a simple lm() function call.
The advantage of using lm() is that the interpretation of the coefficients is easier to communicate.The disadvantage of using lm() is that we assume a normality condition on the residuals, which may or may not be true.
Let’s reload the data incase we mutated anything.
library(ISLR2)
xdata <- data.matrix(NYSE[, c("DJ_return", "log_volume", "log_volatility")])
istrain <- NYSE[,"train"]
xdata <- scale(xdata)
We will need to create the AR dataframe and then add day of week as a one-hot vector (of length 5) for each observation.
arframe <- data.frame(log_volume=xdata[, "log_volume"],
L1 = lagm(xdata,1), L2=lagm(xdata,2),
L3 = lagm(xdata,3), L4=lagm(xdata,4),
L5 = lagm(xdata,5))
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]
day=NYSE[-(1:5), "day_of_week"]
arframed <- data.frame(day=NYSE[-(1:5), "day_of_week"], arframe)
x <- model.matrix(log_volume ~. -1, data=arframed)
#first five columns are the factor one-hot vector for day of week.
day.feature = x[,c(1:5)]
arframe.day <- data.frame(arframe$log_volume,
arframe[,c(2:4)], day.feature,
arframe[,c(5:7)], day.feature,
arframe[,c(8:10)], day.feature,
arframe[,c(11:13)], day.feature,
arframe[,c(14:16)], day.feature
)
Let’s set up our RNN data structure.
xrnn.day <- data.matrix(arframe.day[, -1])
xrnn.day <- array(xrnn.day, c(n, 8, 5))
xrnn.day <- xrnn.day[,, 5:1]
xrnn.day <- aperm(xrnn.day, c(1,3,2))
Now let’s compile our model.
arnn.day <- keras_model_sequential() %>%
layer_simple_rnn(units = 12,
input_shape = list(5,8),
dropout = 0.1, recurrent_dropout = 0.1) %>%
layer_dense(units = 1)
arnn.day %>% compile(optimizer = optimizer_rmsprop(),
loss = 'mse')
Now let’s fit our model
history <- arnn.day %>% fit(
xrnn.day[istrain,,], arframe[istrain, "log_volume"],
batch_size = 64, epochs = 200,
validation_data =
list(xrnn.day[!istrain,,], arframe[!istrain, "log_volume"])
)
Retry a \(R^2\). Let’s see how our model does on the test set.
kpred <- predict(arnn.day, xrnn.day[!istrain,,])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2)/V0
## [1] 0.4645618
As we can see, this new RNN model with variable day_of_week included explains about 5% more of the variation than the RNN model without variable day_of_week that the textbook completed on page 457.
There we used a dictionary of size 10,000. Consider the effects of varying the dictionary size. Try the values
1,000
3,000
5,000
10,000
and compare the results.
Load our data and one_hot() code our documents into feature vectors which then can joined into a feature matrix. Since we are dealing with a sparse matrix we will use the Matrix library to create a sparseMatrix() object that will save considerable memory space
max_features <- 1000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
library(Matrix)
one_hot <- function(sequences, dimension) {
seqlen <- sapply(sequences, length)
n <- length(seqlen)
rowind <- rep(1:n, seqlen)
colind <- unlist(sequences)
sparseMatrix(i = rowind, j=colind, dims = c(n, dimension))
}
x_train_1h <- one_hot(x_train, 1000)
x_test_1h <- one_hot(x_test, 1000)
dim(x_train_1h)
## [1] 25000 1000
We can see above that our x_train_1h is a matrix of size 25,000 x 1,000.
Let’s set up our training and validation sets. We’ll create a size 2,000 validation set, which leaves 23,000 documents to train on.
set.seed(3)
ival <- sample(seq(along=y_train), 2000)
Next we fit a fully-connected neural network with two hidden layers, each with 16 units and ReLU activation.
Define model and compile it.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(1000)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
Fit to our data:
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
Now we can check the accuracy of the model on the validation set.
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.1000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.1000
## [1] 0.845
We see that our test accuracy when we use a dictionary of size 1,000 is approximately 0.851.
We repeat the steps above. Fewer comments will be made.
max_features <- 3000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.3000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.3000
## [1] 0.8655
We see that our test accuracy when we use a dictionary of size 3,000 is approximately 0.857.
max_features <- 5000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.5000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.5000
## [1] 0.8635
We see that our test accuracy when we use a dictionary of size 5,000 is approximately 0.8625.
max_features <- 10000
imdb <- dataset_imdb(num_words=max_features)
c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
x_train_1h <- one_hot(x_train, max_features)
x_test_1h <- one_hot(x_test, max_features)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation ='relu', input_shape=c(max_features)) %>%
layer_dense(units=16, activation='relu') %>%
layer_dense(units=1, activation='sigmoid')
model %>% compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=c('accuracy'))
history <- model %>% fit(x_train_1h[-ival,], y_train[-ival], epochs=20, batch_size=512, validation_data = list(x_train_1h[ival, ], y_train[ival]))
predict_x=model %>% predict(x_train_1h[ival,])
predict_class_x <- as.numeric(predict_x > 0.5)
correct.accuracy.10000 <- mean(y_train[ival] == predict_class_x)
correct.accuracy.10000
## [1] 0.861
We see that our test accuracy when we use a dictionary of size 10,000 is approximately 0.866.
Model | Accuracy |
---|---|
Dictionary Size 1,000 | 0.8450 |
Dictionary Size 3,000 | 0.8655 |
Dictionary Size 5,000 | 0.8635 |
Dictionary Size 10,000 | 0.8610 |
We can see that as the dictionary size increases, so does the accuracy, albeit marginally.