Loading the IMDB dataset

To work with vector data of a manageable size, keep only the top 10,000 most frequently occurring words in the training data. Rare words will be discarded.

The variables train_data and test_data are lists of reviews; each review is a list of word indices (encoding a sequence of words). train_labels and test_labels are lists of 0s and 1s, where 0 stands for a negative review and 1 stands for positive one.

Note: The multi-assignment operator, %<-%, is automatically available whenever the R Keras package is loaded.

library(keras)

imdb <- dataset_imdb(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% imdb

Here’s how to decode a review back to English words.

# word_index is a named list mapping words to an integer index
word_index <- dataset_imdb_word_index()

# Reverses it, mapping integer indices to words
reverse_word_index <- names(word_index)                                    
names(reverse_word_index) <- word_index

# Decodes the 1st review. Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for "padding," "start of sequence," and "unknown."
decoded_review <- sapply(train_data[[1]], function(index) {                
  word <- if (index >= 3) reverse_word_index[[as.character(index - 3)]]
  if (!is.null(word)) word else "?"
})

paste(decoded_review, collapse = " ")
[1] "? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

Preparing the data

Since you can’t feed lists of integers into a neural net, here we convert the training data into the necessary tensor format by converting from a list to a matrix (or rows of 1D vectors).

Each row of the matrix contains 10,000 columns representing the all possible words in a review. They are then one-hot encoded (also known as categorical encoding) so that either a 0 or 1 is present indicating if that words has been used in the review. This is done for each of the 25,000 reviews creating a sparse matrix of 250,000,000 1s and 0s (2GB of data). This is also done for the testing data, which is of the same size.

vectorize_sequences <- function(sequences, dimension = 10000) {
  # Initialize a matrix with all zeroes
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  # Replace 0 with a 1 for each column of the matrix given in the list
  for (i in 1:length(sequences))
    results[i, sequences[[i]]] <- 1
  results
}

x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

str(x_train[1,])
 num [1:10000] 1 1 0 1 1 1 1 1 1 0 ...

Also convert labels from integer to numeric.

y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)

Now the data is ready to be fed into a neural network.

Building the network

A type of network that performs well on this type of vector data is a simple stack of fully connected (dense) layers with relu1 activations.

There are two key architecture decisions to be made about such a stack of dense layers:

  1. How many layers to use
  2. How many hidden units to choose for each layer

Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).

The model chosen has 2 intermediate dense layer with 16 hidden units each and a 3rd (sigmoid)2 layer that will output the scalar prediction regarding the sentiment of the current review.

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

Compiling the model

Because this is a binary classification problem and the output of the network is a probability, it’s best to use the binary_crossentropy loss.3

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

Optimizer parameters can be configured by passing an optimizer instance as the optimizer argument.

model %>% compile(
  optimizer = optimizer_rmsprop(learning_rate=0.001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

Custom loss or metric functions can also be passed the respective arguments.

model %>% compile(
  optimizer = optimizer_rmsprop(learning_rate = 0.001),
  loss = loss_binary_crossentropy,
  metrics = metric_binary_accuracy
)

Validation step

Create a validation set by setting apart 10,000 samples from the original training data.

val_indices <- 1:10000

x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]

Training the model

Train the model for 20 epochs (20 iterations over all samples in the x_train and y_train tensors), in mini-batches of 512 samples. At the same time, monitor loss and accuracy on the 10,000 samples that were set apart in the validation step.

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)
Epoch 1/20

 1/30 [>.............................] - ETA: 7s - loss: 0.6938 - accuracy: 0.4863
 6/30 [=====>........................] - ETA: 0s - loss: 0.6568 - accuracy: 0.6279
12/30 [===========>..................] - ETA: 0s - loss: 0.6034 - accuracy: 0.7085
18/30 [=================>............] - ETA: 0s - loss: 0.5637 - accuracy: 0.7472
27/30 [==========================>...] - ETA: 0s - loss: 0.5198 - accuracy: 0.7780
30/30 [==============================] - 0s 8ms/step - loss: 0.5095 - accuracy: 0.7856

30/30 [==============================] - 1s 16ms/step - loss: 0.5095 - accuracy: 0.7856 - val_loss: 0.3816 - val_accuracy: 0.8704
Epoch 2/20

 1/30 [>.............................] - ETA: 0s - loss: 0.3508 - accuracy: 0.8926
 9/30 [========>.....................] - ETA: 0s - loss: 0.3363 - accuracy: 0.8941
16/30 [===============>..............] - ETA: 0s - loss: 0.3194 - accuracy: 0.8997
25/30 [========================>.....] - ETA: 0s - loss: 0.3060 - accuracy: 0.9027
30/30 [==============================] - 0s 6ms/step - loss: 0.3030 - accuracy: 0.9013

30/30 [==============================] - 0s 9ms/step - loss: 0.3030 - accuracy: 0.9013 - val_loss: 0.3207 - val_accuracy: 0.8752
Epoch 3/20

 1/30 [>.............................] - ETA: 0s - loss: 0.2202 - accuracy: 0.9355
11/30 [==========>...................] - ETA: 0s - loss: 0.2210 - accuracy: 0.9331
20/30 [===================>..........] - ETA: 0s - loss: 0.2244 - accuracy: 0.9267
29/30 [============================>.] - ETA: 0s - loss: 0.2206 - accuracy: 0.9269
30/30 [==============================] - 0s 6ms/step - loss: 0.2206 - accuracy: 0.9269

30/30 [==============================] - 0s 8ms/step - loss: 0.2206 - accuracy: 0.9269 - val_loss: 0.2839 - val_accuracy: 0.8879
Epoch 4/20

 1/30 [>.............................] - ETA: 0s - loss: 0.1642 - accuracy: 0.9492
11/30 [==========>...................] - ETA: 0s - loss: 0.1706 - accuracy: 0.9492
20/30 [===================>..........] - ETA: 0s - loss: 0.1748 - accuracy: 0.9442
30/30 [==============================] - 0s 5ms/step - loss: 0.1735 - accuracy: 0.9431

30/30 [==============================] - 0s 8ms/step - loss: 0.1735 - accuracy: 0.9431 - val_loss: 0.2745 - val_accuracy: 0.8916
Epoch 5/20

 1/30 [>.............................] - ETA: 0s - loss: 0.1485 - accuracy: 0.9512
10/30 [=========>....................] - ETA: 0s - loss: 0.1481 - accuracy: 0.9535
21/30 [====================>.........] - ETA: 0s - loss: 0.1415 - accuracy: 0.9558
30/30 [==============================] - 0s 5ms/step - loss: 0.1423 - accuracy: 0.9547

30/30 [==============================] - 0s 7ms/step - loss: 0.1423 - accuracy: 0.9547 - val_loss: 0.2871 - val_accuracy: 0.8875
Epoch 6/20

 1/30 [>.............................] - ETA: 0s - loss: 0.1218 - accuracy: 0.9746
11/30 [==========>...................] - ETA: 0s - loss: 0.1108 - accuracy: 0.9711
22/30 [=====================>........] - ETA: 0s - loss: 0.1137 - accuracy: 0.9677
30/30 [==============================] - 0s 5ms/step - loss: 0.1160 - accuracy: 0.9652

30/30 [==============================] - 0s 7ms/step - loss: 0.1160 - accuracy: 0.9652 - val_loss: 0.2940 - val_accuracy: 0.8863
Epoch 7/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0742 - accuracy: 0.9883
10/30 [=========>....................] - ETA: 0s - loss: 0.0875 - accuracy: 0.9783
20/30 [===================>..........] - ETA: 0s - loss: 0.0913 - accuracy: 0.9748
28/30 [===========================>..] - ETA: 0s - loss: 0.0961 - accuracy: 0.9714
30/30 [==============================] - 0s 6ms/step - loss: 0.0973 - accuracy: 0.9703

30/30 [==============================] - 0s 9ms/step - loss: 0.0973 - accuracy: 0.9703 - val_loss: 0.3214 - val_accuracy: 0.8787
Epoch 8/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0842 - accuracy: 0.9766
11/30 [==========>...................] - ETA: 0s - loss: 0.0723 - accuracy: 0.9840
21/30 [====================>.........] - ETA: 0s - loss: 0.0728 - accuracy: 0.9821
26/30 [=========================>....] - ETA: 0s - loss: 0.0784 - accuracy: 0.9787
30/30 [==============================] - 0s 7ms/step - loss: 0.0777 - accuracy: 0.9789

30/30 [==============================] - 0s 10ms/step - loss: 0.0777 - accuracy: 0.9789 - val_loss: 0.3349 - val_accuracy: 0.8802
Epoch 9/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0502 - accuracy: 0.9863
11/30 [==========>...................] - ETA: 0s - loss: 0.0551 - accuracy: 0.9881
21/30 [====================>.........] - ETA: 0s - loss: 0.0637 - accuracy: 0.9826
30/30 [==============================] - 0s 5ms/step - loss: 0.0634 - accuracy: 0.9829

30/30 [==============================] - 0s 8ms/step - loss: 0.0634 - accuracy: 0.9829 - val_loss: 0.3796 - val_accuracy: 0.8692
Epoch 10/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0482 - accuracy: 0.9902
10/30 [=========>....................] - ETA: 0s - loss: 0.0496 - accuracy: 0.9879
19/30 [==================>...........] - ETA: 0s - loss: 0.0526 - accuracy: 0.9860
29/30 [============================>.] - ETA: 0s - loss: 0.0527 - accuracy: 0.9863
30/30 [==============================] - 0s 5ms/step - loss: 0.0525 - accuracy: 0.9863

30/30 [==============================] - 0s 8ms/step - loss: 0.0525 - accuracy: 0.9863 - val_loss: 0.3840 - val_accuracy: 0.8751
Epoch 11/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0377 - accuracy: 0.9941
11/30 [==========>...................] - ETA: 0s - loss: 0.0377 - accuracy: 0.9925
20/30 [===================>..........] - ETA: 0s - loss: 0.0427 - accuracy: 0.9897
30/30 [==============================] - 0s 6ms/step - loss: 0.0416 - accuracy: 0.9904

30/30 [==============================] - 0s 8ms/step - loss: 0.0416 - accuracy: 0.9904 - val_loss: 0.4203 - val_accuracy: 0.8758
Epoch 12/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0348 - accuracy: 0.9941
10/30 [=========>....................] - ETA: 0s - loss: 0.0251 - accuracy: 0.9980
20/30 [===================>..........] - ETA: 0s - loss: 0.0314 - accuracy: 0.9945
29/30 [============================>.] - ETA: 0s - loss: 0.0342 - accuracy: 0.9931
30/30 [==============================] - 0s 6ms/step - loss: 0.0343 - accuracy: 0.9931

30/30 [==============================] - 0s 8ms/step - loss: 0.0343 - accuracy: 0.9931 - val_loss: 0.4352 - val_accuracy: 0.8755
Epoch 13/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0222 - accuracy: 0.9941
11/30 [==========>...................] - ETA: 0s - loss: 0.0215 - accuracy: 0.9968
21/30 [====================>.........] - ETA: 0s - loss: 0.0224 - accuracy: 0.9965
30/30 [==============================] - 0s 5ms/step - loss: 0.0262 - accuracy: 0.9945

30/30 [==============================] - 0s 7ms/step - loss: 0.0262 - accuracy: 0.9945 - val_loss: 0.4682 - val_accuracy: 0.8749
Epoch 14/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0131 - accuracy: 1.0000
10/30 [=========>....................] - ETA: 0s - loss: 0.0150 - accuracy: 0.9986
21/30 [====================>.........] - ETA: 0s - loss: 0.0169 - accuracy: 0.9979
30/30 [==============================] - 0s 5ms/step - loss: 0.0205 - accuracy: 0.9960

30/30 [==============================] - 0s 8ms/step - loss: 0.0205 - accuracy: 0.9960 - val_loss: 0.5022 - val_accuracy: 0.8714
Epoch 15/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0110 - accuracy: 1.0000
 9/30 [========>.....................] - ETA: 0s - loss: 0.0114 - accuracy: 0.9998
17/30 [================>.............] - ETA: 0s - loss: 0.0113 - accuracy: 0.9997
28/30 [===========================>..] - ETA: 0s - loss: 0.0125 - accuracy: 0.9992
30/30 [==============================] - 0s 6ms/step - loss: 0.0128 - accuracy: 0.9990

30/30 [==============================] - 0s 8ms/step - loss: 0.0128 - accuracy: 0.9990 - val_loss: 0.5363 - val_accuracy: 0.8683
Epoch 16/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0077 - accuracy: 1.0000
11/30 [==========>...................] - ETA: 0s - loss: 0.0105 - accuracy: 0.9986
20/30 [===================>..........] - ETA: 0s - loss: 0.0124 - accuracy: 0.9986
30/30 [==============================] - 0s 5ms/step - loss: 0.0117 - accuracy: 0.9989

30/30 [==============================] - 0s 8ms/step - loss: 0.0117 - accuracy: 0.9989 - val_loss: 0.5709 - val_accuracy: 0.8701
Epoch 17/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0067 - accuracy: 1.0000
11/30 [==========>...................] - ETA: 0s - loss: 0.0061 - accuracy: 0.9998
21/30 [====================>.........] - ETA: 0s - loss: 0.0095 - accuracy: 0.9990
30/30 [==============================] - 0s 5ms/step - loss: 0.0087 - accuracy: 0.9991

30/30 [==============================] - 0s 8ms/step - loss: 0.0087 - accuracy: 0.9991 - val_loss: 0.6230 - val_accuracy: 0.8637
Epoch 18/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0059 - accuracy: 1.0000
10/30 [=========>....................] - ETA: 0s - loss: 0.0050 - accuracy: 0.9996
20/30 [===================>..........] - ETA: 0s - loss: 0.0103 - accuracy: 0.9981
29/30 [============================>.] - ETA: 0s - loss: 0.0088 - accuracy: 0.9987
30/30 [==============================] - 0s 5ms/step - loss: 0.0087 - accuracy: 0.9987

30/30 [==============================] - 0s 8ms/step - loss: 0.0087 - accuracy: 0.9987 - val_loss: 0.6395 - val_accuracy: 0.8681
Epoch 19/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0034 - accuracy: 1.0000
11/30 [==========>...................] - ETA: 0s - loss: 0.0032 - accuracy: 1.0000
21/30 [====================>.........] - ETA: 0s - loss: 0.0035 - accuracy: 0.9999
30/30 [==============================] - 0s 5ms/step - loss: 0.0039 - accuracy: 0.9998

30/30 [==============================] - 0s 8ms/step - loss: 0.0039 - accuracy: 0.9998 - val_loss: 0.7504 - val_accuracy: 0.8534
Epoch 20/20

 1/30 [>.............................] - ETA: 0s - loss: 0.0078 - accuracy: 1.0000
11/30 [==========>...................] - ETA: 0s - loss: 0.0031 - accuracy: 1.0000
21/30 [====================>.........] - ETA: 0s - loss: 0.0030 - accuracy: 0.9999
30/30 [==============================] - 0s 5ms/step - loss: 0.0063 - accuracy: 0.9989

30/30 [==============================] - 0s 7ms/step - loss: 0.0063 - accuracy: 0.9989 - val_loss: 0.7101 - val_accuracy: 0.8658
str(history)
List of 2
 $ params :List of 3
  ..$ verbose: int 1
  ..$ epochs : int 20
  ..$ steps  : int 30
 $ metrics:List of 4
  ..$ loss        : num [1:20] 0.51 0.303 0.221 0.174 0.142 ...
  ..$ accuracy    : num [1:20] 0.786 0.901 0.927 0.943 0.955 ...
  ..$ val_loss    : num [1:20] 0.382 0.321 0.284 0.274 0.287 ...
  ..$ val_accuracy: num [1:20] 0.87 0.875 0.888 0.892 0.887 ...
 - attr(*, "class")= chr "keras_training_history"
plot(history)
`geom_smooth()` using formula 'y ~ x'

Validation loss and accuracy show evidence of overfitting after about the 5th epoch where the accuracy peaks and the loss bottoms out. To mitigate this, we can train for only 5 epochs.

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

model %>% fit(x_train, y_train, epochs = 5, batch_size = 512)
Epoch 1/5

 1/49 [..............................] - ETA: 13s - loss: 0.6944 - accuracy: 0.5020
 7/49 [===>..........................] - ETA: 0s - loss: 0.6606 - accuracy: 0.6088 
13/49 [======>.......................] - ETA: 0s - loss: 0.6196 - accuracy: 0.6813
20/49 [===========>..................] - ETA: 0s - loss: 0.5780 - accuracy: 0.7319
27/49 [===============>..............] - ETA: 0s - loss: 0.5458 - accuracy: 0.7598
34/49 [===================>..........] - ETA: 0s - loss: 0.5182 - accuracy: 0.7822
41/49 [========================>.....] - ETA: 0s - loss: 0.4942 - accuracy: 0.7982
49/49 [==============================] - 1s 8ms/step - loss: 0.4730 - accuracy: 0.8116

49/49 [==============================] - 1s 8ms/step - loss: 0.4730 - accuracy: 0.8116
Epoch 2/5

 1/49 [..............................] - ETA: 0s - loss: 0.3425 - accuracy: 0.8887
 9/49 [====>.........................] - ETA: 0s - loss: 0.3063 - accuracy: 0.9054
15/49 [========>.....................] - ETA: 0s - loss: 0.2971 - accuracy: 0.9069
23/49 [=============>................] - ETA: 0s - loss: 0.2914 - accuracy: 0.9088
32/49 [==================>...........] - ETA: 0s - loss: 0.2833 - accuracy: 0.9098
41/49 [========================>.....] - ETA: 0s - loss: 0.2764 - accuracy: 0.9103
49/49 [==============================] - 0s 7ms/step - loss: 0.2750 - accuracy: 0.9090

49/49 [==============================] - 0s 7ms/step - loss: 0.2750 - accuracy: 0.9090
Epoch 3/5

 1/49 [..............................] - ETA: 0s - loss: 0.2068 - accuracy: 0.9355
11/49 [=====>........................] - ETA: 0s - loss: 0.2085 - accuracy: 0.9304
20/49 [===========>..................] - ETA: 0s - loss: 0.2092 - accuracy: 0.9299
30/49 [=================>............] - ETA: 0s - loss: 0.2104 - accuracy: 0.9275
40/49 [=======================>......] - ETA: 0s - loss: 0.2085 - accuracy: 0.9279
49/49 [==============================] - 0s 6ms/step - loss: 0.2091 - accuracy: 0.9275

49/49 [==============================] - 0s 6ms/step - loss: 0.2091 - accuracy: 0.9275
Epoch 4/5

 1/49 [..............................] - ETA: 0s - loss: 0.1721 - accuracy: 0.9512
11/49 [=====>........................] - ETA: 0s - loss: 0.1662 - accuracy: 0.9499
21/49 [===========>..................] - ETA: 0s - loss: 0.1703 - accuracy: 0.9461
32/49 [==================>...........] - ETA: 0s - loss: 0.1705 - accuracy: 0.9446
43/49 [=========================>....] - ETA: 0s - loss: 0.1721 - accuracy: 0.9424
49/49 [==============================] - 0s 5ms/step - loss: 0.1733 - accuracy: 0.9412

49/49 [==============================] - 0s 5ms/step - loss: 0.1733 - accuracy: 0.9412
Epoch 5/5

 1/49 [..............................] - ETA: 0s - loss: 0.1372 - accuracy: 0.9648
 9/49 [====>.........................] - ETA: 0s - loss: 0.1394 - accuracy: 0.9566
18/49 [==========>...................] - ETA: 0s - loss: 0.1566 - accuracy: 0.9465
29/49 [================>.............] - ETA: 0s - loss: 0.1520 - accuracy: 0.9492
39/49 [======================>.......] - ETA: 0s - loss: 0.1517 - accuracy: 0.9477
49/49 [==============================] - 0s 5ms/step - loss: 0.1511 - accuracy: 0.9476

49/49 [==============================] - 0s 5ms/step - loss: 0.1511 - accuracy: 0.9476
results <- model %>% evaluate(x_test, y_test)

  1/782 [..............................] - ETA: 48s - loss: 0.3780 - accuracy: 0.8438
 82/782 [==>...........................] - ETA: 0s - loss: 0.3170 - accuracy: 0.8784 
163/782 [=====>........................] - ETA: 0s - loss: 0.3142 - accuracy: 0.8765
248/782 [========>.....................] - ETA: 0s - loss: 0.3133 - accuracy: 0.8769
325/782 [===========>..................] - ETA: 0s - loss: 0.3271 - accuracy: 0.8719
410/782 [==============>...............] - ETA: 0s - loss: 0.3263 - accuracy: 0.8735
498/782 [==================>...........] - ETA: 0s - loss: 0.3251 - accuracy: 0.8745
574/782 [=====================>........] - ETA: 0s - loss: 0.3209 - accuracy: 0.8759
647/782 [=======================>......] - ETA: 0s - loss: 0.3194 - accuracy: 0.8762
739/782 [===========================>..] - ETA: 0s - loss: 0.3201 - accuracy: 0.8761
782/782 [==============================] - 1s 604us/step - loss: 0.3191 - accuracy: 0.8763

782/782 [==============================] - 1s 604us/step - loss: 0.3191 - accuracy: 0.8763
results
     loss  accuracy 
0.3190779 0.8762800 

This fairly naive approach achieves an accuracy of 88%. State-of-the-art approaches can get 95%.

Predicting on new data

Use the predict method directly. The value indicates how likely the review is to be positive.

model %>% predict(x_test[1:10,])
             [,1]
 [1,] 0.117030084
 [2,] 0.996890545
 [3,] 0.652673364
 [4,] 0.769265294
 [5,] 0.925853550
 [6,] 0.658951819
 [7,] 0.999153733
 [8,] 0.002696246
 [9,] 0.933195472
[10,] 0.988951802

Other training experiments to try

Takeaways


  1. rectified linear unit: zeroes out negative values↩︎

  2. The sigmoid activation function “squashes” arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.↩︎

  3. Crossentropy is a quantity from the field of Information Theory that measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions.↩︎

---
title: "Sentiment Analysis with the IMDB dataset"
output: html_notebook
---

## Loading the IMDB dataset

To work with vector data of a manageable size, keep only the top 10,000 most frequently occurring words in the training data. Rare words will be discarded.

The variables `train_data` and `test_data` are lists of reviews; each review is a list of word indices (encoding a sequence of words). `train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for a negative review and 1 stands for positive one.

Note: The multi-assignment operator, `%<-%`, is automatically available whenever the R Keras package is loaded.

```{r}
library(keras)

imdb <- dataset_imdb(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% imdb
```

Here's how to decode a review back to English words.

```{r}
# word_index is a named list mapping words to an integer index
word_index <- dataset_imdb_word_index()

# Reverses it, mapping integer indices to words
reverse_word_index <- names(word_index)                                    
names(reverse_word_index) <- word_index

# Decodes the 1st review. Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for "padding," "start of sequence," and "unknown."
decoded_review <- sapply(train_data[[1]], function(index) {                
  word <- if (index >= 3) reverse_word_index[[as.character(index - 3)]]
  if (!is.null(word)) word else "?"
})

paste(decoded_review, collapse = " ")
```

## Preparing the data

Since you can't feed lists of integers into a neural net, here we convert the training data into the necessary tensor format by converting from a list to a matrix (or rows of 1D vectors).

Each row of the matrix contains 10,000 columns representing the all possible words in a review. They are then one-hot encoded (also known as categorical encoding) so that either a 0 or 1 is present indicating if that words has been used in the review. This is done for each of the 25,000 reviews creating a sparse matrix of 250,000,000 1s and 0s (2GB of data). This is also done for the testing data, which is of the same size.

```{r}
vectorize_sequences <- function(sequences, dimension = 10000) {
  # Initialize a matrix with all zeroes
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  # Replace 0 with a 1 for each column of the matrix given in the list
  for (i in 1:length(sequences))
    results[i, sequences[[i]]] <- 1
  results
}

x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

str(x_train[1,])
```

Also convert labels from integer to numeric.

```{r}
y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)
```

Now the data is ready to be fed into a neural network.

## Building the network

A type of network that performs well on this type of vector data is a simple stack of fully connected (dense) layers with `relu`[^1] activations.

[^1]: rectified linear unit: zeroes out negative values

There are two key architecture decisions to be made about such a stack of dense layers:

1. How many layers to use
2. How many hidden units to choose for each layer

Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).

The model chosen has 2 intermediate dense layer with 16 hidden units each and a 3rd (sigmoid)[^2] layer that will output the scalar prediction regarding the sentiment of the current review.

[^2]: The sigmoid activation function "squashes" arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.

```{r}
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")
```

## Compiling the model

Because this is a binary classification problem and the output of the network is a probability, it's best to use the `binary_crossentropy` loss.[^3]

[^3]: Crossentropy is a quantity from the field of Information Theory that measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions.

```{r}
model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)
```

Optimizer parameters can be configured by passing an optimizer instance as the optimizer argument.

```{r, eval=F}
model %>% compile(
  optimizer = optimizer_rmsprop(learning_rate=0.001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)
```

Custom loss or metric functions can also be passed the respective arguments.

```{r, eval=F}
model %>% compile(
  optimizer = optimizer_rmsprop(learning_rate = 0.001),
  loss = loss_binary_crossentropy,
  metrics = metric_binary_accuracy
)
```

## Validation step

Create a validation set by setting apart 10,000 samples from the original training data.

```{r}
val_indices <- 1:10000

x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]
```

## Training the model

Train the model for 20 epochs (20 iterations over all samples in the x_train and y_train tensors), in mini-batches of 512 samples. At the same time, monitor loss and accuracy on the 10,000 samples that were set apart in the validation step.

```{r}
history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)
```

```{r}
str(history)
plot(history)
```
Validation loss and accuracy show evidence of overfitting after about the 5th epoch where the accuracy peaks and the loss bottoms out. To mitigate this, we can train for only 5 epochs.

```{r}
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

model %>% fit(x_train, y_train, epochs = 5, batch_size = 512)
results <- model %>% evaluate(x_test, y_test)
```

```{r}
results
```

This fairly naive approach achieves an accuracy of 88%. State-of-the-art approaches can get 95%.

## Predicting on new data

Use the `predict` method directly. The value indicates how likely the review is to be positive.

```{r}
model %>% predict(x_test[1:10,])
```

## Other training experiments to try

- Use one or three hidden layers, and see how doing so affects validation and test accuracy.
- Use layers with more hidden units or fewer hidden units: 32 units, 64 units, and so on.
- Use the `mse` loss function instead of `binary_crossentropy.`
- Use the `tanh` activation (popular in the early days of neural nets) instead of `relu`.

## Takeaways

- You usually need to do quite a bit of preprocessing on your raw data in order to be able to feed it—as tensors—into a neural network. Sequences of words can be encoded as binary vectors, but there are other encoding options, too.
- Stacks of dense layers with `relu` activations can solve a wide range of problems (including sentiment classification), and you'll likely use them frequently.
- In a binary classification problem (two output classes), your network should end with a dense layer with one unit and a sigmoid activation: the output of your network should be a scalar between 0 and 1, encoding a probability.
- With such a scalar `sigmoid` output on a binary classification problem, the loss function you should use is `binary_crossentropy.`
- The `rmsprop` optimizer is generally a good enough choice, whatever your problem. That's one less thing for you to worry about.
- As they get better on their training data, neural networks eventually start overfitting and end up obtaining increasingly worse results on data they've never seen before. Be sure to always monitor performance on data that is outside of the training set.