Info about the activity

Objectives

By the end of this activity, students should be able to

  1. Train LSTM with GloVe vectors for text classification in R Torch.

It is important to understand that we won’t be able to achieve decent accuracy in text classification today. The reason is that we do not have enough computing power. The main takeaway is that, although deep learning models are very powerful, their power comes at a cost — they need a lot of data and a lot of effort to outperform models like random forest (and AdaBoost, gradient boosting, support vector machine etc.)

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.

Data

We will work with BBC news data

Source: https://www.kaggle.com/c/learn-ai-bbc

Loading data into R

library(tidyverse)
library(torch)
library(text2vec) # for GloVe

clean_news <- function(x) {
  x %>% 
    tolower %>%
    gsub('[^a-z0-9]', ' ', . ) 
}


n <- read_csv("BBC News Train.csv") %>%
  mutate(Category = as.factor(Category)) %>%
  mutate(clean_text = clean_news(Text))

n

Training GloVe

Here, we will quickly train a simple 50-dimensional GloVe model. We will do it to save time. The right way is to download a pretrained GloVe and load it into R. Here is the link for your reference:

dim_word_emb <- 50
skip_gram_window_size <- 5L

tokens <- space_tokenizer(n$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = skip_gram_window_size)

glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)

wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)

wv_context = glove$components
word_vectors = wv_main + t(wv_context)

dim(word_vectors)

Understanding LSTM

Here we will create a random LSTM block and test it on a random input. The objective here is to understand what gets in and gets out of it. We will create an LSTM layer that transforms a sequence of 5-dimensional vectors to a sequence of 4-dimensional vectors:

rnn <- nn_lstm(5, 4)
rnn
## An `nn_module` containing 176 parameters.
## 
## -- Parameters ------------------------------------------------------------------
## * weight_ih_l1: Float [1:16, 1:5]
## * weight_hh_l1: Float [1:16, 1:4]
## * bias_ih_l1: Float [1:16]
## * bias_hh_l1: Float [1:16]

The total number of parameters can be found as follows:

((5+1)*4+(4+1)*4)*4
## [1] 176

Now we will create a random input - it will contain 3 sets, each with 6 vectors of dimension 5.

input <- torch_randn(3, 6, 5)
input
## torch_tensor
## (1,.,.) = 
##   0.5708 -0.0675  0.6794 -0.1003 -0.1650
##  -1.0021 -1.1822 -0.5198  1.6388  1.9145
##   0.5766 -1.4728  0.5142 -0.2676 -1.2059
##   1.2317  0.5425 -0.2121  0.8429  0.7004
##   0.7021  1.6605  0.4310 -0.5378 -0.4442
##  -0.2120 -1.9608  1.7355 -1.4333 -0.1524
## 
## (2,.,.) = 
##   1.2052 -0.2871  0.0893  0.3656 -1.0609
##  -0.5559 -0.1428 -0.6418 -0.7515  0.0681
##  -0.5482  1.0624 -1.1676 -1.3781  0.6400
##  -1.3292  0.1798  1.9925  0.3894 -0.7269
##  -0.8455  1.2165 -0.1595  0.6082 -0.4818
##   1.5097 -0.3973  0.0573  0.4306 -0.2526
## 
## (3,.,.) = 
##   0.6288  1.2930  0.8281 -0.0820  1.3557
##   0.6408 -0.5747  0.5410  0.5686 -1.4462
##  -0.8214  0.7221 -2.7238  1.7821  0.1085
##   1.6837 -0.9538 -0.3798  1.4186 -0.6179
##  -1.2391  0.6689  1.0110  0.2489 -0.8238
##  -0.2121  0.4577 -1.5570  0.3236  0.4115
## [ CPUFloatType{3,6,5} ]

Now we will feed it into our LSTM module:

output <- rnn(input)
output
## [[1]]
## torch_tensor
## (1,.,.) = 
##  -0.0267  0.1548  0.0664 -0.0810
##  -0.3578  0.0267  0.0010  0.1151
##  -0.0527  0.1902 -0.0659 -0.3106
##   0.0521  0.1129  0.0677 -0.0654
##   0.1474  0.2417  0.1131 -0.0296
##  -0.0629 -0.0110 -0.2439  0.0233
## 
## (2,.,.) = 
##  -0.0038  0.2504  0.1324 -0.2343
##  -0.2271  0.2443 -0.0450  0.0763
##  -0.0271  0.2785 -0.0537 -0.0761
##  -0.1992  0.0138  0.3402  0.1626
##  -0.0516  0.2006  0.2269  0.0130
##  -0.0449  0.1732 -0.0442 -0.1972
## 
## (3,.,.) = 
##  -0.0620  0.1866  0.1201 -0.0135
##  -0.2388  0.2972  0.0964 -0.1592
##  -0.1973  0.0757  0.0738 -0.0893
##  -0.2900  0.0976  0.1753 -0.1522
##  -0.2303  0.1583  0.3700  0.1151
##  -0.1034  0.1579  0.0159 -0.1192
## [ CPUFloatType{3,6,4} ]
## 
## [[2]]
## [[2]][[1]]
## torch_tensor
## (1,.,.) = 
##  -0.0620  0.1866  0.1201 -0.0135
##  -0.2388  0.2972  0.0964 -0.1592
##  -0.1973  0.0757  0.0738 -0.0893
##  -0.2900  0.0976  0.1753 -0.1522
##  -0.2303  0.1583  0.3700  0.1151
##  -0.1034  0.1579  0.0159 -0.1192
## [ CPUFloatType{1,6,4} ]
## 
## [[2]][[2]]
## torch_tensor
## (1,.,.) = 
##  -0.0981  0.4019  0.4302 -0.0403
##  -0.3845  0.6177  0.1547 -0.2277
##  -0.2249  0.7197  0.1873 -0.1490
##  -0.3876  0.3958  0.4601 -0.2272
##  -0.4252  0.2633  0.5725  0.1800
##  -0.1365  0.6681  0.0393 -0.2418
## [ CPUFloatType{1,6,4} ]

Note that the output is a list of length 2. Its first item contains the same number of sequences, each with the same number of vectors, but they have dimension 4 rather than 5 - this is because our LSTM transforms 5-dimensional vectors to 4-dimensional vectors.

dim(output[[1]])
## [1] 3 6 4

Besides, we have the second element of that list that, in turn, contains two tensors. These tensors have the same dimension and it is 1 by the number of vectors in each vector sequence by the number of entries in output vectors.

dim(output[[2]][[1]])
## [1] 1 6 4
dim(output[[2]][[2]])
## [1] 1 6 4

These are so-called “hidden states”. If you are interested, look for materials to better understand recurrent neural networks.

Further, we need to extract the final vector (i.e., the 6th) for every input set of 6 vectors out of the model’s output. This is because it is consider the final layer’s oputput that needs to be passed on to the next layer.

It means that we need to get just a set of 3 4-dimensional vectors. This is how we can do it:

extract_lstm_final_output <- function(output) {
  n <- dim(output[[1]])[2]
  output[[1]][ , n , ]
}

extract_lstm_final_output(output)
## torch_tensor
## -0.0629 -0.0110 -0.2439  0.0233
## -0.0449  0.1732 -0.0442 -0.1972
## -0.1034  0.1579  0.0159 -0.1192
## [ CPUFloatType{3,4} ]

Preparing data for LSTM

Removing words that are not in the vocabulary

Here, we will convert sequences of words to character vectors. Only words that appear in our vocabulary can be converted. To do our conversion, we will write a function whose input is a charecter vector thought of as a sequence of words and whose output is the same sequence with words not in the vocabulary removed.

keep_words_in_vocab <- function(x) {
  # x is a sequence of words
  # the output is the same sequence with words not in the vocabulary removed
  x[x %in% rownames(word_vectors)]
}

c("mfe", "is", "the", "best") %>% keep_words_in_vocab()
## [1] "is"   "the"  "best"

Question 1

Write a function whose input is a sequence of words and whose output is a matrix with rows representing word vectors for words in the sequence that actually appear in our vocabulary.

doc_to_vector_sequence <- function(x) {
  # x is a sequence of words
  # the output is the corresponding sequence of word vectors
  word_vectors[keep_words_in_vocab(x) , ]
}

c("mfe", "is", "the", "best") %>%
  doc_to_vector_sequence()
##             [,1]        [,2]        [,3]       [,4]        [,5]      [,6]
## is    0.06992671  0.51875145 -0.33758363 -0.4103236  0.03910815 0.6007398
## the  -0.59600627  0.49821806 -0.32427391 -1.0793698  0.05068699 0.7133359
## best -0.37386291 -0.04957605  0.06703747 -1.0576419 -1.25389786 0.3838965
##             [,7]      [,8]       [,9]     [,10]    [,11]      [,12]      [,13]
## is   -0.21143221 0.9243905 -0.4870051 -1.079826 1.336915 -0.3990836 -0.8951306
## the   0.23297595 0.4041999 -1.3272980 -0.458140 1.268350  0.4113141 -1.8347583
## best -0.03976561 0.3630931  0.7298949  0.325497 1.550214 -0.2238901 -1.7015079
##           [,14]      [,15]      [,16]      [,17]      [,18]      [,19]
## is   -1.2492843 0.72721587 -0.2973046  0.8117945 -0.2134237 0.49669386
## the  -0.6900967 0.02307693 -1.2547817  0.1673617 -1.1491398 0.91756592
## best -0.1755150 0.57114497 -0.4800561 -0.1375125 -0.3495292 0.09675292
##            [,20]     [,21]       [,22]       [,23]      [,24]      [,25]
## is   -0.90045044 0.5676096 -0.88922358 -0.38198146 -0.3704513 -0.4290502
## the  -0.07696151 0.9193126 -0.08831784  1.47746937 -0.2047517 -0.7999161
## best  0.37564961 0.5044329 -0.14002333 -0.07614849 -0.3090218 -0.8909856
##          [,26]       [,27]      [,28]      [,29]       [,30]      [,31]
## is   1.5942293 -0.01428077 -0.7754776  0.9408662 -1.03019985  0.5509425
## the  1.4017431  0.67858374 -0.3197300  1.1430419 -0.01979674  0.2081596
## best 0.2289432  0.39565444 -0.2616278 -0.4760076  0.53749199 -0.4135999
##           [,32]     [,33]     [,34]      [,35]      [,36]      [,37]     [,38]
## is    0.1236672 0.5227013 0.3853909 -1.0148681 -0.2481313  0.3035956 0.3436571
## the  -0.1160918 0.8654789 1.1753424  0.4266378 -0.8978530 -0.2385122 1.2387519
## best -0.6905683 1.0814938 0.2986401  0.4564447 -0.6388204 -0.2823120 0.5101160
##           [,39]      [,40]      [,41]       [,42]      [,43]     [,44]
## is   -0.5458721 -1.2552614 -0.7716221 -0.65269960 0.31759697 0.2112984
## the  -0.4550541 -0.5527503 -0.1552242 -0.28129001 0.93415223 0.9722938
## best -1.1793912 -0.7030975  0.4699766  0.02527979 0.03643233 0.9087853
##           [,45]      [,46]      [,47]       [,48]      [,49]      [,50]
## is   -0.6138701 0.02847755 -0.4648214  0.08526896 -0.5574806  0.3227411
## the  -0.6405497 0.23991516 -0.3027951  0.48339889 -0.9435089 -0.2148640
## best -0.2978892 0.23444276  0.1377534 -1.71746045  1.4331844 -0.6259600

Tokenized messages

To construct sequences of word vectors, we need to convert our documents (news articles) to sequences of words (tokens). This has already been done as a preparation for training GloVe:

For example, here is article 574:

n$Text[574]
## [1] "blunkett hints at election call ex-home secretary david blunkett has given fresh clues that the general election will be announced on monday.  he told bbc radio five live:  i m out in my constituency getting ready for what we presume will be an announcement very shortly at the weekend.  he clarified that he meant he would be in his sheffield seat this weekend  not that he expected an election call then. tony blair is tipped to ask the queen on monday to dissolve parliament ready for a 5 may poll."

And here is the tokenized version (sequence of words):

tokens[574]
## [[1]]
##   [1] "blunkett"     "hints"        "at"           "election"     "call"        
##   [6] "ex"           "home"         "secretary"    "david"        "blunkett"    
##  [11] "has"          "given"        "fresh"        "clues"        "that"        
##  [16] "the"          "general"      "election"     "will"         "be"          
##  [21] "announced"    "on"           "monday"       ""             ""            
##  [26] "he"           "told"         "bbc"          "radio"        "five"        
##  [31] "live"         ""             ""             "i"            "m"           
##  [36] "out"          "in"           "my"           "constituency" "getting"     
##  [41] "ready"        "for"          "what"         "we"           "presume"     
##  [46] "will"         "be"           "an"           "announcement" "very"        
##  [51] "shortly"      "at"           "the"          "weekend"      ""            
##  [56] ""             "he"           "clarified"    "that"         "he"          
##  [61] "meant"        "he"           "would"        "be"           "in"          
##  [66] "his"          "sheffield"    "seat"         "this"         "weekend"     
##  [71] ""             "not"          "that"         "he"           "expected"    
##  [76] "an"           "election"     "call"         "then"         ""            
##  [81] "tony"         "blair"        "is"           "tipped"       "to"          
##  [86] "ask"          "the"          "queen"        "on"           "monday"      
##  [91] "to"           "dissolve"     "parliament"   "ready"        "for"         
##  [96] "a"            "5"            "may"          "poll"         ""

Question 2

What is the length of the longest word sequence that we have after removing words that are not in a vocabulary?

tokens %>% lapply(keep_words_in_vocab) %>% sapply(length) %>% max
## [1] 3223

Getting

Note that word sequences have different lengths. To facilitate training of a recurrent neural network, we need to make sure that all these lengths are the same. In practice, one should add zero vectors to all sequences whose length is not the maximal. In Torch, there is a special function for that - nn_utils_rnn_pad_sequence.

However, in our problem, this maximal length is just too large and training will be super-slow if we actually do it. So, instead, we trim all the word sequences to some small number, say, 80. It means that we will try to predict the label of an article by its first 80 words (which is, of course, harder than predicting the label by the entire article).

Below we do it:

word_sequences <- tokens %>% 
  lapply(doc_to_vector_sequence) %>%
  sapply(function(A) A[1:80 , ], simplify = 'array') %>%
  aperm(c(3, 1, 2))

Note that the result of this operation is an array. Its dimensions are

dim(word_sequences)
## [1] 1490   80   50

It means that it contains 1490 sets of 80 50-dimensional vectors in each set.

Response variable

The original response variable is categorical:

head(n$Category)
## [1] business business business tech     business politics
## Levels: business entertainment politics sport tech

To feed our data into a deep learning model, we will convert it to numerical so that numbers correspond to possible values. This is how it is done:

y_all <- as.numeric(n$Category) 
head(y_all)
## [1] 1 1 1 5 1 3

To retrieve the labels, we do the following:

attributes(n$Category)$levels[y_all] %>% head
## [1] "business" "business" "business" "tech"     "business" "politics"

Training and test sets

And now we will split our data into training and test sets.

set.seed(42)
N <- dim(word_sequences)[1]
train_ind <- sample(1:N, size = floor(0.7 * N))
test_ind <- setdiff(1:N, train_ind)

x_train = word_sequences[train_ind , , ] %>%
  torch_tensor(dtype = torch_float())
  
y_train = y_all[train_ind] %>%
  torch_tensor(dtype = torch_long())

x_test = word_sequences[test_ind , , ] %>%
  torch_tensor(dtype = torch_float())

y_test = y_all[test_ind] %>%
  torch_tensor(dtype = torch_long())

cat("Training set dimensions =", dim(x_train), "\n")
## Training set dimensions = 1043 80 50
cat("Test set dimensions =", dim(x_test), "\n")
## Test set dimensions = 447 80 50

Training LSTM

Defining the model

With LSTM, we cannot simply use the container nn_sequential like we did for a fully connected network. The reason is that the sequence of computations involves extracting just some of the output of LSTM and this operation is not (yet) in Torch library. Thus we will have to use our own custom function. This is how it is done:

net <- nn_module(
 initialize = function() {
   self$lstm_layer <- nn_lstm(dim(x_train)[3], 16)
   self$fully_connected <- nn_linear(16, 5)
 },
 forward = function(input) {
   input %>% 
     self$lstm_layer() %>%
     extract_lstm_final_output() %>%
     self$fully_connected()  %>%
     nnf_softmax(2)
 }
)

model <- net()
model
## An `nn_module` containing 4,437 parameters.
## 
## -- Modules ---------------------------------------------------------------------
## * lstm_layer: <nn_lstm> #4,352 parameters
## * fully_connected: <nn_linear> #85 parameters

Training settings

Here, we choose an optimizer, the number of epochs, and the loss function

optimizer <- optim_adam(model$parameters)
epochs <- 100
loss_function <- nnf_cross_entropy

Model validation

The output of our model looks like a matrix:

x_train %>%
  model %>%
  head
## torch_tensor
##  0.2132  0.1690  0.2436  0.1716  0.2026
##  0.2050  0.1619  0.2476  0.1758  0.2096
##  0.2128  0.1507  0.2418  0.1814  0.2133
##  0.2007  0.1668  0.2421  0.1781  0.2123
##  0.2057  0.1648  0.2380  0.1825  0.2090
##  0.2053  0.1431  0.2736  0.1676  0.2104
## [ CPUFloatType{6,5} ]

Every row here is a probability distribution.

To validate our model, we need to convert the raw output to predicted labels.

predicted_classes <- function(pred_prob) {
  pred_prob %>% apply(1, which.max)
}

x_train %>% model %>% predicted_classes %>% head
## [1] 3 3 3 3 3 3

And we need to calculate accuracy. This is going to be just the fraction of labels predicted correctly.

acc <- function(x, y) {
  (x == y) %>% as.numeric %>% mean
}

x_test %>% model %>% predicted_classes %>% acc(y_test)
## [1] 0.2214765

And here is our validation function:

validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
  pred_train <- model(x_train) 
  pred_test <- model(x_test)
  train_loss <- loss_function(pred_train, y_train) %>% as.numeric
  test_loss <- loss_function(pred_test, y_test) %>% as.numeric

  train_acc <- pred_train %>% predicted_classes %>% acc(y_train)
  test_acc <- pred_test %>% predicted_classes %>% acc(y_test)
  data.frame(value =  c(train_loss, test_loss, train_acc, test_acc),
             type = c("loss", "loss", "accuracy", "accuracy"),
             dataset = c("train", "test", "train", "test"),
             epoch = epoch_no)
}

val_df <- validation(x_train, y_train, x_test, y_test)
val_df

Gradient descent.

Now we are ready to run gradient descent

val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
  optimizer$zero_grad()

  y_pred <- model(x_train)
  loss <- loss_function(y_pred, y_train)
  loss$backward() # this is gradient calculation
  optimizer$step() # this is one step of optimized gradient descent

  # Here, we will calculate and save the metrics
  if(i %% 5 == 0){
    cat("Epoch ", i, "\n")
    val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
      rbind(val_df)
  }
}
## Epoch  5 
## Epoch  10 
## Epoch  15 
## Epoch  20 
## Epoch  25 
## Epoch  30 
## Epoch  35 
## Epoch  40 
## Epoch  45 
## Epoch  50 
## Epoch  55 
## Epoch  60 
## Epoch  65 
## Epoch  70 
## Epoch  75 
## Epoch  80 
## Epoch  85 
## Epoch  90 
## Epoch  95 
## Epoch  100
val_df %>% 
  ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
  facet_grid(. ~type) +
  geom_line()

Question 3 (open-ended)

Try to introduce some form of regularization in the model and see if it prevents overfitting. Play with the optimizer. You can even try batch gradient descent, but it is harder to implement - you will need to read tutorials.