By the end of this activity, students should be able to
Torch
.It is important to understand that we won’t be able to achieve decent accuracy in text classification today. The reason is that we do not have enough computing power. The main takeaway is that, although deep learning models are very powerful, their power comes at a cost — they need a lot of data and a lot of effort to outperform models like random forest (and AdaBoost, gradient boosting, support vector machine etc.)
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the instructor.
We will work with BBC news data
Source: https://www.kaggle.com/c/learn-ai-bbc
library(tidyverse)
library(torch)
library(text2vec) # for GloVe
clean_news <- function(x) {
x %>%
tolower %>%
gsub('[^a-z0-9]', ' ', . )
}
n <- read_csv("BBC News Train.csv") %>%
mutate(Category = as.factor(Category)) %>%
mutate(clean_text = clean_news(Text))
n
Here, we will quickly train a simple 50-dimensional GloVe model. We will do it to save time. The right way is to download a pretrained GloVe and load it into R. Here is the link for your reference:
dim_word_emb <- 50
skip_gram_window_size <- 5L
tokens <- space_tokenizer(n$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer,
skip_grams_window = skip_gram_window_size)
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
dim(word_vectors)
Here we will create a random LSTM block and test it on a random input. The objective here is to understand what gets in and gets out of it. We will create an LSTM layer that transforms a sequence of 5-dimensional vectors to a sequence of 4-dimensional vectors:
rnn <- nn_lstm(5, 4)
rnn
## An `nn_module` containing 176 parameters.
##
## -- Parameters ------------------------------------------------------------------
## * weight_ih_l1: Float [1:16, 1:5]
## * weight_hh_l1: Float [1:16, 1:4]
## * bias_ih_l1: Float [1:16]
## * bias_hh_l1: Float [1:16]
The total number of parameters can be found as follows:
((5+1)*4+(4+1)*4)*4
## [1] 176
Now we will create a random input - it will contain 3 sets, each with 6 vectors of dimension 5.
input <- torch_randn(3, 6, 5)
input
## torch_tensor
## (1,.,.) =
## 0.5708 -0.0675 0.6794 -0.1003 -0.1650
## -1.0021 -1.1822 -0.5198 1.6388 1.9145
## 0.5766 -1.4728 0.5142 -0.2676 -1.2059
## 1.2317 0.5425 -0.2121 0.8429 0.7004
## 0.7021 1.6605 0.4310 -0.5378 -0.4442
## -0.2120 -1.9608 1.7355 -1.4333 -0.1524
##
## (2,.,.) =
## 1.2052 -0.2871 0.0893 0.3656 -1.0609
## -0.5559 -0.1428 -0.6418 -0.7515 0.0681
## -0.5482 1.0624 -1.1676 -1.3781 0.6400
## -1.3292 0.1798 1.9925 0.3894 -0.7269
## -0.8455 1.2165 -0.1595 0.6082 -0.4818
## 1.5097 -0.3973 0.0573 0.4306 -0.2526
##
## (3,.,.) =
## 0.6288 1.2930 0.8281 -0.0820 1.3557
## 0.6408 -0.5747 0.5410 0.5686 -1.4462
## -0.8214 0.7221 -2.7238 1.7821 0.1085
## 1.6837 -0.9538 -0.3798 1.4186 -0.6179
## -1.2391 0.6689 1.0110 0.2489 -0.8238
## -0.2121 0.4577 -1.5570 0.3236 0.4115
## [ CPUFloatType{3,6,5} ]
Now we will feed it into our LSTM module:
output <- rnn(input)
output
## [[1]]
## torch_tensor
## (1,.,.) =
## -0.0267 0.1548 0.0664 -0.0810
## -0.3578 0.0267 0.0010 0.1151
## -0.0527 0.1902 -0.0659 -0.3106
## 0.0521 0.1129 0.0677 -0.0654
## 0.1474 0.2417 0.1131 -0.0296
## -0.0629 -0.0110 -0.2439 0.0233
##
## (2,.,.) =
## -0.0038 0.2504 0.1324 -0.2343
## -0.2271 0.2443 -0.0450 0.0763
## -0.0271 0.2785 -0.0537 -0.0761
## -0.1992 0.0138 0.3402 0.1626
## -0.0516 0.2006 0.2269 0.0130
## -0.0449 0.1732 -0.0442 -0.1972
##
## (3,.,.) =
## -0.0620 0.1866 0.1201 -0.0135
## -0.2388 0.2972 0.0964 -0.1592
## -0.1973 0.0757 0.0738 -0.0893
## -0.2900 0.0976 0.1753 -0.1522
## -0.2303 0.1583 0.3700 0.1151
## -0.1034 0.1579 0.0159 -0.1192
## [ CPUFloatType{3,6,4} ]
##
## [[2]]
## [[2]][[1]]
## torch_tensor
## (1,.,.) =
## -0.0620 0.1866 0.1201 -0.0135
## -0.2388 0.2972 0.0964 -0.1592
## -0.1973 0.0757 0.0738 -0.0893
## -0.2900 0.0976 0.1753 -0.1522
## -0.2303 0.1583 0.3700 0.1151
## -0.1034 0.1579 0.0159 -0.1192
## [ CPUFloatType{1,6,4} ]
##
## [[2]][[2]]
## torch_tensor
## (1,.,.) =
## -0.0981 0.4019 0.4302 -0.0403
## -0.3845 0.6177 0.1547 -0.2277
## -0.2249 0.7197 0.1873 -0.1490
## -0.3876 0.3958 0.4601 -0.2272
## -0.4252 0.2633 0.5725 0.1800
## -0.1365 0.6681 0.0393 -0.2418
## [ CPUFloatType{1,6,4} ]
Note that the output is a list of length 2. Its first item contains the same number of sequences, each with the same number of vectors, but they have dimension 4 rather than 5 - this is because our LSTM transforms 5-dimensional vectors to 4-dimensional vectors.
dim(output[[1]])
## [1] 3 6 4
Besides, we have the second element of that list that, in turn, contains two tensors. These tensors have the same dimension and it is 1 by the number of vectors in each vector sequence by the number of entries in output vectors.
dim(output[[2]][[1]])
## [1] 1 6 4
dim(output[[2]][[2]])
## [1] 1 6 4
These are so-called “hidden states”. If you are interested, look for materials to better understand recurrent neural networks.
Further, we need to extract the final vector (i.e., the 6th) for every input set of 6 vectors out of the model’s output. This is because it is consider the final layer’s oputput that needs to be passed on to the next layer.
It means that we need to get just a set of 3 4-dimensional vectors. This is how we can do it:
extract_lstm_final_output <- function(output) {
n <- dim(output[[1]])[2]
output[[1]][ , n , ]
}
extract_lstm_final_output(output)
## torch_tensor
## -0.0629 -0.0110 -0.2439 0.0233
## -0.0449 0.1732 -0.0442 -0.1972
## -0.1034 0.1579 0.0159 -0.1192
## [ CPUFloatType{3,4} ]
Here, we will convert sequences of words to character vectors. Only words that appear in our vocabulary can be converted. To do our conversion, we will write a function whose input is a charecter vector thought of as a sequence of words and whose output is the same sequence with words not in the vocabulary removed.
keep_words_in_vocab <- function(x) {
# x is a sequence of words
# the output is the same sequence with words not in the vocabulary removed
x[x %in% rownames(word_vectors)]
}
c("mfe", "is", "the", "best") %>% keep_words_in_vocab()
## [1] "is" "the" "best"
Write a function whose input is a sequence of words and whose output is a matrix with rows representing word vectors for words in the sequence that actually appear in our vocabulary.
doc_to_vector_sequence <- function(x) {
# x is a sequence of words
# the output is the corresponding sequence of word vectors
word_vectors[keep_words_in_vocab(x) , ]
}
c("mfe", "is", "the", "best") %>%
doc_to_vector_sequence()
## [,1] [,2] [,3] [,4] [,5] [,6]
## is 0.06992671 0.51875145 -0.33758363 -0.4103236 0.03910815 0.6007398
## the -0.59600627 0.49821806 -0.32427391 -1.0793698 0.05068699 0.7133359
## best -0.37386291 -0.04957605 0.06703747 -1.0576419 -1.25389786 0.3838965
## [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## is -0.21143221 0.9243905 -0.4870051 -1.079826 1.336915 -0.3990836 -0.8951306
## the 0.23297595 0.4041999 -1.3272980 -0.458140 1.268350 0.4113141 -1.8347583
## best -0.03976561 0.3630931 0.7298949 0.325497 1.550214 -0.2238901 -1.7015079
## [,14] [,15] [,16] [,17] [,18] [,19]
## is -1.2492843 0.72721587 -0.2973046 0.8117945 -0.2134237 0.49669386
## the -0.6900967 0.02307693 -1.2547817 0.1673617 -1.1491398 0.91756592
## best -0.1755150 0.57114497 -0.4800561 -0.1375125 -0.3495292 0.09675292
## [,20] [,21] [,22] [,23] [,24] [,25]
## is -0.90045044 0.5676096 -0.88922358 -0.38198146 -0.3704513 -0.4290502
## the -0.07696151 0.9193126 -0.08831784 1.47746937 -0.2047517 -0.7999161
## best 0.37564961 0.5044329 -0.14002333 -0.07614849 -0.3090218 -0.8909856
## [,26] [,27] [,28] [,29] [,30] [,31]
## is 1.5942293 -0.01428077 -0.7754776 0.9408662 -1.03019985 0.5509425
## the 1.4017431 0.67858374 -0.3197300 1.1430419 -0.01979674 0.2081596
## best 0.2289432 0.39565444 -0.2616278 -0.4760076 0.53749199 -0.4135999
## [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## is 0.1236672 0.5227013 0.3853909 -1.0148681 -0.2481313 0.3035956 0.3436571
## the -0.1160918 0.8654789 1.1753424 0.4266378 -0.8978530 -0.2385122 1.2387519
## best -0.6905683 1.0814938 0.2986401 0.4564447 -0.6388204 -0.2823120 0.5101160
## [,39] [,40] [,41] [,42] [,43] [,44]
## is -0.5458721 -1.2552614 -0.7716221 -0.65269960 0.31759697 0.2112984
## the -0.4550541 -0.5527503 -0.1552242 -0.28129001 0.93415223 0.9722938
## best -1.1793912 -0.7030975 0.4699766 0.02527979 0.03643233 0.9087853
## [,45] [,46] [,47] [,48] [,49] [,50]
## is -0.6138701 0.02847755 -0.4648214 0.08526896 -0.5574806 0.3227411
## the -0.6405497 0.23991516 -0.3027951 0.48339889 -0.9435089 -0.2148640
## best -0.2978892 0.23444276 0.1377534 -1.71746045 1.4331844 -0.6259600
To construct sequences of word vectors, we need to convert our documents (news articles) to sequences of words (tokens). This has already been done as a preparation for training GloVe:
For example, here is article 574:
n$Text[574]
## [1] "blunkett hints at election call ex-home secretary david blunkett has given fresh clues that the general election will be announced on monday. he told bbc radio five live: i m out in my constituency getting ready for what we presume will be an announcement very shortly at the weekend. he clarified that he meant he would be in his sheffield seat this weekend not that he expected an election call then. tony blair is tipped to ask the queen on monday to dissolve parliament ready for a 5 may poll."
And here is the tokenized version (sequence of words):
tokens[574]
## [[1]]
## [1] "blunkett" "hints" "at" "election" "call"
## [6] "ex" "home" "secretary" "david" "blunkett"
## [11] "has" "given" "fresh" "clues" "that"
## [16] "the" "general" "election" "will" "be"
## [21] "announced" "on" "monday" "" ""
## [26] "he" "told" "bbc" "radio" "five"
## [31] "live" "" "" "i" "m"
## [36] "out" "in" "my" "constituency" "getting"
## [41] "ready" "for" "what" "we" "presume"
## [46] "will" "be" "an" "announcement" "very"
## [51] "shortly" "at" "the" "weekend" ""
## [56] "" "he" "clarified" "that" "he"
## [61] "meant" "he" "would" "be" "in"
## [66] "his" "sheffield" "seat" "this" "weekend"
## [71] "" "not" "that" "he" "expected"
## [76] "an" "election" "call" "then" ""
## [81] "tony" "blair" "is" "tipped" "to"
## [86] "ask" "the" "queen" "on" "monday"
## [91] "to" "dissolve" "parliament" "ready" "for"
## [96] "a" "5" "may" "poll" ""
What is the length of the longest word sequence that we have after removing words that are not in a vocabulary?
tokens %>% lapply(keep_words_in_vocab) %>% sapply(length) %>% max
## [1] 3223
Note that word sequences have different lengths. To facilitate training of a recurrent neural network, we need to make sure that all these lengths are the same. In practice, one should add zero vectors to all sequences whose length is not the maximal. In Torch
, there is a special function for that - nn_utils_rnn_pad_sequence
.
However, in our problem, this maximal length is just too large and training will be super-slow if we actually do it. So, instead, we trim all the word sequences to some small number, say, 80. It means that we will try to predict the label of an article by its first 80 words (which is, of course, harder than predicting the label by the entire article).
Below we do it:
word_sequences <- tokens %>%
lapply(doc_to_vector_sequence) %>%
sapply(function(A) A[1:80 , ], simplify = 'array') %>%
aperm(c(3, 1, 2))
Note that the result of this operation is an array. Its dimensions are
dim(word_sequences)
## [1] 1490 80 50
It means that it contains 1490 sets of 80 50-dimensional vectors in each set.
The original response variable is categorical:
head(n$Category)
## [1] business business business tech business politics
## Levels: business entertainment politics sport tech
To feed our data into a deep learning model, we will convert it to numerical so that numbers correspond to possible values. This is how it is done:
y_all <- as.numeric(n$Category)
head(y_all)
## [1] 1 1 1 5 1 3
To retrieve the labels, we do the following:
attributes(n$Category)$levels[y_all] %>% head
## [1] "business" "business" "business" "tech" "business" "politics"
And now we will split our data into training and test sets.
set.seed(42)
N <- dim(word_sequences)[1]
train_ind <- sample(1:N, size = floor(0.7 * N))
test_ind <- setdiff(1:N, train_ind)
x_train = word_sequences[train_ind , , ] %>%
torch_tensor(dtype = torch_float())
y_train = y_all[train_ind] %>%
torch_tensor(dtype = torch_long())
x_test = word_sequences[test_ind , , ] %>%
torch_tensor(dtype = torch_float())
y_test = y_all[test_ind] %>%
torch_tensor(dtype = torch_long())
cat("Training set dimensions =", dim(x_train), "\n")
## Training set dimensions = 1043 80 50
cat("Test set dimensions =", dim(x_test), "\n")
## Test set dimensions = 447 80 50
With LSTM, we cannot simply use the container nn_sequential
like we did for a fully connected network. The reason is that the sequence of computations involves extracting just some of the output of LSTM and this operation is not (yet) in Torch
library. Thus we will have to use our own custom function. This is how it is done:
net <- nn_module(
initialize = function() {
self$lstm_layer <- nn_lstm(dim(x_train)[3], 16)
self$fully_connected <- nn_linear(16, 5)
},
forward = function(input) {
input %>%
self$lstm_layer() %>%
extract_lstm_final_output() %>%
self$fully_connected() %>%
nnf_softmax(2)
}
)
model <- net()
model
## An `nn_module` containing 4,437 parameters.
##
## -- Modules ---------------------------------------------------------------------
## * lstm_layer: <nn_lstm> #4,352 parameters
## * fully_connected: <nn_linear> #85 parameters
Here, we choose an optimizer, the number of epochs, and the loss function
optimizer <- optim_adam(model$parameters)
epochs <- 100
loss_function <- nnf_cross_entropy
The output of our model looks like a matrix:
x_train %>%
model %>%
head
## torch_tensor
## 0.2132 0.1690 0.2436 0.1716 0.2026
## 0.2050 0.1619 0.2476 0.1758 0.2096
## 0.2128 0.1507 0.2418 0.1814 0.2133
## 0.2007 0.1668 0.2421 0.1781 0.2123
## 0.2057 0.1648 0.2380 0.1825 0.2090
## 0.2053 0.1431 0.2736 0.1676 0.2104
## [ CPUFloatType{6,5} ]
Every row here is a probability distribution.
To validate our model, we need to convert the raw output to predicted labels.
predicted_classes <- function(pred_prob) {
pred_prob %>% apply(1, which.max)
}
x_train %>% model %>% predicted_classes %>% head
## [1] 3 3 3 3 3 3
And we need to calculate accuracy. This is going to be just the fraction of labels predicted correctly.
acc <- function(x, y) {
(x == y) %>% as.numeric %>% mean
}
x_test %>% model %>% predicted_classes %>% acc(y_test)
## [1] 0.2214765
And here is our validation function:
validation <- function(x_train, y_train, x_test, y_test, epoch_no = 0) {
pred_train <- model(x_train)
pred_test <- model(x_test)
train_loss <- loss_function(pred_train, y_train) %>% as.numeric
test_loss <- loss_function(pred_test, y_test) %>% as.numeric
train_acc <- pred_train %>% predicted_classes %>% acc(y_train)
test_acc <- pred_test %>% predicted_classes %>% acc(y_test)
data.frame(value = c(train_loss, test_loss, train_acc, test_acc),
type = c("loss", "loss", "accuracy", "accuracy"),
dataset = c("train", "test", "train", "test"),
epoch = epoch_no)
}
val_df <- validation(x_train, y_train, x_test, y_test)
val_df
Now we are ready to run gradient descent
val_df <- validation(x_train, y_train, x_test, y_test)
for(i in 1:epochs){
optimizer$zero_grad()
y_pred <- model(x_train)
loss <- loss_function(y_pred, y_train)
loss$backward() # this is gradient calculation
optimizer$step() # this is one step of optimized gradient descent
# Here, we will calculate and save the metrics
if(i %% 5 == 0){
cat("Epoch ", i, "\n")
val_df <- validation(x_train, y_train, x_test, y_test, i) %>%
rbind(val_df)
}
}
## Epoch 5
## Epoch 10
## Epoch 15
## Epoch 20
## Epoch 25
## Epoch 30
## Epoch 35
## Epoch 40
## Epoch 45
## Epoch 50
## Epoch 55
## Epoch 60
## Epoch 65
## Epoch 70
## Epoch 75
## Epoch 80
## Epoch 85
## Epoch 90
## Epoch 95
## Epoch 100
val_df %>%
ggplot(aes(x = epoch, y = value, group = dataset, color = dataset)) +
facet_grid(. ~type) +
geom_line()
Try to introduce some form of regularization in the model and see if it prevents overfitting. Play with the optimizer. You can even try batch gradient descent, but it is harder to implement - you will need to read tutorials.