Pretrained Word Embeddings

Introduction

Pre-trained word embeddings are useful when you do not have a lot of text to use to train your own word embeddings. A popular word embedding model is Global Vectors (GloVe)). GloVe uses word co-occurrence information, specifically ratios of co-occurrence probabilities to create vectors that capture meaning.

Similar to LSA, it uses matrix factorization methods.

The GloVe models available include models trained on:

6 Billion tokens
42 Billion tokens
840 Billion tokens
27 Billion tokens (twitter)

For each model, there are different embedding dimensions available, such as 50, 100, 200, etc. In the example below, we will use the model trained on 6 billion tokens with an embedding size of 50 (the smallest model) for efficiency.

Preliminary

library(tm) # text mining
library(caret) # classification
library(keras) # deep learning
library(tensorflow) # deep learning

Cleaning

First, we import the data, remove missing text documents and prepare the data for the GloVe model.

cr <- read.csv("clothing_revs_sample.csv", stringsAsFactors=FALSE)
cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]
names(cr)[1] <- "doc_id"
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")
crCorpus <- Corpus(DataframeSource(cr))
crCorpus <- tm_map(crCorpus, tolower)
myCorpus <- tm_map(crCorpus, PlainTextDocument)
cr$clean_text <- myCorpus$content$content

Training/Testing Split

We split the data into training and testing, preserving the Rating distribution. We use 70% of the data for training and 30% for testing.

set.seed(831)
samp <- createDataPartition(cr$Rating, p = .70, list = FALSE)
train = cr[samp, ] 
test = cr[-samp, ]

One Hot Encoding of Rating

To make the dependent variable compatible with keras modeling, we need to convert the current categorical variable, Rating (with levels 1-5) to a one hot encoding/dummy variable representation, with the levels 0-4 (to be consistent with Python indexing).

# make compatible
y_train <- train$Rating - 1
y_test <- test$Rating - 1
# convert y variable to one-hot-encoding representation
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test <- to_categorical(y = y_test, num_classes = 5)

Text Tokenization

max_words <- 5000

counts <- sapply(strsplit(cr$clean_text, " "), length)
max_length <- round(mean(counts) + 2 * sd(counts))
max_length

## [1] 122

Once these numbers are chosen, we use text_tokenizer() to tokenize and prepare our text.

tokenizer <- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(cr$clean_text)

word_index <- tokenizer$word_index

x_train_text <- texts_to_sequences(tokenizer, train$clean_text)
x_train_text <- pad_sequences(x_train_text, maxlen=max_length)

x_test_text <- texts_to_sequences(tokenizer, test$clean_text)
x_test_text <- pad_sequences(x_test_text, maxlen=max_length)

GloVe

The GloVe model that will be used to create the embedding layer can be found here: http://nlp.stanford.edu/data/glove.6B.zip

The embeddings with dimension = 50 will be used (“glove.6B.50d.txt”). First, we parse the file to match words to their vectors.

glove_lines <- readLines("glove.6B.50d.txt")
embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(glove_lines)) {
  line <- glove_lines[[i]]
  values <- strsplit(line, " ")[[1]]
  word <- values[[1]]
  embeddings_index[[word]] <- as.double(values[-1])
}

length(embeddings_index)

## [1] 400000

Now, we build the embedding matrix that we will use as the embedding layer. Since we are using the GloVe model with 50 dimensions, we set embedding_dims to 50. We set up an empty matrix that has the number of rows equal to our word index and the number of columns equal to our embedding size. Then, we map the vectors to the word index.

embedding_dims <- 50
embedding_matrix <- array(0, c(length(word_index), embedding_dims))

for (word in names(word_index)){
  index <- word_index[[word]]
  if(index < length(word_index)) {
    embedding_vector <- embeddings_index[[word]]
    if(!is.null(embedding_vector))
      embedding_matrix[index + 1,] <- embedding_vector
  }
}

Deep Learning Model

We build a deep learning model with 1 Embedding Layer, 1 Dense Hidden Layer, 1 Dropout Layer and 1 Output Layer. We use a Flatten layer to flatten the output after the embedding from a 3D tensor to a 2D tensor (matrix).

We set freeze the weights of the embedding layer so that they are not updated while the model is training, and the model instead uses the pretrained embeddings.

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = length(word_index), 
                  output_dim = embedding_dims, 
                  input_length = max_length) %>%
  layer_flatten %>%
  layer_dense(units = 32, 
              activation = "relu", 
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(0.5) %>%
  layer_dense(units = 5, 
              activation = "softmax")

get_layer(model, index=1) %>%
  set_weights(list(embedding_matrix)) %>%
  freeze_weights()

model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

Class Imbalance

To address class imbalance, we use class-based weighting.

dist <- table(train$Rating)
weights <- c()
for (i in 1:length(dist)){
  weights[i] <- sum(dist)/((length(dist)*dist[i]))
}

class_weights <- list("0" = weights[1],
                      "1" = weights[2],
                      "2" = weights[3],
                      "3" = weights[4],
                      "4" = weights[5])

Training the Model

We train the model using 10 epochs, batch sizes of 16, 10% of the data in the validation set, and assign the calculated class weights.

history <- model %>% fit(
  x_train_text,
  y_train,
  epochs = 10,
  batch_size = 16,
  validation_split = 0.1,
  class_weight = class_weights,
  verbose=2
)

In order for us to evaluate our model performance, we need to convert our one hot encoded dependent variable back to a single variable representation.

ypred_train <- predict(model, x_train_text)
ypred_train_cat <- vector()
for (i in 1:nrow(ypred_train)){
  ypred_train_cat[i] <- which.max(ypred_train[i,])
}
confusionMatrix(factor(ypred_train_cat), 
                factor(train$Rating),
                mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1   50    6   10    6   26
##          2    0   76    8    9    7
##          3   45   91  284  196  254
##          4    9   31   60  224  359
##          5    3    8   33  247 1114
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5539          
##                  95% CI : (0.5363, 0.5713)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 0.6731          
##                                           
##                   Kappa : 0.3314          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.46729  0.35849  0.71899  0.32845   0.6330
## Specificity           0.98426  0.99185  0.78776  0.81447   0.7915
## Pos Pred Value        0.51020  0.76000  0.32644  0.32796   0.7929
## Neg Pred Value        0.98136  0.95550  0.95144  0.81480   0.6311
## Precision             0.51020  0.76000  0.32644  0.32796   0.7929
## Recall                0.46729  0.35849  0.71899  0.32845   0.6330
## F1                    0.48780  0.48718  0.44901  0.32821   0.7039
## Prevalence            0.03390  0.06717  0.12516  0.21610   0.5577
## Detection Rate        0.01584  0.02408  0.08999  0.07098   0.3530
## Detection Prevalence  0.03105  0.03169  0.27567  0.21641   0.4452
## Balanced Accuracy     0.72577  0.67517  0.75337  0.57146   0.7123

Testing the Model

We can perform the same conversion on the dependent variable and use confusionMatrix() to assess performance.

ypred_test <- predict(model, x_test_text)
ypred_test_cat <- vector()
for (i in 1:nrow(ypred_test)){
  ypred_test_cat[i] <- which.max(ypred_test[i,])
}
confusionMatrix(factor(ypred_test_cat), 
                factor(test$Rating),
                mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1   5   4   7   4  21
##          2   4   9   2   6   6
##          3  23  39  66  89 126
##          4  16  23  37  92 197
##          5   6  21  42 103 404
## 
## Overall Statistics
##                                           
##                Accuracy : 0.426           
##                  95% CI : (0.3995, 0.4529)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1461          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.092593 0.093750  0.42857  0.31293   0.5358
## Specificity          0.972265 0.985669  0.76878  0.74197   0.7124
## Pos Pred Value       0.121951 0.333333  0.19242  0.25205   0.7014
## Neg Pred Value       0.962624 0.934340  0.91278  0.79534   0.5490
## Precision            0.121951 0.333333  0.19242  0.25205   0.7014
## Recall               0.092593 0.093750  0.42857  0.31293   0.5358
## F1                   0.105263 0.146341  0.26559  0.27921   0.6075
## Prevalence           0.039941 0.071006  0.11391  0.21746   0.5577
## Detection Rate       0.003698 0.006657  0.04882  0.06805   0.2988
## Detection Prevalence 0.030325 0.019970  0.25370  0.26997   0.4260
## Balanced Accuracy    0.532429 0.539709  0.59868  0.52745   0.6241