Pre-trained word embeddings are useful when you do not have a lot of text to use to train your own word embeddings. A popular word embedding model is Global Vectors (GloVe)). GloVe uses word co-occurrence information, specifically ratios of co-occurrence probabilities to create vectors that capture meaning.
Similar to LSA, it uses matrix factorization methods.
The GloVe models available include models trained on:
For each model, there are different embedding dimensions available, such as 50, 100, 200, etc. In the example below, we will use the model trained on 6 billion tokens with an embedding size of 50 (the smallest model) for efficiency.
library(tm) # text mining
library(caret) # classification
library(keras) # deep learning
library(tensorflow) # deep learning
First, we import the data, remove missing text documents and prepare the data for the GloVe model.
cr <- read.csv("clothing_revs_sample.csv", stringsAsFactors=FALSE)
cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]
names(cr)[1] <- "doc_id"
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")
crCorpus <- Corpus(DataframeSource(cr))
crCorpus <- tm_map(crCorpus, tolower)
myCorpus <- tm_map(crCorpus, PlainTextDocument)
cr$clean_text <- myCorpus$content$content
We split the data into training and testing, preserving the Rating distribution. We use 70% of the data for training and 30% for testing.
set.seed(831)
samp <- createDataPartition(cr$Rating, p = .70, list = FALSE)
train = cr[samp, ]
test = cr[-samp, ]
To make the dependent variable compatible with keras modeling, we need to convert the current categorical variable, Rating (with levels 1-5) to a one hot encoding/dummy variable representation, with the levels 0-4 (to be consistent with Python indexing).
# make compatible
y_train <- train$Rating - 1
y_test <- test$Rating - 1
# convert y variable to one-hot-encoding representation
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test <- to_categorical(y = y_test, num_classes = 5)
max_words <- 5000
counts <- sapply(strsplit(cr$clean_text, " "), length)
max_length <- round(mean(counts) + 2 * sd(counts))
max_length
## [1] 122
Once these numbers are chosen, we use text_tokenizer() to tokenize and prepare our text.
tokenizer <- text_tokenizer(num_words = max_words) %>%
fit_text_tokenizer(cr$clean_text)
word_index <- tokenizer$word_index
x_train_text <- texts_to_sequences(tokenizer, train$clean_text)
x_train_text <- pad_sequences(x_train_text, maxlen=max_length)
x_test_text <- texts_to_sequences(tokenizer, test$clean_text)
x_test_text <- pad_sequences(x_test_text, maxlen=max_length)
The GloVe model that will be used to create the embedding layer can be found here: http://nlp.stanford.edu/data/glove.6B.zip
The embeddings with dimension = 50 will be used (“glove.6B.50d.txt”). First, we parse the file to match words to their vectors.
glove_lines <- readLines("glove.6B.50d.txt")
embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(glove_lines)) {
line <- glove_lines[[i]]
values <- strsplit(line, " ")[[1]]
word <- values[[1]]
embeddings_index[[word]] <- as.double(values[-1])
}
length(embeddings_index)
## [1] 400000
Now, we build the embedding matrix that we will use as the embedding layer. Since we are using the GloVe model with 50 dimensions, we set embedding_dims to 50. We set up an empty matrix that has the number of rows equal to our word index and the number of columns equal to our embedding size. Then, we map the vectors to the word index.
embedding_dims <- 50
embedding_matrix <- array(0, c(length(word_index), embedding_dims))
for (word in names(word_index)){
index <- word_index[[word]]
if(index < length(word_index)) {
embedding_vector <- embeddings_index[[word]]
if(!is.null(embedding_vector))
embedding_matrix[index + 1,] <- embedding_vector
}
}
We build a deep learning model with 1 Embedding Layer, 1 Dense Hidden Layer, 1 Dropout Layer and 1 Output Layer. We use a Flatten layer to flatten the output after the embedding from a 3D tensor to a 2D tensor (matrix).
We set freeze the weights of the embedding layer so that they are not updated while the model is training, and the model instead uses the pretrained embeddings.
model <- keras_model_sequential() %>%
layer_embedding(input_dim = length(word_index),
output_dim = embedding_dims,
input_length = max_length) %>%
layer_flatten %>%
layer_dense(units = 32,
activation = "relu",
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(0.5) %>%
layer_dense(units = 5,
activation = "softmax")
get_layer(model, index=1) %>%
set_weights(list(embedding_matrix)) %>%
freeze_weights()
model %>% compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = list('accuracy')
)
To address class imbalance, we use class-based weighting.
dist <- table(train$Rating)
weights <- c()
for (i in 1:length(dist)){
weights[i] <- sum(dist)/((length(dist)*dist[i]))
}
class_weights <- list("0" = weights[1],
"1" = weights[2],
"2" = weights[3],
"3" = weights[4],
"4" = weights[5])
We train the model using 10 epochs, batch sizes of 16, 10% of the data in the validation set, and assign the calculated class weights.
history <- model %>% fit(
x_train_text,
y_train,
epochs = 10,
batch_size = 16,
validation_split = 0.1,
class_weight = class_weights,
verbose=2
)
In order for us to evaluate our model performance, we need to convert our one hot encoded dependent variable back to a single variable representation.
ypred_train <- predict(model, x_train_text)
ypred_train_cat <- vector()
for (i in 1:nrow(ypred_train)){
ypred_train_cat[i] <- which.max(ypred_train[i,])
}
confusionMatrix(factor(ypred_train_cat),
factor(train$Rating),
mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 50 6 10 6 26
## 2 0 76 8 9 7
## 3 45 91 284 196 254
## 4 9 31 60 224 359
## 5 3 8 33 247 1114
##
## Overall Statistics
##
## Accuracy : 0.5539
## 95% CI : (0.5363, 0.5713)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : 0.6731
##
## Kappa : 0.3314
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.46729 0.35849 0.71899 0.32845 0.6330
## Specificity 0.98426 0.99185 0.78776 0.81447 0.7915
## Pos Pred Value 0.51020 0.76000 0.32644 0.32796 0.7929
## Neg Pred Value 0.98136 0.95550 0.95144 0.81480 0.6311
## Precision 0.51020 0.76000 0.32644 0.32796 0.7929
## Recall 0.46729 0.35849 0.71899 0.32845 0.6330
## F1 0.48780 0.48718 0.44901 0.32821 0.7039
## Prevalence 0.03390 0.06717 0.12516 0.21610 0.5577
## Detection Rate 0.01584 0.02408 0.08999 0.07098 0.3530
## Detection Prevalence 0.03105 0.03169 0.27567 0.21641 0.4452
## Balanced Accuracy 0.72577 0.67517 0.75337 0.57146 0.7123
We can perform the same conversion on the dependent variable and use confusionMatrix() to assess performance.
ypred_test <- predict(model, x_test_text)
ypred_test_cat <- vector()
for (i in 1:nrow(ypred_test)){
ypred_test_cat[i] <- which.max(ypred_test[i,])
}
confusionMatrix(factor(ypred_test_cat),
factor(test$Rating),
mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 5 4 7 4 21
## 2 4 9 2 6 6
## 3 23 39 66 89 126
## 4 16 23 37 92 197
## 5 6 21 42 103 404
##
## Overall Statistics
##
## Accuracy : 0.426
## 95% CI : (0.3995, 0.4529)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1461
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.092593 0.093750 0.42857 0.31293 0.5358
## Specificity 0.972265 0.985669 0.76878 0.74197 0.7124
## Pos Pred Value 0.121951 0.333333 0.19242 0.25205 0.7014
## Neg Pred Value 0.962624 0.934340 0.91278 0.79534 0.5490
## Precision 0.121951 0.333333 0.19242 0.25205 0.7014
## Recall 0.092593 0.093750 0.42857 0.31293 0.5358
## F1 0.105263 0.146341 0.26559 0.27921 0.6075
## Prevalence 0.039941 0.071006 0.11391 0.21746 0.5577
## Detection Rate 0.003698 0.006657 0.04882 0.06805 0.2988
## Detection Prevalence 0.030325 0.019970 0.25370 0.26997 0.4260
## Balanced Accuracy 0.532429 0.539709 0.59868 0.52745 0.6241