Word embedding is the process of capturing context of a word in a document such as word semantic and its relation with other words. A general architecture or technique used for word embeddings is called Word2Vec which use shallow neural network to develop words vector which represent words characteristics.
Here we will describe step-by-step on performing word embedding with Word2Vec architecture using Keras in R. The libraries we’ll use throughout this article will be informed one by one for each step.
News concerning the COVID-19 was collected in 29 April 2020 from various news platform in Indonesia. We aim to perform word embedding for the words found in the news.
Description:
in addition, we also save a vector containing Indonesian stopwords from online source for data pre-processing.
This step will transform the value in column text into a standarized and tidy format, ready for tokenization.
library(tidyverse)
data_clean <- data %>%
mutate(text = text %>%
# turn text into lowercase
str_to_lower() %>%
# remove stopwords
tm::removeWords(words = stopwords$V1) %>%
# reduce repeated whitespace from the text
str_squish())We will use functions from the package keras for tokenize each strings of our text (one string for 1 article) data into its component words.
library(keras)
# making tokenizer
tokenizer <- text_tokenizer(num_words = 18000) # maximum number of word to keep (based on frequency)
# tokenize data
tokenizer %>% fit_text_tokenizer(data$text)After tokenization, we should make a skip-gram training samples for training our Word2Vec architecture. The training sample contains a collection of sequence from our text data, cut based on the determinded number of skip-gram.
For example, a sentence “I love to drink orange juice” with 3-skip-gram should create:
(‘I’, ‘love’), (‘I’, ‘to’), (‘I’, ‘drink’),
(‘love’, ‘I’), (‘love’, ‘to’), (‘love’, ‘drink’), (‘love’, ‘orange’),
(‘to’, ‘I’), (‘to’, ‘love’), (‘to’, ‘drink’), (‘to’, ‘orange’), (‘to’, ‘juice’),
(‘drink’, ‘I’), (‘drink’, ‘love’), (‘drink’, ‘to’), (‘drink’, ‘orange’), (‘drink’, ‘juice’),
(‘orange’, ‘love’), (‘orange’, ‘to’), (‘orange’, ‘drink’), (‘orange’, ‘juice’),
(‘juice’, ‘to’), (‘juice’, ‘drink’), (‘juice’, ‘orange’)
Below is a function to prepare our skip-gram training sample:
library(reticulate)
library(purrr)
skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) {
gen <- texts_to_sequences_generator(tokenizer, sample(text))
function() {
skip <- generator_next(gen) %>%
skipgrams(
vocabulary_size = tokenizer$num_words,
window_size = window_size,
negative_samples = 1
)
x <- transpose(skip$couples) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
y <- skip$labels %>% as.matrix(ncol = 1)
list(x, y)
}
}Description:
Next we will build Word2Vec Architecture and prepare some model tuning inputs:
Also note that when we use skip-gram architecture:
one input is for one context (target) word.
the number of output is the number of its neighboring words.
the model will be trained with a list of skip-gram training sample, with one epoch stands for 1 training iteration, and one batch (in each epoch) for one group of skip-gram pairs, for example:
(‘I’, ‘love’), (‘I’, ‘to’), (‘I’, ‘drink’),
# making architecture
input_target <- layer_input(shape = 1)
input_context <- layer_input(shape = 1)
embedding <- layer_embedding(
input_dim = tokenizer$num_words + 1,
output_dim = embedding_size,
input_length = 1,
name = "embedding"
)
target_vector <- input_target %>%
embedding() %>%
layer_flatten() # to return the dimension of the input
context_vector <- input_context %>%
embedding() %>%
layer_flatten()
dot_product <- layer_dot(list(target_vector, context_vector), axes = 1)
output <- layer_dense(dot_product, units = 1, activation = "sigmoid")model <- keras_model(list(input_target, input_context), output)
model %>% compile(loss = "binary_crossentropy", optimizer = "adam")Below is the Word2Vec architecture:
#> Model: "model"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param # Connected to
#> ================================================================================
#> input_1 (InputLayer) [(None, 1)] 0
#> ________________________________________________________________________________
#> input_2 (InputLayer) [(None, 1)] 0
#> ________________________________________________________________________________
#> embedding (Embedding) (None, 1, 256) 4608256 input_1[0][0]
#> input_2[0][0]
#> ________________________________________________________________________________
#> flatten (Flatten) (None, 256) 0 embedding[0][0]
#> ________________________________________________________________________________
#> flatten_1 (Flatten) (None, 256) 0 embedding[1][0]
#> ________________________________________________________________________________
#> dot (Dot) (None, 1) 0 flatten[0][0]
#> flatten_1[0][0]
#> ________________________________________________________________________________
#> dense (Dense) (None, 1) 2 dot[0][0]
#> ================================================================================
#> Total params: 4,608,258
#> Trainable params: 4,608,258
#> Non-trainable params: 0
#> ________________________________________________________________________________
During the training process, our model will update the weights from the each input (one context word) to the specified number of neuron in the hidden/embedding layer. This weights is the value that will be used to describe each word for word embeddings. A collection of this weights for many number of words is what we call word vector.
#obtaining word vector
embedding_matrix <- get_weights(model)[[1]]
words <- dplyr::data_frame(
word = names(tokenizer$word_index),
id = as.integer(unlist(tokenizer$word_index))
)
words <- words %>%
dplyr::filter(id <= tokenizer$num_words) %>%
dplyr::arrange(id)
row.names(embedding_matrix) <- c("UNK", words$word)
dim(embedding_matrix)#> [1] 18001 256
As you can see, there are 18001 words with 256 value (characteristics) to explain them.
Once we have the word vector, we can use it to perform analysis on word/text semantics. For example, we can use it to find similar word from a pool of vocabulary based on cosine similarity:
library(text2vec)
find_similar_words <- function(word, embedding_matrix, n = 7) {
similarities <- embedding_matrix[word, , drop = FALSE] %>%
sim2(embedding_matrix, y = ., method = "cosine")
similarities[,1] %>% sort(decreasing = TRUE) %>% head(n)
}#> corona virus 19 akibat pandemi positif covid
#> 1.0000000 0.9815613 0.9276114 0.9237683 0.9217538 0.9209640 0.9183338
#> pandemi wabah dampak penyebaran virus melawan menghadapi
#> 1.0000000 0.9558361 0.9392546 0.9342459 0.9318135 0.9308110 0.9293815
#> baswedan anies dki pemkot bodebek menyampaikan
#> 1.0000000 0.9701637 0.9476971 0.9448433 0.9428023 0.9402692
#> pemprov
#> 0.9386129
#> lockdown buruh karantina meskipun setelah mengikuti awal
#> 1.0000000 0.9560191 0.9545811 0.9536048 0.9528437 0.9524906 0.9517070
#> psbb pembatasan berskala besar sosial penerapan
#> 1.0000000 0.9616514 0.9581562 0.9541515 0.9398070 0.9336288
#> pemberlakuan
#> 0.9332383
We can see that the word “corona” took “virus” for the word with the highest similarity, following “pandemi” with “wabah”, “baswedan” with “anies”, “dki”, “jakarta” etc. Some words may not be very similar but the performace can be improved by providing more data for the model training.
We hope this brief explanation on word embedding and Word2Vec in R can be useful for you, the readers. Happy learning!