In this post we will try to build a text classification model using the Deep Learning architecture. We then compared it with the benchmark model of Naive Bayes classifier.
The following code contains all the required package and setup for this post. All source code are provided on my github repository.
# Data Wrangling
library(tidyverse)
# Text Preprocessing
library(tidytext)
library(textclean)
library(hunspell)
# Model Evaluation
library(yardstick)
# Naive Bayes
library(e1071)
# Deep Learning
library(keras)
use_condaenv("r-tensorflow")
# ggplot2 Plot Configuration
theme_set(theme_minimal() +
theme(legend.position = "top")
)Data are collected from the Analytic Vidhya, JanataHack: NLP Hackathon.. The dataset consists of 5 columns and around 17,000 observations.
## Rows: 17,494
## Columns: 5
## $ review_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review <chr> "I'm scared and hearing creepy voices. So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Data Description:
review_id: id of the reviewtitle: Title of the gameyear: Year in which the review was posteduser_review: Review of the useruser_suggestion: Game marked Recommended(1) and Not Recommended(0) by the userWe will start by cleansing the data with the following steps:
early access review? into explicit quetionmark! into explicit exclamationmarkSince the process take a quite long time to run, we will use parallel computing to get the job done using the furrr package.
cleansing_text <- function(x) x %>%
replace_non_ascii() %>%
tolower() %>%
str_replace_all(pattern = "\\@.*? |\\@.*?[:punct:]", replacement = " ") %>%
str_remove(pattern = "early access review") %>%
replace_url() %>%
replace_hash() %>%
replace_html() %>%
replace_contraction() %>%
replace_word_elongation() %>%
str_replace_all("\\?", " questionmark") %>%
str_replace_all("\\!", " exclamationmark") %>%
str_replace_all("[:punct:]", " ") %>%
str_replace_all("[:digit:]", " ") %>%
str_trim() %>%
str_squish()
cleansing_text("I really love this game !!!")## [1] "i really love this game exclamationmark exclamationmark exclamationmark"
library(furrr)
plan(multisession, workers = 4) # Using 4 CPU cores
df_clean <- df %>%
mutate(
text_clean = user_review %>%
future_map_chr(cleansing_text)
)
head(df_clean)## Rows: 17,494
## Columns: 6
## $ review_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review <chr> "I'm scared and hearing creepy voices. So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ text_clean <chr> "I am scared and hearing creepy voices so I will paus…
Next, we need to check the length of the sentence on each review after being cleansed. Some text may only contains 1 or 2 words left after being cleansed.
word_count <- map_dbl(df_clean$text_clean, function(x) str_split(x, " ") %>%
unlist() %>%
length()
)
summary(word_count)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 51 84 141 158 1601
We will filter the dataset by only using a text that at least has 3 words in the sentence.
## Rows: 17,455
## Columns: 6
## $ review_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review <chr> "I'm scared and hearing creepy voices. So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ text_clean <chr> "I am scared and hearing creepy voices so I will paus…
We will split the data and 80% of the data will be data train while the rest will be data test.
set.seed(123)
row_data <- nrow(df_clean)
index <- sample(row_data, row_data*0.8)
data_train <- df_clean[ index, ]
data_test <- df_clean[-index, ]Don’t forget to check the proportion of the target variable in data train to see if there are any class imbalance.
##
## 0 1
## 0.4315382 0.5684618
We will check how many unique words that we have in our corpus.
## [1] 44206
We will transform thet ext data into token using tokenizer from keras library. The number of words that will be used during the model training is determined manually ranked by the frequency of each word in the corpus in descending order.
Since the length of the text can differ, we will pad the sequence to make sure all text has the same length by inserting 0 value if the text is short. We will use pre padding sequence method that will ensure information from LSTM layer will not lost. See this paper that discuss about it. Since we only use maximum length of word by 200 while some review has more than 200 words, we will truncate and only use the first 200 words using the post truncating type.
train_x <- texts_to_sequences(tokenizer, data_train$text_clean) %>%
pad_sequences(maxlen = maxlen, padding = "pre", truncating = "post")
test_x <- texts_to_sequences(tokenizer, data_test$text_clean) %>%
pad_sequences(maxlen = maxlen, padding = "pre", truncating = "post")
# Transform the target variable on data train
train_y <- data_train$user_suggestionLet’s check the dimension of the feature.
## [1] 13964 250
The model will use embedding layer as the input layer. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word. The context and feature of the text will be extracted using the recurrent network of LSTM layer. If you are unfamiliar with both layer, I recommend you to read this article. To avoid overfitting, we will use the Elastic Net regularizer that use both L1 and L2 norm to penalize the loss function. You can read about it here.
# Set Random Seed for Initial Weight
tensorflow::tf$random$set_seed(123)
# Build model architecture
model <- keras_model_sequential(name = "lstm_model") %>%
layer_embedding(name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 8
) %>%
layer_lstm(name = "LSTM",
units = 8,
kernel_regularizer = regularizer_l1_l2(l1 = 0.05, l2 = 0.05),
return_sequences = F
) %>%
layer_dense(name = "Output",
units = 1,
activation = "sigmoid"
)
model## Model
## Model: "lstm_model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (Embedding) (None, 250, 8) 320000
## ________________________________________________________________________________
## LSTM (LSTM) (None, 8) 544
## ________________________________________________________________________________
## Output (Dense) (None, 1) 9
## ================================================================================
## Total params: 320,553
## Trainable params: 320,553
## Non-trainable params: 0
## ________________________________________________________________________________
The model will be trained using adam optimizer with learning rate of 0.001 with binary cross-entropy loss function. The model trained with 8 epoch and 64 batch size. We will also use 10% of the training dataset as the validation dataset to monitor if the model will go overfit after some time.
model %>%
compile(optimizer = optimizer_adam(lr = 0.001),
metrics = "accuracy",
loss = "binary_crossentropy"
)
epochs <- 9
batch_size <- 64
train_history <- model %>%
fit(x = train_x,
y = train_y,
batch_size = batch_size,
epochs = epochs,
validation_split = 0.1, # 10% validation data
# print progress but don't create graphic
verbose = 1,
view_metrics = 0
)
plot(train_history) +
geom_line()Now we will predict the data test using the trained model.
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
## [5,] 1
## [6,] 1
## [7,] 1
## [8,] 0
## [9,] 1
## [10,] 1
Let’s check the confusion matrix.
decode <- function(x) as.factor(ifelse(x == 0, "Not Recommended", "Recommended"))
pred_class <- decode(pred_test)
true_class <- decode(data_test$user_suggestion)
# Confusion Matrix
table("Prediction" = pred_class, "Actual" = true_class)## Actual
## Prediction Not Recommended Recommended
## Not Recommended 1225 322
## Recommended 263 1681
Finally, we can evaluate the model using the main performance metrics: accuracy, recall, and precision.
data.frame(
Accuracy = accuracy_vec(pred_class, true_class),
Recall = sens_vec(pred_class, true_class),
Precision = precision_vec(pred_class, true_class),
F1 = f_meas_vec(pred_class, true_class)
)Now we will try to compare the Deep Learning model with more simple model of Naive Bayes, which is often used as a benchmark model.
We will manually tokenize the model and use tidytext package to get the job done.
Unlike in LSTM where we can look for context of sequence of text, in Naive Bayes we only care about the bag of word or the frequency (sometimes the TF-IDF value) of the corpus. Here, we will remove the unnecessary stop words.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here we will use the hunspell package to do word stemming. The Hunspell is the spell checker library used by LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others. It provides a system for tokenizing, stemming and spelling in almost any language or alphabet. Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in a given language.
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
return(stem)
}We use parallel computing to do the hunspell stemming.
Next, we will get words that appear in at least 80% of all document and words that appear in less than 5 documents in data train. The purpose of this step is to remove common words and rare words that may hold little information.
# Find number of word appearance in the corpus
frequent_token <- train_token %>%
count(review_id, word) %>%
count(word, name = "appearance") %>%
arrange(desc(appearance))
number_of_document <- n_distinct(train_token$review_id)
# Get word that appear in at least 80% of all document
top_word <- frequent_token %>%
filter(appearance >= (number_of_document * 0.8)) %>%
pull(word)
# Get word that appear in less than 5 document
low_word <- frequent_token %>%
filter(appearance <= 5) %>%
pull(word)
custom_stop_word <- c(top_word, low_word)
head(custom_stop_word, 30)## [1] "i" "the" "game" "and" "to"
## [6] "a" "it" "aber" "abilties" "abound"
## [11] "accelerator" "accually" "ache" "achievments" "actives"
## [16] "actully" "acutally" "adaptation" "administrate" "advocate"
## [21] "africa" "agame" "aggravate" "ahve" "alias"
## [26] "alienware" "allthough" "altitude" "altough" "amazon"
We filter the custom stop words from out tokenized data.
train_token <- train_token %>%
filter(!word %in% custom_stop_word)
test_token <- test_token %>%
filter(!word %in% custom_stop_word)Finally, we will create the Document-Term Matrix.
train_dtm <- train_token %>%
count(review_id, word) %>%
cast_dtm(document = review_id,
term = word,
value = n)
test_dtm <- test_token %>%
count(review_id, word) %>%
cast_dtm(document = review_id,
term = word,
value = n)We then convert the value in Document-Term matrix into categorical, whether the word present (has frequency > 0) or not.
After all data is properly processed, now we will build the Naive Bayes model.
We then predict the data using the data test.
## [1] Not Recommended Recommended Recommended Recommended
## [5] Not Recommended Recommended
## Levels: Not Recommended Recommended
Check the Confusion Matrix
test_y <- data_test %>%
filter(review_id %in% test_dtm$dimnames$Docs) %>%
pull(user_suggestion) %>%
decode()
# Confusion Matrix
table("Prediction" = pred_test, "Actual" = test_y)## Actual
## Prediction Not Recommended Recommended
## Not Recommended 518 626
## Recommended 970 1377
Finally, let’s check the model performance.
data.frame(Accuracy = accuracy_vec(pred_test, test_y),
Recall = sens_vec(pred_test, test_y),
Precision = precision_vec(pred_test, test_y),
"F1 Score" = f_meas_vec(pred_test, test_y),
check.names = F
)