In this post we will try to build a text classification model using the Deep Learning architecture. We then compared it with the benchmark model of Naive Bayes classifier.

1 Library and Setup

The following code contains all the required package and setup for this post. All source code are provided on my github repository.

# Data Wrangling
library(tidyverse)

# Text Preprocessing
library(tidytext)
library(textclean)
library(hunspell)

# Model Evaluation
library(yardstick)

# Naive Bayes
library(e1071)

# Deep Learning
library(keras)
use_condaenv("r-tensorflow")

# ggplot2 Plot Configuration
theme_set(theme_minimal() +
            theme(legend.position = "top")
          )

2 Data

Data are collected from the Analytic Vidhya, JanataHack: NLP Hackathon.. The dataset consists of 5 columns and around 17,000 observations.

df <- read.csv("data/steam_review.csv")

glimpse(df)
## Rows: 17,494
## Columns: 5
## $ review_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title           <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year            <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review     <chr> "I'm scared and hearing creepy voices.  So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Data Description:

  • review_id: id of the review
  • title: Title of the game
  • year: Year in which the review was posted
  • user_review: Review of the user
  • user_suggestion: Game marked Recommended(1) and Not Recommended(0) by the user

3 Text Preprocessing

3.1 Text Cleansing

We will start by cleansing the data with the following steps:

  • Make all characters lowercase
  • Remove all non ASCII characters
  • Remove all mention to name (@….)
  • Remove the phrase early access review
  • Remove all https or url link
  • Remove all hashtag (#…)
  • Remove all html tag
  • Replace a contracted word
  • Remove any word elongation
  • Replace ? into explicit quetionmark
  • Replace ! into explicit exclamationmark
  • Remove all punctuation
  • Remove all numbers
  • Remove unnecessary white space

Since the process take a quite long time to run, we will use parallel computing to get the job done using the furrr package.

cleansing_text <- function(x) x %>% 
                       replace_non_ascii() %>% 
                       tolower() %>% 
                       str_replace_all(pattern = "\\@.*? |\\@.*?[:punct:]", replacement = " ") %>% 
                       str_remove(pattern = "early access review") %>%
                       replace_url() %>% 
                       replace_hash() %>% 
                       replace_html() %>% 
                       replace_contraction() %>% 
                       replace_word_elongation() %>% 
                       str_replace_all("\\?", " questionmark") %>% 
                       str_replace_all("\\!", " exclamationmark") %>% 
                       str_replace_all("[:punct:]", " ") %>% 
                       str_replace_all("[:digit:]", " ") %>% 
                       str_trim() %>% 
                       str_squish()

cleansing_text("I really love this game !!!")
## [1] "i really love this game exclamationmark exclamationmark exclamationmark"
library(furrr) 
plan(multisession, workers = 4) # Using 4 CPU cores

df_clean <- df %>% 
  mutate(
    text_clean = user_review %>% 
      future_map_chr(cleansing_text)
    ) 

head(df_clean)
## Rows: 17,494
## Columns: 6
## $ review_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title           <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year            <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review     <chr> "I'm scared and hearing creepy voices.  So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ text_clean      <chr> "I am scared and hearing creepy voices so I will paus…

Next, we need to check the length of the sentence on each review after being cleansed. Some text may only contains 1 or 2 words left after being cleansed.

word_count <- map_dbl(df_clean$text_clean, function(x) str_split(x, " ") %>% 
                        unlist() %>% 
                        length()
                      )

summary(word_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      51      84     141     158    1601

We will filter the dataset by only using a text that at least has 3 words in the sentence.

df_clean <- df_clean %>% 
  filter(word_count > 3)

glimpse(df_clean)
## Rows: 17,455
## Columns: 6
## $ review_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ title           <chr> "Spooky's Jump Scare Mansion", "Spooky's Jump Scare M…
## $ year            <int> 2016, 2016, 2016, 2015, 2015, 2015, 2017, 2015, 2015,…
## $ user_review     <chr> "I'm scared and hearing creepy voices.  So I'll pause…
## $ user_suggestion <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ text_clean      <chr> "I am scared and hearing creepy voices so I will paus…

4 Cross-Validation

We will split the data and 80% of the data will be data train while the rest will be data test.

set.seed(123)
row_data <- nrow(df_clean)
index <- sample(row_data, row_data*0.8)

data_train <- df_clean[ index, ]
data_test <- df_clean[-index, ]

Don’t forget to check the proportion of the target variable in data train to see if there are any class imbalance.

table(data_train$user_suggestion) %>% 
  prop.table()
## 
##         0         1 
## 0.4315382 0.5684618

5 Deep Learning with LSTM

5.1 Tokenization

We will check how many unique words that we have in our corpus.

paste(data_train$text_clean, collapse = " ") %>% 
  str_split(" ") %>% 
  unlist() %>% 
  n_distinct()
## [1] 44206

We will transform thet ext data into token using tokenizer from keras library. The number of words that will be used during the model training is determined manually ranked by the frequency of each word in the corpus in descending order.

num_words <- 40*1e3

tokenizer <- text_tokenizer(num_words = num_words) %>% 
  fit_text_tokenizer(data_train$text_clean)

# Maximum Length of Word to use
maxlen <- 250

5.2 Padding Text Sequence

Since the length of the text can differ, we will pad the sequence to make sure all text has the same length by inserting 0 value if the text is short. We will use pre padding sequence method that will ensure information from LSTM layer will not lost. See this paper that discuss about it. Since we only use maximum length of word by 200 while some review has more than 200 words, we will truncate and only use the first 200 words using the post truncating type.

train_x <- texts_to_sequences(tokenizer, data_train$text_clean) %>% 
  pad_sequences(maxlen = maxlen, padding = "pre", truncating = "post")

test_x <- texts_to_sequences(tokenizer, data_test$text_clean) %>% 
  pad_sequences(maxlen = maxlen, padding = "pre", truncating = "post")

# Transform the target variable on data train
train_y <- data_train$user_suggestion

Let’s check the dimension of the feature.

dim(train_x)
## [1] 13964   250

5.3 Model Architecture

The model will use embedding layer as the input layer. In a variety of deep learning frameworks such as Keras, the embedding layer aims to train text data into numerical vectors which represent the closeness of the meaning of each word. The context and feature of the text will be extracted using the recurrent network of LSTM layer. If you are unfamiliar with both layer, I recommend you to read this article. To avoid overfitting, we will use the Elastic Net regularizer that use both L1 and L2 norm to penalize the loss function. You can read about it here.

# Set Random Seed for Initial Weight
tensorflow::tf$random$set_seed(123)

# Build model architecture
model <- keras_model_sequential(name = "lstm_model") %>% 
  layer_embedding(name = "input",
                  input_dim = num_words,
                  input_length = maxlen,
                  output_dim = 8
                  ) %>% 
  layer_lstm(name = "LSTM",
             units = 8,
             kernel_regularizer = regularizer_l1_l2(l1 = 0.05, l2 = 0.05),
             return_sequences = F
             ) %>% 
  layer_dense(name = "Output",
              units = 1,
              activation = "sigmoid"
              )

model
## Model
## Model: "lstm_model"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (Embedding)                   (None, 250, 8)                  320000      
## ________________________________________________________________________________
## LSTM (LSTM)                         (None, 8)                       544         
## ________________________________________________________________________________
## Output (Dense)                      (None, 1)                       9           
## ================================================================================
## Total params: 320,553
## Trainable params: 320,553
## Non-trainable params: 0
## ________________________________________________________________________________

5.4 Model Fitting

The model will be trained using adam optimizer with learning rate of 0.001 with binary cross-entropy loss function. The model trained with 8 epoch and 64 batch size. We will also use 10% of the training dataset as the validation dataset to monitor if the model will go overfit after some time.

model %>% 
  compile(optimizer = optimizer_adam(lr = 0.001),
          metrics = "accuracy",
          loss = "binary_crossentropy"
          )

epochs <- 9
batch_size <- 64

train_history <- model %>% 
  fit(x = train_x,
      y = train_y,
      batch_size = batch_size,
      epochs = epochs,
      validation_split = 0.1, # 10% validation data
      
      # print progress but don't create graphic
      verbose = 1,
      view_metrics = 0
      )

plot(train_history) +
  geom_line()

5.5 Model Evaluation

Now we will predict the data test using the trained model.

pred_test <- predict_classes(model, test_x)

head(pred_test, 10)
##       [,1]
##  [1,]    1
##  [2,]    1
##  [3,]    1
##  [4,]    1
##  [5,]    1
##  [6,]    1
##  [7,]    1
##  [8,]    0
##  [9,]    1
## [10,]    1

Let’s check the confusion matrix.

decode <- function(x) as.factor(ifelse(x == 0, "Not Recommended", "Recommended"))

pred_class <- decode(pred_test)
true_class <- decode(data_test$user_suggestion)

# Confusion Matrix
table("Prediction" = pred_class, "Actual" = true_class)
##                  Actual
## Prediction        Not Recommended Recommended
##   Not Recommended            1225         322
##   Recommended                 263        1681

Finally, we can evaluate the model using the main performance metrics: accuracy, recall, and precision.

data.frame(
           Accuracy = accuracy_vec(pred_class, true_class),
           Recall = sens_vec(pred_class, true_class),
           Precision = precision_vec(pred_class, true_class),
           F1 = f_meas_vec(pred_class, true_class)
           )

6 Naive Bayes

Now we will try to compare the Deep Learning model with more simple model of Naive Bayes, which is often used as a benchmark model.

6.1 Tokenization

We will manually tokenize the model and use tidytext package to get the job done.

train_token <- data_train %>% 
  select(review_id, text_clean) %>% 
  unnest_tokens(output = "word",
                input = text_clean)

test_token <- data_test %>% 
  select(review_id, text_clean) %>% 
  unnest_tokens(output = "word",
                input = text_clean)

6.2 Remove Stop Words

Unlike in LSTM where we can look for context of sequence of text, in Naive Bayes we only care about the bag of word or the frequency (sometimes the TF-IDF value) of the corpus. Here, we will remove the unnecessary stop words.

train_token <- train_token %>% 
  filter(!word %in% stop_words$word)

test_token <- test_token %>% 
  filter(!word %in% stop_words$word)

6.3 Stemming

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here we will use the hunspell package to do word stemming. The Hunspell is the spell checker library used by LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS-X, InDesign, Opera, RStudio and many others. It provides a system for tokenizing, stemming and spelling in almost any language or alphabet. Hunspell uses a special dictionary format that defines which characters, words and conjugations are valid in a given language.

stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell_stem(term)[[1]]
  
  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }
  return(stem)
}

We use parallel computing to do the hunspell stemming.

library(furrr)
plan(multisession, workers = 4) # number of cpu core

train_token <- train_token %>% 
  mutate(word = future_map_chr(word, stem_hunspell))

test_token <- test_token %>% 
  mutate(word = future_map_chr(word, stem_hunspell))

6.4 Document-Term Matrix

Next, we will get words that appear in at least 80% of all document and words that appear in less than 5 documents in data train. The purpose of this step is to remove common words and rare words that may hold little information.

# Find number of word appearance in the corpus
frequent_token <- train_token %>% 
  count(review_id, word) %>% 
  count(word, name = "appearance") %>% 
  arrange(desc(appearance))

number_of_document <- n_distinct(train_token$review_id)

# Get word that appear in at least 80% of all document
top_word <- frequent_token %>% 
  filter(appearance >= (number_of_document * 0.8)) %>% 
  pull(word)

# Get word that appear in less than 5 document
low_word <- frequent_token %>% 
  filter(appearance <= 5) %>% 
  pull(word)

custom_stop_word <- c(top_word, low_word)

head(custom_stop_word, 30)
##  [1] "i"            "the"          "game"         "and"          "to"          
##  [6] "a"            "it"           "aber"         "abilties"     "abound"      
## [11] "accelerator"  "accually"     "ache"         "achievments"  "actives"     
## [16] "actully"      "acutally"     "adaptation"   "administrate" "advocate"    
## [21] "africa"       "agame"        "aggravate"    "ahve"         "alias"       
## [26] "alienware"    "allthough"    "altitude"     "altough"      "amazon"

We filter the custom stop words from out tokenized data.

train_token <- train_token %>% 
  filter(!word %in% custom_stop_word)

test_token <- test_token %>% 
  filter(!word %in% custom_stop_word)

Finally, we will create the Document-Term Matrix.

train_dtm <- train_token %>% 
  count(review_id, word) %>% 
  cast_dtm(document = review_id,
           term = word,
           value = n) 

test_dtm <- test_token %>% 
  count(review_id, word) %>% 
  cast_dtm(document = review_id,
           term = word,
           value = n)

We then convert the value in Document-Term matrix into categorical, whether the word present (has frequency > 0) or not.

bernoulli_conv <- function(x){
        x <- as.factor(ifelse(x > 0, 1, 0))
}

train_bn <- apply(train_dtm, 2, bernoulli_conv)
test_bn <- apply(test_dtm, 2, bernoulli_conv)

6.5 Model Fitting

After all data is properly processed, now we will build the Naive Bayes model.

train_y <- data_train %>% 
  filter(review_id %in% train_dtm$dimnames$Docs) %>% 
  pull(user_suggestion) %>% 
  decode()

model_bayes <- naiveBayes(train_bn, train_y)

6.6 Model Evaluation

We then predict the data using the data test.

pred_test <- predict(model_bayes, test_bn, type = "class")

head(pred_test)
## [1] Not Recommended Recommended     Recommended     Recommended    
## [5] Not Recommended Recommended    
## Levels: Not Recommended Recommended

Check the Confusion Matrix

test_y <- data_test %>% 
  filter(review_id %in% test_dtm$dimnames$Docs) %>% 
  pull(user_suggestion) %>% 
  decode()

# Confusion Matrix
table("Prediction" = pred_test, "Actual" = test_y)
##                  Actual
## Prediction        Not Recommended Recommended
##   Not Recommended             518         626
##   Recommended                 970        1377

Finally, let’s check the model performance.

data.frame(Accuracy = accuracy_vec(pred_test, test_y),
           Recall = sens_vec(pred_test, test_y),
           Precision = precision_vec(pred_test, test_y),
           "F1 Score" = f_meas_vec(pred_test, test_y),
           check.names = F
           )