# Data Wrangling
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(lubridate)

# Plotting
library(ggplot2)

# Text Preprocessing
library(textclean)
devtools::install_github('nurandi/katadasaR')
library(katadasaR)
library(tokenizers)
library(textstem)
library(stopwords)
library(tm)
library(tidytext)

# Cross Validation
library(rsample)

# Machine Learning
## Naive Bayes
library(e1071)
## LSTM
library(keras)
library(tensorflow)

# Machine Learning Evaluation
library(caret)
library(lime)

Information about Data Used

The SMS dataset is a curated SMS dataset with a spam/ham label for each message. As users we would not want to waste our time in reading spam messages. Because of that we would like to know if the text that enters our inbox is a spam or a ham. This is so that we can classify all the spam text and move them to a different inbox and prevent them from cluttering our inbox. We are going to use machine learning models to build a spam/ham classifier.

Data Wrangling

There is:

  1. train data: used to train and validate the initial model with prior knowledge whether the text is spam or ham
  2. test data: used to predict data set with no prior knowledge whether the text is spam or ham
text_train <- read.csv("data_input/data-train.csv")
head(text_train)
text_test <- read.csv("data_input/data-test.csv")
head(text_test)

Distribution plot of total hourly frequency for each status

# Data preparation
text_train_plot <- text_train %>% 
                    mutate(datetime=ymd_hms(datetime),
                           hourly=hour(datetime),
                           status=as.factor(status)) %>% 
                    select(c(hourly, status)) %>% 
                    group_by(hourly, status) %>% 
                    summarise(total_count=n())

# Plotting using line plot                    
ggplot(text_train_plot, aes(x=hourly, y=total_count, group=status)) +
      geom_line(aes(colour=status), size=0.75)+
      geom_point(aes(colour=status), size=2) + theme_minimal()+
      labs(y="Total spam/ham",
           x="Hour",
           colour=NULL)

Checking NAs on data test and data train

colSums(is.na(text_train))
#> datetime     text   status 
#>        0        0        0
colSums(is.na(text_test))
#> datetime     text   status 
#>        0        0      283

Droppping NAs on test data

text_test <- text_test[,c("datetime", "text")]

Cleaning data test

text_test_cln <- text_test %>% 
                    select(-c(datetime))

Cleaning data train

text_train <- text_train %>% 
                    mutate(status=as.factor(status))
                    
                           
text_train_cln <- text_train %>% 
                    select(-c(datetime))

Text (SMS) case folding

An important step for text preprocessing due to inconsistent form that each text have. We need to change all text to lower case and get rid of symbols, etc.

Cleaning text from html, url, symbol, etc using textclean library

Strip() from textclean pretty much strips any unecessary symbols, numbers, whitespace, etc. Text messages rarely uses emoticons so we will not be on the lookout for that.

text_train_cln$text <- text_train_cln$text %>% 
                          replace_html() %>%
                          replace_url() %>% 
                          replace_symbol() %>% 
                          strip()

text_test_cln$text <- text_test_cln$text %>% 
                          replace_html() %>%
                          replace_url() %>% 
                          replace_symbol() %>% 
                          strip()

Replacing Indonesian internet slang for data train

# # import Indonesian lexicon
# spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv")
# 
# # replace internet slang
# text_train_cln$text <- replace_internet_slang(text_train_cln$text, slang = paste0("\\b", spell.lex$slang, "\\b"),
#                                  replacement = spell.lex$formal, ignore.case = TRUE)

Replacing Indonesian internet slang for data test

# # import Indonesian lexicon
# spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv")
# 
# # replace internet slang
# text_test_cln$text <- replace_internet_slang(text_test_cln$text, slang = paste0("\\b", spell.lex$slang, "\\b"),
#                                  replacement = spell.lex$formal, ignore.case = TRUE)

Saving variable with internet slang cleaned

# saveRDS(text_train_cln, file = "data_input/train_clean.rds")
# saveRDS(text_test_cln, file = "data_input/test_clean.rds")

Reading variable with internet slang cleaned

train_clean <- readRDS("data_input/train_clean.rds")
test_clean <- readRDS("data_input/test_clean.rds")

Word cloud

We will be taking a look on which word, based on the term frequency of each word or token, pop up more under the Spam or Ham label.

  1. Spam
library(wordcloud)

spam <- train_clean %>% 
          filter(status=="spam") %>% 
          select(text)

wordcloud(words=as.matrix(spam),
          max.words = 20,
          scale = c(2.5,1.25),
          random.order = F,
          rot.per = 0.5,
          colors = brewer.pal(2,"Dark2"))

  1. Ham
ham <- train_clean %>% 
          filter(status=="ham") %>% 
          select(text)

wordcloud(words=as.matrix(ham),
          max.words = 20,
          scale = c(2.5,1.25),
          random.order = F,
          rot.per = 0.5,
          colors = brewer.pal(2,"Dark2"))

Text (SMS) pre-processing

Stemming words

To stem words into the most simple form that they can take: ex. abai, abaikan to aba so it can be recognized as one type of word

# stemming <- function(x){
#   paste(lapply(x,katadasar),collapse = " ")}
# 
# text_train_cln1 <- lapply(tokenize_words(train_clean$text), stemming)

# stemming <- function(x){
#   paste(lapply(x,katadasar),collapse = " ")}
# 
# text_test_cln1 <- lapply(tokenize_words(test_clean$text), stemming)

Tokenizer

Separating phrases, sentences, or paragraphs into smaller units to smaller units (usually words) that is called token.

# text_train_cln1 <- tokenize_words(text_train_cln1)

# text_test_cln1 <- tokenize_words(text_test_cln1)

Saving variable that is stemmed and tokenized

# saveRDS(text_train_cln1, file = "data_input/train_clean_final.rds")
# saveRDS(text_test_cln1, file = "data_input/test_clean_final.rds")

Dropping stop words

Dropping stop-words, words that are commonly used but carries no meaning (insignificant). In English it can be the, is, and, etc. In Indonesian it contains words like adalah, atas, bagi, etc.

text_train_clean <- readRDS("data_input/train_clean_final.rds")
text_test_clean <- readRDS("data_input/test_clean_final.rds")

myStopwords <- readLines("data_input/stopwords-id.txt")
text_train_clean <- text_train_clean %>% as.character() %>% tokenize_words(stopwords = myStopwords)
text_test_clean <- text_test_clean %>% as.character() %>% tokenize_words(stopwords = myStopwords)

Document Term Matrix

text_train_dtm <- DocumentTermMatrix(text_train_clean)
inspect(text_train_dtm)
#> <<DocumentTermMatrix (documents: 2004, terms: 2665)>>
#> Non-/sparse entries: 20522/5320138
#> Sparsity           : 100%
#> Maximal term length: 46
#> Weighting          : term frequency (tf)
#> Sample             :
#>       Terms
#> Docs   "and", "gb", "kuota", "number", "paket", "pulsa", "rp", "sms", "ya",
#>   1343      0     0        0         0        0        2     3      0     0
#>   1910      2     0        0         0        0        0     0      0     0
#>   1924      2     0        0         0        0        0     0      0     0
#>   193       0     0        0         0        0        0     0      0     0
#>   197       1     0        0         0        0        0     0      1     0
#>   1983      2     0        0         0        0        0     0      0     0
#>   225       0     0        0         0        0        0     0      0     0
#>   344       0     0        0         0        0        0     0      0     0
#>   409       0     0        0         0        0        0     0      0     2
#>   410       0     0        0         1        2        2     6      0     0
#>       Terms
#> Docs   c("c",
#>   1343      1
#>   1910      1
#>   1924      1
#>   193       1
#>   197       1
#>   1983      1
#>   225       1
#>   344       1
#>   409       1
#>   410       1

Cross validation for Naive Bayes and Random Forest

Splitting train text to text_tn and text_vd

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(text_train_dtm), nrow(text_train_dtm) * 0.75)
text_tn <- text_train_dtm[index,]
text_vd <- text_train_dtm[-index,]
label_tn <- text_train[index, "status"]
label_vd <- text_train[-index, "status"]

Checking proportion

prop.table(table(label_tn))
#> label_tn
#>      ham     spam 
#> 0.584165 0.415835
prop.table(table(label_vd))
#> label_vd
#>       ham      spam 
#> 0.5668663 0.4331337

Further Data Preprocessing

Bernoulli Converter

Values in the text_tn matrix are still numeric values. Therefore, we will change the contents to only appear (1) or not (0). One way is to use the Bernoulli Converter.

  • If frequency > 0, then the value is 1 (appears)
  • If frequency == 0, then the value is 0 (does not appear)
bernoulli_conv <- function(x){
  # parameter ifelse: kondisi, TRUE, FALSE
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

# testing fungsi
bernoulli_conv(c(3,0,0,1,4,0))
#> [1] 1 0 0 1 1 0
#> Levels: 0 1
text_tn_bn <- apply(X = text_tn, FUN = bernoulli_conv, MARGIN = 2)
text_vd_bn <- apply(X = text_vd, FUN = bernoulli_conv, MARGIN = 2)

Making Naive Bayes Model

Naive Bayes and Accuracy

  • Naive Bayes classifies based on dependent probability between the predictor and the target variable (Bayes’ Theorem).
  • Pro: Relatively faster computation time because it only calculates the proportion of the frequency table. Therefore, it is often used as a baseline model or benchmark, which is a simple model (reference) to be compared with more complex models
  • Con: skewness due to data scarcity, namely bias arises when events occur that rarely occur or even do not occur at all. This can be overcome with Laplace Smoothing
# membuat model
naive_spam <- naiveBayes(x=text_tn_bn, y=label_tn, laplace = 1)

text_pred_class <- predict(object=naive_spam, newdata=text_vd_bn, type = "class")

confusionMatrix(text_pred_class, reference = as.factor(label_vd), positive = "spam")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction ham spam
#>       ham  251    6
#>       spam  33  211
#>                                                
#>                Accuracy : 0.9222               
#>                  95% CI : (0.8951, 0.9441)     
#>     No Information Rate : 0.5669               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8438               
#>                                                
#>  Mcnemar's Test P-Value : 0.00003136           
#>                                                
#>             Sensitivity : 0.9724               
#>             Specificity : 0.8838               
#>          Pos Pred Value : 0.8648               
#>          Neg Pred Value : 0.9767               
#>              Prevalence : 0.4331               
#>          Detection Rate : 0.4212               
#>    Detection Prevalence : 0.4870               
#>       Balanced Accuracy : 0.9281               
#>                                                
#>        'Positive' Class : spam                 
#> 

Checking out prediction vs label in data validation (Naive Bayes)

Mostly the model failed to predict text that falls under the ham categories (the model classify them as spam instead). This might be because those text shares the same characteristics as spam text (with links and words that is often used for promotion). Those ham text doesn’t only contains ham but also partly spam words on the end.

validation_texts <- data.frame(train_clean[-index,]) 
validation_texts$NBPrediction <- text_pred_class
check <- validation_texts %>% 
          filter(status!=NBPrediction)

check

Words that is present in most of the misclassified texts

mispred <- check %>% 
          select(text)

wordcloud(words=as.matrix(mispred),
          max.words = 20,
          scale = c(2.5,1.25),
          random.order = F,
          rot.per = 0.5,
          colors = brewer.pal(2,"Dark2"))

Trying out LSTM model

LSTM is a part of Recurrent Neural Network (RNN). RNN cells only have one gate, so any new information or information from the previous RNN cell will only be processed once using the Tanh function. Meanwhile LTSM focuses on its cell state and its multiple gates (forget gate (things to forget, function: sigmoid), input gate (relevant information, function: tanh and sigmoid), output gate (things to carryover, function: tanh and sigmoid))

Changing target variable into numeric

train_clean_lstm <- train_clean %>% 
                  mutate(label = factor(status, levels = c("spam", "ham")),
                         label = as.numeric(label),
                         label = label - 1) %>% 
                  select(text, label)

Data cross validation for LSTM

library(rsample)
set.seed(100)

# Spliting
text_split_lstm <- initial_split(train_clean_lstm, 0.8, "label") 

# Data train
text_train_lstm <- training(text_split_lstm)

# Data validation
text_validation_lstm <- testing(text_split_lstm)

Creating function for LSTM

  1. Tokenizer
  • unique words: 2665
  • taking about half of them to be the num_words
num_words <- 1500

# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words) %>% 
             fit_text_tokenizer(train_clean$text)
  1. Max len
maxlen <- max(str_count(train_clean$text, "\\w+")) + 1 

Changing matrix to numeric

# prediktor
train_x <- texts_to_sequences(tokenizer, text_train_lstm$text) %>%
           pad_sequences(maxlen = maxlen)

validation_x <- texts_to_sequences(tokenizer, text_validation_lstm$text) %>%
                pad_sequences(maxlen = maxlen) 

#target
train_y <- to_categorical(text_train_lstm$label, num_classes = 2) 

validation_y <- to_categorical(text_validation_lstm$label, num_classes = 2)

Making LSTM Model

tensorflow::set_random_seed(15)
model_tuning <- keras_model_sequential() 
model_tuning %>%

    # layer input
    layer_embedding(
     name = "input",
     input_dim = num_words,
     input_length = maxlen,
     output_dim = 32,
     embeddings_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
   ) %>%
    # layer dropout
    layer_dropout(
      name = "embedding_dropout",
      rate = 0.5
   ) %>%
   # layer lstm 1
   layer_lstm(
     name = "lstm",
     units = 256,
     dropout = 0.5,
     recurrent_dropout = 0.5,
     return_sequences = FALSE,
     recurrent_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2),
     kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
   ) %>%
   # layer output
   layer_dense(
     name = "output",
     units = 2,
     activation = "sigmoid",
     kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
   ) %>%
   # compile the model
   compile(
     optimizer = optimizer_adam(learning_rate = 0.001),
     metrics = "accuracy",
     loss = "binary_crossentropy"
   ) %>%
   # Fitting the model
   fit(train_x, train_y,
       batch_size = 20,
       epochs = 10,
       verbose = 0)

Predicting and Checking Accuracy of LSTM Model

# prediksi kelas di data validation
pred_vd_LSTM <- model_tuning %>% predict(validation_x) %>% k_argmax() %>% as.array() %>% as.factor()
#> 
#>  1/13 [=>............................] - ETA: 7s
#>  2/13 [===>..........................] - ETA: 0s
#>  3/13 [=====>........................] - ETA: 1s
#>  4/13 [========>.....................] - ETA: 1s
#>  5/13 [==========>...................] - ETA: 0s
#>  6/13 [============>.................] - ETA: 0s
#>  7/13 [===============>..............] - ETA: 0s
#>  8/13 [=================>............] - ETA: 0s
#>  9/13 [===================>..........] - ETA: 0s
#> 10/13 [======================>.......] - ETA: 0s
#> 11/13 [========================>.....] - ETA: 0s
#> 12/13 [==========================>...] - ETA: 0s
#> 13/13 [==============================] - ETA: 0s
#> 13/13 [==============================] - 2s 108ms/step
pred_vd_LSTM_cln <- ifelse(pred_vd_LSTM==0, "spam", "ham")

text_val <- ifelse(text_validation_lstm$label==0, "spam", "ham")

# confusion matrix data validation
confusionMatrix(as.factor(pred_vd_LSTM_cln), as.factor(text_val), "spam")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction ham spam
#>       ham  225    9
#>       spam   8  160
#>                                              
#>                Accuracy : 0.9577             
#>                  95% CI : (0.9332, 0.9752)   
#>     No Information Rate : 0.5796             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9132             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.9467             
#>             Specificity : 0.9657             
#>          Pos Pred Value : 0.9524             
#>          Neg Pred Value : 0.9615             
#>              Prevalence : 0.4204             
#>          Detection Rate : 0.3980             
#>    Detection Prevalence : 0.4179             
#>       Balanced Accuracy : 0.9562             
#>                                              
#>        'Positive' Class : spam               
#> 

Predicting data test

Naive Bayes model and LSTM model will be used to predict the data test.

These two models were chosen based on:

  1. NB is used because of the light computing
  2. LSTM is used because of the higher accuracy
# Predict using naive bayes
test_cln_dtm <- DocumentTermMatrix(text_test_clean)

test_bn <- apply(X = test_cln_dtm, FUN = bernoulli_conv, MARGIN = 2)

test_pred_nb <- predict(object=naive_spam, newdata=test_bn, type = "class")
# Predict using LSTM
num_words_test <- 350

# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words_test) %>% 
             fit_text_tokenizer(test_clean$text)

maxlen_test <- max(str_count(test_clean$text, "\\w+")) + 1 

test_x <- texts_to_sequences(tokenizer, test_clean$text) %>%
                pad_sequences(maxlen = maxlen_test) 

test_pred_LSTM <- model_tuning %>% predict(test_x) %>% k_argmax() %>% as.array() %>% as.factor()
#> 
#> 1/9 [==>...........................] - ETA: 3s
#> 2/9 [=====>........................] - ETA: 0s
#> 4/9 [============>.................] - ETA: 0s
#> 6/9 [===================>..........] - ETA: 0s
#> 8/9 [=========================>....] - ETA: 0s
#> 9/9 [==============================] - 1s 43ms/step
test_pred_LSTM_cln <- ifelse(test_pred_LSTM==0, "spam", "ham")

Submission file

submission_ifa_LSTM <- 
data.frame(datetime = ymd_hms(text_test$datetime),
           status = test_pred_LSTM_cln)

submission_ifa_NB <- 
data.frame(datetime = ymd_hms(text_test$datetime),
           status = test_pred_nb)
# write.csv(x=submission_ifa_LSTM,
#           file = "submission_ifa_LSTM.csv",
#           row.names = F)
# 
# write.csv(x=submission_ifa_NB,
#           file = "submission_ifa_NB.csv",
#           row.names = F)

Model Evaluation using LIME

Overall after submitting to check with the data test label, the LSTM model have the tendency to be overfitted, we want to look further on what makes it overfitted through LIME.

get_embedding_explanation <- function(text) {
  
  tokenizer %>% fit_text_tokenizer(text)
  
  text_to_seq <- texts_to_sequences(tokenizer, text)
  sentences <- text_to_seq %>% pad_sequences(maxlen = maxlen)
}


library(lime)

# Lets choose some text (3 rows) to explain
sentence_to_explain <- test_clean$text[37:40]

# You could notice that our input is just a plain text. Unlike tabular data, lime function 
# for text classification requires a preprocess fuction. Because it will help to convert a text to integers 
# with provided function. 
explainer <- lime(sentence_to_explain, model = model_tuning, preprocess = get_embedding_explanation)

# Get explanation for the first 10 words
explanation <- explain(sentence_to_explain, explainer, n_labels = 1, n_features = 10, n_permutations = 5, feature_select = "none")
#> 
#> 1/1 [==============================] - ETA: 0s
#> 1/1 [==============================] - 0s 107ms/step

From graphs above, it is indicated that the LSTM model provides a probability of around 70-90%, and the Explanation Fit has a value of around 0.8-0.9 (80%-90%), which means that LIME can explain around 80%-90% of this model. In cases 37 - 40, it can be seen that there are several words that support the SMS as spam, such as ribu, min, and jam meanwhile kuota, cek, and info contradicts the SMS as spam. Arguably, both words actually can exist in both spam and ham type of text.

This is actually inline with the observation made above (from the interpretable model naive bayes) that text, whether it’s spam or ham, shares the same characteristics as spam text (with links and words that is often used for promotion and spams). Those ham text doesn’t only contains ham but also partly spam words on the end. This might be why this LSTM models overfits and failed to predict spam/ham classes according to Algo’s standard. Although unlike in Naive Bayes, there is a package specialized in measuring whether those text are seen supporting spam classes or ham classes, so we don’t have to rely on our observation alone.

Conclusion

Our initial goal was to use machine learning models to build a spam/ham classifier. We managed to build a spam/ham classifier, albeit not perfect. The problem is partly solved, but we still need to retrain the classifier so it can predict spam/ham classes better in the future. At the end the basic Naive Bayes model was used because the other models used here (LSTM) was overfitted and failed to predict data test satisfactorily. The Naive Bayes initial prediction accuracy might not be as high as the Naive Bayes but in the end it predicts the data test better. Ofcourse this machine learning model is very important for company that deals with a lot of text-data, ex: social media company (twitter, instagram, etc), messaging company (whatsapp, line), or mail company (gmail, yahoo mail), etc. Every company that deals with a lot of data text input needs this model and business aspect wise it’s quite good.