# Data Wrangling
library(dplyr)
library(tidyr)
library(stringr)
library(tidyverse)
library(lubridate)
# Plotting
library(ggplot2)
# Text Preprocessing
library(textclean)
devtools::install_github('nurandi/katadasaR')
library(katadasaR)
library(tokenizers)
library(textstem)
library(stopwords)
library(tm)
library(tidytext)
# Cross Validation
library(rsample)
# Machine Learning
## Naive Bayes
library(e1071)
## LSTM
library(keras)
library(tensorflow)
# Machine Learning Evaluation
library(caret)
library(lime)The SMS dataset is a curated SMS dataset with a spam/ham label for each message. As users we would not want to waste our time in reading spam messages. Because of that we would like to know if the text that enters our inbox is a spam or a ham. This is so that we can classify all the spam text and move them to a different inbox and prevent them from cluttering our inbox. We are going to use machine learning models to build a spam/ham classifier.
There is:
Distribution plot of total hourly frequency for each status
# Data preparation
text_train_plot <- text_train %>%
mutate(datetime=ymd_hms(datetime),
hourly=hour(datetime),
status=as.factor(status)) %>%
select(c(hourly, status)) %>%
group_by(hourly, status) %>%
summarise(total_count=n())
# Plotting using line plot
ggplot(text_train_plot, aes(x=hourly, y=total_count, group=status)) +
geom_line(aes(colour=status), size=0.75)+
geom_point(aes(colour=status), size=2) + theme_minimal()+
labs(y="Total spam/ham",
x="Hour",
colour=NULL)Checking NAs on data test and data train
#> datetime text status
#> 0 0 0
#> datetime text status
#> 0 0 283
Droppping NAs on test data
Cleaning data test
Cleaning data train
An important step for text preprocessing due to inconsistent form that each text have. We need to change all text to lower case and get rid of symbols, etc.
Strip() from textclean pretty much strips any unecessary symbols, numbers, whitespace, etc. Text messages rarely uses emoticons so we will not be on the lookout for that.
# # import Indonesian lexicon
# spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv")
#
# # replace internet slang
# text_train_cln$text <- replace_internet_slang(text_train_cln$text, slang = paste0("\\b", spell.lex$slang, "\\b"),
# replacement = spell.lex$formal, ignore.case = TRUE)# # import Indonesian lexicon
# spell.lex <- read.csv("data_input/colloquial-indonesian-lexicon.csv")
#
# # replace internet slang
# text_test_cln$text <- replace_internet_slang(text_test_cln$text, slang = paste0("\\b", spell.lex$slang, "\\b"),
# replacement = spell.lex$formal, ignore.case = TRUE)Saving variable with internet slang cleaned
# saveRDS(text_train_cln, file = "data_input/train_clean.rds")
# saveRDS(text_test_cln, file = "data_input/test_clean.rds")Reading variable with internet slang cleaned
We will be taking a look on which word, based on the term frequency of each word or token, pop up more under the Spam or Ham label.
library(wordcloud)
spam <- train_clean %>%
filter(status=="spam") %>%
select(text)
wordcloud(words=as.matrix(spam),
max.words = 20,
scale = c(2.5,1.25),
random.order = F,
rot.per = 0.5,
colors = brewer.pal(2,"Dark2"))ham <- train_clean %>%
filter(status=="ham") %>%
select(text)
wordcloud(words=as.matrix(ham),
max.words = 20,
scale = c(2.5,1.25),
random.order = F,
rot.per = 0.5,
colors = brewer.pal(2,"Dark2"))To stem words into the most simple form that they can take: ex.
abai, abaikan to aba so it can be
recognized as one type of word
Separating phrases, sentences, or paragraphs into smaller units to smaller units (usually words) that is called token.
# text_train_cln1 <- tokenize_words(text_train_cln1)
# text_test_cln1 <- tokenize_words(text_test_cln1)Saving variable that is stemmed and tokenized
Dropping stop-words, words that are commonly used but carries no
meaning (insignificant). In English it can be the,
is, and, etc. In Indonesian it contains words
like adalah, atas, bagi, etc.
text_train_clean <- readRDS("data_input/train_clean_final.rds")
text_test_clean <- readRDS("data_input/test_clean_final.rds")
myStopwords <- readLines("data_input/stopwords-id.txt")
text_train_clean <- text_train_clean %>% as.character() %>% tokenize_words(stopwords = myStopwords)
text_test_clean <- text_test_clean %>% as.character() %>% tokenize_words(stopwords = myStopwords)#> <<DocumentTermMatrix (documents: 2004, terms: 2665)>>
#> Non-/sparse entries: 20522/5320138
#> Sparsity : 100%
#> Maximal term length: 46
#> Weighting : term frequency (tf)
#> Sample :
#> Terms
#> Docs "and", "gb", "kuota", "number", "paket", "pulsa", "rp", "sms", "ya",
#> 1343 0 0 0 0 0 2 3 0 0
#> 1910 2 0 0 0 0 0 0 0 0
#> 1924 2 0 0 0 0 0 0 0 0
#> 193 0 0 0 0 0 0 0 0 0
#> 197 1 0 0 0 0 0 0 1 0
#> 1983 2 0 0 0 0 0 0 0 0
#> 225 0 0 0 0 0 0 0 0 0
#> 344 0 0 0 0 0 0 0 0 0
#> 409 0 0 0 0 0 0 0 0 2
#> 410 0 0 0 1 2 2 6 0 0
#> Terms
#> Docs c("c",
#> 1343 1
#> 1910 1
#> 1924 1
#> 193 1
#> 197 1
#> 1983 1
#> 225 1
#> 344 1
#> 409 1
#> 410 1
text_tn and
text_vd#> label_tn
#> ham spam
#> 0.584165 0.415835
#> label_vd
#> ham spam
#> 0.5668663 0.4331337
Values in the text_tn matrix are still numeric values.
Therefore, we will change the contents to only appear (1) or not (0).
One way is to use the Bernoulli Converter.
bernoulli_conv <- function(x){
# parameter ifelse: kondisi, TRUE, FALSE
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
# testing fungsi
bernoulli_conv(c(3,0,0,1,4,0))#> [1] 1 0 0 1 1 0
#> Levels: 0 1
# membuat model
naive_spam <- naiveBayes(x=text_tn_bn, y=label_tn, laplace = 1)
text_pred_class <- predict(object=naive_spam, newdata=text_vd_bn, type = "class")
confusionMatrix(text_pred_class, reference = as.factor(label_vd), positive = "spam")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction ham spam
#> ham 251 6
#> spam 33 211
#>
#> Accuracy : 0.9222
#> 95% CI : (0.8951, 0.9441)
#> No Information Rate : 0.5669
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.8438
#>
#> Mcnemar's Test P-Value : 0.00003136
#>
#> Sensitivity : 0.9724
#> Specificity : 0.8838
#> Pos Pred Value : 0.8648
#> Neg Pred Value : 0.9767
#> Prevalence : 0.4331
#> Detection Rate : 0.4212
#> Detection Prevalence : 0.4870
#> Balanced Accuracy : 0.9281
#>
#> 'Positive' Class : spam
#>
Mostly the model failed to predict text that falls under the ham categories (the model classify them as spam instead). This might be because those text shares the same characteristics as spam text (with links and words that is often used for promotion). Those ham text doesn’t only contains ham but also partly spam words on the end.
validation_texts <- data.frame(train_clean[-index,])
validation_texts$NBPrediction <- text_pred_class
check <- validation_texts %>%
filter(status!=NBPrediction)
checkWords that is present in most of the misclassified texts
mispred <- check %>%
select(text)
wordcloud(words=as.matrix(mispred),
max.words = 20,
scale = c(2.5,1.25),
random.order = F,
rot.per = 0.5,
colors = brewer.pal(2,"Dark2"))LSTM is a part of Recurrent Neural Network (RNN). RNN cells only have one gate, so any new information or information from the previous RNN cell will only be processed once using the Tanh function. Meanwhile LTSM focuses on its cell state and its multiple gates (forget gate (things to forget, function: sigmoid), input gate (relevant information, function: tanh and sigmoid), output gate (things to carryover, function: tanh and sigmoid))
num_words <- 1500
# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words) %>%
fit_text_tokenizer(train_clean$text)# prediktor
train_x <- texts_to_sequences(tokenizer, text_train_lstm$text) %>%
pad_sequences(maxlen = maxlen)
validation_x <- texts_to_sequences(tokenizer, text_validation_lstm$text) %>%
pad_sequences(maxlen = maxlen)
#target
train_y <- to_categorical(text_train_lstm$label, num_classes = 2)
validation_y <- to_categorical(text_validation_lstm$label, num_classes = 2)tensorflow::set_random_seed(15)
model_tuning <- keras_model_sequential()
model_tuning %>%
# layer input
layer_embedding(
name = "input",
input_dim = num_words,
input_length = maxlen,
output_dim = 32,
embeddings_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
) %>%
# layer dropout
layer_dropout(
name = "embedding_dropout",
rate = 0.5
) %>%
# layer lstm 1
layer_lstm(
name = "lstm",
units = 256,
dropout = 0.5,
recurrent_dropout = 0.5,
return_sequences = FALSE,
recurrent_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2),
kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
) %>%
# layer output
layer_dense(
name = "output",
units = 2,
activation = "sigmoid",
kernel_initializer = initializer_random_uniform(minval = -0.05, maxval = 0.05, seed = 2)
) %>%
# compile the model
compile(
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy",
loss = "binary_crossentropy"
) %>%
# Fitting the model
fit(train_x, train_y,
batch_size = 20,
epochs = 10,
verbose = 0)# prediksi kelas di data validation
pred_vd_LSTM <- model_tuning %>% predict(validation_x) %>% k_argmax() %>% as.array() %>% as.factor()#>
#> 1/13 [=>............................] - ETA: 7s
#> 2/13 [===>..........................] - ETA: 0s
#> 3/13 [=====>........................] - ETA: 1s
#> 4/13 [========>.....................] - ETA: 1s
#> 5/13 [==========>...................] - ETA: 0s
#> 6/13 [============>.................] - ETA: 0s
#> 7/13 [===============>..............] - ETA: 0s
#> 8/13 [=================>............] - ETA: 0s
#> 9/13 [===================>..........] - ETA: 0s
#> 10/13 [======================>.......] - ETA: 0s
#> 11/13 [========================>.....] - ETA: 0s
#> 12/13 [==========================>...] - ETA: 0s
#> 13/13 [==============================] - ETA: 0s
#> 13/13 [==============================] - 2s 108ms/step
pred_vd_LSTM_cln <- ifelse(pred_vd_LSTM==0, "spam", "ham")
text_val <- ifelse(text_validation_lstm$label==0, "spam", "ham")
# confusion matrix data validation
confusionMatrix(as.factor(pred_vd_LSTM_cln), as.factor(text_val), "spam")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction ham spam
#> ham 225 9
#> spam 8 160
#>
#> Accuracy : 0.9577
#> 95% CI : (0.9332, 0.9752)
#> No Information Rate : 0.5796
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9132
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9467
#> Specificity : 0.9657
#> Pos Pred Value : 0.9524
#> Neg Pred Value : 0.9615
#> Prevalence : 0.4204
#> Detection Rate : 0.3980
#> Detection Prevalence : 0.4179
#> Balanced Accuracy : 0.9562
#>
#> 'Positive' Class : spam
#>
Naive Bayes model and LSTM model will be used to predict the data test.
These two models were chosen based on:
# Predict using naive bayes
test_cln_dtm <- DocumentTermMatrix(text_test_clean)
test_bn <- apply(X = test_cln_dtm, FUN = bernoulli_conv, MARGIN = 2)
test_pred_nb <- predict(object=naive_spam, newdata=test_bn, type = "class")# Predict using LSTM
num_words_test <- 350
# prepare tokenizers
tokenizer <- text_tokenizer(num_words = num_words_test) %>%
fit_text_tokenizer(test_clean$text)
maxlen_test <- max(str_count(test_clean$text, "\\w+")) + 1
test_x <- texts_to_sequences(tokenizer, test_clean$text) %>%
pad_sequences(maxlen = maxlen_test)
test_pred_LSTM <- model_tuning %>% predict(test_x) %>% k_argmax() %>% as.array() %>% as.factor()#>
#> 1/9 [==>...........................] - ETA: 3s
#> 2/9 [=====>........................] - ETA: 0s
#> 4/9 [============>.................] - ETA: 0s
#> 6/9 [===================>..........] - ETA: 0s
#> 8/9 [=========================>....] - ETA: 0s
#> 9/9 [==============================] - 1s 43ms/step
Overall after submitting to check with the data test label, the LSTM model have the tendency to be overfitted, we want to look further on what makes it overfitted through LIME.
get_embedding_explanation <- function(text) {
tokenizer %>% fit_text_tokenizer(text)
text_to_seq <- texts_to_sequences(tokenizer, text)
sentences <- text_to_seq %>% pad_sequences(maxlen = maxlen)
}
library(lime)
# Lets choose some text (3 rows) to explain
sentence_to_explain <- test_clean$text[37:40]
# You could notice that our input is just a plain text. Unlike tabular data, lime function
# for text classification requires a preprocess fuction. Because it will help to convert a text to integers
# with provided function.
explainer <- lime(sentence_to_explain, model = model_tuning, preprocess = get_embedding_explanation)
# Get explanation for the first 10 words
explanation <- explain(sentence_to_explain, explainer, n_labels = 1, n_features = 10, n_permutations = 5, feature_select = "none")#>
#> 1/1 [==============================] - ETA: 0s
#> 1/1 [==============================] - 0s 107ms/step
From graphs above, it is indicated that the LSTM model provides a
probability of around 70-90%, and the Explanation Fit has a value of
around 0.8-0.9 (80%-90%), which means that LIME can explain around
80%-90% of this model. In cases 37 - 40, it can be seen that there are
several words that support the SMS as spam, such as ribu, min, and jam
meanwhile kuota, cek, and info contradicts the SMS as spam. Arguably,
both words actually can exist in both spam and ham type of text.
This is actually inline with the observation made above (from the interpretable model naive bayes) that text, whether it’s spam or ham, shares the same characteristics as spam text (with links and words that is often used for promotion and spams). Those ham text doesn’t only contains ham but also partly spam words on the end. This might be why this LSTM models overfits and failed to predict spam/ham classes according to Algo’s standard. Although unlike in Naive Bayes, there is a package specialized in measuring whether those text are seen supporting spam classes or ham classes, so we don’t have to rely on our observation alone.
Our initial goal was to use machine learning models to build a spam/ham classifier. We managed to build a spam/ham classifier, albeit not perfect. The problem is partly solved, but we still need to retrain the classifier so it can predict spam/ham classes better in the future. At the end the basic Naive Bayes model was used because the other models used here (LSTM) was overfitted and failed to predict data test satisfactorily. The Naive Bayes initial prediction accuracy might not be as high as the Naive Bayes but in the end it predicts the data test better. Ofcourse this machine learning model is very important for company that deals with a lot of text-data, ex: social media company (twitter, instagram, etc), messaging company (whatsapp, line), or mail company (gmail, yahoo mail), etc. Every company that deals with a lot of data text input needs this model and business aspect wise it’s quite good.