Introduction

The use of mobile phones has skyrocketed in the last decade leading to a new area for junk promotions from disreptable marketers. People innocently give out their mobile phone numbers while utilizing day to day services and are then flooded with spam promotional messages.

In this rpubs, we will take a look at classifying SMS messages using the Naive Bayes Machine Learning model and Random Forest model.

Setup

First lets import some library. Some key library that will be used for text mining are dplyr, lubridate, tm, stopwords, tidyr.

library(dplyr)
library(lubridate)
library(tm)
library(stopwords)
library(e1071)
library(caret)
library(lime)
library(ggplot2)
library(tidyr)
library(tibble)

Import Data

Okay, lets read the data.

sms <- read.csv("data/data-train.csv", stringsAsFactors = F)

Here’s a glimpse of our imported data

sms %>% glimpse()

## Rows: 2,004
## Columns: 3
## $ datetime <chr> "2017-02-15T14:48:00Z", "2017-02-15T15:24:00Z", "2017-02-15T…
## $ text     <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengiriman U…
## $ status   <chr> "ham", "spam", "ham", "ham", "ham", "ham", "ham", "spam", "s…

As you can see, the data contains three columns: - datatime: Show when the SMS sent - text: The content of SMS - status: Is a sms spam or ham

A few step that we’ll done in data preprocessing - Change text menjadi corpus - Clean text from punctuation, white spaces, stopwords, number - Convert to document-term matrix

But before jump any further, lets do data wrangling and EDA.

Data wrangling

In this data wrangling we’ll be change text to character, status to factor and datetime to date.

sms <- sms %>% 
   mutate(
      status=as.factor(status),
      datetime=ymd_hms(datetime)
   )

Exploratory Data Analytics (EDA)

Now lets’s visualize our data to see at which time spam and ham sms is the most sent in the day. For the visualization we will use Bar plot.

Let’s preprocess the data for visualization by grouping data by hour and sum total spam, ham. And then plot it using geom_bar in ggplot.

sms %>% 
   mutate(hour = hour(datetime)) %>% 
   group_by(hour) %>% 
   summarise(
      spam = sum(ifelse(status == "spam", 1, 0)),
      ham = sum(ifelse(status == "spam", 0, 1)),
   ) %>% 
   ungroup() %>% 
   pivot_longer(
      cols=c(ham, spam)
   ) %>% 
   ggplot(
      aes(
         x=hour,
         y=value,
         fill=name
      )
   ) +
   geom_col(
      stat="identity"
   ) +
   scale_fill_manual(values=c("#2990ff", "#e64027"))

## `summarise()` ungrouping output (override with `.groups` argument)

## Warning: Ignoring unknown parameters: stat

From what we’ve seen above the number of SMS is gradually increasing in the morning and peaked at 9 AM. And hit the lowest at 4 AM in the morning.

Data characteristic

Our data is seperated to two category, it’s spam or ham (not spam). Let’s explore what characteristic makes sms is a ham or spam.

Spam

sms %>%
   filter(status == "spam") %>% 
   tail()

From what seen above, text that related to spam usually are promotional. And some of the word/token like “gratis”, “bonus”.

Ham

sms %>% 
   filter(status == "ham") %>% 
   tail()

The text which are ham usually related to code verification number or provider information or usual conversation. Some word that use are “code”, “dimana”.

Data preperation

Since the `text in our data is raw or pure SMS text. Let’s clean our text so our model can use our text for training.

There’re a few step that we need to get through.

Convert to corpus

The first step is to convert our text to corpus. > Corpus is set of documents.

sms.corpus <- sms %>% 
   # Convert to corpus
   VectorSource() %>% 
   VCorpus()

Text Cleaning

After that, now we can clean our text. A few things we need to do are.

Converting our text to lowercase
Remove numbers
Remove stopwords words
Since in the dataset are using indonesian language. We need to remove stopwords specific to that language.
Remove punctuation
Remove punctuation marks from a text document.
Stem document Converting a word to its basic word.
Strip whitespace remove extra white space

sms.corpus <- sms.corpus %>%
   tm_map(content_transformer(tolower)) %>% 
   tm_map(removeNumbers) %>% 
   tm_map(removeWords, stopwords("id", source="stopwords-iso")) %>% 
   tm_map(removePunctuation) %>%
   tm_map(function(x) { stemDocument(x, language="indonesian") }) %>%
   tm_map(stripWhitespace)

Convert to Document Term Matrix

After the text are clean. The next question is how can we make the model if the predictor still a text.

In text mining, text usually change to Document-Term Matrix (DTM) with the process of tokenization. Tokenization can split one sentence to a few term. Term can be one word, two word or more. In DTM, one word equal to 1 predictor, with the value of how frequen is word the shows up in one document or sms.

sms.dtm <- sms.corpus %>% 
   DocumentTermMatrix()

sms.dtm

## <<DocumentTermMatrix (documents: 3, terms: 2821)>>
## Non-/sparse entries: 2822/5641
## Sparsity           : 67%
## Maximal term length: 79
## Weighting          : term frequency (tf)

Get only the most frequence terms

From getting only the most frequence term by atleast shows up in 20 sms, we can get the candidate for the most influence predictor.

sms.freq <- findFreqTerms(sms.dtm, lowfreq = 20)

sms.dtm <- sms.dtm[,sms.freq]

By filtering to the most influence predictor, we can cut down the time for training our model.

Convert to Bernoulli

In document terms reference, the matrix value is a frequency range from 0 to Infinite. To calculate the probability, the frequency need to be simplified to 0 and 1 or not appear and appear.

To do that we need to build a custom function named Bernoulli Converter.

The logic behind it is very simple - If word frequency is more than 1 then returns 1 - If word frequency is 0 then returns 0

bernoulli_conv <- function(x) {
  x <- as.factor(ifelse(x > 0, 1, 0))
  return(x)
}

bernoulli_conv(c(0,1,3))

## [1] 0 1 1
## Levels: 0 1

As you can see the custom function is working, now let’s apply it to our data

sms.dtm <- sms.dtm %>% 
   apply(MARGIN = 2, FUN = bernoulli_conv)

sms.dtm[1:3, 1:20]

##     Terms
## Docs aja aks aktif aktifkan aplikasi app aspen axi axisnet ayo bala bank beba
##    1 "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
##    2 "1" "1" "1"   "1"      "1"      "1" "1"   "1" "1"     "1" "1"  "1"  "1" 
##    3 "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
##     Terms
## Docs beli berhasil berita berlaku bersifat biaya blm
##    1 "0"  "0"      "0"    "0"     "0"      "0"   "0"
##    2 "1"  "1"      "1"    "1"     "1"      "1"   "1"
##    3 "0"  "0"      "0"    "0"     "0"      "0"   "0"

Our data is now has been clean and based on Term Frequency (TF) - Inverse Document Frequency (IDF)

Create a function

For preparation data that we’ve just do, we can summarise all of that to a function.

tokenize_text <- function(x, is_bernoulli = TRUE) {
   data_dtm <- x %>% 
      # Convert to corpus
      VectorSource() %>% 
      VCorpus() %>% 
      
      # text cleaning
      tm_map(content_transformer(tolower)) %>% 
      tm_map(removeNumbers) %>% 
      tm_map(removeWords, stopwords("id", source="stopwords-iso")) %>% 
      tm_map(removePunctuation) %>%
      tm_map(stemDocument) %>%
      tm_map(stripWhitespace) %>% 

      # Convert DTM
      DocumentTermMatrix()
   
   data_freq <- findFreqTerms(data_dtm, lowfreq = 20)

   if (is_bernoulli) {
      data_dtm[,data_freq] %>% 
         apply(MARGIN = 2, FUN = bernoulli_conv) %>% 
         return()
   } else {
      data_dtm[,data_freq] %>% 
         return()
   }
}

Cross Validation

After cleaning the data, now lets split the data to train and test. For this case we will split to 75% training data and 25% test data.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(sms), nrow(sms)*0.75)

sms_clean <- tokenize_text(sms$text)

data_train_clean <- sms_clean[index,]
data_test_clean <- sms_clean[-index,]

label_train <- sms[index, "status"]
label_test <- sms[-index, "status"]

*this pure data train and test will be use later to interpreting the model

data_train <- sms[index,]
data_test <- sms[-index,]

Model

For model, we will create two distinct model for comparison. Naive Bayes and Random forest.

Naive Bayes

Now let’s create the model Naive bayes from our data that we’ve just clean.

model_nb <- naiveBayes(
   x = data_train_clean, 
   y = label_train,
   laplace = 1
)

Random forest

For comparison, let’s create our next model random forest.

WARNING! Training random forest model can take minutes or even hours. So it’s helpful to save the model after the model created.

set.seed(417)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)

model_forest <- train(
   x = data_train_clean,
   y = label_train,
   method = "rf",
   trControl = ctrl
)

saveRDS(model_forest, "spam_forest_3.RDS") # save model

Let load our random forest model.

model_forest <- readRDS("spam_forest_3.RDS")

Evaluate model

After creating our model, now let’s evaluate our model.

Predicting

But first, we need to create prediction so it can be evaluate with test data.

Naive Bayes

sms_pred_naive <- predict(model_nb, newdata = data_test_clean, type="class")
head(sms_pred_naive)

## [1] ham  spam spam ham  ham  ham 
## Levels: ham spam

Random Forest

sms_pred_rf <- predict(model_forest, newdata = data_test_clean, type="raw")
head(sms_pred_rf)

## [1] ham  spam spam ham  spam spam
## Levels: ham spam

Confusion Matrix

To evaluate our model. Let’s create confusion Matrix.

For this sms classification case. We gonna use Accuracy for measuring our model performance because determine positive value or Spam is as important as determine negative value or Ham, since most people don’t wanna miss on important SMS, but also want to get rid off the spam SMS.

Naive Bayes

confusionMatrix(data = sms_pred_naive, reference = label_test, positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  257   17
##       spam  27  200
##                                           
##                Accuracy : 0.9122          
##                  95% CI : (0.8839, 0.9355)
##     No Information Rate : 0.5669          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8221          
##                                           
##  Mcnemar's Test P-Value : 0.1748          
##                                           
##             Sensitivity : 0.9217          
##             Specificity : 0.9049          
##          Pos Pred Value : 0.8811          
##          Neg Pred Value : 0.9380          
##              Prevalence : 0.4331          
##          Detection Rate : 0.3992          
##    Detection Prevalence : 0.4531          
##       Balanced Accuracy : 0.9133          
##                                           
##        'Positive' Class : spam            
##

From the confusion matrix for naive bayes prediction. We got 91% in accuracy. This shows that the naive bayes model is accurate enough.

Random Forest

confusionMatrix(data = sms_pred_rf, reference = label_test, positive = "spam")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  270    6
##       spam  14  211
##                                          
##                Accuracy : 0.9601         
##                  95% CI : (0.939, 0.9754)
##     No Information Rate : 0.5669         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9191         
##                                          
##  Mcnemar's Test P-Value : 0.1175         
##                                          
##             Sensitivity : 0.9724         
##             Specificity : 0.9507         
##          Pos Pred Value : 0.9378         
##          Neg Pred Value : 0.9783         
##              Prevalence : 0.4331         
##          Detection Rate : 0.4212         
##    Detection Prevalence : 0.4491         
##       Balanced Accuracy : 0.9615         
##                                          
##        'Positive' Class : spam           
##

Altough previous model already pretty acurate in predicting the test data. This random forest model are more accurate than naive bayes and get 96% in accuracy for the test data.

False prediction

Okay, let’s see which our model predict did wrong. For this random forest prediction result will be used since it’s more robust than naive bayes.

pred.false <- data_test %>% 
   mutate(
      pred.rf = sms_pred_rf,
   ) %>% 
   filter(pred.rf != status)
pred.false %>% select(-datetime) %>% filter(pred.rf == "spam")

From seen above many of misclassified ham text are sms from internet provider informing its user about something useful like remaining data. This may happen because internet provider often send promotional stuff that contain word like “pulsa” or “kuota” or “paket” that also used to inform user about remaining data or something useful for the user.

Interpreting Model

There’re two methods that gonna be use for interpreting our model. Variable Importance for random forest and LIME for Naive Bayes.

Variable Importance

Variable Importance helps us to know which variable are contributing more and which variable contributing nothing.

Let’s see which word is the most important, using Variable Importance from our model Random forest.

caret::varImp(model_forest, 20)$importance %>% 
   as.data.frame() %>%
   rownames_to_column() %>%
   arrange(-Overall) %>%
   mutate(rowname = forcats::fct_inorder(rowname))

From seen above the most contributor in the model is info and the least contributor is memberitahukan.

LIME

LIME or short for Local Interpretable Model-agnostic Explanations is a novel explanation technique that explains the prediction of any classifier in an interpretable and faithful manner by learning a interpretable model locally around the prediction.

LIME can predict any model and treat it as a black box model. Meanwhile Decision Tree or Variable Importance in Random Forest is only aplicable in those model.

LIME will be used for interpreting naive bayes model.

Since lime doesn’t support naiveBayes, we need to create custom function for naive bayes named model_type.naiveBayes.

model_type.naiveBayes <- function(x){
  return("classification")
}

We also need a function to store the prediction. The function would be predict_model.naiveBayes.

predict_model.naiveBayes <- function(x, newdata, type = "raw") {
    res <- predict(x, newdata, type = "raw") %>% as.data.frame()
    return(res)
}

Now, we need to prepare the input for the lime. In common classification problem, the input can be the table that contain the features. However, in text classification, the input should be the original text and we also need to give the preprocessing step to process the text from cleansing to the tokenization. Make sure the input of the text is character, not a factor.

text_train <- data_train$text %>% as.character()
text_test <- data_test$text

explainer <- lime(
   text_train,
   model=model_nb,
   preprocess=tokenize_text
)

Now we will try to explain how our model work on the test dataset. We will observe the interpretation of the 2nd to 5th obervations of the data test. Don’t forget to do set.seed to get reproducible example.

We will use five features for this.

set.seed(123)
explanation <- explain(
   text_test[1:5],
   explainer = explainer, 
   n_labels = 1, # show only 1 label (recommend or not recommend)
   n_features = 5, 
   feature_select = "none", # use all terms to explain the model
   single_explanation = F
)

Let’s visualize the result.

plot_text_explanations(explanation)

We can see that from the third observation, the probability to be ham is 98%. The explanation fit shows how good LIME at interpreting the prediction for this observation, which is 76% so it may accurate enough.

The blue-labeled text means that the word support/increase the probability to be SPAM, with the most influence word promo and belaku.

The red-labeled text means that the word contradict/decrease the probability of the review to be HAM, such as your, hari or nasi.

The difference between LIME and using an interpretable machine learning model like decision tree is LIME can be aplied in any model but explain feature role based on model prediction in sample data. Meanwhile interpretable machine learning model only can be aplied in its model like Variable Importance in Random forest but can explain what the feature contribution in the model itself.

Submission data

Now let’s apply our model to the submission data. For this we will use Random forest since it’s more robust and accurate than Naive bayes.

Import data

First import the submission data.

submission <- read.csv("data/data-test.csv")

Text Cleaning

Because we already create a function earlier named tokenize_text.

submission.clean <- tokenize_text(submission$text)
submission.clean[1:5,1:10]

##     Terms
## Docs aplikasi axi axisnet bala beli berlaku bonus bronet dgn diblokir
##    1 "0"      "0" "0"     "0"  "1"  "0"     "0"   "0"    "1" "0"     
##    2 "0"      "0" "0"     "0"  "0"  "0"     "0"   "0"    "0" "0"     
##    3 "0"      "0" "0"     "0"  "0"  "0"     "0"   "0"    "0" "0"     
##    4 "0"      "0" "0"     "0"  "0"  "0"     "0"   "0"    "0" "0"     
##    5 "0"      "0" "0"     "0"  "0"  "0"     "1"   "0"    "0" "0"

Optimize data

Since random forest require to has same predictor. We need to trim our predictor so it has the same predictor with the data train. Let’s create a function for that and named trimRfPredictor.

trimRfPredictor <- function(x, train_data) {
   x %>%
      as.data.frame() %>% 
      fncols(colnames(train_data)) %>% 
      select(colnames(train_data)) %>% 
      mutate_all(as.factor) %>% 
      as.matrix.data.frame() %>% 
      return()
}

Also, we need to create custom function for adding columns to match the training data predictor. We can named it fncols.

fncols <- function(data, cname) {
  add <-cname[!cname%in%names(data)]

  if(length(add)!=0) data[add] <- as.factor("0")
  data
}

submission.clean.df <- trimRfPredictor(submission.clean, data_train_clean)
submission.clean.df[1:5,1:20]

##      aja aks aktif aktifkan aplikasi app aspen axi axisnet ayo bala bank beba
## [1,] "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
## [2,] "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
## [3,] "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
## [4,] "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
## [5,] "0" "0" "0"   "0"      "0"      "0" "0"   "0" "0"     "0" "0"  "0"  "0" 
##      beli berhasil berita berlaku bersifat biaya blm
## [1,] "1"  "0"      "0"    "0"     "0"      "0"   "0"
## [2,] "0"  "0"      "0"    "0"     "0"      "0"   "0"
## [3,] "0"  "0"      "0"    "0"     "0"      "0"   "0"
## [4,] "0"  "0"      "0"    "0"     "0"      "0"   "0"
## [5,] "0"  "0"      "0"    "0"     "0"      "0"   "0"

Predict Submission

After cleaning the data. Lets predict the data and save it.

Naive bayes

submission.nb <- submission %>% 
   select(datetime)
submission.nb$status <- predict(model_nb, newdata = submission.clean.df, type="class")

head(submission.nb)

write.csv(submission.nb, "data/submission_nb.csv")

Random Forest

submission.rf <- submission %>% 
   select(datetime)
submission.rf$status <- predict(model_forest, newdata = submission.clean.df, type="raw")

head(submission.rf)

write.csv(submission.rf, "data/submission_rf_3.csv")

Result

Conclusion

For classify wheter SMS is spam or not. We used Naive Bayes and Random Forest. Since random forest is more accurate than naive bayes. So we use random forest for predicting test submission data and got >90% in accuracy, Sensitivity, Specificity and Precision. This proven that the problem can be solve in machine learning.

One of potential business implementation is SMS Spam filter.

Is this SPAM or not? SMS Spam Classification

Arya

2020-08-26

Introduction

Setup

Import Data

Data wrangling

Exploratory Data Analytics (EDA)

Data characteristic

Spam

Ham

Data preperation

Convert to corpus

Text Cleaning

Convert to Document Term Matrix

Get only the most frequence terms

Convert to Bernoulli

Create a function

Cross Validation

Model

Naive Bayes

Random forest

Evaluate model

Predicting

Naive Bayes

Random Forest

Confusion Matrix

Naive Bayes

Random Forest

False prediction

Interpreting Model

Variable Importance

LIME

Submission data

Import data

Text Cleaning

Optimize data

Predict Submission

Naive bayes

Random Forest

Result

Conclusion

Reference