The use of mobile phones has skyrocketed in the last decade leading to a new area for junk promotions from disreptable marketers. People innocently give out their mobile phone numbers while utilizing day to day services and are then flooded with spam promotional messages.
In this rpubs, we will take a look at classifying SMS messages using the Naive Bayes Machine Learning model and Random Forest model.
First lets import some library. Some key library that will be used for text mining are dplyr, lubridate, tm, stopwords, tidyr.
Okay, lets read the data.
Here’s a glimpse of our imported data
## Rows: 2,004
## Columns: 3
## $ datetime <chr> "2017-02-15T14:48:00Z", "2017-02-15T15:24:00Z", "2017-02-15T…
## $ text <chr> "Telegram code 53784", "Rezeki Nomplok Dompetku Pengiriman U…
## $ status <chr> "ham", "spam", "ham", "ham", "ham", "ham", "ham", "spam", "s…
As you can see, the data contains three columns: - datatime: Show when the SMS sent - text: The content of SMS - status: Is a sms spam or ham
A few step that we’ll done in data preprocessing - Change text menjadi corpus - Clean text from punctuation, white spaces, stopwords, number - Convert to document-term matrix
But before jump any further, lets do data wrangling and EDA.
In this data wrangling we’ll be change text to character, status to factor and datetime to date.
Now lets’s visualize our data to see at which time spam and ham sms is the most sent in the day. For the visualization we will use Bar plot.
Let’s preprocess the data for visualization by grouping data by hour and sum total spam, ham. And then plot it using geom_bar in ggplot.
sms %>%
mutate(hour = hour(datetime)) %>%
group_by(hour) %>%
summarise(
spam = sum(ifelse(status == "spam", 1, 0)),
ham = sum(ifelse(status == "spam", 0, 1)),
) %>%
ungroup() %>%
pivot_longer(
cols=c(ham, spam)
) %>%
ggplot(
aes(
x=hour,
y=value,
fill=name
)
) +
geom_col(
stat="identity"
) +
scale_fill_manual(values=c("#2990ff", "#e64027"))## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: Ignoring unknown parameters: stat
From what we’ve seen above the number of SMS is gradually increasing in the morning and peaked at 9 AM. And hit the lowest at 4 AM in the morning.
Our data is seperated to two category, it’s spam or ham (not spam). Let’s explore what characteristic makes sms is a ham or spam.
From what seen above, text that related to spam usually are promotional. And some of the word/token like “gratis”, “bonus”.
The text which are ham usually related to code verification number or provider information or usual conversation. Some word that use are “code”, “dimana”.
Since the `text in our data is raw or pure SMS text. Let’s clean our text so our model can use our text for training.
There’re a few step that we need to get through.
The first step is to convert our text to corpus. > Corpus is set of documents.
After that, now we can clean our text. A few things we need to do are.
Converting our text to lowercase
Remove numbers
Remove stopwords words
Since in the dataset are using indonesian language. We need to remove stopwords specific to that language.
Remove punctuation
Remove punctuation marks from a text document.
Stem document Converting a word to its basic word.
Strip whitespace remove extra white space
After the text are clean. The next question is how can we make the model if the predictor still a text.
In text mining, text usually change to Document-Term Matrix (DTM) with the process of tokenization. Tokenization can split one sentence to a few term. Term can be one word, two word or more. In DTM, one word equal to 1 predictor, with the value of how frequen is word the shows up in one document or sms.
## <<DocumentTermMatrix (documents: 3, terms: 2821)>>
## Non-/sparse entries: 2822/5641
## Sparsity : 67%
## Maximal term length: 79
## Weighting : term frequency (tf)
From getting only the most frequence term by atleast shows up in 20 sms, we can get the candidate for the most influence predictor.
By filtering to the most influence predictor, we can cut down the time for training our model.
In document terms reference, the matrix value is a frequency range from 0 to Infinite. To calculate the probability, the frequency need to be simplified to 0 and 1 or not appear and appear.
To do that we need to build a custom function named Bernoulli Converter.
The logic behind it is very simple - If word frequency is more than 1 then returns 1 - If word frequency is 0 then returns 0
bernoulli_conv <- function(x) {
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
bernoulli_conv(c(0,1,3))## [1] 0 1 1
## Levels: 0 1
As you can see the custom function is working, now let’s apply it to our data
## Terms
## Docs aja aks aktif aktifkan aplikasi app aspen axi axisnet ayo bala bank beba
## 1 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 2 "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## 3 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## Terms
## Docs beli berhasil berita berlaku bersifat biaya blm
## 1 "0" "0" "0" "0" "0" "0" "0"
## 2 "1" "1" "1" "1" "1" "1" "1"
## 3 "0" "0" "0" "0" "0" "0" "0"
Our data is now has been clean and based on Term Frequency (TF) - Inverse Document Frequency (IDF)
For preparation data that we’ve just do, we can summarise all of that to a function.
tokenize_text <- function(x, is_bernoulli = TRUE) {
data_dtm <- x %>%
# Convert to corpus
VectorSource() %>%
VCorpus() %>%
# text cleaning
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("id", source="stopwords-iso")) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace) %>%
# Convert DTM
DocumentTermMatrix()
data_freq <- findFreqTerms(data_dtm, lowfreq = 20)
if (is_bernoulli) {
data_dtm[,data_freq] %>%
apply(MARGIN = 2, FUN = bernoulli_conv) %>%
return()
} else {
data_dtm[,data_freq] %>%
return()
}
}After cleaning the data, now lets split the data to train and test. For this case we will split to 75% training data and 25% test data.
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(sms), nrow(sms)*0.75)
sms_clean <- tokenize_text(sms$text)
data_train_clean <- sms_clean[index,]
data_test_clean <- sms_clean[-index,]
label_train <- sms[index, "status"]
label_test <- sms[-index, "status"]*this pure data train and test will be use later to interpreting the model
For model, we will create two distinct model for comparison. Naive Bayes and Random forest.
Now let’s create the model Naive bayes from our data that we’ve just clean.
For comparison, let’s create our next model random forest.
WARNING! Training random forest model can take minutes or even hours. So it’s helpful to save the model after the model created.
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3)
model_forest <- train(
x = data_train_clean,
y = label_train,
method = "rf",
trControl = ctrl
)
saveRDS(model_forest, "spam_forest_3.RDS") # save model
Let load our random forest model.
After creating our model, now let’s evaluate our model.
But first, we need to create prediction so it can be evaluate with test data.
## [1] ham spam spam ham ham ham
## Levels: ham spam
## [1] ham spam spam ham spam spam
## Levels: ham spam
To evaluate our model. Let’s create confusion Matrix.
For this sms classification case. We gonna use Accuracy for measuring our model performance because determine positive value or Spam is as important as determine negative value or Ham, since most people don’t wanna miss on important SMS, but also want to get rid off the spam SMS.
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 257 17
## spam 27 200
##
## Accuracy : 0.9122
## 95% CI : (0.8839, 0.9355)
## No Information Rate : 0.5669
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8221
##
## Mcnemar's Test P-Value : 0.1748
##
## Sensitivity : 0.9217
## Specificity : 0.9049
## Pos Pred Value : 0.8811
## Neg Pred Value : 0.9380
## Prevalence : 0.4331
## Detection Rate : 0.3992
## Detection Prevalence : 0.4531
## Balanced Accuracy : 0.9133
##
## 'Positive' Class : spam
##
From the confusion matrix for naive bayes prediction. We got 91% in accuracy. This shows that the naive bayes model is accurate enough.
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 270 6
## spam 14 211
##
## Accuracy : 0.9601
## 95% CI : (0.939, 0.9754)
## No Information Rate : 0.5669
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9191
##
## Mcnemar's Test P-Value : 0.1175
##
## Sensitivity : 0.9724
## Specificity : 0.9507
## Pos Pred Value : 0.9378
## Neg Pred Value : 0.9783
## Prevalence : 0.4331
## Detection Rate : 0.4212
## Detection Prevalence : 0.4491
## Balanced Accuracy : 0.9615
##
## 'Positive' Class : spam
##
Altough previous model already pretty acurate in predicting the test data. This random forest model are more accurate than naive bayes and get 96% in accuracy for the test data.
Okay, let’s see which our model predict did wrong. For this random forest prediction result will be used since it’s more robust than naive bayes.
pred.false <- data_test %>%
mutate(
pred.rf = sms_pred_rf,
) %>%
filter(pred.rf != status)
pred.false %>% select(-datetime) %>% filter(pred.rf == "spam")From seen above many of misclassified ham text are sms from internet provider informing its user about something useful like remaining data. This may happen because internet provider often send promotional stuff that contain word like “pulsa” or “kuota” or “paket” that also used to inform user about remaining data or something useful for the user.
There’re two methods that gonna be use for interpreting our model. Variable Importance for random forest and LIME for Naive Bayes.
Variable Importance helps us to know which variable are contributing more and which variable contributing nothing.
Let’s see which word is the most important, using Variable Importance from our model Random forest.
caret::varImp(model_forest, 20)$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
arrange(-Overall) %>%
mutate(rowname = forcats::fct_inorder(rowname))From seen above the most contributor in the model is info and the least contributor is memberitahukan.
LIME or short for Local Interpretable Model-agnostic Explanations is a novel explanation technique that explains the prediction of any classifier in an interpretable and faithful manner by learning a interpretable model locally around the prediction.
LIME can predict any model and treat it as a black box model. Meanwhile Decision Tree or Variable Importance in Random Forest is only aplicable in those model.
LIME will be used for interpreting naive bayes model.
Since lime doesn’t support naiveBayes, we need to create custom function for naive bayes named model_type.naiveBayes.
We also need a function to store the prediction. The function would be predict_model.naiveBayes.
predict_model.naiveBayes <- function(x, newdata, type = "raw") {
res <- predict(x, newdata, type = "raw") %>% as.data.frame()
return(res)
}Now, we need to prepare the input for the lime. In common classification problem, the input can be the table that contain the features. However, in text classification, the input should be the original text and we also need to give the preprocessing step to process the text from cleansing to the tokenization. Make sure the input of the text is character, not a factor.
text_train <- data_train$text %>% as.character()
text_test <- data_test$text
explainer <- lime(
text_train,
model=model_nb,
preprocess=tokenize_text
)Now we will try to explain how our model work on the test dataset. We will observe the interpretation of the 2nd to 5th obervations of the data test. Don’t forget to do set.seed to get reproducible example.
We will use five features for this.
set.seed(123)
explanation <- explain(
text_test[1:5],
explainer = explainer,
n_labels = 1, # show only 1 label (recommend or not recommend)
n_features = 5,
feature_select = "none", # use all terms to explain the model
single_explanation = F
)Let’s visualize the result.
We can see that from the third observation, the probability to be ham is 98%. The explanation fit shows how good LIME at interpreting the prediction for this observation, which is 76% so it may accurate enough.
The blue-labeled text means that the word support/increase the probability to be SPAM, with the most influence word promo and belaku.
The red-labeled text means that the word contradict/decrease the probability of the review to be HAM, such as your, hari or nasi.
The difference between LIME and using an interpretable machine learning model like decision tree is LIME can be aplied in any model but explain feature role based on model prediction in sample data. Meanwhile interpretable machine learning model only can be aplied in its model like Variable Importance in Random forest but can explain what the feature contribution in the model itself.
Now let’s apply our model to the submission data. For this we will use Random forest since it’s more robust and accurate than Naive bayes.
Because we already create a function earlier named tokenize_text.
## Terms
## Docs aplikasi axi axisnet bala beli berlaku bonus bronet dgn diblokir
## 1 "0" "0" "0" "0" "1" "0" "0" "0" "1" "0"
## 2 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 3 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 4 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## 5 "0" "0" "0" "0" "0" "0" "1" "0" "0" "0"
Since random forest require to has same predictor. We need to trim our predictor so it has the same predictor with the data train. Let’s create a function for that and named trimRfPredictor.
trimRfPredictor <- function(x, train_data) {
x %>%
as.data.frame() %>%
fncols(colnames(train_data)) %>%
select(colnames(train_data)) %>%
mutate_all(as.factor) %>%
as.matrix.data.frame() %>%
return()
}Also, we need to create custom function for adding columns to match the training data predictor. We can named it fncols.
fncols <- function(data, cname) {
add <-cname[!cname%in%names(data)]
if(length(add)!=0) data[add] <- as.factor("0")
data
}submission.clean.df <- trimRfPredictor(submission.clean, data_train_clean)
submission.clean.df[1:5,1:20]## aja aks aktif aktifkan aplikasi app aspen axi axisnet ayo bala bank beba
## [1,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [2,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [3,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [4,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## [5,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
## beli berhasil berita berlaku bersifat biaya blm
## [1,] "1" "0" "0" "0" "0" "0" "0"
## [2,] "0" "0" "0" "0" "0" "0" "0"
## [3,] "0" "0" "0" "0" "0" "0" "0"
## [4,] "0" "0" "0" "0" "0" "0" "0"
## [5,] "0" "0" "0" "0" "0" "0" "0"
After cleaning the data. Lets predict the data and save it.
For classify wheter SMS is spam or not. We used Naive Bayes and Random Forest. Since random forest is more accurate than naive bayes. So we use random forest for predicting test submission data and got >90% in accuracy, Sensitivity, Specificity and Precision. This proven that the problem can be solve in machine learning.
One of potential business implementation is SMS Spam filter.