Introduction
The goal of this report is to show the steps of a data analysis project that uses text data to classify SMS messages into SPAM or not SPAM. This data source can be obtained on a Kaggle challenge in this link.
Data Analysis Steps
This project will use the following steps to create this classifier:
- Load the data and check data quality.
- Split the data in training and testing datasets.
- Features engineering using simple stats and regex.
- Creation of semantic dictionaries to be used as features.
- Apply a machine learning algorithm.
- Assess performance
- Closing remarks
Execution
Loading the data and data quality
Here is a sample of 6 rows of data where you can identify 3 columns:
- is_spam: identifies if the SMS message is spam or not (ham).
- sms: SMS content in text format.
- id: identifier of each SMS.
library(tidyverse)
library(tidytext)
library(skimr)
spam = read.csv(
file = "/Users/dscorzoni/Documents/R Scripts/06 spam classifier/spam.csv",
header = T,
fileEncoding = "latin1"
) %>%
mutate(
is_spam = v1,
sms = v2
)
spam = spam %>%
dplyr::select(is_spam, sms) %>%
mutate(
id = 1:dim(spam)[1]
)
knitr::kable(head(spam))| is_spam | sms | id |
|---|---|---|
| ham | Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat… | 1 |
| ham | Ok lar… Joking wif u oni… | 2 |
| spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s | 3 |
| ham | U dun say so early hor… U c already then say… | 4 |
| ham | Nah I don’t think he goes to usf, he lives around here though | 5 |
| spam | FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv | 6 |
Splitting the data in training and testing sets
In the following code I’m splitting the dataset randomly 50%/50% in training and testing datasets. We will use the training dataset to train the machine learning algorithm and apply the estimates on test dataset to assess performance and overfitting.
# Splitting the dataset in training and testing
set.seed(123)
train = spam %>%
sample_n(round(dim(spam)[1]/2))
test = spam %>%
filter(!id %in% train$id)
dim(train)## [1] 2786 3
dim(test)## [1] 2786 3
Features engineering
In this session we will use some text mining basic tools such as character counting and some regular expressions to extract some deterministic elements from the text content and create features that will happen in our classification task.
library(stringi)
library(qdapRegex)
train_features = train %>%
mutate(
char_count = nchar(sms), # Count the number of characters
has_numbers = ifelse(grepl("[0-9]", sms), 1, 0), # Boolean if the string has numbers
numbers_count = str_count(sms, "[0-9]"), # Count of number digits
has_url = ifelse(grepl(grab("@rm_url"), sms), 1, 0), # Detect URL (needs improvement)
has_date = ifelse(grepl(grab("@rm_date"), sms), 1, 0), # Detect Dates
has_dollar = ifelse(grepl(grab("@rm_dollar"), sms), 1, 0), # Detect Dollar Sign
has_emoticon = ifelse(grepl(grab("@rm_emoticon"), sms), 1, 0), # Detect Emoticon
has_email = ifelse(grepl(grab("@rm_email"), sms), 1, 0), # Detect Dates
has_phone = ifelse(grepl(grab("@rm_phone"), sms), 1, 0), # Detect Phone Number
)
skim_without_charts(train_features)| Name | train_features |
| Number of rows | 2786 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| is_spam | 0 | 1 | 3 | 4 | 0 | 2 | 0 |
| sms | 0 | 1 | 2 | 632 | 0 | 2665 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 2763.95 | 1608.21 | 1 | 1359.25 | 2759 | 4131.75 | 5571 |
| char_count | 0 | 1 | 78.52 | 57.18 | 2 | 35.00 | 60 | 118.00 | 632 |
| has_numbers | 0 | 1 | 0.25 | 0.44 | 0 | 0.00 | 0 | 1.00 | 1 |
| numbers_count | 0 | 1 | 2.22 | 5.88 | 0 | 0.00 | 0 | 1.00 | 36 |
| has_url | 0 | 1 | 0.02 | 0.14 | 0 | 0.00 | 0 | 0.00 | 1 |
| has_date | 0 | 1 | 0.00 | 0.07 | 0 | 0.00 | 0 | 0.00 | 1 |
| has_dollar | 0 | 1 | 0.00 | 0.05 | 0 | 0.00 | 0 | 0.00 | 1 |
| has_emoticon | 0 | 1 | 0.18 | 0.38 | 0 | 0.00 | 0 | 0.00 | 1 |
| has_email | 0 | 1 | 0.00 | 0.05 | 0 | 0.00 | 0 | 0.00 | 1 |
| has_phone | 0 | 1 | 0.07 | 0.25 | 0 | 0.00 | 0 | 0.00 | 1 |
The summary table above show some examples of how frequently some objects appear on those SMS messages. For example, URLs appear on 2% of the data, 25% of the messages contains numbers, 7% contains phone numbers and 18% contains emoticons Just out of curiosity, let’s see what is the proportion of messages with URL that were classified as SPAM, so we can get a sense of predictive power of this feature:
train_features %>%
count(is_spam, has_url) %>%
pivot_wider(
names_from = "is_spam",
values_from = "n",
values_fill = 0
) %>%
mutate(
total = ham + spam,
spam_pct = round(spam / total * 100,1)
) %>%
knitr::kable(.)| has_url | ham | spam | total | spam_pct |
|---|---|---|---|---|
| 0 | 2418 | 316 | 2734 | 11.6 |
| 1 | 2 | 50 | 52 | 96.2 |
So, according to the table above, the rate of spam is 96.2% on messages that contain an URL, while only 11.6% on messages that don’t contain URL. Let’s get another example with emoticon:
train_features %>%
count(is_spam, has_emoticon) %>%
pivot_wider(
names_from = "is_spam",
values_from = "n",
values_fill = 0
) %>%
mutate(
total = ham + spam,
spam_pct = round(spam / total * 100,1)
) %>%
knitr::kable(.)| has_emoticon | ham | spam | total | spam_pct |
|---|---|---|---|---|
| 0 | 2126 | 165 | 2291 | 7.2 |
| 1 | 294 | 201 | 495 | 40.6 |
Here is also an interesting result: messages with emoticons also have a higher rate of SPAM (40.6%) compared to messages without emoticon (7.2%). However, look how lift is different compared to URLs. While for URLs the lift is 8.3x (96.2/11.6), for emoticon is 5.6 (40.6/7.2). The two variables show strong predictable power but URLs is stronger.
Now, let’s explore some word frequency techniques to identify which words are more related to SPAM and create a dictionary that can help us build more features based on textual content.
Semantic Dictionaries
We can start by exploring what words happen more frequently on SPAM messages but not in HAM. But before that, we need to process the text a little bit to remove noise:
library(tidytext)
stopwords_en = get_stopwords(language = "en")
train %>%
unnest_tokens(word, sms) %>%
anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
filter(nchar(word) >= 3) %>% # Remove too short words
count(is_spam, word) %>% # Counting word frequency
group_by(is_spam) %>%
arrange(-n) %>% # Ordering from more to less frequent
slice_head(n = 10) %>%
ggplot(aes(x = word, y = n, fill = is_spam)) +
geom_bar(stat = "identity") +
facet_wrap(~ is_spam, scales = "free_y") +
coord_flip() +
theme_light(base_size = 13) +
labs(
title = "Top 10 frequent words in ham and spam SMS messages.",
subtitle = "Kaggle SMS SPAM Dataset",
x = "Words",
y = "Frequency"
)We have very interesting insights from this chart above, where we can see that words with highest frequency in SPAM messages are: stop, text, reply, mobile, free, claim. You probably can recognize these words from most SPAM messages that you receive every day.
To highlight the contrast of words, we can calculate the relative frequency of those words within ham and spam and calculate the ratio of proportions to see the word lift to detect spam:
spam_docs = train %>%
count(is_spam, name = "total_docs")
spam_dict = train %>%
unnest_tokens(word, sms) %>%
anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
filter(nchar(word) >= 3) %>%
count(is_spam, word) %>%
inner_join(spam_docs) %>%
mutate(
word_prop = n/total_docs
) %>%
dplyr::select(word_prop, is_spam, word) %>%
pivot_wider(names_from = "is_spam", values_from = "word_prop", names_prefix = "word_") %>%
filter(word_ham > 0, word_spam > 0) %>%
mutate(
word_ratio = word_spam / word_ham
)
top_30_spam_words = spam_dict %>%
arrange(-word_ratio) %>%
slice_head(n = 30)
top_30_spam_words %>%
mutate(
page = c(rep(1, 10), rep(2,10), rep(3,10))
) %>%
ggplot(aes(x = word, y = word_ratio)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap(~ page, scales = "free_y") +
theme_light(base_size = 13) +
labs(
title = "Top 30 Word Lift on SPAM messages over HAM messages.",
subtitle = "Words with bigger predictive power to detect SPAM messages.",
x = "Words",
y = "Lift (Relative Freq. Ratio)"
)These are the words with bigger relative frequency happening on SPAM messages compared to their frequencies on HAM messages. You can recognize many words on SPAM messages and we can build a dictionary using this criteria by selecting the top X words as part of a dictionary and then we can count these words in each document to create a dictionary based feature. Lets use these 30 words as our SPAM dictionary and create a new feature:
train_features_complete = train_features %>%
left_join(
train %>%
unnest_tokens(word, sms) %>%
anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
filter(nchar(word) >= 3) %>%
dplyr::select(id, word) %>%
inner_join(top_30_spam_words) %>%
count(id, name = "spam_words_count")
) %>%
mutate(
spam_words_count = replace_na(spam_words_count, 0),
is_spam_binary = ifelse(is_spam == "spam", 1, 0)
) %>%
dplyr::select(-id, -sms)
train_features_complete %>%
count(spam_words_count, is_spam) %>%
pivot_wider(names_from = is_spam, values_from = n, values_fill = 0) %>%
mutate(
total = ham + spam,
spam_pct = round(spam / total * 100,1)
) %>%
knitr::kable(.)| spam_words_count | ham | spam | total | spam_pct |
|---|---|---|---|---|
| 0 | 2375 | 100 | 2475 | 4.0 |
| 1 | 40 | 101 | 141 | 71.6 |
| 2 | 4 | 80 | 84 | 95.2 |
| 3 | 1 | 50 | 51 | 98.0 |
| 4 | 0 | 29 | 29 | 100.0 |
| 5 | 0 | 5 | 5 | 100.0 |
| 6 | 0 | 1 | 1 | 100.0 |
The table above shows how effective this new feature is! As the number of spam words increase, the % of SPAM SMS increases significantly. Note also that this feature also detects a good portion of SPAM messages.
Machine Learning Model
Finally, let’s try using a simple logistic regression first to model the probability of an SMS being SPAM:
mod.glm = glm(
is_spam_binary ~ .,
data = train_features_complete %>% dplyr::select(-is_spam),
family = "binomial"
)
summary(mod.glm)##
## Call:
## glm(formula = is_spam_binary ~ ., family = "binomial", data = train_features_complete %>%
## dplyr::select(-is_spam))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9322 -0.0926 -0.0915 -0.0891 3.3443
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.432533 0.340092 -15.974 < 2e-16 ***
## char_count -0.001026 0.002478 -0.414 0.678972
## has_numbers 1.193233 0.425571 2.804 0.005050 **
## numbers_count 0.524931 0.071649 7.326 2.36e-13 ***
## has_url 4.411534 0.949499 4.646 3.38e-06 ***
## has_date -0.985855 4.699260 -0.210 0.833832
## has_dollar -2.168442 1.264742 -1.715 0.086431 .
## has_emoticon 1.156173 0.342986 3.371 0.000749 ***
## has_email 3.056134 2.573076 1.188 0.234938
## has_phone 0.062574 0.980224 0.064 0.949101
## spam_words_count 2.292275 0.309353 7.410 1.26e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2167.42 on 2785 degrees of freedom
## Residual deviance: 349.42 on 2775 degrees of freedom
## AIC: 371.42
##
## Number of Fisher Scoring iterations: 8
library(caret)
confusionMatrix(
factor(ifelse(predict(mod.glm, train_features_complete, type = "response") > 0.5, 1, 0)),
reference = factor(train_features_complete$is_spam_binary),
)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2408 41
## 1 12 325
##
## Accuracy : 0.981
## 95% CI : (0.9752, 0.9857)
## No Information Rate : 0.8686
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9137
##
## Mcnemar's Test P-Value : 0.00012
##
## Sensitivity : 0.9950
## Specificity : 0.8880
## Pos Pred Value : 0.9833
## Neg Pred Value : 0.9644
## Prevalence : 0.8686
## Detection Rate : 0.8643
## Detection Prevalence : 0.8790
## Balanced Accuracy : 0.9415
##
## 'Positive' Class : 0
##
library(pROC)
auc(
train_features_complete$is_spam_binary,
predict(mod.glm, train_features_complete, type = "response")
)## Area under the curve: 0.9775
Performance on Testing Set
The logistic regression shows a very good performance with AUC > 0.9. Usually this is too good to be true but also expected as we’re using the training data. Now, let’s apply the same feature engineering to the test set, apply the model and check if we get similar results:
# Feature Engineering
test_features_complete = test %>%
mutate(
char_count = nchar(sms), # Count the number of characters
has_numbers = ifelse(grepl("[0-9]", sms), 1, 0), # Boolean if the string has numbers
numbers_count = str_count(sms, "[0-9]"), # Count of number digits
has_url = ifelse(grepl(grab("@rm_url"), sms), 1, 0), # Detect URL (needs improvement)
has_date = ifelse(grepl(grab("@rm_date"), sms), 1, 0), # Detect Dates
has_dollar = ifelse(grepl(grab("@rm_dollar"), sms), 1, 0), # Detect Dollar Sign
has_emoticon = ifelse(grepl(grab("@rm_emoticon"), sms), 1, 0), # Detect Emoticon
has_email = ifelse(grepl(grab("@rm_email"), sms), 1, 0), # Detect Dates
has_phone = ifelse(grepl(grab("@rm_phone"), sms), 1, 0), # Detect Phone Number
) %>%
left_join(
test %>%
unnest_tokens(word, sms) %>%
anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
filter(nchar(word) >= 3) %>%
dplyr::select(id, word) %>%
inner_join(top_30_spam_words) %>%
count(id, name = "spam_words_count")
) %>%
mutate(
spam_words_count = replace_na(spam_words_count, 0),
is_spam_binary = ifelse(is_spam == "spam", 1, 0)
) %>%
dplyr::select(-id, -sms)With the features built, let’s apply the model and assess performance:
confusionMatrix(
factor(ifelse(predict(mod.glm, test_features_complete, type = "response") > 0.5, 1, 0)),
reference = factor(test_features_complete$is_spam_binary),
)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2376 41
## 1 29 340
##
## Accuracy : 0.9749
## 95% CI : (0.9684, 0.9804)
## No Information Rate : 0.8632
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8922
##
## Mcnemar's Test P-Value : 0.1886
##
## Sensitivity : 0.9879
## Specificity : 0.8924
## Pos Pred Value : 0.9830
## Neg Pred Value : 0.9214
## Prevalence : 0.8632
## Detection Rate : 0.8528
## Detection Prevalence : 0.8676
## Balanced Accuracy : 0.9402
##
## 'Positive' Class : 0
##
auc(
test_features_complete$is_spam_binary,
predict(mod.glm, test_features_complete, type = "response")
)## Area under the curve: 0.9758
Surprisingly, the model shows consistent results between training and testing sets, with AUC > 0.9, showing that these features have high predictive power for this use case.
Bonus: interpreting coefficients
The following chart shows the odds ratio of each model component and it’s helpful to understand and interpret which features in the model are contributing most in predicting SPAM:
library(sjPlot)
plot_model(
mod.glm,
show.values = T,
value.size = 3,
line.size = 0.2,
value.offset = 0.4,
vline.color = "grey",
sort.est = T
) +
theme_light(base_size = 13) +
labs(
title = "Odds ratio of model features.",
subtitle = "Estimates from logistic regression model. Colors indicating OR below or above 1."
)Looking at the chart above, we can understand that:
- SMS that have an URL has 81.4x higher risk of being SPAM, being the biggest statistically significant predictor in this model.
- Having an email on SMS content have a big predictive effect, but the lack of coverage doesn’t allow this effect be statistically significant.
- Having spam words in SMS content, created thought our SPAM dictionary technique, increase 8.9x the risk of being SPAM for each word detected. Note that the same SMS can have multiple SPAM words, increasing the risk multiple times. So we can understand here that the effort of building a SPAM dictionary using the text data was important to improve model performance.
- Other factors with significant effects also were: presence of numbers, presence of emoticon and numbers count in SMS content.
- The remaining factors were not statistically significant: has phone number, characters count, has date and has dollar sign.
Closing
In this report we trained a model to predict if a SMS message is SPAM or not. We used features that were created based on fixed rules searching for regular expressions in the content as well as SPAM dictionaries that were created using the own SMS content and training labels.
Important to note that we didn’t need to use an advanced text mining technique such as word embeddings neither advanced machine learning models as the features used had enough prediction value and the logistic regression model was good enough for our use case.
The performance of final model on testing data set showed that this model is good to generalize results, having a AUC similar between training and testing sets meaning that the model is stable.