Text Classification Report

Introduction

The goal of this report is to show the steps of a data analysis project that uses text data to classify SMS messages into SPAM or not SPAM. This data source can be obtained on a Kaggle challenge in this link.

Data Analysis Steps

This project will use the following steps to create this classifier:

Load the data and check data quality.
Split the data in training and testing datasets.
Features engineering using simple stats and regex.
Creation of semantic dictionaries to be used as features.
Apply a machine learning algorithm.
Assess performance
Closing remarks

Execution

Loading the data and data quality

Here is a sample of 6 rows of data where you can identify 3 columns:

is_spam: identifies if the SMS message is spam or not (ham).
sms: SMS content in text format.
id: identifier of each SMS.

library(tidyverse)
library(tidytext)
library(skimr)

spam = read.csv(
  file = "/Users/dscorzoni/Documents/R Scripts/06 spam classifier/spam.csv", 
  header = T,
  fileEncoding = "latin1"
  ) %>%
  mutate(
    is_spam = v1,
    sms = v2
  )

spam = spam %>%
  dplyr::select(is_spam, sms) %>%
  mutate(
    id = 1:dim(spam)[1]
  )

knitr::kable(head(spam))

is_spam	sms	id
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet… Cine there got amore wat…	1
ham	Ok lar… Joking wif u oni…	2
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C’s apply 08452810075over18’s	3
ham	U dun say so early hor… U c already then say…	4
ham	Nah I don’t think he goes to usf, he lives around here though	5
spam	FreeMsg Hey there darling it’s been 3 week’s now and no word back! I’d like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv	6

Splitting the data in training and testing sets

In the following code I’m splitting the dataset randomly 50%/50% in training and testing datasets. We will use the training dataset to train the machine learning algorithm and apply the estimates on test dataset to assess performance and overfitting.

# Splitting the dataset in training and testing
set.seed(123)
train = spam %>%
  sample_n(round(dim(spam)[1]/2))

test = spam %>%
  filter(!id %in% train$id)


dim(train)

## [1] 2786    3

dim(test)

## [1] 2786    3

Features engineering

In this session we will use some text mining basic tools such as character counting and some regular expressions to extract some deterministic elements from the text content and create features that will happen in our classification task.

library(stringi)
library(qdapRegex)

train_features = train %>%
  mutate(
    char_count = nchar(sms), # Count the number of characters
    has_numbers = ifelse(grepl("[0-9]", sms), 1, 0), # Boolean if the string has numbers
    numbers_count = str_count(sms, "[0-9]"), # Count of number digits
    has_url = ifelse(grepl(grab("@rm_url"), sms), 1, 0), # Detect URL (needs improvement)
    has_date = ifelse(grepl(grab("@rm_date"), sms), 1, 0), # Detect Dates
    has_dollar = ifelse(grepl(grab("@rm_dollar"), sms), 1, 0), # Detect Dollar Sign
    has_emoticon = ifelse(grepl(grab("@rm_emoticon"), sms), 1, 0), # Detect Emoticon
    has_email = ifelse(grepl(grab("@rm_email"), sms), 1, 0), # Detect Dates
    has_phone = ifelse(grepl(grab("@rm_phone"), sms), 1, 0), # Detect Phone Number
  )

skim_without_charts(train_features)

Data summary
Name	train_features
Number of rows	2786
Number of columns	12
_______________________
Column type frequency:
character	2
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
is_spam	0	1	3	4	0	2	0
sms	0	1	2	632	0	2665	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
id	1	2763.95	1608.21	1	1359.25	2759	4131.75	5571
char_count	1	78.52	57.18	2	35.00	60	118.00	632
has_numbers	1	0.25	0.44	0	0.00	0	1.00	1
numbers_count	1	2.22	5.88	0	0.00	0	1.00	36
has_url	1	0.02	0.14	0	0.00	0	0.00	1
has_date	1	0.00	0.07	0	0.00	0	0.00	1
has_dollar	1	0.00	0.05	0	0.00	0	0.00	1
has_emoticon	1	0.18	0.38	0	0.00	0	0.00	1
has_email	1	0.00	0.05	0	0.00	0	0.00	1
has_phone	1	0.07	0.25	0	0.00	0	0.00	1

The summary table above show some examples of how frequently some objects appear on those SMS messages. For example, URLs appear on 2% of the data, 25% of the messages contains numbers, 7% contains phone numbers and 18% contains emoticons Just out of curiosity, let’s see what is the proportion of messages with URL that were classified as SPAM, so we can get a sense of predictive power of this feature:

train_features %>%
  count(is_spam, has_url) %>%
  pivot_wider(
    names_from = "is_spam", 
    values_from = "n", 
    values_fill = 0  
  ) %>%
  mutate(
    total = ham + spam,
    spam_pct = round(spam / total * 100,1)
  ) %>%
  knitr::kable(.)

has_url	ham	spam	total	spam_pct
0	2418	316	2734	11.6
1	2	50	52	96.2

So, according to the table above, the rate of spam is 96.2% on messages that contain an URL, while only 11.6% on messages that don’t contain URL. Let’s get another example with emoticon:

train_features %>%
  count(is_spam, has_emoticon) %>%
  pivot_wider(
    names_from = "is_spam", 
    values_from = "n", 
    values_fill = 0  
  ) %>%
  mutate(
    total = ham + spam,
    spam_pct = round(spam / total * 100,1)
  ) %>%
  knitr::kable(.)

has_emoticon	ham	spam	total	spam_pct
0	2126	165	2291	7.2
1	294	201	495	40.6

Here is also an interesting result: messages with emoticons also have a higher rate of SPAM (40.6%) compared to messages without emoticon (7.2%). However, look how lift is different compared to URLs. While for URLs the lift is 8.3x (96.2/11.6), for emoticon is 5.6 (40.6/7.2). The two variables show strong predictable power but URLs is stronger.

Now, let’s explore some word frequency techniques to identify which words are more related to SPAM and create a dictionary that can help us build more features based on textual content.

Semantic Dictionaries

We can start by exploring what words happen more frequently on SPAM messages but not in HAM. But before that, we need to process the text a little bit to remove noise:

library(tidytext)

stopwords_en = get_stopwords(language = "en")

train %>%
  unnest_tokens(word, sms) %>%
  anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
  filter(nchar(word) >= 3) %>% # Remove too short words
  count(is_spam, word) %>% # Counting word frequency
  group_by(is_spam) %>%
  arrange(-n) %>% # Ordering from more to less frequent
  slice_head(n = 10) %>%
  ggplot(aes(x = word, y = n, fill = is_spam)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ is_spam, scales = "free_y") +
  coord_flip() +
  theme_light(base_size = 13) +
  labs(
    title = "Top 10 frequent words in ham and spam SMS messages.",
    subtitle = "Kaggle SMS SPAM Dataset",
    x = "Words",
    y = "Frequency"
  )

We have very interesting insights from this chart above, where we can see that words with highest frequency in SPAM messages are: stop, text, reply, mobile, free, claim. You probably can recognize these words from most SPAM messages that you receive every day.

To highlight the contrast of words, we can calculate the relative frequency of those words within ham and spam and calculate the ratio of proportions to see the word lift to detect spam:

spam_docs = train %>%
  count(is_spam, name = "total_docs")

spam_dict = train %>%
  unnest_tokens(word, sms) %>%
  anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
  filter(nchar(word) >= 3) %>%
  count(is_spam, word) %>%
  inner_join(spam_docs) %>%
  mutate(
    word_prop = n/total_docs
  ) %>%
  dplyr::select(word_prop, is_spam, word) %>%
  pivot_wider(names_from = "is_spam", values_from = "word_prop", names_prefix = "word_") %>%
  filter(word_ham > 0, word_spam > 0) %>%
  mutate(
    word_ratio = word_spam / word_ham
  ) 

top_30_spam_words = spam_dict %>%
  arrange(-word_ratio) %>%
  slice_head(n = 30) 

top_30_spam_words %>%
  mutate(
    page = c(rep(1, 10), rep(2,10), rep(3,10))
  ) %>%
  ggplot(aes(x = word, y = word_ratio)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  facet_wrap(~ page, scales = "free_y") +
  theme_light(base_size = 13) +
  labs(
    title = "Top 30 Word Lift on SPAM messages over HAM messages.",
    subtitle = "Words with bigger predictive power to detect SPAM messages.",
    x = "Words",
    y = "Lift (Relative Freq. Ratio)"
  )

These are the words with bigger relative frequency happening on SPAM messages compared to their frequencies on HAM messages. You can recognize many words on SPAM messages and we can build a dictionary using this criteria by selecting the top X words as part of a dictionary and then we can count these words in each document to create a dictionary based feature. Lets use these 30 words as our SPAM dictionary and create a new feature:

train_features_complete = train_features %>%
  left_join(
    train %>%
    unnest_tokens(word, sms) %>%
    anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
    filter(nchar(word) >= 3) %>%
    dplyr::select(id, word) %>%
    inner_join(top_30_spam_words) %>%
    count(id, name = "spam_words_count")
  ) %>%
  mutate(
    spam_words_count = replace_na(spam_words_count, 0),
    is_spam_binary = ifelse(is_spam == "spam", 1, 0)
  ) %>%
  dplyr::select(-id, -sms)

train_features_complete %>%
  count(spam_words_count, is_spam) %>%
  pivot_wider(names_from = is_spam, values_from = n, values_fill = 0) %>%
  mutate(
    total = ham + spam,
    spam_pct = round(spam / total * 100,1)
  ) %>%
  knitr::kable(.)

spam_words_count	ham	spam	total	spam_pct
0	2375	100	2475	4.0
1	40	101	141	71.6
2	4	80	84	95.2
3	1	50	51	98.0
4	0	29	29	100.0
5	0	5	5	100.0
6	0	1	1	100.0

The table above shows how effective this new feature is! As the number of spam words increase, the % of SPAM SMS increases significantly. Note also that this feature also detects a good portion of SPAM messages.

Machine Learning Model

Finally, let’s try using a simple logistic regression first to model the probability of an SMS being SPAM:

mod.glm = glm(
  is_spam_binary ~ ., 
  data = train_features_complete %>% dplyr::select(-is_spam), 
  family = "binomial"
  )
summary(mod.glm)

## 
## Call:
## glm(formula = is_spam_binary ~ ., family = "binomial", data = train_features_complete %>% 
##     dplyr::select(-is_spam))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9322  -0.0926  -0.0915  -0.0891   3.3443  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.432533   0.340092 -15.974  < 2e-16 ***
## char_count       -0.001026   0.002478  -0.414 0.678972    
## has_numbers       1.193233   0.425571   2.804 0.005050 ** 
## numbers_count     0.524931   0.071649   7.326 2.36e-13 ***
## has_url           4.411534   0.949499   4.646 3.38e-06 ***
## has_date         -0.985855   4.699260  -0.210 0.833832    
## has_dollar       -2.168442   1.264742  -1.715 0.086431 .  
## has_emoticon      1.156173   0.342986   3.371 0.000749 ***
## has_email         3.056134   2.573076   1.188 0.234938    
## has_phone         0.062574   0.980224   0.064 0.949101    
## spam_words_count  2.292275   0.309353   7.410 1.26e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2167.42  on 2785  degrees of freedom
## Residual deviance:  349.42  on 2775  degrees of freedom
## AIC: 371.42
## 
## Number of Fisher Scoring iterations: 8

library(caret)
confusionMatrix(
  factor(ifelse(predict(mod.glm, train_features_complete, type = "response") > 0.5, 1, 0)),
  reference = factor(train_features_complete$is_spam_binary),
  
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2408   41
##          1   12  325
##                                           
##                Accuracy : 0.981           
##                  95% CI : (0.9752, 0.9857)
##     No Information Rate : 0.8686          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9137          
##                                           
##  Mcnemar's Test P-Value : 0.00012         
##                                           
##             Sensitivity : 0.9950          
##             Specificity : 0.8880          
##          Pos Pred Value : 0.9833          
##          Neg Pred Value : 0.9644          
##              Prevalence : 0.8686          
##          Detection Rate : 0.8643          
##    Detection Prevalence : 0.8790          
##       Balanced Accuracy : 0.9415          
##                                           
##        'Positive' Class : 0               
##

library(pROC)
auc(
  train_features_complete$is_spam_binary, 
  predict(mod.glm, train_features_complete, type = "response")
  )

## Area under the curve: 0.9775

Performance on Testing Set

The logistic regression shows a very good performance with AUC > 0.9. Usually this is too good to be true but also expected as we’re using the training data. Now, let’s apply the same feature engineering to the test set, apply the model and check if we get similar results:

# Feature Engineering
test_features_complete = test %>%
  mutate(
    char_count = nchar(sms), # Count the number of characters
    has_numbers = ifelse(grepl("[0-9]", sms), 1, 0), # Boolean if the string has numbers
    numbers_count = str_count(sms, "[0-9]"), # Count of number digits
    has_url = ifelse(grepl(grab("@rm_url"), sms), 1, 0), # Detect URL (needs improvement)
    has_date = ifelse(grepl(grab("@rm_date"), sms), 1, 0), # Detect Dates
    has_dollar = ifelse(grepl(grab("@rm_dollar"), sms), 1, 0), # Detect Dollar Sign
    has_emoticon = ifelse(grepl(grab("@rm_emoticon"), sms), 1, 0), # Detect Emoticon
    has_email = ifelse(grepl(grab("@rm_email"), sms), 1, 0), # Detect Dates
    has_phone = ifelse(grepl(grab("@rm_phone"), sms), 1, 0), # Detect Phone Number
  ) %>%
  left_join(
    test %>%
      unnest_tokens(word, sms) %>%
      anti_join(stopwords_en) %>% # This removes too frequent words that don't add meaning.
      filter(nchar(word) >= 3) %>%
      dplyr::select(id, word) %>%
      inner_join(top_30_spam_words) %>%
      count(id, name = "spam_words_count")
  ) %>%
  mutate(
    spam_words_count = replace_na(spam_words_count, 0),
    is_spam_binary = ifelse(is_spam == "spam", 1, 0)
  ) %>%
  dplyr::select(-id, -sms)

With the features built, let’s apply the model and assess performance:

confusionMatrix(
  factor(ifelse(predict(mod.glm, test_features_complete, type = "response") > 0.5, 1, 0)),
  reference = factor(test_features_complete$is_spam_binary),
  
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2376   41
##          1   29  340
##                                           
##                Accuracy : 0.9749          
##                  95% CI : (0.9684, 0.9804)
##     No Information Rate : 0.8632          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8922          
##                                           
##  Mcnemar's Test P-Value : 0.1886          
##                                           
##             Sensitivity : 0.9879          
##             Specificity : 0.8924          
##          Pos Pred Value : 0.9830          
##          Neg Pred Value : 0.9214          
##              Prevalence : 0.8632          
##          Detection Rate : 0.8528          
##    Detection Prevalence : 0.8676          
##       Balanced Accuracy : 0.9402          
##                                           
##        'Positive' Class : 0               
##

auc(
  test_features_complete$is_spam_binary, 
  predict(mod.glm, test_features_complete, type = "response")
  )

## Area under the curve: 0.9758

Surprisingly, the model shows consistent results between training and testing sets, with AUC > 0.9, showing that these features have high predictive power for this use case.

Bonus: interpreting coefficients

The following chart shows the odds ratio of each model component and it’s helpful to understand and interpret which features in the model are contributing most in predicting SPAM:

library(sjPlot)

plot_model(
  mod.glm, 
  show.values = T, 
  value.size = 3, 
  line.size = 0.2, 
  value.offset = 0.4,
  vline.color = "grey",
  sort.est = T
  ) +
  theme_light(base_size = 13) +
  labs(
    title = "Odds ratio of model features.",
    subtitle = "Estimates from logistic regression model. Colors indicating OR below or above 1."
  )

Looking at the chart above, we can understand that:

SMS that have an URL has 81.4x higher risk of being SPAM, being the biggest statistically significant predictor in this model.
Having an email on SMS content have a big predictive effect, but the lack of coverage doesn’t allow this effect be statistically significant.
Having spam words in SMS content, created thought our SPAM dictionary technique, increase 8.9x the risk of being SPAM for each word detected. Note that the same SMS can have multiple SPAM words, increasing the risk multiple times. So we can understand here that the effort of building a SPAM dictionary using the text data was important to improve model performance.
Other factors with significant effects also were: presence of numbers, presence of emoticon and numbers count in SMS content.
The remaining factors were not statistically significant: has phone number, characters count, has date and has dollar sign.

Closing

In this report we trained a model to predict if a SMS message is SPAM or not. We used features that were created based on fixed rules searching for regular expressions in the content as well as SPAM dictionaries that were created using the own SMS content and training labels.

Important to note that we didn’t need to use an advanced text mining technique such as word embeddings neither advanced machine learning models as the features used had enough prediction value and the logistic regression model was good enough for our use case.

The performance of final model on testing data set showed that this model is good to generalize results, having a AUC similar between training and testing sets meaning that the model is stable.