Detecting Spam Messages using Naive Bayes

Introduction

This project aims to create a model for detecting spam in SMS messages. Specifically, this project will make use of the Naive Bayes classifier to determine whether a text message made is spam or not.

The following concepts were applied in this project:

Data cleaning
Vectorization/Document-Term Matrix
Balancing Imbalanced Datasets
Train-test splitting
Naive Bayes
Model Evaluation

Loading the Packages

The following packages were loaded for this packages:

Document-creation: for publishing of this document
Tidyverse-related packages: for loading data-science related packages
Data-cleaning: for cleaning and manipulating the data
Model-related packages: for model building and evaluation

# Document-creation
pacman::p_load("rsconnect")

# Tidyverse-related packages
pacman::p_load("readr", "dplyr", "stringr", "purrr")

# Data-cleaning/balancing datasets
pacman::p_load("stopwords", "ROSE", "tm")

# Model-related packages
pacman::p_load("caret", "fastNaiveBayes", "broom", "ePCR")

Loading the Dataset

This dataset is obtained from the UCI Machine Learning Repository. It contains over 5,000 pre-classified SMS messages (spam and non-spam/ham). The link to the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

# Loads tab-separated text file
spam_dataset <- read_tsv("SMSSpamCollection.tsv", col_names = FALSE, show_col_types = FALSE)
colnames(spam_dataset) <- c("Type", "Message")

# Prints the dimensions of the dataset
cat("Rows: ", dim(spam_dataset)[1], "\nColumns: ", dim(spam_dataset)[2])

## Rows:  4773 
## Columns:  2

# Displays the first 10 rows
head(spam_dataset, 10)

## # A tibble: 10 x 2
##    Type  Message                                                                
##    <chr> <chr>                                                                  
##  1 ham   Go until jurong point, crazy.. Available only in bugis n great world l~
##  2 ham   Ok lar... Joking wif u oni...                                          
##  3 spam  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Te~
##  4 ham   U dun say so early hor... U c already then say...                      
##  5 ham   Nah I don't think he goes to usf, he lives around here though          
##  6 spam  FreeMsg Hey there darling it's been 3 week's now and no word back! I'd~
##  7 ham   Even my brother is not like to speak with me. They treat me like aids ~
##  8 ham   As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)'~
##  9 spam  WINNER!! As a valued network customer you have been selected to receiv~
## 10 spam  Had your mobile 11 months or more? U R entitled to Update to the lates~

Inspecting the Dataset

Looking at the data, it is observed that there are more messages classified as non-spam compared to spam messages. To remedy this problem and avoid any biases in the model, the author will implement a balancing algorithm to balance out the training data for the model. This will be introduced in the following sections.

# Displays the number of messages per classification
table(spam_dataset$Type)

## 
##  ham spam 
## 4123  650

Cleaning the Data

For this step, data will be pre-processed using relevant cleaning methods. This will ensure that the model selects the right set of features to be used for training/testing.

# Loads all English stopwords
eng_stopwords <- stopwords(kind = "en")

# Function for cleaning SMS messages
clean_sms <- function(message){
  
  message <- str_to_lower(message) # Converts text to lowercase
  
  # Removes stopwords in the text
  message <- unlist(map(str_split(message, " "), 
                        function(x) paste(x[!(x %in% stopwords("en"))], collapse = " ")))
  
  message <- str_replace_all(message, "[:punct:]|[:digit:]", " ") # Removes non-letter cases
  message <- str_squish(message) # Removes extra spaces
  
  # Retains all letter/word cases
  message <- paste(unlist(str_extract_all(message, "[A-Za-z]+"))
                   , collapse = " ") 
  
  # Retains words with three or more characters
  message <- unlist(map(str_split(message, " "), 
                        function(x) paste(x[str_length(x) > 2], collapse = " ")))
  message
}

# Cleans dataset and converts sentiment labels to factor
spam_dataset$Message <- unlist(map(spam_dataset$Message, clean_sms))
spam_dataset$Type <- factor(spam_dataset$Type, levels = c("spam", "ham"))

# Displays the first 10 rows of the cleaned dataset
head(spam_dataset, 10)

## # A tibble: 10 x 2
##    Type  Message                                                                
##    <fct> <chr>                                                                  
##  1 ham   jurong point crazy available bugis great world buffet cine got amore w~
##  2 ham   lar joking wif oni                                                     
##  3 spam  free entry wkly comp win cup final tkts may text receive entry questio~
##  4 ham   dun say early hor already say                                          
##  5 ham   nah think goes usf lives around though                                 
##  6 spam  freemsg hey darling week now word back like fun still xxx std chgs sen~
##  7 ham   even brother like speak treat like aids patent                         
##  8 ham   per request melle melle oru minnaminunginte nurungu vettam set callert~
##  9 spam  winner valued network customer selected receivea prize reward claim ca~
## 10 spam  mobile months more entitled update latest colour mobiles camera free c~

Train-test Splitting

Afterwards, the cleaned dataset will be split into two: 80% of the data will serve as the training data while 20% will serve as the testing data.

# For replicability
set.seed(1)

# Generates training indeces for 80% of the data
split_indeces <- createDataPartition(y = spam_dataset[["Type"]], p = 0.8, list=FALSE)

# Splits data into training/testing
train_set <- spam_dataset[split_indeces,]
test_set <- spam_dataset[-split_indeces,]  

# Prints the dimensions of the dataset
cat("Training Set:", dim(train_set)[1], "\nTesting Set:", dim(test_set)[1])

## Training Set: 3819 
## Testing Set: 954

Remedy to the Imbalance Problem

Since there is an imbalance in the training dataset, the ROSE (Random Over-Sampling Examples) algorithm will be implemented for this project. What this does is that it generates synthetic examples for the under-represented class while maintaining the original length of the dataset. This results in a dataset which has a proportion of nearly 50-50 for each class.

# Balances the training dataset while retaining the same amount of data
balanced_train_set <- ovun.sample(Type~., data = train_set, p = 0.5, seed = 1, method="both")$data

# Displays the updated number of messages per classification
table(balanced_train_set$Type)

## 
##  ham spam 
## 1964 1855

Feature-Response Selection

The balanced dataset is then used to create a document-term matrix. This matrix contains the frequency counts of all possible features used for each document/SMS messages.

# Implements a function for converting a vector of text messages
# into a document-term matrix
convert_to_dtm_matrix <- function(vector){
  corpus <- Corpus(VectorSource(as.vector(vector)))
  corpus <- tm_map(corpus, stemDocument)
  dtm <- as.matrix(DocumentTermMatrix(corpus))
  dtm
}

# Splits training data into x/y - feature/response variables
train_x = convert_to_dtm_matrix(balanced_train_set[["Message"]])
train_y = balanced_train_set[["Type"]]

cat("Total no. of Features:", ncol(train_x), 
    "\nTotal no. of Documents:", nrow(train_x))

## Total no. of Features: 3541 
## Total no. of Documents: 3819

Training the Model

After the data has been prepared, the Naive bayes algorithm is trained and implemented. The fastNaiveBayes function is used instead of the traditional naive_bayes function since it is capable of training a significant amount of data at faster speeds.

# Implements a classifier using the created training data
nb_classifier <- fastNaiveBayes(x = train_x, y = train_y, laplace = 1)

Model Evaluation

The resulting model is then evaluated using the remaining test data. Based on the results below, the model was able to achieve a 98% accuracy in classifying SMS messages. It is also able to correctly detect spam messages around 92% of the time and classify non-spam messages around 99% of the time.

# Splits training data into x/y - feature/response variables
test_x = convert_to_dtm_matrix(test_set[["Message"]])
test_y = test_set[["Type"]]

# Evaluates model using the test data
test_predictions <- predict(nb_classifier, newdata = test_x)
test_cf <- confusionMatrix(test_predictions, test_y)
test_cf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  120   5
##       ham    10 819
##                                           
##                Accuracy : 0.9843          
##                  95% CI : (0.9742, 0.9912)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9321          
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.9939          
##          Pos Pred Value : 0.9600          
##          Neg Pred Value : 0.9879          
##              Prevalence : 0.1363          
##          Detection Rate : 0.1258          
##    Detection Prevalence : 0.1310          
##       Balanced Accuracy : 0.9585          
##                                           
##        'Positive' Class : spam            
##

Conclusion

The aim of this project is to create a model for detecting spam in SMS messages. Methods were introduced to prepare the data for training and reduce any possible bias that might occur in the model.

Based on the results above, this project was able to come-up with a model with over 98% accuracy while maintaining a high detection rate for detecting spam in a set of messages.