This project aims to create a model for detecting spam in SMS messages. Specifically, this project will make use of the Naive Bayes classifier to determine whether a text message made is spam or not.
The following concepts were applied in this project:
The following packages were loaded for this packages:
# Document-creation
pacman::p_load("rsconnect")
# Tidyverse-related packages
pacman::p_load("readr", "dplyr", "stringr", "purrr")
# Data-cleaning/balancing datasets
pacman::p_load("stopwords", "ROSE", "tm")
# Model-related packages
pacman::p_load("caret", "fastNaiveBayes", "broom", "ePCR")
This dataset is obtained from the UCI Machine Learning Repository. It contains over 5,000 pre-classified SMS messages (spam and non-spam/ham). The link to the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# Loads tab-separated text file
spam_dataset <- read_tsv("SMSSpamCollection.tsv", col_names = FALSE, show_col_types = FALSE)
colnames(spam_dataset) <- c("Type", "Message")
# Prints the dimensions of the dataset
cat("Rows: ", dim(spam_dataset)[1], "\nColumns: ", dim(spam_dataset)[2])
## Rows: 4773
## Columns: 2
# Displays the first 10 rows
head(spam_dataset, 10)
## # A tibble: 10 x 2
## Type Message
## <chr> <chr>
## 1 ham Go until jurong point, crazy.. Available only in bugis n great world l~
## 2 ham Ok lar... Joking wif u oni...
## 3 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Te~
## 4 ham U dun say so early hor... U c already then say...
## 5 ham Nah I don't think he goes to usf, he lives around here though
## 6 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd~
## 7 ham Even my brother is not like to speak with me. They treat me like aids ~
## 8 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)'~
## 9 spam WINNER!! As a valued network customer you have been selected to receiv~
## 10 spam Had your mobile 11 months or more? U R entitled to Update to the lates~
Looking at the data, it is observed that there are more messages classified as non-spam compared to spam messages. To remedy this problem and avoid any biases in the model, the author will implement a balancing algorithm to balance out the training data for the model. This will be introduced in the following sections.
# Displays the number of messages per classification
table(spam_dataset$Type)
##
## ham spam
## 4123 650
For this step, data will be pre-processed using relevant cleaning methods. This will ensure that the model selects the right set of features to be used for training/testing.
# Loads all English stopwords
eng_stopwords <- stopwords(kind = "en")
# Function for cleaning SMS messages
clean_sms <- function(message){
message <- str_to_lower(message) # Converts text to lowercase
# Removes stopwords in the text
message <- unlist(map(str_split(message, " "),
function(x) paste(x[!(x %in% stopwords("en"))], collapse = " ")))
message <- str_replace_all(message, "[:punct:]|[:digit:]", " ") # Removes non-letter cases
message <- str_squish(message) # Removes extra spaces
# Retains all letter/word cases
message <- paste(unlist(str_extract_all(message, "[A-Za-z]+"))
, collapse = " ")
# Retains words with three or more characters
message <- unlist(map(str_split(message, " "),
function(x) paste(x[str_length(x) > 2], collapse = " ")))
message
}
# Cleans dataset and converts sentiment labels to factor
spam_dataset$Message <- unlist(map(spam_dataset$Message, clean_sms))
spam_dataset$Type <- factor(spam_dataset$Type, levels = c("spam", "ham"))
# Displays the first 10 rows of the cleaned dataset
head(spam_dataset, 10)
## # A tibble: 10 x 2
## Type Message
## <fct> <chr>
## 1 ham jurong point crazy available bugis great world buffet cine got amore w~
## 2 ham lar joking wif oni
## 3 spam free entry wkly comp win cup final tkts may text receive entry questio~
## 4 ham dun say early hor already say
## 5 ham nah think goes usf lives around though
## 6 spam freemsg hey darling week now word back like fun still xxx std chgs sen~
## 7 ham even brother like speak treat like aids patent
## 8 ham per request melle melle oru minnaminunginte nurungu vettam set callert~
## 9 spam winner valued network customer selected receivea prize reward claim ca~
## 10 spam mobile months more entitled update latest colour mobiles camera free c~
Afterwards, the cleaned dataset will be split into two: 80% of the data will serve as the training data while 20% will serve as the testing data.
# For replicability
set.seed(1)
# Generates training indeces for 80% of the data
split_indeces <- createDataPartition(y = spam_dataset[["Type"]], p = 0.8, list=FALSE)
# Splits data into training/testing
train_set <- spam_dataset[split_indeces,]
test_set <- spam_dataset[-split_indeces,]
# Prints the dimensions of the dataset
cat("Training Set:", dim(train_set)[1], "\nTesting Set:", dim(test_set)[1])
## Training Set: 3819
## Testing Set: 954
Since there is an imbalance in the training dataset, the ROSE (Random Over-Sampling Examples) algorithm will be implemented for this project. What this does is that it generates synthetic examples for the under-represented class while maintaining the original length of the dataset. This results in a dataset which has a proportion of nearly 50-50 for each class.
# Balances the training dataset while retaining the same amount of data
balanced_train_set <- ovun.sample(Type~., data = train_set, p = 0.5, seed = 1, method="both")$data
# Displays the updated number of messages per classification
table(balanced_train_set$Type)
##
## ham spam
## 1964 1855
The balanced dataset is then used to create a document-term matrix. This matrix contains the frequency counts of all possible features used for each document/SMS messages.
# Implements a function for converting a vector of text messages
# into a document-term matrix
convert_to_dtm_matrix <- function(vector){
corpus <- Corpus(VectorSource(as.vector(vector)))
corpus <- tm_map(corpus, stemDocument)
dtm <- as.matrix(DocumentTermMatrix(corpus))
dtm
}
# Splits training data into x/y - feature/response variables
train_x = convert_to_dtm_matrix(balanced_train_set[["Message"]])
train_y = balanced_train_set[["Type"]]
cat("Total no. of Features:", ncol(train_x),
"\nTotal no. of Documents:", nrow(train_x))
## Total no. of Features: 3541
## Total no. of Documents: 3819
After the data has been prepared, the Naive bayes algorithm is trained and implemented. The fastNaiveBayes function is used instead of the traditional naive_bayes function since it is capable of training a significant amount of data at faster speeds.
# Implements a classifier using the created training data
nb_classifier <- fastNaiveBayes(x = train_x, y = train_y, laplace = 1)
The resulting model is then evaluated using the remaining test data. Based on the results below, the model was able to achieve a 98% accuracy in classifying SMS messages. It is also able to correctly detect spam messages around 92% of the time and classify non-spam messages around 99% of the time.
# Splits training data into x/y - feature/response variables
test_x = convert_to_dtm_matrix(test_set[["Message"]])
test_y = test_set[["Type"]]
# Evaluates model using the test data
test_predictions <- predict(nb_classifier, newdata = test_x)
test_cf <- confusionMatrix(test_predictions, test_y)
test_cf
## Confusion Matrix and Statistics
##
## Reference
## Prediction spam ham
## spam 120 5
## ham 10 819
##
## Accuracy : 0.9843
## 95% CI : (0.9742, 0.9912)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9321
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.9231
## Specificity : 0.9939
## Pos Pred Value : 0.9600
## Neg Pred Value : 0.9879
## Prevalence : 0.1363
## Detection Rate : 0.1258
## Detection Prevalence : 0.1310
## Balanced Accuracy : 0.9585
##
## 'Positive' Class : spam
##
The aim of this project is to create a model for detecting spam in SMS messages. Methods were introduced to prepare the data for training and reduce any possible bias that might occur in the model.
Based on the results above, this project was able to come-up with a model with over 98% accuracy while maintaining a high detection rate for detecting spam in a set of messages.