The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.
The goal of this project is to build a spam filter that can effectively categorise an incoming mail or text message as either “spam” or “ham”. We will use a dataset from the dataset repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine!.
This dataset consists of 5574 observations of 2 variables. The first variable is the content of the emails and the second variable the target variable, which is the class to be predicted. The target variable can either be a “spam” or “ham”. We will be building this classier using the text messages from the email.
The dataset was imported from the repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine! with this URL “https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip”
#Building Spam Filter using ML
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
if (!file.exists("smsspamcollection.zip"))
{
download.file(url=url, destfile="smsspamcollection.zip", method="curl")
}
unzip("smsspamcollection.zip")
data_text <- read.delim("SMSSpamCollection", sep="\t", header=F, colClasses="character", quote="")
str(data_text)
## 'data.frame': 5574 obs. of 2 variables:
## $ V1: chr "ham" "ham" "spam" "ham" ...
## $ V2: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
head(data_text)
## V1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## V2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
For easy identification of the columns, we rename V1 as Class and V2 as Text. And we have to also convert the Class column from Character strings to factor. We also need to know the proportion of ham to spam in our dataset.
colnames(data_text)
## [1] "V1" "V2"
colnames(data_text) <- c("Class", "Text")
colnames(data_text)
## [1] "Class" "Text"
data_text$Class <- factor(data_text$Class)
prop.table(table(data_text$Class))
##
## ham spam
## 0.8659849 0.1340151
Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. In text mining, we need to put the words in lowercase, remove stops words that do not add any meaning to the model etc.
# Cleaning the texts
library(tm)
## Loading required package: NLP
library(SnowballC)
corpus = VCorpus(VectorSource(data_text$Text))
as.character(corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
as.character(corpus[[1]])
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In other to answer this question; we are creating a DocumentTermMatrix to keep all these words.
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5574, terms: 6981)>>
## Non-/sparse entries: 43801/38868293
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
dtm = removeSparseTerms(dtm, 0.999)
dim(dtm)
## [1] 5574 1209
inspect(dtm[40:50, 10:15])
## <<DocumentTermMatrix (documents: 11, terms: 6)>>
## Non-/sparse entries: 0/66
## Sparsity : 100%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs activ actual add address admir adult
## 40 0 0 0 0 0 0
## 41 0 0 0 0 0 0
## 42 0 0 0 0 0 0
## 43 0 0 0 0 0 0
## 44 0 0 0 0 0 0
## 45 0 0 0 0 0 0
## 46 0 0 0 0 0 0
## 47 0 0 0 0 0 0
## 48 0 0 0 0 0 0
## 49 0 0 0 0 0 0
## 50 0 0 0 0 0 0
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# Apply the convert_count function to get final training and testing DTMs
datasetNB <- apply(dtm, 2, convert_count)
dataset = as.data.frame(as.matrix(datasetNB))
We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.
freq<- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
tail(freq, 10)
## vikki vodafon vote vri wherev wnt wwq yay yiju
## 6 6 6 6 6 6 6 6 6
## zed
## 6
findFreqTerms(dtm, lowfreq=60) #identifying terms that appears frequently
## [1] "alreadi" "also" "amp" "anyth" "around" "ask"
## [7] "award" "babe" "back" "buy" "call" "can"
## [13] "cant" "care" "cash" "chat" "claim" "come"
## [19] "contact" "cos" "custom" "day" "dear" "didnt"
## [25] "dont" "end" "even" "everi" "feel" "find"
## [31] "finish" "first" "free" "friend" "get" "give"
## [37] "good" "got" "great" "gud" "guy" "happi"
## [43] "help" "hey" "home" "hope" "ill" "ive"
## [49] "just" "keep" "know" "last" "later" "leav"
## [55] "let" "life" "like" "lol" "look" "lor"
## [61] "love" "ltgt" "make" "meet" "messag" "min"
## [67] "miss" "mobil" "morn" "msg" "much" "need"
## [73] "new" "next" "night" "nokia" "now" "number"
## [79] "one" "person" "phone" "pick" "place" "pleas"
## [85] "pls" "prize" "realli" "repli" "right" "said"
## [91] "say" "see" "send" "sent" "servic" "show"
## [97] "sleep" "smile" "someon" "someth" "sorri" "start"
## [103] "still" "stop" "sure" "take" "talk" "tell"
## [109] "text" "thank" "that" "thing" "think" "time"
## [115] "today" "tomorrow" "tone" "tonight" "tri" "txt"
## [121] "urgent" "use" "wait" "want" "wat" "watch"
## [127] "way" "week" "well" "went" "will" "win"
## [133] "wish" "won" "work" "yeah" "year" "yes"
We will like to plot those words that appeared more than 60 times in our dataset.
#Plot Word Frequencies
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
wf<- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## call call 657
## now now 479
## get get 451
## can can 405
## will will 389
## just just 368
pp <- ggplot(subset(wf, freq>100), aes(x=reorder(word, -freq), y =freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
pp
From the plot, “call” appeared the most number of time.
Presenting the word frequency as a word cloud.
library("wordcloud")
## Loading required package: RColorBrewer
library("RColorBrewer")
set.seed(1234)
wordcloud(words = wf$word, freq = wf$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The text data has been cleaned and now ready to be added to the response variable “Class” for the purpose of predictive analytics.
dataset$Class = data_text$Class
str(dataset$Class)
## Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...
The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before.
In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.
set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.75,0.25),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]
prop.table(table(train_set$Class))
##
## ham spam
## 0.8670327 0.1329673
prop.table(table(test_set$Class))
##
## ham spam
## 0.8628159 0.1371841
We will be building our model on 3 different Machine Learning algorithms which are Random Forest, Naive Bayes and Support Vector Machine for the purpose of deciding which perform the best.
The Random Forest Model is an ensemble method of Machine Learning with which 300 decision trees were used to build this model with the mode of the outcomes of each individual trees taken as the final output.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_classifier = randomForest(x = train_set[-1210],
y = train_set$Class,
ntree = 300)
rf_classifier
##
## Call:
## randomForest(x = train_set[-1210], y = train_set$Class, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 34
##
## OOB estimate of error rate: 2.7%
## Confusion matrix:
## ham spam class.error
## ham 3615 17 0.004680617
## spam 96 461 0.172351885
The rf_classifier was able to accurately classify the text messages as ham and spam respectively with the class error of 0 which suggest that there is 100% accuracy of the model on the training set of observations. This is expected as the model was exposed to this set of data. ## 4.2.1.1 Making Predictions and evaluating the Random Forest Classifier.
We want to evaluate the model using the test_set and see if our model can match the 100% accuracy on this new set of data in comparison to the one obtained from the training set.
# Predicting the Test set results
rf_pred = predict(rf_classifier, newdata = test_set[-1210])
# Making the Confusion Matrix
library(caret)
## Loading required package: lattice
confusionMatrix(table(rf_pred,test_set$Class))
## Confusion Matrix and Statistics
##
##
## rf_pred ham spam
## ham 1191 36
## spam 4 154
##
## Accuracy : 0.9711
## 95% CI : (0.9609, 0.9793)
## No Information Rate : 0.8628
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8687
## Mcnemar's Test P-Value : 9.509e-07
##
## Sensitivity : 0.9967
## Specificity : 0.8105
## Pos Pred Value : 0.9707
## Neg Pred Value : 0.9747
## Prevalence : 0.8628
## Detection Rate : 0.8599
## Detection Prevalence : 0.8859
## Balanced Accuracy : 0.9036
##
## 'Positive' Class : ham
##
The Random Forest Classifier(rf_classifier) performed exceptionally well on this data set as the model accuracy is 100%. Again, we need not be too excited as there is the possibility of Random Forest to overfit.
Naive Bayes Classifier is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.
library(e1071)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
system.time( classifier_nb <- naiveBayes(train_set, train_set$Class, laplace = 1,
trControl = control,tuneLength = 7) )
## user system elapsed
## 0.25 0.08 0.33
nb_pred = predict(classifier_nb, type = 'class', newdata = test_set)
confusionMatrix(nb_pred,test_set$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 1195 7
## spam 0 183
##
## Accuracy : 0.9949
## 95% CI : (0.9896, 0.998)
## No Information Rate : 0.8628
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9783
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 1.0000
## Specificity : 0.9632
## Pos Pred Value : 0.9942
## Neg Pred Value : 1.0000
## Prevalence : 0.8628
## Detection Rate : 0.8628
## Detection Prevalence : 0.8679
## Balanced Accuracy : 0.9816
##
## 'Positive' Class : ham
##
The Naive Bayes Classifier also performed very well on the training set by achieving 99.49% accuracy which means we have got 7 misclassifications out a possible 1209 observation. While the model has a 100% sensitivity rate; the proportion of the positive class predicted as positive, it was able to achieve about 96.32% on specificity rate which is the proportion of the negative class predicted accurately i.e 183 out of 190.
The Support Vector Machine is another algorithm that finds the hyperplane that differentiates the two classes to be predicted, ham and spam in this case; very well. SVM can perform both linear and non-linear classification problems.
svm_classifier <- svm(Class~., data=train_set)
svm_classifier
##
## Call:
## svm(formula = Class ~ ., data = train_set)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.0008264463
##
## Number of Support Vectors: 1157
Our model employs a total of 1155 support vectors while building this classification model.
svm_pred = predict(svm_classifier,test_set)
confusionMatrix(svm_pred,test_set$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 1195 189
## spam 0 1
##
## Accuracy : 0.8635
## 95% CI : (0.8443, 0.8812)
## No Information Rate : 0.8628
## P-Value [Acc > NIR] : 0.4882
##
## Kappa : 0.009
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.000000
## Specificity : 0.005263
## Pos Pred Value : 0.863439
## Neg Pred Value : 1.000000
## Prevalence : 0.862816
## Detection Rate : 0.862816
## Detection Prevalence : 0.999278
## Balanced Accuracy : 0.502632
##
## 'Positive' Class : ham
##
The Support Vector Machine model performed badly on this dataset as the model performed exactly as a mere guess. With the accuracy of 86.28%, we may be tempted to think the performance is good but a closer look at specificity rate of 0% suggest our model is not doing good.
The essence of building a spam classifier is for the model to be able to effectively categorise an incoming email as either spam or ham. A model will not be doing very well if it is unable to categorise both categories effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a nice job. The Random Forest and Naive Bayes performed exceptionally well in this project.
This spam classifier was built just for academic purposes and as such suggestion on what to improve on or what was not properly done are welcome.