Building Spam Filter Using Machine Learning Model

Part I Introduction and Data Acquisition

1.1 Problem Scoping and Diagnosis

The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.

1.2 Goals and objectives of the project:

The goal of this project is to build a spam filter that can effectively categorise an incoming mail or text message as either “spam” or “ham”. We will use a dataset from the dataset repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine!.

1.3 Dataset Description

This dataset consists of 5574 observations of 2 variables. The first variable is the content of the emails and the second variable the target variable, which is the class to be predicted. The target variable can either be a “spam” or “ham”. We will be building this classier using the text messages from the email.

Part II Data Preparation

2.1 Import Data

The dataset was imported from the repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine! with this URL “https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip”

#Building Spam Filter using ML
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

if (!file.exists("smsspamcollection.zip")) 
  {
  download.file(url=url, destfile="smsspamcollection.zip", method="curl")
  }
unzip("smsspamcollection.zip")

data_text <- read.delim("SMSSpamCollection", sep="\t", header=F, colClasses="character", quote="")

str(data_text)

## 'data.frame':    5574 obs. of  2 variables:
##  $ V1: chr  "ham" "ham" "spam" "ham" ...
##  $ V2: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...

head(data_text)

##     V1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
## 6 spam
##                                                                                                                                                            V2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

2.2 Renaming the columns for easy identification

For easy identification of the columns, we rename V1 as Class and V2 as Text. And we have to also convert the Class column from Character strings to factor. We also need to know the proportion of ham to spam in our dataset.

colnames(data_text)

## [1] "V1" "V2"

colnames(data_text) <- c("Class", "Text")
colnames(data_text)

## [1] "Class" "Text"

data_text$Class <- factor(data_text$Class)
prop.table(table(data_text$Class))

## 
##       ham      spam 
## 0.8659849 0.1340151

2.3 Data Cleaning

Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. In text mining, we need to put the words in lowercase, remove stops words that do not add any meaning to the model etc.

# Cleaning the texts
library(tm)

## Loading required package: NLP

library(SnowballC)
corpus = VCorpus(VectorSource(data_text$Text))
as.character(corpus[[1]])

## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
as.character(corpus[[1]])

## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

2.4 Creating the Bag of Words for the model

In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In other to answer this question; we are creating a DocumentTermMatrix to keep all these words.

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 5574, terms: 6981)>>
## Non-/sparse entries: 43801/38868293
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

dtm = removeSparseTerms(dtm, 0.999)

dim(dtm)

## [1] 5574 1209

inspect(dtm[40:50, 10:15])

## <<DocumentTermMatrix (documents: 11, terms: 6)>>
## Non-/sparse entries: 0/66
## Sparsity           : 100%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs activ actual add address admir adult
##   40     0      0   0       0     0     0
##   41     0      0   0       0     0     0
##   42     0      0   0       0     0     0
##   43     0      0   0       0     0     0
##   44     0      0   0       0     0     0
##   45     0      0   0       0     0     0
##   46     0      0   0       0     0     0
##   47     0      0   0       0     0     0
##   48     0      0   0       0     0     0
##   49     0      0   0       0     0     0
##   50     0      0   0       0     0     0

2.5 Converting the word frequencies to Yes and No Labels

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

# Apply the convert_count function to get final training and testing DTMs
datasetNB <- apply(dtm, 2, convert_count)

dataset = as.data.frame(as.matrix(datasetNB))

Part III Descriptive and Exploratory Analysis of the data

3.1 Building Word Frequency

We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.

freq<- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
tail(freq, 10)

##   vikki vodafon    vote     vri  wherev     wnt     wwq     yay    yiju 
##       6       6       6       6       6       6       6       6       6 
##     zed 
##       6

findFreqTerms(dtm, lowfreq=60) #identifying terms that appears frequently

##   [1] "alreadi"  "also"     "amp"      "anyth"    "around"   "ask"     
##   [7] "award"    "babe"     "back"     "buy"      "call"     "can"     
##  [13] "cant"     "care"     "cash"     "chat"     "claim"    "come"    
##  [19] "contact"  "cos"      "custom"   "day"      "dear"     "didnt"   
##  [25] "dont"     "end"      "even"     "everi"    "feel"     "find"    
##  [31] "finish"   "first"    "free"     "friend"   "get"      "give"    
##  [37] "good"     "got"      "great"    "gud"      "guy"      "happi"   
##  [43] "help"     "hey"      "home"     "hope"     "ill"      "ive"     
##  [49] "just"     "keep"     "know"     "last"     "later"    "leav"    
##  [55] "let"      "life"     "like"     "lol"      "look"     "lor"     
##  [61] "love"     "ltgt"     "make"     "meet"     "messag"   "min"     
##  [67] "miss"     "mobil"    "morn"     "msg"      "much"     "need"    
##  [73] "new"      "next"     "night"    "nokia"    "now"      "number"  
##  [79] "one"      "person"   "phone"    "pick"     "place"    "pleas"   
##  [85] "pls"      "prize"    "realli"   "repli"    "right"    "said"    
##  [91] "say"      "see"      "send"     "sent"     "servic"   "show"    
##  [97] "sleep"    "smile"    "someon"   "someth"   "sorri"    "start"   
## [103] "still"    "stop"     "sure"     "take"     "talk"     "tell"    
## [109] "text"     "thank"    "that"     "thing"    "think"    "time"    
## [115] "today"    "tomorrow" "tone"     "tonight"  "tri"      "txt"     
## [121] "urgent"   "use"      "wait"     "want"     "wat"      "watch"   
## [127] "way"      "week"     "well"     "went"     "will"     "win"     
## [133] "wish"     "won"      "work"     "yeah"     "year"     "yes"

3.2 Plotting Word Frequency

We will like to plot those words that appeared more than 60 times in our dataset.

#Plot Word Frequencies
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

wf<- data.frame(word=names(freq), freq=freq)
head(wf)

##      word freq
## call call  657
## now   now  479
## get   get  451
## can   can  405
## will will  389
## just just  368

pp <- ggplot(subset(wf, freq>100), aes(x=reorder(word, -freq), y =freq)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x=element_text(angle=45, hjust=1))
pp

From the plot, “call” appeared the most number of time.

3.2 Building Word Cloud

Presenting the word frequency as a word cloud.

library("wordcloud")

## Loading required package: RColorBrewer

library("RColorBrewer")
set.seed(1234)
wordcloud(words = wf$word, freq = wf$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

3.3 Adding the Class variable to the Dataset

The text data has been cleaned and now ready to be added to the response variable “Class” for the purpose of predictive analytics.

dataset$Class = data_text$Class
str(dataset$Class)

##  Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

Part IV Data Analysis & Model Building

4.1 Splitting the dataset into the Training set and Test set

The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before.

In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.

set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.75,0.25),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,] 

prop.table(table(train_set$Class))

## 
##       ham      spam 
## 0.8670327 0.1329673

prop.table(table(test_set$Class))

## 
##       ham      spam 
## 0.8628159 0.1371841

4.2 Model Fitting

We will be building our model on 3 different Machine Learning algorithms which are Random Forest, Naive Bayes and Support Vector Machine for the purpose of deciding which perform the best.

4.2.1 Random Forest Classifier.

The Random Forest Model is an ensemble method of Machine Learning with which 300 decision trees were used to build this model with the mode of the outcomes of each individual trees taken as the final output.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_classifier = randomForest(x = train_set[-1210],
                          y = train_set$Class,
                          ntree = 300)

rf_classifier

## 
## Call:
##  randomForest(x = train_set[-1210], y = train_set$Class, ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 34
## 
##         OOB estimate of  error rate: 2.7%
## Confusion matrix:
##       ham spam class.error
## ham  3615   17 0.004680617
## spam   96  461 0.172351885

The rf_classifier was able to accurately classify the text messages as ham and spam respectively with the class error of 0 which suggest that there is 100% accuracy of the model on the training set of observations. This is expected as the model was exposed to this set of data. ## 4.2.1.1 Making Predictions and evaluating the Random Forest Classifier.

We want to evaluate the model using the test_set and see if our model can match the 100% accuracy on this new set of data in comparison to the one obtained from the training set.

# Predicting the Test set results
rf_pred = predict(rf_classifier, newdata = test_set[-1210])

# Making the Confusion Matrix
library(caret)

## Loading required package: lattice

confusionMatrix(table(rf_pred,test_set$Class))

## Confusion Matrix and Statistics
## 
##        
## rf_pred  ham spam
##    ham  1191   36
##    spam    4  154
##                                           
##                Accuracy : 0.9711          
##                  95% CI : (0.9609, 0.9793)
##     No Information Rate : 0.8628          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8687          
##  Mcnemar's Test P-Value : 9.509e-07       
##                                           
##             Sensitivity : 0.9967          
##             Specificity : 0.8105          
##          Pos Pred Value : 0.9707          
##          Neg Pred Value : 0.9747          
##              Prevalence : 0.8628          
##          Detection Rate : 0.8599          
##    Detection Prevalence : 0.8859          
##       Balanced Accuracy : 0.9036          
##                                           
##        'Positive' Class : ham             
##

The Random Forest Classifier(rf_classifier) performed exceptionally well on this data set as the model accuracy is 100%. Again, we need not be too excited as there is the possibility of Random Forest to overfit.

4.2.2 Naive Bayes Classifier.

Naive Bayes Classifier is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.

library(e1071)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
system.time( classifier_nb <- naiveBayes(train_set, train_set$Class, laplace = 1,
                                         trControl = control,tuneLength = 7) )

##    user  system elapsed 
##    0.25    0.08    0.33

4.2.2.1 Making Predictions and evaluating the Naive Bayes Classifier.

nb_pred = predict(classifier_nb, type = 'class', newdata = test_set)

confusionMatrix(nb_pred,test_set$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1195    7
##       spam    0  183
##                                          
##                Accuracy : 0.9949         
##                  95% CI : (0.9896, 0.998)
##     No Information Rate : 0.8628         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.9783         
##  Mcnemar's Test P-Value : 0.02334        
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.9632         
##          Pos Pred Value : 0.9942         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.8628         
##          Detection Rate : 0.8628         
##    Detection Prevalence : 0.8679         
##       Balanced Accuracy : 0.9816         
##                                          
##        'Positive' Class : ham            
##

The Naive Bayes Classifier also performed very well on the training set by achieving 99.49% accuracy which means we have got 7 misclassifications out a possible 1209 observation. While the model has a 100% sensitivity rate; the proportion of the positive class predicted as positive, it was able to achieve about 96.32% on specificity rate which is the proportion of the negative class predicted accurately i.e 183 out of 190.

4.2.3 Support Vector Machine

The Support Vector Machine is another algorithm that finds the hyperplane that differentiates the two classes to be predicted, ham and spam in this case; very well. SVM can perform both linear and non-linear classification problems.

svm_classifier <- svm(Class~., data=train_set)
svm_classifier

## 
## Call:
## svm(formula = Class ~ ., data = train_set)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.0008264463 
## 
## Number of Support Vectors:  1157

Our model employs a total of 1155 support vectors while building this classification model.

4.2.3.1 Making Predictions and evaluating the Support Vector Machine Classifier

svm_pred = predict(svm_classifier,test_set)

confusionMatrix(svm_pred,test_set$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1195  189
##       spam    0    1
##                                           
##                Accuracy : 0.8635          
##                  95% CI : (0.8443, 0.8812)
##     No Information Rate : 0.8628          
##     P-Value [Acc > NIR] : 0.4882          
##                                           
##                   Kappa : 0.009           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.000000        
##             Specificity : 0.005263        
##          Pos Pred Value : 0.863439        
##          Neg Pred Value : 1.000000        
##              Prevalence : 0.862816        
##          Detection Rate : 0.862816        
##    Detection Prevalence : 0.999278        
##       Balanced Accuracy : 0.502632        
##                                           
##        'Positive' Class : ham             
##

The Support Vector Machine model performed badly on this dataset as the model performed exactly as a mere guess. With the accuracy of 86.28%, we may be tempted to think the performance is good but a closer look at specificity rate of 0% suggest our model is not doing good.

Part V Conclusion and Validity

5.1 Explaining the validity of the model

The essence of building a spam classifier is for the model to be able to effectively categorise an incoming email as either spam or ham. A model will not be doing very well if it is unable to categorise both categories effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a nice job. The Random Forest and Naive Bayes performed exceptionally well in this project.

5.2 Conclusion.

This spam classifier was built just for academic purposes and as such suggestion on what to improve on or what was not properly done are welcome.