Email Ham And Spam Classifier

Problem Scoping And Diagnosis

The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.

Objectives And Goals Of The Project

The goal of this project is to build a Spam Filter that can effectively categorise an incoming mail or text message as either Spam or Ham.

Dataset

We will use a dataset from the dataset repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine!.

Dataset Description

This dataset consists of 5574 observations of 2 variables. The first variable is the content of the emails and the second variable the target variable, which is the class to be predicted. The target variable can either be a “spam” or “ham”. We will be building this classier using the text messages from the email.

Import Libraries

library(ggplot2)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(e1071)
library(caret)

Import Data

The dataset was imported from the repository of Center for Machine Learning and Intelligent Systems at the University of California,Irvine.

emailData <- read.delim("SMSSpamCollection",sep = "\t",header = FALSE,colClasses = "character",quote = "")

Structure And Summary Of The Data

head(emailData)

##     V1
## 1  ham
## 2  ham
## 3 spam
## 4  ham
## 5  ham
## 6 spam
##                                                                                                                                                            V2
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

str(emailData)

## 'data.frame':    5574 obs. of  2 variables:
##  $ V1: chr  "ham" "ham" "spam" "ham" ...
##  $ V2: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...

summary(emailData)

##       V1                 V2           
##  Length:5574        Length:5574       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Renaming The Columns For Easy Identification

For easy identification of the columns, we rename V1 as Class and V2 as Text. And we have to also convert the Class column from Character strings to factor. We also need to know the proportion of ham to spam in our dataset.

colnames(emailData) <- c("Class","Text")
head(emailData)

##   Class
## 1   ham
## 2   ham
## 3  spam
## 4   ham
## 5   ham
## 6  spam
##                                                                                                                                                          Text
## 1                                             Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2                                                                                                                               Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4                                                                                                           U dun say so early hor... U c already then say...
## 5                                                                                               Nah I don't think he goes to usf, he lives around here though
## 6        FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

Convert Target Variable Into Factor

emailData$Class <- factor(emailData$Class)
propTable <- prop.table(table(emailData$Class))
barplot(propTable,col = c("lightblue","salmon"),horiz = TRUE,ylab = "Ham And Spam",xlab = "Frequency",main = "Proportion Of Spam And Ham Emails")

Data Cleaning

Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. In text mining, we need to put the words in lowercase, remove stops words that do not add any meaning to the model et

corpus <- VCorpus(VectorSource(emailData$Text))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,removeWords,stopwords("english"))
corpus <- tm_map(corpus,stemDocument)
corpus <- tm_map(corpus,stripWhitespace)
as.character(corpus[[1]])

## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

as.character(corpus[[2]])

## [1] "ok lar joke wif u oni"

as.character(corpus[[3]])

## [1] "free entri wkli comp win fa cup final tkts st may text fa receiv entri questionstd txt ratetc appli over"

as.character(corpus[[4]])

## [1] "u dun say earli hor u c alreadi say"

as.character(corpus[[5]])

## [1] "nah dont think goe usf live around though"

Creating The Bag Of Words For The Model

In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In other to answer this question; we are creating a DocumentTermMatrix to keep all these words.

dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm,0.999)
dtm

## <<DocumentTermMatrix (documents: 5574, terms: 1209)>>
## Non-/sparse entries: 34521/6704445
## Sparsity           : 99%
## Maximal term length: 19
## Weighting          : term frequency (tf)

dim(dtm)

## [1] 5574 1209

Converting The Word Frequencies To Yes And No Labels

convert <- function(x){

  y <- ifelse(x>0,1,0)
  y <- factor(y,levels = c(0,1),labels = c("No","Yes"))
  y
}

datasetNB <- apply(dtm,2,convert)
dataset <- as.data.frame(as.matrix(datasetNB))

Building Word Frequency

We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.

freq <- sort(colSums(as.matrix(dtm)),decreasing = TRUE)
tail(freq,10)

##   vikki vodafon    vote     vri  wherev     wnt     wwq     yay    yiju 
##       6       6       6       6       6       6       6       6       6 
##     zed 
##       6

Plotting Word Frequency

We will like to plot those words that appeared more than 60 times in our dataset.

wf <- data.frame(word=names(freq),freq=freq)
head(wf)

##      word freq
## call call  657
## now   now  479
## get   get  451
## can   can  405
## will will  389
## just just  368

ggplot(subset(wf,freq>200),aes(x=reorder(word,-freq),y=freq,fill=word))+
  geom_bar(stat = "identity")+
  theme(axis.text.x = element_text(angle = 50,hjust = 1))+xlab("Words")+ylab("Frequencies")+ggtitle("Word Frequencies")

Building Word Cloud

Presenting the word frequency as a word cloud.

Adding the Class variable to the Dataset

The text data has been cleaned and now ready to be added to the response variable “Class” for the purpose of predictive analytics.

dataset$Class <- emailData$Class
str(dataset$Class)

##  Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...

Data Modeling

The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.

splitData <- sample(2,nrow(dataset),prob = c(0.80,0.20),replace = TRUE)
trainData <- dataset[splitData==1,]
testData <- dataset[splitData==2,]

Model Fiting

We are using Naive Bayes Machine Learning Model.Naive Bayes Classifier is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.

trnControl <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- naiveBayes(trainData,trainData$Class,laplace = 1,trControl=control,tuneLength=7)

Prediction And Evaluating The Naive Bayes Classifier

pred <- predict(model,type = "class",newdata = testData)
confusionMatrix(pred,testData$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  924    2
##       spam   0  159
##                                           
##                Accuracy : 0.9982          
##                  95% CI : (0.9934, 0.9998)
##     No Information Rate : 0.8516          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9927          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9876          
##          Pos Pred Value : 0.9978          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.8516          
##          Detection Rate : 0.8516          
##    Detection Prevalence : 0.8535          
##       Balanced Accuracy : 0.9938          
##                                           
##        'Positive' Class : ham             
##

Our model is 99% accurate