The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.
The goal of this project is to build a Spam Filter that can effectively categorise an incoming mail or text message as either Spam or Ham.
We will use a dataset from the dataset repository of Center for Machine Learning and Intelligent Systems at the University of California, Irvine!.
This dataset consists of 5574 observations of 2 variables. The first variable is the content of the emails and the second variable the target variable, which is the class to be predicted. The target variable can either be a “spam” or “ham”. We will be building this classier using the text messages from the email.
library(ggplot2)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(e1071)
library(caret)
The dataset was imported from the repository of Center for Machine Learning and Intelligent Systems at the University of California,Irvine.
emailData <- read.delim("SMSSpamCollection",sep = "\t",header = FALSE,colClasses = "character",quote = "")
head(emailData)
## V1
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## V2
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
str(emailData)
## 'data.frame': 5574 obs. of 2 variables:
## $ V1: chr "ham" "ham" "spam" "ham" ...
## $ V2: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
summary(emailData)
## V1 V2
## Length:5574 Length:5574
## Class :character Class :character
## Mode :character Mode :character
For easy identification of the columns, we rename V1 as Class and V2 as Text. And we have to also convert the Class column from Character strings to factor. We also need to know the proportion of ham to spam in our dataset.
colnames(emailData) <- c("Class","Text")
head(emailData)
## Class
## 1 ham
## 2 ham
## 3 spam
## 4 ham
## 5 ham
## 6 spam
## Text
## 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
## 2 Ok lar... Joking wif u oni...
## 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
## 4 U dun say so early hor... U c already then say...
## 5 Nah I don't think he goes to usf, he lives around here though
## 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
emailData$Class <- factor(emailData$Class)
propTable <- prop.table(table(emailData$Class))
barplot(propTable,col = c("lightblue","salmon"),horiz = TRUE,ylab = "Ham And Spam",xlab = "Frequency",main = "Proportion Of Spam And Ham Emails")
Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. In text mining, we need to put the words in lowercase, remove stops words that do not add any meaning to the model et
corpus <- VCorpus(VectorSource(emailData$Text))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,removeWords,stopwords("english"))
corpus <- tm_map(corpus,stemDocument)
corpus <- tm_map(corpus,stripWhitespace)
as.character(corpus[[1]])
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
as.character(corpus[[2]])
## [1] "ok lar joke wif u oni"
as.character(corpus[[3]])
## [1] "free entri wkli comp win fa cup final tkts st may text fa receiv entri questionstd txt ratetc appli over"
as.character(corpus[[4]])
## [1] "u dun say earli hor u c alreadi say"
as.character(corpus[[5]])
## [1] "nah dont think goe usf live around though"
In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In other to answer this question; we are creating a DocumentTermMatrix to keep all these words.
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm,0.999)
dtm
## <<DocumentTermMatrix (documents: 5574, terms: 1209)>>
## Non-/sparse entries: 34521/6704445
## Sparsity : 99%
## Maximal term length: 19
## Weighting : term frequency (tf)
dim(dtm)
## [1] 5574 1209
convert <- function(x){
y <- ifelse(x>0,1,0)
y <- factor(y,levels = c(0,1),labels = c("No","Yes"))
y
}
datasetNB <- apply(dtm,2,convert)
dataset <- as.data.frame(as.matrix(datasetNB))
We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.
freq <- sort(colSums(as.matrix(dtm)),decreasing = TRUE)
tail(freq,10)
## vikki vodafon vote vri wherev wnt wwq yay yiju
## 6 6 6 6 6 6 6 6 6
## zed
## 6
We will like to plot those words that appeared more than 60 times in our dataset.
wf <- data.frame(word=names(freq),freq=freq)
head(wf)
## word freq
## call call 657
## now now 479
## get get 451
## can can 405
## will will 389
## just just 368
ggplot(subset(wf,freq>200),aes(x=reorder(word,-freq),y=freq,fill=word))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 50,hjust = 1))+xlab("Words")+ylab("Frequencies")+ggtitle("Word Frequencies")
Presenting the word frequency as a word cloud.
The text data has been cleaned and now ready to be added to the response variable “Class” for the purpose of predictive analytics.
dataset$Class <- emailData$Class
str(dataset$Class)
## Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...
The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.
splitData <- sample(2,nrow(dataset),prob = c(0.80,0.20),replace = TRUE)
trainData <- dataset[splitData==1,]
testData <- dataset[splitData==2,]
We are using Naive Bayes Machine Learning Model.Naive Bayes Classifier is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.
trnControl <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- naiveBayes(trainData,trainData$Class,laplace = 1,trControl=control,tuneLength=7)
pred <- predict(model,type = "class",newdata = testData)
confusionMatrix(pred,testData$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 924 2
## spam 0 159
##
## Accuracy : 0.9982
## 95% CI : (0.9934, 0.9998)
## No Information Rate : 0.8516
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9927
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9876
## Pos Pred Value : 0.9978
## Neg Pred Value : 1.0000
## Prevalence : 0.8516
## Detection Rate : 0.8516
## Detection Prevalence : 0.8535
## Balanced Accuracy : 0.9938
##
## 'Positive' Class : ham
##