Define location for training and test dataset
spam.path <- "data/spam/"
easyham.path <- "data/easy_ham/"
easyham.test.path <- "data/easy_ham_test/"
Create a vector of spam emails for processing
get.msg <- function(path) {
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)
msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse = "\n"))
}
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
function(p) get.msg(file.path(spam.path, p)))
head(all.spam, n = 1)
## 00001.317e78fa8ee2f54cd4890fdc09ba8176
## "Greetings!\n\nYou are receiving this letter because you have expressed an interest in \nreceiving information about online business opportunities. If this is \nerroneous then please accept my most sincere apology. This is a one-time \nmailing, so no removal is necessary.\n\nIf you've been burned, betrayed, and back-stabbed by multi-level marketing, \nMLM, then please read this letter. It could be the most important one that \nhas ever landed in your Inbox.\n\nMULTI-LEVEL MARKETING IS A HUGE MISTAKE FOR MOST PEOPLE\n\nMLM has failed to deliver on its promises for the past 50 years. The pursuit \nof the \"MLM Dream\" has cost hundreds of thousands of people their friends, \ntheir fortunes and their sacred honor. The fact is that MLM is fatally \nflawed, meaning that it CANNOT work for most people.\n\nThe companies and the few who earn the big money in MLM are NOT going to \ntell you the real story. FINALLY, there is someone who has the courage to \ncut through the hype and lies and tell the TRUTH about MLM.\n\nHERE'S GOOD NEWS\n\nThere IS an alternative to MLM that WORKS, and works BIG! If you haven't yet \nabandoned your dreams, then you need to see this. Earning the kind of income \nyou've dreamed about is easier than you think!\n\nWith your permission, I'd like to send you a brief letter that will tell you \nWHY MLM doesn't work for most people and will then introduce you to \nsomething so new and refreshing that you'll wonder why you haven't heard of \nthis before.\n\nI promise that there will be NO unwanted follow up, NO sales pitch, no one \nwill call you, and your email address will only be used to send you the \ninformation. Period.\n\nTo receive this free, life-changing information, simply click Reply, type \n\"Send Info\" in the Subject box and hit Send. I'll get the information to you \nwithin 24 hours. Just look for the words MLM WALL OF SHAME in your Inbox.\n\nCordially,\n\nSiddhi\n\nP.S. Someone recently sent the letter to me and it has been the most \neye-opening, financially beneficial information I have ever received. I \nhonestly believe that you will feel the same way once you've read it. And \nit's FREE!\n\n\n------------------------------------------------------------\nThis email is NEVER sent unsolicited. THIS IS NOT \"SPAM\". You are receiving \nthis email because you EXPLICITLY signed yourself up to our list with our \nonline signup form or through use of our FFA Links Page and E-MailDOM \nsystems, which have EXPLICIT terms of use which state that through its use \nyou agree to receive our emailings. You may also be a member of a Altra \nComputer Systems list or one of many numerous FREE Marketing Services and as \nsuch you agreed when you signed up for such list that you would also be \nreceiving this emailing.\nDue to the above, this email message cannot be considered unsolicitated, or \nspam.\n-----------------------------------------------------------\n\n\n\n\n-- \nIrish Linux Users' Group: ilug@linux.ie\nhttp://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.\nList maintainer: listmaster@linux.ie\n\n"
Create text corpus and term document matrix(TDM) from spam email vector
get.tdm <- function(doc.vec) {
control <- list(stopwords = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
minDocFreq = 2)
doc.corpus <- Corpus(VectorSource(doc.vec))
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
spam.tdm <- get.tdm(all.spam)
Create a data frame that provides the feature set from the training SPAM data
spam.matrix <- as.matrix(spam.tdm)
spam.counts <- rowSums(spam.matrix)
spam.df <- data.frame(cbind(names(spam.counts),
as.numeric(spam.counts)),
stringsAsFactors = FALSE)
names(spam.df) <- c("term", "frequency")
spam.df$frequency <- as.numeric(spam.df$frequency)
spam.occurrence <- sapply(1:nrow(spam.matrix),
function(i)
{
length(which(spam.matrix[i, ] > 0)) / ncol(spam.matrix)
})
spam.density <- spam.df$frequency / sum(spam.df$frequency)
spam.df <- transform(spam.df,
density = spam.density,
occurrence = spam.occurrence)
head(spam.df[with(spam.df, order(-occurrence)),])
## term frequency density occurrence
## 71 http 11435 0.015306648 0.8276226
## 342 com 8780 0.011752722 0.7116500
## 248 html 4043 0.005411874 0.6046389
## 38 email 3410 0.004564554 0.5566684
## 21 click 2193 0.002935503 0.5366368
## 247 href 5093 0.006817382 0.5224038
Process ham training data set using above steps
easyham.docs <- dir(easyham.path)
easyham.docs <- easyham.docs[which(easyham.docs != "cmds")]
all.easyham <- sapply(easyham.docs[1:length(spam.docs)],
function(p) get.msg(file.path(easyham.path, p)))
easyham.tdm <- get.tdm(all.easyham)
easyham.matrix <- as.matrix(easyham.tdm)
easyham.counts <- rowSums(easyham.matrix)
easyham.df <- data.frame(cbind(names(easyham.counts),
as.numeric(easyham.counts)),
stringsAsFactors = FALSE)
names(easyham.df) <- c("term", "frequency")
easyham.df$frequency <- as.numeric(easyham.df$frequency)
easyham.occurrence <- sapply(1:nrow(easyham.matrix),
function(i)
{
length(which(easyham.matrix[i, ] > 0)) / ncol(easyham.matrix)
})
easyham.density <- easyham.df$frequency / sum(easyham.df$frequency)
easyham.df <- transform(easyham.df,
density = easyham.density,
occurrence = easyham.occurrence)
head(easyham.df[with(easyham.df, order(-occurrence)),])
## term frequency density occurrence
## 125 http 2840 0.009813136 0.6642066
## 8 com 3106 0.010732253 0.6162362
## 49 list 2151 0.007432414 0.4791776
## 322 www 1462 0.005051692 0.4586189
## 50 listinfo 936 0.003234188 0.4449130
## 5 can 1691 0.005842962 0.4406958
Apply naive Bayes classifier on ham test dataset
classify.email <- function(path, training.df, prior = 0.5, c = 1e-6) {
msg <- get.msg(path)
msg.tdm <- get.tdm(msg)
msg.freq <- rowSums(as.matrix(msg.tdm))
msg.match <- intersect(names(msg.freq), training.df$term)
if(length(msg.match) < 1)
{
return(prior * c ^ (length(msg.freq)))
}
else
{
match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
return(prior * prod(match.probs) * c ^ (length(msg.freq) - length(msg.match)))
}
}
easyham.test.docs <- dir(easyham.test.path)
easyham.test.docs <- easyham.test.docs[which(easyham.test.docs != "cmds")]
easyham.test.spamtest <- sapply(easyham.test.docs,
function(p) classify.email(file.path(easyham.test.path, p), training.df = spam.df, prior = 0.2))
easyham.test.hamtest <- sapply(easyham.test.docs,
function(p) classify.email(file.path(easyham.test.path, p), training.df = easyham.df, prior = 0.8))
easyham.test.res <- ifelse(easyham.test.spamtest > easyham.test.hamtest,
TRUE,
FALSE)
summary(easyham.test.res)
## Mode FALSE TRUE
## logical 1368 32
Conclusion
The naive Bayes classifier is able to classify spam vs ham emails with around
97.7% accuracy.
References
Automated Data Collection with R
Machine Learning for Hackers