Project 04 -

Define location for training and test dataset

spam.path <- "data/spam/"
easyham.path <- "data/easy_ham/"
easyham.test.path <- "data/easy_ham_test/"

Create a vector of spam emails for processing

get.msg <- function(path) {
        con <- file(path, open = "rt", encoding = "latin1")
        text <- readLines(con)
        msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
        close(con)
        return(paste(msg, collapse = "\n"))
}

spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
                   function(p) get.msg(file.path(spam.path, p)))

head(all.spam, n = 1)

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             00001.317e78fa8ee2f54cd4890fdc09ba8176 
## "Greetings!\n\nYou are receiving this letter because you have expressed an interest in \nreceiving information about online business opportunities. If this is \nerroneous then please accept my most sincere apology. This is a one-time \nmailing, so no removal is necessary.\n\nIf you've been burned, betrayed, and back-stabbed by multi-level marketing, \nMLM, then please read this letter. It could be the most important one that \nhas ever landed in your Inbox.\n\nMULTI-LEVEL MARKETING IS A HUGE MISTAKE FOR MOST PEOPLE\n\nMLM has failed to deliver on its promises for the past 50 years. The pursuit \nof the \"MLM Dream\" has cost hundreds of thousands of people their friends, \ntheir fortunes and their sacred honor. The fact is that MLM is fatally \nflawed, meaning that it CANNOT work for most people.\n\nThe companies and the few who earn the big money in MLM are NOT going to \ntell you the real story. FINALLY, there is someone who has the courage to \ncut through the hype and lies and tell the TRUTH about MLM.\n\nHERE'S GOOD NEWS\n\nThere IS an alternative to MLM that WORKS, and works BIG! If you haven't yet \nabandoned your dreams, then you need to see this. Earning the kind of income \nyou've dreamed about is easier than you think!\n\nWith your permission, I'd like to send you a brief letter that will tell you \nWHY MLM doesn't work for most people and will then introduce you to \nsomething so new and refreshing that you'll wonder why you haven't heard of \nthis before.\n\nI promise that there will be NO unwanted follow up, NO sales pitch, no one \nwill call you, and your email address will only be used to send you the \ninformation. Period.\n\nTo receive this free, life-changing information, simply click Reply, type \n\"Send Info\" in the Subject box and hit Send. I'll get the information to you \nwithin 24 hours. Just look for the words MLM WALL OF SHAME in your Inbox.\n\nCordially,\n\nSiddhi\n\nP.S. Someone recently sent the letter to me and it has been the most \neye-opening, financially beneficial information I have ever received. I \nhonestly believe that you will feel the same way once you've read it. And \nit's FREE!\n\n\n------------------------------------------------------------\nThis email is NEVER sent unsolicited.  THIS IS NOT \"SPAM\". You are receiving \nthis email because you EXPLICITLY signed yourself up to our list with our \nonline signup form or through use of our FFA Links Page and E-MailDOM \nsystems, which have EXPLICIT terms of use which state that through its use \nyou agree to receive our emailings.  You may also be a member of a Altra \nComputer Systems list or one of many numerous FREE Marketing Services and as \nsuch you agreed when you signed up for such list that you would also be \nreceiving this emailing.\nDue to the above, this email message cannot be considered unsolicitated, or \nspam.\n-----------------------------------------------------------\n\n\n\n\n-- \nIrish Linux Users' Group: ilug@linux.ie\nhttp://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.\nList maintainer: listmaster@linux.ie\n\n"

Create text corpus and term document matrix(TDM) from spam email vector

get.tdm <- function(doc.vec) {
        control <- list(stopwords = TRUE,
                        removePunctuation = TRUE,
                        removeNumbers = TRUE,
                        minDocFreq = 2)
        doc.corpus <- Corpus(VectorSource(doc.vec))
        doc.dtm <- TermDocumentMatrix(doc.corpus, control)
        return(doc.dtm)
}

spam.tdm <- get.tdm(all.spam)

Create a data frame that provides the feature set from the training SPAM data

spam.matrix <- as.matrix(spam.tdm)
spam.counts <- rowSums(spam.matrix)
spam.df <- data.frame(cbind(names(spam.counts),
                            as.numeric(spam.counts)),
                      stringsAsFactors = FALSE)
names(spam.df) <- c("term", "frequency")
spam.df$frequency <- as.numeric(spam.df$frequency)
spam.occurrence <- sapply(1:nrow(spam.matrix),
                          function(i)
                          {
                                  length(which(spam.matrix[i, ] > 0)) / ncol(spam.matrix)
                          })
spam.density <- spam.df$frequency / sum(spam.df$frequency)

spam.df <- transform(spam.df,
                     density = spam.density,
                     occurrence = spam.occurrence)

head(spam.df[with(spam.df, order(-occurrence)),])

##      term frequency     density occurrence
## 71   http     11435 0.015306648  0.8276226
## 342   com      8780 0.011752722  0.7116500
## 248  html      4043 0.005411874  0.6046389
## 38  email      3410 0.004564554  0.5566684
## 21  click      2193 0.002935503  0.5366368
## 247  href      5093 0.006817382  0.5224038

Process ham training data set using above steps

easyham.docs <- dir(easyham.path)
easyham.docs <- easyham.docs[which(easyham.docs != "cmds")]
all.easyham <- sapply(easyham.docs[1:length(spam.docs)],
                      function(p) get.msg(file.path(easyham.path, p)))

easyham.tdm <- get.tdm(all.easyham)

easyham.matrix <- as.matrix(easyham.tdm)
easyham.counts <- rowSums(easyham.matrix)
easyham.df <- data.frame(cbind(names(easyham.counts),
                               as.numeric(easyham.counts)),
                         stringsAsFactors = FALSE)
names(easyham.df) <- c("term", "frequency")
easyham.df$frequency <- as.numeric(easyham.df$frequency)
easyham.occurrence <- sapply(1:nrow(easyham.matrix),
                             function(i)
                             {
                                     length(which(easyham.matrix[i, ] > 0)) / ncol(easyham.matrix)
                             })
easyham.density <- easyham.df$frequency / sum(easyham.df$frequency)

easyham.df <- transform(easyham.df,
                        density = easyham.density,
                        occurrence = easyham.occurrence)
head(easyham.df[with(easyham.df, order(-occurrence)),])

##         term frequency     density occurrence
## 125     http      2840 0.009813136  0.6642066
## 8        com      3106 0.010732253  0.6162362
## 49      list      2151 0.007432414  0.4791776
## 322      www      1462 0.005051692  0.4586189
## 50  listinfo       936 0.003234188  0.4449130
## 5        can      1691 0.005842962  0.4406958

Apply naive Bayes classifier on ham test dataset

classify.email <- function(path, training.df, prior = 0.5, c = 1e-6) {
        msg <- get.msg(path)
        msg.tdm <- get.tdm(msg)
        msg.freq <- rowSums(as.matrix(msg.tdm))
        msg.match <- intersect(names(msg.freq), training.df$term)
        if(length(msg.match) < 1)
        {
                return(prior * c ^ (length(msg.freq)))
        }
        else
        {
                match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
                return(prior * prod(match.probs) * c ^ (length(msg.freq) - length(msg.match)))
        }
}

easyham.test.docs <- dir(easyham.test.path)
easyham.test.docs <- easyham.test.docs[which(easyham.test.docs != "cmds")]

easyham.test.spamtest <- sapply(easyham.test.docs,
                           function(p) classify.email(file.path(easyham.test.path, p), training.df = spam.df, prior = 0.2))

easyham.test.hamtest <- sapply(easyham.test.docs,
                          function(p) classify.email(file.path(easyham.test.path, p), training.df = easyham.df, prior = 0.8))

easyham.test.res <- ifelse(easyham.test.spamtest > easyham.test.hamtest,
                      TRUE,
                      FALSE)
summary(easyham.test.res)

##    Mode   FALSE    TRUE 
## logical    1368      32

Conclusion

The naive Bayes classifier is able to classify spam vs ham emails with around
97.7% accuracy.

References

Automated Data Collection with R  
Machine Learning for Hackers

Project 04 -

Binish Kurian Chandy

4/13/2018

Define location for training and test dataset

Create a vector of spam emails for processing

Create text corpus and term document matrix(TDM) from spam email vector

Create a data frame that provides the feature set from the training SPAM data

Process ham training data set using above steps

Apply naive Bayes classifier on ham test dataset

Conclusion

References