CUNY DATA607_Project4

In this assignment, I begin with a spam/ham dataset, then predict the class of new documents using a Naive Bayesian tool. I describe the spam/ham data, after transforming the plain text emails and excavating the most frequently occurring terms.

library(tm)
library(tidytext)
library(wordcloud)
library("RColorBrewer")
library(e1071)
library(dplyr)

Sourcing the email files: The spam and ham files are saved in a local directory. There are about 1400 spam files and 2500 ham.

source <- DirSource("C:/Users/ZacharyHerold/Documents/DATA607/Project4/spamham/spam_2") #input path for spam files
source2 <- DirSource("C:/Users/ZacharyHerold/Documents/DATA607/Project4/spamham/easy_ham") #input path for ham files
length(source)

## [1] 1396

length(source2)

## [1] 2500

These files are compiled into two distinct Corpuses.

spam.corp <- Corpus(source, readerControl=list(reader=readPlain))
ham.corp <- Corpus(source2, readerControl=list(reader=readPlain))

The corpuses content is cleaned via a CleanCorpus function, performing the following transformations: - Removing line breaks - Removing meta content between HTML tags - Removing all header content before the subject line - Changing to lower case - Removing punctuation - Removing numbers - Removing stop words, customized stop words (can and will) and words over 15 character long - Stemming out the word suffixes - Stripping out white space

CleanCorpus <- function(corpus) {
  toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
  tmp <- tm_map(corpus, toSpace, "\\n")
  tmp <- tm_map(tmp, toSpace, "<.*?>")
  tmp <- tm_map(tmp, toSpace, "^.*Subject: ")
  tmp <- tm_map(tmp, content_transformer(tolower))
  tmp <- tm_map(tmp, content_transformer(removePunctuation)) 
  tmp <- tm_map(tmp, content_transformer(removeNumbers))
  tmp <- tm_map(tmp,  removeWords, c(stopwords("english"),"can","will","[[:alpha:]]{15,}"))
  tmp <- tm_map(tmp, content_transformer(stemDocument))  
  tmp <- tm_map(tmp, content_transformer(stripWhitespace))
}

Viewing the first spam content after the transformations.

clean.spam <- CleanCorpus(spam.corp)
inspect(clean.spam[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1546
## 
## ilug stop mlm insan sender errorsto preced bulk listid irish linux user group xbeenther iluglinuxi greet receiv letter express interest receiv inform onlin busi opportun erron pleas accept sincer apolog onetim mail remov necessari youv burn betray backstab multilevel market mlm pleas read letter import one ever land inbox multilevel market huge mistak peopl mlm fail deliv promis past year pursuit mlm dream cost hundr thousand peopl friend fortun sacr honor fact mlm fatal flaw mean work peopl compani earn big money mlm go tell real stori final someon courag cut hype lie tell truth mlm here good news altern mlm work work big havent yet abandon dream need see earn kind incom youv dream easier think permiss id like send brief letter tell mlm doesnt work peopl introduc someth new refresh youll wonder havent heard promis unwant follow sale pitch one call email address use send inform period receiv free lifechang inform simpli click repli type send info subject box hit send ill get inform within hour just look word mlm wall shame inbox cordial siddhi ps someon recent sent letter eyeopen financi benefici inform ever receiv honest believ feel way youv read free email never sent unsolicit spam receiv email explicit sign list onlin signup form use ffa link page emaildom system explicit term use state use agre receiv email may also member altra comput system list one mani numer free market servic agre sign list also receiv email due email messag consid unsolicit spam irish linux user group iluglinuxi unsubscript inform list maintain

Viewing the first ham content after the transformations.

clean.ham <- CleanCorpus(ham.corp)
inspect(clean.spam[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1546
## 
## ilug stop mlm insan sender errorsto preced bulk listid irish linux user group xbeenther iluglinuxi greet receiv letter express interest receiv inform onlin busi opportun erron pleas accept sincer apolog onetim mail remov necessari youv burn betray backstab multilevel market mlm pleas read letter import one ever land inbox multilevel market huge mistak peopl mlm fail deliv promis past year pursuit mlm dream cost hundr thousand peopl friend fortun sacr honor fact mlm fatal flaw mean work peopl compani earn big money mlm go tell real stori final someon courag cut hype lie tell truth mlm here good news altern mlm work work big havent yet abandon dream need see earn kind incom youv dream easier think permiss id like send brief letter tell mlm doesnt work peopl introduc someth new refresh youll wonder havent heard promis unwant follow sale pitch one call email address use send inform period receiv free lifechang inform simpli click repli type send info subject box hit send ill get inform within hour just look word mlm wall shame inbox cordial siddhi ps someon recent sent letter eyeopen financi benefici inform ever receiv honest believ feel way youv read free email never sent unsolicit spam receiv email explicit sign list onlin signup form use ffa link page emaildom system explicit term use state use agre receiv email may also member altra comput system list one mani numer free market servic agre sign list also receiv email due email messag consid unsolicit spam irish linux user group iluglinuxi unsubscript inform list maintain

I now separate out some training data, with which I wish to:

Remove words that appear to frequently (likely meta language, not filtered out pereviously) and sparse data
List and visualize the most frequently appearing words that are left over
Create wordclouds to visualize the different in content

First I develop a customized sparse text removal function:

CustomRemoveSparse <- function(x) {
  ndocs <- length(x)
  # ignore overly sparse terms (appearing in less than 5% of the documents)
  minDocFreq <- ndocs * 0.05
  # ignore overly common terms (appearing in more than 60% of the documents)
  maxDocFreq <- ndocs * 0.6
  x <- TermDocumentMatrix(x, control = list(bounds = list(global = c(minDocFreq, maxDocFreq))))
}

The following indicate the non-sparse entries of the remaining data. I designate 75% of the data to be for training purposes.

no_spam_train <- round(length(clean.spam)*3/4)
no_ham_train <- round(length(clean.ham)*3/4)
tdm.spam <- CustomRemoveSparse(clean.spam[1:no_spam_train])
tdm.spam

## <<TermDocumentMatrix (terms: 462, documents: 1047)>>
## Non-/sparse entries: 58671/425043
## Sparsity           : 88%
## Maximal term length: 14
## Weighting          : term frequency (tf)

tdm.ham <- CustomRemoveSparse(clean.ham[1:no_ham_train])
tdm.ham

## <<TermDocumentMatrix (terms: 358, documents: 1875)>>
## Non-/sparse entries: 74835/596415
## Sparsity           : 89%
## Maximal term length: 14
## Weighting          : term frequency (tf)

Now, I create a function that will list words in the Term Document Matrix in order of frequency, above a pre-defined number of occurrences.

MostFreq <- function(y,num){
  m <- as.matrix(y, rownames = FALSE)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  d <- subset(d, freq > num)
}

freq.spam <- MostFreq(tdm.spam, 500)
head(freq.spam)

##            word freq
## free       free 1413
## nbsp       nbsp 1364
## click     click 1219
## receiv   receiv 1136
## order     order 1109
## address address 1095

The #1 most frequent word is “free” among the spam.

barplot(freq.spam$freq, border = NA, names.arg = freq.spam$word, las = 2, ylim = c(0,max(freq.spam$freq)))

freq.ham <- MostFreq(tdm.ham, 600)
head(freq.ham)

##      word freq
## use   use 2072
## list list 1828
## sep   sep 1346
## get   get 1303
## mail mail 1238
## one   one 1158

The #1 most frequent word is “use” among the ham.

barplot(freq.ham$freq, border = NA, names.arg = freq.ham$word, las = 2, ylim = c(0,max(freq.ham$freq)))

Two wordclouds visualize the discrepancies, first for the spam, then the ham.

set.seed(1234)
wordcloud(words = freq.spam$word, freq = freq.spam$freq, min.freq = 500,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

## Warning in wordcloud(words = freq.spam$word, freq = freq.spam$freq,
## min.freq = 500, : messag could not be fit on page. It will not be plotted.

wordcloud(words = freq.ham$word, freq = freq.ham$freq, min.freq = 600,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Now I go on to build the training set, eventually rbinding the spam with the ham data.

#the spam training data, converted into TDM, then a tibble
tdm.spam <- TermDocumentMatrix(clean.spam[1:no_spam_train])
tdm.spam

## <<TermDocumentMatrix (terms: 12845, documents: 1047)>>
## Non-/sparse entries: 115227/13333488
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

I then: - use the tidy-text package to arrange it into a long tibble, - subset so only words that appear more than 5 times are included, - mark the data as spam in an extra column

tidy.spam <- tidy(tdm.spam)
tidy.spam <- subset(tidy.spam, tidy.spam$count >= 5)
tidy.spam$type <- "spam"
str(tidy.spam)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4781 obs. of  4 variables:
##  $ term    : chr  "email" "inform" "mlm" "receiv" ...
##  $ document: chr  "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" ...
##  $ count   : num  6 6 10 7 10 8 6 7 6 9 ...
##  $ type    : chr  "spam" "spam" "spam" "spam" ...

head(tidy.spam)

## # A tibble: 6 x 4
##   term    document                               count type 
##   <chr>   <chr>                                  <dbl> <chr>
## 1 email   00001.317e78fa8ee2f54cd4890fdc09ba8176     6 spam 
## 2 inform  00001.317e78fa8ee2f54cd4890fdc09ba8176     6 spam 
## 3 mlm     00001.317e78fa8ee2f54cd4890fdc09ba8176    10 spam 
## 4 receiv  00001.317e78fa8ee2f54cd4890fdc09ba8176     7 spam 
## 5 address 00002.9438920e9a55591b18e60d1ed37d992b    10 spam 
## 6 card    00002.9438920e9a55591b18e60d1ed37d992b     8 spam

tidy.ham <- tidy(tdm.ham)
tidy.ham <- subset(tidy.ham, tidy.ham$count >= 5)
tidy.ham$type <- "ham"

train.all <- rbind(tidy.spam, tidy.ham)
head(train.all)

## # A tibble: 6 x 4
##   term    document                               count type 
##   <chr>   <chr>                                  <dbl> <chr>
## 1 email   00001.317e78fa8ee2f54cd4890fdc09ba8176     6 spam 
## 2 inform  00001.317e78fa8ee2f54cd4890fdc09ba8176     6 spam 
## 3 mlm     00001.317e78fa8ee2f54cd4890fdc09ba8176    10 spam 
## 4 receiv  00001.317e78fa8ee2f54cd4890fdc09ba8176     7 spam 
## 5 address 00002.9438920e9a55591b18e60d1ed37d992b    10 spam 
## 6 card    00002.9438920e9a55591b18e60d1ed37d992b     8 spam

Finally, I create two sets of testing data, one with all spam, the other with all ham.

##assembling the testing data

test.spam.corp <- clean.spam[no_spam_train+1:length(source)]
test.ham.corp <- clean.ham[no_ham_train+1:length(source2)]
        
tdm.spam.test <- TermDocumentMatrix(test.spam.corp)
tidy.spam.test <- tidy(tdm.spam.test)       
tidy.spam.test <- subset(tidy.spam.test, tidy.spam.test$count >= 5)
tidy.spam.test$type <- "spam" 

tdm.ham.test <- TermDocumentMatrix(test.ham.corp)
tidy.ham.test <- tidy(tdm.ham.test)       
tidy.ham.test <- subset(tidy.ham.test, tidy.ham.test$count >= 5)
tidy.ham.test$type <- "ham"

Now, I seek to make use of the Naive Bayesian formula in the e1071 package. An predictive classifier is initialized based on the complete training data.

## Naive Bayes prediction

email_classifier <- naiveBayes(train.all, factor(train.all$type))
class(email_classifier)

## [1] "naiveBayes"

After merging the test data into one tibble, I run it through the predictor. It is error-free in predicting the spam, with only 4.6% an error rate mis-classifing the ham as spam.

test.data <- rbind(tidy.spam.test, tidy.ham.test)
preds3 <- predict(email_classifier, newdata=test.data)
table(preds3, test.data$type)

##       
## preds3  ham spam
##   ham   262   12
##   spam    0 1630

CUNY DATA607_Project4

Zachary Herold

November 4, 2018