In this assignment, I begin with a spam/ham dataset, then predict the class of new documents using a Naive Bayesian tool. I describe the spam/ham data, after transforming the plain text emails and excavating the most frequently occurring terms.
library(tm)
library(tidytext)
library(wordcloud)
library("RColorBrewer")
library(e1071)
library(dplyr)
Sourcing the email files: The spam and ham files are saved in a local directory. There are about 1400 spam files and 2500 ham.
source <- DirSource("C:/Users/ZacharyHerold/Documents/DATA607/Project4/spamham/spam_2") #input path for spam files
source2 <- DirSource("C:/Users/ZacharyHerold/Documents/DATA607/Project4/spamham/easy_ham") #input path for ham files
length(source)
## [1] 1396
length(source2)
## [1] 2500
These files are compiled into two distinct Corpuses.
spam.corp <- Corpus(source, readerControl=list(reader=readPlain))
ham.corp <- Corpus(source2, readerControl=list(reader=readPlain))
The corpuses content is cleaned via a CleanCorpus function, performing the following transformations: - Removing line breaks - Removing meta content between HTML tags - Removing all header content before the subject line - Changing to lower case - Removing punctuation - Removing numbers - Removing stop words, customized stop words (can and will) and words over 15 character long - Stemming out the word suffixes - Stripping out white space
CleanCorpus <- function(corpus) {
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
tmp <- tm_map(corpus, toSpace, "\\n")
tmp <- tm_map(tmp, toSpace, "<.*?>")
tmp <- tm_map(tmp, toSpace, "^.*Subject: ")
tmp <- tm_map(tmp, content_transformer(tolower))
tmp <- tm_map(tmp, content_transformer(removePunctuation))
tmp <- tm_map(tmp, content_transformer(removeNumbers))
tmp <- tm_map(tmp, removeWords, c(stopwords("english"),"can","will","[[:alpha:]]{15,}"))
tmp <- tm_map(tmp, content_transformer(stemDocument))
tmp <- tm_map(tmp, content_transformer(stripWhitespace))
}
Viewing the first spam content after the transformations.
clean.spam <- CleanCorpus(spam.corp)
inspect(clean.spam[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1546
##
## ilug stop mlm insan sender errorsto preced bulk listid irish linux user group xbeenther iluglinuxi greet receiv letter express interest receiv inform onlin busi opportun erron pleas accept sincer apolog onetim mail remov necessari youv burn betray backstab multilevel market mlm pleas read letter import one ever land inbox multilevel market huge mistak peopl mlm fail deliv promis past year pursuit mlm dream cost hundr thousand peopl friend fortun sacr honor fact mlm fatal flaw mean work peopl compani earn big money mlm go tell real stori final someon courag cut hype lie tell truth mlm here good news altern mlm work work big havent yet abandon dream need see earn kind incom youv dream easier think permiss id like send brief letter tell mlm doesnt work peopl introduc someth new refresh youll wonder havent heard promis unwant follow sale pitch one call email address use send inform period receiv free lifechang inform simpli click repli type send info subject box hit send ill get inform within hour just look word mlm wall shame inbox cordial siddhi ps someon recent sent letter eyeopen financi benefici inform ever receiv honest believ feel way youv read free email never sent unsolicit spam receiv email explicit sign list onlin signup form use ffa link page emaildom system explicit term use state use agre receiv email may also member altra comput system list one mani numer free market servic agre sign list also receiv email due email messag consid unsolicit spam irish linux user group iluglinuxi unsubscript inform list maintain
Viewing the first ham content after the transformations.
clean.ham <- CleanCorpus(ham.corp)
inspect(clean.spam[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1546
##
## ilug stop mlm insan sender errorsto preced bulk listid irish linux user group xbeenther iluglinuxi greet receiv letter express interest receiv inform onlin busi opportun erron pleas accept sincer apolog onetim mail remov necessari youv burn betray backstab multilevel market mlm pleas read letter import one ever land inbox multilevel market huge mistak peopl mlm fail deliv promis past year pursuit mlm dream cost hundr thousand peopl friend fortun sacr honor fact mlm fatal flaw mean work peopl compani earn big money mlm go tell real stori final someon courag cut hype lie tell truth mlm here good news altern mlm work work big havent yet abandon dream need see earn kind incom youv dream easier think permiss id like send brief letter tell mlm doesnt work peopl introduc someth new refresh youll wonder havent heard promis unwant follow sale pitch one call email address use send inform period receiv free lifechang inform simpli click repli type send info subject box hit send ill get inform within hour just look word mlm wall shame inbox cordial siddhi ps someon recent sent letter eyeopen financi benefici inform ever receiv honest believ feel way youv read free email never sent unsolicit spam receiv email explicit sign list onlin signup form use ffa link page emaildom system explicit term use state use agre receiv email may also member altra comput system list one mani numer free market servic agre sign list also receiv email due email messag consid unsolicit spam irish linux user group iluglinuxi unsubscript inform list maintain
I now separate out some training data, with which I wish to:
First I develop a customized sparse text removal function:
CustomRemoveSparse <- function(x) {
ndocs <- length(x)
# ignore overly sparse terms (appearing in less than 5% of the documents)
minDocFreq <- ndocs * 0.05
# ignore overly common terms (appearing in more than 60% of the documents)
maxDocFreq <- ndocs * 0.6
x <- TermDocumentMatrix(x, control = list(bounds = list(global = c(minDocFreq, maxDocFreq))))
}
The following indicate the non-sparse entries of the remaining data. I designate 75% of the data to be for training purposes.
no_spam_train <- round(length(clean.spam)*3/4)
no_ham_train <- round(length(clean.ham)*3/4)
tdm.spam <- CustomRemoveSparse(clean.spam[1:no_spam_train])
tdm.spam
## <<TermDocumentMatrix (terms: 462, documents: 1047)>>
## Non-/sparse entries: 58671/425043
## Sparsity : 88%
## Maximal term length: 14
## Weighting : term frequency (tf)
tdm.ham <- CustomRemoveSparse(clean.ham[1:no_ham_train])
tdm.ham
## <<TermDocumentMatrix (terms: 358, documents: 1875)>>
## Non-/sparse entries: 74835/596415
## Sparsity : 89%
## Maximal term length: 14
## Weighting : term frequency (tf)
Now, I create a function that will list words in the Term Document Matrix in order of frequency, above a pre-defined number of occurrences.
MostFreq <- function(y,num){
m <- as.matrix(y, rownames = FALSE)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
d <- subset(d, freq > num)
}
freq.spam <- MostFreq(tdm.spam, 500)
head(freq.spam)
## word freq
## free free 1413
## nbsp nbsp 1364
## click click 1219
## receiv receiv 1136
## order order 1109
## address address 1095
The #1 most frequent word is “free” among the spam.
barplot(freq.spam$freq, border = NA, names.arg = freq.spam$word, las = 2, ylim = c(0,max(freq.spam$freq)))
freq.ham <- MostFreq(tdm.ham, 600)
head(freq.ham)
## word freq
## use use 2072
## list list 1828
## sep sep 1346
## get get 1303
## mail mail 1238
## one one 1158
The #1 most frequent word is “use” among the ham.
barplot(freq.ham$freq, border = NA, names.arg = freq.ham$word, las = 2, ylim = c(0,max(freq.ham$freq)))
Two wordclouds visualize the discrepancies, first for the spam, then the ham.
set.seed(1234)
wordcloud(words = freq.spam$word, freq = freq.spam$freq, min.freq = 500,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
## Warning in wordcloud(words = freq.spam$word, freq = freq.spam$freq,
## min.freq = 500, : messag could not be fit on page. It will not be plotted.
wordcloud(words = freq.ham$word, freq = freq.ham$freq, min.freq = 600,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Now I go on to build the training set, eventually rbinding the spam with the ham data.
#the spam training data, converted into TDM, then a tibble
tdm.spam <- TermDocumentMatrix(clean.spam[1:no_spam_train])
tdm.spam
## <<TermDocumentMatrix (terms: 12845, documents: 1047)>>
## Non-/sparse entries: 115227/13333488
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
I then: - use the tidy-text package to arrange it into a long tibble, - subset so only words that appear more than 5 times are included, - mark the data as spam in an extra column
tidy.spam <- tidy(tdm.spam)
tidy.spam <- subset(tidy.spam, tidy.spam$count >= 5)
tidy.spam$type <- "spam"
str(tidy.spam)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4781 obs. of 4 variables:
## $ term : chr "email" "inform" "mlm" "receiv" ...
## $ document: chr "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" "00001.317e78fa8ee2f54cd4890fdc09ba8176" ...
## $ count : num 6 6 10 7 10 8 6 7 6 9 ...
## $ type : chr "spam" "spam" "spam" "spam" ...
head(tidy.spam)
## # A tibble: 6 x 4
## term document count type
## <chr> <chr> <dbl> <chr>
## 1 email 00001.317e78fa8ee2f54cd4890fdc09ba8176 6 spam
## 2 inform 00001.317e78fa8ee2f54cd4890fdc09ba8176 6 spam
## 3 mlm 00001.317e78fa8ee2f54cd4890fdc09ba8176 10 spam
## 4 receiv 00001.317e78fa8ee2f54cd4890fdc09ba8176 7 spam
## 5 address 00002.9438920e9a55591b18e60d1ed37d992b 10 spam
## 6 card 00002.9438920e9a55591b18e60d1ed37d992b 8 spam
tidy.ham <- tidy(tdm.ham)
tidy.ham <- subset(tidy.ham, tidy.ham$count >= 5)
tidy.ham$type <- "ham"
train.all <- rbind(tidy.spam, tidy.ham)
head(train.all)
## # A tibble: 6 x 4
## term document count type
## <chr> <chr> <dbl> <chr>
## 1 email 00001.317e78fa8ee2f54cd4890fdc09ba8176 6 spam
## 2 inform 00001.317e78fa8ee2f54cd4890fdc09ba8176 6 spam
## 3 mlm 00001.317e78fa8ee2f54cd4890fdc09ba8176 10 spam
## 4 receiv 00001.317e78fa8ee2f54cd4890fdc09ba8176 7 spam
## 5 address 00002.9438920e9a55591b18e60d1ed37d992b 10 spam
## 6 card 00002.9438920e9a55591b18e60d1ed37d992b 8 spam
Finally, I create two sets of testing data, one with all spam, the other with all ham.
##assembling the testing data
test.spam.corp <- clean.spam[no_spam_train+1:length(source)]
test.ham.corp <- clean.ham[no_ham_train+1:length(source2)]
tdm.spam.test <- TermDocumentMatrix(test.spam.corp)
tidy.spam.test <- tidy(tdm.spam.test)
tidy.spam.test <- subset(tidy.spam.test, tidy.spam.test$count >= 5)
tidy.spam.test$type <- "spam"
tdm.ham.test <- TermDocumentMatrix(test.ham.corp)
tidy.ham.test <- tidy(tdm.ham.test)
tidy.ham.test <- subset(tidy.ham.test, tidy.ham.test$count >= 5)
tidy.ham.test$type <- "ham"
Now, I seek to make use of the Naive Bayesian formula in the e1071 package. An predictive classifier is initialized based on the complete training data.
## Naive Bayes prediction
email_classifier <- naiveBayes(train.all, factor(train.all$type))
class(email_classifier)
## [1] "naiveBayes"
After merging the test data into one tibble, I run it through the predictor. It is error-free in predicting the spam, with only 4.6% an error rate mis-classifing the ham as spam.
test.data <- rbind(tidy.spam.test, tidy.ham.test)
preds3 <- predict(email_classifier, newdata=test.data)
table(preds3, test.data$type)
##
## preds3 ham spam
## ham 262 12
## spam 0 1630