Document Classification

We are provided with a list of corpus that contains Spam and Ham emails files (https://spamassassin.apache.org/old/publiccorpus/) and instructions on how to download. we prefer to use Studio systematic files download so everyone can run the script. We are going to use the tm package which contains function NLP functions.

We are going to implement the following the steps: 1. Download and Load the data 2. Transform the data into Corpus 3. Model Development 3. Prediction

We use this code to download the files and load in R.

#Loading and Downloading the files url_ham <- “https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2” url_spam <- “https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2

download.file(url_ham, destfile = “ham.tar.bz2”) download.file(url_spam, destfile = “spam.tar.bz2”) untar(“spam.tar.bz2”, exdir = “spam”) untar(“ham.tar.bz2”, exdir = “ham”)

This will stop our Rm markdown after running once and freeze.

# Download and Load the data 
library(tm)
## Loading required package: NLP
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Loading required package: lattice
library(e1071)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::lift()       masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(wordcloud)
## Loading required package: RColorBrewer
library(naivebayes)
## naivebayes 0.9.7 loaded
library(tm)
library(SnowballC)
library(caret)
library(gbm)
## Loaded gbm 2.1.8.1
library(e1071)

#Loading and Downloading the files


# Read the data in the folder we save them 
spam_path <- "spam/"
ham_path <- "ham/easy_ham/"
spam_path <- spam_path[which(spam_path!="cmds")]
ham_path <- ham_path[which(ham_path!="cmds")]

Transform the data into Corpus

The data analysis step involves creating a corpus and tidy the data into a data frane so the algorithm can run it. We are going to bind both ham and spam corpus and turn them into a data set. It is very important to create a test set and training set.

# Creating the text Corpus
spam <- Corpus(DirSource(spam_path))
ham <- Corpus(DirSource(ham_path))

# #Lets see what inside each email corpus 
meta(spam[[1]])
##   author       : character(0)
##   datetimestamp: 2023-11-20 04:05:14
##   description  : character(0)
##   heading      : character(0)
##   id           : 00001.7848dde101aa985090474a91ec93fcf0
##   language     : en
##   origin       : character(0)
meta(ham[[1]])
##   author       : character(0)
##   datetimestamp: 2023-11-20 04:05:14
##   description  : character(0)
##   heading      : character(0)
##   id           : 0001.ea7e79d3153e7469e7a9c3e0af6a357e
##   language     : en
##   origin       : character(0)
# create function to remove punctuation , common stop words , covert text to lower
## remove any unnecessary character with numbers 
spam_tidy <- function(doc){
  doc <- tm_map(doc, content_transformer(tolower))
  doc <- tm_map(doc, content_transformer(PlainTextDocument))
  doc <- tm_map(doc, content_transformer(removePunctuation))
  doc <- tm_map(doc, content_transformer(tolower))
  doc <- tm_map(doc,content_transformer(removeNumbers))
  doc <- tm_map(doc, content_transformer(stemDocument),  language = 'english')
  #doc <- tm_map(doc, removeWords, c('receiv', stopwords('english')))
  doc <- tm_map(doc, removeWords, c('spamassassin', stopwords('english')))
  doc <- tm_map(doc, stripWhitespace)
  
  return(doc)
  
}

# use function for both spam and ham 
spam_corpus <- spam_tidy(spam)
spam_corpus
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 501
easy_ham_corpus <- spam_tidy(ham)
easy_ham_corpus
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2551
# create the ham and spam corpus 
ham_or_spam_corpus <- c(easy_ham_corpus, spam_corpus)
tdm <- DocumentTermMatrix(ham_or_spam_corpus)
tdm
## <<DocumentTermMatrix (documents: 6, terms: 59472)>>
## Non-/sparse entries: 64479/292353
## Sparsity           : 82%
## Maximal term length: 298
## Weighting          : term frequency (tf)
inspect(tdm)
## <<DocumentTermMatrix (documents: 6, terms: 59472)>>
## Non-/sparse entries: 64479/292353
## Sparsity           : 82%
## Maximal term length: 298
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs  aug esmtp  ist localhost  mon  oct postfix receiv   sep  thu
##    1 5179  8710 4389      7602 3869 5456    4825  14477 10008 4224
##    2    0     0    0         0    0    0       0      0     0    0
##    3    0     0    0         0    0    0       0      0     0    0
##    4 1173  1118  740      1178  674   56     595   2890  2128  486
##    5    0     0    0         0    0    0       0      0     0    0
##    6    0     0    0         0    0    0       0      0     0    0
# see top 10 words in both spam ham combine 
# Summary Statistics

freq <- tdm %>% as.matrix() %>% colSums()
length(freq)  
## [1] 59472
freq_ord <- freq %>% order(decreasing = TRUE)
par(las=1)
#This will create a bar plot of the top 10 words in the spam Corpus
barplot(freq[freq_ord[1:10]], horiz = TRUE, col=terrain.colors(10), cex.names=0.7)

# Let see combinaison of  100 words that appears in both ham and spam 
wordcloud(ham_or_spam_corpus, max.words = 100, random.order = FALSE, rot.per=0.15, min.freq=5, colors = brewer.pal(8, "Dark2"))

Model Development

We are going to transform the Document Term Matrix in a data frame and add a category spam and ham. It is very efficient to add categorical column to use Naive Bayes model. This stage we have encounter a series of problems due to many Naive Bayes packages out there that is requires the data to be in a specific way. this was the part that i spent more than 6 hours figure out how put the corpus in the format. I have enough GPU power to run this

# model development 
# adding text and email columns to the data set 
df_ham <- as.data.frame(unlist(easy_ham_corpus), stringsAsFactors = FALSE)
df_ham$type <- "ham"
colnames(df_ham)=c("text", "email")


df_spam <- as.data.frame(unlist(spam_corpus), stringsAsFactors = FALSE)
df_spam$type <- "spam"
colnames(df_spam)=c("text", "email")

df_ham_or_spam <- rbind(df_ham, df_spam)

head(df_ham_or_spam)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              text
## content1  exmhworkersadminredhatcom thu aug returnpath exmhworkersadminexamplecom deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id dec zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv listmanexamplecom listmanexamplecom dogmaslashnullorg esmtp id gmbyrz zzzzexmhexamplecom thu aug receiv listmanexamplecom localhostlocaldomain listmanredhatcom postfix esmtp id thu aug edt deliveredto exmhworkerslistmanexamplecom receiv intmxcorpexamplecom intmxcorpexamplecom listmanredhatcom postfix esmtp id cfd exmhworkerslistmanredhatcom thu aug edt receiv maillocalhost intmxcorpexamplecom id gmbyg exmhworkerslistmanredhatcom thu aug receiv mxexamplecom mxexamplecom intmxcorpredhatcom smtp id gmbyy exmhworkersredhatcom thu aug receiv ratreepsuacth mxexamplecom smtp id gmbihl exmhworkersredhatcom thu aug receiv deltacsmuozau deltacoepsuacth ratreepsuacth esmtp id gmbwel thu aug ict receiv munnariozau localhost deltacsmuozau esmtp id gmbqpw thu aug ict robert elz kremunnariozau chris garrigu cwgdatedfaddeepeddycom cc exmhworkersexamplecom subject re new sequenc window inreplyto tmdadeepeddyvirciocom refer tmdadeepeddyvirciocom tmdadeepeddyvirciocom munnariozau tmdadeepeddyvirciocom tmdadeepeddyvirciocom mimevers contenttyp textplain charsetusascii messageid munnariozau xloop exmhworkersexamplecom sender exmhworkersadminexamplecom errorsto exmhworkersadminexamplecom xbeenther exmhworkersexamplecom xmailmanvers preced bulk listhelp mailtoexmhworkersrequestexamplecomsubjecthelp listpost mailtoexmhworkersexamplecom listsubscrib httpslistmanexamplecommailmanlistinfoexmhwork mailtoexmhworkersrequestredhatcomsubjectsubscrib listid discuss list exmh develop exmhworkersexamplecom listunsubscrib httpslistmanexamplecommailmanlistinfoexmhwork mailtoexmhworkersrequestredhatcomsubjectunsubscrib listarch httpslistmanexamplecommailmanprivateexmhwork date thu aug date wed aug chris garrigu cwgdatedfaddeepeddycom messageid tmdadeepeddyvirciocom cant reproduc error veri repeat like everi time without fail debug log pick happen pickit exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri ftocpickmsg hit mark hit tkerror syntax error express int note run pick command hand delta pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri hit hit come obvious version nmh im use delta pick version pick nmh compil fuchsiacsmuozau sun mar ict relev part mhprofil delta mhparam pick seq sel list sinc pick command work sequenc actual one explicit command line search popup one come mhprofil get creat kre ps still use version code form day ago havent abl reach cvs repositori today local rout issu think exmhwork mail list exmhworkersredhatcom httpslistmanredhatcommailmanlistinfoexmhwork
## content2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        steveburtcursorsystemcom thu aug returnpath steveburtcursorsystemcom deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id beec zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv ngrpscdyahoocom ngrpscdyahoocom dogmaslashnullorg smtp id gmbktz zzzzexamplecom thu aug xegroupsreturn senttozzzzexamplecomreturnsgroupsyahoocom receiv ngrpscdyahoocom nnfmp aug xsender steveburtcursorsystemcom xapparentlyto zzzzteanayahoogroupscom receiv egp mail aug receiv qmail invok network aug receiv unknown mgrpscdyahoocom qmqp aug receiv unknown helo mailgatewaycursorsystemcom mtagrpscdyahoocom smtp aug receiv exchangecpsloc unverifi mailgatewaycursorsystemcom content technolog smtprs esmtp id tcdefacddmailgatewaycursorsystemcom forteanayahoogroupscom thu aug receiv exchangecpsloc internet mail servic id pxxat thu aug messageid ecadddfbbdaddefbfexchangecpsloc zzzzteanayahoogroupscom zzzzteanayahoogroupscom xmailer internet mail servic xegroupsfrom steve burt steveburtcursorsystemcom steve burt steveburtcursorsystemcom xyahooprofil pyrus mimevers mailinglist list zzzzteanayahoogroupscom contact forteanaowneryahoogroupscom deliveredto mail list zzzzteanayahoogroupscom preced bulk listunsubscrib mailtozzzzteanaunsubscribeyahoogroupscom date thu aug subject zzzzteana re alexand replyto zzzzteanayahoogroupscom contenttyp textplain charsetusascii contenttransferencod bit martin post tasso papadopoulo greek sculptor behind plan judg limeston mount kerdylio mile east salonika far mount atho monast communiti ideal patriot sculptur well alexand granit featur ft high ft wide museum restor amphitheatr car park admir crowd plan mountain limeston granit limeston itll weather pretti fast yahoo group sponsor dvds free sp join now httpusclickyahoocomptybbnxieaamghaagsolbtm unsubscrib group send email forteanaunsubscribeegroupscom use yahoo group subject httpdocsyahoocominfoterm
## content3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             timcubhcom thu aug returnpath timcubhcom deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id c zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv ngrpscdyahoocom ngrpscdyahoocom dogmaslashnullorg smtp id gmcrdz zzzzexamplecom thu aug xegroupsreturn senttozzzzexamplecomreturnsgroupsyahoocom receiv ngrpscdyahoocom nnfmp aug xsender timcubhcom xapparentlyto zzzzteanayahoogroupscom receiv egp mail aug receiv qmail invok network aug receiv unknown mgrpscdyahoocom qmqp aug receiv unknown helo rheniumbtinternetcom mtagrpscdyahoocom smtp aug receiv hostinaddrbtopenworldcom rheniumbtinternetcom esmtp exim id hrtgj forteanayahoogroupscom thu aug xmailer microsoft outlook express macintosh edit zzzzteana zzzzteanayahoogroupscom xprioriti messageid ehrtgjrheniumbtinternetcom tim chapman timcubhcom xyahooprofil timubh mimevers mailinglist list zzzzteanayahoogroupscom contact forteanaowneryahoogroupscom deliveredto mail list zzzzteanayahoogroupscom preced bulk listunsubscrib mailtozzzzteanaunsubscribeyahoogroupscom date thu aug subject zzzzteana moscow bomber replyto zzzzteanayahoogroupscom contenttyp textplain charsetusascii contenttransferencod bit man threaten explos moscow thursday august pm moscow ap secur offic thursday seiz unidentifi man said arm explos threaten blow truck front russia feder secur servic headquart moscow ntv televis report offic seiz automat rifl man carri man got truck taken custodi ntv said detail immedi avail man demand talk high govern offici interfax itartass news agenc said ekho moskvi radio report want talk russian presid vladimir putin polic secur forc rush secur servic build within block kremlin red squar bolshoi ballet surround man claim one half ton explos news agenc said negoti continu one half hour outsid build itartass interfax report cite wit man later drove away build polic escort drove street near moscow olymp penta hotel author held negoti moscow polic press servic said move appear attempt secur servic get secur locat yahoo group sponsor dvds free sp join now httpusclickyahoocomptybbnxieaamghaagsolbtm unsubscrib group send email forteanaunsubscribeegroupscom use yahoo group subject httpdocsyahoocominfoterm
## content4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         irregularsadmintbtf thu aug returnpath irregularsadmintbtf deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id daec zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv webtbtf routetelocitycom dogmaslashnullorg esmtp id gmdgoz zzzzirrexamplecom thu aug receiv webtbtf localhostlocaldomain webtbtf esmtp id gmdpi thu aug receiv redharveehom red may forg webtbtf esmtp id gmdoi irregularstbtf thu aug receiv prservnet outprservnet redharveehom esmtp id gmdfbd irregularstbtf thu aug receiv slipmausprservnet prservnet esmtp id qujc thu aug mimevers xsender unverifi messageid pbaca undisclosedrecipi monti solomon montyroscomcom contenttyp textplain charsetusascii subject irr klez virus wont die sender irregularsadmintbtf errorsto irregularsadmintbtf xbeenther irregularstbtf xmailmanvers preced bulk listhelp mailtoirregularsrequesttbtfsubjecthelp listpost mailtoirregularstbtf listsubscrib httptbtfmailmanlistinfoirregular mailtoirregularsrequesttbtfsubjectsubscrib listid new home tbtf irregular mail list irregularstbtf listunsubscrib httptbtfmailmanlistinfoirregular mailtoirregularsrequesttbtfsubjectunsubscrib listarch httptbtfmailmanprivateirregular date thu aug klez virus wont die alreadi prolif virus ever klez continu wreak havoc andrew brandt septemb issu pc world magazin post thursday august klez worm approach seventh month wriggl across web make one persist virus ever expert warn may harbing new virus use combin pernici approach go pc pc antivirus softwar maker symantec mcafe report new infect daili sign letup press time british secur firm messagelab estim everi email messag hold variat klez virus say klez alreadi surpass last summer sircam prolif virus ever newer klez variant arent mere nuisancesthey can carri virus corrupt data httpwwwpcworldcomnewsarticleaidasp irregular mail list irregularstbtf httptbtfmailmanlistinfoirregular
## content5                                                                                                           exmhusersadminredhatcom thu aug returnpath exmhusersadminexamplecom deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id bc zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv listmanexamplecom listmanexamplecom dogmaslashnullorg esmtp id gmdgez zzzzexmhexamplecom thu aug receiv listmanexamplecom localhostlocaldomain listmanredhatcom postfix esmtp id feea thu aug edt deliveredto exmhuserslistmanexamplecom receiv intmxcorpexamplecom intmxcorpexamplecom listmanredhatcom postfix esmtp id aceffa exmhuserslistmanredhatcom thu aug edt receiv maillocalhost intmxcorpexamplecom id gmdc exmhuserslistmanredhatcom thu aug receiv mxexamplecom mxexamplecom intmxcorpredhatcom smtp id gmdci exmhusersredhatcom thu aug receiv mtabwbigpondcom mtabwbigpondcom mxredhatcom smtp id gmdnvl exmhusersredhatcom thu aug receiv hobbitlinuxworkscomau mtabwbigpondcom netscap messag server mtabw may smtp id hzfg exmhusersredhatcom thu aug receiv cpeqldbigpondnetau bwmammailsvcemailbigpondcommailrout vn aug receiv tonylocalhost hobbitlinuxworkscomau id gmdawx thu aug messageid gmdawxhobbitlinuxworkscomaunospam exmh user mail list exmhusersexamplecom toni nugent tonylinuxworkscomau xface irgslrofdtgfsrgasghrrzthdjxbvdrjoelxzaz qwxnllbxhsuqlllwsirvxyyebuivmufu uthzqrfpqcnjdxtpikquattvczfhfam organ linux work network xmailer nmh exmh xos linux redhat inreplyto messageid glkkqfmailbanirhcom wed aug subject re insert signatur xloop exmhusersexamplecom sender exmhusersadminexamplecom errorsto exmhusersadminexamplecom xbeenther exmhusersexamplecom xmailmanvers preced bulk replyto exmhusersexamplecom listhelp mailtoexmhusersrequestexamplecomsubjecthelp listpost mailtoexmhusersexamplecom listsubscrib httpslistmanexamplecommailmanlistinfoexmhus mailtoexmhusersrequestredhatcomsubjectsubscrib listid discuss list exmh user exmhusersexamplecom listunsubscrib httpslistmanexamplecommailmanlistinfoexmhus mailtoexmhusersrequestredhatcomsubjectunsubscrib listarch httpslistmanexamplecommailmanprivateexmhus date thu aug wed aug ulis ponc wrote hi command insert signatur use combin key sent mail insert simpli put nmh compon file compon replcomp forwcomp way get edit messag also use comp file specif folder can alter sig per folder trick see doc nmh detail might must also way get sedit ive use gvim exmh messag editor long time now load command load emailspecif set eg syntax colourhighlight header quot part email possibl map vim key add sig even give select sig choos sort way randomlychosen sig somewher rtfmmitedu ok go rtfmmitedupubusenetbygroupnewsanswerssignaturefingerfaq warn old may regard ulis hope help cheer toni exmhus mail list exmhusersredhatcom httpslistmanredhatcommailmanlistinfoexmhus
## content6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     stewartsmitheeedacuk thu aug returnpath stewartsmitheeedacuk deliveredto zzzzlocalhostnetnoteinccom receiv localhost localhost phoboslabsnetnoteinccom postfix esmtp id ecdc zzzzlocalhost thu aug edt receiv phobo localhost imap fetchmail zzzzlocalhost singledrop thu aug ist receiv ngrpscdyahoocom ngrpscdyahoocom dogmaslashnullorg smtp id gmdcoz zzzzexamplecom thu aug xegroupsreturn senttozzzzexamplecomreturnsgroupsyahoocom receiv ngrpscdyahoocom nnfmp aug xsender stewartsmitheeedacuk xapparentlyto zzzzteanayahoogroupscom receiv egp mail aug receiv qmail invok network aug receiv unknown mgrpscdyahoocom qmqp aug receiv unknown helo postboxeeedacuk mtagrpscdyahoocom smtp aug receiv eeedacuk sxsdunblan postboxeeedacuk esmtp id gmdcni forteanayahoogroupscom thu aug bst messageid deeeeedacuk organ scottish microelectron centr userag mozilla x u suno sunu enus rvb gecko xacceptlanguag en enus zzzzteanayahoogroupscom refer dfealocalhost stewart smith stewartsmitheeedacuk xyahooprofil stochasticus mimevers mailinglist list zzzzteanayahoogroupscom contact forteanaowneryahoogroupscom deliveredto mail list zzzzteanayahoogroupscom preced bulk listunsubscrib mailtozzzzteanaunsubscribeyahoogroupscom date thu aug subject re zzzzteana noth like mama use make replyto zzzzteanayahoogroupscom contenttyp textplain charsetusascii contenttransferencod bit ad cream spaghetti carbonara effect pasta make pizza deeppi just jump carbonara one favourit make ask hell suppos use instead cream ive never seen recip hasnt use person use low fat creme fraich becaus work quit nice onli time ive seen suppos authent recip carbonara ident mine cream egg lot fresh parmesan except creme fraich stew stewart smith scottish microelectron centr univers edinburgh httpwwweeedacuksx yahoo group sponsor dvds free sp join now httpusclickyahoocomptybbnxieaamghaagsolbtm unsubscrib group send email forteanaunsubscribeegroupscom use yahoo group subject httpdocsyahoocominfoterm
##          email
## content1   ham
## content2   ham
## content3   ham
## content4   ham
## content5   ham
## content6   ham
# Split data 
set.seed(250)
df_ham_or_spam$text[df_ham_or_spam$text==""] <- "NaN"
train_index <- createDataPartition(df_ham_or_spam$email, p=0.80, list=FALSE)
email_train <- df_ham_or_spam[train_index,]
email_test <- df_ham_or_spam[-train_index,]


#Create corpus for training and test data
train_email_corpus <- Corpus(VectorSource(email_train$text))
test_email_corpus <- Corpus(VectorSource(email_test$text))

train_clean_corpus <- tm_map(train_email_corpus ,removeNumbers)
test_clean_corpus <- tm_map(test_email_corpus,removeNumbers)

train_clean_corpus <- tm_map(train_clean_corpus,removePunctuation)
test_clean_corpus <- tm_map(test_clean_corpus,removePunctuation)

train_clean_corpus <- tm_map(train_clean_corpus,removeWords,stopwords())
test_clean_corpus  <- tm_map(test_clean_corpus,removeWords, stopwords())

train_clean_corpus<- tm_map(train_clean_corpus,stripWhitespace)
test_clean_corpus<- tm_map(test_clean_corpus,stripWhitespace)

train_email_dtm <- DocumentTermMatrix(train_clean_corpus)
test_email_dtm <- DocumentTermMatrix(test_clean_corpus )

After Creating the training and testing set, we are going to add o and 1 to both of the sets and create the model. Training set contains 80% of the combines ham and spam and the rest goes to test set.

# Here I'm defining input variables 0 and 1 from string to integer
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c(0,1))
  y
}


train_sms <- apply(train_email_dtm, 2, convert_count)
test_sms <- apply(test_email_dtm, 2, convert_count)


# NaiveBayes Model:
classifier <- naiveBayes(train_sms, factor(email_train$email))
test_pred <- predict(classifier, newdata=test_sms)

cm <- table(test_pred, email_test$email)
summary(classifier)
##           Length Class  Mode     
## apriori       2  table  numeric  
## tables    47073  -none- list     
## levels        2  -none- character
## isnumeric 47073  -none- logical  
## call          3  -none- call
confusionMatrix(cm)
## Confusion Matrix and Statistics
## 
##          
## test_pred ham spam
##      ham  510    1
##      spam   0   99
##                                      
##                Accuracy : 0.9984     
##                  95% CI : (0.9909, 1)
##     No Information Rate : 0.8361     
##     P-Value [Acc > NIR] : <2e-16     
##                                      
##                   Kappa : 0.994      
##                                      
##  Mcnemar's Test P-Value : 1          
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 0.9900     
##          Pos Pred Value : 0.9980     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.8361     
##          Detection Rate : 0.8361     
##    Detection Prevalence : 0.8377     
##       Balanced Accuracy : 0.9950     
##                                      
##        'Positive' Class : ham        
## 

Our model achieved accuracy of 0.99 with a p value equal 1. With Sensitivity and balanced accuracy around 0.99, the build is good. We were able to classified 509 as ham and 1 ham as spam, 1 spam as ham and 99 spam as spam.

Conclusion

we are able to use classified ham and spam loaded from (https://spamassassin.apache.org/old/publiccorpus/). Our most significant challenge were the equipment we used. we need more GPU power to run the Naive Bayes model. This is was a very good assignment. It is added significant knowledge on how to do classification problem. We can agree that 99% of spam emails has been corectly classified.

References

  1. GeeksforGeeks. (2021, July 13). Naive Bayes classifier in R programming. GeeksforGeeks. https://www.geeksforgeeks.org/naive-bayes-classifier-in-r-programming/

  2. Kasper Welbers, Wouter Van Atteveldt & Kenneth Benoit Text Analysis in R. (n.d.). https://eprints.lse.ac.uk/86659/1/Benoit_Text%20analysis%20in%20R_2018.pdf

  3. Public Corpus https://spamassassin.apache.org/old/publiccorpus/