Project problem

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

Here are two short videos that you may find helpful.

The first video shows how to unzip the provided files.

Solutions:

Packages used

  1. library(tidyverse)
  2. library(tidytext)
  3. library(tm)
  4. library(caret)
  5. library(e1071)
  6. library(wordcloud)
  7. library(R.utils)
  8. library(DT)

Download and Extract files

The spam and ham files are downloaded from spamassassin’s website:

getFiles <- function(files,baseurl)
{
        for (i in 1: length(files))
          {
                fUrl <- str_c(baseurl,files[i])
                download.file(fUrl, destfile = files[i])
                Sys.sleep(1)
        }
}
targetFiles <- c("20050311_spam_2.tar.bz2","20030228_easy_ham_2.tar.bz2")
getFiles(targetFiles,"https://spamassassin.apache.org/old/publiccorpus/")

Extract the downloaded files twice. first, as a .bz2 and then as a .tar

for (i in 1: length(targetFiles)) 
{
  bunzip2(targetFiles[i])
  untar(tarfile = str_replace(targetFiles[i], ".bz2", ""))
}
View all the sub-directories in the current working directory to be sure the files were extracted with their expected contents:
list.dirs(path = ".", full.names = TRUE, recursive = TRUE)
## [1] "."                                                                      
## [2] "./easy_ham_2"                                                           
## [3] "./rsconnect"                                                            
## [4] "./rsconnect/documents"                                                  
## [5] "./rsconnect/documents/Henry-document-classification.Rmd"                
## [6] "./rsconnect/documents/Henry-document-classification.Rmd/rpubs.com"      
## [7] "./rsconnect/documents/Henry-document-classification.Rmd/rpubs.com/rpubs"
## [8] "./spam_2"
length(list.files("./easy_ham_2"))
## [1] 1400
length(list.files("./spam_2"))
## [1] 1396
So, as it stands, all the files are now extracted in their separate folders, but I need to be sure that all the files in each of the folders are exactly named as every other on so that I won’t try to parse files that are neither a spam nor a ham
hamFiles <- list.files("./easy_ham_2")
spamFiles <- list.files("./spam_2")
head(hamFiles)
## [1] "00001.1a31cc283af0060967a233d26548a6ce"
## [2] "00002.5a587ae61666c5aa097c8e866aedcc59"
## [3] "00003.19be8acd739ad589cd00d8425bac7115"
## [4] "00004.b2ed6c3c62bbdfab7683d60e214d1445"
## [5] "00005.07b9d4aa9e6c596440295a5170111392"
## [6] "00006.654c4ec7c059531accf388a807064363"
head(spamFiles)
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"
## [6] "00006.3ca1f399ccda5d897fecb8c57669a283"
(notHamFile <- hamFiles[str_detect(hamFiles,"^\\d+\\.[:alnum:]") == FALSE])
## character(0)
(notSpamFile <- spamFiles[str_detect(spamFiles,"^\\d+\\.[:alnum:]") == FALSE])
## character(0)
So it is obvious that there are a lot of files that are neither ham nor spam existing in both folders, therefore, I have to get rid of those files
if (file.exists("./easy_ham_2/cmds")) file.remove("./easy_ham_2/cmds")
if (file.exists("./spam_2/cmds")) file.remove("./spam_2/cmds")
I have to re-fetch the files in the folders again as the cmds files have been deleted, and then check the length of the content
hamFiles <- list.files("./easy_ham_2")
spamFiles <- list.files("./spam_2")
length(spamFiles)
## [1] 1396

Read all files into a dataframe

hamSpamMails <- tibble()
hamSpamFolders <- c("./spam_2", "./easy_ham_2")
fileTypes <-c("spam", "ham")
allFileNames <- c(spamFiles, hamFiles)
for (i in 1: length(hamSpamFolders))
  {
        type <- fileTypes[i]
        ff <- tibble(file = dir(hamSpamFolders[i],  full.names = TRUE)) %>% mutate(text = map(file, read_lines)) %>%
                transmute(id = basename(file), type = type, text) %>%
                unnest(text)
                hamSpamMails <- bind_rows(hamSpamMails, ff)
 }
head(hamSpamMails)
## # A tibble: 6 x 3
##   id                        type  text                                     
##   <chr>                     <chr> <chr>                                    
## 1 00001.317e78fa8ee2f54cd4~ spam  From ilug-admin@linux.ie  Tue Aug  6 11:~
## 2 00001.317e78fa8ee2f54cd4~ spam  Return-Path: <ilug-admin@linux.ie>       
## 3 00001.317e78fa8ee2f54cd4~ spam  Delivered-To: yyyy@localhost.netnoteinc.~
## 4 00001.317e78fa8ee2f54cd4~ spam  Received: from localhost (localhost [127~
## 5 00001.317e78fa8ee2f54cd4~ spam  "\tby phobos.labs.netnoteinc.com (Postfi~
## 6 00001.317e78fa8ee2f54cd4~ spam  "\tfor <jm@localhost>; Tue,  6 Aug 2002 ~
allEmails <- hamSpamMails
allEmails <- tibble::rowid_to_column(allEmails, "linenumber")
datatable(head(allEmails), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
Group the dataframe by file type and check how many of each type is in the dataframe
allEmails %>% group_by(type) %>% summarize(messages  = n_distinct(id)) %>% ungroup()
## # A tibble: 2 x 2
##   type  messages
##   <chr>    <int>
## 1 ham       1400
## 2 spam      1396

Data cleaning

Next is to Create a corpus of the files but first striping all non-graphic characters so as to avoid the errors they can cause when tm_map is called. Then, I will remove characters like colons, hyphens, apostrophe, etc and replace them with spaces so that the words they separated will not be combined into one. Finally, I will remove punctuations, convert the corpus to lower case, remove all numbers, remove white spaces, and remove stop words.
allEmails$text <- str_replace_all(allEmails$text,"[^[:graph:]]", " ")
(allCorpus <- Corpus(VectorSource(allEmails$text)))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 328911
stripJoiners <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
cleanMailCorpus <- tm_map(allCorpus, stripJoiners, "-")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, ":")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, "\\.")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, "'")
cleanMailCorpus <- tm_map(cleanMailCorpus, removePunctuation)
cleanMailCorpus <- tm_map(cleanMailCorpus, removeNumbers)
cleanMailCorpus <- tm_map(cleanMailCorpus, tolower)
cleanMailCorpus <- tm_map(cleanMailCorpus, stripWhitespace)
cleanMailCorpus <- tm_map(cleanMailCorpus, removeWords, stopwords("English"))
mailsDf <- data.frame(text = get("content", cleanMailCorpus))
allEmails$text <- as.character(mailsDf$text)
allMailsTokens <- allEmails %>%
   unnest_tokens(output = word, input = text) %>%
   # remove numbers
   filter(!str_detect(word, "^[0-9]*$")) %>%
   # remove stop words
   anti_join(stop_words) %>%
   # stem the words
   mutate(word = SnowballC::wordStem(word))
## Joining, by = "word"
datatable(head(allMailsTokens), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Weighting (tf-idf)

Term Frequency (tf) is simply to count how frequently a word occurs in a document. An alternative approach is term frequency inverse document frequency (tf-idf), which is the frequency of a term adjusted for how rarely it is used.
cleanMailsTokensTfIdf <- allMailsTokens %>%
   count(type, word) %>%
   bind_tf_idf(term = word, document = type, n = n)
datatable(head(cleanMailsTokensTfIdf), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Exploratory analysis

Visualisations: There are some very low frequent terms in the clean data. in some cases, rarely occuring terms might have important descriptive values in the documents they occur.
# Before weighting
hams <- allMailsTokens %>% group_by(type) %>% filter(type == "ham") %>% count(word, sort = TRUE) %>% ungroup()
spams <- allMailsTokens %>% group_by(type) %>% filter(type == "spam") %>% count(word, sort = TRUE) %>% ungroup()
head(hams)
## # A tibble: 6 x 3
##   type  word       n
##   <chr> <chr>  <int>
## 1 ham   org    10661
## 2 ham   net    10536
## 3 ham   id     10348
## 4 ham   list   10014
## 5 ham   receiv  9629
## 6 ham   aug     7922

Bar plots

Before weighting
# For hams

hams %>% 
        top_n(15) %>% 
        mutate(word = reorder(word, n)) %>%
        ggplot(aes(word, n))+
        geom_col(show.legend = FALSE, fill = "steelblue") +
        labs(y= "Ham Word Count", x=NULL)+
        coord_flip()
## Selecting by n

# For spams
spams %>% 
        top_n(15) %>% 
        mutate(word = reorder(word, n)) %>%
        ggplot(aes(word, n))+
        geom_col(show.legend = FALSE, fill = "steelblue") +
        labs(y= "Spam Word Count", x=NULL)+
        coord_flip()
## Selecting by n

After weighting

sorting the tf-idf data frame and convert word to a factor column
mailsPlot <- cleanMailsTokensTfIdf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))
datatable(head(mailsPlot), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
# graph the top 10 tokens for both ham and spam
mailsPlot %>%
  filter(type %in% c('ham', 'spam')) %>%
  mutate(type = factor(type, levels = c('ham', 'spam'),
                        labels = c("Ham mails", "Spam mails"))) %>%
  group_by(type) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf)) +
  geom_col(fill = "steelblue") +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~type, scales = "free") +
  coord_flip()
## Selecting by tf_idf

Word cloud of top 50 ham and spam mails before and after tf-idf weighting. This is to see if the same words/terms that occured very frequently before weighting are still as such after weighting
# After Weighting
hams50 <- top_n(subset(cleanMailsTokensTfIdf, type == 'ham'), 50)
## Selecting by tf_idf
spams50 <- top_n(subset(cleanMailsTokensTfIdf, type == 'spam'), 50)
## Selecting by tf_idf

For Hams

# Before weighting

wordcloud(words = hams$word, freq = hams$n, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

#After weighting
wordcloud(words = hams50$word, freq = hams50$n, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

For Spams

# Bfore weighting
wordcloud(words = spams$word, freq = hams$n, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

# After Weighting
wordcloud(words = spams50$word, freq = spams50$n, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Apparently the visualisations for normal word count and tf-idf are different with words that appear more in the former almost not showing up in the later

Training, Classification and prediction

The next steps involve 1). Randomizing the emails order and quantifying each subset 2). Generating Training and test Document Term Matrices 3).Training & Test Label 4). Creating Proportions for training & test labels 5). Creating vector of most frequent occuring words 6). Filtering the Document Term Matrices 7). Creating training model from the training dataset 8). Making predictions on the test data set 8). Creating confusion matrix to give a summary statistics of the predictions
set.seed(12)
randomisedSet <- allMailsTokens[sample(nrow(allMailsTokens)),]
randomisedCorpus <- Corpus(VectorSource(randomisedSet$word))
randommisedDTM <- DocumentTermMatrix(randomisedCorpus) 
trainingDim <- dim(randomisedSet)[1]%/%4*3
trainingSet <- randomisedSet[1:trainingDim,]
txtSet <- dim(randomisedSet)[1]
trainingSet <- randomisedSet[1:trainingDim,]
testingSet <- randomisedSet[(trainingDim+1):txtSet,]
trainingDTM <- randommisedDTM[1:trainingDim,]
testDTM <- randommisedDTM[(trainingDim+1):txtSet,]
trainingLabels <- trainingSet$type
testLabels <- testingSet$type

Sparsity

I tried to use trainingDTM and testDTM in the training and testing but it almost sucked life out of my laptop so I want to reduce the model complexity by removing sparse terms from the model. That is, removing tokens which do not appear across many documents. It is similar to using tf-idf weighting, but directly deletes sparse variables from the document-term matrix. This results in a statistical learning model with a much smaller set of variables. The tm package contains the removeSparseTerms() function, which does this task. The first argument is a document-term matrix, and the second argument defines the maximal allowed sparsity in the range from 0 to 1. So for instance, sparse = .99 would remove any tokens which are missing from more than 95% of the documents in the corpus (i.e. the token must appear in at least 5% of the documents to be retained).
threshold <- 0.5

minFreq = round(randommisedDTM$nrow*(threshold/100),0)
frequentWords <- findFreqTerms(x = randommisedDTM, lowfreq = minFreq)
length(testDTM)
## [1] 6
length(frequentWords)
## [1] 15
trainingDTMFreq <- trainingDTM[ , frequentWords]
testDTMFreq <- testDTM[ , frequentWords]

dim(trainingDTMFreq)
## [1] 784068     15
categoriseValues <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
trainingText <- apply(trainingDTMFreq,  MARGIN =  2, categoriseValues)
testText <- apply(testDTMFreq, MARGIN = 2, categoriseValues)

Naive Bayes Classifier

This is based on Bayes rule, fequency analysis of occurances of words and an independence assumption (the naive part). The Naive Bayes classifier assigns a probability that a new sample is in one class or another (spam or ham). Then, from the words contained or not contained in the message, it will compute the probability that a message is either a spam or ham. The Naive Bayes can be considered for this processes because the mails can be categorised as ham or spam (the input are categorical)
classifier <- naiveBayes(trainingText, factor(trainingLabels))
testPrediction <- predict(classifier, testText)

Prediction Statistics

confusionMatrix(data = testPrediction, reference = factor(testLabels),
                positive = "spam", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction    ham   spam
##       ham   19657   9767
##       spam  95906 136027
##                                           
##                Accuracy : 0.5957          
##                  95% CI : (0.5938, 0.5976)
##     No Information Rate : 0.5578          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1117          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9330          
##             Specificity : 0.1701          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value : 0.6681          
##              Prevalence : 0.5578          
##          Detection Rate : 0.5205          
##    Detection Prevalence : 0.8874          
##       Balanced Accuracy : 0.5516          
##                                           
##        'Positive' Class : spam            
##