Introduction

As a group, we worked with two files containing spam and ham to predict if a document is spam or not .By utilizing our ‘training’ documents, our group was able to classify the “test” documents. We were able to communicate via zoom meeting and collaborating with Github.

For this project, we started with a spam/ham dataset, then predicted the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). We are provided with the corpus (https://spamassassin.apache.org/old/publiccorpus/) and instructions on how to download the ham and spam files.

Loading Libraries

We started by loading our libraries which were TidyVerse,tm, and magrittr. The first step was to load the corpus data into our frames ‘spam_folder’ and ‘ham_folder’. For better understanding we then utilized ‘list.files’ on our ‘spam_folder’ object which produces a character vector of the names of files or directories in the named directory. We then read that info into a data frame. We specified our column names. Then we used the ‘lapply’ function which is useful for performing operations on list objects and returns a list object of the same length as the original set. It takes a list, vector or data frame as input and gives output in the list.

library(tidyverse)
library(tm)
## Warning: package 'tm' was built under R version 4.0.3
## Warning: package 'NLP' was built under R version 4.0.3
library(magrittr)
library(qdap)
## Warning: package 'qdap' was built under R version 4.0.3
## Warning: package 'qdapDictionaries' was built under R version 4.0.3
## Warning: package 'qdapRegex' was built under R version 4.0.3
## Warning: package 'qdapTools' was built under R version 4.0.3
library(data.table)
library(e1071) 
## Warning: package 'e1071' was built under R version 4.0.3
library(caret)
## Warning: package 'caret' was built under R version 4.0.3

Loading the Data

We used two folder for our project - spam and easy_ham_2.

spam_folder <- 'C:/data/607project4/spam/'

ham_folder <- 'C:/data/607project4/easy_ham_2/'

length(list.files(path = spam_folder))
## [1] 500
spam_files <- list.files(path = spam_folder, full.names = TRUE)
ham_files <- list.files(path = ham_folder, full.names = TRUE)

spam <- list.files(path = spam_folder) %>%
  as.data.frame() %>%
  set_colnames("file") %>%
  mutate(text = lapply(spam_files, read_lines)) %>%
  unnest(c(text)) %>%
  mutate(class = "spam",
         spam = 1) %>%
  group_by(file) %>%
  mutate(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  distinct()

ham <- list.files(path = ham_folder) %>%
  as.data.frame() %>%
  set_colnames("file") %>%
  mutate(text = lapply(ham_files, read_lines)) %>%
  unnest(c(text)) %>%
  mutate(class = "ham",
         spam = 0) %>%
  group_by(file) %>%
  mutate(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  distinct()

Tidying Data

This was perhaps the section we most found challenging. The following step in our process was to use an rbind function for the contents of ‘ham’ and ‘spam’. From previous assignment experience we knew the rbind function can be used to combine several vectors, matrices and/or data frames by rows. Keeping in mind that select() keeps only the variables mentioned we utilized this with values of class,spam, file, text. Another handy function to tidy the data that we utilized was str_replace. Basically, we filtered our ‘ham_spam’ for any empty/white spaces caused by factors such as ‘tabs’. We also found helpful the content transformers function. This is basically functions which modify the content of an R object. In this case we used it to modify punctuation and to replace it with a space.

During the process of completing this project we have also learned to manipulate data with functionalities such as a document-term matrix or term-document matrix which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. This is a matrix where each row represents one document, each column represents one term (word), each value (typically) contains the number of appearances of that term in that document. We experimented as well using function removeSparseTerms which removes those terms which don’t appear too often in our data. In this instance we removed word that did not appear in at least 10 documents.

ham_spam <- rbind(ham, spam) %>%
  select(class, spam,file, text)

ham_spam$text <- ham_spam$text %>%
  str_replace(.,"[\\r\\n\\t]+", "")

replacePunctuation <- content_transformer(function(x) {return (gsub("[[:punct:]]", " ", x))})



corpus <- Corpus(VectorSource(ham_spam$text)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(replacePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace)
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., replacePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm,1-(10/length(corpus)))
                
#test<-as.matrix(dtm)

inspect(dtm)
## <<DocumentTermMatrix (documents: 1900, terms: 3816)>>
## Non-/sparse entries: 282538/6967862
## Sparsity           : 96%
## Maximal term length: 19
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug com font http linux list localhost net org received
##   1317   0   4    1  242    17   16        13 254   5       10
##   1345  12  56    0   38    71   10         6   0   9        5
##   1380  11  18    0    5     5   18         8   0   3        9
##   1637   0  36  149   22     0    1         5   1   8        6
##   1645   2   4    0    7     0    1         5  17  12        5
##   1682   0   5  542    1     0    2         5   1   3        4
##   1805   0   6  542    1     0    3         5   2   3        4
##   1835   0  13  126    6     0    3         5   1   6        3
##   1876   0   8  542    1     0    2         5   4   6        6
##   693    0  24    0   10     0    6         7  13   7        5
dim(dtm)
## [1] 1900 3816

Creating Training and Test Data

email_dtm <- dtm %>%
  as.matrix() %>%
  as.data.frame() %>%
  sapply(., as.numeric) %>%
  as.data.frame() %>%
  mutate(class = ham_spam$class) %>%
  select(class, everything())



email_dtm$class <- as.factor(email_dtm$class)

#Training & Test set
sample_size <- floor(0.8 * nrow(email_dtm))

set.seed(1564)
index <- sample(seq_len(nrow(email_dtm)), size = sample_size)

dtm_train <- email_dtm[index, ]
dtm_test <-  email_dtm[-index, ]

#Training & Test Spam Count

train_labels <- dtm_train$class
test_labels <- dtm_test$class
#Proportion for training & test Spam
prop.table(table(train_labels))
## train_labels
##       ham      spam 
## 0.7434211 0.2565789
prop.table(table(test_labels))
## test_labels
##       ham      spam 
## 0.7105263 0.2894737

Model Training

dtm_train[ , 2:3816] <- ifelse(dtm_train[ , 2:3816] == 0, "No", "Yes")
dtm_test[ , 2:3816] <- ifelse(dtm_test[ , 2:3816] == 0, "No", "Yes")

model_classifier <- naiveBayes(dtm_train, train_labels) 

test_pred <- predict(model_classifier, dtm_test)

confusionMatrix(test_pred, test_labels, positive = "spam", 
                dnn = c("Prediction","Actual"))
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  269    4
##       spam   1  106
##                                           
##                Accuracy : 0.9868          
##                  95% CI : (0.9696, 0.9957)
##     No Information Rate : 0.7105          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9678          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9636          
##             Specificity : 0.9963          
##          Pos Pred Value : 0.9907          
##          Neg Pred Value : 0.9853          
##              Prevalence : 0.2895          
##          Detection Rate : 0.2789          
##    Detection Prevalence : 0.2816          
##       Balanced Accuracy : 0.9800          
##                                           
##        'Positive' Class : spam            
## 

Conclusion

As tested using the Naive Bayes model from the e1071 model, we were able to accurately predict roughly 99% of the emails into the proper categories. There is also a 98% sensitivity rate which means that 98% of the spam emails were classified correctly and a 99% specificity rate means that 99% of the ham emails were classified correctly.