Assignment statement

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus:

Problem scope

Email spam, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Although these filters use a number of techniques, most rely heavily on the analysis of the contents of an email via text analytics.

The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.

Gools and objectives

The goal of this project is to build a spam filter that can effectively categorise an incoming mail or text message as either spam or ham. I will use a dataset from this apache of public corpus.

Dataset description

I picked the 20030228_spam_2.tar.bz2 for spam, and 20030228_easy_ham_2.tar. This dataset is a group of emails in a document type. Each folder more than 1200 document file need to be processed, cleaned up, converted to a dataframe and tidying it up.

Section_1: Dataset preparation

Loading necessary libraries

library(R.utils)
library(tidyverse)
library(tidytext)
library(readtext)
library(stringr)
library(tm)
library(rpart)
library(rpart.plot)
library(e1071)
library(dplyr)
library(caret)
library(lattice)
library(ggplot2)
library(randomForest)
library(wordcloud)
library(rpart.plot)
library(RColorBrewer)
# Library for parallel processing
library(doMC)
registerDoMC(cores=detectCores())
library(knitr)

Download the required folders

The first step, we need to download the both folders to be able to read the data from inside of it. So, I used both unzip() and untar() to unzip the files inside a try-catch block

base_url_spam <- "https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2"
spam_zip <- "20030228_spam_2.tar.bz2"
spam_tar <- "20030228_spam_2.tar"

base_url_ham <- "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2"
ham_zip <- "20030228_easy_ham_2.tar.bz2"
ham_tar <- "20030228_easy_ham_2.tar"
  if(!file.exists(spam_tar)){
    res_spam <- tryCatch(download.file(base_url_spam,
                                  destfile= spam_folder,
                                  method="curl"),
                error=function(e) 1)
    bunzip2(spam_zip)
    untar(spam_tar, exdir="spam_ham_documents")
  } 

  if(!file.exists(ham_tar)){
    res_ham <- tryCatch(download.file(base_url_ham,
                                  destfile= ham_folder,
                                  method="curl"),
                error=function(e) 1)
    bunzip2(ham_zip)
    untar(ham_tar, exdir = "spam_ham_documents")
    
  } else {
    paste("The file is already exists!")
  }

## [1] "The file is already exists!"

Read content of each file

After fetching files, I wrote a function get_content() to pull out the content of each file as a list vector

base_dir <- "/Users/salmaelshahawy/Desktop/MSDS_2019/Fall2019/aquisition_management_607/week_10/spam_ham_documents"

email_content <- NA

get_content <- function(type) {
  files_path <- paste(base_dir,type, sep = "/")
  files_name <- list.files(files_path)
    for (file in 1:length(files_name)) {
      file_path <- paste(files_path, files_name[file], sep = "/")
      content_per_file <- file_path %>%
        lapply(readLines)
      
      email_content <- c(email_content, content_per_file)
    }
  return(email_content)
}

spam_test <- get_content("spam_2") #list
ham_test <- get_content("easy_ham_2") #list

Convert into a dataframe

Then, I managed to extract the content of each list and push it to a dataframe.

get_nested_content <- function(list_name) {
  nested_value <- NA
  for (value in 2:length(list_name)) {
    value_per_row <- list_name[[value]]
    nested_value <- c(nested_value, value_per_row)
  }
  return(nested_value)
}

spam_content <- get_nested_content(spam_test)
ham_content <- get_nested_content(ham_test)

Note: we took a sample of 10% of each generated dataset.

I also added a class column to each dataframe designated to document type, either spam or ham The resulting dataframes consist of observations of 2 variables. The first variable is the content of the emails per line and the second variable the target variable, which is the class to be predicted.

spam_df_1 <- as.data.frame(spam_content) %>%
  mutate(class = "spam") %>% #adding a class tag 
  na.omit(spam_df_1)
spam_df <- spam_df_1[c(2:10000), c(1:2)] ## taking a subset of 10%
names(spam_df) <- c("text", "class")
spam_df

ham_df_1 <- as.data.frame(ham_content) %>%
  mutate(class = "ham") %>% #adding a class tag 
  na.omit(ham_df_1) 
ham_df <- ham_df_1[c(2:6800), c(1:2)] ## taking a subset 10%
names(ham_df) <- c("text", "class")
ham_df

Section_2: Data processing

Merging the two dataframes

data_df <- rbind(spam_df, ham_df) %>%
  mutate_all(funs(gsub("[^[:alnum:][:blank:]+\\s+?&-]", "",.)))
data_df

table(data_df$class)

## 
##  ham spam 
## 6799 9999

Convert the ‘class’ variable from character to factor.

data_df$class <- as.factor(data_df$class)
prop.table(table(data_df$class))

## 
##       ham      spam 
## 0.4047506 0.5952494

Converting into corpus

Convert the resulting dataset into corpus.

corpus_data = VCorpus(VectorSource(data_df$text))
inspect(corpus_data[1:3])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 29
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 39
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 40

Corpus clean-up

corpus_data = tm_map(corpus_data, content_transformer(stringi::stri_trans_tolower))
corpus_data = tm_map(corpus_data, removeNumbers)
corpus_data = tm_map(corpus_data, removePunctuation)
corpus_data = tm_map(corpus_data, stripWhitespace)
corpus_data = tm_map(corpus_data, removeWords, stopwords("english"))
corpus_data = tm_map(corpus_data, stemDocument)

#as.character(corpus[[1]])

Section_3: Word frequencies

Creating bag of words using Document tearm matrix

In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In order to answer this question; we are creating a DocumentTermMatrix to keep all these words.

The rows of the DTM correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. I used a built-in function from the tm package to create the DTM.

#I need the data in a one-row-per-document format. That is, a document-term matrix.
dtm <- DocumentTermMatrix(corpus_data)
dtm

## <<DocumentTermMatrix (documents: 16798, terms: 8440)>>
## Non-/sparse entries: 45679/141729441
## Sparsity           : 100%
## Maximal term length: 250
## Weighting          : term frequency (tf)

dim(dtm)

## [1] 16798  8440

dtm <- removeSparseTerms(dtm, 0.999)
dim(dtm)

## [1] 16798   497

inspect(dtm[40:100, 10:20])

## <<DocumentTermMatrix (documents: 61, terms: 11)>>
## Non-/sparse entries: 2/669
## Sparsity           : 100%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs aligndright alreadi also alway amaz anoth answer anyth appear arg
##   40           0       0    0     0    0     0      0     0      0   0
##   41           0       0    0     0    0     0      0     0      0   0
##   42           0       0    0     0    0     0      0     0      0   0
##   43           0       0    0     0    0     0      0     0      0   0
##   44           0       0    0     0    0     0      0     0      0   0
##   45           0       0    0     0    0     0      0     0      0   0
##   46           0       0    0     0    0     0      0     0      0   0
##   47           0       0    0     0    0     0      0     0      0   0
##   91           0       0    1     0    0     0      0     0      0   0
##   93           0       0    1     0    0     0      0     0      0   0

Section_4: Descriptive analysis

Building Word Frequency

We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.

freq<- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
tail(freq, 10) # this is the least

##   natur    nice outlook potenti  requir  secret  select     sun   tbodi 
##      17      17      17      17      17      17      17      17      17 
##    unit 
##      17

findFreqTerms(dtm, lowfreq=60) #identifying terms that appears frequently

##   [1] "address"                    "advertis"                  
##   [3] "also"                       "aug"                       
##   [5] "bfont"                      "bit"                       
##   [7] "bodi"                       "brbr"                      
##   [9] "bulk"                       "busi"                      
##  [11] "can"                        "card"                      
##  [13] "claim"                      "colord"                    
##  [15] "contenttransferencod"       "contenttyp"                
##  [17] "copi"                       "credit"                    
##  [19] "date"                       "day"                       
##  [21] "deliveredto"                "dogmaslashnullorg"         
##  [23] "dont"                       "edt"                       
##  [25] "email"                      "errorsto"                  
##  [27] "esmtp"                      "event"                     
##  [29] "everi"                      "fetchmail"                 
##  [31] "first"                      "font"                      
##  [33] "free"                       "fri"                       
##  [35] "get"                        "group"                     
##  [37] "help"                       "host"                      
##  [39] "ilugadminlinuxi"            "iluglinuxi"                
##  [41] "imap"                       "includ"                    
##  [43] "inform"                     "internet"                  
##  [45] "invok"                      "irish"                     
##  [47] "ist"                        "jmlocalhost"               
##  [49] "jul"                        "jun"                       
##  [51] "just"                       "know"                      
##  [53] "letter"                     "like"                      
##  [55] "line"                       "link"                      
##  [57] "linux"                      "list"                      
##  [59] "listid"                     "listmasterlinuxi"          
##  [61] "localhost"                  "look"                      
##  [63] "lugh"                       "lughtuathaorg"             
##  [65] "mail"                       "maintain"                  
##  [67] "make"                       "may"                       
##  [69] "messag"                     "messageid"                 
##  [71] "mimevers"                   "mon"                       
##  [73] "money"                      "month"                     
##  [75] "name"                       "nbsp"                      
##  [77] "need"                       "new"                       
##  [79] "now"                        "offer"                     
##  [81] "one"                        "option"                    
##  [83] "order"                      "page"                      
##  [85] "peopl"                      "person"                    
##  [87] "phoboslabsnetnoteinccom"    "pleas"                     
##  [89] "postfix"                    "preced"                    
##  [91] "product"                    "question"                  
##  [93] "read"                       "receiv"                    
##  [95] "report"                     "returnpath"                
##  [97] "rootlocalhost"              "rootlughtuathaorg"         
##  [99] "sale"                       "sat"                       
## [101] "see"                        "send"                      
## [103] "sender"                     "sequenc"                   
## [105] "servic"                     "singledrop"                
## [107] "site"                       "size"                      
## [109] "smtp"                       "social"                    
## [111] "socialadminlinuxi"          "sociallinuxi"              
## [113] "subject"                    "take"                      
## [115] "textplain"                  "thu"                       
## [117] "time"                       "tue"                       
## [119] "unsubscript"                "use"                       
## [121] "user"                       "version"                   
## [123] "want"                       "websit"                    
## [125] "wed"                        "widthd"                    
## [127] "will"                       "within"                    
## [129] "work"                       "xauthenticationwarn"       
## [131] "xbeenther"                  "xmailmanvers"              
## [133] "yyyylocalhostnetnoteinccom"

Visualizing

We will like to plot those words that appeared more than 60 times in our dataset.

word_freq<- data.frame(word=names(freq), freq=freq)
head(word_freq)

Bar-chart

word_freq_bp <- ggplot(subset(word_freq, freq > 100), aes(x=reorder(word, -freq), y =freq)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x=element_text(angle=45, hjust=1))
word_freq_bp

From the plot it appeares that receiv is the most frequent word in our dataset.

Wordcloud

set.seed(1234)
wordcloud(words = word_freq$word, freq = word_freq$freq, min.freq = 1,
          max.words=150, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Add class variable to the dataset

The data has been cleaned and now ready to be added to the response variable “class” for the purpose of predictive analytics.

Convert word frequency into logical value.

The multinomial Naive Bayes algorithm known as binarized (boolean feature) Naive Bayes due to Dan Jurafsky. In this method, the term frequencies are replaced by Boolean presence/absence features. The logic behind this being that for sentiment classification, word occurrence matters more than word frequency.

convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

# Apply the convert_count function to get final training and testing DTMs
datasetNB <- apply(dtm, 2, convert_count)

dataset = as.data.frame(as.matrix(datasetNB))

dataset$class = data_df$class
str(dataset$class)

##  Factor w/ 2 levels "ham","spam": 2 2 2 2 2 2 2 2 2 2 ...

Section_5: Model Building

Splitting the dataset into the Training set and Test set

The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before.

In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split. I followed the approach of 75% for the train and 25% for the test.

set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.75,0.25),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,] 

prop.table(table(train_set$class))

## 
##       ham      spam 
## 0.4050743 0.5949257

prop.table(table(test_set$class))

## 
##       ham      spam 
## 0.4037627 0.5962373

Model Fitting

We will be building our model on 3 different Machine Learning algorithms which are Random Forest, Naive Bayes and Support Vector Machine for the purpose of deciding which perform the best.

Random Forest Classifier.

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

rf_classifier = randomForest(x = train_set,
                          y = train_set$class,
                          ntree = 300)

rf_classifier

## 
## Call:
##  randomForest(x = train_set, y = train_set$class, ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 22
## 
##         OOB estimate of  error rate: 0.29%
## Confusion matrix:
##       ham spam class.error
## ham  5125    0 0.000000000
## spam   37 7490 0.004915637

The rf_classifier was able to accurately classify the text messages as ham and spam respectively with the class error of 0 which suggest that there is 100% accuracy of the model on the training set of observations. This is expected as the model was exposed to this set of data.

Making Predictions and evaluating the Random Forest Classifier.

We want to evaluate the model using the test_set and see if our model can match the 100% accuracy on this new set of data in comparison to the one obtained from the training set.

# Predicting the Test set results
rf_pred = predict(rf_classifier, newdata = test_set)

# Making the Confusion Matrix
confusionMatrix(table(rf_pred,test_set$class))

## Confusion Matrix and Statistics
## 
##        
## rf_pred  ham spam
##    ham  1674    6
##    spam    0 2466
##                                           
##                Accuracy : 0.9986          
##                  95% CI : (0.9969, 0.9995)
##     No Information Rate : 0.5962          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.997           
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9976          
##          Pos Pred Value : 0.9964          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4038          
##          Detection Rate : 0.4038          
##    Detection Prevalence : 0.4052          
##       Balanced Accuracy : 0.9988          
##                                           
##        'Positive' Class : ham             
##

The Random Forest Classifier (rf_classifier) performed well on this data set as the model accuracy is 0.9986. Again, we need not be too excited as there is the possibility of Random Forest to overfit.

Naive Bayes Classifier

It is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.

control <- trainControl(method="repeatedcv", number=10, repeats=3)
system.time( classifier_nb <- naiveBayes(train_set, train_set$class, laplace = 1,
                                         trControl = control,tuneLength = 7) )

##    user  system elapsed 
##   0.294   0.162   0.457

Making Predictions and evaluating the Naive Bayes Classifier.

nb_pred = predict(classifier_nb, type = 'class', newdata = test_set)

confusionMatrix(nb_pred,test_set$class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham  1674    2
##       spam    0 2470
##                                           
##                Accuracy : 0.9995          
##                  95% CI : (0.9983, 0.9999)
##     No Information Rate : 0.5962          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.999           
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9992          
##          Pos Pred Value : 0.9988          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4038          
##          Detection Rate : 0.4038          
##    Detection Prevalence : 0.4042          
##       Balanced Accuracy : 0.9996          
##                                           
##        'Positive' Class : ham             
##

The Naive Bayes Classifier also performed very well on the training set by achieving 0.9995 accuracy which means we have got 2 misclassifications out a possible 1209 observation. While the model has a 100% sensitivity rate; the proportion of the positive class predicted as positive, it was able to achieve about 0.9992 on specificity rate which is the proportion of the negative class predicted accurately i.e 2470 out of 2472.

Support Vector Machine

The Support Vector Machine is another algorithm that finds the hyperplane that differentiates the two classes to be predicted, ham and spam in this case. SVM can perform both linear and non-linear classification problems.

svm_classifier <- svm(class~., data=train_set)
svm_classifier

## 
## Call:
## svm(formula = class ~ ., data = train_set)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  10101

Our model employs a total of 10101 support vectors while building this classification model.

Making Predictions and evaluating the Support Vector Machine Classifier

svm_pred = predict(svm_classifier,test_set)

confusionMatrix(svm_pred,test_set$class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  ham spam
##       ham   323   20
##       spam 1351 2452
##                                           
##                Accuracy : 0.6693          
##                  95% CI : (0.6548, 0.6836)
##     No Information Rate : 0.5962          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2121          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.19295         
##             Specificity : 0.99191         
##          Pos Pred Value : 0.94169         
##          Neg Pred Value : 0.64475         
##              Prevalence : 0.40376         
##          Detection Rate : 0.07791         
##    Detection Prevalence : 0.08273         
##       Balanced Accuracy : 0.59243         
##                                           
##        'Positive' Class : ham             
##

The Support Vector Machine model performed badly on this dataset as the model performed exactly as a mere guess. With the accuracy of 0.6693, we may be tempted to think the performance is bad where the classifier classify 1351 emails as ham while they should be spam.

Section_6: Conclusion and Validity

Conclusion

The essence of building a spam classifier is for the model to be able to effectively categorise an incoming email as either spam or ham. A model will not be doing very well if it is unable to categorise both categories effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a nice job. The Random Forest and Naive Bayes performed exceptionally well in this project. However, Support Vector Machine was not a good choice for this case as a classifier.

References and important links

DATA607_Project 4_Document classification

Salma Elshahawy

10/28/2019