It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus:
Email spam, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email.
As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Although these filters use a number of techniques, most rely heavily on the analysis of the contents of an email via text analytics.
The fact that an email box can be flooded with unsolicited emails makes it possible for the account holder to miss an important message; thereby defeating the purpose of having an email address for effective communication. These junk emails from online marketing campaigns, online fraudsters among others is one of the reasons for this model.
The goal of this project is to build a spam filter that can effectively categorise an incoming mail or text message as either spam or ham. I will use a dataset from this apache of public corpus.
I picked the 20030228_spam_2.tar.bz2 for spam, and 20030228_easy_ham_2.tar. This dataset is a group of emails in a document type. Each folder more than 1200 document file need to be processed, cleaned up, converted to a dataframe and tidying it up.
library(R.utils)
library(tidyverse)
library(tidytext)
library(readtext)
library(stringr)
library(tm)
library(rpart)
library(rpart.plot)
library(e1071)
library(dplyr)
library(caret)
library(lattice)
library(ggplot2)
library(randomForest)
library(wordcloud)
library(rpart.plot)
library(RColorBrewer)
# Library for parallel processing
library(doMC)
registerDoMC(cores=detectCores())
library(knitr)
The first step, we need to download the both folders to be able to read the data from inside of it. So, I used both unzip()
and untar()
to unzip the files inside a try-catch
block
base_url_spam <- "https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2"
spam_zip <- "20030228_spam_2.tar.bz2"
spam_tar <- "20030228_spam_2.tar"
base_url_ham <- "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2"
ham_zip <- "20030228_easy_ham_2.tar.bz2"
ham_tar <- "20030228_easy_ham_2.tar"
if(!file.exists(spam_tar)){
res_spam <- tryCatch(download.file(base_url_spam,
destfile= spam_folder,
method="curl"),
error=function(e) 1)
bunzip2(spam_zip)
untar(spam_tar, exdir="spam_ham_documents")
}
if(!file.exists(ham_tar)){
res_ham <- tryCatch(download.file(base_url_ham,
destfile= ham_folder,
method="curl"),
error=function(e) 1)
bunzip2(ham_zip)
untar(ham_tar, exdir = "spam_ham_documents")
} else {
paste("The file is already exists!")
}
## [1] "The file is already exists!"
After fetching files, I wrote a function get_content()
to pull out the content of each file as a list vector
base_dir <- "/Users/salmaelshahawy/Desktop/MSDS_2019/Fall2019/aquisition_management_607/week_10/spam_ham_documents"
email_content <- NA
get_content <- function(type) {
files_path <- paste(base_dir,type, sep = "/")
files_name <- list.files(files_path)
for (file in 1:length(files_name)) {
file_path <- paste(files_path, files_name[file], sep = "/")
content_per_file <- file_path %>%
lapply(readLines)
email_content <- c(email_content, content_per_file)
}
return(email_content)
}
Then, I managed to extract the content of each list and push it to a dataframe.
get_nested_content <- function(list_name) {
nested_value <- NA
for (value in 2:length(list_name)) {
value_per_row <- list_name[[value]]
nested_value <- c(nested_value, value_per_row)
}
return(nested_value)
}
Note: we took a sample of 10% of each generated dataset.
I also added a class column to each dataframe designated to document type, either spam or ham The resulting dataframes consist of observations of 2 variables. The first variable is the content of the emails per line and the second variable the target variable, which is the class to be predicted.
spam_df_1 <- as.data.frame(spam_content) %>%
mutate(class = "spam") %>% #adding a class tag
na.omit(spam_df_1)
spam_df <- spam_df_1[c(2:10000), c(1:2)] ## taking a subset of 10%
names(spam_df) <- c("text", "class")
spam_df
ham_df_1 <- as.data.frame(ham_content) %>%
mutate(class = "ham") %>% #adding a class tag
na.omit(ham_df_1)
ham_df <- ham_df_1[c(2:6800), c(1:2)] ## taking a subset 10%
names(ham_df) <- c("text", "class")
ham_df
data_df <- rbind(spam_df, ham_df) %>%
mutate_all(funs(gsub("[^[:alnum:][:blank:]+\\s+?&-]", "",.)))
data_df
##
## ham spam
## 6799 9999
##
## ham spam
## 0.4047506 0.5952494
Convert the resulting dataset into corpus.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 29
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 39
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 40
corpus_data = tm_map(corpus_data, content_transformer(stringi::stri_trans_tolower))
corpus_data = tm_map(corpus_data, removeNumbers)
corpus_data = tm_map(corpus_data, removePunctuation)
corpus_data = tm_map(corpus_data, stripWhitespace)
corpus_data = tm_map(corpus_data, removeWords, stopwords("english"))
corpus_data = tm_map(corpus_data, stemDocument)
#as.character(corpus[[1]])
In text mining, it is important to get a feel of words that describes if a text message will be regarded as spam or ham. What is the frequency of each of these words? Which word appears the most? In order to answer this question; we are creating a DocumentTermMatrix to keep all these words.
The rows of the DTM
correspond to documents in the collection, columns correspond to terms, and its elements are the term frequencies. I used a built-in function from the tm
package to create the DTM
.
#I need the data in a one-row-per-document format. That is, a document-term matrix.
dtm <- DocumentTermMatrix(corpus_data)
dtm
## <<DocumentTermMatrix (documents: 16798, terms: 8440)>>
## Non-/sparse entries: 45679/141729441
## Sparsity : 100%
## Maximal term length: 250
## Weighting : term frequency (tf)
## [1] 16798 8440
## [1] 16798 497
## <<DocumentTermMatrix (documents: 61, terms: 11)>>
## Non-/sparse entries: 2/669
## Sparsity : 100%
## Maximal term length: 11
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aligndright alreadi also alway amaz anoth answer anyth appear arg
## 40 0 0 0 0 0 0 0 0 0 0
## 41 0 0 0 0 0 0 0 0 0 0
## 42 0 0 0 0 0 0 0 0 0 0
## 43 0 0 0 0 0 0 0 0 0 0
## 44 0 0 0 0 0 0 0 0 0 0
## 45 0 0 0 0 0 0 0 0 0 0
## 46 0 0 0 0 0 0 0 0 0 0
## 47 0 0 0 0 0 0 0 0 0 0
## 91 0 0 1 0 0 0 0 0 0 0
## 93 0 0 1 0 0 0 0 0 0 0
We want to words that frequently appeared in the dataset. Due to the number of words in the dataset, we are keeping words that appeared more than 60 times.
## natur nice outlook potenti requir secret select sun tbodi
## 17 17 17 17 17 17 17 17 17
## unit
## 17
## [1] "address" "advertis"
## [3] "also" "aug"
## [5] "bfont" "bit"
## [7] "bodi" "brbr"
## [9] "bulk" "busi"
## [11] "can" "card"
## [13] "claim" "colord"
## [15] "contenttransferencod" "contenttyp"
## [17] "copi" "credit"
## [19] "date" "day"
## [21] "deliveredto" "dogmaslashnullorg"
## [23] "dont" "edt"
## [25] "email" "errorsto"
## [27] "esmtp" "event"
## [29] "everi" "fetchmail"
## [31] "first" "font"
## [33] "free" "fri"
## [35] "get" "group"
## [37] "help" "host"
## [39] "ilugadminlinuxi" "iluglinuxi"
## [41] "imap" "includ"
## [43] "inform" "internet"
## [45] "invok" "irish"
## [47] "ist" "jmlocalhost"
## [49] "jul" "jun"
## [51] "just" "know"
## [53] "letter" "like"
## [55] "line" "link"
## [57] "linux" "list"
## [59] "listid" "listmasterlinuxi"
## [61] "localhost" "look"
## [63] "lugh" "lughtuathaorg"
## [65] "mail" "maintain"
## [67] "make" "may"
## [69] "messag" "messageid"
## [71] "mimevers" "mon"
## [73] "money" "month"
## [75] "name" "nbsp"
## [77] "need" "new"
## [79] "now" "offer"
## [81] "one" "option"
## [83] "order" "page"
## [85] "peopl" "person"
## [87] "phoboslabsnetnoteinccom" "pleas"
## [89] "postfix" "preced"
## [91] "product" "question"
## [93] "read" "receiv"
## [95] "report" "returnpath"
## [97] "rootlocalhost" "rootlughtuathaorg"
## [99] "sale" "sat"
## [101] "see" "send"
## [103] "sender" "sequenc"
## [105] "servic" "singledrop"
## [107] "site" "size"
## [109] "smtp" "social"
## [111] "socialadminlinuxi" "sociallinuxi"
## [113] "subject" "take"
## [115] "textplain" "thu"
## [117] "time" "tue"
## [119] "unsubscript" "use"
## [121] "user" "version"
## [123] "want" "websit"
## [125] "wed" "widthd"
## [127] "will" "within"
## [129] "work" "xauthenticationwarn"
## [131] "xbeenther" "xmailmanvers"
## [133] "yyyylocalhostnetnoteinccom"
We will like to plot those words that appeared more than 60 times in our dataset.
word_freq_bp <- ggplot(subset(word_freq, freq > 100), aes(x=reorder(word, -freq), y =freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
word_freq_bp
From the plot it appeares that receiv is the most frequent word in our dataset.
The data has been cleaned and now ready to be added to the response variable “class” for the purpose of predictive analytics.
The multinomial Naive Bayes algorithm known as binarized (boolean feature) Naive Bayes due to Dan Jurafsky. In this method, the term frequencies are replaced by Boolean presence/absence features. The logic behind this being that for sentiment classification, word occurrence matters more than word frequency.
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# Apply the convert_count function to get final training and testing DTMs
datasetNB <- apply(dtm, 2, convert_count)
dataset = as.data.frame(as.matrix(datasetNB))
## Factor w/ 2 levels "ham","spam": 2 2 2 2 2 2 2 2 2 2 ...
The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before.
In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split. I followed the approach of 75% for the train and 25% for the test.
set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.75,0.25),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]
prop.table(table(train_set$class))
##
## ham spam
## 0.4050743 0.5949257
##
## ham spam
## 0.4037627 0.5962373
We will be building our model on 3 different Machine Learning algorithms which are Random Forest
, Naive Bayes
and Support Vector Machine
for the purpose of deciding which perform the best.
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
##
## Call:
## randomForest(x = train_set, y = train_set$class, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 22
##
## OOB estimate of error rate: 0.29%
## Confusion matrix:
## ham spam class.error
## ham 5125 0 0.000000000
## spam 37 7490 0.004915637
The rf_classifier was able to accurately classify the text messages as ham and spam respectively with the class error of 0
which suggest that there is 100% accuracy of the model on the training set of observations. This is expected as the model was exposed to this set of data.
We want to evaluate the model using the test_set and see if our model can match the 100% accuracy on this new set of data in comparison to the one obtained from the training set.
# Predicting the Test set results
rf_pred = predict(rf_classifier, newdata = test_set)
# Making the Confusion Matrix
confusionMatrix(table(rf_pred,test_set$class))
## Confusion Matrix and Statistics
##
##
## rf_pred ham spam
## ham 1674 6
## spam 0 2466
##
## Accuracy : 0.9986
## 95% CI : (0.9969, 0.9995)
## No Information Rate : 0.5962
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.997
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9976
## Pos Pred Value : 0.9964
## Neg Pred Value : 1.0000
## Prevalence : 0.4038
## Detection Rate : 0.4038
## Detection Prevalence : 0.4052
## Balanced Accuracy : 0.9988
##
## 'Positive' Class : ham
##
The Random Forest Classifier (rf_classifier)
performed well on this data set as the model accuracy is 0.9986. Again, we need not be too excited as there is the possibility of Random Forest to overfit.
It is a Machine Learning model that is based upon the assumptions of conditional probability as proposed by Bayes’ Theorem. It is fast and easy.
control <- trainControl(method="repeatedcv", number=10, repeats=3)
system.time( classifier_nb <- naiveBayes(train_set, train_set$class, laplace = 1,
trControl = control,tuneLength = 7) )
## user system elapsed
## 0.294 0.162 0.457
nb_pred = predict(classifier_nb, type = 'class', newdata = test_set)
confusionMatrix(nb_pred,test_set$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 1674 2
## spam 0 2470
##
## Accuracy : 0.9995
## 95% CI : (0.9983, 0.9999)
## No Information Rate : 0.5962
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.999
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9992
## Pos Pred Value : 0.9988
## Neg Pred Value : 1.0000
## Prevalence : 0.4038
## Detection Rate : 0.4038
## Detection Prevalence : 0.4042
## Balanced Accuracy : 0.9996
##
## 'Positive' Class : ham
##
The Naive Bayes Classifier also performed very well on the training set by achieving 0.9995 accuracy which means we have got 2 misclassifications out a possible 1209 observation. While the model has a 100% sensitivity rate; the proportion of the positive class predicted as positive, it was able to achieve about 0.9992 on specificity rate which is the proportion of the negative class predicted accurately i.e 2470 out of 2472.
The Support Vector Machine is another algorithm that finds the hyperplane that differentiates the two classes to be predicted, ham and spam in this case. SVM can perform both linear and non-linear classification problems.
##
## Call:
## svm(formula = class ~ ., data = train_set)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 10101
Our model employs a total of 10101 support vectors while building this classification model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 323 20
## spam 1351 2452
##
## Accuracy : 0.6693
## 95% CI : (0.6548, 0.6836)
## No Information Rate : 0.5962
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2121
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.19295
## Specificity : 0.99191
## Pos Pred Value : 0.94169
## Neg Pred Value : 0.64475
## Prevalence : 0.40376
## Detection Rate : 0.07791
## Detection Prevalence : 0.08273
## Balanced Accuracy : 0.59243
##
## 'Positive' Class : ham
##
The Support Vector Machine model performed badly on this dataset as the model performed exactly as a mere guess. With the accuracy of 0.6693, we may be tempted to think the performance is bad where the classifier classify 1351 emails as ham while they should be spam.
The essence of building a spam classifier is for the model to be able to effectively categorise an incoming email as either spam or ham. A model will not be doing very well if it is unable to categorise both categories effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a nice job. The Random Forest and Naive Bayes performed exceptionally well in this project. However, Support Vector Machine was not a good choice for this case as a classifier.