In my document, I will be detailing the creation and utilization of a Naive Bayes classifier to identify spam and ham emails. The spam and ham datasets utilized can be downloaded from https://spamassassin.apache.org/old/publiccorpus/.
The two files I downloaded are easy_ham (which contains 2,551 email files) and spam (which contains 501 email files). I will be downloading the files, detail the cleaning process, creation of the Corpus, and Document Term Matrix. Once the files are in the correct format, I will then build and test a Naives Bayes classifier to detect spam and ham emails.
library(tidyverse)
library(stringr)
library(knitr)
library(tidytext)
library(tidyr)
library(magrittr)
library(quanteda)
library(tm)
library(e1071)
library(wordcloud)
Due to the large amount of files, I will be downloading the files
directly from the website, load them to my direct drive, unzip the files
and assign them to the objects ham_loc, and
spam_loc.
# load spam and ham dataset into object
easyham_url <- "https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2"
spam_url <- "https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2"
# download data into local drive
download.file(easyham_url, destfile = "ham")
download.file(spam_url, destfile = "spam")
In the code below, I am extracting the name of every file from my
datasets, separating them into the objects filenames_ham
and filenames_spam. This will allow me to later utilize the
function readLines to read each message contained in the
objects ham_loc, and spam_loc.
# read file names from Spam and Ham dataset
filenames_ham <- list.files(ham_loc, pattern = "*.*", full.names = TRUE)
filenames_spam <- list.files(spam_loc, pattern = "*.*", full.names = TRUE)
Prior to creating my dataframes, I created a function that removes HTML strings.
# create function that removes HTML tags
htmlclean <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
In the code below, along with creating the spam and
ham dataframes, I perform extensive cleaning and
transformation. In the text within the emails, there are headers,
senders list, received addresses, and other metadata that will need to
be removed. The steps are show below:
Add column names: We want to make sure that once the messages are extracted, that both the columns that include the name of file and text are named accordingly.
Extract email content: Using the lapply along with readLines function, we can extract the email content for each file.
Remove HTML strings: Now we can apply the htmlclean function created earlier to remove HTML strings - content which we are not interest in.
Unnest and paste text column: The unnest and
paste will be applied to expand text by line and eventually combine the
strings in the text column.
Add additional column label: This column will be added seperately to the ham and spam dataframe. Depending on which dataframe I am working on, I will add a label of no (this is not spam) or yes (this is spam).
ham <- filenames_ham %>%
as.data.frame() %>%
set_names("file") %>%
mutate(text = lapply(filenames_ham,readLines)) %>%
mutate(text = lapply(text,htmlclean)) %>%
unnest(c(text)) %>%
mutate(spam = "no") %>%
group_by(file) %>%
mutate(text = paste(text,collapse=" ")) %>%
ungroup() %>%
distinct()
spam <- filenames_spam %>%
as.data.frame() %>%
set_names("file") %>%
mutate(text = lapply(filenames_spam,readLines)) %>%
mutate(text = lapply(text,htmlclean)) %>%
unnest(c(text)) %>%
mutate(spam = "yes") %>%
group_by(file) %>%
mutate(text = paste(text,collapse=" ")) %>%
ungroup() %>%
distinct()
Once cleaned and ready, I used the rbind function to join both datasets.
spam = read.csv("C:\\Users\\urios\\OneDrive\\Documents\\spam.csv")
ham = read.csv("C:\\Users\\urios\\OneDrive\\Documents\\ham.csv")
combined <- rbind(spam,ham)
set.seed(123)
# shuffle the dataframe by rows
combined= combined[sample(1:nrow(combined)), ]
spamword <- subset(combined, spam == "yes")
wordcloud(spamword$text,max.words = 100, scale = c(5, 0.3), random.order = FALSE, rot.per = 0.15, colors = brewer.pal(8, "Dark2"))
hamword <- subset(combined, spam == "no")
wordcloud(spamword$text,max.words = 100, scale = c(5, 0.3), random.order = FALSE, rot.per = 0.15, colors = brewer.pal(8, "Dark2"))
combined$spam <- factor(combined$spam)
table(combined$spam)
##
## no yes
## 2551 501
email_corpus <- Corpus(VectorSource(combined$text))
corpus_clean <- tm_map(email_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords())
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
email_dtm <- DocumentTermMatrix(corpus_clean)
email_dtm
## <<DocumentTermMatrix (documents: 3052, terms: 57531)>>
## Non-/sparse entries: 455550/175129062
## Sparsity : 100%
## Maximal term length: 298
## Weighting : term frequency (tf)
email_raw_train <- combined[1:2000, ]
email_raw_test <- combined[2000:3052, ]
email_dtm_train <- email_dtm[1:250, ]
email_dtm_test <- email_dtm[250:301, ]
email_corpus_train <- corpus_clean[1:2000]
email_corpus_test <- corpus_clean[2000:3052]
prop.table(table(email_raw_train$spam))
##
## no yes
## 0.8345 0.1655
prop.table(table(email_raw_test$spam))
##
## no yes
## 0.8385565 0.1614435
email_dict <- findFreqTerms(email_dtm_train, 5)
#email_dict <- Dictionary(findFreqTerms(email_dtm_train, 5))
email_train <- DocumentTermMatrix(email_corpus_train, list(dictionary = email_dict))
email_test <- DocumentTermMatrix(email_corpus_test, list(dictionary = email_dict))
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
}
email_train <- apply(email_train, MARGIN = 2, convert_counts)
email_test <- apply(email_test, MARGIN = 2, convert_counts)
email_classifier <- naiveBayes(email_train, email_raw_train$spam)
#email_classifier
Evaluate the predition with the actual data using a crosstable from the gmodels package.
email_test_pred <- predict(email_classifier, email_test)
library(gmodels)
CrossTable(email_test_pred, email_raw_test$spam,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1053
##
##
## | actual
## predicted | no | yes | Row Total |
## -------------|-----------|-----------|-----------|
## no | 875 | 4 | 879 |
## | 0.991 | 0.024 | |
## -------------|-----------|-----------|-----------|
## yes | 8 | 166 | 174 |
## | 0.009 | 0.976 | |
## -------------|-----------|-----------|-----------|
## Column Total | 883 | 170 | 1053 |
## | 0.839 | 0.161 | |
## -------------|-----------|-----------|-----------|
##
##
As shown in the table only 12/1053 messages were classified incorrectly. This means that the algorithm classified the testing set as spam or ham with approx. 98.86% accuracy. That’s impressive. To improve the model, one might tamper with the Laplace value, collect more email data, or try splitting the dataset randomly into training and testing.
I suspect the accuracy would increase as the dataset gets bigger. The more data there is to train the algorithm, the more effective it would be in predicting Spam or Ham.
email_classifier2 <- naiveBayes(email_train, email_raw_train$spam, laplace = 1)
email_test_pred2 <- predict(email_classifier2, email_test)
CrossTable(email_test_pred2, email_raw_test$spam,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1053
##
##
## | actual
## predicted | no | yes | Row Total |
## -------------|-----------|-----------|-----------|
## no | 857 | 3 | 860 |
## | 0.971 | 0.018 | |
## -------------|-----------|-----------|-----------|
## yes | 26 | 167 | 193 |
## | 0.029 | 0.982 | |
## -------------|-----------|-----------|-----------|
## Column Total | 883 | 170 | 1053 |
## | 0.839 | 0.161 | |
## -------------|-----------|-----------|-----------|
##
##
As shown in the table only 29/1053 messages were classified incorrectly. This means that the algorithm classified the testing set as spam or ham with approx. 98% accuracy. That’s impressive. To improve the model, one might tamper with the Laplace value, collect more email data, or try splitting the dataset randomly into training and testing.
I suspect the accuracy would increase as the dataset gets bigger. The more data there is to train the algorithm, the more effective it would be in predicting spam or ham.