Documentation Classifier - Spam or Ham (Non-Spam)

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

#Libraries Used for this Project
library(knitr)
library(R.utils)
library(tm)
library(wordcloud)
library(SnowballC)
library(data.table)
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(stats)
library(readtext)
library(caTools)
library(randomForest)

R-Function to Extract Email Content

email_body <- function(Content){
  message = str_split(Content,"\n\n") %>% unlist()
  body = paste(message[2:length(message)], collapse=' ' )
  return(body)
}

Importing Spam and Ham (Non-Spam) Emails

Spam Emails

dir="C:/Users/jpsim/Documents/DATA Acquisition and Management/spam/"
filename = list.files(dir)
messContent<-NA

for(i in 1:length(filename)){
  filepath<-paste0(dir,filename[i])  
  Content <-suppressWarnings(warning(readtext(filepath)))
  mess <- email_body(Content)
  mess <- gsub("<.*?>", " ", mess)
  eachmess<- list(paste(mess, collapse="\n"))
  messContent = c(messContent,eachmess)
 }
spam<-data.frame()
spam<-as.data.frame(unlist(messContent),stringsAsFactors = FALSE)
spam$class<-1
colnames(spam)<-c("mess","class")
spam_num <- nrow(spam) # Total Number of Spam Emails
print(paste0("The Total Number of Emails in the Spam Data-Set is : ", spam_num))
## [1] "The Total Number of Emails in the Spam Data-Set is : 502"

Ham (Non-Spam) Emails

dir="C:/Users/jpsim/Documents/DATA Acquisition and Management/easy_ham/"
filename = list.files(dir)
messContent<-NA

for(i in 1:length(filename)){
  filepath<-paste0(dir,filename[i])  
  Content <-suppressWarnings(warning(readtext(filepath)))
  mess <- email_body(Content)
  mess <- gsub("<.*?>", " ", mess)
  eachmess<- list(paste(mess, collapse="\n"))
  messContent = c(messContent,eachmess)
  }
ham<-data.frame()
ham<-as.data.frame(unlist(messContent),stringsAsFactors = FALSE)
ham$class<-0
colnames(ham)<-c("mess","class")
ham_num <- nrow(ham) # Total Number of Ham Emails
print(paste0("The Total Number of Emails in the Ham Data-Set is : ", ham_num))
## [1] "The Total Number of Emails in the Ham Data-Set is : 2502"

Merge Spam and Ham (Non-Spam) Data-Frames into 1 Data-Frame

merge_datat<-rbind(spam,ham)
total_num <- nrow(merge_datat) # Total number of Emails
print(paste0("The Total Number of Emails in the Ham Data-Set is : ", total_num))
## [1] "The Total Number of Emails in the Ham Data-Set is : 3004"

Corpus Creation and Pre-processing Email Message Text Values

corpus = VCorpus(VectorSource(merge_datat$mess))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

Bag of Words Model

bog = DocumentTermMatrix(corpus)

bog = removeSparseTerms(bog, 0.98)

dataset = as.data.frame(as.matrix(bog))

dataset$outputType = merge_datat$class

spamDF <- dataset %>% filter(`outputType` == "1" )

hamDF <- dataset %>% filter(`outputType` == "0" )

Word Clouds

The Following Word Clouds are made up of the most frequently mined words from both email groups, respectively.

Word Cloud for Spam Email

spam_freq = colSums(spamDF)
spam_freq = sort(spam_freq, decreasing = TRUE)
spam_freq[1:20]
##      email       will       nbsp       free     receiv        can 
##       1141        875        645        643        567        519 
##      money outputType       list        get      pleas      order 
##        505        502        483        475        473        457 
##       name     report      click       make       busi       mail 
##        446        433        412        411        409        406 
##    address        one 
##        395        387
spam_words = names(spam_freq)
wordcloud(spam_words[1:50], spam_freq[1:50])

Word Cloud for Ham (Non-Spam) Email

ham_freq = colSums(hamDF)
ham_freq  =  sort(ham_freq, decreasing = TRUE)
ham_freq[1:20]
##    use    can   will    get   list    one   mail   just   like messag 
##   2114   1523   1408   1402   1400   1336   1218   1211   1171   1096 
##   time   work  peopl  wrote   dont    new   date    now   make  email 
##   1076    990    950    923    901    895    869    816    800    783
ham_words = names(ham_freq)
wordcloud(ham_words[1:50], ham_freq[1:50])

Machine Learning Modeling

Shuffling the Data

shuffle = dataset[sample(1:nrow(dataset)),]

Splitting the Combined Dataset into Trainging and Test Sets

set.seed(123)
split = sample.split(shuffle$outputType, SplitRatio = 0.8)
training = subset(shuffle, split == TRUE)
testing = subset(shuffle, split == FALSE)
noob =  ncol(training) - 1

Random Forest

  1. Creatation of a Random Forest Classifier
classifier = randomForest(x = training[-noob],
                          y = training$outputType,
                          ntree = 3)

Predicting the Data

  1. Predict the Test set results
  2. Creatation of a Confusion Matrix
y_predictor = predict(classifier, newdata = testing[-noob])

confusion_matrix <- table(y_predictor>0,testing$outputType)


confusion_matrix
##        
##           0   1
##   FALSE 498   0
##   TRUE    2 100
  1. Define the Accuracy
validation <- confusion_matrix['TRUE', 2] + confusion_matrix['FALSE', 1] 
accuracy_model <- validation/nrow(testing) * 100
print(paste0("The Accuracy of this Predictor Model is : ", accuracy_model, "%"))
## [1] "The Accuracy of this Predictor Model is : 99.6666666666667%"

Conclusion

At first glance, this project can seem very overwhelming. Think about how many emails a single person can get to one Email Domain daily. How does the Spam Folder in everyone’s inbox function.

After much data manipulation, I created a Bag of Words Model to collect the most frequent words in the category of Spam and Ham (Non-Spam). This made it easy for me to denote which words belong to which category. Giving eat word an ‘OutputType’ either a 1 for a word belonging to the Spam class or a 0 for the Ham class. From there it was easy to create the Word Clouds above. And the Random Model that I ran above.

On a side note, I thought it very was interesting to see ‘Spam’ be one of the top 50 most frequently used words in a Non-Spam or Ham email. I guess it doesnt really surprise me given this extremely hightened world of Cyber Security.