It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
#Libraries Used for this Project
library(knitr)
library(R.utils)
library(tm)
library(wordcloud)
library(SnowballC)
library(data.table)
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(stats)
library(readtext)
library(caTools)
library(randomForest)
email_body <- function(Content){
message = str_split(Content,"\n\n") %>% unlist()
body = paste(message[2:length(message)], collapse=' ' )
return(body)
}
dir="C:/Users/jpsim/Documents/DATA Acquisition and Management/spam/"
filename = list.files(dir)
messContent<-NA
for(i in 1:length(filename)){
filepath<-paste0(dir,filename[i])
Content <-suppressWarnings(warning(readtext(filepath)))
mess <- email_body(Content)
mess <- gsub("<.*?>", " ", mess)
eachmess<- list(paste(mess, collapse="\n"))
messContent = c(messContent,eachmess)
}
spam<-data.frame()
spam<-as.data.frame(unlist(messContent),stringsAsFactors = FALSE)
spam$class<-1
colnames(spam)<-c("mess","class")
spam_num <- nrow(spam) # Total Number of Spam Emails
print(paste0("The Total Number of Emails in the Spam Data-Set is : ", spam_num))
## [1] "The Total Number of Emails in the Spam Data-Set is : 502"
dir="C:/Users/jpsim/Documents/DATA Acquisition and Management/easy_ham/"
filename = list.files(dir)
messContent<-NA
for(i in 1:length(filename)){
filepath<-paste0(dir,filename[i])
Content <-suppressWarnings(warning(readtext(filepath)))
mess <- email_body(Content)
mess <- gsub("<.*?>", " ", mess)
eachmess<- list(paste(mess, collapse="\n"))
messContent = c(messContent,eachmess)
}
ham<-data.frame()
ham<-as.data.frame(unlist(messContent),stringsAsFactors = FALSE)
ham$class<-0
colnames(ham)<-c("mess","class")
ham_num <- nrow(ham) # Total Number of Ham Emails
print(paste0("The Total Number of Emails in the Ham Data-Set is : ", ham_num))
## [1] "The Total Number of Emails in the Ham Data-Set is : 2502"
merge_datat<-rbind(spam,ham)
total_num <- nrow(merge_datat) # Total number of Emails
print(paste0("The Total Number of Emails in the Ham Data-Set is : ", total_num))
## [1] "The Total Number of Emails in the Ham Data-Set is : 3004"
corpus = VCorpus(VectorSource(merge_datat$mess))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
bog = DocumentTermMatrix(corpus)
bog = removeSparseTerms(bog, 0.98)
dataset = as.data.frame(as.matrix(bog))
dataset$outputType = merge_datat$class
spamDF <- dataset %>% filter(`outputType` == "1" )
hamDF <- dataset %>% filter(`outputType` == "0" )
The Following Word Clouds are made up of the most frequently mined words from both email groups, respectively.
spam_freq = colSums(spamDF)
spam_freq = sort(spam_freq, decreasing = TRUE)
spam_freq[1:20]
## email will nbsp free receiv can
## 1141 875 645 643 567 519
## money outputType list get pleas order
## 505 502 483 475 473 457
## name report click make busi mail
## 446 433 412 411 409 406
## address one
## 395 387
spam_words = names(spam_freq)
wordcloud(spam_words[1:50], spam_freq[1:50])
ham_freq = colSums(hamDF)
ham_freq = sort(ham_freq, decreasing = TRUE)
ham_freq[1:20]
## use can will get list one mail just like messag
## 2114 1523 1408 1402 1400 1336 1218 1211 1171 1096
## time work peopl wrote dont new date now make email
## 1076 990 950 923 901 895 869 816 800 783
ham_words = names(ham_freq)
wordcloud(ham_words[1:50], ham_freq[1:50])
shuffle = dataset[sample(1:nrow(dataset)),]
set.seed(123)
split = sample.split(shuffle$outputType, SplitRatio = 0.8)
training = subset(shuffle, split == TRUE)
testing = subset(shuffle, split == FALSE)
noob = ncol(training) - 1
classifier = randomForest(x = training[-noob],
y = training$outputType,
ntree = 3)
y_predictor = predict(classifier, newdata = testing[-noob])
confusion_matrix <- table(y_predictor>0,testing$outputType)
confusion_matrix
##
## 0 1
## FALSE 498 0
## TRUE 2 100
validation <- confusion_matrix['TRUE', 2] + confusion_matrix['FALSE', 1]
accuracy_model <- validation/nrow(testing) * 100
print(paste0("The Accuracy of this Predictor Model is : ", accuracy_model, "%"))
## [1] "The Accuracy of this Predictor Model is : 99.6666666666667%"
At first glance, this project can seem very overwhelming. Think about how many emails a single person can get to one Email Domain daily. How does the Spam Folder in everyone’s inbox function.
After much data manipulation, I created a Bag of Words Model to collect the most frequent words in the category of Spam and Ham (Non-Spam). This made it easy for me to denote which words belong to which category. Giving eat word an ‘OutputType’ either a 1 for a word belonging to the Spam class or a 0 for the Ham class. From there it was easy to create the Word Clouds above. And the Random Model that I ran above.
On a side note, I thought it very was interesting to see ‘Spam’ be one of the top 50 most frequently used words in a Non-Spam or Ham email. I guess it doesnt really surprise me given this extremely hightened world of Cyber Security.