I took me a while to figure out the best way to do this. I looked for patterns between the spam and ham emails and noticed that spam emails tended to have more HTML headers, which I considered to be emails that were professionally created. I thought about using the HTML headers as a way to consider an email spam, but then I thought that if a co-worker forwarded an email with an advertisement, coupon, or link to a website, I didn’t want those to go to the Spam folder.
Then I decided to focus on comparing any body of text and leave numbers out of it. Of course, soon after I noticed some emails can code snippets or could have have a rating. (This movie was 10 out of 10).
I then realized I know nothing about the differences in spam and ham emails. So I decided to keep everything. After running multiple trials, all the term weights were very low, possibly because ‘THE’, ‘The’, and ‘the’ were considered different terms. I’m sure this matters in a more sophisticated algorithm, but for this project I removed the whitespaces, English stopwords, and converted everything to lowercase so at least I could understand what was going on.
Build Training Function
Function to text mine a directory of files. Input variables is the path to the directory and “spam”. “spam” gives it a negative weighting to represent spam. Function returns a data frame of all individual words from all files, their frequency and weighting.
#x is a path to the directory of text files (the folder, not the individual file).
text_mine <- function(x, y=NULL){
w <- 1
#Create a vector of corpus's from the directory of txt files. Each file is it's own row
corpus <- VCorpus(DirSource(x))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Creat a Term Document Matrix
#Terms as rownames, documents as columns with rows populated with frequency of term in the document
dtm <- as.matrix(TermDocumentMatrix(corpus))
#Make a data frame showing the terms, frequency, and weighting (freq/total terms)
if(!is.null(y)){w <- -1}
terms <- rownames(dtm)
freq <- rowSums(dtm)
weight <- freq/sum(freq)*w
df_freq <- data.frame(terms, freq, weight,
row.names = NULL, stringsAsFactors = F)
df_freq <- df_freq[order(-df_freq$freq),]
return(df_freq)
}
Spam Training Terms
spam <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\spam\\"
spam_training <- text_mine(spam, "spam")
head(spam_training)
## terms freq weight
## 10552 2002 3471 -0.016937574
## 37362 received: 2356 -0.011496665
## 39078 sep 2062 -0.010062021
## 6160 <td 1985 -0.009686282
## 3722 +0100 1453 -0.007090261
## 4192 </tr> 1138 -0.005553143
Ham Training Terms
ham <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\easy_ham\\"
ham_training <- text_mine(ham)
head(ham_training)
## terms freq weight
## 28369 2002 20956 0.030861125
## 74819 received: 14170 0.020867634
## 78007 sep 10004 0.014732520
## 48716 esmtp 8684 0.012788605
## 8614 +0100 7526 0.011083261
## 69734 oct 5455 0.008033376
Merge Data Frames
I merged the spam_training and ham_training data frames and added the weighted columns together so each term has a final weight, either positive or negative
final <- merge(ham_training, spam_training, by="terms")
final$weight <- final$weight.x + final$weight.y
final <- final %>%
select(terms, weight)
Test Some Emails
I took 10 different spam emails and 10 different ham emails and put them in the same folder. I renamed the file either spam or ham so I would know if my guess was correct or not afterwards.
After running multiple tests, I realized that most the Spam emails didn’t have a negative weight but were positive and closer to zero. I adjusted the metric to make spam emails have a total weight of less than 0.1. I felt comfortable with this because the ham weight was always much higher in the .5 and .6 range.
Build Function
Input argument is a path to the directory of email files
#x is a path to the directory of text files (the folder, not the individual file).
spam_or_ham <- function(x){
#Create a vector of corpus's from the directory of txt files. Each file is it's own row
list_emails <- list.files(x)
l <- length(list_emails)
r <- 1
while(r <= l){
path <- paste0(x, list_emails[r])
text <- read_lines(path)
corpus <- VCorpus(VectorSource(text))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Creat a Term Document Matrix
#Terms as rownames, documents as columns with rows populated with frequency of term in the document
dtm <- as.matrix(TermDocumentMatrix(corpus))
#Prep columns for data frame
terms <- rownames(dtm)
freq <- rowSums(dtm)
#Create data frame
df_freq <- data.frame(terms, freq,
row.names = NULL, stringsAsFactors = F)
#Match the terms from the test email and final weight data frame and calculate the weights of the terms.
#If positive the email is a Ham, negative is spam
i <- 1
n <- nrow(df_freq)
term_weight <- 0
total_weight <- c()
while(i <= n){
j <- 1
while(j <= nrow(final)){
if(df_freq$terms[i] == final$terms[j]){
term_weight <- df_freq$freq[i] * final$weight[j]
total_weight <- c(total_weight, term_weight)
j <- nrow(final)
}
j <- j + 1
}
i <- i + 1
}
sum_weight <- sum(total_weight)
if(sum_weight > 0.1){
reply <- paste0("Email #" ,r," is Ham.")
} else {reply <- paste0("Email #" ,r," is Spam.")}
print(paste0("Email weight is ", sum_weight))
print(reply)
r <- r + 1
}
}
Test Emails
The results of the test were fairly good. All the ham emails were identified correctly. 4 of the spam emails were misidentified giving me an accuracy rate of 80%.
test <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\test\\"
spam_or_ham(test)
## [1] "Email weight is 0.536131714052154"
## [1] "Email #1 is Ham."
## [1] "Email weight is 0.276233423162124"
## [1] "Email #2 is Ham."
## [1] "Email weight is 0.598563035644696"
## [1] "Email #3 is Ham."
## [1] "Email weight is 0.049309534655008"
## [1] "Email #4 is Spam."
## [1] "Email weight is 0.59630246753537"
## [1] "Email #5 is Ham."
## [1] "Email weight is 0.0117638146799612"
## [1] "Email #6 is Spam."
## [1] "Email weight is 0.622272703232102"
## [1] "Email #7 is Ham."
## [1] "Email weight is 0.0459641432678928"
## [1] "Email #8 is Spam."
## [1] "Email weight is 0.506375643766872"
## [1] "Email #9 is Ham."
## [1] "Email weight is 0.128715955587171"
## [1] "Email #10 is Ham."
## [1] "Email weight is 0.616273326974402"
## [1] "Email #11 is Ham."
## [1] "Email weight is -0.228134959995541"
## [1] "Email #12 is Spam."
## [1] "Email weight is 0.618011260532457"
## [1] "Email #13 is Ham."
## [1] "Email weight is 0.0727286301224972"
## [1] "Email #14 is Spam."
## [1] "Email weight is 0.625467638012444"
## [1] "Email #15 is Ham."
## [1] "Email weight is -0.647778095186361"
## [1] "Email #16 is Spam."
## [1] "Email weight is 0.6204697422721"
## [1] "Email #17 is Ham."
## [1] "Email weight is 0.220271634486181"
## [1] "Email #18 is Ham."
## [1] "Email weight is 0.480363325606337"
## [1] "Email #19 is Ham."
## [1] "Email weight is 0.387238612720074"
## [1] "Email #20 is Ham."