Import Training Data

I took me a while to figure out the best way to do this. I looked for patterns between the spam and ham emails and noticed that spam emails tended to have more HTML headers, which I considered to be emails that were professionally created. I thought about using the HTML headers as a way to consider an email spam, but then I thought that if a co-worker forwarded an email with an advertisement, coupon, or link to a website, I didn’t want those to go to the Spam folder.

Then I decided to focus on comparing any body of text and leave numbers out of it. Of course, soon after I noticed some emails can code snippets or could have have a rating. (This movie was 10 out of 10).

I then realized I know nothing about the differences in spam and ham emails. So I decided to keep everything. After running multiple trials, all the term weights were very low, possibly because ‘THE’, ‘The’, and ‘the’ were considered different terms. I’m sure this matters in a more sophisticated algorithm, but for this project I removed the whitespaces, English stopwords, and converted everything to lowercase so at least I could understand what was going on.

Build Training Function

Function to text mine a directory of files. Input variables is the path to the directory and “spam”. “spam” gives it a negative weighting to represent spam. Function returns a data frame of all individual words from all files, their frequency and weighting.

#x is a path to the directory of text files (the folder, not the individual file).
text_mine <- function(x, y=NULL){
  w <- 1
  
  #Create a vector of corpus's from the directory of txt files. Each file is it's own row
  corpus <- VCorpus(DirSource(x))
  
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  
  #Creat a Term Document Matrix
  #Terms as rownames, documents as columns with rows populated with frequency of term in the document
  dtm <- as.matrix(TermDocumentMatrix(corpus))
  
  #Make a data frame showing the terms, frequency, and weighting (freq/total terms)
  if(!is.null(y)){w <- -1}
  terms <- rownames(dtm)
  freq <- rowSums(dtm)
  weight <- freq/sum(freq)*w

  df_freq <- data.frame(terms, freq, weight,
                      row.names = NULL, stringsAsFactors = F)

  df_freq <- df_freq[order(-df_freq$freq),]

return(df_freq)
}

Spam Training Terms

spam <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\spam\\"
spam_training <- text_mine(spam, "spam")
head(spam_training)
##           terms freq       weight
## 10552      2002 3471 -0.016937574
## 37362 received: 2356 -0.011496665
## 39078       sep 2062 -0.010062021
## 6160        <td 1985 -0.009686282
## 3722      +0100 1453 -0.007090261
## 4192      </tr> 1138 -0.005553143

Ham Training Terms

ham <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\easy_ham\\"
ham_training <- text_mine(ham)
head(ham_training)
##           terms  freq      weight
## 28369      2002 20956 0.030861125
## 74819 received: 14170 0.020867634
## 78007       sep 10004 0.014732520
## 48716     esmtp  8684 0.012788605
## 8614      +0100  7526 0.011083261
## 69734       oct  5455 0.008033376

Calculate Final Weighting

Merge Data Frames

I merged the spam_training and ham_training data frames and added the weighted columns together so each term has a final weight, either positive or negative

final <- merge(ham_training, spam_training, by="terms")
final$weight <- final$weight.x + final$weight.y

final <- final %>%
  select(terms, weight)

Test Some Emails

I took 10 different spam emails and 10 different ham emails and put them in the same folder. I renamed the file either spam or ham so I would know if my guess was correct or not afterwards.

After running multiple tests, I realized that most the Spam emails didn’t have a negative weight but were positive and closer to zero. I adjusted the metric to make spam emails have a total weight of less than 0.1. I felt comfortable with this because the ham weight was always much higher in the .5 and .6 range.

Build Function

Input argument is a path to the directory of email files

#x is a path to the directory of text files (the folder, not the individual file).
spam_or_ham <- function(x){
  
  #Create a vector of corpus's from the directory of txt files. Each file is it's own row
  list_emails <- list.files(x)
  l <- length(list_emails)
  r <- 1
  
  while(r <= l){
    path <- paste0(x, list_emails[r])
    text <- read_lines(path)
    corpus <- VCorpus(VectorSource(text))
    
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    
    #Creat a Term Document Matrix
    #Terms as rownames, documents as columns with rows populated with frequency of term in the document
    dtm <- as.matrix(TermDocumentMatrix(corpus))
    
    #Prep columns for data frame
    terms <- rownames(dtm)
    freq <- rowSums(dtm)
  
    #Create data frame 
    df_freq <- data.frame(terms, freq, 
                        row.names = NULL, stringsAsFactors = F)
    
    #Match the terms from the test email and final weight data frame and calculate the weights of the terms.
    #If positive the email is a Ham, negative is spam
    i <- 1
    n <- nrow(df_freq)
    term_weight <- 0
    total_weight <- c()
    
    while(i <= n){
      j <- 1
      while(j <= nrow(final)){
        if(df_freq$terms[i] == final$terms[j]){
          term_weight <- df_freq$freq[i] * final$weight[j]
          total_weight <- c(total_weight, term_weight)
          j <- nrow(final)
        }
        j <- j + 1
      }
      i <- i + 1
    }
    
    sum_weight <- sum(total_weight)
    
    if(sum_weight > 0.1){
      reply <- paste0("Email #" ,r," is Ham.")
      } else {reply <- paste0("Email #" ,r," is Spam.")}
  
    print(paste0("Email weight is ", sum_weight))
    print(reply)
    
    r <- r + 1
    }
    
  }

Test Emails

The results of the test were fairly good. All the ham emails were identified correctly. 4 of the spam emails were misidentified giving me an accuracy rate of 80%.

test <- "C:\\Users\\smith\\Desktop\\IS 607\\Projects\\spamham\\test\\"
spam_or_ham(test)
## [1] "Email weight is 0.536131714052154"
## [1] "Email #1 is Ham."
## [1] "Email weight is 0.276233423162124"
## [1] "Email #2 is Ham."
## [1] "Email weight is 0.598563035644696"
## [1] "Email #3 is Ham."
## [1] "Email weight is 0.049309534655008"
## [1] "Email #4 is Spam."
## [1] "Email weight is 0.59630246753537"
## [1] "Email #5 is Ham."
## [1] "Email weight is 0.0117638146799612"
## [1] "Email #6 is Spam."
## [1] "Email weight is 0.622272703232102"
## [1] "Email #7 is Ham."
## [1] "Email weight is 0.0459641432678928"
## [1] "Email #8 is Spam."
## [1] "Email weight is 0.506375643766872"
## [1] "Email #9 is Ham."
## [1] "Email weight is 0.128715955587171"
## [1] "Email #10 is Ham."
## [1] "Email weight is 0.616273326974402"
## [1] "Email #11 is Ham."
## [1] "Email weight is -0.228134959995541"
## [1] "Email #12 is Spam."
## [1] "Email weight is 0.618011260532457"
## [1] "Email #13 is Ham."
## [1] "Email weight is 0.0727286301224972"
## [1] "Email #14 is Spam."
## [1] "Email weight is 0.625467638012444"
## [1] "Email #15 is Ham."
## [1] "Email weight is -0.647778095186361"
## [1] "Email #16 is Spam."
## [1] "Email weight is 0.6204697422721"
## [1] "Email #17 is Ham."
## [1] "Email weight is 0.220271634486181"
## [1] "Email #18 is Ham."
## [1] "Email weight is 0.480363325606337"
## [1] "Email #19 is Ham."
## [1] "Email weight is 0.387238612720074"
## [1] "Email #20 is Ham."