library(tidyverse)
library(tm)
library(purrr)
library(randomForest)
library(caTools)

Reading email data

Here I am reading the email data directly from 2 separate folders. 1 for spam data and the other for my easy ham data. I ran quite a few functions on each set to get the data formatted correctly. Notice I am encoding each file to ‘latin1’. When I first ran the code in the next section I kept running into encoding issue errors. I found that bit of code on stack overflow. Lastly I union the 2 dataframes.

spam_files <- list.files("Spam/spam_2/",full.names=TRUE)
ham_files <- list.files("Ham/easy_ham_2/",full.names=TRUE)


spam_df <- spam_files %>%
  lapply(FUN=readLines) %>%
  lapply(FUN=paste, collapse=" ") %>%
  gsub(pattern = "\\d",replace="") %>%
  as.data.frame() %>%
  mutate_if(is.character, function(x) {Encoding(x) <- 'latin1'; return(x)}) %>%
  mutate(file_name = spam_files) %>%
  mutate(spam = 1)

ham_df <- ham_files %>%
  lapply(FUN=readLines) %>%
  lapply(FUN=paste, collapse=" ") %>%
  gsub(pattern = "\\d",replace="") %>%
  as.data.frame() %>%
  mutate_if(is.character, function(x) {Encoding(x) <- 'latin1'; return(x)}) %>%
  mutate(file_name = ham_files) %>%
  mutate(spam = 0)

colnames(spam_df) <- c("text","file_name","spam")
colnames(ham_df) <- c("text","file_name","spam")

emails <- rbind(spam_df, ham_df)

table(emails$spam)
## 
##    0    1 
## 1401 1396

Corpus Cleanup

The next step is to convert the dataframe into a corpus. The reason we do this is because a corpurs object has a ton of useful functions to clean up the text in the files. After we call all these cleanup functions we convert the corpus into a Document Term Matrix. The 0.95 I passed as the second arguement means at least 5% of the documents need to cotain a specific phrase. This might be too much? I would be interested to hear what others did.

email_corpus <- Corpus(VectorSource(emails$text))

email_corpus <- email_corpus %>%
  tm_map(tolower) %>%
  #tm_map(PlainTextDocument) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace) %>%
  tm_map(stemDocument) %>%
  tm_map(removeWords, stopwords("english")) 
  
dtm <- DocumentTermMatrix(email_corpus)

dtm <- removeSparseTerms(dtm, 0.95)

inspect(dtm)
## <<DocumentTermMatrix (documents: 2797, terms: 589)>>
## Non-/sparse entries: 248061/1399372
## Sparsity           : 85%
## Maximal term length: 54
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug email esmtp jul localhost mon receiv size tue wed
##   1079   0     1     4  10         1   0      9    0   0   0
##   1375   0     8     2   0         2   0      6   18   0   0
##   2713   0    20     6  10         3   0     12    3   0  10
##   2776  11     1     1   0         5   2      9    2   4   4
##   28     0     9     2   0         0   4      8    0   0   0
##   44     0    23     2   0         0   0      4   14   0   0
##   51     0    16     2   0         0   0      8   24   0   0
##   81     0    19     2   4         0   0      5   26   0   0
##   903    0    38     4   8         1   8     34    2   0   0
##   989    0    38     4   8         1   0     35    2   7   1

Prepping for Prediction Model

Here we convert our document term matrix into a dataframe and add the spam classification back. To train our model we split our data into 2 dataframes: test and train.

emails_dtm = as.data.frame(as.matrix(dtm))

colnames(emails_dtm) = make.names(colnames(emails_dtm))

emails_dtm$spam = emails$spam
emails_dtm$spam = as.factor(emails_dtm$spam)

spl = sample.split(emails_dtm$spam, 0.7)

train = subset(emails_dtm, spl == TRUE)
test = subset(emails_dtm, spl == FALSE)

table(train$spam)
## 
##   0   1 
## 981 977

Random Forest

Here the model is pretty straight forward. I decided to use Random Forest. Once we have the model trained we can use it to predict our test values and see how accurate it is.

set.seed(10000)
rf_model = randomForest(spam~., data=train)

pred = predict(rf_model, newdata=test)

table(test$spam, pred)
##    pred
##       0   1
##   0 416   4
##   1  18 401

Conclusion

Our prediction model is fairly accurate. We successfully predicted 820 emails out of 839, which is a 97.7% success rate.