library(tidyverse)
library(tm)
library(purrr)
library(randomForest)
library(caTools)
Here I am reading the email data directly from 2 separate folders. 1 for spam data and the other for my easy ham data. I ran quite a few functions on each set to get the data formatted correctly. Notice I am encoding each file to ‘latin1’. When I first ran the code in the next section I kept running into encoding issue errors. I found that bit of code on stack overflow. Lastly I union the 2 dataframes.
spam_files <- list.files("Spam/spam_2/",full.names=TRUE)
ham_files <- list.files("Ham/easy_ham_2/",full.names=TRUE)
spam_df <- spam_files %>%
lapply(FUN=readLines) %>%
lapply(FUN=paste, collapse=" ") %>%
gsub(pattern = "\\d",replace="") %>%
as.data.frame() %>%
mutate_if(is.character, function(x) {Encoding(x) <- 'latin1'; return(x)}) %>%
mutate(file_name = spam_files) %>%
mutate(spam = 1)
ham_df <- ham_files %>%
lapply(FUN=readLines) %>%
lapply(FUN=paste, collapse=" ") %>%
gsub(pattern = "\\d",replace="") %>%
as.data.frame() %>%
mutate_if(is.character, function(x) {Encoding(x) <- 'latin1'; return(x)}) %>%
mutate(file_name = ham_files) %>%
mutate(spam = 0)
colnames(spam_df) <- c("text","file_name","spam")
colnames(ham_df) <- c("text","file_name","spam")
emails <- rbind(spam_df, ham_df)
table(emails$spam)
##
## 0 1
## 1401 1396
The next step is to convert the dataframe into a corpus. The reason we do this is because a corpurs object has a ton of useful functions to clean up the text in the files. After we call all these cleanup functions we convert the corpus into a Document Term Matrix. The 0.95 I passed as the second arguement means at least 5% of the documents need to cotain a specific phrase. This might be too much? I would be interested to hear what others did.
email_corpus <- Corpus(VectorSource(emails$text))
email_corpus <- email_corpus %>%
tm_map(tolower) %>%
#tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument) %>%
tm_map(removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(email_corpus)
dtm <- removeSparseTerms(dtm, 0.95)
inspect(dtm)
## <<DocumentTermMatrix (documents: 2797, terms: 589)>>
## Non-/sparse entries: 248061/1399372
## Sparsity : 85%
## Maximal term length: 54
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug email esmtp jul localhost mon receiv size tue wed
## 1079 0 1 4 10 1 0 9 0 0 0
## 1375 0 8 2 0 2 0 6 18 0 0
## 2713 0 20 6 10 3 0 12 3 0 10
## 2776 11 1 1 0 5 2 9 2 4 4
## 28 0 9 2 0 0 4 8 0 0 0
## 44 0 23 2 0 0 0 4 14 0 0
## 51 0 16 2 0 0 0 8 24 0 0
## 81 0 19 2 4 0 0 5 26 0 0
## 903 0 38 4 8 1 8 34 2 0 0
## 989 0 38 4 8 1 0 35 2 7 1
Here we convert our document term matrix into a dataframe and add the spam classification back. To train our model we split our data into 2 dataframes: test and train.
emails_dtm = as.data.frame(as.matrix(dtm))
colnames(emails_dtm) = make.names(colnames(emails_dtm))
emails_dtm$spam = emails$spam
emails_dtm$spam = as.factor(emails_dtm$spam)
spl = sample.split(emails_dtm$spam, 0.7)
train = subset(emails_dtm, spl == TRUE)
test = subset(emails_dtm, spl == FALSE)
table(train$spam)
##
## 0 1
## 981 977
Here the model is pretty straight forward. I decided to use Random Forest. Once we have the model trained we can use it to predict our test values and see how accurate it is.
set.seed(10000)
rf_model = randomForest(spam~., data=train)
pred = predict(rf_model, newdata=test)
table(test$spam, pred)
## pred
## 0 1
## 0 416 4
## 1 18 401
Our prediction model is fairly accurate. We successfully predicted 820 emails out of 839, which is a 97.7% success rate.