Data Intake

library(readtext)
library(dplyr)

dfham1 <- readtext('20021010_easy_ham/easy_ham/*')
dfham2 <- readtext('20021010_hard_ham/hard_ham/*')
dfham3 <- readtext('20030228_easy_ham_2/easy_ham_2/*')
dfspam1 <- readtext('20021010_spam/spam/*')
dfspam2 <- readtext('20030228_spam_2/spam_2/*')

dfham <- bind_rows(dfham1, dfham2, dfham3)
dfspam <- bind_rows(dfspam1, dfspam2)

dfham <- dfham %>% mutate('target' = 1) %>% select(c(text, target))
dfspam <- dfspam %>% mutate('target' = 0) %>% select(c(text, target))

df <- bind_rows(dfham, dfspam)

We read in the text using the readtext library, which reads in every file from the directories as plain text file, perfect for this task/file setup. Then we bind all the ham and spam together, provide the labels of 1 and 0, and create our dataframe of all ham and spam together. Now we must begin to process our text and prepare it for modeling.

Text Preprocessing

I will transform all text to lower case, remove punctuation, remove overly frequent words that provide little information, and remove rare words. Then I will create a document term matrix that measures how often each term appears in each document. I will be using the tfidf transformation to penalize extremely common words.

library(quanteda.textmodels)
## Warning: package 'quanteda.textmodels' was built under R version 3.6.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(stringr)

set.seed(42)

df$text <- df$text %>% str_replace_all("[[:punct:]]", " ")

clean_corpus <- function(df){
  corp <- corpus(df$text)
  dfm1 <- dfm(corp, tolower = T, remove = stopwords('en'))
  dfm1 <- dfm_trim(dfm1, min_termfreq = 5, min_docfreq = 3)
  dfm1 <- dfm_tfidf(dfm1, scheme_tf = 'count', scheme_df = 'inverse', base = 10, force = F)
  dfm1
}

dfm1 <- clean_corpus(df)

index <- sample(nrow(df))

dfm1 <- dfm1[index,]
df <- df[index,]

train_index <- 1:round(.8 * nrow(df))
test_index <- round(.8 * nrow(df) + 1):nrow(df)

train_dfm <- dfm1[train_index,]
train_target <- df$target[train_index]

test_dfm <- dfm1[round(.8 * nrow(df) + 1):nrow(df),]
test_target <- df$target[test_index]

Now the text is in a term document matrix, which means it is ready to be modeled. Other things we could consider in terms of preprocessing would be to transform the text into ngrams; however, this would result in an explosion of features.

Modeling

I will create a Naive Bayes model as these are extremely common to use for NLP tasks and generalize well to large numbers of features due to their assumption of independence.

nb <- textmodel_nb(train_dfm, train_target, prior = 'docfreq', distribution = 'multinomial')
predictions <- predict(nb, test_dfm)

print(table(predicted = predictions, actual = test_target))
##          actual
## predicted   0   1
##         0 376   8
##         1  13 823
mean(predictions == test_target)
## [1] 0.9827869

Our model achieved an accuracy of 98.3% on the test set, a great result!