library(readtext)
library(dplyr)
dfham1 <- readtext('20021010_easy_ham/easy_ham/*')
dfham2 <- readtext('20021010_hard_ham/hard_ham/*')
dfham3 <- readtext('20030228_easy_ham_2/easy_ham_2/*')
dfspam1 <- readtext('20021010_spam/spam/*')
dfspam2 <- readtext('20030228_spam_2/spam_2/*')
dfham <- bind_rows(dfham1, dfham2, dfham3)
dfspam <- bind_rows(dfspam1, dfspam2)
dfham <- dfham %>% mutate('target' = 1) %>% select(c(text, target))
dfspam <- dfspam %>% mutate('target' = 0) %>% select(c(text, target))
df <- bind_rows(dfham, dfspam)
We read in the text using the readtext library, which reads in every file from the directories as plain text file, perfect for this task/file setup. Then we bind all the ham and spam together, provide the labels of 1 and 0, and create our dataframe of all ham and spam together. Now we must begin to process our text and prepare it for modeling.
I will transform all text to lower case, remove punctuation, remove overly frequent words that provide little information, and remove rare words. Then I will create a document term matrix that measures how often each term appears in each document. I will be using the tfidf transformation to penalize extremely common words.
library(quanteda.textmodels)
## Warning: package 'quanteda.textmodels' was built under R version 3.6.3
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.6.3
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(stringr)
set.seed(42)
df$text <- df$text %>% str_replace_all("[[:punct:]]", " ")
clean_corpus <- function(df){
corp <- corpus(df$text)
dfm1 <- dfm(corp, tolower = T, remove = stopwords('en'))
dfm1 <- dfm_trim(dfm1, min_termfreq = 5, min_docfreq = 3)
dfm1 <- dfm_tfidf(dfm1, scheme_tf = 'count', scheme_df = 'inverse', base = 10, force = F)
dfm1
}
dfm1 <- clean_corpus(df)
index <- sample(nrow(df))
dfm1 <- dfm1[index,]
df <- df[index,]
train_index <- 1:round(.8 * nrow(df))
test_index <- round(.8 * nrow(df) + 1):nrow(df)
train_dfm <- dfm1[train_index,]
train_target <- df$target[train_index]
test_dfm <- dfm1[round(.8 * nrow(df) + 1):nrow(df),]
test_target <- df$target[test_index]
Now the text is in a term document matrix, which means it is ready to be modeled. Other things we could consider in terms of preprocessing would be to transform the text into ngrams; however, this would result in an explosion of features.
I will create a Naive Bayes model as these are extremely common to use for NLP tasks and generalize well to large numbers of features due to their assumption of independence.
nb <- textmodel_nb(train_dfm, train_target, prior = 'docfreq', distribution = 'multinomial')
predictions <- predict(nb, test_dfm)
print(table(predicted = predictions, actual = test_target))
## actual
## predicted 0 1
## 0 376 8
## 1 13 823
mean(predictions == test_target)
## [1] 0.9827869
Our model achieved an accuracy of 98.3% on the test set, a great result!