Analyze a corpus of emails classified as spam or legitimate (dubbed “ham”). Develop a predictive process for classifying new email correctly.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: lattice
The following describes our general approach towards a predictive model for classifying emails: 1. utilize the Tidyverse toolkit to clean and tokenize the dataset 2. create a Document Term Matrix (DTM) to describe the frequency of terms in the dataset 3. visualize descriptive findings with barcharts and wordclouds 4. apply a machine learning model to develop a predictive classification model
We utilize Tidyverse tools for text mining enable data input, cleansing, and tokenization. It is additionally possible to cast tidy structures to a DTM.
Reference: Text Mining with R: A Tidy Approach.
A key data structure in text mining is the DTM. DTMs serve as the input format for many machine learning models.
We classify emails by employing a Support Vector Machine model from the caret
package.
As part of our approach, we tried additional text mining tools and methods in order to familiarize ourselves with other prevalent methodologies and inform our eventual process. As part of our experimentation, we: - built a corpus with the tm
package to demonstrate the recommended approach; - used the quanteda
package to enable more efficient processing; and - attempted basic sentiment analysis of the corpus.
The processing begins with a Tidy approach. We first load the spam and ham emails from the respective directories. We observe that the readr::read_file()
method is substantively faster and syntactically easier than base::readLines()
. We process 500 emails from each class as a pragmatic limitation of compute resources.
# Directories for emails
dir_spam <- "data/spam"
dir_ham <- "data/easy_ham"
# File lists
fils_ham <- list.files(dir_ham, full.names = TRUE)
fils_spam <- list.files(dir_spam, full.names = TRUE)
# Filter file count
max_fils <- 500
fils_ham <- fils_ham[1:(min(max_fils, length(fils_ham)))]
fils_spam <- fils_spam[1:(min(max_fils, length(fils_spam)))]
# Function for input
get_text_df <- function(files, class) {
raw_email <- sapply(files, function(x) read_file(x, locale=locale(encoding="latin1")))
# Pick off key value on end of file names
id <- str_extract(files, "\\w*$")
# Corpus in tidy format. Each row is a document.
text_df <- tibble(id = id, class = as.factor(class), email = raw_email)
return(text_df)
}
# Input
ham_df <- get_text_df(fils_ham, "ham")
spam_df <- get_text_df(fils_spam, "spam")
# Merge the documents into one corpus
email_df <- rbind(ham_df, spam_df)
# Free some memory
rm(list = c("ham_df", "spam_df"))
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2046181 109.3 3908864 208.8 NA 3726537 199.1
## Vcells 3719032 28.4 8388608 64.0 32768 8388382 64.0
We start cleaning by removing headers. By specification, email headers may not contain blank lines; we apply regular expression to strip the headers.
Notes on regular expression: - .*?\n\n
: The question mark enables non-greedy matching of any text in order to match up to the first double line break. The solution isn’t perfect: there are some headers with multiple instances of blank lines. Nevertheless, most of the headers are removed and the solution appears adequate for our purposes. - Our experience found stringr
to be inferior in performance to base R for this string manipulation. However, our base R code requires more testing to fix a bug that allows some headers through.
This approach creates a tall, narrow dataframe of tokens.
We get some cleaning for free from the unnest()
function. - Remove punctuation; - Convert to lower case; and - Remove white space
We also outer-join a stop word list to remove them.
Notes on possibilities for additional cleaning - Stemming: we can enable improved analysis by reducing words to their root form, or stem. This would have the additional benefit of reducing the size of the DTM.
# Look at 2 emails for validation
# dplyr retrieval is easy, as is filtering and aggregation
email_df %>%
select(email) %>%
sample_n(2)
## email
## 1 --==_Exmh_267413022P\nContent-Type: text/plain; charset=us-ascii\n\n> From: Anders Eriksson <aeriksson@fastmail.fm>\n> Date: Thu, 22 Aug 2002 20:23:17 +0200\n>\n> \n> Oooops!\n> \n> Doesn't work at all. Got this on startup and on any attempt to change folde\n> r (which fail)\n\n~sigh~ I'd already found that and checked it in....apparently I did so after \nyou checked it out and before you sent this mail...I hoped I was fast enough \nthat you wouldn't see it.\n\nTry again!\n\nChris\n\n-- \nChris Garrigues http://www.DeepEddy.Com/~cwg/\nvirCIO http://www.virCIO.Com\n716 Congress, Suite 200\nAustin, TX 78701\t\t+1 512 374 0500\n\n World War III: The Wrong-Doers Vs. the Evil-Doers.\n\n\n\n\n--==_Exmh_267413022P\nContent-Type: application/pgp-signature\n\n-----BEGIN PGP SIGNATURE-----\nVersion: GnuPG v1.0.6 (GNU/Linux)\nComment: Exmh version 2.2_20000822 06/23/2000\n\niD8DBQE9ZUlVK9b4h5R0IUIRAr4LAJ9Mhzgw03dF2qiyqtMks72364uaqwCeJxp1\n23jNAVlrHHIDRMvMPXnfzoE=\n=HErg\n-----END PGP SIGNATURE-----\n\n--==_Exmh_267413022P--\n\n\n\n_______________________________________________\nExmh-workers mailing list\nExmh-workers@redhat.com\nhttps://listman.redhat.com/mailman/listinfo/exmh-workers\n\n
## 2 \n//this function should print all numbers up to 100...\n\nvoid print_nums()\n{\n int i;\n\n for(i = 0; i < 10l; i++) {\n printf("%d\\n",i);\n }\n\n}\n\n
# Tokenize all emails
email_tokens_df <- email_df %>%
unnest_tokens(word, email) %>%
anti_join(stop_words)
## Joining, by = "word"
## # A tibble: 11,972 x 2
## word n
## <chr> <int>
## 1 3d 3532
## 2 font 2718
## 3 br 1810
## 4 td 1493
## 5 http 1407
## 6 size 1055
## 7 tr 924
## 8 color 788
## 9 width 787
## 10 1 638
## # … with 11,962 more rows
We calculate term frequency and inverse document frequency (TD-IDF) to enable producing a DTM.
email_tf_idf <- email_tokens_df %>%
count(class, word, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(word, class, n)
email_tf_idf %>%
arrange(-tf_idf)
## # A tibble: 13,669 x 6
## class word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 spam td 1493 0.0250 0.693 0.0173
## 2 spam tr 924 0.0155 0.693 0.0107
## 3 spam align 532 0.00891 0.693 0.00618
## 4 spam height 450 0.00754 0.693 0.00522
## 5 spam border 391 0.00655 0.693 0.00454
## 6 spam img 315 0.00527 0.693 0.00366
## 7 spam arial 308 0.00516 0.693 0.00358
## 8 ham rpm 94 0.00439 0.693 0.00304
## 9 spam span 249 0.00417 0.693 0.00289
## 10 ham exmh 86 0.00402 0.693 0.00278
## # … with 13,659 more rows
email_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(class) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = class)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~class, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf
We make our first departure from Tidy formats to a format readily utilizable by ML models. We cast counts to a DTM, then inspect and reduce some of the sparseness.
# Word counts
word_counts <- email_tokens_df %>%
count(id, word, sort = TRUE) %>%
ungroup()
# Cast to a document term matrix
email_dtm <- word_counts %>%
cast_dtm(id, word, n)
# Print
dim(email_dtm)
## [1] 395 11972
## <<DocumentTermMatrix (documents: 395, terms: 11972)>>
## Non-/sparse entries: 36712/4692228
## Sparsity : 99%
## Maximal term length: 89
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 2/23
## Sparsity : 92%
## Maximal term length: 6
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs æ arial class guides sc
## 7160290efcb7320ac8852369a695bcaf 0 0 0 0 0
## 770f0e7b8378a47a945043434f6f43df 0 0 0 0 0
## 829bab9379cfe32fe4b5af15ca99361b 0 2 0 0 0
## 8ff64b5c77f9c9618bd7b119ae14c8b2 0 0 0 0 0
## f97a14d667569ebbc0502bb2c7beec27 0 2 0 0 0
# inspect() does the work of printing the object and a matrix conversion
# Repeat the review to show the comparison
email_dtm
## <<DocumentTermMatrix (documents: 395, terms: 11972)>>
## Non-/sparse entries: 36712/4692228
## Sparsity : 99%
## Maximal term length: 89
## Weighting : term frequency (tf)
## [1] 395 11972
## Terms
## Docs æ arial sc guides class
## 7160290efcb7320ac8852369a695bcaf 0 0 0 0 0
## 770f0e7b8378a47a945043434f6f43df 0 0 0 0 0
## 829bab9379cfe32fe4b5af15ca99361b 0 2 0 0 0
## 8ff64b5c77f9c9618bd7b119ae14c8b2 0 0 0 0 0
## f97a14d667569ebbc0502bb2c7beec27 0 2 0 0 0
# Remove some sparse terms and review again using inspect()
email_dtm_rm_sparse <- removeSparseTerms(email_dtm, 0.99)
dim(email_dtm_rm_sparse)
## [1] 395 1983
## <<DocumentTermMatrix (documents: 395, terms: 1983)>>
## Non-/sparse entries: 23434/759851
## Sparsity : 97%
## Maximal term length: 65
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 2/23
## Sparsity : 92%
## Maximal term length: 6
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 50 left redhat select server
## 7160290efcb7320ac8852369a695bcaf 0 0 0 0 0
## 770f0e7b8378a47a945043434f6f43df 0 0 0 0 0
## 829bab9379cfe32fe4b5af15ca99361b 0 0 0 0 5
## 8ff64b5c77f9c9618bd7b119ae14c8b2 0 0 0 0 0
## f97a14d667569ebbc0502bb2c7beec27 0 0 0 0 5
We segment the data into different sets for training and testing.
ham_cnt <- nrow(subset(email_df, class == "ham"))
spam_cnt <- nrow(subset(email_df, class == "spam"))
set.seed(2012)
train_indicator_ham <- rbinom(n = ham_cnt, size = 1, prob = .5)
train_indicator_spam <- rbinom(n = spam_cnt, size = 1, prob = .5)
# We discarded the split corpuses when we merged them. Split them again
# to assign a training indicator
email_df_ham <- email_df %>%
filter(class == "ham") %>%
mutate(train_indicator = train_indicator_ham)
email_df_spam <- email_df %>%
filter(class == "spam") %>%
mutate(train_indicator = train_indicator_spam)
email_df <- rbind(email_df_ham, email_df_spam)
table(email_df$class, email_df$train_indicator)
##
## 0 1
## ham 100 99
## spam 99 101
# We need to add the class into the matrix
email_dtm_df <- as.data.frame(as.matrix(email_dtm_rm_sparse))
# The matrix conversion puts the id column into row names. The tibble package
# has a function for moving that into a column. There is a term for "id" in the
# matrix, so we need a synonym. Since unnest() removed punctuation, we can use
# punctuation in the variable name and be sure of no collision
email_dtm_df <- email_dtm_df %>%
rownames_to_column(var = "doc_id")
# Join the dtm to the email data frame to pick up the class. Don't need
# the document id or the raw email, though
email_dtm_df <- inner_join(email_df, email_dtm_df, by = c("id" ="doc_id")) %>%
select(-c("id", "email.x"))
# Build the train and test sets. Drop the indicators when done
train_dtm <- subset(email_dtm_df, train_indicator == 1)
train_dtm <- train_dtm %>% select(-"train_indicator")
test_dtm <- subset(email_dtm_df, train_indicator == 0)
test_dtm <- test_dtm %>% select(-"train_indicator")
We enable the predictive model through a supervised learning task. We employ SVM for its classification functionality. We train the model with a subset of the DTM and test the prediction of ham/spam classification using the remainder.
We observe confusion matrices. First, we examine the confusion matrix for the training set just for its interest. It naturally gets a perfect score.
The confusion matrix for the test set is very accurate with a low p value. One run on the test set was 97% accurate with a p value near zero. The positive class is ham. There were 2 false negatives identified as spam within 247 ham emails. There were 11 false positives identifies as ham within 248 spam emails.
svm_confusion_m_train <- confusionMatrix(predict_train, train_dtm$class.x)
svm_confusion_m <- confusionMatrix(predict_test, test_dtm$class.x)
svm_confusion_m_train
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 99 0
## spam 0 101
##
## Accuracy : 1
## 95% CI : (0.9817, 1)
## No Information Rate : 0.505
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.495
## Detection Rate : 0.495
## Detection Prevalence : 0.495
## Balanced Accuracy : 1.000
##
## 'Positive' Class : ham
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 99 9
## spam 1 90
##
## Accuracy : 0.9497
## 95% CI : (0.9095, 0.9756)
## No Information Rate : 0.5025
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8995
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.9900
## Specificity : 0.9091
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9890
## Prevalence : 0.5025
## Detection Rate : 0.4975
## Detection Prevalence : 0.5427
## Balanced Accuracy : 0.9495
##
## 'Positive' Class : ham
##
## Package version: 1.5.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ purrr 0.3.2 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ NLP::annotate() masks ggplot2::annotate()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
There are only a few major packages focused on textmining in R, including: tm
, tidytext
, corpus
, and koRpus
. While we found tidytext to be most user-friendly and comprehensive, we also learned the ins and outs of the tm
package and VCorpus
, despite our file format not exactly lining up with any of the tm examples online. See below for what we found to be the most flexible grammars in working with tm.
# Create a corpus from many files
ham_rawc <- VCorpus(DirSource("data/easy_ham/")) # Use DirSource, VCorpus is sensitive to file endings
spam_rawc <- VCorpus(DirSource("data/spam/")) # Use DirSource, VCorpus is sensitive to file endings
# Use the meta class in VCorpus
# to tag each ham and spam corpus
# (also true for other corpus objects)
ham_corpus <- ham_rawc
spam_corpus <- spam_rawc
meta(ham_corpus, tag="class", type ="corpus") <- "ham"
meta(spam_corpus, tag="class", type ="corpus") <- "spam"
rm(list = c("ham_rawc", "spam_rawc")) # remove objects
gc() # garbage collection
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2637175 140.9 3908864 208.8 NA 3908864 208.8
## Vcells 14062650 107.3 25064332 191.3 32768 25064332 191.3
# Speed up processing by collapsing the lines in the
# line-by-line storage format of the VCorpus import
# Always use content_transformer to transform VCorpus
# content due to its unique structure
# The collapse attribute of paste has the added benefit of
# controlling line break codes that are altered by unziping
# files on a Windows machine, allowing me to remove windows code
# \\r\\n from the regexes
collapse_lines <- content_transformer(function(x) paste(x, collapse="\n"))
# tm_map is also useful for mapping operations to the
# location of corpus content
ham_corpus <- tm_map(ham_corpus, collapse_lines)
spam_corpus <- tm_map(spam_corpus, collapse_lines)
# Check object contents if desired
#ham_corpus[[1]]$content[1]
Exporting metadata from VCorpus to quanteda is a special process, so note that exporting at this point is an option.
q_corp_ham <- tm::VCorpus(tm::VectorSource(ham_corpus))
q_corp_ham <- corpus(q_corp_ham)
q_corp_spam <- tm::VCorpus(tm::VectorSource(spam_corpus))
q_corp_spam <- corpus(q_corp_spam)
# Try out various sample code online
# Disable until tf-idf weighting can be reintegrated with training
# set parameters
#set.seed(123)
#train_prop <- 0.7 # % of data to use for training
# prepare data
#names(df) <- c("Id", "Class", "Text") # add column labels
#df <- df[sample(nrow(df)),] # randomize data
#df <- df %>% filter(Text != '') %>% filter(Class != '') # filter blank data
# create document corpus
#df_corpus <- corpus(df$Text) # convert Text to corpus
#docvars(df_corpus, field="Class") <- factor(df$Class, ordered=TRUE) # add classification label as docvar
#df_dfm <- dfm(df_corpus, tolower = TRUE)
#df_dfm <- dfm_wordstem(df_dfm)
#df_dfm <- dfm_trim(df_dfm, min_termfreq = 5, min_docfreq = 3)
# tf-idf weighting
#df_dfm <- dfm_tfidf(df_dfm, scheme_tf = "count", scheme_df = "inverse", force = TRUE)
#size <- dim(df)
#train_end <- round(train_prop*size[1])
#test_start <- train_end + 1
#test_end <- size[1]
#df_train <- df[1:train_end,]
#df_test <- df[test_start:test_end,]
#df_dfm_train <- df_dfm[1:train_end,]
#df_dfm_test <- df_dfm[test_start:test_end,]
#glimpse(df_dfm_train)
# Remove Quanteda until we need them
rm(list = c("q_corp_ham", "q_corp_spam")) # remove objects
gc() # garbage collection
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2643548 141.2 3908864 208.8 NA 3908864 208.8
## Vcells 14098494 107.6 25064332 191.3 32768 25064332 191.3
Set docvars prior to cleaning the headers
# Setting the Metadata at the document level looks a little
# different than setting the metadata at the corpora level
# Since we are altering the metadata and not content
# we use a simple for loop to access the documents
# and extract data from the body which we hope will help
# classify out documents later
# We set these up to contain one rule per line so we can easily
# turn on or off by commenting them out as needed
set_doc_vars <- function(x) {
for(i in seq(1, length(x))){
doc_content <-x[[i]]$content
x[[i]]$meta["date"] <- str_extract(doc_content, "(?<=Date:)([^\\n]+)")
x[[i]]$meta["to"] <- str_extract(doc_content, "(?<=To:)([^\\n]+)")
x[[i]]$meta["from"] <- str_extract(doc_content, "(?<=From:)([^\\n]+)")
x[[i]]$meta["subject"] <- str_extract(doc_content, "(?<=Subject:)([^\\n]+)")
}
return(x)
}
ham_corpus <- set_doc_vars(ham_corpus)
spam_corpus <- set_doc_vars(spam_corpus)
# Additional regexes were removed to simplify the documents for testing
# Follow the code style recomendation in the VCorpus documentation:
# to_space <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
c_ham <- tm_map(ham_corpus, content_transformer(function(x) sub(".*?\n\n", "", x)))
c_spam <- tm_map(spam_corpus, content_transformer(function(x) sub(".*?\n\n", "", x)))
Great text processing ideas come from many places since text processing challenges tend to be ubiquitous. We got this function directly from https://www.datacamp.com/community/tutorials/R-nlp-machine-learning
# TODO: expand contractions
fix_contractions <- function(doc) {
# "won't" is a special case as it does not expand to "wo not" doc <- gsub("won't", "will not", doc)
doc <- gsub("can't", "can not", doc)
doc <- gsub("n't", " not", doc)
doc <- gsub("'ll", " will", doc)
doc <- gsub("'re", " are", doc)
doc <- gsub("'ve", " have", doc)
doc <- gsub("'m", " am", doc)
doc <- gsub("'d", " would", doc)
# 's could be 'is' or could be possessive: it has no expansion doc <- gsub("'s", "", doc)
return(doc)
}
gsub
, content_transformer
, and tm_map
We separate all of the cleaning functions into one line-rules as much as possible so to turn these on or off during testing. In production, however, they would likely be grouped.
c_ham <- tm_map(c_ham, content_transformer(fix_contractions))
c_spam <- tm_map(c_ham, content_transformer(fix_contractions))
# Transform all to lowercase in order to capture more word frequencies
c_ham <- tm_map(c_ham, content_transformer(tolower))
# Remove urls
c_ham <- tm_map(c_ham, content_transformer(function(x) gsub("https?://[^ ]+", "", x)), lazy = TRUE)
# Clean some windows artifacts
##c_ham <- tm_map(c_ham, content_transformer(function(x) gsub("\\\\", "", x)), lazy = TRUE)
c_spam <- tm_map(c_spam, content_transformer(tolower))
c_spam <- tm_map(c_spam, content_transformer(function(x) gsub("https?://[^ ]+", "", x)))
##c_spam <- tm_map(c_spam, content_transformer(function(x) gsub("\\\\", "", x)))
# Create a temporary function for removing html tags, which tidytext will sometimes remove, but
# paste will not.
# NB: Our chosen spam files have more html than our ham files.
# While including html allows for more accurate identification of spam
# removing html will allow us to develop a model that identifies spam in non html text,
# or alternately by identifying ham mail that is in html should a company allow for html
c_ham <- tm_map(c_ham, content_transformer(function(x) gsub("<.*?>", "", x)), lazy = TRUE)
# Remove punctuation characters in general
c_ham <- tm_map(c_ham, content_transformer(function(x) gsub("[[:punct:]]+"," ", x)), lazy = TRUE)
# Requires more testing to prevent word concatentation
c_ham <- tm_map(c_ham, function(x) removePunctuation(x,
preserve_intra_word_contractions = FALSE,
preserve_intra_word_dashes = FALSE), lazy = TRUE)
# Normalize whitespace
#c_ham <- tm_map(c_ham, stripWhitespace, lazy = TRUE)
c_spam <- tm_map(c_spam, content_transformer(function(x) gsub("<.[^>]+>", "", x)))
c_spam <- tm_map(c_spam, content_transformer(function(x) gsub("[[:punct:]]+"," ", x)))
c_spam <- tm_map(c_spam, function(x) removePunctuation(x,
preserve_intra_word_contractions = FALSE,
preserve_intra_word_dashes = FALSE), lazy = TRUE)
#c_spam <- tm_map(c_spam, stripWhitespace, lazy = TRUE)
While removing numbers and stopwords will no doubt result in more robust classification over large corpora, doing so in this small dataset will likely yield a worse result.It is debatable whether removing stopwords and stemming will also remove some of the grammatical signatures of spam, and this depends on our method of text processing.
c_ham <- tm_map(c_ham, removeNumbers, lazy = TRUE)
c_ham <- tm_map(c_ham, removeWords, stopwords("english"), lazy = TRUE)
#c_ham <- tm_map(c_ham, stemDocument, lazy = TRUE)
c_spam <- tm_map(c_spam, removeNumbers, lazy = TRUE)
c_spam <- tm_map(c_spam, removeWords, stopwords("english"), lazy = TRUE)
#c_spam <- tm_map(c_spam, stemDocument, lazy = TRUE)
## ==ham corpus==
## <<VCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 11
## Content: chars: 1117
## ==ham corpus meta==
## $class
## [1] "ham"
##
## attr(,"class")
## [1] "CorpusMeta"
## ==ham corpus meta class==
## [1] "ham"
## ==ham corpus document 2 metadata==
## author : character(0)
## datetimestamp: 2019-11-17 20:04:16
## description : character(0)
## heading : character(0)
## id : 0002.b3120c4bcbf3101e661161ee7efcb8bf
## language : en
## origin : character(0)
## date : Thu, 22 Aug 2002 12:46:18 +0100
## to : zzzz@localhost.netnoteinc.com
## from : Steve Burt <steve.burt@cursor-system.com>
## subject : [zzzzteana] RE: Alexander
## ==ha document 1 content m==
## [1] " date wed aug \n chris garrigues cwgdatedfaddeepeddycom\n messageid tmdadeepeddyvirciocom\n\n\n can reproduce error\n\n repeatable like every time without fail\n\n debug log pick happening \n\n pickit exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury\n exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury\n ftocpickmsgs hit\n marking hits\n tkerror syntax error expression int \n\nnote run pick command hand \n\ndelta pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury\n hit\n\n hit comes obviously version nmh \nusing \n\ndelta pick version\npick nmh compiled fuchsiacsmuozau sun mar ict \n\n relevant part mhprofile \n\ndelta mhparam pick\nseq sel list\n\n\nsince pick command works sequence actually \none explicit command line search popup \none comes mhprofile get created\n\nkre\n\nps still using version code form day ago \n able reach cvs repository today local routing issue think\n\n\n\n\nexmhworkers mailing list\nexmhworkersredhatcom\nhttpslistmanredhatcommailmanlistinfoexmhworkers\n"
## Length Class Mode
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 2 PlainTextDocument list
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 2 PlainTextDocument list
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 2 PlainTextDocument list
## 0004.e8d5727378ddde5c3be181df593f1712 2 PlainTextDocument list
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 2 PlainTextDocument list
## 0006.ee8b0dba12856155222be180ba122058 2 PlainTextDocument list
## <<VCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 11
## Content: chars: 1138
## $class
## [1] "ham"
##
## attr(,"class")
## [1] "CorpusMeta"
## author : character(0)
## datetimestamp: 2019-11-17 20:04:16
## description : character(0)
## heading : character(0)
## id : 0002.b3120c4bcbf3101e661161ee7efcb8bf
## language : en
## origin : character(0)
## date : Thu, 22 Aug 2002 12:46:18 +0100
## to : zzzz@localhost.netnoteinc.com
## from : Steve Burt <steve.burt@cursor-system.com>
## subject : [zzzzteana] RE: Alexander
## [1] "ham"
## [1] " date wed aug \n chris garrigues \n message id \n\n\n can reproduce error \n\n repeatable like every time without fail \n\n debug log pick happening \n\n pick exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury \n exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury\n ftoc pickmsgs hit \n marking hits\n tkerror syntax error expression int \n\nnote run pick command hand \n\ndelta pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury\n hit\n\n s hit comes obviously version nmh \nusing \n\ndelta pick version\npick nmh compiled fuchsia cs mu oz au sun mar ict \n\n relevant part mh profile \n\ndelta mhparam pick\n seq sel list\n\n\nsince pick command works sequence actually \none s explicit command line search popup \none comes mh profile get created \n\nkre\n\nps still using version code form day ago \n able reach cvs repository today local routing issue think \n\n\n\n \nexmh workers mailing list\nexmh workers redhat com\n"
## Length Class Mode
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 2 PlainTextDocument list
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 2 PlainTextDocument list
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 2 PlainTextDocument list
## 0004.e8d5727378ddde5c3be181df593f1712 2 PlainTextDocument list
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 2 PlainTextDocument list
## 0006.ee8b0dba12856155222be180ba122058 2 PlainTextDocument list
Quickly create a dataframe to check that our VCorpus is exportable to tidytext
make_df <- function() {
dfh <- data.frame(text = sapply(c(c_ham), as.character), stringsAsFactors = FALSE)
dfh$class <- "ham"
dfh$id <- rownames(dfh)
glimpse(dfh) # validate ham output
dfs <- data.frame(text = sapply(c(c_spam), as.character), stringsAsFactors = FALSE)
dfs$class <- "spam"
dfs$id <- rownames(dfs)
glimpse(dfs) # validate spam output
df <- rbind(dfh, dfs)
glimpse(df) # validate merge
df <- df %>%
mutate(class = as.factor(class))
return(df)
}
df <- make_df()
## Observations: 199
## Variables: 3
## $ text <chr> " date wed aug \n chris garrigues …
## $ class <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "h…
## $ id <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 199
## Variables: 3
## $ text <chr> " date wed aug \n chris garr…
## $ class <chr> "spam", "spam", "spam", "spam", "spam", "spam", "spam", "s…
## $ id <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 398
## Variables: 3
## $ text <chr> " date wed aug \n chris garrigues …
## $ class <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "h…
## $ id <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 398
## Variables: 3
## $ text <chr> " date wed aug \n chris garrigues …
## $ class <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham…
## $ id <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…