It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
Here are two short videos that you may find helpful.
The first video shows how to unzip the provided files.
Solutions:
Packages used
The spam and ham files are downloaded from spamassassin’s website:
getFiles <- function(files,baseurl)
{
for (i in 1: length(files))
{
fUrl <- str_c(baseurl,files[i])
download.file(fUrl, destfile = files[i])
Sys.sleep(1)
}
}targetFiles <- c("20050311_spam_2.tar.bz2","20030228_easy_ham_2.tar.bz2")getFiles(targetFiles,"https://spamassassin.apache.org/old/publiccorpus/")Extract the downloaded files twice. first, as a .bz2 and then as a .tar
for (i in 1: length(targetFiles))
{
bunzip2(targetFiles[i])
untar(tarfile = str_replace(targetFiles[i], ".bz2", ""))
}list.dirs(path = ".", full.names = TRUE, recursive = TRUE)## [1] "."
## [2] "./easy_ham_2"
## [3] "./rsconnect"
## [4] "./rsconnect/documents"
## [5] "./rsconnect/documents/Henry-document-classification.Rmd"
## [6] "./rsconnect/documents/Henry-document-classification.Rmd/rpubs.com"
## [7] "./rsconnect/documents/Henry-document-classification.Rmd/rpubs.com/rpubs"
## [8] "./spam_2"
length(list.files("./easy_ham_2"))## [1] 1400
length(list.files("./spam_2"))## [1] 1396
hamFiles <- list.files("./easy_ham_2")spamFiles <- list.files("./spam_2")head(hamFiles)## [1] "00001.1a31cc283af0060967a233d26548a6ce"
## [2] "00002.5a587ae61666c5aa097c8e866aedcc59"
## [3] "00003.19be8acd739ad589cd00d8425bac7115"
## [4] "00004.b2ed6c3c62bbdfab7683d60e214d1445"
## [5] "00005.07b9d4aa9e6c596440295a5170111392"
## [6] "00006.654c4ec7c059531accf388a807064363"
head(spamFiles)## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"
## [6] "00006.3ca1f399ccda5d897fecb8c57669a283"
(notHamFile <- hamFiles[str_detect(hamFiles,"^\\d+\\.[:alnum:]") == FALSE])## character(0)
(notSpamFile <- spamFiles[str_detect(spamFiles,"^\\d+\\.[:alnum:]") == FALSE])## character(0)
if (file.exists("./easy_ham_2/cmds")) file.remove("./easy_ham_2/cmds")
if (file.exists("./spam_2/cmds")) file.remove("./spam_2/cmds")cmds files have been deleted, and then check the length of the contenthamFiles <- list.files("./easy_ham_2")spamFiles <- list.files("./spam_2")length(spamFiles)## [1] 1396
hamSpamMails <- tibble()
hamSpamFolders <- c("./spam_2", "./easy_ham_2")
fileTypes <-c("spam", "ham")
allFileNames <- c(spamFiles, hamFiles)for (i in 1: length(hamSpamFolders))
{
type <- fileTypes[i]
ff <- tibble(file = dir(hamSpamFolders[i], full.names = TRUE)) %>% mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), type = type, text) %>%
unnest(text)
hamSpamMails <- bind_rows(hamSpamMails, ff)
}head(hamSpamMails)## # A tibble: 6 x 3
## id type text
## <chr> <chr> <chr>
## 1 00001.317e78fa8ee2f54cd4~ spam From ilug-admin@linux.ie Tue Aug 6 11:~
## 2 00001.317e78fa8ee2f54cd4~ spam Return-Path: <ilug-admin@linux.ie>
## 3 00001.317e78fa8ee2f54cd4~ spam Delivered-To: yyyy@localhost.netnoteinc.~
## 4 00001.317e78fa8ee2f54cd4~ spam Received: from localhost (localhost [127~
## 5 00001.317e78fa8ee2f54cd4~ spam "\tby phobos.labs.netnoteinc.com (Postfi~
## 6 00001.317e78fa8ee2f54cd4~ spam "\tfor <jm@localhost>; Tue, 6 Aug 2002 ~
allEmails <- hamSpamMailsallEmails <- tibble::rowid_to_column(allEmails, "linenumber")datatable(head(allEmails), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))allEmails %>% group_by(type) %>% summarize(messages = n_distinct(id)) %>% ungroup()## # A tibble: 2 x 2
## type messages
## <chr> <int>
## 1 ham 1400
## 2 spam 1396
tm_map is called. Then, I will remove characters like colons, hyphens, apostrophe, etc and replace them with spaces so that the words they separated will not be combined into one. Finally, I will remove punctuations, convert the corpus to lower case, remove all numbers, remove white spaces, and remove stop words.allEmails$text <- str_replace_all(allEmails$text,"[^[:graph:]]", " ")(allCorpus <- Corpus(VectorSource(allEmails$text)))## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 328911
stripJoiners <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})cleanMailCorpus <- tm_map(allCorpus, stripJoiners, "-")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, ":")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, "\\.")
cleanMailCorpus <- tm_map(cleanMailCorpus, stripJoiners, "'")
cleanMailCorpus <- tm_map(cleanMailCorpus, removePunctuation)
cleanMailCorpus <- tm_map(cleanMailCorpus, removeNumbers)
cleanMailCorpus <- tm_map(cleanMailCorpus, tolower)
cleanMailCorpus <- tm_map(cleanMailCorpus, stripWhitespace)
cleanMailCorpus <- tm_map(cleanMailCorpus, removeWords, stopwords("English"))mailsDf <- data.frame(text = get("content", cleanMailCorpus))
allEmails$text <- as.character(mailsDf$text)allMailsTokens <- allEmails %>%
unnest_tokens(output = word, input = text) %>%
# remove numbers
filter(!str_detect(word, "^[0-9]*$")) %>%
# remove stop words
anti_join(stop_words) %>%
# stem the words
mutate(word = SnowballC::wordStem(word))## Joining, by = "word"
datatable(head(allMailsTokens), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))cleanMailsTokensTfIdf <- allMailsTokens %>%
count(type, word) %>%
bind_tf_idf(term = word, document = type, n = n)datatable(head(cleanMailsTokensTfIdf), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))# Before weighting
hams <- allMailsTokens %>% group_by(type) %>% filter(type == "ham") %>% count(word, sort = TRUE) %>% ungroup()
spams <- allMailsTokens %>% group_by(type) %>% filter(type == "spam") %>% count(word, sort = TRUE) %>% ungroup()head(hams)## # A tibble: 6 x 3
## type word n
## <chr> <chr> <int>
## 1 ham org 10661
## 2 ham net 10536
## 3 ham id 10348
## 4 ham list 10014
## 5 ham receiv 9629
## 6 ham aug 7922
# For hams
hams %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n))+
geom_col(show.legend = FALSE, fill = "steelblue") +
labs(y= "Ham Word Count", x=NULL)+
coord_flip()## Selecting by n
# For spams
spams %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n))+
geom_col(show.legend = FALSE, fill = "steelblue") +
labs(y= "Spam Word Count", x=NULL)+
coord_flip()## Selecting by n
mailsPlot <- cleanMailsTokensTfIdf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))datatable(head(mailsPlot), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#26868d', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
# graph the top 10 tokens for both ham and spam
mailsPlot %>%
filter(type %in% c('ham', 'spam')) %>%
mutate(type = factor(type, levels = c('ham', 'spam'),
labels = c("Ham mails", "Spam mails"))) %>%
group_by(type) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(word, tf_idf)) +
geom_col(fill = "steelblue") +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~type, scales = "free") +
coord_flip()## Selecting by tf_idf
# After Weighting
hams50 <- top_n(subset(cleanMailsTokensTfIdf, type == 'ham'), 50)## Selecting by tf_idf
spams50 <- top_n(subset(cleanMailsTokensTfIdf, type == 'spam'), 50)## Selecting by tf_idf
# Before weighting
wordcloud(words = hams$word, freq = hams$n, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))#After weighting
wordcloud(words = hams50$word, freq = hams50$n, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))# Bfore weighting
wordcloud(words = spams$word, freq = hams$n, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))# After Weighting
wordcloud(words = spams50$word, freq = spams50$n, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))set.seed(12)
randomisedSet <- allMailsTokens[sample(nrow(allMailsTokens)),]randomisedCorpus <- Corpus(VectorSource(randomisedSet$word))randommisedDTM <- DocumentTermMatrix(randomisedCorpus) trainingDim <- dim(randomisedSet)[1]%/%4*3
trainingSet <- randomisedSet[1:trainingDim,]
txtSet <- dim(randomisedSet)[1]trainingSet <- randomisedSet[1:trainingDim,]
testingSet <- randomisedSet[(trainingDim+1):txtSet,]trainingDTM <- randommisedDTM[1:trainingDim,]
testDTM <- randommisedDTM[(trainingDim+1):txtSet,]trainingLabels <- trainingSet$type
testLabels <- testingSet$typetrainingDTM and testDTM in the training and testing but it almost sucked life out of my laptop so I want to reduce the model complexity by removing sparse terms from the model. That is, removing tokens which do not appear across many documents. It is similar to using tf-idf weighting, but directly deletes sparse variables from the document-term matrix. This results in a statistical learning model with a much smaller set of variables. The tm package contains the removeSparseTerms() function, which does this task. The first argument is a document-term matrix, and the second argument defines the maximal allowed sparsity in the range from 0 to 1. So for instance, sparse = .99 would remove any tokens which are missing from more than 95% of the documents in the corpus (i.e. the token must appear in at least 5% of the documents to be retained).threshold <- 0.5
minFreq = round(randommisedDTM$nrow*(threshold/100),0)frequentWords <- findFreqTerms(x = randommisedDTM, lowfreq = minFreq)length(testDTM)## [1] 6
length(frequentWords)## [1] 15
trainingDTMFreq <- trainingDTM[ , frequentWords]
testDTMFreq <- testDTM[ , frequentWords]
dim(trainingDTMFreq)## [1] 784068 15
categoriseValues <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}trainingText <- apply(trainingDTMFreq, MARGIN = 2, categoriseValues)
testText <- apply(testDTMFreq, MARGIN = 2, categoriseValues)classifier <- naiveBayes(trainingText, factor(trainingLabels))testPrediction <- predict(classifier, testText)confusionMatrix(data = testPrediction, reference = factor(testLabels),
positive = "spam", dnn = c("Prediction", "Actual"))## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 19657 9767
## spam 95906 136027
##
## Accuracy : 0.5957
## 95% CI : (0.5938, 0.5976)
## No Information Rate : 0.5578
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1117
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9330
## Specificity : 0.1701
## Pos Pred Value : 0.5865
## Neg Pred Value : 0.6681
## Prevalence : 0.5578
## Detection Rate : 0.5205
## Detection Prevalence : 0.8874
## Balanced Accuracy : 0.5516
##
## 'Positive' Class : spam
##