It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
The following dependencies were used to complete the assignment:
library(knitr)
library(kableExtra)
library(tidyverse)
library(tm) # Text mining package
library(SnowballC) # Word stemming package
library(RTextTools)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)In order to begin, I first downloaded an 8-Zip application to extract the spam/ham documents from the public corpus into my week10 github folder. The .bz2 and .tar files needed to be unzipped twice. This process created two subfolders in the directory, labed “easy_ham” and “spam.”
I then set this folder as my working directory in R, and used the tm package to load the spam and ham corpora into R. I quickly examined the content of both corpora and used the document count to build a loop to label all files according to corpus.
Once complete, I used the c() function to merge the spam and ham corpora into a new test corpus.
# set working directory
setwd("../week10")
# create list of file names for reference
spam.files <- list.files("spam")
ham.files <- list.files("easy_ham")
# load corpus from directory into R memory
spam <- VCorpus(DirSource("spam"), readerControl = list(language = "en"))
ham <- VCorpus(DirSource("easy_ham"), readerControl = list(language = "en")) spam## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 501
ham## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2551
# loop spam
for (i in 1:length(spam)){
meta(spam[[i]], "spam", "corpus") <- "1"
}
# loop ham
for (i in 1:length(ham)){
meta(ham[[i]], "spam", "corpus") <- "0"
}Test Corpus# Combine and randomize corpora
set.seed(10)
test <- sample(c(ham, spam))
# Review Labeling
spam_id <- data.frame(unlist(meta(test, "spam")))
table(spam_id)## spam_id
## 0 1
## 2551 501
In order to develop a spam/ham prediction, the test corpus content needed to be cleaned through transformations. You can view a snippit of the untransformed content below:
meta(test[[1]])## author : character(0)
## datetimestamp: 2018-11-04 22:41:09
## description : character(0)
## heading : character(0)
## id : 1549.929f73150d87a253b519a286d3dab77e
## language : en
## origin : character(0)
## spam : 0
strwrap(test[[1]]$content[1:20])## [1] "From spamassassin-talk-admin@lists.sourceforge.net Wed Sep 11"
## [2] "16:05:26 2002"
## [3] "Return-Path: <spamassassin-talk-admin@example.sourceforge.net>"
## [4] "Delivered-To: yyyy@localhost.example.com"
## [5] "Received: from localhost (jalapeno [127.0.0.1])"
## [6] "by jmason.org (Postfix) with ESMTP id CAB9B16F03"
## [7] "for <jm@localhost>; Wed, 11 Sep 2002 16:05:25 +0100 (IST)"
## [8] "Received: from jalapeno [127.0.0.1]"
## [9] "by localhost with IMAP (fetchmail-5.9.0)"
## [10] "for jm@localhost (single-drop); Wed, 11 Sep 2002 16:05:25 +0100"
## [11] "(IST)"
## [12] "Received: from usw-sf-list2.sourceforge.net"
## [13] "(usw-sf-fw2.sourceforge.net"
## [14] "[216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with"
## [15] "ESMTP id"
## [16] "g8BDeHC16691 for <jm-sa@jmason.org>; Wed, 11 Sep 2002 14:40:17"
## [17] "+0100"
## [18] "Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13]"
## [19] "helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net"
## [20] "with"
## [21] "esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17p7g0-0002aM-00; Wed,"
## [22] "11 Sep 2002 06:37:04 -0700"
## [23] "Received: from yertle.kcilink.com ([216.194.193.105]) by"
## [24] "usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1"
## [25] "(Debian)) id"
## [26] "17p7fd-00078a-00 for <spamassassin-talk@lists.sourceforge.net>;"
## [27] "Wed, 11 Sep 2002 06:36:41 -0700"
I again used the tm package to map several transformation functions to clean the content of the test corpus. Some cleaning functions, such as stopwords were added after the initial review of Document Term Matrix in the subsequent section.
stop <- c("date", "deliveredto", "received", "subject", "localhost", "returnpath") #added stopwords
test_tm <- test %>%
tm_map(content_transformer(tolower)) %>% # transform to lower case
tm_map(content_transformer(function(x) gsub(x, pattern="\\S*\\.\\S*", replacement=" "))) %>% #regex
tm_map(content_transformer(function(x) gsub(x, pattern="\\S*\\@\\S*", replacement=" "))) %>% #regex
tm_map(content_transformer(removePunctuation)) %>% # remove punctuation
tm_map(content_transformer(removeNumbers)) %>% # remove numbers
tm_map(content_transformer(PlainTextDocument)) %>% # plain text
tm_map(content_transformer(function(x) removeWords(x, words = c(stop, stopwords("en"))))) %>% #stopwords
tm_map(stemDocument) %>% # stem document
tm_map(content_transformer(stripWhitespace)) # remove white spaceNext, I used the newly transformed corpus, test_tm, to create a Document Term Matrix (DTM). The DTM creates a vector frequencies of terms used in the corpus, which we can inspect below:
test_dtm <- DocumentTermMatrix(test_tm)
kable(inspect(test_dtm), caption = 'Inspect Initial DTM', format = "html") %>%
kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
row_spec(row = 0:0, background = "lightgrey") %>%
column_spec(column = 1, bold = T)## <<DocumentTermMatrix (documents: 3052, terms: 46028)>>
## Non-/sparse entries: 369725/140107731
## Sparsity : 100%
## Maximal term length: 298
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp ist jalapeno mon oct
## 0208.0577aa26cbb2382d8789377d351ca8bf 0 2 2 2 0 6
## 0255.42a6feb4435a0a68929075c0926f085d 2 3 2 2 1 0
## 0317.0bea188e5bd639ae421f07b3ad68c5e0 0 1 2 2 4 0
## 0320.6c54ea1bb991c6fae395588219cfce37 0 5 2 2 0 8
## 0352.f7adb4aa267e50a8db1e4bcacfe863f3 0 2 2 2 0 0
## 0627.c9ad8730dad7bda1e1169ee00c4006fc 0 5 2 2 0 0
## 0730.9570ee3b6bf144198297b23bca5044e9 0 5 2 2 0 0
## 0737.aa298505cb31aac78d0dbf229fc45fb9 0 4 2 2 0 0
## 0826.082e92a79a15aa7f6dd5b85a40327abd 0 5 2 2 0 0
## 1022.73ab70b91862d709897cfe3dd5bb22a0 0 4 2 2 0 9
## Terms
## Docs postfix sep thu wed
## 0208.0577aa26cbb2382d8789377d351ca8bf 1 0 0 0
## 0255.42a6feb4435a0a68929075c0926f085d 1 10 3 0
## 0317.0bea188e5bd639ae421f07b3ad68c5e0 1 6 0 0
## 0320.6c54ea1bb991c6fae395588219cfce37 3 0 0 4
## 0352.f7adb4aa267e50a8db1e4bcacfe863f3 1 5 0 0
## 0627.c9ad8730dad7bda1e1169ee00c4006fc 3 8 8 1
## 0730.9570ee3b6bf144198297b23bca5044e9 3 8 0 0
## 0737.aa298505cb31aac78d0dbf229fc45fb9 3 9 0 0
## 0826.082e92a79a15aa7f6dd5b85a40327abd 3 8 0 1
## 1022.73ab70b91862d709897cfe3dd5bb22a0 3 0 4 4
| aug | esmtp | ist | jalapeno | mon | oct | postfix | sep | thu | wed | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0208.0577aa26cbb2382d8789377d351ca8bf | 0 | 2 | 2 | 2 | 0 | 6 | 1 | 0 | 0 | 0 |
| 0255.42a6feb4435a0a68929075c0926f085d | 2 | 3 | 2 | 2 | 1 | 0 | 1 | 10 | 3 | 0 |
| 0317.0bea188e5bd639ae421f07b3ad68c5e0 | 0 | 1 | 2 | 2 | 4 | 0 | 1 | 6 | 0 | 0 |
| 0320.6c54ea1bb991c6fae395588219cfce37 | 0 | 5 | 2 | 2 | 0 | 8 | 3 | 0 | 0 | 4 |
| 0352.f7adb4aa267e50a8db1e4bcacfe863f3 | 0 | 2 | 2 | 2 | 0 | 0 | 1 | 5 | 0 | 0 |
| 0627.c9ad8730dad7bda1e1169ee00c4006fc | 0 | 5 | 2 | 2 | 0 | 0 | 3 | 8 | 8 | 1 |
| 0730.9570ee3b6bf144198297b23bca5044e9 | 0 | 5 | 2 | 2 | 0 | 0 | 3 | 8 | 0 | 0 |
| 0737.aa298505cb31aac78d0dbf229fc45fb9 | 0 | 4 | 2 | 2 | 0 | 0 | 3 | 9 | 0 | 0 |
| 0826.082e92a79a15aa7f6dd5b85a40327abd | 0 | 5 | 2 | 2 | 0 | 0 | 3 | 8 | 0 | 1 |
| 1022.73ab70b91862d709897cfe3dd5bb22a0 | 0 | 4 | 2 | 2 | 0 | 9 | 3 | 0 | 4 | 4 |
As you can see, there are 58,606 total terms used in the 3,052 test corpus documents, with a meximal term length of 298.
To refine the DTM results, I introduced new controls that affect the sparsity and length of terms. The non-/sparse entries and sparsity percent output show the relative frequency that a term appears in a document. I used the removeSparseTerms function below to set the sparcity to 95% and remove less frequent terms from the matrix. In addition, I set word length controls to only account for terms between the length of 4 and 20 characters.
refine_test <- test_tm %>%
DocumentTermMatrix(control=list(wordLengths=c(4, 20))) %>%
removeSparseTerms(.95)
kable(inspect(refine_test), caption = 'Inspect Refined DTM', format = "html") %>%
kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
row_spec(row = 0:0, background = "lightgrey") %>%
column_spec(column = 1, bold = T)## <<DocumentTermMatrix (documents: 3052, terms: 319)>>
## Non-/sparse entries: 141664/831924
## Sparsity : 85%
## Maximal term length: 20
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs contenttyp esmtp imap jalapeno
## 0191.cd23170803a4680d6dbd798a4cd1e5dd 1 2 0 0
## 0206.806263422d55d38a151fe3b89d56192f 1 2 1 0
## 0255.42a6feb4435a0a68929075c0926f085d 0 3 1 2
## 0320.6c54ea1bb991c6fae395588219cfce37 1 5 1 2
## 0433.8977506bae8028f48290ea0fb2f54ddd 1 4 1 2
## 0627.c9ad8730dad7bda1e1169ee00c4006fc 1 5 1 2
## 0730.9570ee3b6bf144198297b23bca5044e9 1 5 1 2
## 0737.aa298505cb31aac78d0dbf229fc45fb9 1 4 1 2
## 0826.082e92a79a15aa7f6dd5b85a40327abd 1 5 1 2
## 1022.73ab70b91862d709897cfe3dd5bb22a0 1 4 1 2
## Terms
## Docs list messageid postfix singledrop
## 0191.cd23170803a4680d6dbd798a4cd1e5dd 8 1 2 1
## 0206.806263422d55d38a151fe3b89d56192f 8 1 1 1
## 0255.42a6feb4435a0a68929075c0926f085d 0 1 1 1
## 0320.6c54ea1bb991c6fae395588219cfce37 0 1 3 1
## 0433.8977506bae8028f48290ea0fb2f54ddd 8 1 4 1
## 0627.c9ad8730dad7bda1e1169ee00c4006fc 1 1 3 1
## 0730.9570ee3b6bf144198297b23bca5044e9 1 1 3 1
## 0737.aa298505cb31aac78d0dbf229fc45fb9 1 1 3 1
## 0826.082e92a79a15aa7f6dd5b85a40327abd 3 1 3 1
## 1022.73ab70b91862d709897cfe3dd5bb22a0 0 1 3 1
## Terms
## Docs textplain will
## 0191.cd23170803a4680d6dbd798a4cd1e5dd 1 19
## 0206.806263422d55d38a151fe3b89d56192f 1 22
## 0255.42a6feb4435a0a68929075c0926f085d 0 27
## 0320.6c54ea1bb991c6fae395588219cfce37 1 14
## 0433.8977506bae8028f48290ea0fb2f54ddd 1 24
## 0627.c9ad8730dad7bda1e1169ee00c4006fc 1 18
## 0730.9570ee3b6bf144198297b23bca5044e9 1 167
## 0737.aa298505cb31aac78d0dbf229fc45fb9 1 167
## 0826.082e92a79a15aa7f6dd5b85a40327abd 1 21
## 1022.73ab70b91862d709897cfe3dd5bb22a0 1 14
| contenttyp | esmtp | imap | jalapeno | list | messageid | postfix | singledrop | textplain | will | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0191.cd23170803a4680d6dbd798a4cd1e5dd | 1 | 2 | 0 | 0 | 8 | 1 | 2 | 1 | 1 | 19 |
| 0206.806263422d55d38a151fe3b89d56192f | 1 | 2 | 1 | 0 | 8 | 1 | 1 | 1 | 1 | 22 |
| 0255.42a6feb4435a0a68929075c0926f085d | 0 | 3 | 1 | 2 | 0 | 1 | 1 | 1 | 0 | 27 |
| 0320.6c54ea1bb991c6fae395588219cfce37 | 1 | 5 | 1 | 2 | 0 | 1 | 3 | 1 | 1 | 14 |
| 0433.8977506bae8028f48290ea0fb2f54ddd | 1 | 4 | 1 | 2 | 8 | 1 | 4 | 1 | 1 | 24 |
| 0627.c9ad8730dad7bda1e1169ee00c4006fc | 1 | 5 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 18 |
| 0730.9570ee3b6bf144198297b23bca5044e9 | 1 | 5 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 167 |
| 0737.aa298505cb31aac78d0dbf229fc45fb9 | 1 | 4 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 167 |
| 0826.082e92a79a15aa7f6dd5b85a40327abd | 1 | 5 | 1 | 2 | 3 | 1 | 3 | 1 | 1 | 21 |
| 1022.73ab70b91862d709897cfe3dd5bb22a0 | 1 | 4 | 1 | 2 | 0 | 1 | 3 | 1 | 1 | 14 |
term_freq <- refine_test %>%
as.matrix %>%
colSums() %>%
sort(decreasing=TRUE)
kable(head(term_freq, 10)) | x | |
|---|---|
| esmtp | 9824 |
| postfix | 5413 |
| jalapeno | 4157 |
| messageid | 3094 |
| contenttyp | 3016 |
| singledrop | 2869 |
| imap | 2767 |
| list | 2591 |
| textplain | 2443 |
| will | 2361 |
terms <- data.frame(term=names(term_freq), frequency=term_freq)
plot <- ggplot(subset(terms, frequency>1000), aes(x = reorder(term, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill='grey') +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
labs(title = "Terms with Frequencies > 1000", x = "Term", "Frequency")
plotI set up an SVM and Maxent predicitive models to train and classify to indicate whether or not an email should be marked as spam. The steps for this section are outlined below:
# Create loop to lable spam indicator
indicator<-c()
for(i in 1:length(test_tm)){
indicator<-c(indicator,test_tm[[i]]$meta$spam)
}
# randomize data and set up model container using 75% probability
set.seed(100)
probs <- runif(length(test_tm),0,1)
train <- which(probs<=.75)
test <- which(probs>.75)
# build container for model from DTM
container <- create_container(refine_test, labels = indicator, trainSize = train, testSize = test, virgin = FALSE)
# use container to train models
train_svm <- train_model(container, "SVM")
train_max <-train_model(container, "MAXENT")
# use trained models to classify new data
classify_svm <- classify_model(container, train_svm)
classify_max <-classify_model(container, train_max)
# view output of models
svm <- head(classify_svm, 10)
max <- head(classify_max, 10)
kable(svm, caption = 'SVM Model Output', format = "html") %>%
kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
column_spec(column = 1, bold = T)| SVM_LABEL | SVM_PROB |
|---|---|
| 0 | 0.9998325 |
| 0 | 0.8725006 |
| 0 | 0.9989105 |
| 0 | 0.9950264 |
| 0 | 0.9998966 |
| 0 | 0.9997816 |
| 0 | 0.9353922 |
| 0 | 0.9999979 |
| 0 | 0.9686015 |
| 0 | 0.9993535 |
kable(max, caption = 'Maxent Model Output', format = "html") %>%
kable_styling(bootstrap_options = "condensed", full_width = F, position = "right") %>%
column_spec(column = 1, bold = T)| MAXENTROPY_LABEL | MAXENTROPY_PROB |
|---|---|
| 0 | 1.0000000 |
| 0 | 0.9999977 |
| 0 | 1.0000000 |
| 0 | 1.0000000 |
| 0 | 1.0000000 |
| 0 | 1.0000000 |
| 0 | 0.9999324 |
| 0 | 1.0000000 |
| 0 | 0.9305556 |
| 0 | 1.0000000 |