Text mining

Assignment Instructions

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

Dependencies

The following dependencies were used to complete the assignment:

library(knitr) 
library(kableExtra)
library(tidyverse)
library(tm) # Text mining package
library(SnowballC) # Word stemming package
library(RTextTools)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)

Getting Started

In order to begin, I first downloaded an 8-Zip application to extract the spam/ham documents from the public corpus into my week10 github folder. The .bz2 and .tar files needed to be unzipped twice. This process created two subfolders in the directory, labed “easy_ham” and “spam.”

I then set this folder as my working directory in R, and used the tm package to load the spam and ham corpora into R. I quickly examined the content of both corpora and used the document count to build a loop to label all files according to corpus.

Once complete, I used the c() function to merge the spam and ham corpora into a new test corpus.

1. Load Corpora from Directory

# set working directory 
setwd("../week10") 

# create list of file names for reference
spam.files <- list.files("spam")
ham.files <- list.files("easy_ham")

# load corpus from directory into R memory
spam <- VCorpus(DirSource("spam"), readerControl = list(language = "en")) 
ham <- VCorpus(DirSource("easy_ham"), readerControl = list(language = "en")) 

2. Brief Examination of Corpora

spam
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 501
ham
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2551

3. Create Loop to Label the Corpora

# loop spam

for (i in 1:length(spam)){
  meta(spam[[i]], "spam", "corpus") <- "1"
}

# loop ham 
for (i in 1:length(ham)){
  meta(ham[[i]], "spam", "corpus") <- "0"
}

4. Create a Test Corpus

# Combine and randomize corpora 
set.seed(10)
test <- sample(c(ham, spam))

# Review Labeling 
spam_id <- data.frame(unlist(meta(test, "spam"))) 
table(spam_id)
## spam_id
##    0    1 
## 2551  501

Corpus Transformations

In order to develop a spam/ham prediction, the test corpus content needed to be cleaned through transformations. You can view a snippit of the untransformed content below:

meta(test[[1]])
##   author       : character(0)
##   datetimestamp: 2018-11-04 22:41:09
##   description  : character(0)
##   heading      : character(0)
##   id           : 1549.929f73150d87a253b519a286d3dab77e
##   language     : en
##   origin       : character(0)
##   spam         : 0
strwrap(test[[1]]$content[1:20])
##  [1] "From spamassassin-talk-admin@lists.sourceforge.net Wed Sep 11"     
##  [2] "16:05:26 2002"                                                     
##  [3] "Return-Path: <spamassassin-talk-admin@example.sourceforge.net>"    
##  [4] "Delivered-To: yyyy@localhost.example.com"                          
##  [5] "Received: from localhost (jalapeno [127.0.0.1])"                   
##  [6] "by jmason.org (Postfix) with ESMTP id CAB9B16F03"                  
##  [7] "for <jm@localhost>; Wed, 11 Sep 2002 16:05:25 +0100 (IST)"         
##  [8] "Received: from jalapeno [127.0.0.1]"                               
##  [9] "by localhost with IMAP (fetchmail-5.9.0)"                          
## [10] "for jm@localhost (single-drop); Wed, 11 Sep 2002 16:05:25 +0100"   
## [11] "(IST)"                                                             
## [12] "Received: from usw-sf-list2.sourceforge.net"                       
## [13] "(usw-sf-fw2.sourceforge.net"                                       
## [14] "[216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with"    
## [15] "ESMTP id"                                                          
## [16] "g8BDeHC16691 for <jm-sa@jmason.org>; Wed, 11 Sep 2002 14:40:17"    
## [17] "+0100"                                                             
## [18] "Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13]"        
## [19] "helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net"
## [20] "with"                                                              
## [21] "esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17p7g0-0002aM-00; Wed,"    
## [22] "11 Sep 2002 06:37:04 -0700"                                        
## [23] "Received: from yertle.kcilink.com ([216.194.193.105]) by"          
## [24] "usw-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1"      
## [25] "(Debian)) id"                                                      
## [26] "17p7fd-00078a-00 for <spamassassin-talk@lists.sourceforge.net>;"   
## [27] "Wed, 11 Sep 2002 06:36:41 -0700"

I again used the tm package to map several transformation functions to clean the content of the test corpus. Some cleaning functions, such as stopwords were added after the initial review of Document Term Matrix in the subsequent section.

stop <- c("date", "deliveredto", "received", "subject", "localhost", "returnpath") #added stopwords

test_tm <- test %>% 
  tm_map(content_transformer(tolower)) %>% # transform to lower case 
  tm_map(content_transformer(function(x) gsub(x, pattern="\\S*\\.\\S*", replacement=" "))) %>%  #regex
  tm_map(content_transformer(function(x) gsub(x, pattern="\\S*\\@\\S*", replacement=" "))) %>%  #regex
  tm_map(content_transformer(removePunctuation)) %>% # remove punctuation
  tm_map(content_transformer(removeNumbers)) %>% # remove numbers
  tm_map(content_transformer(PlainTextDocument)) %>% # plain text
  tm_map(content_transformer(function(x) removeWords(x, words = c(stop, stopwords("en"))))) %>% #stopwords
  tm_map(stemDocument) %>% # stem document 
  tm_map(content_transformer(stripWhitespace)) # remove white space

Document Term Matrix

Next, I used the newly transformed corpus, test_tm, to create a Document Term Matrix (DTM). The DTM creates a vector frequencies of terms used in the corpus, which we can inspect below:

test_dtm <- DocumentTermMatrix(test_tm) 
kable(inspect(test_dtm), caption = 'Inspect Initial DTM', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
  row_spec(row = 0:0, background = "lightgrey") %>%
  column_spec(column = 1, bold = T)
## <<DocumentTermMatrix (documents: 3052, terms: 46028)>>
## Non-/sparse entries: 369725/140107731
## Sparsity           : 100%
## Maximal term length: 298
## Weighting          : term frequency (tf)
## Sample             :
##                                        Terms
## Docs                                    aug esmtp ist jalapeno mon oct
##   0208.0577aa26cbb2382d8789377d351ca8bf   0     2   2        2   0   6
##   0255.42a6feb4435a0a68929075c0926f085d   2     3   2        2   1   0
##   0317.0bea188e5bd639ae421f07b3ad68c5e0   0     1   2        2   4   0
##   0320.6c54ea1bb991c6fae395588219cfce37   0     5   2        2   0   8
##   0352.f7adb4aa267e50a8db1e4bcacfe863f3   0     2   2        2   0   0
##   0627.c9ad8730dad7bda1e1169ee00c4006fc   0     5   2        2   0   0
##   0730.9570ee3b6bf144198297b23bca5044e9   0     5   2        2   0   0
##   0737.aa298505cb31aac78d0dbf229fc45fb9   0     4   2        2   0   0
##   0826.082e92a79a15aa7f6dd5b85a40327abd   0     5   2        2   0   0
##   1022.73ab70b91862d709897cfe3dd5bb22a0   0     4   2        2   0   9
##                                        Terms
## Docs                                    postfix sep thu wed
##   0208.0577aa26cbb2382d8789377d351ca8bf       1   0   0   0
##   0255.42a6feb4435a0a68929075c0926f085d       1  10   3   0
##   0317.0bea188e5bd639ae421f07b3ad68c5e0       1   6   0   0
##   0320.6c54ea1bb991c6fae395588219cfce37       3   0   0   4
##   0352.f7adb4aa267e50a8db1e4bcacfe863f3       1   5   0   0
##   0627.c9ad8730dad7bda1e1169ee00c4006fc       3   8   8   1
##   0730.9570ee3b6bf144198297b23bca5044e9       3   8   0   0
##   0737.aa298505cb31aac78d0dbf229fc45fb9       3   9   0   0
##   0826.082e92a79a15aa7f6dd5b85a40327abd       3   8   0   1
##   1022.73ab70b91862d709897cfe3dd5bb22a0       3   0   4   4
Inspect Initial DTM
aug esmtp ist jalapeno mon oct postfix sep thu wed
0208.0577aa26cbb2382d8789377d351ca8bf 0 2 2 2 0 6 1 0 0 0
0255.42a6feb4435a0a68929075c0926f085d 2 3 2 2 1 0 1 10 3 0
0317.0bea188e5bd639ae421f07b3ad68c5e0 0 1 2 2 4 0 1 6 0 0
0320.6c54ea1bb991c6fae395588219cfce37 0 5 2 2 0 8 3 0 0 4
0352.f7adb4aa267e50a8db1e4bcacfe863f3 0 2 2 2 0 0 1 5 0 0
0627.c9ad8730dad7bda1e1169ee00c4006fc 0 5 2 2 0 0 3 8 8 1
0730.9570ee3b6bf144198297b23bca5044e9 0 5 2 2 0 0 3 8 0 0
0737.aa298505cb31aac78d0dbf229fc45fb9 0 4 2 2 0 0 3 9 0 0
0826.082e92a79a15aa7f6dd5b85a40327abd 0 5 2 2 0 0 3 8 0 1
1022.73ab70b91862d709897cfe3dd5bb22a0 0 4 2 2 0 9 3 0 4 4

As you can see, there are 58,606 total terms used in the 3,052 test corpus documents, with a meximal term length of 298.

Refine Terms

To refine the DTM results, I introduced new controls that affect the sparsity and length of terms. The non-/sparse entries and sparsity percent output show the relative frequency that a term appears in a document. I used the removeSparseTerms function below to set the sparcity to 95% and remove less frequent terms from the matrix. In addition, I set word length controls to only account for terms between the length of 4 and 20 characters.

refine_test <- test_tm %>% 
  DocumentTermMatrix(control=list(wordLengths=c(4, 20))) %>% 
  removeSparseTerms(.95)

kable(inspect(refine_test), caption = 'Inspect Refined DTM', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
  row_spec(row = 0:0, background = "lightgrey") %>%
  column_spec(column = 1, bold = T)
## <<DocumentTermMatrix (documents: 3052, terms: 319)>>
## Non-/sparse entries: 141664/831924
## Sparsity           : 85%
## Maximal term length: 20
## Weighting          : term frequency (tf)
## Sample             :
##                                        Terms
## Docs                                    contenttyp esmtp imap jalapeno
##   0191.cd23170803a4680d6dbd798a4cd1e5dd          1     2    0        0
##   0206.806263422d55d38a151fe3b89d56192f          1     2    1        0
##   0255.42a6feb4435a0a68929075c0926f085d          0     3    1        2
##   0320.6c54ea1bb991c6fae395588219cfce37          1     5    1        2
##   0433.8977506bae8028f48290ea0fb2f54ddd          1     4    1        2
##   0627.c9ad8730dad7bda1e1169ee00c4006fc          1     5    1        2
##   0730.9570ee3b6bf144198297b23bca5044e9          1     5    1        2
##   0737.aa298505cb31aac78d0dbf229fc45fb9          1     4    1        2
##   0826.082e92a79a15aa7f6dd5b85a40327abd          1     5    1        2
##   1022.73ab70b91862d709897cfe3dd5bb22a0          1     4    1        2
##                                        Terms
## Docs                                    list messageid postfix singledrop
##   0191.cd23170803a4680d6dbd798a4cd1e5dd    8         1       2          1
##   0206.806263422d55d38a151fe3b89d56192f    8         1       1          1
##   0255.42a6feb4435a0a68929075c0926f085d    0         1       1          1
##   0320.6c54ea1bb991c6fae395588219cfce37    0         1       3          1
##   0433.8977506bae8028f48290ea0fb2f54ddd    8         1       4          1
##   0627.c9ad8730dad7bda1e1169ee00c4006fc    1         1       3          1
##   0730.9570ee3b6bf144198297b23bca5044e9    1         1       3          1
##   0737.aa298505cb31aac78d0dbf229fc45fb9    1         1       3          1
##   0826.082e92a79a15aa7f6dd5b85a40327abd    3         1       3          1
##   1022.73ab70b91862d709897cfe3dd5bb22a0    0         1       3          1
##                                        Terms
## Docs                                    textplain will
##   0191.cd23170803a4680d6dbd798a4cd1e5dd         1   19
##   0206.806263422d55d38a151fe3b89d56192f         1   22
##   0255.42a6feb4435a0a68929075c0926f085d         0   27
##   0320.6c54ea1bb991c6fae395588219cfce37         1   14
##   0433.8977506bae8028f48290ea0fb2f54ddd         1   24
##   0627.c9ad8730dad7bda1e1169ee00c4006fc         1   18
##   0730.9570ee3b6bf144198297b23bca5044e9         1  167
##   0737.aa298505cb31aac78d0dbf229fc45fb9         1  167
##   0826.082e92a79a15aa7f6dd5b85a40327abd         1   21
##   1022.73ab70b91862d709897cfe3dd5bb22a0         1   14
Inspect Refined DTM
contenttyp esmtp imap jalapeno list messageid postfix singledrop textplain will
0191.cd23170803a4680d6dbd798a4cd1e5dd 1 2 0 0 8 1 2 1 1 19
0206.806263422d55d38a151fe3b89d56192f 1 2 1 0 8 1 1 1 1 22
0255.42a6feb4435a0a68929075c0926f085d 0 3 1 2 0 1 1 1 0 27
0320.6c54ea1bb991c6fae395588219cfce37 1 5 1 2 0 1 3 1 1 14
0433.8977506bae8028f48290ea0fb2f54ddd 1 4 1 2 8 1 4 1 1 24
0627.c9ad8730dad7bda1e1169ee00c4006fc 1 5 1 2 1 1 3 1 1 18
0730.9570ee3b6bf144198297b23bca5044e9 1 5 1 2 1 1 3 1 1 167
0737.aa298505cb31aac78d0dbf229fc45fb9 1 4 1 2 1 1 3 1 1 167
0826.082e92a79a15aa7f6dd5b85a40327abd 1 5 1 2 3 1 3 1 1 21
1022.73ab70b91862d709897cfe3dd5bb22a0 1 4 1 2 0 1 3 1 1 14

Term Frequencies

term_freq <- refine_test %>% 
  as.matrix %>% 
  colSums() %>% 
  sort(decreasing=TRUE)

kable(head(term_freq, 10)) 
x
esmtp 9824
postfix 5413
jalapeno 4157
messageid 3094
contenttyp 3016
singledrop 2869
imap 2767
list 2591
textplain 2443
will 2361
terms <- data.frame(term=names(term_freq), frequency=term_freq)

plot <- ggplot(subset(terms, frequency>1000), aes(x = reorder(term, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='grey') +
  theme(axis.text.x=element_text(angle=90, hjust=1)) +
  labs(title = "Terms with Frequencies > 1000", x = "Term", "Frequency")
plot

Predictive Models

I set up an SVM and Maxent predicitive models to train and classify to indicate whether or not an email should be marked as spam. The steps for this section are outlined below:

# Create loop to lable spam indicator 
indicator<-c()
for(i in 1:length(test_tm)){
  indicator<-c(indicator,test_tm[[i]]$meta$spam)
}

# randomize data and set up model container using 75% probability 
set.seed(100) 
probs <- runif(length(test_tm),0,1) 
train <- which(probs<=.75) 
test <- which(probs>.75)

# build container for model from DTM 
container <- create_container(refine_test, labels = indicator, trainSize = train, testSize = test, virgin = FALSE)

# use container to train models 
train_svm <- train_model(container, "SVM")
train_max <-train_model(container, "MAXENT")

# use trained models to classify new data
classify_svm <- classify_model(container, train_svm)
classify_max <-classify_model(container, train_max)

# view output of models  
svm <- head(classify_svm, 10)
max <- head(classify_max, 10)
  
kable(svm, caption = 'SVM Model Output', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
  column_spec(column = 1, bold = T)
SVM Model Output
SVM_LABEL SVM_PROB
0 0.9998325
0 0.8725006
0 0.9989105
0 0.9950264
0 0.9998966
0 0.9997816
0 0.9353922
0 0.9999979
0 0.9686015
0 0.9993535
kable(max, caption = 'Maxent Model Output', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "right") %>%
  column_spec(column = 1, bold = T)
Maxent Model Output
MAXENTROPY_LABEL MAXENTROPY_PROB
0 1.0000000
0 0.9999977
0 1.0000000
0 1.0000000
0 1.0000000
0 1.0000000
0 0.9999324
0 1.0000000
0 0.9305556
0 1.0000000