It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
This assignment is due end of day on Sunday, November 6th. You may work in a small team if you want. We will look at all of your solutions in our meetup on Thursday, November 10th.
For this assignment I am using the spam and ham datasets below from the https://spamassassin.apache.org/publiccorpus/ site.
20030228_easy_ham.tar.bz2 20050311_spam_2.tar.bz2
setwd("~/Desktop/IS607/Data-607/Week 10 Assignment")
require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
require(XML)
## Loading required package: XML
require(stringr)
## Loading required package: stringr
require(tm)
## Loading required package: tm
## Loading required package: NLP
require(SnowballC)
## Loading required package: SnowballC
require(RTextTools)
## Loading required package: RTextTools
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
##
## getStemLanguages, wordStem
length(list.files("easy_ham")) #2501 HAM files
## [1] 2501
length(list.files("spam_2")) #1397 SPAM files
## [1] 1397
list.files("easy_ham")[1:5]
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
list.files("spam_2")[1:5]
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"
Before we can complete supervised learning with the data we must create a document term matrix that contains records from the HAM and SPAM datasets.
#Create a list of all files in each directory
spam_list <- list.files("spam_2", full.names = TRUE)
ham_list <- list.files("easy_ham", full.names = TRUE)
#test process out on one record from SPAM dataset
tmp <- readLines(spam_list[1])
tmp <- str_c(tmp, collapse = "")
email_corpus <- Corpus(VectorSource(tmp))
#Assign category for SPAM in meta data
meta(email_corpus[[1]], "category") <- "Spam"
Now that the process has been verified with one record from the SPAM dataset we will use two loops to create a corpus that contains records from the SPAM and HAM datasets.
#SPAM loop
n <- 1
for (i in 2:length(spam_list)){
tmp <- readLines(spam_list[i])
tmp <- str_c(tmp, collapse = "")
#try to fix encoding issue with idea from stack overflow
tmp <- iconv(tmp, to = "utf-8-mac", sub="")
if (length(tmp) != 0){
n <- n+1
temp_corpus <- Corpus(VectorSource(tmp))
email_corpus <- c(email_corpus, temp_corpus)
meta(email_corpus[[n]], "category") <- "Spam"
}
}
#HAM
#Leave counter at n from previous loop
for (i in 1:length(ham_list)){
tmp <- readLines(ham_list[i])
tmp <- str_c(tmp, collapse = "")
#try to fix encoding issue with idea from stack overflow
tmp <- iconv(tmp, to = "utf-8-mac", sub="")
if (length(tmp) != 0){
n <- n+1
temp_corpus <- Corpus(VectorSource(tmp))
email_corpus <- c(email_corpus, temp_corpus)
meta(email_corpus[[n]], "category") <- "Ham"
}
}
Now we can check the count of meta data for email category in the corpus.
metadata <- unlist(meta(email_corpus, "category"))
table(metadata) #view meta data for spam/ham
## metadata
## Ham Spam
## 2501 1397
Prior to creating the document term matrix (DTM) we need to randomize the data.
email_corpus <- sample(email_corpus) #randomize
email_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3898
Before we create the DTM we need to clean the corpus data.
#There seems to be a bug with tm_map in a revision that is causing the files created to not pull into a DTM correctly and it seems to be isolated to mac. This code works when it is not in RMD. From the discussion on BB it seemed at least two others had the same issue. I turned off this section because it wouldn't knit.
#remove numbers
email_corpus <- tm_map(email_corpus, content_transformer(removeNumbers), lazy = TRUE)
#remove stop words (english)
email_corpus <- tm_map(email_corpus, content_transformer(removeWords), words = stopwords("en"), lazy = TRUE)
#remove punctuation
email_corpus <- tm_map(email_corpus, content_transformer(str_replace_all), pattern="[[:punct:]]", replacement = " ", lazy = TRUE)
#Create stems
email_corpus <- tm_map(email_corpus, content_transformer(stemDocument), lazy = TRUE)
#convert all letters to lower case
email_corpus <- tm_map(email_corpus, content_transformer(tolower), lazy = TRUE)
Now that the corpus has been created and cleaned we can move forward with creating the DTM.
#convert to plain text document
email_corpus <- tm_map(email_corpus, content_transformer(PlainTextDocument), lazy = TRUE)
#Create DTM
dtm <- DocumentTermMatrix(email_corpus)
dtm
## <<DocumentTermMatrix (documents: 3898, terms: 105920)>>
## Non-/sparse entries: 762739/412113421
## Sparsity : 100%
## Maximal term length: 17339
## Weighting : term frequency (tf)
#Remove sparse terms (<10 occurences) from DTM
dtm <- removeSparseTerms(dtm, 1-(10/length(email_corpus)))
dtm
## <<DocumentTermMatrix (documents: 3898, terms: 7423)>>
## Non-/sparse entries: 596339/28338515
## Sparsity : 98%
## Maximal term length: 70
## Weighting : term frequency (tf)
With the DTM created we can now set up the 3 supervised learning methods we will use for categorization.
#create vector with labels
category_labels <- unlist(meta(email_corpus, "category"))
category_labels[1:5]
## 1 1 1 1 1
## "Ham" "Ham" "Ham" "Ham" "Ham"
N <- length(category_labels)
#create container with relavant information used in estimation procedures
container <- create_container(
dtm,
labels = category_labels,
trainSize = 1:1000, #use 1000 records for training
testSize = 1001:N, #use remaining records for testing
virgin = FALSE
)
For this assignment we will use 3 model types: SVM, Random Forest and Maximum Entropy.
#Estimation procedures
slotNames(container)
## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
#inspect outcome of each procedure
head(svm_out)
## SVM_LABEL SVM_PROB
## 1 Ham 0.9999923
## 2 Ham 0.9996881
## 3 Spam 1.0000000
## 4 Ham 0.9979691
## 5 Spam 1.0000000
## 6 Spam 0.9968999
head(tree_out)
## TREE_LABEL TREE_PROB
## 1 Ham 1
## 2 Ham 1
## 3 Spam 1
## 4 Ham 1
## 5 Spam 1
## 6 Spam 1
head(maxent_out)
## MAXENTROPY_LABEL MAXENTROPY_PROB
## 1 Ham 1
## 2 Ham 1
## 3 Spam 1
## 4 Ham 1
## 5 Spam 1
## 6 Spam 1
Finally we can examine in detail how each model performed with our data.
labels_out <- data.frame(
correct_label = category_labels[1001:N],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F
)
#SVM
table(labels_out[,1] == labels_out[,2])
##
## FALSE TRUE
## 23 2875
prop.table(table(labels_out[,1] == labels_out[,2]))
##
## FALSE TRUE
## 0.007936508 0.992063492
#Random Forest
table(labels_out[,1] == labels_out[,3])
##
## FALSE TRUE
## 39 2859
prop.table(table(labels_out[,1] == labels_out[,3]))
##
## FALSE TRUE
## 0.01345756 0.98654244
#Maximum Entropy
table(labels_out[,1] == labels_out[,4])
##
## FALSE TRUE
## 27 2871
prop.table(table(labels_out[,1] == labels_out[,4]))
##
## FALSE TRUE
## 0.00931677 0.99068323
Examining the results of the three methods - maximum entropy appears to be the most accurate (~99%), but all three models perform extremely well. I am somewhat skeptical of the results given the high performance of all three models. This may be somewhat related to the large number of training/testing documents or the source of the data may be very clean or an error in the creation of the models.