Task Assignment

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

This assignment is due end of day on Sunday, November 6th. You may work in a small team if you want. We will look at all of your solutions in our meetup on Thursday, November 10th.

SPAM and HAM sets

For this assignment I am using the spam and ham datasets below from the https://spamassassin.apache.org/publiccorpus/ site.

20030228_easy_ham.tar.bz2 20050311_spam_2.tar.bz2

Set WD, required packages and do some exploratory analysis into the file contents

setwd("~/Desktop/IS607/Data-607/Week 10 Assignment")

require(RCurl)

## Loading required package: RCurl

## Loading required package: bitops

require(XML)

## Loading required package: XML

require(stringr)

## Loading required package: stringr

require(tm)

## Loading required package: tm

## Loading required package: NLP

require(SnowballC)

## Loading required package: SnowballC

require(RTextTools)

## Loading required package: RTextTools

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## Attaching package: 'RTextTools'

## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem

length(list.files("easy_ham")) #2501 HAM files

## [1] 2501

length(list.files("spam_2")) #1397 SPAM files

## [1] 1397

list.files("easy_ham")[1:5]

## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"

list.files("spam_2")[1:5]

## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"

Begin preparing files for supervised learning

Before we can complete supervised learning with the data we must create a document term matrix that contains records from the HAM and SPAM datasets.

#Create a list of all files in each directory 
spam_list <- list.files("spam_2",  full.names = TRUE)
ham_list <- list.files("easy_ham", full.names = TRUE)

#test process out on one record from SPAM dataset
tmp <- readLines(spam_list[1])

tmp <- str_c(tmp, collapse = "")

email_corpus <- Corpus(VectorSource(tmp))

#Assign category for SPAM in meta data
meta(email_corpus[[1]], "category") <- "Spam"

Now that the process has been verified with one record from the SPAM dataset we will use two loops to create a corpus that contains records from the SPAM and HAM datasets.

#SPAM loop 
n <- 1
for (i in 2:length(spam_list)){

  tmp <- readLines(spam_list[i])
  tmp <- str_c(tmp, collapse = "")
  #try to fix encoding issue with idea from stack overflow
  tmp <- iconv(tmp, to = "utf-8-mac", sub="")

  
  if (length(tmp) != 0){
    n <- n+1
    
    temp_corpus <- Corpus(VectorSource(tmp))
    email_corpus <- c(email_corpus, temp_corpus)
    meta(email_corpus[[n]], "category") <- "Spam"
  }
}

#HAM
#Leave counter at n from previous loop
for (i in 1:length(ham_list)){
  
  tmp <- readLines(ham_list[i])
  tmp <- str_c(tmp, collapse = "")
    #try to fix encoding issue with idea from stack overflow

  tmp <- iconv(tmp, to = "utf-8-mac", sub="")

    
  if (length(tmp) != 0){
    n <- n+1
    
    temp_corpus <- Corpus(VectorSource(tmp))
    email_corpus <- c(email_corpus, temp_corpus)
    meta(email_corpus[[n]], "category") <- "Ham"
  }
}

Now we can check the count of meta data for email category in the corpus.

metadata <- unlist(meta(email_corpus, "category"))
table(metadata) #view meta data for spam/ham

## metadata
##  Ham Spam 
## 2501 1397

Prior to creating the document term matrix (DTM) we need to randomize the data.

email_corpus <- sample(email_corpus) #randomize
email_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3898

Cleaning of corpus data prior to DTM creation

Before we create the DTM we need to clean the corpus data.

#There seems to be a bug with tm_map in a revision that is causing the files created to not pull into a DTM correctly and it seems to be isolated to mac.  This code works when it is not in RMD.  From the discussion on BB it seemed at least two others had the same issue. I turned off this section because it wouldn't knit.  

#remove numbers 
email_corpus <- tm_map(email_corpus, content_transformer(removeNumbers), lazy = TRUE)

#remove stop words (english)
email_corpus <- tm_map(email_corpus, content_transformer(removeWords), words = stopwords("en"), lazy = TRUE)

#remove punctuation 
email_corpus <- tm_map(email_corpus, content_transformer(str_replace_all), pattern="[[:punct:]]", replacement = " ", lazy = TRUE)

#Create stems 
email_corpus <- tm_map(email_corpus, content_transformer(stemDocument), lazy = TRUE)

#convert all letters to lower case
email_corpus <- tm_map(email_corpus, content_transformer(tolower), lazy = TRUE)

Create DTM

Now that the corpus has been created and cleaned we can move forward with creating the DTM.

#convert to plain text document

email_corpus <- tm_map(email_corpus, content_transformer(PlainTextDocument), lazy = TRUE)

#Create DTM
dtm <- DocumentTermMatrix(email_corpus)
dtm

## <<DocumentTermMatrix (documents: 3898, terms: 105920)>>
## Non-/sparse entries: 762739/412113421
## Sparsity           : 100%
## Maximal term length: 17339
## Weighting          : term frequency (tf)

#Remove sparse terms (<10 occurences) from DTM
dtm <- removeSparseTerms(dtm, 1-(10/length(email_corpus)))
dtm

## <<DocumentTermMatrix (documents: 3898, terms: 7423)>>
## Non-/sparse entries: 596339/28338515
## Sparsity           : 98%
## Maximal term length: 70
## Weighting          : term frequency (tf)

Supervised learning

With the DTM created we can now set up the 3 supervised learning methods we will use for categorization.

#create vector with labels
category_labels <- unlist(meta(email_corpus, "category"))
category_labels[1:5]

##     1     1     1     1     1 
## "Ham" "Ham" "Ham" "Ham" "Ham"

N <- length(category_labels)

#create container with relavant information used in estimation procedures
container <- create_container(
  dtm,
  labels = category_labels,
  trainSize = 1:1000, #use 1000 records for training
  testSize = 1001:N, #use remaining records for testing
  virgin = FALSE
)

For this assignment we will use 3 model types: SVM, Random Forest and Maximum Entropy.

#Estimation procedures
slotNames(container)

## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"

svm_model <- train_model(container, "SVM")

tree_model <- train_model(container, "TREE")

maxent_model <- train_model(container, "MAXENT")

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

#inspect outcome of each procedure 
head(svm_out)

##   SVM_LABEL  SVM_PROB
## 1       Ham 0.9999923
## 2       Ham 0.9996881
## 3      Spam 1.0000000
## 4       Ham 0.9979691
## 5      Spam 1.0000000
## 6      Spam 0.9968999

head(tree_out)

##   TREE_LABEL TREE_PROB
## 1        Ham         1
## 2        Ham         1
## 3       Spam         1
## 4        Ham         1
## 5       Spam         1
## 6       Spam         1

head(maxent_out)

##   MAXENTROPY_LABEL MAXENTROPY_PROB
## 1              Ham               1
## 2              Ham               1
## 3             Spam               1
## 4              Ham               1
## 5             Spam               1
## 6             Spam               1

Finally we can examine in detail how each model performed with our data.

labels_out <- data.frame(
  correct_label = category_labels[1001:N],
  svm = as.character(svm_out[,1]),
  tree = as.character(tree_out[,1]),
  maxent = as.character(maxent_out[,1]),
  stringsAsFactors = F
)

#SVM
table(labels_out[,1] == labels_out[,2])

## 
## FALSE  TRUE 
##    23  2875

prop.table(table(labels_out[,1] == labels_out[,2]))

## 
##       FALSE        TRUE 
## 0.007936508 0.992063492

#Random Forest
table(labels_out[,1] == labels_out[,3])

## 
## FALSE  TRUE 
##    39  2859

prop.table(table(labels_out[,1] == labels_out[,3]))

## 
##      FALSE       TRUE 
## 0.01345756 0.98654244

#Maximum Entropy
table(labels_out[,1] == labels_out[,4])

## 
## FALSE  TRUE 
##    27  2871

prop.table(table(labels_out[,1] == labels_out[,4]))

## 
##      FALSE       TRUE 
## 0.00931677 0.99068323

Examining the results of the three methods - maximum entropy appears to be the most accurate (~99%), but all three models perform extremely well. I am somewhat skeptical of the results given the high performance of all three models. This may be somewhat related to the large number of training/testing documents or the source of the data may be very clean or an error in the creation of the models.

Data 607 - Week 10 Assignment

Brandon OHara

11/06/2016