Week11 Assignment

This document pertains to assignment week11 for CUNY607 - Automated Data Collection with R. In this week assignment, we will explore how to perform text classifying. To this effect, we will work with a set of emails that have alredy been identified as “spam” or “ham”. We will download these emails in “spam” and “ham” under the working directory. read these in a corpus and tagging them as spam (s) or ham (h), build a term-document matrix, perform some data cleansing, and finally perform a classification these documents.

For the purpose of these assignment we will assume that the appropriate folders for “spam” and “ham” has been downloaded, extracted, and renamed “spam” and “ham” respectelively and reside under the working directory.
spams and hams folders

Loading spam and ham emails into corpus and classify them as ‘s’ or ‘h’

Loading of required packages:

we are using the following packages: * stringr * XML * tm * SnowballC * RTextTools

## Loading required package: NLP

## Warning: package 'RTextTools' was built under R version 3.2.4

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## Attaching package: 'RTextTools'

## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem

Setting Environmnet:

Once again, we are assuming that the “spam” and “ham” emails are in their directory repectively and these reside under the working directory.
we are setting a spam_directory and ham_directory variables. These must be updated with the proper naming if directory naming convention is different.

spam_directory <- "spam"
ham_directory <- "ham"

Approach:

We will load each email into a corpus of text document, classifying each with either “s” or “h” to indicate whether the email is from spam or ham folder. In addition, we will add subject, content_type, and date as meta data for each document. Before loading them in the corpus some clean-up will be done.
An analysis of some emails indicates that their are either text based or html based document. Once meta data information is extracted, these lines of the email will be dropped. For each set of emails, we will process them individually within for loop to iterate other each item in folder that match i

Spam Emails:

# Check emails in each spam folder
number_spam <- length(list.files(spam_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE"))

# Construct list of all spam emails
l_spam <- list.files(spam_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE")

# initialize counter to be used across spam and ham emails
n <- 0

# initiate for loop
for(i in 1:number_spam){
  # read email
  tmp <- readLines(l_spam[i])
  if (length(tmp) != 0){
   
     # increment counter 
    n <- n + 1
    
    # Extract Subject Line
    subject_pos <- grep("Subject:", tmp, ignore.case = FALSE, value = FALSE)
    subject_tmp <- tmp[subject_pos][1]     ### We will use first encountered subject
    subject_txt <- str_trim(str_sub(subject_tmp, start = 9))
    
    # Extract Date and Time Line, we will store it as text
    date_pos <- grep("Date:", tmp, ignore.case = FALSE, value = FALSE)
    date_tmp <- tmp[date_pos][1]         ### We will use first encountered date
    date_txt <- str_trim(str_sub(date_tmp, start = 6))
    
    # Extract Content_type
    content_pos <- grep("Content-Type:", tmp, ignore.case = FALSE, value = FALSE)
    content_tmp <- tmp[content_pos][1] 
    content_txt <- str_trim(str_sub(content_tmp, start = 14))
    
    # retrieve latest position of Date, Subject, Content-Type and store
    start_line <- max(unlist(subject_pos), unlist(date_pos), unlist(content_pos)) + 1
    end_line <- length(tmp)
    
    # select only from latest of date, subject, content_type to end
    tmp <- tmp[start_line:end_line]
    
    # remove new lines
    tmp <- str_c(tmp, collapse = " ")
    
    # remove html tag
    tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
    
    # Add to corpus, create(n==1), or append
    if(n==1){
      mail_corpus <- Corpus(VectorSource(tmp))
    }else{
      tmp_corpus <- Corpus(VectorSource(tmp))
      mail_corpus <-c(mail_corpus, tmp_corpus)
    }

    # Add meta data for new element of corpus
    meta(mail_corpus[[n]], "subject") <- subject_txt
    meta(mail_corpus[[n]], "date") <- date_txt
    meta(mail_corpus[[n]], "content_type") <- content_txt
    meta(mail_corpus[[n]], "class") <- "s"
  }
  
}

## Warning in readLines(l_spam[i]): incomplete final line found on 'spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'

Similarly, we will now process all the emails from the ham folder. Please note, since we want to have all emails text in the same corpus, the counter should not be reset so we continue to append to the mail_corpus. Also, we will ramdomize the corpus using sample function so that we can later partioned for training set/test set for analysis.

Ham Emails:

# Check emails in each spam folder
number_ham <- length(list.files(ham_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE"))

# Construct list of all spam emails
l_ham <- list.files(ham_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE")

# initiate for loop
for(i in 1:number_ham){
  # read email
  tmp <- readLines(l_ham[i])
  if (length(tmp) != 0){
   
     # increment counter 
    n <- n + 1
    
    # Extract Subject Line
    subject_pos <- grep("Subject:", tmp, ignore.case = FALSE, value = FALSE)
    subject_tmp <- tmp[subject_pos][1]     ### We will use first encountered subject
    subject_txt <- str_trim(str_sub(subject_tmp, start = 9))
    
    # Extract Date and Time Line, we will store it as text
    date_pos <- grep("Date:", tmp, ignore.case = FALSE, value = FALSE)
    date_tmp <- tmp[date_pos][1]         ### We will use first encountered date
    date_txt <- str_trim(str_sub(date_tmp, start = 6))
    
    # Extract Content_type
    content_pos <- grep("Content-Type:", tmp, ignore.case = FALSE, value = FALSE)
    content_tmp <- tmp[content_pos][1] 
    content_txt <- str_trim(str_sub(content_tmp, start = 14))
    
    # retrieve latest position of Date, Subject, Content-Type and store
    start_line <- max(unlist(subject_pos), unlist(date_pos), unlist(content_pos)) + 1
    end_line <- length(tmp)
    
    # select only from latest of date, subject, content_type to end
    tmp <- tmp[start_line:end_line]
    
    # remove new lines
    tmp <- str_c(tmp, collapse = " ")
    
    # remove html tag
    tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
    
    # Add to corpus, create(n==1), or append
    if(n==1){
      mail_corpus <- Corpus(VectorSource(tmp))
    }else{
      tmp_corpus <- Corpus(VectorSource(tmp))
      mail_corpus <-c(mail_corpus, tmp_corpus)
    }

    # Add meta data for new element of corpus
    meta(mail_corpus[[n]], "subject") <- subject_txt
    meta(mail_corpus[[n]], "date") <- date_txt
    meta(mail_corpus[[n]], "content_type") <- content_txt
    meta(mail_corpus[[n]], "class") <- "h"
  }
  
}

# randomize corpus
mail_corpus <- sample(mail_corpus)
mail_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1900

# Meta Data
meta_subject <- meta(mail_corpus, type = "local", tag = "subject")
meta_date <- meta(mail_corpus, type = "local", tag = "date")
meta_content <- meta(mail_corpus, type = "local", tag = "content_type")
meta_class <- meta(mail_corpus, type = "local", tag = "class")

length(meta_class)

## [1] 1900

length(meta_date)

## [1] 1900

length(meta_subject)

## [1] 1900

length(meta_content)

## [1] 1900

meta_data_mail <- data.frame(
  subject = unlist(meta_subject),
  date = unlist(meta_date), 
  content_type = unlist(meta_content),
  class = unlist(meta_class))

head(meta_data_mail)

##                                                subject
## 1    Re: [Baseline] Raising chickens the high-tech way
## 2 Re: [Razor-users] dot-tk registrations hitting Razor
## 3                                      Pics and thanks
## 4                                  [ILUG-Social] HELLO
## 5                  Tired of paying big bucks for cable
## 6                                        Re: request ?
##                                          date
## 1             Tue, 23 Jul 2002 15:05:48 -0700
## 2             Wed, 14 Aug 2002 17:39:50 -0700
## 3        Thu, 1 Aug 2002 02:11:10 -0400 (EDT)
## 4       ginal-Date: Wed, 18 Sep 2002 01:11:52
## 5       Tue, 24 Sep 2002 10:16:22 -0400 (EDT)
## 6 ginal-Date: Fri, 16 Aug 2002 15:21:49 +0200
##                       content_type class
## 1                             <NA>     h
## 2     text/plain; charset=us-ascii     h
## 3     TEXT/PLAIN; charset=US-ASCII     h
## 4 text/plain; charset="iso-8859-1"     s
## 5 text/plain; charset="iso-8859-1"     s
## 6  text/plain; charset=ISO-8859-15     h

Cleanning corpus:
We are now going to run some cleanning operation on the corpus making use of tm_map function from tm package; * remove numbers
* remove stopwords
* remove puntuation
* change to lower case
* remove extra white space
* reduce term in document to stem

# remove numbers
mail_corpus <- tm_map(mail_corpus, removeNumbers)

# remove stopwords
mail_corpus <- tm_map(mail_corpus, removeWords, words = stopwords("en"))

# remove punctuation, since we want to replace by " ", we will use regex expression
mail_corpus <- tm_map(mail_corpus, str_replace_all, pattern="[[:punct:]]", replacement = " ")

# change to lower case
mail_corpus <- tm_map(mail_corpus, tolower)

# remove white space
mail_corpus <- tm_map(mail_corpus, stripWhitespace)

# reduce term in document to stem
mail_corpus <- tm_map(mail_corpus, stemDocument)

# force mail_corpus back to text document
mail_corpus <- tm_map(mail_corpus, PlainTextDocument)

Building a Document-term Matrix and Container

dtm <- DocumentTermMatrix(mail_corpus)
dtm <- removeSparseTerms(dtm, 1-(10/length(mail_corpus)))

dtm

## <<DocumentTermMatrix (documents: 1900, terms: 3048)>>
## Non-/sparse entries: 150190/5641010
## Sparsity           : 97%
## Maximal term length: 70
## Weighting          : term frequency (tf)

# Create label based on class meta data
class_label <- unlist(meta_class)

# use 1/3 of corpus for training, we had randomized the order so we should get a mixed of both spam/ham, since we set the meta data on text when creating corpus, virgin = FALSE
mail_container <- create_container(dtm, labels = class_label, trainSize = 1:633, testSize = 634:1900, virgin = FALSE)

Running Statistical Text Processing We will be running Support Vector machines, Random Forest, and Maximum entropy models.

# Training models 
svm_mail_model <- train_model(mail_container, "SVM")
tree_mail_model <- train_model(mail_container, "TREE")
maxent_mail_model <- train_model(mail_container, "SVM")

svm_mail_out <- classify_model(mail_container, svm_mail_model)
tree_mail_out <- classify_model(mail_container, tree_mail_model)
maxent_mail_out <- classify_model(mail_container, maxent_mail_model)

Looking at Results
We are now going to look at the results and compare the effectiveness of each model.

head(svm_mail_out)

##   SVM_LABEL  SVM_PROB
## 1         h 0.9904617
## 2         h 0.9999514
## 3         s 0.9790393
## 4         h 0.9992861
## 5         h 0.9946637
## 6         h 0.8236364

head(tree_mail_out)

##   TREE_LABEL TREE_PROB
## 1          h 0.9977273
## 2          h 0.9977273
## 3          s 0.8644068
## 4          h 0.9977273
## 5          h 0.9977273
## 6          h 0.9977273

head(maxent_mail_out)

##   SVM_LABEL  SVM_PROB
## 1         h 0.9884091
## 2         h 0.9999264
## 3         s 0.9755319
## 4         h 0.9990335
## 5         h 0.9933558
## 6         h 0.8138945

# Buidling data frame of labels, Acutal vs predicted 
labels_out <- data.frame(correct_lable = class_label[634:1900], svm = as.character(svm_mail_out[,1]), tree = as.character(tree_mail_out[,1]), 
                         maxent = as.character(maxent_mail_out[,1]), stringsAsFactors = FALSE)

# SVM Performance
table(labels_out[, 1] == labels_out[, 2])

## 
## FALSE  TRUE 
##    44  1223

prop.table(table(labels_out[, 1] == labels_out[, 2]))

## 
##     FALSE      TRUE 
## 0.0347277 0.9652723

# Random Forrest Performance
table(labels_out[, 1] == labels_out[, 3])

## 
## FALSE  TRUE 
##    79  1188

prop.table(table(labels_out[, 1] == labels_out[, 3]))

## 
##      FALSE       TRUE 
## 0.06235201 0.93764799

# Maximum Entropy Performance
table(labels_out[, 1] == labels_out[, 4])

## 
## FALSE  TRUE 
##    44  1223

prop.table(table(labels_out[, 1] == labels_out[, 4]))

## 
##     FALSE      TRUE 
## 0.0347277 0.9652723

Maximum Entropy slightly edges out Support Vector Machines, both with 97% of correct matching. Random Forrest only match the result correctly 93%.

Week11 Assignment

Valerie Briot

April 09, 2016

Loading spam and ham emails into corpus and classify them as ‘s’ or ‘h’

Loading of required packages:

Setting Environmnet:

Approach: