This document pertains to assignment week11 for CUNY607 - Automated Data Collection with R. In this week assignment, we will explore how to perform text classifying. To this effect, we will work with a set of emails that have alredy been identified as “spam” or “ham”. We will download these emails in “spam” and “ham” under the working directory. read these in a corpus and tagging them as spam (s) or ham (h), build a term-document matrix, perform some data cleansing, and finally perform a classification these documents.
For the purpose of these assignment we will assume that the appropriate folders for “spam” and “ham” has been downloaded, extracted, and renamed “spam” and “ham” respectelively and reside under the working directory.
spams and hams folders
we are using the following packages: * stringr * XML * tm * SnowballC * RTextTools
## Loading required package: NLP
## Warning: package 'RTextTools' was built under R version 3.2.4
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
##
## getStemLanguages, wordStem
Once again, we are assuming that the “spam” and “ham” emails are in their directory repectively and these reside under the working directory.
we are setting a spam_directory and ham_directory variables. These must be updated with the proper naming if directory naming convention is different.
spam_directory <- "spam"
ham_directory <- "ham"
We will load each email into a corpus of text document, classifying each with either “s” or “h” to indicate whether the email is from spam or ham folder. In addition, we will add subject, content_type, and date as meta data for each document. Before loading them in the corpus some clean-up will be done.
An analysis of some emails indicates that their are either text based or html based document. Once meta data information is extracted, these lines of the email will be dropped. For each set of emails, we will process them individually within for loop to iterate other each item in folder that match i
Spam Emails:
# Check emails in each spam folder
number_spam <- length(list.files(spam_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE"))
# Construct list of all spam emails
l_spam <- list.files(spam_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE")
# initialize counter to be used across spam and ham emails
n <- 0
# initiate for loop
for(i in 1:number_spam){
# read email
tmp <- readLines(l_spam[i])
if (length(tmp) != 0){
# increment counter
n <- n + 1
# Extract Subject Line
subject_pos <- grep("Subject:", tmp, ignore.case = FALSE, value = FALSE)
subject_tmp <- tmp[subject_pos][1] ### We will use first encountered subject
subject_txt <- str_trim(str_sub(subject_tmp, start = 9))
# Extract Date and Time Line, we will store it as text
date_pos <- grep("Date:", tmp, ignore.case = FALSE, value = FALSE)
date_tmp <- tmp[date_pos][1] ### We will use first encountered date
date_txt <- str_trim(str_sub(date_tmp, start = 6))
# Extract Content_type
content_pos <- grep("Content-Type:", tmp, ignore.case = FALSE, value = FALSE)
content_tmp <- tmp[content_pos][1]
content_txt <- str_trim(str_sub(content_tmp, start = 14))
# retrieve latest position of Date, Subject, Content-Type and store
start_line <- max(unlist(subject_pos), unlist(date_pos), unlist(content_pos)) + 1
end_line <- length(tmp)
# select only from latest of date, subject, content_type to end
tmp <- tmp[start_line:end_line]
# remove new lines
tmp <- str_c(tmp, collapse = " ")
# remove html tag
tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
# Add to corpus, create(n==1), or append
if(n==1){
mail_corpus <- Corpus(VectorSource(tmp))
}else{
tmp_corpus <- Corpus(VectorSource(tmp))
mail_corpus <-c(mail_corpus, tmp_corpus)
}
# Add meta data for new element of corpus
meta(mail_corpus[[n]], "subject") <- subject_txt
meta(mail_corpus[[n]], "date") <- date_txt
meta(mail_corpus[[n]], "content_type") <- content_txt
meta(mail_corpus[[n]], "class") <- "s"
}
}
## Warning in readLines(l_spam[i]): incomplete final line found on 'spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
Similarly, we will now process all the emails from the ham folder. Please note, since we want to have all emails text in the same corpus, the counter should not be reset so we continue to append to the mail_corpus. Also, we will ramdomize the corpus using sample function so that we can later partioned for training set/test set for analysis.
Ham Emails:
# Check emails in each spam folder
number_ham <- length(list.files(ham_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE"))
# Construct list of all spam emails
l_ham <- list.files(ham_directory, pattern = "\\d\\d\\d\\d\\d\\.", all.files = "FALSE", full.names = "TRUE")
# initiate for loop
for(i in 1:number_ham){
# read email
tmp <- readLines(l_ham[i])
if (length(tmp) != 0){
# increment counter
n <- n + 1
# Extract Subject Line
subject_pos <- grep("Subject:", tmp, ignore.case = FALSE, value = FALSE)
subject_tmp <- tmp[subject_pos][1] ### We will use first encountered subject
subject_txt <- str_trim(str_sub(subject_tmp, start = 9))
# Extract Date and Time Line, we will store it as text
date_pos <- grep("Date:", tmp, ignore.case = FALSE, value = FALSE)
date_tmp <- tmp[date_pos][1] ### We will use first encountered date
date_txt <- str_trim(str_sub(date_tmp, start = 6))
# Extract Content_type
content_pos <- grep("Content-Type:", tmp, ignore.case = FALSE, value = FALSE)
content_tmp <- tmp[content_pos][1]
content_txt <- str_trim(str_sub(content_tmp, start = 14))
# retrieve latest position of Date, Subject, Content-Type and store
start_line <- max(unlist(subject_pos), unlist(date_pos), unlist(content_pos)) + 1
end_line <- length(tmp)
# select only from latest of date, subject, content_type to end
tmp <- tmp[start_line:end_line]
# remove new lines
tmp <- str_c(tmp, collapse = " ")
# remove html tag
tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
# Add to corpus, create(n==1), or append
if(n==1){
mail_corpus <- Corpus(VectorSource(tmp))
}else{
tmp_corpus <- Corpus(VectorSource(tmp))
mail_corpus <-c(mail_corpus, tmp_corpus)
}
# Add meta data for new element of corpus
meta(mail_corpus[[n]], "subject") <- subject_txt
meta(mail_corpus[[n]], "date") <- date_txt
meta(mail_corpus[[n]], "content_type") <- content_txt
meta(mail_corpus[[n]], "class") <- "h"
}
}
# randomize corpus
mail_corpus <- sample(mail_corpus)
mail_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1900
# Meta Data
meta_subject <- meta(mail_corpus, type = "local", tag = "subject")
meta_date <- meta(mail_corpus, type = "local", tag = "date")
meta_content <- meta(mail_corpus, type = "local", tag = "content_type")
meta_class <- meta(mail_corpus, type = "local", tag = "class")
length(meta_class)
## [1] 1900
length(meta_date)
## [1] 1900
length(meta_subject)
## [1] 1900
length(meta_content)
## [1] 1900
meta_data_mail <- data.frame(
subject = unlist(meta_subject),
date = unlist(meta_date),
content_type = unlist(meta_content),
class = unlist(meta_class))
head(meta_data_mail)
## subject
## 1 Re: [Baseline] Raising chickens the high-tech way
## 2 Re: [Razor-users] dot-tk registrations hitting Razor
## 3 Pics and thanks
## 4 [ILUG-Social] HELLO
## 5 Tired of paying big bucks for cable
## 6 Re: request ?
## date
## 1 Tue, 23 Jul 2002 15:05:48 -0700
## 2 Wed, 14 Aug 2002 17:39:50 -0700
## 3 Thu, 1 Aug 2002 02:11:10 -0400 (EDT)
## 4 ginal-Date: Wed, 18 Sep 2002 01:11:52
## 5 Tue, 24 Sep 2002 10:16:22 -0400 (EDT)
## 6 ginal-Date: Fri, 16 Aug 2002 15:21:49 +0200
## content_type class
## 1 <NA> h
## 2 text/plain; charset=us-ascii h
## 3 TEXT/PLAIN; charset=US-ASCII h
## 4 text/plain; charset="iso-8859-1" s
## 5 text/plain; charset="iso-8859-1" s
## 6 text/plain; charset=ISO-8859-15 h
Cleanning corpus:
We are now going to run some cleanning operation on the corpus making use of tm_map function from tm package; * remove numbers
* remove stopwords
* remove puntuation
* change to lower case
* remove extra white space
* reduce term in document to stem
# remove numbers
mail_corpus <- tm_map(mail_corpus, removeNumbers)
# remove stopwords
mail_corpus <- tm_map(mail_corpus, removeWords, words = stopwords("en"))
# remove punctuation, since we want to replace by " ", we will use regex expression
mail_corpus <- tm_map(mail_corpus, str_replace_all, pattern="[[:punct:]]", replacement = " ")
# change to lower case
mail_corpus <- tm_map(mail_corpus, tolower)
# remove white space
mail_corpus <- tm_map(mail_corpus, stripWhitespace)
# reduce term in document to stem
mail_corpus <- tm_map(mail_corpus, stemDocument)
# force mail_corpus back to text document
mail_corpus <- tm_map(mail_corpus, PlainTextDocument)
Building a Document-term Matrix and Container
dtm <- DocumentTermMatrix(mail_corpus)
dtm <- removeSparseTerms(dtm, 1-(10/length(mail_corpus)))
dtm
## <<DocumentTermMatrix (documents: 1900, terms: 3048)>>
## Non-/sparse entries: 150190/5641010
## Sparsity : 97%
## Maximal term length: 70
## Weighting : term frequency (tf)
# Create label based on class meta data
class_label <- unlist(meta_class)
# use 1/3 of corpus for training, we had randomized the order so we should get a mixed of both spam/ham, since we set the meta data on text when creating corpus, virgin = FALSE
mail_container <- create_container(dtm, labels = class_label, trainSize = 1:633, testSize = 634:1900, virgin = FALSE)
Running Statistical Text Processing We will be running Support Vector machines, Random Forest, and Maximum entropy models.
# Training models
svm_mail_model <- train_model(mail_container, "SVM")
tree_mail_model <- train_model(mail_container, "TREE")
maxent_mail_model <- train_model(mail_container, "SVM")
svm_mail_out <- classify_model(mail_container, svm_mail_model)
tree_mail_out <- classify_model(mail_container, tree_mail_model)
maxent_mail_out <- classify_model(mail_container, maxent_mail_model)
Looking at Results
We are now going to look at the results and compare the effectiveness of each model.
head(svm_mail_out)
## SVM_LABEL SVM_PROB
## 1 h 0.9904617
## 2 h 0.9999514
## 3 s 0.9790393
## 4 h 0.9992861
## 5 h 0.9946637
## 6 h 0.8236364
head(tree_mail_out)
## TREE_LABEL TREE_PROB
## 1 h 0.9977273
## 2 h 0.9977273
## 3 s 0.8644068
## 4 h 0.9977273
## 5 h 0.9977273
## 6 h 0.9977273
head(maxent_mail_out)
## SVM_LABEL SVM_PROB
## 1 h 0.9884091
## 2 h 0.9999264
## 3 s 0.9755319
## 4 h 0.9990335
## 5 h 0.9933558
## 6 h 0.8138945
# Buidling data frame of labels, Acutal vs predicted
labels_out <- data.frame(correct_lable = class_label[634:1900], svm = as.character(svm_mail_out[,1]), tree = as.character(tree_mail_out[,1]),
maxent = as.character(maxent_mail_out[,1]), stringsAsFactors = FALSE)
# SVM Performance
table(labels_out[, 1] == labels_out[, 2])
##
## FALSE TRUE
## 44 1223
prop.table(table(labels_out[, 1] == labels_out[, 2]))
##
## FALSE TRUE
## 0.0347277 0.9652723
# Random Forrest Performance
table(labels_out[, 1] == labels_out[, 3])
##
## FALSE TRUE
## 79 1188
prop.table(table(labels_out[, 1] == labels_out[, 3]))
##
## FALSE TRUE
## 0.06235201 0.93764799
# Maximum Entropy Performance
table(labels_out[, 1] == labels_out[, 4])
##
## FALSE TRUE
## 44 1223
prop.table(table(labels_out[, 1] == labels_out[, 4]))
##
## FALSE TRUE
## 0.0347277 0.9652723
Maximum Entropy slightly edges out Support Vector Machines, both with 97% of correct matching. Random Forrest only match the result correctly 93%.