The purpose of this project is to prepare document data for classification algorithms. We used the spam and ham email sets suggested in the brief, and primarily applied the tm and RTextTools packages along with a little bit of tidyr for cleaning.
Once the email text was containerized, we ran Support Vector Machine and Maximum Entropy classifiers on the data to evaluate and compare their predictive power; and we did this at different proportions of training set to test set (70:30, 80:20, and 90:10). Attempts to run Random Forest led to crashes, so unfortunately we cannot include that classifier in the comparison set.
Read in the data.
Create a corpus.
Clean the data.
Create a document-term matrix.
Create containers at different proportions of training set to test holdout.
Train the classifiers.
Evaluate the results for different classifiers and holdout proportions.
library(tm)
library(RTextTools)
library(dplyr)
library(stringr)
library(tidyr)
library(SnowballC)# Identify the working directory - we worked collaboratively but with local files, so we flipped this switch when iterating
# Kavya WD
setwd <- "C:/Users/Kavya/Desktop/Education/msds/DATA 607 Data Acquisition and Management/Projects/Project 04"
# Jeremy's WD
#setwd(file.path(wd.path2, "Projects", "Project 4"))# The name of the folders in our working directories -- in this case, "spam" and "ham" -- become the `path` argument below.
ham.list <- list.files(path = "spam/", full.names = T, recursive = F)
spam.list <- list.files(path = "ham/", full.names = T, recursive = F)readLines function to every filepath to get the contents of the file (i.e., the email itself).ham.sapply <- sapply(ham.list, readLines, warn = F)
spam.sapply <- sapply(spam.list, readLines, warn = F)# Build a dataframe composed of the text read in from the ham and spam emails.
combined1 <- c(ham.sapply, spam.sapply)
combined2 <- data.frame(t(sapply(combined1,c)))
combined.df <- gather(combined2, "file", "text", 1:2796)## Warning: attributes are not identical across measure variables;
## they will be dropped
# Coerce the text variable to character type
combined.df$text <- as.character(combined.df$text)# Prime the type variable
combined.df$type <- NA
# Set ranges to map ham and spam values to row ranges within the dataframe
ham.l <- length(ham.sapply)
combined.l <- nrow(combined.df)
# Map the appropriate values to type in the dataframe
combined.df$type[1:ham.l] <- "spam"
combined.df$type[(ham.l + 1):(combined.l)] <- "ham"
combined.df$type <- factor(combined.df$type)We elected not to remove HTML tags using regex to turn “<[^>]*>“” into blanks, as that level of formatting could conceivably signal whether an email is ham or spam.
combined.df$text <- str_replace_all(combined.df$text, "^c\\(", "") %>% # Remove
str_replace_all("[[:punct:]]", " ") %>% # Remove punctuation
str_replace_all("\\s{2,}", " ") # Truncate white space (to single space)
combined.df$text <- enc2utf8(combined.df$text)combined.corpus <- Corpus(VectorSource(combined.df$text))
combined.corpus <- sample(combined.corpus)
print(combined.corpus)## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2796
corpus.dtm <- DocumentTermMatrix(combined.corpus)
corpus.dtm <- removeSparseTerms(corpus.dtm, (1 - 10 / length(combined.corpus)))
print(corpus.dtm)## <<DocumentTermMatrix (documents: 2796, terms: 6215)>>
## Non-/sparse entries: 551140/16826000
## Sparsity : 97%
## Maximal term length: 33
## Weighting : term frequency (tf)
We’d like to see if different train-test proportion impact predictive power of the classifiers, so we’ll create a few different parameters.
# Specify the location of the "spam" and "ham" labels
labels.corpus <- combined.df$type
# The total number of documents -- 2,796
N <- length(combined.corpus)
# The percentage of the data to partition
P.70 <- 0.7 # For a 30% test holdout
P.80 <- 0.8 # For a 20% test holdout
P.90 <- 0.9 # For a 10% test holdout
# Number of documents in the training set
trainSize.70 <- round(P.70*N)
trainSize.80 <- round(P.80*N)
trainSize.90 <- round(P.90*N)
# Number of documents in the test (holdout) set
testSize.70 <- N - round(P.70*N+1)
testSize.80 <- N - round(P.80*N+1)
testSize.90 <- N - round(P.90*N+1)
paste0("For a 30% test holdout, training set = ", trainSize.70, " and test set = ", testSize.70)## [1] "For a 30% test holdout, training set = 1957 and test set = 838"
paste0("For a 20% test holdout, training set = ", trainSize.80, " and test set = ", testSize.80)## [1] "For a 20% test holdout, training set = 2237 and test set = 558"
paste0("For a 10% test holdout, training set = ", trainSize.90, " and test set = ", testSize.90)## [1] "For a 10% test holdout, training set = 2516 and test set = 279"
# Create a container for the 30% test holdout
container.70 <- create_container(corpus.dtm,
labels = labels.corpus,
trainSize = 1:trainSize.70,
testSize = (trainSize.70+1):N,
virgin = F)
slotNames(container.70)## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
# Create containers for the 20% and 10% test holdouts
container.80 <- create_container(corpus.dtm,
labels = labels.corpus,
trainSize = 1:trainSize.80,
testSize = (trainSize.80+1):N,
virgin = F)
container.90 <- create_container(corpus.dtm,
labels = labels.corpus,
trainSize = 1:trainSize.90,
testSize = (trainSize.90+1):N,
virgin = F)We started with three classifiers to train the data: Support Vector Machine Models, Random Forest Models, and Maximum Entropy Models.
# Classify using the Support Vector Machines model
svm_model.70 <- train_model(container.70, "SVM")
svm_out.70 <- classify_model(container.70, svm_model.70)
svm_model.80 <- train_model(container.80, "SVM")
svm_out.80 <- classify_model(container.80, svm_model.80)
svm_model.90 <- train_model(container.90, "SVM")
svm_out.90 <- classify_model(container.90, svm_model.90)
# Classify using the Maximum Entropy model
maxent_model.70 <- train_model(container.70, "MAXENT")
maxent_out.70 <- classify_model(container.70, maxent_model.70)
maxent_model.80 <- train_model(container.80, "MAXENT")
maxent_out.80 <- classify_model(container.80, maxent_model.80)
maxent_model.90 <- train_model(container.90, "MAXENT")
maxent_out.90 <- classify_model(container.90, maxent_model.90)labels.out.70 <- data.frame(
correct.label = labels.corpus[round(P.70*N+1):N],
svm.70 = as.character(svm_out.70[,1]),
maxent.70 = as.character(maxent_out.70[,1]),
stringsAsFactors = F)
head(labels.out.70, 2)## correct.label svm.70 maxent.70
## 1 ham spam ham
## 2 ham spam ham
labels.out.80 <- data.frame(
correct.label = labels.corpus[round(P.80*N+1):N],
svm.80 = as.character(svm_out.80[,1]),
maxent.80 = as.character(maxent_out.80[,1]),
stringsAsFactors = F)
head(labels.out.80, 2)## correct.label svm.80 maxent.80
## 1 ham spam ham
## 2 ham spam spam
labels.out.90 <- data.frame(
correct.label = labels.corpus[round(P.90*N+1):N],
svm.90 = as.character(svm_out.90[,1]),
maxent.90 = as.character(maxent_out.90[,1]),
stringsAsFactors = F)
head(labels.out.90, 2)## correct.label svm.90 maxent.90
## 1 ham spam spam
## 2 ham spam spam
table(labels.out.70[,1] == labels.out.70[,2]) ##
## FALSE
## 839
prop.table(table(labels.out.70[,1] == labels.out.70[,2])) ##
## FALSE
## 1
table(labels.out.80[,1] == labels.out.80[,2]) ##
## FALSE
## 559
prop.table(table(labels.out.80[,1] == labels.out.80[,2])) ##
## FALSE
## 1
table(labels.out.90[,1] == labels.out.90[,2]) ##
## FALSE
## 280
prop.table(table(labels.out.90[,1] == labels.out.90[,2])) ##
## FALSE
## 1
table(labels.out.70[,1] == labels.out.70[,3])##
## FALSE TRUE
## 591 248
prop.table(table(labels.out.70[,1] == labels.out.70[,3])) ##
## FALSE TRUE
## 0.70441 0.29559
table(labels.out.80[,1] == labels.out.80[,3])##
## FALSE TRUE
## 360 199
prop.table(table(labels.out.80[,1] == labels.out.80[,3])) ##
## FALSE TRUE
## 0.6440072 0.3559928
table(labels.out.90[,1] == labels.out.90[,3])##
## FALSE TRUE
## 167 113
prop.table(table(labels.out.90[,1] == labels.out.90[,3])) ##
## FALSE TRUE
## 0.5964286 0.4035714
Results from the SVM classifier are puzzlingly inaccurate – 100%. This does not vary based on the size of the training and test sets. We feel this would benefit from further debugging.
Results from the Maximum Entropy Model were also puzzling, but for a different reason. Each team member used different spam and ham email source sets. For Kavya, the predictive accuracy of the Maximum Entropy model declined when the training set grew in size; while Jeremy observed the opposite result.
As a next step, we need to revisit our algorithms to see if we can improve the performance and reliability of the models.