Background

A. Purpose

The purpose of this project is to prepare document data for classification algorithms. We used the spam and ham email sets suggested in the brief, and primarily applied the tm and RTextTools packages along with a little bit of tidyr for cleaning.
Once the email text was containerized, we ran Support Vector Machine and Maximum Entropy classifiers on the data to evaluate and compare their predictive power; and we did this at different proportions of training set to test set (70:30, 80:20, and 90:10). Attempts to run Random Forest led to crashes, so unfortunately we cannot include that classifier in the comparison set.


B. Team


C. Approach

  1. Read in the data.

  2. Create a corpus.

  3. Clean the data.

  4. Create a document-term matrix.

  5. Create containers at different proportions of training set to test holdout.

  6. Train the classifiers.

  7. Evaluate the results for different classifiers and holdout proportions.


D. Libraries and Setup

library(tm)
library(RTextTools)
library(dplyr)
library(stringr)
library(tidyr)
library(SnowballC)



1. Read in the data.

A. Set the working directory to be the location of the extracted ham and spam folders.

# Identify the working directory - we worked collaboratively but with local files, so we flipped this switch when iterating

# Kavya WD
setwd <- "C:/Users/Kavya/Desktop/Education/msds/DATA 607 Data Acquisition and Management/Projects/Project 04"

# Jeremy's WD
#setwd(file.path(wd.path2, "Projects", "Project 4"))


B. Create separate lists of spam and ham filepaths.

# The name of the folders in our working directories -- in this case, "spam" and "ham" -- become the `path` argument below.

ham.list <- list.files(path = "spam/", full.names = T, recursive = F)

spam.list <- list.files(path = "ham/", full.names = T, recursive = F)


C. Apply the readLines function to every filepath to get the contents of the file (i.e., the email itself).

ham.sapply <- sapply(ham.list, readLines, warn = F)

spam.sapply <- sapply(spam.list, readLines, warn = F)



2. Create a corpus

A. Combine the two lists of files and create a dataframe.

# Build a dataframe composed of the text read in from the ham and spam emails.

combined1 <- c(ham.sapply, spam.sapply)

combined2 <- data.frame(t(sapply(combined1,c)))

combined.df <- gather(combined2, "file", "text", 1:2796)
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Coerce the text variable to character type 

combined.df$text <- as.character(combined.df$text)


B. Add a column for “type” and label the documents as spam or ham.

# Prime the type variable

combined.df$type <- NA

# Set ranges to map ham and spam values to row ranges within the dataframe

ham.l <- length(ham.sapply)

combined.l <- nrow(combined.df)

# Map the appropriate values to type in the dataframe

combined.df$type[1:ham.l] <- "spam"

combined.df$type[(ham.l + 1):(combined.l)] <- "ham"

combined.df$type <- factor(combined.df$type)


C. Perform some basic cleaning.

We elected not to remove HTML tags using regex to turn “<[^>]*>“” into blanks, as that level of formatting could conceivably signal whether an email is ham or spam.

combined.df$text <-  str_replace_all(combined.df$text, "^c\\(", "") %>% # Remove 
                     str_replace_all("[[:punct:]]", " ") %>% # Remove punctuation
                     str_replace_all("\\s{2,}", " ") # Truncate white space (to single space)

combined.df$text <- enc2utf8(combined.df$text)


D. Create a corpus

combined.corpus <- Corpus(VectorSource(combined.df$text))

combined.corpus <- sample(combined.corpus)

print(combined.corpus)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2796


3. Create a document-term matrix.

corpus.dtm <- DocumentTermMatrix(combined.corpus)

corpus.dtm <- removeSparseTerms(corpus.dtm, (1 - 10 / length(combined.corpus)))

print(corpus.dtm)
## <<DocumentTermMatrix (documents: 2796, terms: 6215)>>
## Non-/sparse entries: 551140/16826000
## Sparsity           : 97%
## Maximal term length: 33
## Weighting          : term frequency (tf)


4. Create a container.

A. Set the parameters.

We’d like to see if different train-test proportion impact predictive power of the classifiers, so we’ll create a few different parameters.

# Specify the location of the "spam" and "ham" labels

labels.corpus <- combined.df$type

# The total number of documents -- 2,796

N <- length(combined.corpus)

# The percentage of the data to partition
P.70 <- 0.7 # For a 30% test holdout
P.80 <- 0.8 # For a 20% test holdout
P.90 <- 0.9 # For a 10% test holdout

# Number of documents in the training set
trainSize.70 <- round(P.70*N)
trainSize.80 <- round(P.80*N)
trainSize.90 <- round(P.90*N)

# Number of documents in the test (holdout) set
testSize.70 <- N - round(P.70*N+1)
testSize.80 <- N - round(P.80*N+1)
testSize.90 <- N - round(P.90*N+1)

paste0("For a 30% test holdout, training set = ", trainSize.70, " and test set = ", testSize.70)
## [1] "For a 30% test holdout, training set = 1957 and test set = 838"
paste0("For a 20% test holdout, training set = ", trainSize.80, " and test set = ", testSize.80)
## [1] "For a 20% test holdout, training set = 2237 and test set = 558"
paste0("For a 10% test holdout, training set = ", trainSize.90, " and test set = ", testSize.90)
## [1] "For a 10% test holdout, training set = 2516 and test set = 279"


B. Create the containers – one each for each of the test holdout levels.

# Create a container for the 30% test holdout
container.70 <- create_container(corpus.dtm, 
                              labels = labels.corpus, 
                              trainSize = 1:trainSize.70, 
                              testSize = (trainSize.70+1):N,
                              virgin = F)

slotNames(container.70)
## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"
# Create containers for the 20% and 10% test holdouts
container.80 <- create_container(corpus.dtm, 
                              labels = labels.corpus, 
                              trainSize = 1:trainSize.80, 
                              testSize = (trainSize.80+1):N,
                              virgin = F)

container.90 <- create_container(corpus.dtm, 
                              labels = labels.corpus, 
                              trainSize = 1:trainSize.90, 
                              testSize = (trainSize.90+1):N,
                              virgin = F)


5. Train the data.

We started with three classifiers to train the data: Support Vector Machine Models, Random Forest Models, and Maximum Entropy Models.

  • The SVM model provided results that seemed unreasonable; we need more time to refine the approach.
  • The Random Forest model crashed every machine we tried, so we discontinued that approach.
# Classify using the Support Vector Machines model
svm_model.70 <- train_model(container.70, "SVM")
svm_out.70 <- classify_model(container.70, svm_model.70)

svm_model.80 <- train_model(container.80, "SVM")
svm_out.80 <- classify_model(container.80, svm_model.80)

svm_model.90 <- train_model(container.90, "SVM")
svm_out.90 <- classify_model(container.90, svm_model.90)

# Classify using the Maximum Entropy model
maxent_model.70 <- train_model(container.70, "MAXENT")
maxent_out.70 <- classify_model(container.70, maxent_model.70)

maxent_model.80 <- train_model(container.80, "MAXENT")
maxent_out.80 <- classify_model(container.80, maxent_model.80)

maxent_model.90 <- train_model(container.90, "MAXENT")
maxent_out.90 <- classify_model(container.90, maxent_model.90)


6. Train the classifiers and evaluate the results.

A. Create a new dataframe with the true and expected classifications.

labels.out.70 <- data.frame( 
  correct.label = labels.corpus[round(P.70*N+1):N], 
  svm.70 = as.character(svm_out.70[,1]), 
  maxent.70 = as.character(maxent_out.70[,1]), 
  stringsAsFactors = F)

head(labels.out.70, 2)
##   correct.label svm.70 maxent.70
## 1           ham   spam       ham
## 2           ham   spam       ham
labels.out.80 <- data.frame( 
  correct.label = labels.corpus[round(P.80*N+1):N], 
  svm.80 = as.character(svm_out.80[,1]), 
  maxent.80 = as.character(maxent_out.80[,1]), 
  stringsAsFactors = F)

head(labels.out.80, 2)
##   correct.label svm.80 maxent.80
## 1           ham   spam       ham
## 2           ham   spam      spam
labels.out.90 <- data.frame( 
  correct.label = labels.corpus[round(P.90*N+1):N], 
  svm.90 = as.character(svm_out.90[,1]), 
  maxent.90 = as.character(maxent_out.90[,1]), 
  stringsAsFactors = F)

head(labels.out.90, 2)
##   correct.label svm.90 maxent.90
## 1           ham   spam      spam
## 2           ham   spam      spam


B. Performance of Support Vector Machine Model at different test holdout levels.

table(labels.out.70[,1] == labels.out.70[,2]) 
## 
## FALSE 
##   839
prop.table(table(labels.out.70[,1] == labels.out.70[,2])) 
## 
## FALSE 
##     1
table(labels.out.80[,1] == labels.out.80[,2]) 
## 
## FALSE 
##   559
prop.table(table(labels.out.80[,1] == labels.out.80[,2])) 
## 
## FALSE 
##     1
table(labels.out.90[,1] == labels.out.90[,2]) 
## 
## FALSE 
##   280
prop.table(table(labels.out.90[,1] == labels.out.90[,2])) 
## 
## FALSE 
##     1


C. Performance of Maximum Entropy Model at different test holdout levels.

table(labels.out.70[,1] == labels.out.70[,3])
## 
## FALSE  TRUE 
##   591   248
prop.table(table(labels.out.70[,1] == labels.out.70[,3])) 
## 
##   FALSE    TRUE 
## 0.70441 0.29559
table(labels.out.80[,1] == labels.out.80[,3])
## 
## FALSE  TRUE 
##   360   199
prop.table(table(labels.out.80[,1] == labels.out.80[,3])) 
## 
##     FALSE      TRUE 
## 0.6440072 0.3559928
table(labels.out.90[,1] == labels.out.90[,3])
## 
## FALSE  TRUE 
##   167   113
prop.table(table(labels.out.90[,1] == labels.out.90[,3])) 
## 
##     FALSE      TRUE 
## 0.5964286 0.4035714



7. Conclusions

Results from the SVM classifier are puzzlingly inaccurate – 100%. This does not vary based on the size of the training and test sets. We feel this would benefit from further debugging.

Results from the Maximum Entropy Model were also puzzling, but for a different reason. Each team member used different spam and ham email source sets. For Kavya, the predictive accuracy of the Maximum Entropy model declined when the training set grew in size; while Jeremy observed the opposite result.

As a next step, we need to revisit our algorithms to see if we can improve the performance and reliability of the models.