Project 4

Summary

This project involves predicting whether an email is SPAM or HAM using the tm and RTextTools packages.

The folders downloaded from https://spamassassin.apache.org/publiccorpus/. The file 20030228_easy_ham.tar.bz2 and 20050311_spam_2.tar.bz2 are being used for this analysis.

Setup

The following packages were used in this analysis: stringer, dplyr, tm, RTextTools, ggplot, DT, ROCR and wordcloud.

library(stringr)
library(dplyr)
library(tm)
library(RTextTools)
library(wordcloud)
library(DT)
library(ROCR)
library(ggplot2)
library(SDMTools)
library(pROC)

#Obtain current working directory.
setwd("~/Documents/CUNYMSDA/Data 607/CUNYDATA607/Project 4")

Building the Corpus

Used VCorpus to build the junk and good emails into spearte corpus. The meta headers of each email was then converted to either 0 or 1 to determine if they were good emails or junk emails.

There are 1397 files from the spam file and 2501 from the ham file.

projectjunk <- VCorpus(DirSource("spam_2"))
projectgood <- VCorpus(DirSource("easy_ham"))

## update headers into 1 and 0 to differentiate for processing
meta(projectjunk, "spam") <- 1
meta(projectgood, "spam") <- 0

projectall <- c(projectjunk, projectgood)

Cleanup and Transformatino of the Corpus

The following steps set everything to lowercase, remove any numbers, remove english stop words (built into tm) and remove all non-alpha numeric characters. This was directly from some items found in the Data Collection with R.

projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)

Sampling and Seeds

The following section generates the training and test sets from the same files instead of using separate file contents.

The document matrix had 100% sparsity, so in order to avoid terms that are infrequent from influencing the results, I will set the sparsity to .95. THis reduces the sparsity to 82%. The top 20 items are being shown.

RTextTools are then used to determine how good the model is at predicting.

projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)

Table 1. Percent Split Between Training and Testing Data
dataset	spam	Freq
training	0	0.6553333
training	1	0.3446667
test	0	0.6330275
test	1	0.3669725

## [1] "List of Common Terms"

##  [1] "com"          "esmtp"        "font"         "fork"        
##  [5] "http"         "list"         "localhost"    "nbsp"        
##  [9] "net"          "org"          "received"     "sep"         
## [13] "spamassassin" "taint"        "xent"

Analysis

In looking at the confusion matrices, you will notice the following results in accuracy:

1. SVM - 58 / 2398 (2.4%)

2. Tree - 45 / 2398 (1.9%)

3. Maxent - 15 / 2398 (0.6%)

This means that the Maxent model produces the least amount of Type I and Type II errors.

Since maxent was the best predictor you will notice in the plots that it is nearly a right triangle to show its accuracy. You will notice SVM and Tree with less pronounced vertices.

projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)

confusion.matrix(results$spam, maxent_out[,1])

## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors

## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors

##     obs
## pred    0   1
##    0 1507   7
##    1   11 873
## attr(,"class")
## [1] "confusion.matrix"

confusion.matrix(results$spam, svm_out[,1])

## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors

## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors

##     obs
## pred    0   1
##    0 1488  10
##    1   30 870
## attr(,"class")
## [1] "confusion.matrix"

confusion.matrix(results$spam, tree_out[,1])

## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors

## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors

##     obs
## pred    0   1
##    0 1501  24
##    1   17 856
## attr(,"class")
## [1] "confusion.matrix"

par(mfrow=c(1,3))
plot(roc(results$spam, as.numeric(as.character(maxent_out[,1]))), main="Maxent")
plot(roc(results$spam, as.numeric(as.character(svm_out[,1]))), main="SVM")
plot(roc(results$spam, as.numeric(as.character(tree_out[,1]))), main="Tree")

Conclusion

The analysis posed some challenges in determining how much data to use for training and testing. A conventional 30/70 test/training split was used to determine the data subsets for this project. In addition, the high level of accuracy is interesting on the maxent model which was not expected. In revieweing the weights of keywords in wasn’t clear why it would rank some months as spam such as ‘sep’ and ‘aug’ and lump those in with sexual terms and specific companies like yahoo.