Summary
This project involves predicting whether an email is SPAM or HAM using the tm
and RTextTools
packages.
The folders downloaded from https://spamassassin.apache.org/publiccorpus/. The file 20030228_easy_ham.tar.bz2
and 20050311_spam_2.tar.bz2
are being used for this analysis.
Setup
The following packages were used in this analysis: stringer, dplyr, tm, RTextTools, ggplot, DT, ROCR and wordcloud
.
library(stringr)
library(dplyr)
library(tm)
library(RTextTools)
library(wordcloud)
library(DT)
library(ROCR)
library(ggplot2)
library(SDMTools)
library(pROC)
#Obtain current working directory.
setwd("~/Documents/CUNYMSDA/Data 607/CUNYDATA607/Project 4")
Building the Corpus
Used VCorpus to build the junk and good emails into spearte corpus. The meta headers of each email was then converted to either 0 or 1 to determine if they were good emails or junk emails.
There are 1397 files from the spam file and 2501 from the ham file.
projectjunk <- VCorpus(DirSource("spam_2"))
projectgood <- VCorpus(DirSource("easy_ham"))
## update headers into 1 and 0 to differentiate for processing
meta(projectjunk, "spam") <- 1
meta(projectgood, "spam") <- 0
projectall <- c(projectjunk, projectgood)
Cleanup and Transformatino of the Corpus
The following steps set everything to lowercase, remove any numbers, remove english stop words (built into tm) and remove all non-alpha numeric characters. This was directly from some items found in the Data Collection with R.
projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)
Sampling and Seeds
The following section generates the training and test sets from the same files instead of using separate file contents.
The document matrix had 100% sparsity, so in order to avoid terms that are infrequent from influencing the results, I will set the sparsity to .95. THis reduces the sparsity to 82%. The top 20 items are being shown.
RTextTools are then used to determine how good the model is at predicting.
projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)
dataset | spam | Freq |
---|---|---|
training | 0 | 0.6553333 |
training | 1 | 0.3446667 |
test | 0 | 0.6330275 |
test | 1 | 0.3669725 |
## [1] "List of Common Terms"
## [1] "com" "esmtp" "font" "fork"
## [5] "http" "list" "localhost" "nbsp"
## [9] "net" "org" "received" "sep"
## [13] "spamassassin" "taint" "xent"
Analysis
In looking at the confusion matrices, you will notice the following results in accuracy:
1. SVM - 58 / 2398 (2.4%)
2. Tree - 45 / 2398 (1.9%)
3. Maxent - 15 / 2398 (0.6%)
This means that the Maxent model produces the least amount of Type I and Type II errors.
Since maxent was the best predictor you will notice in the plots that it is nearly a right triangle to show its accuracy. You will notice SVM and Tree with less pronounced vertices.
projectall <- tm_map(projectall, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
projectall <- tm_map(projectall, content_transformer(tolower))
projectall <- tm_map(projectall, removeNumbers)
projectall <- tm_map(projectall, removeWords, words = stopwords("en"))
projectall <- tm_map(projectall, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
projectall <- tm_map(projectall, stripWhitespace)
confusion.matrix(results$spam, maxent_out[,1])
## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors
## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors
## obs
## pred 0 1
## 0 1507 7
## 1 11 873
## attr(,"class")
## [1] "confusion.matrix"
confusion.matrix(results$spam, svm_out[,1])
## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors
## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors
## obs
## pred 0 1
## 0 1488 10
## 1 30 870
## attr(,"class")
## [1] "confusion.matrix"
confusion.matrix(results$spam, tree_out[,1])
## Warning in Ops.factor(pred, threshold): '>=' not meaningful for factors
## Warning in Ops.factor(pred, threshold): '<' not meaningful for factors
## obs
## pred 0 1
## 0 1501 24
## 1 17 856
## attr(,"class")
## [1] "confusion.matrix"
par(mfrow=c(1,3))
plot(roc(results$spam, as.numeric(as.character(maxent_out[,1]))), main="Maxent")
plot(roc(results$spam, as.numeric(as.character(svm_out[,1]))), main="SVM")
plot(roc(results$spam, as.numeric(as.character(tree_out[,1]))), main="Tree")
Conclusion
The analysis posed some challenges in determining how much data to use for training and testing. A conventional 30/70 test/training split was used to determine the data subsets for this project. In addition, the high level of accuracy is interesting on the maxent model which was not expected. In revieweing the weights of keywords in wasn’t clear why it would rank some months as spam such as ‘sep’ and ‘aug’ and lump those in with sexual terms and specific companies like yahoo.