CUNY MSDS 607: Canned Spam
CUNY MSDS 607: Canned Spam
Project Overview
A common use for NLP is to help email service providers discern spam, unwanted or inappropriate messages sent in bulk to many people, from email pertenant and addressed to a user.
In this exercise we will examine a sample of known spam emails, compare them to known non-spam emails (which will be referred to as “ham”), and train and test models to find the best method to make this comparison.
Guiding Question
There are a number of classifier models that can be used to identify spam from ham. Which one is the best to use in production?
Prepare the data
# Bentley Source path
easy_ham_2 <- VCorpus(DirSource("easy_ham_2"))
spam_2 <- VCorpus(DirSource("spam_2"))
# Niculescu Source Path
#setwd("~/Documents/R/CUNYMS/DATA607/PROJECTS/")
#easy_ham_2 <- VCorpus(DirSource("../DATA/easy_ham"))
#spam_2 <- VCorpus(DirSource("../DATA/spam"))2. Tag the data
We add metadata to the two data sets so that we can later identify which emails are spam and which are ham.
3. Preliminary visualization
Before we perform transformations on the data we wanted to see what the spam and ham looked like in their “natural”" states.
dataframe <- data.frame(text=unlist(sapply(spam_2, `[`)), stringsAsFactors=F)
spamcorpus <- Corpus(VectorSource(spam_2))
spamcorpus <- tm_map(spamcorpus, removePunctuation)## Warning in tm_map.SimpleCorpus(spamcorpus, removePunctuation):
## transformation drops documents
dtm <- TermDocumentMatrix(spamcorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
e <- data.frame(word = names(v),freq=v)
head(e, 100)## word freq
## the the 14416
## and and 10183
## from from 9789
## you you 9173
## 2002 2002 8706
## with with 8165
## for for 7879
## your your 6478
## received received 6059
## character0 character0 5592
## this this 4683
## jul jul 4382
## font font 3384
## mon mon 3360
## esmtp esmtp 3139
## table table 2973
## are are 2880
## email email 2845
## that that 2739
## tby tby 2577
## will will 2557
## our our 2516
## helvetica helvetica 2496
## 0100 0100 2438
## may may 2416
## tfor tfor 2406
## have have 2311
## not not 2022
## meta meta 2007
## localhost localhost 1898
## subject subject 1805
## can can 1771
## free free 1767
## all all 1746
## year year 1736
## face3darial face3darial 1707
## div div 1668
## sansserif sansserif 1602
## contenttype contenttype 1599
## 81168116 81168116 1515
## tue tue 1508
## spam spam 1501
## date date 1496
## facearial facearial 1480
## 0400 0400 1468
## hour hour 1464
## jun jun 1452
## description description 1422
## 118 118 1415
## language language 1412
## min min 1407
## heading heading 1405
## arial arial 1404
## 310 310 1403
## messageid messageid 1400
## datetimestamp datetimestamp 1398
## isdst isdst 1398
## mday mday 1398
## origin origin 1398
## wday wday 1398
## yday yday 1398
## listcontent listcontent 1385
## wed wed 1382
## listauthor listauthor 1376
## html html 1370
## listsec listsec 1345
## thu thu 1336
## more more 1300
## aug aug 1288
## smtp smtp 1285
## get get 1245
## body body 1230
## new new 1202
## business business 1200
## dogmaslashnullorg dogmaslashnullorg 1190
## faceverdana faceverdana 1190
## returnpath returnpath 1185
## please please 1179
## cfrom cfrom 1176
## out out 1174
## list list 1157
## mimeversion mimeversion 1149
## contenttransferencoding contenttransferencoding 1130
## 0500 0500 1124
## one one 1103
## jmlocalhost jmlocalhost 1083
## 127001 127001 1067
## mandarklabsnetnoteinccom mandarklabsnetnoteinccom 1040
## money money 1018
## border3d0 border3d0 978
## postfix postfix 966
## only only 961
## information information 937
## border0 border0 929
## tr20 tr20 924
## fri fri 900
## internet internet 895
## size2 size2 893
## here here 875
## option option 875
dataframe <- data.frame(text=unlist(sapply(easy_ham_2, `[`)), stringsAsFactors=F)
mycorpus <- Corpus(VectorSource(easy_ham_2))
mycorpus <- tm_map(mycorpus, removePunctuation)## Warning in tm_map.SimpleCorpus(mycorpus, removePunctuation): transformation
## drops documents
dtm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)#plot in a wordcloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))wordcloud(words = e$word, freq = e$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Accent"))4. Data cleaning
In order to normalize data we are going to do a number of transformations:
All lowercase: this will prevent the model from considering “This” and “this” as two different words.
Remove stopwords: this will prevent very common words (e.g., “i”, “we”, “my”, “me”) from being part of the model.
Consolodate stemming: will allow the model to consider words that essentially mean the same thing (e.g., “sparse”, “sparsity” and “sparseness”) as the same.
Remove punctuation: will prevent the model from considering words with punctuation attached as different words (e.g., “this, and that” versus “this and that”).
Remove special characters: similar to the above, but with characters such as “+” or “~”. We had originally removed dollar signs but that seemed like something that might be a good indicator.
Strip whitespace: eliminates spaces, tabs, or other “empty” space from being included in the analysis.
We originally also excluded strings with over 30 characters but that resulted in too few documents for the analysis.
# initially this step was done with piping but on some machines there were memory issues so we broke them out individually.
# transform all to lowercase
#spham <- tm_map(spham, content_transformer(tolower))
# consolidate stems
spham <- tm_map(spham, stemDocument)
# remove punctuation
spham <- tm_map(spham, removePunctuation)
# remove other punctuation
spham <- tm_map(spham, content_transformer(str_replace_all), pattern = "[[:punct:]]", replacement = "")
# remove special characters
spham <- tm_map(spham, content_transformer(str_replace_all), pattern = "\\+*\\~*\\|*\\?*\\`*\\?*\\?*",
replacement = "")
# remove whitespace
spham <- tm_map(spham, stripWhitespace)
# strip stopwords
spham <- tm_map(spham, removeWords, stopwords("english"))5. Randomize
In order to allow for a more dynamic analysis we have randomized the data.
6. Create Document Term Matrix
We now convert the data into a document term matrix, the format needed to do further analysis.
7. Reduce sparsity
Sparsity is the extent to which documents appear in documents.
First looking at what initial sparcity is so that we can compare.
## <<DocumentTermMatrix (documents: 2799, terms: 129129)>>
## Non-/sparse entries: 620022/360812049
## Sparsity : 100%
## Maximal term length: 880
## Weighting : term frequency (tf)
Reducing sparcity to include only terms that appear in at least 10 documents.
## <<DocumentTermMatrix (documents: 2799, terms: 5795)>>
## Non-/sparse entries: 429093/15791112
## Sparsity : 97%
## Maximal term length: 71
## Weighting : term frequency (tf)
Reducing sparcity in this method left us with 5,522 terms which, while only 4% of the terms pre-reduction, should still be enough to create a model.
Mine the corpus
1. Create containers
First collect the meta labels assigned in step 2.
Now creating a “container” that separates test and train data separately.
We will use 2/3 of the available data to train the models and then test the model on the remaining 1/3.
# calculating lengths as variables for the container formula. Setting this up as formulas will make changing the numbers later if we choose to try different test / train proportions.
n <- length(meta_type)
n_train <- round(n*.67,0)
n_test <- n-n_train
container <- create_container(dtm,
labels = meta_type,
trainSize = 1:n_train,
testSize = (n_train+1):n,
virgin = FALSE)2. Train and test SVM model
Will first use the SVM model as a test to make sure that we have the coding correct and will then apply what we use here on other models.
We’re also going to add a timing function so that we can consider computational efficiency for the other models.
3. Create results DF
4. Check SVM performance
tic("svm prop table")
svmperf <- prop.table(table(svm_df[,1] == svm_df[,2]))
toc(log = TRUE, quiet = TRUE)
svmperf##
## FALSE TRUE
## 0.005411255 0.994588745
5. Starting a log of the results
perflog <- data.frame(svmperf) %>%
rownames_to_column %>%
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
slice(1)
perflog[1] <- "svm"
names(perflog) <- c("Model", "% False", "% True")
perflog## Model % False % True
## 1 svm 0.00541125541125541 0.994588744588745
The SVM model identified spam 99.1% of the time–quite a good score. A confusion matrix would tell us more, though, particularly how often our model resulted in false positives or negatives.
6. Create confusion matrix
tic("svm confusion matrix")
svm_condf <- svm_df
bi <- c("ham" = 0, "spam" = 1)
svm_condf$label <- bi[svm_condf$label]
svm_condf$svm <- bi[svm_condf$svm]
confusion.matrix(svm_condf$label, svm_condf$svm)## obs
## pred 0 1
## 0 462 4
## 1 1 457
## attr(,"class")
## [1] "confusion.matrix"
There were 7 Type I errors (0.8%) and 1 Type II error (0.1%).
In terms of timing the whole process took 7.054 seconds.
We’re going to start creating vectors to capture the processing times so we can compare later.
7. Train and test other models
Now that we see how well SVM did at the job we can run other models and compare their performances.
# start timer
tic('tree')
# train and execute model
tree_model <- train_model(container, "TREE")
tree_classy <- classify_model(container, tree_model)
tree_df <- data.frame(label = meta_type[(n_train+1):n],tree = tree_classy[,1], stringsAsFactors = FALSE)
tree_df$tree <- as.character(tree_df$tree)
treeperf <- prop.table(table(tree_df[,1] == tree_df[,2]))
# add to performance log
treeperf_df <- data.frame(treeperf) %>%
rownames_to_column %>%
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
slice(1)
treeperf_df[1] <- "tree"
names(treeperf_df) <- c("Model", "% False", "% True")
perflog <- rbind(perflog, treeperf_df)
# create confusion matrix
tree_condf <- tree_df
bi <- c("ham" = 0, "spam" = 1)
tree_condf$label <- bi[tree_condf$label]
tree_condf$tree <- bi[tree_condf$tree]
confusion.matrix(tree_condf$label, tree_condf$tree)## obs
## pred 0 1
## 0 427 17
## 1 36 444
## attr(,"class")
## [1] "confusion.matrix"
# stop timer
toc(log = TRUE, quiet = TRUE)
# add time to log
tl <- tic.log(format = TRUE)
tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")
treetime <- c(tla, tlb)Not nearly as good. Only 94% correct, 19 type I errors (2.2%) and 36 type II (4.1%)
# start timer
tic('maxy')
# train and execute model
maxy_model <- train_model(container, "MAXENT")
maxy_classy <- classify_model(container, maxy_model)
maxy_df <- data.frame(label = meta_type[(n_train+1):n],maxy = maxy_classy[,1], stringsAsFactors = FALSE)
maxy_df$max <- as.character(maxy_df$max)
maxyperf <- prop.table(table(maxy_df[,1] == maxy_df[,2]))
# add to performance log
maxyperf_df <- data.frame(maxyperf) %>%
rownames_to_column %>%
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
slice(1)
maxyperf_df[1] <- "maxy"
names(maxyperf_df) <- c("Model", "% False", "% True")
perflog <- rbind(perflog, maxyperf_df)
# create confusion matrix
maxy_condf <- maxy_df
bi <- c("ham" = 0, "spam" = 1)
maxy_condf$label <- bi[maxy_condf$label]
maxy_condf$max <- bi[maxy_condf$maxy]
confusion.matrix(maxy_condf$label, maxy_condf$maxy)## obs
## pred 0 1
## 0 0 0
## 1 0 0
## attr(,"class")
## [1] "confusion.matrix"
#stop timer
toc(log = TRUE, quiet = TRUE)
tl <- tic.log(format = TRUE)
tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")
maxytime <- c(tla, tlb)
maxyperf##
## FALSE TRUE
## 0.003246753 0.996753247
Maximum Entropy has taken the lead with 99.7% correct predictions, 0.1% Type I errors, and 0.2% Type II errors.
# start timer
tic('boost')
# train and execute model
boost_model <- train_model(container, "BOOSTING")
boost_classy <- classify_model(container, boost_model)
boost_df <- data.frame(label = meta_type[(n_train+1):n],boost = boost_classy[,1], stringsAsFactors = FALSE)
boost_df$boost <- as.character(boost_df$boost)
boostperf <- prop.table(table(boost_df[,1] == boost_df[,2]))
# add to performance log
boostperf_df <- data.frame(boostperf) %>%
rownames_to_column %>%
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
slice(1)
boostperf_df[1] <- "boost"
names(boostperf_df) <- c("Model", "% False", "% True")
perflog <- rbind(perflog, boostperf_df)
# create confusion matrix
boost_condf <- boost_df
bi <- c("ham" = 0, "spam" = 1)
boost_condf$label <- bi[boost_condf$label]
boost_condf$boost <- bi[boost_condf$boost]
confusion.matrix(boost_condf$label, boost_condf$boost)## obs
## pred 0 1
## 0 462 2
## 1 1 459
## attr(,"class")
## [1] "confusion.matrix"
toc(log = TRUE, quiet = TRUE)
tl <- tic.log(format = TRUE)
tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")
boosttime <- c(tla, tlb)Boosting had the same exact results as the Support Vector Machine, which seems very weird but don’t see where the code is off so I guess it just is what it is.
Let’s convert the log to numeric and put it into percent form
perflog$`% False`<- round(as.numeric(perflog$`% False`) * 100, 1)
perflog$`% True`<- round(as.numeric(perflog$`% True`) * 100, 1)
perflog## Model % False % True
## 1 svm 0.5 99.5
## 2 tree 5.7 94.3
## 3 maxy 0.3 99.7
## 4 boost 0.3 99.7
8. Compare timing
Merge the logs to compare efficiency rates.
timelog <- data.frame("model" = c("svm", boosttime[1:3]), "time" = c(7.054, boosttime[4:6]), stringsAsFactors = FALSE)
timelog$time <- round(as.numeric(timelog$time),2)
timelog## model time
## 1 svm 7.05
## 2 tree: 0.87
## 3 maxy: 6.38
## 4 boost: 2.50
9. Create final table of results
Now we’ll centralize all of the data into one table.
spham_perf <- bind_cols(perflog, timelog[2])
knitr::kable(head(spham_perf), format = "html") %>%
kable_styling(bootstrap_options = c("striped", "condensed"))| Model | % False | % True | time |
|---|---|---|---|
| svm | 0.5 | 99.5 | 7.05 |
| tree | 5.7 | 94.3 | 0.87 |
| maxy | 0.3 | 99.7 | 6.38 |
| boost | 0.3 | 99.7 | 2.50 |
Conclusion
Max Entropy is clearly the most accurate model to use, but that doesn’t necessarily mean that is the ideal choice.
Boost is slightly less accurate, but it runs in nearly half the time. When deploying this model at scale this will be a massive saving in resources and given that spam filtering is not a life-or-death situation so the moderate loss in accuracy shouldn’t be too much of an issue.
Next steps
In addition to testing the models on other samples, perhaps just using an k-fold on this data set, it would be good to see whether any of the text transformations would result in greater or lesser accuracy.
For example, we chose to keep numerical data in while others may have chosen to remove it. We would look not only at what the results would be with numbers excluded but also what the results are of just using numbers.
Additionally it would be good to generate features from the data in order to develop more rigorous logistic modeling.