CUNY MSDS 607: Canned Spam

Project Overview
Guiding Question

Project Overview

A common use for NLP is to help email service providers discern spam, unwanted or inappropriate messages sent in bulk to many people, from email pertenant and addressed to a user.

In this exercise we will examine a sample of known spam emails, compare them to known non-spam emails (which will be referred to as “ham”), and train and test models to find the best method to make this comparison.

Guiding Question

There are a number of classifier models that can be used to identify spam from ham. Which one is the best to use in production?

Prepare the data

# Bentley Source path

easy_ham_2 <- VCorpus(DirSource("easy_ham_2"))
spam_2 <- VCorpus(DirSource("spam_2"))

# Niculescu Source Path
#setwd("~/Documents/R/CUNYMS/DATA607/PROJECTS/")
#easy_ham_2 <- VCorpus(DirSource("../DATA/easy_ham"))
#spam_2 <- VCorpus(DirSource("../DATA/spam"))

2. Tag the data

We add metadata to the two data sets so that we can later identify which emails are spam and which are ham.

meta(spam_2, tag = "type") <- "spam"
meta(easy_ham_2, tag = "type") <- "ham"

spham <- c(spam_2, easy_ham_2, recursive = TRUE)

3. Preliminary visualization

Before we perform transformations on the data we wanted to see what the spam and ham looked like in their “natural”" states.

dataframe <- data.frame(text=unlist(sapply(spam_2, `[`)), stringsAsFactors=F)

spamcorpus <- Corpus(VectorSource(spam_2))
spamcorpus <- tm_map(spamcorpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(spamcorpus, removePunctuation):
## transformation drops documents

dtm <- TermDocumentMatrix(spamcorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
e <- data.frame(word = names(v),freq=v)
head(e, 100)

##                                              word  freq
## the                                           the 14416
## and                                           and 10183
## from                                         from  9789
## you                                           you  9173
## 2002                                         2002  8706
## with                                         with  8165
## for                                           for  7879
## your                                         your  6478
## received                                 received  6059
## character0                             character0  5592
## this                                         this  4683
## jul                                           jul  4382
## font                                         font  3384
## mon                                           mon  3360
## esmtp                                       esmtp  3139
## table                                       table  2973
## are                                           are  2880
## email                                       email  2845
## that                                         that  2739
## tby                                           tby  2577
## will                                         will  2557
## our                                           our  2516
## helvetica                               helvetica  2496
## 0100                                         0100  2438
## may                                           may  2416
## tfor                                         tfor  2406
## have                                         have  2311
## not                                           not  2022
## meta                                         meta  2007
## localhost                               localhost  1898
## subject                                   subject  1805
## can                                           can  1771
## free                                         free  1767
## all                                           all  1746
## year                                         year  1736
## face3darial                           face3darial  1707
## div                                           div  1668
## sansserif                               sansserif  1602
## contenttype                           contenttype  1599
## 81168116                                 81168116  1515
## tue                                           tue  1508
## spam                                         spam  1501
## date                                         date  1496
## facearial                               facearial  1480
## 0400                                         0400  1468
## hour                                         hour  1464
## jun                                           jun  1452
## description                           description  1422
## 118                                           118  1415
## language                                 language  1412
## min                                           min  1407
## heading                                   heading  1405
## arial                                       arial  1404
## 310                                           310  1403
## messageid                               messageid  1400
## datetimestamp                       datetimestamp  1398
## isdst                                       isdst  1398
## mday                                         mday  1398
## origin                                     origin  1398
## wday                                         wday  1398
## yday                                         yday  1398
## listcontent                           listcontent  1385
## wed                                           wed  1382
## listauthor                             listauthor  1376
## html                                         html  1370
## listsec                                   listsec  1345
## thu                                           thu  1336
## more                                         more  1300
## aug                                           aug  1288
## smtp                                         smtp  1285
## get                                           get  1245
## body                                         body  1230
## new                                           new  1202
## business                                 business  1200
## dogmaslashnullorg               dogmaslashnullorg  1190
## faceverdana                           faceverdana  1190
## returnpath                             returnpath  1185
## please                                     please  1179
## cfrom                                       cfrom  1176
## out                                           out  1174
## list                                         list  1157
## mimeversion                           mimeversion  1149
## contenttransferencoding   contenttransferencoding  1130
## 0500                                         0500  1124
## one                                           one  1103
## jmlocalhost                           jmlocalhost  1083
## 127001                                     127001  1067
## mandarklabsnetnoteinccom mandarklabsnetnoteinccom  1040
## money                                       money  1018
## border3d0                               border3d0   978
## postfix                                   postfix   966
## only                                         only   961
## information                           information   937
## border0                                   border0   929
## tr20                                         tr20   924
## fri                                           fri   900
## internet                                 internet   895
## size2                                       size2   893
## here                                         here   875
## option                                     option   875

dataframe <- data.frame(text=unlist(sapply(easy_ham_2, `[`)), stringsAsFactors=F)

mycorpus <- Corpus(VectorSource(easy_ham_2))
mycorpus <- tm_map(mycorpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(mycorpus, removePunctuation): transformation
## drops documents

dtm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#plot in a wordcloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

wordcloud(words = e$word, freq = e$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"))

4. Data cleaning

In order to normalize data we are going to do a number of transformations:

All lowercase: this will prevent the model from considering “This” and “this” as two different words.
Remove stopwords: this will prevent very common words (e.g., “i”, “we”, “my”, “me”) from being part of the model.
Consolodate stemming: will allow the model to consider words that essentially mean the same thing (e.g., “sparse”, “sparsity” and “sparseness”) as the same.
Remove punctuation: will prevent the model from considering words with punctuation attached as different words (e.g., “this, and that” versus “this and that”).
Remove special characters: similar to the above, but with characters such as “+” or “~”. We had originally removed dollar signs but that seemed like something that might be a good indicator.
Strip whitespace: eliminates spaces, tabs, or other “empty” space from being included in the analysis.
We originally also excluded strings with over 30 characters but that resulted in too few documents for the analysis.

# initially this step was done with piping but on some machines there were memory issues so we broke them out individually.

# transform all to lowercase
#spham <- tm_map(spham, content_transformer(tolower))
  
# consolidate stems
spham <- tm_map(spham, stemDocument)                              
  
# remove punctuation
spham <- tm_map(spham, removePunctuation)                            
  
# remove other punctuation
spham <- tm_map(spham, content_transformer(str_replace_all), pattern = "[[:punct:]]", replacement = "")    
  
# remove special characters
spham <- tm_map(spham, content_transformer(str_replace_all), pattern = "\\+*\\~*\\|*\\?*\\`*\\?*\\?*", 
         replacement = "")
 
# remove whitespace
spham <- tm_map(spham, stripWhitespace)                                  

# strip stopwords
spham <- tm_map(spham, removeWords, stopwords("english"))

5. Randomize

In order to allow for a more dynamic analysis we have randomized the data.

set.seed(23)
spham <- sample(spham)

6. Create Document Term Matrix

We now convert the data into a document term matrix, the format needed to do further analysis.

dtm <- DocumentTermMatrix(spham)

7. Reduce sparsity

Sparsity is the extent to which documents appear in documents.

First looking at what initial sparcity is so that we can compare.

dtm

## <<DocumentTermMatrix (documents: 2799, terms: 129129)>>
## Non-/sparse entries: 620022/360812049
## Sparsity           : 100%
## Maximal term length: 880
## Weighting          : term frequency (tf)

Reducing sparcity to include only terms that appear in at least 10 documents.

dtm <- removeSparseTerms(dtm,1-(10/length(spham)))

dtm

## <<DocumentTermMatrix (documents: 2799, terms: 5795)>>
## Non-/sparse entries: 429093/15791112
## Sparsity           : 97%
## Maximal term length: 71
## Weighting          : term frequency (tf)

Reducing sparcity in this method left us with 5,522 terms which, while only 4% of the terms pre-reduction, should still be enough to create a model.

Mine the corpus

1. Create containers

First collect the meta labels assigned in step 2.

meta_type <- unlist(meta(spham, "type")[,1])

Now creating a “container” that separates test and train data separately.

We will use 2/3 of the available data to train the models and then test the model on the remaining 1/3.

# calculating lengths as variables for the container formula. Setting this up as formulas will make changing the numbers later if we choose to try different test / train proportions.

n <- length(meta_type)
n_train <- round(n*.67,0)
n_test <- n-n_train

container <- create_container(dtm,
                              labels = meta_type,
                              trainSize = 1:n_train,
                              testSize = (n_train+1):n,
                              virgin = FALSE)

2. Train and test SVM model

Will first use the SVM model as a test to make sure that we have the coding correct and will then apply what we use here on other models.

We’re also going to add a timing function so that we can consider computational efficiency for the other models.

tic('svm training time')

svm_model <- train_model(container, "SVM")

svm_classy <- classify_model(container, svm_model)

toc(quiet = TRUE)

3. Create results DF

tic("svm df time")
svm_df <- data.frame(label = meta_type[(n_train+1):n],svm = svm_classy[,1], stringsAsFactors = FALSE)

svm_df$svm <- as.character(svm_df$svm)

toc(quiet = TRUE)

4. Check SVM performance

tic("svm prop table")

svmperf <- prop.table(table(svm_df[,1] == svm_df[,2]))

toc(log = TRUE, quiet = TRUE)
svmperf

## 
##       FALSE        TRUE 
## 0.005411255 0.994588745

5. Starting a log of the results

perflog <- data.frame(svmperf) %>% 
  rownames_to_column %>% 
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>% 
  slice(1) 
  
perflog[1] <- "svm"
names(perflog) <- c("Model", "% False", "% True")

perflog

##   Model             % False            % True
## 1   svm 0.00541125541125541 0.994588744588745

# newXa <- sapply(perflog$`% False`, as.numeric)
# y <- sapply(perflog, as.numeric)
# hist(y)

The SVM model identified spam 99.1% of the time–quite a good score. A confusion matrix would tell us more, though, particularly how often our model resulted in false positives or negatives.

6. Create confusion matrix

tic("svm confusion matrix")

svm_condf <- svm_df 

bi <- c("ham" = 0, "spam" = 1)

svm_condf$label <- bi[svm_condf$label]
svm_condf$svm <- bi[svm_condf$svm]

confusion.matrix(svm_condf$label, svm_condf$svm)

##     obs
## pred   0   1
##    0 462   4
##    1   1 457
## attr(,"class")
## [1] "confusion.matrix"

tic.clearlog()

There were 7 Type I errors (0.8%) and 1 Type II error (0.1%).

In terms of timing the whole process took 7.054 seconds.

We’re going to start creating vectors to capture the processing times so we can compare later.

7. Train and test other models

Now that we see how well SVM did at the job we can run other models and compare their performances.

# start timer
tic('tree')

# train and execute model

tree_model <- train_model(container, "TREE")
tree_classy <- classify_model(container, tree_model)
tree_df <- data.frame(label = meta_type[(n_train+1):n],tree = tree_classy[,1], stringsAsFactors = FALSE)
tree_df$tree <- as.character(tree_df$tree)
treeperf <- prop.table(table(tree_df[,1] == tree_df[,2]))

# add to performance log

treeperf_df <- data.frame(treeperf) %>% 
  rownames_to_column %>% 
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>% 
  slice(1) 
  
treeperf_df[1] <- "tree"
names(treeperf_df) <- c("Model", "% False", "% True")

perflog <- rbind(perflog, treeperf_df)

# create confusion matrix

tree_condf <- tree_df 
bi <- c("ham" = 0, "spam" = 1)
tree_condf$label <- bi[tree_condf$label]
tree_condf$tree <- bi[tree_condf$tree]

confusion.matrix(tree_condf$label, tree_condf$tree)

##     obs
## pred   0   1
##    0 427  17
##    1  36 444
## attr(,"class")
## [1] "confusion.matrix"

# stop timer
toc(log = TRUE, quiet = TRUE)

# add time to log
tl <- tic.log(format = TRUE)

tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")
treetime <- c(tla, tlb)

Not nearly as good. Only 94% correct, 19 type I errors (2.2%) and 36 type II (4.1%)

# start timer
tic('maxy')

# train and execute model
maxy_model <- train_model(container, "MAXENT")
maxy_classy <- classify_model(container, maxy_model)
maxy_df <- data.frame(label = meta_type[(n_train+1):n],maxy = maxy_classy[,1], stringsAsFactors = FALSE)
maxy_df$max <- as.character(maxy_df$max)
maxyperf <- prop.table(table(maxy_df[,1] == maxy_df[,2]))

# add to performance log
maxyperf_df <- data.frame(maxyperf) %>% 
  rownames_to_column %>% 
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>% 
  slice(1) 
  
maxyperf_df[1] <- "maxy"
names(maxyperf_df) <- c("Model", "% False", "% True")

perflog <- rbind(perflog, maxyperf_df)

# create confusion matrix
maxy_condf <- maxy_df 
bi <- c("ham" = 0, "spam" = 1)
maxy_condf$label <- bi[maxy_condf$label]
maxy_condf$max <- bi[maxy_condf$maxy]


confusion.matrix(maxy_condf$label, maxy_condf$maxy)

##     obs
## pred 0 1
##    0 0 0
##    1 0 0
## attr(,"class")
## [1] "confusion.matrix"

#stop timer

toc(log = TRUE, quiet = TRUE)

tl <- tic.log(format = TRUE)

tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")
maxytime <- c(tla, tlb)
maxyperf

## 
##       FALSE        TRUE 
## 0.003246753 0.996753247

Maximum Entropy has taken the lead with 99.7% correct predictions, 0.1% Type I errors, and 0.2% Type II errors.

# start timer
tic('boost')

# train and execute model
boost_model <- train_model(container, "BOOSTING")
boost_classy <- classify_model(container, boost_model)
boost_df <- data.frame(label = meta_type[(n_train+1):n],boost = boost_classy[,1], stringsAsFactors = FALSE)
boost_df$boost <- as.character(boost_df$boost)
boostperf <- prop.table(table(boost_df[,1] == boost_df[,2]))

# add to performance log
boostperf_df <- data.frame(boostperf) %>% 
  rownames_to_column %>% 
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>% 
  slice(1) 
  
boostperf_df[1] <- "boost"
names(boostperf_df) <- c("Model", "% False", "% True")

perflog <- rbind(perflog, boostperf_df)

# create confusion matrix
boost_condf <- boost_df 
bi <- c("ham" = 0, "spam" = 1)
boost_condf$label <- bi[boost_condf$label]
boost_condf$boost <- bi[boost_condf$boost]

confusion.matrix(boost_condf$label, boost_condf$boost)

##     obs
## pred   0   1
##    0 462   2
##    1   1 459
## attr(,"class")
## [1] "confusion.matrix"

toc(log = TRUE, quiet = TRUE)

tl <- tic.log(format = TRUE)

tla <- str_extract(tl, "\\b(.*)(!?\\:)")
tlb <- str_extract(tl, "\\d\\.\\d{1,3}")

boosttime <- c(tla, tlb)

Boosting had the same exact results as the Support Vector Machine, which seems very weird but don’t see where the code is off so I guess it just is what it is.

Let’s convert the log to numeric and put it into percent form

perflog$`% False`<- round(as.numeric(perflog$`% False`) * 100, 1)
perflog$`% True`<- round(as.numeric(perflog$`% True`) * 100, 1)

perflog

##   Model % False % True
## 1   svm     0.5   99.5
## 2  tree     5.7   94.3
## 3  maxy     0.3   99.7
## 4 boost     0.3   99.7

8. Compare timing

Merge the logs to compare efficiency rates.

timelog <- data.frame("model" = c("svm", boosttime[1:3]), "time" = c(7.054, boosttime[4:6]), stringsAsFactors = FALSE)

timelog$time <- round(as.numeric(timelog$time),2)

timelog

##    model time
## 1    svm 7.05
## 2  tree: 0.87
## 3  maxy: 6.38
## 4 boost: 2.50

9. Create final table of results

Now we’ll centralize all of the data into one table.

spham_perf <- bind_cols(perflog, timelog[2])

knitr::kable(head(spham_perf), format = "html") %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

Model	% False	% True	time
svm	0.5	99.5	7.05
tree	5.7	94.3	0.87
maxy	0.3	99.7	6.38
boost	0.3	99.7	2.50

Conclusion

Max Entropy is clearly the most accurate model to use, but that doesn’t necessarily mean that is the ideal choice.

Boost is slightly less accurate, but it runs in nearly half the time. When deploying this model at scale this will be a massive saving in resources and given that spam filtering is not a life-or-death situation so the moderate loss in accuracy shouldn’t be too much of an issue.

Next steps

In addition to testing the models on other samples, perhaps just using an k-fold on this data set, it would be good to see whether any of the text transformations would result in greater or lesser accuracy.

For example, we chose to keep numerical data in while others may have chosen to remove it. We would look not only at what the results would be with numbers excluded but also what the results are of just using numbers.

Additionally it would be good to generate features from the data in order to develop more rigorous logistic modeling.