1 Introduction

In this project, we analyze email messages and develop several models to predict whether they are spam. I chose to work with two sets of emails from the website given in the assignment (https://spamassassin.apache.org/publiccorpus/):

  • “easy_ham_2”: a collection of 1,400 “ham” emails from 2003/02
  • “spam_2”: a collection of 1,396 spam emails from 2005/03.

For the text mining and modeling work, I used the tm and RTextTools packages:

  • tm: to load the messages, create a corpus and document-term matrix, and to prepare the data for modeling
  • RTextTools: to train models on the document-term matrix using different learning algorithms, and then evaluate their predictive performance.
# load required packages
library(tm)
library(stringr)
library(RTextTools)
library(knitr)

2 Data preparation

2.1 Load the data

First we load the data. I downloaded and unzipped the files from the website https://spamassassin.apache.org/publiccorpus/ into separate directories, and then used the DirSource function to read in all the files into two SimpleCorpus data structures. We have 1,400 ham messages and 1,396 spam message.

# load the data
ham_raw <- SimpleCorpus(DirSource("HamSpam/easy_ham_2/"))
spam_raw <- SimpleCorpus(DirSource("HamSpam/spam_2/"))
ham_raw
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1400
spam_raw
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1396
# number of emails
n_ham <- length(ham_raw)
n_spam <- length(spam_raw)
N <- n_ham + n_spam

Once the messages are loaded, we can review some sample emails using the inspect function. Note that the headers are long, and are separated from the message content by a blank line. Also the sample ham and spam messages are 3,492 and 1,821 characters long, respectively.

# inspect sample emails
inspect(ham_raw[[1000]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 3492
## 
## From fork-admin@xent.com  Mon Aug 12 11:09:53 2002
## Return-Path: <fork-admin@xent.com>
## Delivered-To: yyyy@localhost.netnoteinc.com
## Received: from localhost (localhost [127.0.0.1])
##  by phobos.labs.netnoteinc.com (Postfix) with ESMTP id BCD4244108
##  for <jm@localhost>; Mon, 12 Aug 2002 05:57:02 -0400 (EDT)
## Received: from phobos [127.0.0.1]
##  by localhost with IMAP (fetchmail-5.9.0)
##  for jm@localhost (single-drop); Mon, 12 Aug 2002 10:57:02 +0100 (IST)
## Received: from xent.com ([64.161.22.236]) by dogma.slashnull.org
##     (8.11.6/8.11.6) with ESMTP id g7BAVlb30446 for <jm@jmason.org>;
##     Sun, 11 Aug 2002 11:31:47 +0100
## Received: from lair.xent.com (localhost [127.0.0.1]) by xent.com (Postfix)
##     with ESMTP id 94EC929415D; Sun, 11 Aug 2002 03:28:05 -0700 (PDT)
## Delivered-To: fork@spamassassin.taint.org
## Received: from venus.phpwebhosting.com (venus.phpwebhosting.com
##     [64.29.16.27]) by xent.com (Postfix) with SMTP id D7CD0294159 for
##     <fork@xent.com>; Sun, 11 Aug 2002 03:27:25 -0700 (PDT)
## Received: (qmail 22327 invoked by uid 508); 11 Aug 2002 10:28:24 -0000
## Received: from unknown (HELO hydrogen.leitl.org) (62.155.144.56) by
##     venus.phpwebhosting.com with SMTP; 11 Aug 2002 10:28:24 -0000
## Received: from localhost (eugen@localhost) by hydrogen.leitl.org
##     (8.11.6/8.11.6) with ESMTP id g7BAS2U26714; Sun, 11 Aug 2002 12:28:06
##     +0200
## X-Authentication-Warning: hydrogen.leitl.org: eugen owned process doing -bs
## From: Eugen Leitl <eugen@leitl.org>
## To: Gary Lawrence Murphy <garym@canada.com>
## Cc: fork <fork@spamassassin.taint.org>
## Subject: Re: Forged whitelist spam
## In-Reply-To: <m2r8h6qumb.fsf@maya.dyndns.org>
## Message-Id: <Pine.LNX.4.33.0208111214300.3981-100000@hydrogen.leitl.org>
## MIME-Version: 1.0
## Content-Type: TEXT/PLAIN; charset=US-ASCII
## Sender: fork-admin@xent.com
## Errors-To: fork-admin@xent.com
## X-Beenthere: fork@spamassassin.taint.org
## X-Mailman-Version: 2.0.11
## Precedence: bulk
## List-Help: <mailto:fork-request@xent.com?subject=help>
## List-Post: <mailto:fork@spamassassin.taint.org>
## List-Subscribe: <http://xent.com/mailman/listinfo/fork>, <mailto:fork-request@xent.com?subject=subscribe>
## List-Id: Friends of Rohit Khare <fork.xent.com>
## List-Unsubscribe: <http://xent.com/mailman/listinfo/fork>,
##     <mailto:fork-request@xent.com?subject=unsubscribe>
## List-Archive: <http://xent.com/pipermail/fork/>
## Date: Sun, 11 Aug 2002 12:28:02 +0200 (CEST)
## 
## On 10 Aug 2002, Gary Lawrence Murphy wrote:
## 
## > My uneducated guess is that all they need to jump expensive whitelist
## > walls would be buckshot a spam-laden Klez with a 5-million-addresses
## > mailer; if it finds just one vulnerable host on an Exchange server,
## > through hopping addressbooks across a few degrees of freedom, a world
## > of whitelists are instantly breechable.
## 
## You seem to be saying that whitelists are useless, because there are worms
## which can compromise your system, read your address book/whitelist, and
## sent themselves on, compromising a nonnegligible fraction of systems as
## they go along. 
## 
## While mailing lists can be spam/worm amplifiers, I don't think this is
## true for individual users even today. Moreover, worms which use email as
## vector exist *only* because a single vendor ships mailers with broken
## default settings, and insists to make documents executables. This makes
## for very bad press, and eventually that vendor is going to wise up, and 
## stop shipping as many broken wares (or people will switch to more secure 
## alternatives, whatever comes first).
## 
## 
## 
## http://xent.com/mailman/listinfo/fork
inspect(spam_raw[[500]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1821
## 
## From gjmy@public.ayptt.ha.cn  Mon Jun 24 17:08:00 2002
## Return-Path: gyyyyy@public.ayptt.ha.cn
## Delivery-Date: Wed May 29 10:49:10 2002
## Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
##     dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4T9n8O30150 for
##     <jm@jmason.org>; Wed, 29 May 2002 10:49:08 +0100
## Received: from public.ayptt.ha.cn ([202.102.230.147]) by
##     mandark.labs.netnoteinc.com (8.11.2/8.11.2) with ESMTP id g4T9n2701963 for
##     <jm@netnoteinc.com>; Wed, 29 May 2002 10:49:07 +0100
## Received: from Margaret ([218.29.21.108]) by public.ayptt.ha.cn
##     (8.9.1a/8.9.1) with SMTP id RAA29880; Wed, 29 May 2002 17:45:51 +0800
##     (CST)
## Message-Id: <200205290945.RAA29880@public.ayptt.ha.cn>
## Reply-To: Margaret<plum318@163.com>
## From: "Margaret"<gyyyyy@public.ayptt.ha.cn>
## To: ""<ruud@RUUD.ORG>
## Date: Wed,29 May 2002 17:51:47 +0800
## X-Auto-Forward: To: ""<ruud@RUUD.ORG>
## X-Keywords: 
## Subject: 
## 
## Dear Sirs,
## We know your esteemed company in beach towels from Internet, and pleased to introduce us as a leading producer of high quality 100% cotton velour printed towels in China, we sincerely hope to establish a long-term business relationship with your esteemed company in this field.
##   
## Our major items are 100% cotton full printed velour towels of the following sizes and weights with a annual production capacity of one million dozens:
## Disney Standard:
## 30X60 inches, weight  305grams/SM, 350gram/PC  
## 40X70 inches, weight  305grams/SM, 550gram/PC  
## Please refer to our website http://www.jacquard-towel.com/index.html for more details ie patterns about our products.
## Once you are interested in our products, we will give you a more favorable price.
## Looking forward to hearing from you soon 
## Thanks and best regards,
## Margaret/Sales Manager
## Henan Ziyang Textiles
## http://www.jacquard-towel.com

2.2 Clean the data

Before we start cleaning the data, let’s review a sample of the stopwords from the “SMART” stopword set. We observe that they are all lowercase, they include punctuation, and they include variants from the same stemmed words. We factor these observations into our sequence of text cleaning steps below.

sort(sample(stopwords("SMART"), 100))
##   [1] "a"             "able"          "according"     "across"       
##   [5] "ain't"         "also"          "always"        "another"      
##   [9] "appear"        "appropriate"   "are"           "associated"   
##  [13] "awfully"       "become"        "becoming"      "behind"       
##  [17] "can't"         "certain"       "changes"       "clearly"      
##  [21] "com"           "come"          "course"        "currently"    
##  [25] "d"             "edu"           "etc"           "every"        
##  [29] "everywhere"    "exactly"       "followed"      "getting"      
##  [33] "gives"         "gone"          "help"          "hence"        
##  [37] "here"          "hereafter"     "herein"        "in"           
##  [41] "indicated"     "it"            "it'll"         "known"        
##  [45] "last"          "look"          "looking"       "ltd"          
##  [49] "many"          "more"          "moreover"      "much"         
##  [53] "namely"        "necessary"     "neither"       "non"          
##  [57] "not"           "on"            "other"         "please"       
##  [61] "possible"      "really"        "regarding"     "regardless"   
##  [65] "secondly"      "see"           "several"       "should"       
##  [69] "since"         "so"            "thanx"         "that"         
##  [73] "them"          "thence"        "theres"        "these"        
##  [77] "they'll"       "thoroughly"    "thus"          "too"          
##  [81] "two"           "unfortunately" "unto"          "up"           
##  [85] "uucp"          "vs"            "weren't"       "when"         
##  [89] "whenever"      "where"         "while"         "who's"        
##  [93] "why"           "will"          "would"         "yes"          
##  [97] "you'll"        "yours"         "yourselves"    "z"

Now we undertake the text cleaning steps below in sequence, which we apply using the tmp function:

  • Remove the message header: we use a regular expression to identify the blank line separating the header from the message content, and then remove the header
  • Convert words to lower case
  • Remove stopwords using the “SMART” set
  • Remove punctuation
  • Remove numbers
  • Stem words to their root stems
  • Remove extra white space.

Note that some of these steps may remove information that could be useful in identifying spam. For instance, certain website / URL addresses in the header could be associated with certain spam senders; emails that use frequent UPPERCASE words or punctuation patterns (frequent !!!) may be associated with spam; and certain numbers (indicating phone numbers, dollar amounts, or IP addresses) could be predictive of spam. Removing these items may reduce predictive performance, but will force the learning algorithms to focus on the text words alone as a predictor of spam.

To accomplish the text cleaning process, we define a cleaning function that includes all the cleaning steps and then apply it to both the ham and spam datasets.

# define cleaning function
cleandata <- function(x) {
    tmp <- x
    # remove header
    tmp <- tm_map(tmp, str_replace, pattern = "^(.+\\n)+\\n", replacement = "")
    # convert to lowercase
    tmp <- tm_map(tmp, content_transformer(tolower))
    # remove stopwords
    tmp <- tm_map(tmp, removeWords, stopwords("SMART"))
    # remove punctuation
    tmp <- tm_map(tmp, str_replace_all, pattern = "[:punct:]", replacement = " ")
    # remove numbers
    tmp <- tm_map(tmp, removeNumbers)
    # stem words
    tmp <- tm_map(tmp, stemDocument)
    # remove extra whitespace
    tmp <- tm_map(tmp, stripWhitespace)
    return(tmp)
}

# clean ham and spam data
ham <- cleandata(ham_raw)
spam <- cleandata(spam_raw)

Afterwards, we inspect the same sample emails as before. Notice that the headers are gone, and the character lengths have been reduced to 590 and 534 for the ham and spam messages, respectively.

# inspect sample emails
inspect(ham[[1000]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 590
## 
## aug gari lawrenc murphi wrote > uneduc guess jump expens whitelist > wall buckshot spam laden klez million address > mailer find vulner host exchang server > hop addressbook degre freedom world > whitelist instant breechabl whitelist useless worm compromis system read address book whitelist compromis nonneglig fraction system mail list spam worm amplifi true individu user today worm email vector exist singl vendor ship mailer broken default set insist make document execut make bad press eventu vendor wise stop ship broken ware peopl switch secur altern http xent mailman listinfo fork
inspect(spam[[500]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 534
## 
## dear sir esteem compani beach towel internet pleas introduc lead produc high qualiti cotton velour print towel china sincer hope establish long term busi relationship esteem compani field major item cotton full print velour towel size weight annual product capac million dozen disney standard x inch weight gram sm gram pc x inch weight gram sm gram pc refer websit http www jacquard towel index html detail pattern product interest product give favor price forward hear margaret sale manag henan ziyang textil http www jacquard towel

2.3 Create document-term matrix

Next we create document-term matrices for the ham and spam datasets, which we will use to find the most frequent terms in each dataset. We do this using two different term weightings:

  • Term frequency (TF)
  • Term frequency-inverse document frequency (TF-IDF)

First, the document-term matrices using the term frequency weighting. Note that ham_dtm and spam_dtm are 99% and 100% sparse. We will reduce the sparsity before developing the predictive models, by setting a sparsity threshold below.

# create doc-term matrix: term frequency weighting
ham_dtm <- DocumentTermMatrix(ham)
spam_dtm <- DocumentTermMatrix(spam)
inspect(ham_dtm)
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity           : 99%
## Maximal term length: 76
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     http linux list listinfo mail
##   00668.0c194428812d424ce5d9b0a39615b041   42     0    9        0    4
##   00693.2183b91fb14b93bdfaab337b915c98bb    7     0    1        1    1
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4    6     0    1        1    0
##   00813.6598e1ef9134cf77f48bca239e4ba2dc    3     0    0        1    1
##   00869.0fbb783356f6875063681dc49cfcb1eb   18     0    0        1    0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087    3     0    0        1    1
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a    1     0    0        1    0
##   01317.7fc86413a091430c3104b041a6525131  239    17   12        1    4
##   01345.c40d5798193a4a060ec9f3d2321e37e4   38    59    3        0    2
##   01380.e3fad5af747d3a110008f94a046bf31b    5     5   26        0    0
##                                         Terms
## Docs                                     mailman net razor user www
##   00668.0c194428812d424ce5d9b0a39615b041       0  13     0    1  32
##   00693.2183b91fb14b93bdfaab337b915c98bb       1  13     0    4   2
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4       1   0     0    4   2
##   00813.6598e1ef9134cf77f48bca239e4ba2dc       1   0     0    0   2
##   00869.0fbb783356f6875063681dc49cfcb1eb       1   0     0    0   0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087       1   0     0    0   2
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a       1   0     0    0   0
##   01317.7fc86413a091430c3104b041a6525131       1 220     0  155  19
##   01345.c40d5798193a4a060ec9f3d2321e37e4       0   0     0   28  30
##   01380.e3fad5af747d3a110008f94a046bf31b       0   0     0  122   1
inspect(spam_dtm)
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     align arial color face font http
##   00028.60393e49c90f750226bee6381eb3e69d     0   273   275  271 1627   79
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2     4     5     7    5   20   97
##   00051.8b17ce16ace4d5845e2299c0123e1f14     9    18    18   20   41   82
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939     9    18    18   20   41   92
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    12    18    14   24   60    2
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14    19     6     6   10   60    7
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8    23     4    47   24  113    6
##   01083.a6b3c50be5abf782b585995d2c11176b     0     0     4    0    0    8
##   01094.91779ec04e5e6b27e84297c28fc7369f   126    32   170   52 1102  516
##   01095.520dcad6e0ebb4d30222292f51ee76ab   126    32   170   52 1102  516
##                                         Terms
## Docs                                     nbsp size width www
##   00028.60393e49c90f750226bee6381eb3e69d    0  273     0  74
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2   15   16    24  40
##   00051.8b17ce16ace4d5845e2299c0123e1f14  567   24    13  68
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939  283   24    13  74
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    7   20    54   1
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14   10   13     2   9
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8   15   50     8   6
##   01083.a6b3c50be5abf782b585995d2c11176b    0    0     0  10
##   01094.91779ec04e5e6b27e84297c28fc7369f  339  447     0 407
##   01095.520dcad6e0ebb4d30222292f51ee76ab  339  447     0 407

Second, the document-term matrices using the TF-IDF weighting. As before, both DTMs are extremely sparse.

# create doc-term matrix: TFIDF weighting
ham_dtm2 <- DocumentTermMatrix(ham, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
spam_dtm2 <- DocumentTermMatrix(spam, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
inspect(ham_dtm2)
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity           : 99%
## Maximal term length: 76
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##                                         Terms
## Docs                                     exmh       file freshmeat ilug
##   00663.660f0334bb6d89793e3d3bb5367cd9c1    0   0.000000     0.000    0
##   00668.0c194428812d424ce5d9b0a39615b041    0   0.000000     0.000    0
##   00813.6598e1ef9134cf77f48bca239e4ba2dc    0   0.000000     0.000    0
##   00869.0fbb783356f6875063681dc49cfcb1eb    0   0.000000     0.000    0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087    0   5.249325     0.000    0
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a    0   0.000000     0.000    0
##   01317.7fc86413a091430c3104b041a6525131    0 162.729083  1804.701    0
##   01345.c40d5798193a4a060ec9f3d2321e37e4    0  73.490553     0.000    0
##   01380.e3fad5af747d3a110008f94a046bf31b    0 640.417680     0.000    0
##   01389.e4cfb234aace4e12b2d9453686c911c9    0   2.624663     0.000    0
##                                         Terms
## Docs                                         linux       net razor
##   00663.660f0334bb6d89793e3d3bb5367cd9c1  0.000000  31.49565     0
##   00668.0c194428812d424ce5d9b0a39615b041  0.000000  18.61106     0
##   00813.6598e1ef9134cf77f48bca239e4ba2dc  0.000000   0.00000     0
##   00869.0fbb783356f6875063681dc49cfcb1eb  0.000000   0.00000     0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087  0.000000   0.00000     0
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a  0.000000   0.00000     0
##   01317.7fc86413a091430c3104b041a6525131 20.699054 314.95648     0
##   01345.c40d5798193a4a060ec9f3d2321e37e4 71.837895   0.00000     0
##   01380.e3fad5af747d3a110008f94a046bf31b  6.087957   0.00000     0
##   01389.e4cfb234aace4e12b2d9453686c911c9  0.000000  18.61106     0
##                                         Terms
## Docs                                          rpm     spam   unison
##   00663.660f0334bb6d89793e3d3bb5367cd9c1 0.000000 0.000000    0.000
##   00668.0c194428812d424ce5d9b0a39615b041 0.000000 2.904317    0.000
##   00813.6598e1ef9134cf77f48bca239e4ba2dc 0.000000 0.000000    0.000
##   00869.0fbb783356f6875063681dc49cfcb1eb 0.000000 0.000000    0.000
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 0.000000 0.000000    0.000
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a 0.000000 0.000000    0.000
##   01317.7fc86413a091430c3104b041a6525131 6.294861 0.000000    0.000
##   01345.c40d5798193a4a060ec9f3d2321e37e4 0.000000 0.000000    0.000
##   01380.e3fad5af747d3a110008f94a046bf31b 0.000000 0.000000 2003.657
##   01389.e4cfb234aace4e12b2d9453686c911c9 0.000000 5.808633    0.000
inspect(spam_dtm2)
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##                                         Terms
## Docs                                         align      arial      color
##   00028.60393e49c90f750226bee6381eb3e69d   0.00000 384.428837 306.330166
##   00051.8b17ce16ace4d5845e2299c0123e1f14  11.74479  25.346956  20.050702
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f  15.65971  25.346956  15.594990
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14  24.79455   8.448985   6.683567
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8  30.01445   5.632657  52.354610
##   01083.a6b3c50be5abf782b585995d2c11176b   0.00000   0.000000   4.455712
##   01094.91779ec04e5e6b27e84297c28fc7369f 164.42700  45.061256 189.367739
##   01095.520dcad6e0ebb4d30222292f51ee76ab 164.42700  45.061256 189.367739
##   01097.98d732b93866d13b0c13589ae2acc383   0.00000   0.000000   0.000000
##   01359.deafa1d42658c6624c6809a446b7f369   0.00000   0.000000   0.000000
##                                         Terms
## Docs                                          face       font    height
##   00028.60393e49c90f750226bee6381eb3e69d 305.52841 1557.42178  0.000000
##   00051.8b17ce16ace4d5845e2299c0123e1f14  22.54822   39.24665  9.198765
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f  27.05787   57.43412 60.711846
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14  11.27411   57.43412  0.000000
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8  27.05787  108.16759 31.275800
##   01083.a6b3c50be5abf782b585995d2c11176b   0.00000    0.00000  0.000000
##   01094.91779ec04e5e6b27e84297c28fc7369f  58.62538 1054.87326  0.000000
##   01095.520dcad6e0ebb4d30222292f51ee76ab  58.62538 1054.87326  0.000000
##   01097.98d732b93866d13b0c13589ae2acc383   0.00000    0.00000  0.000000
##   01359.deafa1d42658c6624c6809a446b7f369   0.00000    0.00000  0.000000
##                                         Terms
## Docs                                         nbsp      size   verdana
##   00028.60393e49c90f750226bee6381eb3e69d   0.0000 245.75153 630.92652
##   00051.8b17ce16ace4d5845e2299c0123e1f14 888.7725  21.60453  41.90656
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f  10.9725  18.00378  37.25027
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14  15.6750  11.70245   0.00000
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8  23.5125  45.00944  23.28142
##   01083.a6b3c50be5abf782b585995d2c11176b   0.0000   0.00000   0.00000
##   01094.91779ec04e5e6b27e84297c28fc7369f 531.3825 402.38438   0.00000
##   01095.520dcad6e0ebb4d30222292f51ee76ab 531.3825 402.38438   0.00000
##   01097.98d732b93866d13b0c13589ae2acc383   0.0000   0.00000   0.00000
##   01359.deafa1d42658c6624c6809a446b7f369   0.0000   0.00000   0.00000
##                                         Terms
## Docs                                         width
##   00028.60393e49c90f750226bee6381eb3e69d  0.000000
##   00051.8b17ce16ace4d5845e2299c0123e1f14 17.606239
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 73.133609
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14  2.708652
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8 10.834609
##   01083.a6b3c50be5abf782b585995d2c11176b  0.000000
##   01094.91779ec04e5e6b27e84297c28fc7369f  0.000000
##   01095.520dcad6e0ebb4d30222292f51ee76ab  0.000000
##   01097.98d732b93866d13b0c13589ae2acc383  0.000000
##   01359.deafa1d42658c6624c6809a446b7f369  0.000000

3 Exploratory data analysis

We can use several features of the tm package to do some exploratory data analysis.

3.1 Find most frequent terms

First let’s find the most frequent terms in the ham and spam datasets, under both the TF and TF-IDF weightings. We use the findFreqTerms function, and set minimum frequency thresholds for each dataset to arrive at a list of the 30-50 most frequent words. Note that many terms are common between the TF and IDF versions of the most-frequent term lists, but many are not.

# find most frequent terms in ham dtm
HAM_TF <- sort(findFreqTerms(ham_dtm, 500))
HAM_TFIDF <- sort(findFreqTerms(ham_dtm2, 1300))

# find most frequent terms in spam dtm
SPAM_TF <- sort(findFreqTerms(spam_dtm, 1300))
SPAM_TFIDF <- sort(findFreqTerms(spam_dtm2, 2400))

# fill in NA's in shorter vectors and display in a table
max_length <- max(length(HAM_TF), length(HAM_TFIDF), length(SPAM_TF), length(SPAM_TFIDF))
HAM_TF <- c(HAM_TF, rep(NA, max_length - length(HAM_TF)))
HAM_TFIDF <- c(HAM_TFIDF, rep(NA, max_length - length(HAM_TFIDF)))
SPAM_TF <- c(SPAM_TF, rep(NA, max_length - length(SPAM_TF)))
SPAM_TFIDF <- c(SPAM_TFIDF, rep(NA, max_length - length(SPAM_TFIDF)))

kable(data.frame(cbind(1:max_length, HAM_TF, HAM_TFIDF, SPAM_TF, SPAM_TFIDF)), 
      caption = "Most Frequent Terms in the Ham and Spam Datasets by TF and TFIDF Weightings")
Most Frequent Terms in the Ham and Spam Datasets by TF and TFIDF Weightings
V1 HAM_TF HAM_TFIDF SPAM_TF SPAM_TFIDF
1 email div address align
2 exmh exmh align arial
3 file file arial bgcolor
4 fork freshmeat bgcolor blockquote
5 group ftoc border border
6 http ilug busi busi
7 ilug licens cellpadding cellpadding
8 inform linux cellspacing cellspacing
9 irish list center center
10 linux mail click cfont
11 list messag color color
12 listinfo net content colspan
13 listmast org div content
14 mail peopl email dcenter
15 mailman perl face div
16 maintain razor ffffff face
17 make rpm font ffffff
18 messag server free ffont
19 net sourceforg gif font
20 org spam height free
21 peopl spamassassin helvetica geneva
22 razor system href gif
23 rpm time html grant
24 server unison http height
25 sourceforg user imag helvetica
26 spam window img href
27 subscript work left imag
28 system NA list img
29 time NA mail input
30 user NA nbsp left
31 work NA net margin
32 wrote NA option money
33 www NA order mso
34 NA NA receiv nbsp
35 NA NA remov net
36 NA NA san option
37 NA NA serif order
38 NA NA size san
39 NA NA span serif
40 NA NA src site
41 NA NA strong size
42 NA NA style span
43 NA NA tabl src
44 NA NA table strong
45 NA NA text style
46 NA NA time tabl
47 NA NA top table
48 NA NA type tbody
49 NA NA verdana text
50 NA NA width top
51 NA NA www type
52 NA NA NA valign
53 NA NA NA verdana
54 NA NA NA width
55 NA NA NA www

3.2 Find correlated terms

Second we can find the words that are most frequently associated with certain terms in the ham and spam datasets. We do this using the findAssocs function for two frequent words in each of ham_dtm and spam_dtm, and set a minimum correlation threshold to arrive at a list of 5-10 associated terms. For instance, below are some of the terms with the highest correlation to the sample words:

  • “make” in ham: associated with “file”, “interfac”, “improv”, “chang”, “fix”
  • “user” in ham: associated with “footprint”, “function”, “environ”, “librari”, “featur”
  • “money” in spam: associated with “guid”, “unemploy”, “dollar”, “insid”, “grant”
  • “offer” in spam: associated with “intro”, “craft”, “sweet”, “herba”, “hookah”

The results are identical between the TF and IDF versions of the ham and spam datasets.

# find terms associated with freq terms in ham dtm
findAssocs(ham_dtm, "make", 0.78)
## $make
##     file interfac    small   improv  speedup    chang      fix wildcard 
##     0.80     0.79     0.79     0.79     0.79     0.78     0.78     0.78
findAssocs(ham_dtm2, "make", 0.78)
## $make
##     file interfac    small   improv  speedup    chang      fix wildcard 
##     0.80     0.79     0.79     0.79     0.79     0.78     0.78     0.78
findAssocs(ham_dtm, "user", 0.85)
## $user
## footprint    bugfix  function     dynam   environ     creat     stabl 
##      0.92      0.89      0.88      0.88      0.88      0.87      0.87 
##   librari    featur    cygwin 
##      0.85      0.85      0.85
findAssocs(ham_dtm2, "user", 0.85)
## $user
## footprint    bugfix  function     dynam   environ     creat     stabl 
##      0.92      0.89      0.88      0.88      0.88      0.87      0.87 
##   librari    featur    cygwin 
##      0.85      0.85      0.85
# find terms associated with freq terms in spam dtm
findAssocs(spam_dtm, "money", 0.77)
## $money
##     guid unemploy     step guidelin   dollar    insid    grant 
##     0.82     0.81     0.80     0.80     0.79     0.79     0.77
findAssocs(spam_dtm2, "money", 0.77)
## $money
##     guid unemploy     step guidelin   dollar    insid    grant 
##     0.82     0.81     0.80     0.80     0.79     0.79     0.77
findAssocs(spam_dtm, "offer", 0.736)
## $offer
##  intro  craft  sweet  herba hookah jigget 
##   0.76   0.75   0.74   0.74   0.74   0.74
findAssocs(spam_dtm2, "offer", 0.736)
## $offer
##  intro  craft  sweet  herba hookah jigget 
##   0.76   0.75   0.74   0.74   0.74   0.74

3.3 View “denser” document-term matrices

Finally, let’s take a peek at the ham and spam document-term matrices when we reduce the degree of sparsity. We do this using the removeSparseTerms function and setting a maximum sparsity threshold of 0.4; this removes terms that are sparse 40% or more across the document set.

# inspect reduced form of ham dtm
inspect(removeSparseTerms(ham_dtm, 0.4))
## <<DocumentTermMatrix (documents: 1400, terms: 5)>>
## Non-/sparse entries: 5618/1382
## Sparsity           : 20%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     http list listinfo mailman www
##   00562.0f377593022357878ec2249f0c9a5f08    9   15        4       4   6
##   00664.28f4cb9fad800d0c7175d3a67e6c6458   22    4        0       0  15
##   00665.087e07e6a5f47598db0629c21e6e1a70   26    1        0       0  20
##   00666.009d6116caed8ebd2b48febcea7b6c38   34    0        0       0  26
##   00668.0c194428812d424ce5d9b0a39615b041   42    9        0       0  32
##   01303.80dd19a2b1d8496c48396b630179b00f   26    0        0       0  20
##   01317.7fc86413a091430c3104b041a6525131  239   12        1       1  19
##   01318.193fb7308fee59bb4aa70cc72191b0b1   76    1        0       0  38
##   01345.c40d5798193a4a060ec9f3d2321e37e4   38    3        0       0  30
##   01389.e4cfb234aace4e12b2d9453686c911c9   53    4        0       0  44
inspect(removeSparseTerms(ham_dtm2, 0.4))
## <<DocumentTermMatrix (documents: 1400, terms: 5)>>
## Non-/sparse entries: 5618/1382
## Sparsity           : 20%
## Maximal term length: 8
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##                                         Terms
## Docs                                           http       list   listinfo
##   00664.28f4cb9fad800d0c7175d3a67e6c6458  1.8435199  1.9938795 0.00000000
##   00665.087e07e6a5f47598db0629c21e6e1a70  2.1787054  0.4984699 0.00000000
##   00666.009d6116caed8ebd2b48febcea7b6c38  2.8490763  0.0000000 0.00000000
##   00668.0c194428812d424ce5d9b0a39615b041  3.5194471  4.4862288 0.00000000
##   01303.80dd19a2b1d8496c48396b630179b00f  2.1787054  0.0000000 0.00000000
##   01317.7fc86413a091430c3104b041a6525131 20.0273302  5.9816384 0.09036403
##   01318.193fb7308fee59bb4aa70cc72191b0b1  6.3685234  0.4984699 0.00000000
##   01345.c40d5798193a4a060ec9f3d2321e37e4  3.1842617  1.4954096 0.00000000
##   01380.e3fad5af747d3a110008f94a046bf31b  0.4189818 12.9602165 0.00000000
##   01389.e4cfb234aace4e12b2d9453686c911c9  4.4412071  1.9938795 0.00000000
##                                         Terms
## Docs                                       mailman        www
##   00664.28f4cb9fad800d0c7175d3a67e6c6458 0.0000000 10.6208958
##   00665.087e07e6a5f47598db0629c21e6e1a70 0.0000000 14.1611944
##   00666.009d6116caed8ebd2b48febcea7b6c38 0.0000000 18.4095527
##   00668.0c194428812d424ce5d9b0a39615b041 0.0000000 22.6579110
##   01303.80dd19a2b1d8496c48396b630179b00f 0.0000000 14.1611944
##   01317.7fc86413a091430c3104b041a6525131 0.3040062 13.4531346
##   01318.193fb7308fee59bb4aa70cc72191b0b1 0.0000000 26.9062693
##   01345.c40d5798193a4a060ec9f3d2321e37e4 0.0000000 21.2417915
##   01380.e3fad5af747d3a110008f94a046bf31b 0.0000000  0.7080597
##   01389.e4cfb234aace4e12b2d9453686c911c9 0.0000000 31.1546276
# inspect reduced form of spam dtm
inspect(removeSparseTerms(spam_dtm, 0.4))
## <<DocumentTermMatrix (documents: 1396, terms: 2)>>
## Non-/sparse entries: 2024/768
## Sparsity           : 28%
## Maximal term length: 4
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     html http
##   00028.60393e49c90f750226bee6381eb3e69d   13   79
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2   27   97
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939   11   92
##   00081.4c7fbdca38b8def54e276e75ec56682e   11   90
##   00117.9f0ba9c35b1fe59307e32b7c2c0d4e61   12   81
##   00145.b6788a48c1eace0b7c34ff7de32766f6   15   83
##   01094.91779ec04e5e6b27e84297c28fc7369f   10  516
##   01095.520dcad6e0ebb4d30222292f51ee76ab   10  516
##   01113.75ded32e6beb52dec1b6007dc86b47bb   61   64
##   01304.114140cd4c51e9795559b974964aa043   86  153
inspect(removeSparseTerms(spam_dtm2, 0.4))
## <<DocumentTermMatrix (documents: 1396, terms: 2)>>
## Non-/sparse entries: 2024/768
## Sparsity           : 28%
## Maximal term length: 4
## Weighting          : term frequency - inverse document frequency (tf-idf)
## Sample             :
##                                         Terms
## Docs                                          html      http
##   00028.60393e49c90f750226bee6381eb3e69d  8.847206  21.79661
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2 18.374967  26.76292
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939  7.486097  25.38339
##   00081.4c7fbdca38b8def54e276e75ec56682e  7.486097  24.83158
##   00145.b6788a48c1eace0b7c34ff7de32766f6 10.208315  22.90023
##   01017.11a80131a2ae31ad0a9969189de3c2bb 14.291641  16.27848
##   01094.91779ec04e5e6b27e84297c28fc7369f  6.805543 142.36772
##   01095.520dcad6e0ebb4d30222292f51ee76ab  6.805543 142.36772
##   01113.75ded32e6beb52dec1b6007dc86b47bb 41.513813  17.65801
##   01304.114140cd4c51e9795559b974964aa043 58.527671  42.21368

4 Develop predictive models

4.1 Prepare the data

For efficiency in the remainder of the analysis, we work only with the document-term matrices with the TF weighting. First we combine ham_dtm and spam_dtm into a combined document-term matrix, and in the process:

  • Confirm the row positions of the ham emails (1-1400) and the spam emails (1401-2796)
  • Remove words that appear in only 10 documents or less, corresponding to a sparsity threshold of 99.6% (1 - 10 / 2796).
# create combined dtm
hamspam_dtm <- c(ham_dtm, spam_dtm)
hamspam_dtm
## <<DocumentTermMatrix (documents: 2796, terms: 42048)>>
## Non-/sparse entries: 290899/117275309
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
# check that ham and spam emails line up in combined dtm
identical(inspect(ham_dtm[1:n_ham, ]), inspect(hamspam_dtm[1:n_ham, ]))
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity           : 99%
## Maximal term length: 76
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     http linux list listinfo mail
##   00668.0c194428812d424ce5d9b0a39615b041   42     0    9        0    4
##   00693.2183b91fb14b93bdfaab337b915c98bb    7     0    1        1    1
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4    6     0    1        1    0
##   00813.6598e1ef9134cf77f48bca239e4ba2dc    3     0    0        1    1
##   00869.0fbb783356f6875063681dc49cfcb1eb   18     0    0        1    0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087    3     0    0        1    1
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a    1     0    0        1    0
##   01317.7fc86413a091430c3104b041a6525131  239    17   12        1    4
##   01345.c40d5798193a4a060ec9f3d2321e37e4   38    59    3        0    2
##   01380.e3fad5af747d3a110008f94a046bf31b    5     5   26        0    0
##                                         Terms
## Docs                                     mailman net razor user www
##   00668.0c194428812d424ce5d9b0a39615b041       0  13     0    1  32
##   00693.2183b91fb14b93bdfaab337b915c98bb       1  13     0    4   2
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4       1   0     0    4   2
##   00813.6598e1ef9134cf77f48bca239e4ba2dc       1   0     0    0   2
##   00869.0fbb783356f6875063681dc49cfcb1eb       1   0     0    0   0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087       1   0     0    0   2
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a       1   0     0    0   0
##   01317.7fc86413a091430c3104b041a6525131       1 220     0  155  19
##   01345.c40d5798193a4a060ec9f3d2321e37e4       0   0     0   28  30
##   01380.e3fad5af747d3a110008f94a046bf31b       0   0     0  122   1
## <<DocumentTermMatrix (documents: 1400, terms: 42048)>>
## Non-/sparse entries: 112456/58754744
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     http linux list listinfo mail
##   00668.0c194428812d424ce5d9b0a39615b041   42     0    9        0    4
##   00693.2183b91fb14b93bdfaab337b915c98bb    7     0    1        1    1
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4    6     0    1        1    0
##   00813.6598e1ef9134cf77f48bca239e4ba2dc    3     0    0        1    1
##   00869.0fbb783356f6875063681dc49cfcb1eb   18     0    0        1    0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087    3     0    0        1    1
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a    1     0    0        1    0
##   01317.7fc86413a091430c3104b041a6525131  239    17   12        1    4
##   01345.c40d5798193a4a060ec9f3d2321e37e4   38    59    3        0    2
##   01380.e3fad5af747d3a110008f94a046bf31b    5     5   26        0    0
##                                         Terms
## Docs                                     mailman net razor user www
##   00668.0c194428812d424ce5d9b0a39615b041       0  13     0    1  32
##   00693.2183b91fb14b93bdfaab337b915c98bb       1  13     0    4   2
##   00695.2de9d6d30a7713e550b4fd02bb35e7b4       1   0     0    4   2
##   00813.6598e1ef9134cf77f48bca239e4ba2dc       1   0     0    0   2
##   00869.0fbb783356f6875063681dc49cfcb1eb       1   0     0    0   0
##   00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087       1   0     0    0   2
##   01060.95d3e0a8c47b33d1533f18ac2c60c81a       1   0     0    0   0
##   01317.7fc86413a091430c3104b041a6525131       1 220     0  155  19
##   01345.c40d5798193a4a060ec9f3d2321e37e4       0   0     0   28  30
##   01380.e3fad5af747d3a110008f94a046bf31b       0   0     0  122   1
## [1] TRUE
identical(inspect(spam_dtm[1:n_spam, ]), inspect(hamspam_dtm[(n_ham + 1):N, ]))
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     align arial color face font http
##   00028.60393e49c90f750226bee6381eb3e69d     0   273   275  271 1627   79
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2     4     5     7    5   20   97
##   00051.8b17ce16ace4d5845e2299c0123e1f14     9    18    18   20   41   82
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939     9    18    18   20   41   92
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    12    18    14   24   60    2
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14    19     6     6   10   60    7
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8    23     4    47   24  113    6
##   01083.a6b3c50be5abf782b585995d2c11176b     0     0     4    0    0    8
##   01094.91779ec04e5e6b27e84297c28fc7369f   126    32   170   52 1102  516
##   01095.520dcad6e0ebb4d30222292f51ee76ab   126    32   170   52 1102  516
##                                         Terms
## Docs                                     nbsp size width www
##   00028.60393e49c90f750226bee6381eb3e69d    0  273     0  74
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2   15   16    24  40
##   00051.8b17ce16ace4d5845e2299c0123e1f14  567   24    13  68
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939  283   24    13  74
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    7   20    54   1
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14   10   13     2   9
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8   15   50     8   6
##   01083.a6b3c50be5abf782b585995d2c11176b    0    0     0  10
##   01094.91779ec04e5e6b27e84297c28fc7369f  339  447     0 407
##   01095.520dcad6e0ebb4d30222292f51ee76ab  339  447     0 407
## <<DocumentTermMatrix (documents: 1396, terms: 42048)>>
## Non-/sparse entries: 178443/58520565
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     align arial color face font http
##   00028.60393e49c90f750226bee6381eb3e69d     0   273   275  271 1627   79
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2     4     5     7    5   20   97
##   00051.8b17ce16ace4d5845e2299c0123e1f14     9    18    18   20   41   82
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939     9    18    18   20   41   92
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    12    18    14   24   60    2
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14    19     6     6   10   60    7
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8    23     4    47   24  113    6
##   01083.a6b3c50be5abf782b585995d2c11176b     0     0     4    0    0    8
##   01094.91779ec04e5e6b27e84297c28fc7369f   126    32   170   52 1102  516
##   01095.520dcad6e0ebb4d30222292f51ee76ab   126    32   170   52 1102  516
##                                         Terms
## Docs                                     nbsp size width www
##   00028.60393e49c90f750226bee6381eb3e69d    0  273     0  74
##   00044.9f8c4b9ae007c6ded3d57476082bf2b2   15   16    24  40
##   00051.8b17ce16ace4d5845e2299c0123e1f14  567   24    13  68
##   00077.6e13224e39fae4b94bcbe0f5ae9f4939  283   24    13  74
##   00200.2fcabc2b58baa0ebc051e3ea3dfafd8f    7   20    54   1
##   00777.284d3dc66b4f1bdedb5a5eba41d18d14   10   13     2   9
##   00975.5e2e7c9d8b2c04929ff41e010163e5e8   15   50     8   6
##   01083.a6b3c50be5abf782b585995d2c11176b    0    0     0  10
##   01094.91779ec04e5e6b27e84297c28fc7369f  339  447     0 407
##   01095.520dcad6e0ebb4d30222292f51ee76ab  339  447     0 407
## [1] TRUE
# remove words that appear in 10 docs or less
j <- 10
hamspam_dtm <- removeSparseTerms(hamspam_dtm, 1 - j / N) 
hamspam_dtm
## <<DocumentTermMatrix (documents: 2796, terms: 3657)>>
## Non-/sparse entries: 226899/9998073
## Sparsity           : 98%
## Maximal term length: 33
## Weighting          : term frequency (tf)

Next we prepare the data for analysis using the RTextTools package. Steps include:

  • Defining a spam flag vector that identifies whether emails are ham (0) or spam (1); this vector will be set to 0 for the first 1,400 emails, then set to 1 for the next 1,396 emails.
  • Creating training and test indices that identify which emails are in the training set and which are in the holdout (test) set; we will use random sampling to divide each of the ham and spam datasets in half (k=2).
  • Setting up the data container in order to train and test the classifier models in RTextTools.
# define spam flag: 0=ham, 1=spam
spam_labs <- rep(0, n_ham)
spam_labs <- c(spam_labs, rep(1, n_spam))
# validate
sum(spam_labs[1:n_ham])
## [1] 0
sum(spam_labs[(n_ham + 1):N])
## [1] 1396
# split dataset in half and define the training set and the holdout (test) set
k <- 2
n_ham_train <- n_ham %/% k
n_spam_train <- n_spam %/% k
n_ham_test  <- n_ham - n_ham_train
n_spam_test  <- n_spam - n_spam_train

# define index of randomly selected training & test cases in the DTM
idx_ham_train <- sample(1:n_ham, n_ham_train)
idx_spam_train <- sample(1:n_spam, n_spam_train)
idx_train <- c(idx_ham_train, n_ham + idx_spam_train)
idx_test <- (1:N)[-idx_train]

# set up data container
container <- create_container(
    hamspam_dtm,
    labels = spam_labs,
    trainSize = idx_train,
    testSize = idx_test,
    virgin = FALSE
)
str(container)
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
##   ..@ training_matrix      :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:113690] 1 1 2 2 1 1 1 1 2 1 ...
##   .. .. ..@ ja       : int [1:113690] 163 164 165 10 166 167 168 19 169 170 ...
##   .. .. ..@ ia       : int [1:1399] 1 39 83 169 211 254 309 335 364 405 ...
##   .. .. ..@ dimension: int [1:2] 1398 3657
##   ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:113209] 2 1 1 1 1 5 1 1 1 1 ...
##   .. .. ..@ ja       : int [1:113209] 1 2 3 4 5 6 7 8 9 10 ...
##   .. .. ..@ ia       : int [1:1399] 1 163 354 396 462 487 532 637 685 711 ...
##   .. .. ..@ dimension: int [1:2] 1398 3657
##   ..@ training_codes       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ testing_codes        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ column_names         : chr [1:3657] "addit" "age" "ago" "appar" ...
##   ..@ virgin               : logi FALSE

4.2 Train & test the predictive models

We use the RTextTools package to develop and test several models on the ham/spam dataset. Nine different classification algorithms are available in RTextTools:

  • SVM: support vector machine
  • GLMNET: generalized linear model
  • MAXENT: maximum entropy
  • SLDA: scaled linear discriminant analysis
  • BOOSTING: logitboost
  • BAGGING: bootstrap aggregation
  • RF: random forest
  • NNET: neural networks
  • TREE: classification or regression tree

After some experimentation, I decided to use only the SVM, MAXENT, TREE, and BOOSTING algorithms. The other algorithms either took too much time to run (SLDA, BAGGING, RF, and NNET) or could not be used with the set-up of my data container (GLMNET seemed to need the idx_train andidx_test indices to be numbered in sequence).

First we train the models on the training dataset, and then use the models to classify emails (ham/spam) in the holdout dataset.

# train models
SVM <- train_model(container, "SVM")
TREE <- train_model(container, "TREE")
MAXENT <- train_model(container, "MAXENT")
#GLMNET <- train_model(container, "GLMNET")
#SLDA <- train_model(container, "SLDA")
BOOSTING <- train_model(container, "BOOSTING")
#BAGGING <- train_model(container, "BAGGING")
#RF <- train_model(container, "RF")
#NNET <- train_model(container, "NNET")

# test models
SVM_out <- classify_model(container, SVM)
TREE_out <- classify_model(container, TREE)
MAXENT_out <- classify_model(container, MAXENT)
#GLMNET_out <- classify_model(container, GLMNET) 
#SLDA_out <- classify_model(container, SLDA)
BOOSTING_out <- classify_model(container, BOOSTING)
#BAGGING_out <- classify_model(container, BAGGING)
#RF_out <- classify_model(container, RF)
#NNET_out <- classify_model(container, NNET)

Once the models have been run on the holdout dataset, we can compute their predictive accuracy. Here we measure accuracy by the proportion of total predictions that are correct. We compile the results for each model, and then summarize in a table.

# collect model predictions
labels_out <- data.frame(
    correct_label = spam_labs[idx_test],
    svm = SVM_out[ , 1],
    tree = TREE_out[ , 1],
    maxent = MAXENT_out[ , 1],
    #glmnet = GLMNET_out[ , 1],
    #slda = SLDA_out[ , 1],
    boosting = BOOSTING_out[ , 1]
    #bagging = BAGGING_out[ , 1],
    #rf = RF_out[ , 1]
    #nnet = NNET_out[ , 1])
)

# inspect model predictions; note ham<=700, spam >700
head(labels_out)
##   correct_label svm tree maxent boosting
## 1             0   0    0      0        0
## 2             0   0    0      0        0
## 3             0   0    0      0        0
## 4             0   0    0      0        0
## 5             0   0    0      0        0
## 6             0   0    0      0        0
labels_out[680:720, ]
##     correct_label svm tree maxent boosting
## 680             0   0    0      0        0
## 681             0   0    0      0        0
## 682             0   0    0      0        0
## 683             0   0    0      0        0
## 684             0   0    0      0        0
## 685             0   0    0      0        0
## 686             0   1    1      1        1
## 687             0   0    0      0        0
## 688             0   0    0      1        0
## 689             0   0    0      1        0
## 690             0   1    1      1        1
## 691             0   0    0      1        0
## 692             0   0    0      0        0
## 693             0   0    0      0        0
## 694             0   0    0      0        0
## 695             0   1    1      1        1
## 696             0   0    0      0        0
## 697             0   0    1      0        1
## 698             0   0    1      0        1
## 699             0   0    1      0        1
## 700             0   0    0      0        0
## 701             1   1    0      1        1
## 702             1   1    1      1        1
## 703             1   1    1      1        1
## 704             1   1    1      1        1
## 705             1   0    0      0        0
## 706             1   1    1      1        1
## 707             1   1    1      1        1
## 708             1   1    1      1        1
## 709             1   1    1      1        1
## 710             1   1    1      1        1
## 711             1   1    1      1        1
## 712             1   1    1      1        1
## 713             1   1    1      1        1
## 714             1   1    0      1        1
## 715             1   1    1      1        1
## 716             1   1    1      1        1
## 717             1   1    1      1        1
## 718             1   1    1      1        1
## 719             1   1    0      1        1
## 720             1   1    1      1        1
tail(labels_out)
##      correct_label svm tree maxent boosting
## 1393             1   1    1      1        1
## 1394             1   1    1      1        1
## 1395             1   1    1      1        1
## 1396             1   1    1      1        1
## 1397             1   1    1      1        1
## 1398             1   1    0      1        1
# calc SVM accuracy (# correct predictions / total cases)       
tab_svm <- table(labels_out[ , 1] == labels_out[ , 2])
(ptab_svm <- prop.table(tab_svm))
## 
##      FALSE       TRUE 
## 0.06294707 0.93705293
# create dataframe and compile accuracy for all models
results <- data.frame(matrix(NA, nrow = 4, ncol = 3))
model_names <- c("SVM", "TREE", "MAXENT", "BOOSTING") 

for (j in 1:4) {
    results[j, ] <- c(model_names[j], 
                      prop.table(table(labels_out[ , 1] == labels_out[ , j + 1])))
}
colnames(results) <- c("MODEL", "INCORRECT", "CORRECT")
kable(results, digits = 4, caption = "Summary of Classification Accuracy on Holdout Dataset (# Correct Predictions / # Total Cases)")
Summary of Classification Accuracy on Holdout Dataset (# Correct Predictions / # Total Cases)
MODEL INCORRECT CORRECT
SVM 0.0629470672389127 0.937052932761087
TREE 0.108726752503577 0.891273247496423
MAXENT 0.0436337625178827 0.956366237482117
BOOSTING 0.0543633762517883 0.945636623748212

This seems too good to be true. According to this analysis, the models are accurate 89% to 96% of the time on the holdout dataset.

4.3 Evaluate model performance

The RTextTools package offers several features that more formally measure model performance. The ranking of the models by precision, recall, and F-scores is similar to what we compiled above.

analytics <- create_analytics(
    container,
    cbind(SVM_out, TREE_out, MAXENT_out, BOOSTING_out)
)
summary(analytics)
## ENSEMBLE SUMMARY
## 
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00              0.93
## n >= 2                1.00              0.93
## n >= 3                0.98              0.95
## n >= 4                0.90              0.98
## 
## 
## ALGORITHM PERFORMANCE
## 
##        SVM_PRECISION           SVM_RECALL           SVM_FSCORE 
##                0.945                0.940                0.940 
## LOGITBOOST_PRECISION    LOGITBOOST_RECALL    LOGITBOOST_FSCORE 
##                0.945                0.945                0.940 
##       TREE_PRECISION          TREE_RECALL          TREE_FSCORE 
##                0.900                0.890                0.895 
## MAXENTROPY_PRECISION    MAXENTROPY_RECALL    MAXENTROPY_FSCORE 
##                0.960                0.960                0.960
# create and view summaries 
topic_summary <- analytics@label_summary
#alg_summary <- analytics@algorithm_summary
#ens_summary <- analytics@ensemble_summary
#doc_summary <- analytics@document_summary

topic_summary
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0                700                 786                   743
## 1                698                 612                   655
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0           112.28571             106.14286                      99.28571
## 1            87.67908              93.83954                      86.96275
##   PCT_CORRECTLY_CODED_PROBABILITY
## 0                        98.42857
## 1                        92.26361
#alg_summary
#ens_summary
#doc_summary

5 Conclusion

5.1 Findings

Assessed from either the compiled results or the formal analytics summaries produced by RTextTools, it is apparent that the classification algorithms can be ranked as follows for this dataset:

  • #1: MAXENT 96% accuracy
  • #2: BOOSTING 95% accuracy
  • #3: SVM 94% accuracy
  • #4: TREE 89% accuracy

The analytics summary from RTextTools also indicates that if we relied on all four models together, we would achieve 98% accuracy with a 90% coverage of the dataset.

5.2 Suggestions for further work

Some interesting topics for further work include:

  • Develop models based on the email headers: instead of using the message content to develop the predictive models, one could use the headers. Which method is more effective / efficient?
  • Test whether results are sensitive to the data cleaning process: for instance, if we hadn’t removed punctuation, numbers, or uppercase letters during data cleaning, what would be the cost/benefit in performance?
  • Randomize the ham/spam dataframe: could the models be using the numbered sequence of ham (1-1400) and spam (1401-2796) messages as a predictive variable? It’s probably unlikely, but if so, this could explain the high accuracy performance.
  • Test the other classification algorithms available in the RTextTools package: GLMNET, SLDA, BAGGING, RF, and NNET.
  • Try cross validation as a means of testing model performance: this can be done using a feature in RTextTools.