In this project, we analyze email messages and develop several models to predict whether they are spam. I chose to work with two sets of emails from the website given in the assignment (https://spamassassin.apache.org/publiccorpus/):
For the text mining and modeling work, I used the tm and RTextTools packages:
tm: to load the messages, create a corpus and document-term matrix, and to prepare the data for modelingRTextTools: to train models on the document-term matrix using different learning algorithms, and then evaluate their predictive performance.# load required packages
library(tm)
library(stringr)
library(RTextTools)
library(knitr)
First we load the data. I downloaded and unzipped the files from the website https://spamassassin.apache.org/publiccorpus/ into separate directories, and then used the DirSource function to read in all the files into two SimpleCorpus data structures. We have 1,400 ham messages and 1,396 spam message.
# load the data
ham_raw <- SimpleCorpus(DirSource("HamSpam/easy_ham_2/"))
spam_raw <- SimpleCorpus(DirSource("HamSpam/spam_2/"))
ham_raw
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1400
spam_raw
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1396
# number of emails
n_ham <- length(ham_raw)
n_spam <- length(spam_raw)
N <- n_ham + n_spam
Once the messages are loaded, we can review some sample emails using the inspect function. Note that the headers are long, and are separated from the message content by a blank line. Also the sample ham and spam messages are 3,492 and 1,821 characters long, respectively.
# inspect sample emails
inspect(ham_raw[[1000]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 3492
##
## From fork-admin@xent.com Mon Aug 12 11:09:53 2002
## Return-Path: <fork-admin@xent.com>
## Delivered-To: yyyy@localhost.netnoteinc.com
## Received: from localhost (localhost [127.0.0.1])
## by phobos.labs.netnoteinc.com (Postfix) with ESMTP id BCD4244108
## for <jm@localhost>; Mon, 12 Aug 2002 05:57:02 -0400 (EDT)
## Received: from phobos [127.0.0.1]
## by localhost with IMAP (fetchmail-5.9.0)
## for jm@localhost (single-drop); Mon, 12 Aug 2002 10:57:02 +0100 (IST)
## Received: from xent.com ([64.161.22.236]) by dogma.slashnull.org
## (8.11.6/8.11.6) with ESMTP id g7BAVlb30446 for <jm@jmason.org>;
## Sun, 11 Aug 2002 11:31:47 +0100
## Received: from lair.xent.com (localhost [127.0.0.1]) by xent.com (Postfix)
## with ESMTP id 94EC929415D; Sun, 11 Aug 2002 03:28:05 -0700 (PDT)
## Delivered-To: fork@spamassassin.taint.org
## Received: from venus.phpwebhosting.com (venus.phpwebhosting.com
## [64.29.16.27]) by xent.com (Postfix) with SMTP id D7CD0294159 for
## <fork@xent.com>; Sun, 11 Aug 2002 03:27:25 -0700 (PDT)
## Received: (qmail 22327 invoked by uid 508); 11 Aug 2002 10:28:24 -0000
## Received: from unknown (HELO hydrogen.leitl.org) (62.155.144.56) by
## venus.phpwebhosting.com with SMTP; 11 Aug 2002 10:28:24 -0000
## Received: from localhost (eugen@localhost) by hydrogen.leitl.org
## (8.11.6/8.11.6) with ESMTP id g7BAS2U26714; Sun, 11 Aug 2002 12:28:06
## +0200
## X-Authentication-Warning: hydrogen.leitl.org: eugen owned process doing -bs
## From: Eugen Leitl <eugen@leitl.org>
## To: Gary Lawrence Murphy <garym@canada.com>
## Cc: fork <fork@spamassassin.taint.org>
## Subject: Re: Forged whitelist spam
## In-Reply-To: <m2r8h6qumb.fsf@maya.dyndns.org>
## Message-Id: <Pine.LNX.4.33.0208111214300.3981-100000@hydrogen.leitl.org>
## MIME-Version: 1.0
## Content-Type: TEXT/PLAIN; charset=US-ASCII
## Sender: fork-admin@xent.com
## Errors-To: fork-admin@xent.com
## X-Beenthere: fork@spamassassin.taint.org
## X-Mailman-Version: 2.0.11
## Precedence: bulk
## List-Help: <mailto:fork-request@xent.com?subject=help>
## List-Post: <mailto:fork@spamassassin.taint.org>
## List-Subscribe: <http://xent.com/mailman/listinfo/fork>, <mailto:fork-request@xent.com?subject=subscribe>
## List-Id: Friends of Rohit Khare <fork.xent.com>
## List-Unsubscribe: <http://xent.com/mailman/listinfo/fork>,
## <mailto:fork-request@xent.com?subject=unsubscribe>
## List-Archive: <http://xent.com/pipermail/fork/>
## Date: Sun, 11 Aug 2002 12:28:02 +0200 (CEST)
##
## On 10 Aug 2002, Gary Lawrence Murphy wrote:
##
## > My uneducated guess is that all they need to jump expensive whitelist
## > walls would be buckshot a spam-laden Klez with a 5-million-addresses
## > mailer; if it finds just one vulnerable host on an Exchange server,
## > through hopping addressbooks across a few degrees of freedom, a world
## > of whitelists are instantly breechable.
##
## You seem to be saying that whitelists are useless, because there are worms
## which can compromise your system, read your address book/whitelist, and
## sent themselves on, compromising a nonnegligible fraction of systems as
## they go along.
##
## While mailing lists can be spam/worm amplifiers, I don't think this is
## true for individual users even today. Moreover, worms which use email as
## vector exist *only* because a single vendor ships mailers with broken
## default settings, and insists to make documents executables. This makes
## for very bad press, and eventually that vendor is going to wise up, and
## stop shipping as many broken wares (or people will switch to more secure
## alternatives, whatever comes first).
##
##
##
## http://xent.com/mailman/listinfo/fork
inspect(spam_raw[[500]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1821
##
## From gjmy@public.ayptt.ha.cn Mon Jun 24 17:08:00 2002
## Return-Path: gyyyyy@public.ayptt.ha.cn
## Delivery-Date: Wed May 29 10:49:10 2002
## Received: from mandark.labs.netnoteinc.com ([213.105.180.140]) by
## dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4T9n8O30150 for
## <jm@jmason.org>; Wed, 29 May 2002 10:49:08 +0100
## Received: from public.ayptt.ha.cn ([202.102.230.147]) by
## mandark.labs.netnoteinc.com (8.11.2/8.11.2) with ESMTP id g4T9n2701963 for
## <jm@netnoteinc.com>; Wed, 29 May 2002 10:49:07 +0100
## Received: from Margaret ([218.29.21.108]) by public.ayptt.ha.cn
## (8.9.1a/8.9.1) with SMTP id RAA29880; Wed, 29 May 2002 17:45:51 +0800
## (CST)
## Message-Id: <200205290945.RAA29880@public.ayptt.ha.cn>
## Reply-To: Margaret<plum318@163.com>
## From: "Margaret"<gyyyyy@public.ayptt.ha.cn>
## To: ""<ruud@RUUD.ORG>
## Date: Wed,29 May 2002 17:51:47 +0800
## X-Auto-Forward: To: ""<ruud@RUUD.ORG>
## X-Keywords:
## Subject:
##
## Dear Sirs,
## We know your esteemed company in beach towels from Internet, and pleased to introduce us as a leading producer of high quality 100% cotton velour printed towels in China, we sincerely hope to establish a long-term business relationship with your esteemed company in this field.
##
## Our major items are 100% cotton full printed velour towels of the following sizes and weights with a annual production capacity of one million dozens:
## Disney Standard:
## 30X60 inches, weight 305grams/SM, 350gram/PC
## 40X70 inches, weight 305grams/SM, 550gram/PC
## Please refer to our website http://www.jacquard-towel.com/index.html for more details ie patterns about our products.
## Once you are interested in our products, we will give you a more favorable price.
## Looking forward to hearing from you soon
## Thanks and best regards,
## Margaret/Sales Manager
## Henan Ziyang Textiles
## http://www.jacquard-towel.com
Before we start cleaning the data, let’s review a sample of the stopwords from the “SMART” stopword set. We observe that they are all lowercase, they include punctuation, and they include variants from the same stemmed words. We factor these observations into our sequence of text cleaning steps below.
sort(sample(stopwords("SMART"), 100))
## [1] "a" "able" "according" "across"
## [5] "ain't" "also" "always" "another"
## [9] "appear" "appropriate" "are" "associated"
## [13] "awfully" "become" "becoming" "behind"
## [17] "can't" "certain" "changes" "clearly"
## [21] "com" "come" "course" "currently"
## [25] "d" "edu" "etc" "every"
## [29] "everywhere" "exactly" "followed" "getting"
## [33] "gives" "gone" "help" "hence"
## [37] "here" "hereafter" "herein" "in"
## [41] "indicated" "it" "it'll" "known"
## [45] "last" "look" "looking" "ltd"
## [49] "many" "more" "moreover" "much"
## [53] "namely" "necessary" "neither" "non"
## [57] "not" "on" "other" "please"
## [61] "possible" "really" "regarding" "regardless"
## [65] "secondly" "see" "several" "should"
## [69] "since" "so" "thanx" "that"
## [73] "them" "thence" "theres" "these"
## [77] "they'll" "thoroughly" "thus" "too"
## [81] "two" "unfortunately" "unto" "up"
## [85] "uucp" "vs" "weren't" "when"
## [89] "whenever" "where" "while" "who's"
## [93] "why" "will" "would" "yes"
## [97] "you'll" "yours" "yourselves" "z"
Now we undertake the text cleaning steps below in sequence, which we apply using the tmp function:
Note that some of these steps may remove information that could be useful in identifying spam. For instance, certain website / URL addresses in the header could be associated with certain spam senders; emails that use frequent UPPERCASE words or punctuation patterns (frequent !!!) may be associated with spam; and certain numbers (indicating phone numbers, dollar amounts, or IP addresses) could be predictive of spam. Removing these items may reduce predictive performance, but will force the learning algorithms to focus on the text words alone as a predictor of spam.
To accomplish the text cleaning process, we define a cleaning function that includes all the cleaning steps and then apply it to both the ham and spam datasets.
# define cleaning function
cleandata <- function(x) {
tmp <- x
# remove header
tmp <- tm_map(tmp, str_replace, pattern = "^(.+\\n)+\\n", replacement = "")
# convert to lowercase
tmp <- tm_map(tmp, content_transformer(tolower))
# remove stopwords
tmp <- tm_map(tmp, removeWords, stopwords("SMART"))
# remove punctuation
tmp <- tm_map(tmp, str_replace_all, pattern = "[:punct:]", replacement = " ")
# remove numbers
tmp <- tm_map(tmp, removeNumbers)
# stem words
tmp <- tm_map(tmp, stemDocument)
# remove extra whitespace
tmp <- tm_map(tmp, stripWhitespace)
return(tmp)
}
# clean ham and spam data
ham <- cleandata(ham_raw)
spam <- cleandata(spam_raw)
Afterwards, we inspect the same sample emails as before. Notice that the headers are gone, and the character lengths have been reduced to 590 and 534 for the ham and spam messages, respectively.
# inspect sample emails
inspect(ham[[1000]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 590
##
## aug gari lawrenc murphi wrote > uneduc guess jump expens whitelist > wall buckshot spam laden klez million address > mailer find vulner host exchang server > hop addressbook degre freedom world > whitelist instant breechabl whitelist useless worm compromis system read address book whitelist compromis nonneglig fraction system mail list spam worm amplifi true individu user today worm email vector exist singl vendor ship mailer broken default set insist make document execut make bad press eventu vendor wise stop ship broken ware peopl switch secur altern http xent mailman listinfo fork
inspect(spam[[500]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 534
##
## dear sir esteem compani beach towel internet pleas introduc lead produc high qualiti cotton velour print towel china sincer hope establish long term busi relationship esteem compani field major item cotton full print velour towel size weight annual product capac million dozen disney standard x inch weight gram sm gram pc x inch weight gram sm gram pc refer websit http www jacquard towel index html detail pattern product interest product give favor price forward hear margaret sale manag henan ziyang textil http www jacquard towel
Next we create document-term matrices for the ham and spam datasets, which we will use to find the most frequent terms in each dataset. We do this using two different term weightings:
First, the document-term matrices using the term frequency weighting. Note that ham_dtm and spam_dtm are 99% and 100% sparse. We will reduce the sparsity before developing the predictive models, by setting a sparsity threshold below.
# create doc-term matrix: term frequency weighting
ham_dtm <- DocumentTermMatrix(ham)
spam_dtm <- DocumentTermMatrix(spam)
inspect(ham_dtm)
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity : 99%
## Maximal term length: 76
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs http linux list listinfo mail
## 00668.0c194428812d424ce5d9b0a39615b041 42 0 9 0 4
## 00693.2183b91fb14b93bdfaab337b915c98bb 7 0 1 1 1
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 6 0 1 1 0
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 3 0 0 1 1
## 00869.0fbb783356f6875063681dc49cfcb1eb 18 0 0 1 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 3 0 0 1 1
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 1 0
## 01317.7fc86413a091430c3104b041a6525131 239 17 12 1 4
## 01345.c40d5798193a4a060ec9f3d2321e37e4 38 59 3 0 2
## 01380.e3fad5af747d3a110008f94a046bf31b 5 5 26 0 0
## Terms
## Docs mailman net razor user www
## 00668.0c194428812d424ce5d9b0a39615b041 0 13 0 1 32
## 00693.2183b91fb14b93bdfaab337b915c98bb 1 13 0 4 2
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 1 0 0 4 2
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 1 0 0 0 2
## 00869.0fbb783356f6875063681dc49cfcb1eb 1 0 0 0 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 1 0 0 0 2
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 0 0
## 01317.7fc86413a091430c3104b041a6525131 1 220 0 155 19
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0 0 0 28 30
## 01380.e3fad5af747d3a110008f94a046bf31b 0 0 0 122 1
inspect(spam_dtm)
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs align arial color face font http
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 275 271 1627 79
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 4 5 7 5 20 97
## 00051.8b17ce16ace4d5845e2299c0123e1f14 9 18 18 20 41 82
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 9 18 18 20 41 92
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 12 18 14 24 60 2
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 19 6 6 10 60 7
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 23 4 47 24 113 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 4 0 0 8
## 01094.91779ec04e5e6b27e84297c28fc7369f 126 32 170 52 1102 516
## 01095.520dcad6e0ebb4d30222292f51ee76ab 126 32 170 52 1102 516
## Terms
## Docs nbsp size width www
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 0 74
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 15 16 24 40
## 00051.8b17ce16ace4d5845e2299c0123e1f14 567 24 13 68
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 283 24 13 74
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 7 20 54 1
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 10 13 2 9
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 15 50 8 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 0 10
## 01094.91779ec04e5e6b27e84297c28fc7369f 339 447 0 407
## 01095.520dcad6e0ebb4d30222292f51ee76ab 339 447 0 407
Second, the document-term matrices using the TF-IDF weighting. As before, both DTMs are extremely sparse.
# create doc-term matrix: TFIDF weighting
ham_dtm2 <- DocumentTermMatrix(ham, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
spam_dtm2 <- DocumentTermMatrix(spam, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
inspect(ham_dtm2)
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity : 99%
## Maximal term length: 76
## Weighting : term frequency - inverse document frequency (tf-idf)
## Sample :
## Terms
## Docs exmh file freshmeat ilug
## 00663.660f0334bb6d89793e3d3bb5367cd9c1 0 0.000000 0.000 0
## 00668.0c194428812d424ce5d9b0a39615b041 0 0.000000 0.000 0
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 0 0.000000 0.000 0
## 00869.0fbb783356f6875063681dc49cfcb1eb 0 0.000000 0.000 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 0 5.249325 0.000 0
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 0 0.000000 0.000 0
## 01317.7fc86413a091430c3104b041a6525131 0 162.729083 1804.701 0
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0 73.490553 0.000 0
## 01380.e3fad5af747d3a110008f94a046bf31b 0 640.417680 0.000 0
## 01389.e4cfb234aace4e12b2d9453686c911c9 0 2.624663 0.000 0
## Terms
## Docs linux net razor
## 00663.660f0334bb6d89793e3d3bb5367cd9c1 0.000000 31.49565 0
## 00668.0c194428812d424ce5d9b0a39615b041 0.000000 18.61106 0
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 0.000000 0.00000 0
## 00869.0fbb783356f6875063681dc49cfcb1eb 0.000000 0.00000 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 0.000000 0.00000 0
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 0.000000 0.00000 0
## 01317.7fc86413a091430c3104b041a6525131 20.699054 314.95648 0
## 01345.c40d5798193a4a060ec9f3d2321e37e4 71.837895 0.00000 0
## 01380.e3fad5af747d3a110008f94a046bf31b 6.087957 0.00000 0
## 01389.e4cfb234aace4e12b2d9453686c911c9 0.000000 18.61106 0
## Terms
## Docs rpm spam unison
## 00663.660f0334bb6d89793e3d3bb5367cd9c1 0.000000 0.000000 0.000
## 00668.0c194428812d424ce5d9b0a39615b041 0.000000 2.904317 0.000
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 0.000000 0.000000 0.000
## 00869.0fbb783356f6875063681dc49cfcb1eb 0.000000 0.000000 0.000
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 0.000000 0.000000 0.000
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 0.000000 0.000000 0.000
## 01317.7fc86413a091430c3104b041a6525131 6.294861 0.000000 0.000
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0.000000 0.000000 0.000
## 01380.e3fad5af747d3a110008f94a046bf31b 0.000000 0.000000 2003.657
## 01389.e4cfb234aace4e12b2d9453686c911c9 0.000000 5.808633 0.000
inspect(spam_dtm2)
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency - inverse document frequency (tf-idf)
## Sample :
## Terms
## Docs align arial color
## 00028.60393e49c90f750226bee6381eb3e69d 0.00000 384.428837 306.330166
## 00051.8b17ce16ace4d5845e2299c0123e1f14 11.74479 25.346956 20.050702
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 15.65971 25.346956 15.594990
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 24.79455 8.448985 6.683567
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 30.01445 5.632657 52.354610
## 01083.a6b3c50be5abf782b585995d2c11176b 0.00000 0.000000 4.455712
## 01094.91779ec04e5e6b27e84297c28fc7369f 164.42700 45.061256 189.367739
## 01095.520dcad6e0ebb4d30222292f51ee76ab 164.42700 45.061256 189.367739
## 01097.98d732b93866d13b0c13589ae2acc383 0.00000 0.000000 0.000000
## 01359.deafa1d42658c6624c6809a446b7f369 0.00000 0.000000 0.000000
## Terms
## Docs face font height
## 00028.60393e49c90f750226bee6381eb3e69d 305.52841 1557.42178 0.000000
## 00051.8b17ce16ace4d5845e2299c0123e1f14 22.54822 39.24665 9.198765
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 27.05787 57.43412 60.711846
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 11.27411 57.43412 0.000000
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 27.05787 108.16759 31.275800
## 01083.a6b3c50be5abf782b585995d2c11176b 0.00000 0.00000 0.000000
## 01094.91779ec04e5e6b27e84297c28fc7369f 58.62538 1054.87326 0.000000
## 01095.520dcad6e0ebb4d30222292f51ee76ab 58.62538 1054.87326 0.000000
## 01097.98d732b93866d13b0c13589ae2acc383 0.00000 0.00000 0.000000
## 01359.deafa1d42658c6624c6809a446b7f369 0.00000 0.00000 0.000000
## Terms
## Docs nbsp size verdana
## 00028.60393e49c90f750226bee6381eb3e69d 0.0000 245.75153 630.92652
## 00051.8b17ce16ace4d5845e2299c0123e1f14 888.7725 21.60453 41.90656
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 10.9725 18.00378 37.25027
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 15.6750 11.70245 0.00000
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 23.5125 45.00944 23.28142
## 01083.a6b3c50be5abf782b585995d2c11176b 0.0000 0.00000 0.00000
## 01094.91779ec04e5e6b27e84297c28fc7369f 531.3825 402.38438 0.00000
## 01095.520dcad6e0ebb4d30222292f51ee76ab 531.3825 402.38438 0.00000
## 01097.98d732b93866d13b0c13589ae2acc383 0.0000 0.00000 0.00000
## 01359.deafa1d42658c6624c6809a446b7f369 0.0000 0.00000 0.00000
## Terms
## Docs width
## 00028.60393e49c90f750226bee6381eb3e69d 0.000000
## 00051.8b17ce16ace4d5845e2299c0123e1f14 17.606239
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 73.133609
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 2.708652
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 10.834609
## 01083.a6b3c50be5abf782b585995d2c11176b 0.000000
## 01094.91779ec04e5e6b27e84297c28fc7369f 0.000000
## 01095.520dcad6e0ebb4d30222292f51ee76ab 0.000000
## 01097.98d732b93866d13b0c13589ae2acc383 0.000000
## 01359.deafa1d42658c6624c6809a446b7f369 0.000000
We can use several features of the tm package to do some exploratory data analysis.
First let’s find the most frequent terms in the ham and spam datasets, under both the TF and TF-IDF weightings. We use the findFreqTerms function, and set minimum frequency thresholds for each dataset to arrive at a list of the 30-50 most frequent words. Note that many terms are common between the TF and IDF versions of the most-frequent term lists, but many are not.
# find most frequent terms in ham dtm
HAM_TF <- sort(findFreqTerms(ham_dtm, 500))
HAM_TFIDF <- sort(findFreqTerms(ham_dtm2, 1300))
# find most frequent terms in spam dtm
SPAM_TF <- sort(findFreqTerms(spam_dtm, 1300))
SPAM_TFIDF <- sort(findFreqTerms(spam_dtm2, 2400))
# fill in NA's in shorter vectors and display in a table
max_length <- max(length(HAM_TF), length(HAM_TFIDF), length(SPAM_TF), length(SPAM_TFIDF))
HAM_TF <- c(HAM_TF, rep(NA, max_length - length(HAM_TF)))
HAM_TFIDF <- c(HAM_TFIDF, rep(NA, max_length - length(HAM_TFIDF)))
SPAM_TF <- c(SPAM_TF, rep(NA, max_length - length(SPAM_TF)))
SPAM_TFIDF <- c(SPAM_TFIDF, rep(NA, max_length - length(SPAM_TFIDF)))
kable(data.frame(cbind(1:max_length, HAM_TF, HAM_TFIDF, SPAM_TF, SPAM_TFIDF)),
caption = "Most Frequent Terms in the Ham and Spam Datasets by TF and TFIDF Weightings")
| V1 | HAM_TF | HAM_TFIDF | SPAM_TF | SPAM_TFIDF |
|---|---|---|---|---|
| 1 | div | address | align | |
| 2 | exmh | exmh | align | arial |
| 3 | file | file | arial | bgcolor |
| 4 | fork | freshmeat | bgcolor | blockquote |
| 5 | group | ftoc | border | border |
| 6 | http | ilug | busi | busi |
| 7 | ilug | licens | cellpadding | cellpadding |
| 8 | inform | linux | cellspacing | cellspacing |
| 9 | irish | list | center | center |
| 10 | linux | click | cfont | |
| 11 | list | messag | color | color |
| 12 | listinfo | net | content | colspan |
| 13 | listmast | org | div | content |
| 14 | peopl | dcenter | ||
| 15 | mailman | perl | face | div |
| 16 | maintain | razor | ffffff | face |
| 17 | make | rpm | font | ffffff |
| 18 | messag | server | free | ffont |
| 19 | net | sourceforg | gif | font |
| 20 | org | spam | height | free |
| 21 | peopl | spamassassin | helvetica | geneva |
| 22 | razor | system | href | gif |
| 23 | rpm | time | html | grant |
| 24 | server | unison | http | height |
| 25 | sourceforg | user | imag | helvetica |
| 26 | spam | window | img | href |
| 27 | subscript | work | left | imag |
| 28 | system | NA | list | img |
| 29 | time | NA | input | |
| 30 | user | NA | nbsp | left |
| 31 | work | NA | net | margin |
| 32 | wrote | NA | option | money |
| 33 | www | NA | order | mso |
| 34 | NA | NA | receiv | nbsp |
| 35 | NA | NA | remov | net |
| 36 | NA | NA | san | option |
| 37 | NA | NA | serif | order |
| 38 | NA | NA | size | san |
| 39 | NA | NA | span | serif |
| 40 | NA | NA | src | site |
| 41 | NA | NA | strong | size |
| 42 | NA | NA | style | span |
| 43 | NA | NA | tabl | src |
| 44 | NA | NA | table | strong |
| 45 | NA | NA | text | style |
| 46 | NA | NA | time | tabl |
| 47 | NA | NA | top | table |
| 48 | NA | NA | type | tbody |
| 49 | NA | NA | verdana | text |
| 50 | NA | NA | width | top |
| 51 | NA | NA | www | type |
| 52 | NA | NA | NA | valign |
| 53 | NA | NA | NA | verdana |
| 54 | NA | NA | NA | width |
| 55 | NA | NA | NA | www |
Finally, let’s take a peek at the ham and spam document-term matrices when we reduce the degree of sparsity. We do this using the removeSparseTerms function and setting a maximum sparsity threshold of 0.4; this removes terms that are sparse 40% or more across the document set.
# inspect reduced form of ham dtm
inspect(removeSparseTerms(ham_dtm, 0.4))
## <<DocumentTermMatrix (documents: 1400, terms: 5)>>
## Non-/sparse entries: 5618/1382
## Sparsity : 20%
## Maximal term length: 8
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs http list listinfo mailman www
## 00562.0f377593022357878ec2249f0c9a5f08 9 15 4 4 6
## 00664.28f4cb9fad800d0c7175d3a67e6c6458 22 4 0 0 15
## 00665.087e07e6a5f47598db0629c21e6e1a70 26 1 0 0 20
## 00666.009d6116caed8ebd2b48febcea7b6c38 34 0 0 0 26
## 00668.0c194428812d424ce5d9b0a39615b041 42 9 0 0 32
## 01303.80dd19a2b1d8496c48396b630179b00f 26 0 0 0 20
## 01317.7fc86413a091430c3104b041a6525131 239 12 1 1 19
## 01318.193fb7308fee59bb4aa70cc72191b0b1 76 1 0 0 38
## 01345.c40d5798193a4a060ec9f3d2321e37e4 38 3 0 0 30
## 01389.e4cfb234aace4e12b2d9453686c911c9 53 4 0 0 44
inspect(removeSparseTerms(ham_dtm2, 0.4))
## <<DocumentTermMatrix (documents: 1400, terms: 5)>>
## Non-/sparse entries: 5618/1382
## Sparsity : 20%
## Maximal term length: 8
## Weighting : term frequency - inverse document frequency (tf-idf)
## Sample :
## Terms
## Docs http list listinfo
## 00664.28f4cb9fad800d0c7175d3a67e6c6458 1.8435199 1.9938795 0.00000000
## 00665.087e07e6a5f47598db0629c21e6e1a70 2.1787054 0.4984699 0.00000000
## 00666.009d6116caed8ebd2b48febcea7b6c38 2.8490763 0.0000000 0.00000000
## 00668.0c194428812d424ce5d9b0a39615b041 3.5194471 4.4862288 0.00000000
## 01303.80dd19a2b1d8496c48396b630179b00f 2.1787054 0.0000000 0.00000000
## 01317.7fc86413a091430c3104b041a6525131 20.0273302 5.9816384 0.09036403
## 01318.193fb7308fee59bb4aa70cc72191b0b1 6.3685234 0.4984699 0.00000000
## 01345.c40d5798193a4a060ec9f3d2321e37e4 3.1842617 1.4954096 0.00000000
## 01380.e3fad5af747d3a110008f94a046bf31b 0.4189818 12.9602165 0.00000000
## 01389.e4cfb234aace4e12b2d9453686c911c9 4.4412071 1.9938795 0.00000000
## Terms
## Docs mailman www
## 00664.28f4cb9fad800d0c7175d3a67e6c6458 0.0000000 10.6208958
## 00665.087e07e6a5f47598db0629c21e6e1a70 0.0000000 14.1611944
## 00666.009d6116caed8ebd2b48febcea7b6c38 0.0000000 18.4095527
## 00668.0c194428812d424ce5d9b0a39615b041 0.0000000 22.6579110
## 01303.80dd19a2b1d8496c48396b630179b00f 0.0000000 14.1611944
## 01317.7fc86413a091430c3104b041a6525131 0.3040062 13.4531346
## 01318.193fb7308fee59bb4aa70cc72191b0b1 0.0000000 26.9062693
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0.0000000 21.2417915
## 01380.e3fad5af747d3a110008f94a046bf31b 0.0000000 0.7080597
## 01389.e4cfb234aace4e12b2d9453686c911c9 0.0000000 31.1546276
# inspect reduced form of spam dtm
inspect(removeSparseTerms(spam_dtm, 0.4))
## <<DocumentTermMatrix (documents: 1396, terms: 2)>>
## Non-/sparse entries: 2024/768
## Sparsity : 28%
## Maximal term length: 4
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs html http
## 00028.60393e49c90f750226bee6381eb3e69d 13 79
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 27 97
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 11 92
## 00081.4c7fbdca38b8def54e276e75ec56682e 11 90
## 00117.9f0ba9c35b1fe59307e32b7c2c0d4e61 12 81
## 00145.b6788a48c1eace0b7c34ff7de32766f6 15 83
## 01094.91779ec04e5e6b27e84297c28fc7369f 10 516
## 01095.520dcad6e0ebb4d30222292f51ee76ab 10 516
## 01113.75ded32e6beb52dec1b6007dc86b47bb 61 64
## 01304.114140cd4c51e9795559b974964aa043 86 153
inspect(removeSparseTerms(spam_dtm2, 0.4))
## <<DocumentTermMatrix (documents: 1396, terms: 2)>>
## Non-/sparse entries: 2024/768
## Sparsity : 28%
## Maximal term length: 4
## Weighting : term frequency - inverse document frequency (tf-idf)
## Sample :
## Terms
## Docs html http
## 00028.60393e49c90f750226bee6381eb3e69d 8.847206 21.79661
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 18.374967 26.76292
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 7.486097 25.38339
## 00081.4c7fbdca38b8def54e276e75ec56682e 7.486097 24.83158
## 00145.b6788a48c1eace0b7c34ff7de32766f6 10.208315 22.90023
## 01017.11a80131a2ae31ad0a9969189de3c2bb 14.291641 16.27848
## 01094.91779ec04e5e6b27e84297c28fc7369f 6.805543 142.36772
## 01095.520dcad6e0ebb4d30222292f51ee76ab 6.805543 142.36772
## 01113.75ded32e6beb52dec1b6007dc86b47bb 41.513813 17.65801
## 01304.114140cd4c51e9795559b974964aa043 58.527671 42.21368
For efficiency in the remainder of the analysis, we work only with the document-term matrices with the TF weighting. First we combine ham_dtm and spam_dtm into a combined document-term matrix, and in the process:
# create combined dtm
hamspam_dtm <- c(ham_dtm, spam_dtm)
hamspam_dtm
## <<DocumentTermMatrix (documents: 2796, terms: 42048)>>
## Non-/sparse entries: 290899/117275309
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency (tf)
# check that ham and spam emails line up in combined dtm
identical(inspect(ham_dtm[1:n_ham, ]), inspect(hamspam_dtm[1:n_ham, ]))
## <<DocumentTermMatrix (documents: 1400, terms: 15718)>>
## Non-/sparse entries: 112456/21892744
## Sparsity : 99%
## Maximal term length: 76
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs http linux list listinfo mail
## 00668.0c194428812d424ce5d9b0a39615b041 42 0 9 0 4
## 00693.2183b91fb14b93bdfaab337b915c98bb 7 0 1 1 1
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 6 0 1 1 0
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 3 0 0 1 1
## 00869.0fbb783356f6875063681dc49cfcb1eb 18 0 0 1 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 3 0 0 1 1
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 1 0
## 01317.7fc86413a091430c3104b041a6525131 239 17 12 1 4
## 01345.c40d5798193a4a060ec9f3d2321e37e4 38 59 3 0 2
## 01380.e3fad5af747d3a110008f94a046bf31b 5 5 26 0 0
## Terms
## Docs mailman net razor user www
## 00668.0c194428812d424ce5d9b0a39615b041 0 13 0 1 32
## 00693.2183b91fb14b93bdfaab337b915c98bb 1 13 0 4 2
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 1 0 0 4 2
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 1 0 0 0 2
## 00869.0fbb783356f6875063681dc49cfcb1eb 1 0 0 0 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 1 0 0 0 2
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 0 0
## 01317.7fc86413a091430c3104b041a6525131 1 220 0 155 19
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0 0 0 28 30
## 01380.e3fad5af747d3a110008f94a046bf31b 0 0 0 122 1
## <<DocumentTermMatrix (documents: 1400, terms: 42048)>>
## Non-/sparse entries: 112456/58754744
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs http linux list listinfo mail
## 00668.0c194428812d424ce5d9b0a39615b041 42 0 9 0 4
## 00693.2183b91fb14b93bdfaab337b915c98bb 7 0 1 1 1
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 6 0 1 1 0
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 3 0 0 1 1
## 00869.0fbb783356f6875063681dc49cfcb1eb 18 0 0 1 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 3 0 0 1 1
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 1 0
## 01317.7fc86413a091430c3104b041a6525131 239 17 12 1 4
## 01345.c40d5798193a4a060ec9f3d2321e37e4 38 59 3 0 2
## 01380.e3fad5af747d3a110008f94a046bf31b 5 5 26 0 0
## Terms
## Docs mailman net razor user www
## 00668.0c194428812d424ce5d9b0a39615b041 0 13 0 1 32
## 00693.2183b91fb14b93bdfaab337b915c98bb 1 13 0 4 2
## 00695.2de9d6d30a7713e550b4fd02bb35e7b4 1 0 0 4 2
## 00813.6598e1ef9134cf77f48bca239e4ba2dc 1 0 0 0 2
## 00869.0fbb783356f6875063681dc49cfcb1eb 1 0 0 0 0
## 00966.8ebefc5eaa53c3bf9ef1dfcec1ee2087 1 0 0 0 2
## 01060.95d3e0a8c47b33d1533f18ac2c60c81a 1 0 0 0 0
## 01317.7fc86413a091430c3104b041a6525131 1 220 0 155 19
## 01345.c40d5798193a4a060ec9f3d2321e37e4 0 0 0 28 30
## 01380.e3fad5af747d3a110008f94a046bf31b 0 0 0 122 1
## [1] TRUE
identical(inspect(spam_dtm[1:n_spam, ]), inspect(hamspam_dtm[(n_ham + 1):N, ]))
## <<DocumentTermMatrix (documents: 1396, terms: 31729)>>
## Non-/sparse entries: 178443/44115241
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs align arial color face font http
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 275 271 1627 79
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 4 5 7 5 20 97
## 00051.8b17ce16ace4d5845e2299c0123e1f14 9 18 18 20 41 82
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 9 18 18 20 41 92
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 12 18 14 24 60 2
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 19 6 6 10 60 7
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 23 4 47 24 113 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 4 0 0 8
## 01094.91779ec04e5e6b27e84297c28fc7369f 126 32 170 52 1102 516
## 01095.520dcad6e0ebb4d30222292f51ee76ab 126 32 170 52 1102 516
## Terms
## Docs nbsp size width www
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 0 74
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 15 16 24 40
## 00051.8b17ce16ace4d5845e2299c0123e1f14 567 24 13 68
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 283 24 13 74
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 7 20 54 1
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 10 13 2 9
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 15 50 8 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 0 10
## 01094.91779ec04e5e6b27e84297c28fc7369f 339 447 0 407
## 01095.520dcad6e0ebb4d30222292f51ee76ab 339 447 0 407
## <<DocumentTermMatrix (documents: 1396, terms: 42048)>>
## Non-/sparse entries: 178443/58520565
## Sparsity : 100%
## Maximal term length: 121
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs align arial color face font http
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 275 271 1627 79
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 4 5 7 5 20 97
## 00051.8b17ce16ace4d5845e2299c0123e1f14 9 18 18 20 41 82
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 9 18 18 20 41 92
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 12 18 14 24 60 2
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 19 6 6 10 60 7
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 23 4 47 24 113 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 4 0 0 8
## 01094.91779ec04e5e6b27e84297c28fc7369f 126 32 170 52 1102 516
## 01095.520dcad6e0ebb4d30222292f51ee76ab 126 32 170 52 1102 516
## Terms
## Docs nbsp size width www
## 00028.60393e49c90f750226bee6381eb3e69d 0 273 0 74
## 00044.9f8c4b9ae007c6ded3d57476082bf2b2 15 16 24 40
## 00051.8b17ce16ace4d5845e2299c0123e1f14 567 24 13 68
## 00077.6e13224e39fae4b94bcbe0f5ae9f4939 283 24 13 74
## 00200.2fcabc2b58baa0ebc051e3ea3dfafd8f 7 20 54 1
## 00777.284d3dc66b4f1bdedb5a5eba41d18d14 10 13 2 9
## 00975.5e2e7c9d8b2c04929ff41e010163e5e8 15 50 8 6
## 01083.a6b3c50be5abf782b585995d2c11176b 0 0 0 10
## 01094.91779ec04e5e6b27e84297c28fc7369f 339 447 0 407
## 01095.520dcad6e0ebb4d30222292f51ee76ab 339 447 0 407
## [1] TRUE
# remove words that appear in 10 docs or less
j <- 10
hamspam_dtm <- removeSparseTerms(hamspam_dtm, 1 - j / N)
hamspam_dtm
## <<DocumentTermMatrix (documents: 2796, terms: 3657)>>
## Non-/sparse entries: 226899/9998073
## Sparsity : 98%
## Maximal term length: 33
## Weighting : term frequency (tf)
Next we prepare the data for analysis using the RTextTools package. Steps include:
RTextTools.# define spam flag: 0=ham, 1=spam
spam_labs <- rep(0, n_ham)
spam_labs <- c(spam_labs, rep(1, n_spam))
# validate
sum(spam_labs[1:n_ham])
## [1] 0
sum(spam_labs[(n_ham + 1):N])
## [1] 1396
# split dataset in half and define the training set and the holdout (test) set
k <- 2
n_ham_train <- n_ham %/% k
n_spam_train <- n_spam %/% k
n_ham_test <- n_ham - n_ham_train
n_spam_test <- n_spam - n_spam_train
# define index of randomly selected training & test cases in the DTM
idx_ham_train <- sample(1:n_ham, n_ham_train)
idx_spam_train <- sample(1:n_spam, n_spam_train)
idx_train <- c(idx_ham_train, n_ham + idx_spam_train)
idx_test <- (1:N)[-idx_train]
# set up data container
container <- create_container(
hamspam_dtm,
labels = spam_labs,
trainSize = idx_train,
testSize = idx_test,
virgin = FALSE
)
str(container)
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
## ..@ training_matrix :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:113690] 1 1 2 2 1 1 1 1 2 1 ...
## .. .. ..@ ja : int [1:113690] 163 164 165 10 166 167 168 19 169 170 ...
## .. .. ..@ ia : int [1:1399] 1 39 83 169 211 254 309 335 364 405 ...
## .. .. ..@ dimension: int [1:2] 1398 3657
## ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:113209] 2 1 1 1 1 5 1 1 1 1 ...
## .. .. ..@ ja : int [1:113209] 1 2 3 4 5 6 7 8 9 10 ...
## .. .. ..@ ia : int [1:1399] 1 163 354 396 462 487 532 637 685 711 ...
## .. .. ..@ dimension: int [1:2] 1398 3657
## ..@ training_codes : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## ..@ testing_codes : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## ..@ column_names : chr [1:3657] "addit" "age" "ago" "appar" ...
## ..@ virgin : logi FALSE
We use the RTextTools package to develop and test several models on the ham/spam dataset. Nine different classification algorithms are available in RTextTools:
After some experimentation, I decided to use only the SVM, MAXENT, TREE, and BOOSTING algorithms. The other algorithms either took too much time to run (SLDA, BAGGING, RF, and NNET) or could not be used with the set-up of my data container (GLMNET seemed to need the idx_train andidx_test indices to be numbered in sequence).
First we train the models on the training dataset, and then use the models to classify emails (ham/spam) in the holdout dataset.
# train models
SVM <- train_model(container, "SVM")
TREE <- train_model(container, "TREE")
MAXENT <- train_model(container, "MAXENT")
#GLMNET <- train_model(container, "GLMNET")
#SLDA <- train_model(container, "SLDA")
BOOSTING <- train_model(container, "BOOSTING")
#BAGGING <- train_model(container, "BAGGING")
#RF <- train_model(container, "RF")
#NNET <- train_model(container, "NNET")
# test models
SVM_out <- classify_model(container, SVM)
TREE_out <- classify_model(container, TREE)
MAXENT_out <- classify_model(container, MAXENT)
#GLMNET_out <- classify_model(container, GLMNET)
#SLDA_out <- classify_model(container, SLDA)
BOOSTING_out <- classify_model(container, BOOSTING)
#BAGGING_out <- classify_model(container, BAGGING)
#RF_out <- classify_model(container, RF)
#NNET_out <- classify_model(container, NNET)
Once the models have been run on the holdout dataset, we can compute their predictive accuracy. Here we measure accuracy by the proportion of total predictions that are correct. We compile the results for each model, and then summarize in a table.
# collect model predictions
labels_out <- data.frame(
correct_label = spam_labs[idx_test],
svm = SVM_out[ , 1],
tree = TREE_out[ , 1],
maxent = MAXENT_out[ , 1],
#glmnet = GLMNET_out[ , 1],
#slda = SLDA_out[ , 1],
boosting = BOOSTING_out[ , 1]
#bagging = BAGGING_out[ , 1],
#rf = RF_out[ , 1]
#nnet = NNET_out[ , 1])
)
# inspect model predictions; note ham<=700, spam >700
head(labels_out)
## correct_label svm tree maxent boosting
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
labels_out[680:720, ]
## correct_label svm tree maxent boosting
## 680 0 0 0 0 0
## 681 0 0 0 0 0
## 682 0 0 0 0 0
## 683 0 0 0 0 0
## 684 0 0 0 0 0
## 685 0 0 0 0 0
## 686 0 1 1 1 1
## 687 0 0 0 0 0
## 688 0 0 0 1 0
## 689 0 0 0 1 0
## 690 0 1 1 1 1
## 691 0 0 0 1 0
## 692 0 0 0 0 0
## 693 0 0 0 0 0
## 694 0 0 0 0 0
## 695 0 1 1 1 1
## 696 0 0 0 0 0
## 697 0 0 1 0 1
## 698 0 0 1 0 1
## 699 0 0 1 0 1
## 700 0 0 0 0 0
## 701 1 1 0 1 1
## 702 1 1 1 1 1
## 703 1 1 1 1 1
## 704 1 1 1 1 1
## 705 1 0 0 0 0
## 706 1 1 1 1 1
## 707 1 1 1 1 1
## 708 1 1 1 1 1
## 709 1 1 1 1 1
## 710 1 1 1 1 1
## 711 1 1 1 1 1
## 712 1 1 1 1 1
## 713 1 1 1 1 1
## 714 1 1 0 1 1
## 715 1 1 1 1 1
## 716 1 1 1 1 1
## 717 1 1 1 1 1
## 718 1 1 1 1 1
## 719 1 1 0 1 1
## 720 1 1 1 1 1
tail(labels_out)
## correct_label svm tree maxent boosting
## 1393 1 1 1 1 1
## 1394 1 1 1 1 1
## 1395 1 1 1 1 1
## 1396 1 1 1 1 1
## 1397 1 1 1 1 1
## 1398 1 1 0 1 1
# calc SVM accuracy (# correct predictions / total cases)
tab_svm <- table(labels_out[ , 1] == labels_out[ , 2])
(ptab_svm <- prop.table(tab_svm))
##
## FALSE TRUE
## 0.06294707 0.93705293
# create dataframe and compile accuracy for all models
results <- data.frame(matrix(NA, nrow = 4, ncol = 3))
model_names <- c("SVM", "TREE", "MAXENT", "BOOSTING")
for (j in 1:4) {
results[j, ] <- c(model_names[j],
prop.table(table(labels_out[ , 1] == labels_out[ , j + 1])))
}
colnames(results) <- c("MODEL", "INCORRECT", "CORRECT")
kable(results, digits = 4, caption = "Summary of Classification Accuracy on Holdout Dataset (# Correct Predictions / # Total Cases)")
| MODEL | INCORRECT | CORRECT |
|---|---|---|
| SVM | 0.0629470672389127 | 0.937052932761087 |
| TREE | 0.108726752503577 | 0.891273247496423 |
| MAXENT | 0.0436337625178827 | 0.956366237482117 |
| BOOSTING | 0.0543633762517883 | 0.945636623748212 |
This seems too good to be true. According to this analysis, the models are accurate 89% to 96% of the time on the holdout dataset.
The RTextTools package offers several features that more formally measure model performance. The ranking of the models by precision, recall, and F-scores is similar to what we compiled above.
analytics <- create_analytics(
container,
cbind(SVM_out, TREE_out, MAXENT_out, BOOSTING_out)
)
summary(analytics)
## ENSEMBLE SUMMARY
##
## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1.00 0.93
## n >= 2 1.00 0.93
## n >= 3 0.98 0.95
## n >= 4 0.90 0.98
##
##
## ALGORITHM PERFORMANCE
##
## SVM_PRECISION SVM_RECALL SVM_FSCORE
## 0.945 0.940 0.940
## LOGITBOOST_PRECISION LOGITBOOST_RECALL LOGITBOOST_FSCORE
## 0.945 0.945 0.940
## TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0.900 0.890 0.895
## MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0.960 0.960 0.960
# create and view summaries
topic_summary <- analytics@label_summary
#alg_summary <- analytics@algorithm_summary
#ens_summary <- analytics@ensemble_summary
#doc_summary <- analytics@document_summary
topic_summary
## NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0 700 786 743
## 1 698 612 655
## PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0 112.28571 106.14286 99.28571
## 1 87.67908 93.83954 86.96275
## PCT_CORRECTLY_CODED_PROBABILITY
## 0 98.42857
## 1 92.26361
#alg_summary
#ens_summary
#doc_summary
Assessed from either the compiled results or the formal analytics summaries produced by RTextTools, it is apparent that the classification algorithms can be ranked as follows for this dataset:
The analytics summary from RTextTools also indicates that if we relied on all four models together, we would achieve 98% accuracy with a 90% coverage of the dataset.
Some interesting topics for further work include:
RTextTools package: GLMNET, SLDA, BAGGING, RF, and NNET.RTextTools.