Project 4

Assignment

Analyze a corpus of emails classified as spam or legitimate (dubbed “ham”). Develop a predictive process for classifying new email correctly.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: lattice

Approach

The following describes our general approach towards a predictive model for classifying emails: 1. utilize the Tidyverse toolkit to clean and tokenize the dataset 2. create a Document Term Matrix (DTM) to describe the frequency of terms in the dataset 3. visualize descriptive findings with barcharts and wordclouds 4. apply a machine learning model to develop a predictive classification model

Tidyverse

We utilize Tidyverse tools for text mining enable data input, cleansing, and tokenization. It is additionally possible to cast tidy structures to a DTM.

Reference: Text Mining with R: A Tidy Approach.

Document Term Matrix

A key data structure in text mining is the DTM. DTMs serve as the input format for many machine learning models.

Machine learning

We classify emails by employing a Support Vector Machine model from the caret package.

Other text mining tools

As part of our approach, we tried additional text mining tools and methods in order to familiarize ourselves with other prevalent methodologies and inform our eventual process. As part of our experimentation, we: - built a corpus with the tm package to demonstrate the recommended approach; - used the quanteda package to enable more efficient processing; and - attempted basic sentiment analysis of the corpus.

2. Clean

We start cleaning by removing headers. By specification, email headers may not contain blank lines; we apply regular expression to strip the headers.

Notes on regular expression: - .*?\n\n: The question mark enables non-greedy matching of any text in order to match up to the first double line break. The solution isn’t perfect: there are some headers with multiple instances of blank lines. Nevertheless, most of the headers are removed and the solution appears adequate for our purposes. - Our experience found stringr to be inferior in performance to base R for this string manipulation. However, our base R code requires more testing to fix a bug that allows some headers through.

3. Tokenize

This approach creates a tall, narrow dataframe of tokens.

We get some cleaning for free from the unnest() function. - Remove punctuation; - Convert to lower case; and - Remove white space

We also outer-join a stop word list to remove them.

Notes on possibilities for additional cleaning - Stemming: we can enable improved analysis by reducing words to their root form, or stem. This would have the additional benefit of reducing the size of the DTM.

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          email
## 1 --==_Exmh_267413022P\nContent-Type: text/plain; charset=us-ascii\n\n> From:  Anders Eriksson <aeriksson@fastmail.fm>\n> Date:  Thu, 22 Aug 2002 20:23:17 +0200\n>\n> \n> Oooops!\n> \n> Doesn't work at all. Got this on startup and on any attempt to change folde\n> r (which fail)\n\n~sigh~  I'd already found that and checked it in....apparently I did so after \nyou checked it out and before you sent this mail...I hoped I was fast enough \nthat you wouldn't see it.\n\nTry again!\n\nChris\n\n-- \nChris Garrigues                 http://www.DeepEddy.Com/~cwg/\nvirCIO                          http://www.virCIO.Com\n716 Congress, Suite 200\nAustin, TX  78701\t\t+1 512 374 0500\n\n  World War III:  The Wrong-Doers Vs. the Evil-Doers.\n\n\n\n\n--==_Exmh_267413022P\nContent-Type: application/pgp-signature\n\n-----BEGIN PGP SIGNATURE-----\nVersion: GnuPG v1.0.6 (GNU/Linux)\nComment: Exmh version 2.2_20000822 06/23/2000\n\niD8DBQE9ZUlVK9b4h5R0IUIRAr4LAJ9Mhzgw03dF2qiyqtMks72364uaqwCeJxp1\n23jNAVlrHHIDRMvMPXnfzoE=\n=HErg\n-----END PGP SIGNATURE-----\n\n--==_Exmh_267413022P--\n\n\n\n_______________________________________________\nExmh-workers mailing list\nExmh-workers@redhat.com\nhttps://listman.redhat.com/mailman/listinfo/exmh-workers\n\n
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \n//this function should print all numbers up to 100...\n\nvoid print_nums()\n{\n  int i;\n\n  for(i = 0; i < 10l; i++) {\n    printf("%d\\n",i);\n  }\n\n}\n\n
## Joining, by = "word"

Check counts for validation

## # A tibble: 11,972 x 2
##    word      n
##    <chr> <int>
##  1 3d     3532
##  2 font   2718
##  3 br     1810
##  4 td     1493
##  5 http   1407
##  6 size   1055
##  7 tr      924
##  8 color   788
##  9 width   787
## 10 1       638
## # … with 11,962 more rows

Ham Word Cloud

Spam Word Cloud

4. Term Frequency–Inverse Document Frequency

We calculate term frequency and inverse document frequency (TD-IDF) to enable producing a DTM.

## # A tibble: 13,669 x 6
##    class word       n      tf   idf  tf_idf
##    <fct> <chr>  <int>   <dbl> <dbl>   <dbl>
##  1 spam  td      1493 0.0250  0.693 0.0173 
##  2 spam  tr       924 0.0155  0.693 0.0107 
##  3 spam  align    532 0.00891 0.693 0.00618
##  4 spam  height   450 0.00754 0.693 0.00522
##  5 spam  border   391 0.00655 0.693 0.00454
##  6 spam  img      315 0.00527 0.693 0.00366
##  7 spam  arial    308 0.00516 0.693 0.00358
##  8 ham   rpm       94 0.00439 0.693 0.00304
##  9 spam  span     249 0.00417 0.693 0.00289
## 10 ham   exmh      86 0.00402 0.693 0.00278
## # … with 13,659 more rows
## Selecting by tf_idf

5. Document Term Matrix

We make our first departure from Tidy formats to a format readily utilizable by ML models. We cast counts to a DTM, then inspect and reduce some of the sparseness.

## [1]   395 11972
## <<DocumentTermMatrix (documents: 395, terms: 11972)>>
## Non-/sparse entries: 36712/4692228
## Sparsity           : 99%
## Maximal term length: 89
## Weighting          : term frequency (tf)
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 2/23
## Sparsity           : 92%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##                                   Terms
## Docs                               æ arial class guides sc
##   7160290efcb7320ac8852369a695bcaf 0     0     0      0  0
##   770f0e7b8378a47a945043434f6f43df 0     0     0      0  0
##   829bab9379cfe32fe4b5af15ca99361b 0     2     0      0  0
##   8ff64b5c77f9c9618bd7b119ae14c8b2 0     0     0      0  0
##   f97a14d667569ebbc0502bb2c7beec27 0     2     0      0  0
## <<DocumentTermMatrix (documents: 395, terms: 11972)>>
## Non-/sparse entries: 36712/4692228
## Sparsity           : 99%
## Maximal term length: 89
## Weighting          : term frequency (tf)
## [1]   395 11972
##                                   Terms
## Docs                               æ arial sc guides class
##   7160290efcb7320ac8852369a695bcaf 0     0  0      0     0
##   770f0e7b8378a47a945043434f6f43df 0     0  0      0     0
##   829bab9379cfe32fe4b5af15ca99361b 0     2  0      0     0
##   8ff64b5c77f9c9618bd7b119ae14c8b2 0     0  0      0     0
##   f97a14d667569ebbc0502bb2c7beec27 0     2  0      0     0
## [1]  395 1983
## <<DocumentTermMatrix (documents: 395, terms: 1983)>>
## Non-/sparse entries: 23434/759851
## Sparsity           : 97%
## Maximal term length: 65
## Weighting          : term frequency (tf)
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 2/23
## Sparsity           : 92%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##                                   Terms
## Docs                               50 left redhat select server
##   7160290efcb7320ac8852369a695bcaf  0    0      0      0      0
##   770f0e7b8378a47a945043434f6f43df  0    0      0      0      0
##   829bab9379cfe32fe4b5af15ca99361b  0    0      0      0      5
##   8ff64b5c77f9c9618bd7b119ae14c8b2  0    0      0      0      0
##   f97a14d667569ebbc0502bb2c7beec27  0    0      0      0      5

6. Segment

We segment the data into different sets for training and testing.

##       
##          0   1
##   ham  100  99
##   spam  99 101

7. Support Vector Machine

We enable the predictive model through a supervised learning task. We employ SVM for its classification functionality. We train the model with a subset of the DTM and test the prediction of ham/spam classification using the remainder.

8. Results

We observe confusion matrices. First, we examine the confusion matrix for the training set just for its interest. It naturally gets a perfect score.

The confusion matrix for the test set is very accurate with a low p value. One run on the test set was 97% accurate with a p value near zero. The positive class is ham. There were 2 false negatives identified as spam within 247 ham emails. There were 11 false positives identifies as ham within 248 spam emails.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   99    0
##       spam   0  101
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9817, 1)
##     No Information Rate : 0.505      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.495      
##          Detection Rate : 0.495      
##    Detection Prevalence : 0.495      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : ham        
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   99    9
##       spam   1   90
##                                           
##                Accuracy : 0.9497          
##                  95% CI : (0.9095, 0.9756)
##     No Information Rate : 0.5025          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8995          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9900          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.9890          
##              Prevalence : 0.5025          
##          Detection Rate : 0.4975          
##    Detection Prevalence : 0.5427          
##       Balanced Accuracy : 0.9495          
##                                           
##        'Positive' Class : ham             
## 

Appendix: Alternate Approaches

Appendix libraries

## Package version: 1.5.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
## 
##     View
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ purrr   0.3.2     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ NLP::annotate() masks ggplot2::annotate()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()

tm & VCorpus

There are only a few major packages focused on textmining in R, including: tm, tidytext, corpus, and koRpus. While we found tidytext to be most user-friendly and comprehensive, we also learned the ins and outs of the tm package and VCorpus, despite our file format not exactly lining up with any of the tm examples online. See below for what we found to be the most flexible grammars in working with tm.

##            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells  2637175 140.9    3908864 208.8         NA  3908864 208.8
## Vcells 14062650 107.3   25064332 191.3      32768 25064332 191.3

Export to Quanteda

Exporting metadata from VCorpus to quanteda is a special process, so note that exporting at this point is an option.

##            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells  2643548 141.2    3908864 208.8         NA  3908864 208.8
## Vcells 14098494 107.6   25064332 191.3      32768 25064332 191.3

Cleaning with VCorpus

Clean text with gsub, content_transformer, and tm_map

We separate all of the cleaning functions into one line-rules as much as possible so to turn these on or off during testing. In production, however, they would likely be grouped.

Advanced cleaning

While removing numbers and stopwords will no doubt result in more robust classification over large corpora, doing so in this small dataset will likely yield a worse result.It is debatable whether removing stopwords and stemming will also remove some of the grammatical signatures of spam, and this depends on our method of text processing.

Check the state of our VCorpus

## ==ham corpus==
## <<VCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  11
## Content:  chars: 1117
## ==ham corpus meta==
## $class
## [1] "ham"
## 
## attr(,"class")
## [1] "CorpusMeta"
## ==ham corpus meta class==
## [1] "ham"
## ==ham corpus document 2 metadata==
##   author       : character(0)
##   datetimestamp: 2019-11-17 20:04:16
##   description  : character(0)
##   heading      : character(0)
##   id           : 0002.b3120c4bcbf3101e661161ee7efcb8bf
##   language     : en
##   origin       : character(0)
##   date         :  Thu, 22 Aug 2002 12:46:18 +0100
##   to           :  zzzz@localhost.netnoteinc.com
##   from         :  Steve Burt <steve.burt@cursor-system.com>
##   subject      :  [zzzzteana] RE: Alexander
## ==ha document 1 content m==
## [1] "    date        wed  aug   \n            chris garrigues cwgdatedfaddeepeddycom\n    messageid  tmdadeepeddyvirciocom\n\n\n    can  reproduce  error\n\n     repeatable like every time without fail\n\n   debug log   pick happening \n\n pickit exec pick inbox list lbrace lbrace subject ftp rbrace rbrace  sequence mercury\n exec pick inbox list lbrace lbrace subject ftp rbrace rbrace  sequence mercury\n ftocpickmsgs  hit\n marking  hits\n tkerror syntax error  expression int \n\nnote   run  pick command  hand \n\ndelta pick inbox list lbrace lbrace subject ftp rbrace rbrace   sequence mercury\n hit\n\n    hit comes  obviously   version  nmh  \nusing  \n\ndelta pick version\npick  nmh compiled  fuchsiacsmuozau  sun mar   ict \n\n  relevant part   mhprofile \n\ndelta mhparam pick\nseq sel list\n\n\nsince  pick command works  sequence actually    \none  explicit   command line   search popup  \none  comes  mhprofile  get created\n\nkre\n\nps   still using  version   code form  day ago   \n able  reach  cvs repository today local routing issue  think\n\n\n\n\nexmhworkers mailing list\nexmhworkersredhatcom\nhttpslistmanredhatcommailmanlistinfoexmhworkers\n"
##                                       Length Class             Mode
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 2      PlainTextDocument list
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 2      PlainTextDocument list
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 2      PlainTextDocument list
## 0004.e8d5727378ddde5c3be181df593f1712 2      PlainTextDocument list
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 2      PlainTextDocument list
## 0006.ee8b0dba12856155222be180ba122058 2      PlainTextDocument list
## <<VCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  11
## Content:  chars: 1138
## $class
## [1] "ham"
## 
## attr(,"class")
## [1] "CorpusMeta"
##   author       : character(0)
##   datetimestamp: 2019-11-17 20:04:16
##   description  : character(0)
##   heading      : character(0)
##   id           : 0002.b3120c4bcbf3101e661161ee7efcb8bf
##   language     : en
##   origin       : character(0)
##   date         :  Thu, 22 Aug 2002 12:46:18 +0100
##   to           :  zzzz@localhost.netnoteinc.com
##   from         :  Steve Burt <steve.burt@cursor-system.com>
##   subject      :  [zzzzteana] RE: Alexander
## [1] "ham"
## [1] "    date         wed   aug      \n             chris garrigues \n    message id   \n\n\n     can  reproduce  error \n\n     repeatable   like every time  without fail \n\n   debug log   pick happening  \n\n   pick   exec pick  inbox  list  lbrace  lbrace  subject ftp  rbrace  rbrace      sequence mercury \n   exec pick  inbox  list  lbrace  lbrace  subject ftp  rbrace  rbrace    sequence mercury\n   ftoc pickmsgs   hit \n   marking  hits\n   tkerror  syntax error  expression  int  \n\nnote    run  pick command  hand  \n\ndelta  pick  inbox  list  lbrace  lbrace  subject ftp  rbrace  rbrace     sequence mercury\n hit\n\n s     hit  comes   obviously    version  nmh  \nusing   \n\ndelta  pick  version\npick   nmh     compiled  fuchsia cs mu oz au  sun mar     ict  \n\n  relevant part    mh profile  \n\ndelta  mhparam pick\n seq sel  list\n\n\nsince  pick command works   sequence  actually      \none  s explicit   command line    search popup   \none  comes   mh profile   get created \n\nkre\n\nps    still using  version   code form  day ago    \n able  reach  cvs repository today  local routing issue  think \n\n\n\n \nexmh workers mailing list\nexmh workers redhat com\n"
##                                       Length Class             Mode
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 2      PlainTextDocument list
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 2      PlainTextDocument list
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 2      PlainTextDocument list
## 0004.e8d5727378ddde5c3be181df593f1712 2      PlainTextDocument list
## 0005.8c3b9e9c0f3f183ddaf7592a11b99957 2      PlainTextDocument list
## 0006.ee8b0dba12856155222be180ba122058 2      PlainTextDocument list

Export to data frame

Quickly create a dataframe to check that our VCorpus is exportable to tidytext

## Observations: 199
## Variables: 3
## $ text  <chr> "    date        wed  aug   \n            chris garrigues …
## $ class <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "h…
## $ id    <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 199
## Variables: 3
## $ text  <chr> "    date         wed   aug      \n             chris garr…
## $ class <chr> "spam", "spam", "spam", "spam", "spam", "spam", "spam", "s…
## $ id    <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 398
## Variables: 3
## $ text  <chr> "    date        wed  aug   \n            chris garrigues …
## $ class <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "h…
## $ id    <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…
## Observations: 398
## Variables: 3
## $ text  <chr> "    date        wed  aug   \n            chris garrigues …
## $ class <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham…
## $ id    <chr> "0001.ea7e79d3153e7469e7a9c3e0af6a357e", "0002.b3120c4bcbf…