Document Classification

1. Began by initiating the needed packages

library(stringr)
library(RCurl)
library(tm)
library(RTextTools)
library(knitr)

2. Focusing first on ham examples I took the following steps:

* Downloaded the ham files from the corpus available on https://spamassassin.apache.org/publiccorpus/
* Used the "readLines" function from the tm package to get the lines for the first ham email
* Began building the corpus for analyis through the "str_c" function from the stringr package
opts_knit$set(root.dir = 'C:/Users/jenieman/Documents/CUNY/Data 607/HW11/easy_ham/')
```r
hamlist <- list.files("C:/Users/jenieman/Documents/CUNY/Data 607/HW11/easy_ham")
n <- length(hamlist)
temp <- c(readLines(hamlist[1]))
corpham <- str_c(temp, collapse = "")
hams <- "ham"
```

3. Created a loop to finish the corpus for all of the downloaded ham emails and created a “ham” label.

for (i in 2:n) {
  
  temp <- c(readLines(hamlist[i]))
  temp <- str_c(temp, collapse = "")
  corpham <- c(corpham, temp)
  ham <- "ham"
  hams <- c(hams, ham)
} 

4. Finished the ham corpus by combining the “ham” label with the text corpus and giving the resulting matrix column names

hams <- as.matrix(hams, ncol=1)
corpham <- as.matrix(corpham, ncol=1)
ham2 <- cbind(hams, corpham)
colnames(ham2) <- c("Type", "Text")

Example of ham email from corpus

ham2[4,2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Text 
## "From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002Return-Path: <irregulars-admin@tb.tf>Delivered-To: zzzz@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66\tfor <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)Received: from phobos [127.0.0.1]\tby localhost with IMAP (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)Received: from web.tb.tf (route-64-131-126-36.telocity.com    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id    g7MDGOZ07922 for <zzzz-irr@example.com>; Thu, 22 Aug 2002 14:16:24 +0100Received: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09    -0400Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by    web.tb.tf (8.11.6/8.11.6) with ESMTP id g7MDO4I16408 for    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:24:04 -0400Received: from prserv.net (out4.prserv.net [32.97.166.34]) by    red.harvee.home (8.11.6/8.11.6) with ESMTP id g7MDFBD29237 for    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:15:12 -0400Received: from [209.202.248.109]    (slip-32-103-249-10.ma.us.prserv.net[32.103.249.10]) by prserv.net (out4)    with ESMTP id <2002082213150220405qu8jce>; Thu, 22 Aug 2002 13:15:07 +0000MIME-Version: 1.0X-Sender: @ (Unverified)Message-Id: <p04330137b98a941c58a8@[209.202.248.109]>To: undisclosed-recipient: ;From: Monty Solomon <monty@roscom.com>Content-Type: text/plain; charset=\"us-ascii\"Subject: [IRR] Klez: The Virus That  Won't DieSender: irregulars-admin@tb.tfErrors-To: irregulars-admin@tb.tfX-Beenthere: irregulars@tb.tfX-Mailman-Version: 2.0.6Precedence: bulkList-Help: <mailto:irregulars-request@tb.tf?subject=help>List-Post: <mailto:irregulars@tb.tf>List-Subscribe: <http://tb.tf/mailman/listinfo/irregulars>,    <mailto:irregulars-request@tb.tf?subject=subscribe>List-Id: New home of the TBTF Irregulars mailing list <irregulars.tb.tf>List-Unsubscribe: <http://tb.tf/mailman/listinfo/irregulars>,    <mailto:irregulars-request@tb.tf?subject=unsubscribe>List-Archive: <http://tb.tf/mailman/private/irregulars/>Date: Thu, 22 Aug 2002 09:15:25 -0400Klez: The Virus That Won't Die Already the most prolific virus ever, Klez continues to wreak havoc.Andrew Brandt>>From the September 2002 issue of PC World magazinePosted Thursday, August 01, 2002The Klez worm is approaching its seventh month of wriggling across the Web, making it one of the most persistent viruses ever. And experts warn that it may be a harbinger of new viruses that use a combination of pernicious approaches to go from PC to PC.Antivirus software makers Symantec and McAfee both report more than 2000 new infections daily, with no sign of letup at press time. The British security firm MessageLabs estimates that 1 in every 300 e-mail messages holds a variation of the Klez virus, and says that Klez has already surpassed last summer's SirCam as the most prolific virus ever.And some newer Klez variants aren't merely nuisances--they can carry other viruses in them that corrupt your data....http://www.pcworld.com/news/article/0,aid,103259,00.asp_______________________________________________Irregulars mailing listIrregulars@tb.tfhttp://tb.tf/mailman/listinfo/irregulars"

5. Repeated steps 2-4 for the spam corpus

opts_knit$set(root.dir = 'C:/Users/jenieman/Documents/CUNY/Data 607/HW11/spam/')
spamlist <- list.files("C:/Users/jenieman/Documents/CUNY/Data 607/HW11/spam")
m <- length(spamlist)
temp <- c(readLines(spamlist[1]))
corpspam <- str_c(temp, collapse = "")
spams <- "spam"
for (i in 2:m) {
  
  temp <- c(readLines(spamlist[i]))
  temp <- str_c(temp, collapse = "")
  corpspam <- c(corpspam, temp)
  spam <- "spam"
  spams <- c(spams, spam)
} 

spams <- as.matrix(spams, ncol=1)
corpspam <- as.matrix(corpspam, ncol=1)
spam2 <- cbind(spams, corpspam)
colnames(spam2) <- c("Type", "Text")

Example of spam email from the corpus

spam2[4,2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Text 
## "From sabrina@mx3.1premio.com  Thu Aug 22 14:44:07 2002Return-Path: <sabrina@mx3.1premio.com>Delivered-To: zzzz@localhost.example.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.example.com (Postfix) with ESMTP id 1E90847C66\tfor <zzzz@localhost>; Thu, 22 Aug 2002 09:44:02 -0400 (EDT)Received: from mail.webnote.net [193.120.211.219]\tby localhost with POP3 (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:44:03 +0100 (IST)Received: from email.qves.com (email1.qves.net [209.63.151.251] (may be forged))\tby webnote.net (8.9.3/8.9.3) with ESMTP id OAA04953\tfor <zzzz@example.com>; Thu, 22 Aug 2002 14:37:23 +0100Received: from qvp0086 ([169.254.6.17]) by email.qves.com with Microsoft SMTPSVC(5.0.2195.2966);\t Thu, 22 Aug 2002 07:36:20 -0600From: \"Slim Down\" <sabrina@mx3.1premio.com>To: <zzzz@example.com>Subject: Guaranteed to lose 10-12 lbs in 30 days                          11.150Date: Thu, 22 Aug 2002 07:36:19 -0600Message-ID: <9a63c01c249e0$e5a9d610$1106fea9@freeyankeedom.com>MIME-Version: 1.0Content-Type: text/plain;\tcharset=\"iso-8859-1\"Content-Transfer-Encoding: 7bitX-Mailer: Microsoft CDO for Windows 2000Thread-Index: AcJJ4OWpowGq7rdNSwCz5HE3x9ZZDQ==Content-Class: urn:content-classes:messageX-MimeOLE: Produced By Microsoft MimeOLE V6.00.2462.0000X-OriginalArrivalTime: 22 Aug 2002 13:36:20.0969 (UTC) FILETIME=[E692FD90:01C249E0]1) Fight The Risk of Cancer!http://www.adclick.ws/p.cfm?o=315&s=pk0072) Slim Down - Guaranteed to lose 10-12 lbs in 30 dayshttp://www.adclick.ws/p.cfm?o=249&s=pk0073) Get the Child Support You Deserve - Free Legal Advicehttp://www.adclick.ws/p.cfm?o=245&s=pk0024) Join the Web's Fastest Growing Singles Communityhttp://www.adclick.ws/p.cfm?o=259&s=pk0075) Start Your Private Photo Album Online!http://www.adclick.ws/p.cfm?o=283&s=pk007Have a Wonderful Day,Offer ManagerPrizeMamaIf you wish to leave this list please use the link below.http://www.qves.com/trim/?zzzz@example.com%7C17%7C308417"

6. Created a term document matrix using the following steps:

* Combined the corpus of ham emails with the corpus of spam emails
* Created a sample of 3000 out of the 3052 available documents to shuffle the spam and ham examples
* Used the "create_matrix" function in the RTextTools package to build the matrix
* To improve the comparison I removed puctuation, stop words, numbers and spare terms with a factor of 0.9
hamspam <- as.matrix(rbind(spam2, ham2))
hamspam2 <- hamspam[sample(1:3052, size = 3000, replace = FALSE),]
#Test randomness
hamspam2[1:20,1]
##  [1] "ham"  "ham"  "ham"  "ham"  "spam" "ham"  "ham"  "ham"  "spam" "ham" 
## [11] "spam" "ham"  "ham"  "ham"  "ham"  "ham"  "ham"  "ham"  "ham"  "ham"
hsmat <- create_matrix(hamspam2, language = "english", removePunctuation = TRUE, removeStopwords = TRUE, removeNumbers = TRUE, removeSparseTerms = 0.9)
hsmat
## <<DocumentTermMatrix (documents: 3000, terms: 145)>>
## Non-/sparse entries: 121608/313392
## Sparsity           : 72%
## Maximal term length: 53
## Weighting          : term frequency (tf)

7. Created a container using the “create_container” function from the RTextTools package splitting the sample of 3000 into 2/3 training data and 1/3 testing data.

type <- unlist(hamspam2[,1])
container <- create_container(hsmat, as.numeric(factor(type)), trainSize = 1:2000, testSize = 2001:3000, virgin = FALSE)

8. Created a training model using the “train_model” from the RTextTools package using 8 different approaches. The document from https://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf was very helpful.

SVM <- train_model(container, "SVM")
GLMNET <- train_model(container,"GLMNET")
MAXENT <- train_model(container,"MAXENT")
SLDA <- train_model(container,"SLDA")
BOOSTING <- train_model(container,"BOOSTING")
BAGGING <- train_model(container,"BAGGING")
RF <- train_model(container,"RF")
TREE <- train_model(container,"TREE")

9. Classified the data for each of the 8 models using the “classify_model” function from RTextTools

SVM_CLASSIFY <- classify_model(container, SVM)
GLMNET_CLASSIFY <- classify_model(container, GLMNET)
MAXENT_CLASSIFY <- classify_model(container, MAXENT)
SLDA_CLASSIFY <- classify_model(container, SLDA)
BOOSTING_CLASSIFY <- classify_model(container, BOOSTING)
BAGGING_CLASSIFY <- classify_model(container, BAGGING)
RF_CLASSIFY <- classify_model(container, RF)
TREE_CLASSIFY <- classify_model(container, TREE)

10. Ran and summarized the analytics using the “create_analytics” from RTextTools

analytics <- create_analytics(container, cbind(SVM_CLASSIFY, GLMNET_CLASSIFY, MAXENT_CLASSIFY, SLDA_CLASSIFY, BOOSTING_CLASSIFY, BAGGING_CLASSIFY, RF_CLASSIFY, TREE_CLASSIFY))
summary(analytics)
## ENSEMBLE SUMMARY
## 
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00                 1
## n >= 2                1.00                 1
## n >= 3                1.00                 1
## n >= 4                1.00                 1
## n >= 5                1.00                 1
## n >= 6                1.00                 1
## n >= 7                0.99                 1
## n >= 8                0.99                 1
## 
## 
## ALGORITHM PERFORMANCE
## 
##        SVM_PRECISION           SVM_RECALL           SVM_FSCORE 
##                1.000                0.985                0.990 
##       SLDA_PRECISION          SLDA_RECALL          SLDA_FSCORE 
##                0.980                0.980                0.975 
## LOGITBOOST_PRECISION    LOGITBOOST_RECALL    LOGITBOOST_FSCORE 
##                1.000                1.000                1.000 
##    BAGGING_PRECISION       BAGGING_RECALL       BAGGING_FSCORE 
##                1.000                1.000                1.000 
##    FORESTS_PRECISION       FORESTS_RECALL       FORESTS_FSCORE 
##                1.000                1.000                1.000 
##     GLMNET_PRECISION        GLMNET_RECALL        GLMNET_FSCORE 
##                1.000                1.000                1.000 
##       TREE_PRECISION          TREE_RECALL          TREE_FSCORE 
##                1.000                1.000                1.000 
## MAXENTROPY_PRECISION    MAXENTROPY_RECALL    MAXENTROPY_FSCORE 
##                0.995                0.985                0.990

11. Created the data.frame summaries

topic_summary <- analytics@label_summary
alg_summary <- analytics@algorithm_summary
ens_summary <-analytics@ensemble_summary
doc_summary <- analytics@document_summary

head(topic_summary)
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 1                841                 841                   841
## 2                159                 159                   159
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 1                 100                   100                           100
## 2                 100                   100                           100
##   PCT_CORRECTLY_CODED_PROBABILITY
## 1                             100
## 2                             100
head(alg_summary)
##   SVM_PRECISION SVM_RECALL SVM_FSCORE SLDA_PRECISION SLDA_RECALL
## 1             1       1.00       1.00           1.00        0.99
## 2             1       0.97       0.98           0.96        0.97
##   SLDA_FSCORE LOGITBOOST_PRECISION LOGITBOOST_RECALL LOGITBOOST_FSCORE
## 1        0.99                    1                 1                 1
## 2        0.96                    1                 1                 1
##   BAGGING_PRECISION BAGGING_RECALL BAGGING_FSCORE FORESTS_PRECISION
## 1                 1              1              1                 1
## 2                 1              1              1                 1
##   FORESTS_RECALL FORESTS_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 1              1              1                1             1
## 2              1              1                1             1
##   GLMNET_FSCORE TREE_PRECISION TREE_RECALL TREE_FSCORE
## 1             1              1           1           1
## 2             1              1           1           1
##   MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 1                 1.00              1.00              1.00
## 2                 0.99              0.97              0.98
ens_summary
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00                 1
## n >= 2                1.00                 1
## n >= 3                1.00                 1
## n >= 4                1.00                 1
## n >= 5                1.00                 1
## n >= 6                1.00                 1
## n >= 7                0.99                 1
## n >= 8                0.99                 1
head(doc_summary)
##   SVM_LABEL  SVM_PROB GLMNET_LABEL GLMNET_PROB MAXENTROPY_LABEL
## 1         1 0.9999860            1   0.9936855                1
## 2         1 0.9999803            1   0.9936855                1
## 3         2 1.0000000            2   0.9691075                2
## 4         1 0.9998647            1   0.9936855                1
## 5         2 1.0000000            2   0.9691075                2
## 6         1 0.9997909            1   0.9936855                1
##   MAXENTROPY_PROB SLDA_LABEL SLDA_PROB LOGITBOOST_LABEL LOGITBOOST_PROB
## 1               1          1 1.0000000                1               1
## 2               1          1 0.9999998                1               1
## 3               1          2 0.9999649                2               1
## 4               1          1 0.9999999                1               1
## 5               1          2 0.9581309                2               1
## 6               1          1 0.9999965                1               1
##   BAGGING_LABEL BAGGING_PROB FORESTS_LABEL FORESTS_PROB TREE_LABEL
## 1             1            1             1        1.000          1
## 2             1            1             1        1.000          1
## 3             2            1             2        0.990          2
## 4             1            1             1        1.000          1
## 5             2            1             2        0.965          2
## 6             1            1             1        1.000          1
##   TREE_PROB MANUAL_CODE CONSENSUS_CODE CONSENSUS_AGREE CONSENSUS_INCORRECT
## 1         1           1              1               8                   0
## 2         1           1              1               8                   0
## 3         1           2              2               8                   0
## 4         1           1              1               8                   0
## 5         1           2              2               8                   0
## 6         1           1              1               8                   0
##   PROBABILITY_CODE PROBABILITY_INCORRECT
## 1                1                     0
## 2                1                     0
## 3                2                     0
## 4                1                     0
## 5                2                     0
## 6                1                     0

12. Created confusion matrices for the consensus along with a number of the models (special thanks to the example Dr. Andy Catlin provided on this)

consensusCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$CONSENSUS_CODE)
probCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$PROBABILITY_CODE)
svmCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$SVM_LABEL)
glmnetCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$GLMNET_LABEL)
sldaCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$SLDA_LABEL)
baggingCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$BAGGING_LABEL)
treeCM <- table(true = analytics@document_summary$MANUAL_CODE, predict = analytics@document_summary$TREE_LABEL)

consensusCM 
##     predict
## true   1   2
##    1 841   0
##    2   0 159
probCM 
##     predict
## true   1   2
##    1 841   0
##    2   0 159
svmCM
##     predict
## true   1   2
##    1 841   0
##    2   4 155
glmnetCM 
##     predict
## true   1   2
##    1 841   0
##    2   0 159
sldaCM 
##     predict
## true   1   2
##    1 834   7
##    2   4 155
baggingCM
##     predict
## true   1   2
##    1 841   0
##    2   0 159
treeCM 
##     predict
## true   1   2
##    1 841   0
##    2   0 159

13. Finally in conclusion created some performance metrics for models SLDA and Bagging (shown as examples below)

#Performance Metrics for SLDA Model
TP <- sldaCM[2, 2] 
TN <- sldaCM[1, 1]
FP <- sldaCM[2, 1]
FN <- sldaCM[1, 2]
Accuracy <- ((TP + TN)/(TP + FP + TN + FN))
ErrorRate <- ((FP + FN)/(TP + FP + TN + FN))
Precision <- (TP/(TP + FP))
Recall <- (TP/(TP + FN))
Sensitivity <- (TP/(TP + FN))
Specificity <- (TN/(TN + FP))
Output<- as.matrix(c(Accuracy, ErrorRate, Precision, Recall, Sensitivity, Specificity))
rownames(Output) <- c("Accuracy", "Error Rate", "Precision", "Recall", "Sensitivity", "Specificity")
Output
##                  [,1]
## Accuracy    0.9890000
## Error Rate  0.0110000
## Precision   0.9748428
## Recall      0.9567901
## Sensitivity 0.9567901
## Specificity 0.9952267
#Performance Metrics for Bagging Model
TP <- baggingCM[2, 2] 
TN <- baggingCM[1, 1]
FP <- baggingCM[2, 1]
FN <- baggingCM[1, 2]
Accuracy <- ((TP + TN)/(TP + FP + TN + FN))
ErrorRate <- ((FP + FN)/(TP + FP + TN + FN))
Precision <- (TP/(TP + FP))
Recall <- (TP/(TP + FN))
Sensitivity <- (TP/(TP + FN))
Specificity <- (TN/(TN + FP))
Output<- as.matrix(c(Accuracy, ErrorRate, Precision, Recall, Sensitivity, Specificity))
rownames(Output) <- c("Accuracy", "Error Rate", "Precision", "Recall", "Sensitivity", "Specificity")
Output
##             [,1]
## Accuracy       1
## Error Rate     0
## Precision      1
## Recall         1
## Sensitivity    1
## Specificity    1

Conclusions

All of the models tested worked well, but some models such as bagging, tree and GLMNet worked 100% of the time. SLDA seemed to be among the lower performing models with 15 False Positives and False Negatives out of 1000 tested.