It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
library(tm)## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
library(stringr)
library(tidytext)## Warning: package 'tidytext' was built under R version 3.4.4
library(dplyr)## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)## Warning: package 'tidyr' was built under R version 3.4.3
library(ggplot2)##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)## Loading required package: RColorBrewer
Here retrieved files of spam and ham emails from http://spamassassin.apache.org/old/publiccorpus/ specifically 20050311_spam_2.tar.bz2 and 20030228_easy_ham.tar.bz2
These were unpacked into the following directories.
spam_file <- "/Users/Naman/harpreet/cuny/Data607/project4/spam_2/"
ham_file <- "/Users/Naman/harpreet/cuny/Data607/project4/easy_ham/"
spamcorpus <- Corpus(DirSource(spam_file), readerControl = list(language="en"))
hamcorpus <- Corpus(DirSource(ham_file), readerControl = list(language="en"))
meta(spamcorpus[[1]])## author : character(0)
## datetimestamp: 2018-04-16 12:13:19
## description : character(0)
## heading : character(0)
## id : 00001.317e78fa8ee2f54cd4890fdc09ba8176
## language : en
## origin : character(0)
meta(hamcorpus[[1]])## author : character(0)
## datetimestamp: 2018-04-16 12:13:19
## description : character(0)
## heading : character(0)
## id : 00001.7c53336b37003a9286aba55d2945844c
## language : en
## origin : character(0)
#summary(spamcorpus,1)
#summary(hamcorpus,1)getTransformations() # Predefined transformations (mappings) which can be used with tm_map## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
spamcorpus <- tm_map(spamcorpus, content_transformer(removePunctuation))
spamcorpus <- tm_map(spamcorpus, content_transformer(tolower))
spamcorpus <- tm_map(spamcorpus, content_transformer(removeNumbers))
spamcorpus <- tm_map(spamcorpus, content_transformer(PlainTextDocument))
spamcorpus <- tm_map(spamcorpus, content_transformer(stemDocument),language = 'english')hamcorpus <- tm_map(hamcorpus, content_transformer(removePunctuation))
hamcorpus <- tm_map(hamcorpus, content_transformer(tolower))
hamcorpus <- tm_map(hamcorpus, content_transformer(removeNumbers))
hamcorpus <- tm_map(hamcorpus, content_transformer(PlainTextDocument))
hamcorpus <- tm_map(hamcorpus, content_transformer(stemDocument),language = 'english')#meta(spamcorpus, "ind") <- 1
#meta(hamcorpus, "ind") <- 0
#meta
#spamhamCorpus <- c(hamcorpus, spamcorpus)spamcorpus <- Corpus(VectorSource(spamcorpus))
tdmspam <- TermDocumentMatrix(spamcorpus)
tdmspam <- removeSparseTerms(tdmspam,0.97)
tdmspam## <<TermDocumentMatrix (terms: 61394, documents: 3)>>
## Non-/sparse entries: 61397/122785
## Sparsity : 67%
## Maximal term length: 868
## Weighting : term frequency (tf)
hamcorpus <- Corpus(VectorSource(hamcorpus))
tdmham <- TermDocumentMatrix(hamcorpus)
tdmham <- removeSparseTerms(tdmham,0.97)
#tdmham <- tdmham %>% removeSparseTerms(1-(10/length(hamcorpus)))
tdmham## <<TermDocumentMatrix (terms: 37666, documents: 3)>>
## Non-/sparse entries: 37669/75329
## Sparsity : 67%
## Maximal term length: 265
## Weighting : term frequency (tf)
wordcloud(spamcorpus, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())wordcloud(hamcorpus, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())R code :-
combinedspamham <- c(spamcorpus,hamcorpus,recursive=T) spamham <- sample(combinedspamham)
dtm <- DocumentTermMatrix(spamham)
I was facing issue with the creating of DocumentTermMatrix from the corpus. Looks like the files contained some binary character that was resulting in the following error. Because of this, i was not able to able to run various supervised algorithms. Error in nchar(names(tab), type = “chars”) : invalid multibyte string, element 25
I tried tidying the matrix also so that i can merge spam and ham corpus but it keeps on throwing me
Error in UseMethod(“meta”, x) : no applicable method for ‘meta’ applied to an object of class “character”