It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

easy_ham_2: 1401 spam: 1397 total: 2798

Followed the step by step directions from a tutorial, How to Build a Text Mining, Machine Learning Document Classification System in R! The original application was for speeches during the Obama/Romney election campaigns but thought it could fit into our work.

Inital step, load libraries:

libs <- c("tm", "plyr", "class", "RTextTools")
lapply(libs, require, character.only = TRUE)
## Loading required package: tm
## Loading required package: NLP
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.4
## Loading required package: class
## Warning: package 'class' was built under R version 3.2.4
## Loading required package: RTextTools
## Warning: package 'RTextTools' was built under R version 3.2.4
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] FALSE

Set options

options(stringASFactors = FALSE)

Set paramaters

types <- c("spam_2", "easy_ham_2")
pathname <- "C:/Users/danielhong/Documents/DATA607"

Clean Text

cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus,removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removewords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
return(corpus.tmp)
}

Build TDM

#generateTDM <- function(type,path){
#s.dir <- sprintf("%s/%s", path, type)
#s.cor <- Corpus(DirSource(directory = s.dir, encoding = "UTF-8"))
#s.cor.cl <- cleanCorpus(s.cor)
#s.tdm <- TermDocumentMatrix(s.cor.cl)

#s.tdm <- removeSparseTerms(s.tdm, 0.7)
#result <- list(name = types, tdm = s.tdm)
#}

#tdm <- lapply(types, generateTDM, path = pathname)

Attach type - We will add the type to each row

#bindTypeToTDM <- function(tdm){
#s.mat <- t(data.matrix(tdm[["tdm"]]))
#s.df <- as.data.frame(s.mat, StringAsFactors = FALSE)

#s.df <- cbind(s.df, rep(tdm[["name"]], nrow(s.df)))
#colnames(s.df)[ncol(s.df)] <- "types"
#return(s.df)
#}

#typeTDM <- lapply(tdm, bindTypeToTDM)

Stack the two dataframes and replace NAs with 0s

#tdm.stack <- do.call(rbind.fill, typeTDM)
#tdm.stack[is.na(tdm.stack)] <- 0

Hold-out - teach the model by taking a random sample, in this case 70% of the rows to train and use the remaining 30% to test the model

#train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack)*0.7))
#test.idx <- (1:nrow(tdm.stack)) [-train.idx]

KNN Model - We need two new variables, one with all of the rows with targettype and the other variable without targettype

#tdm.type <- tdm.stack[, "targettype"]
#tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targettype"]

#knn.pred <- knn(tdm.stack.nl[train.idx, ], tdm.stack.nl[test.idx, ], tdm.type[train.idx])

One method to measure accuracy is a confusion matrix

#conf.mat <- table("Predictions" = knn.pred, Actual = tdm.type[test.idx])
#(accuracy <- sum(diag(conf.mat))/length(test.idx)*100)

We want to test additional models by creating a container

#container <- create_container(tdm.stack.nl,t(train.idx),virgin=FALSE)

According to the RTextTools websiste, one method of training and classifying data is batch

#models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"))
#results <- classify_models(container, models)

View the results by creating analytics

#analytics <- create_analytics(container, results)