It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
easy_ham_2: 1401 spam: 1397 total: 2798
Followed the step by step directions from a tutorial, How to Build a Text Mining, Machine Learning Document Classification System in R! The original application was for speeches during the Obama/Romney election campaigns but thought it could fit into our work.
Inital step, load libraries:
libs <- c("tm", "plyr", "class", "RTextTools")
lapply(libs, require, character.only = TRUE)
## Loading required package: tm
## Loading required package: NLP
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.4
## Loading required package: class
## Warning: package 'class' was built under R version 3.2.4
## Loading required package: RTextTools
## Warning: package 'RTextTools' was built under R version 3.2.4
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] FALSE
Set options
options(stringASFactors = FALSE)
Set paramaters
types <- c("spam_2", "easy_ham_2")
pathname <- "C:/Users/danielhong/Documents/DATA607"
Clean Text
cleanCorpus <- function(corpus){
corpus.tmp <- tm_map(corpus,removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removewords, stopwords("english"))
corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument)
return(corpus.tmp)
}
Build TDM
#generateTDM <- function(type,path){
#s.dir <- sprintf("%s/%s", path, type)
#s.cor <- Corpus(DirSource(directory = s.dir, encoding = "UTF-8"))
#s.cor.cl <- cleanCorpus(s.cor)
#s.tdm <- TermDocumentMatrix(s.cor.cl)
#s.tdm <- removeSparseTerms(s.tdm, 0.7)
#result <- list(name = types, tdm = s.tdm)
#}
#tdm <- lapply(types, generateTDM, path = pathname)
Attach type - We will add the type to each row
#bindTypeToTDM <- function(tdm){
#s.mat <- t(data.matrix(tdm[["tdm"]]))
#s.df <- as.data.frame(s.mat, StringAsFactors = FALSE)
#s.df <- cbind(s.df, rep(tdm[["name"]], nrow(s.df)))
#colnames(s.df)[ncol(s.df)] <- "types"
#return(s.df)
#}
#typeTDM <- lapply(tdm, bindTypeToTDM)
Stack the two dataframes and replace NAs with 0s
#tdm.stack <- do.call(rbind.fill, typeTDM)
#tdm.stack[is.na(tdm.stack)] <- 0
Hold-out - teach the model by taking a random sample, in this case 70% of the rows to train and use the remaining 30% to test the model
#train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack)*0.7))
#test.idx <- (1:nrow(tdm.stack)) [-train.idx]
KNN Model - We need two new variables, one with all of the rows with targettype and the other variable without targettype
#tdm.type <- tdm.stack[, "targettype"]
#tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targettype"]
#knn.pred <- knn(tdm.stack.nl[train.idx, ], tdm.stack.nl[test.idx, ], tdm.type[train.idx])
One method to measure accuracy is a confusion matrix
#conf.mat <- table("Predictions" = knn.pred, Actual = tdm.type[test.idx])
#(accuracy <- sum(diag(conf.mat))/length(test.idx)*100)
We want to test additional models by creating a container
#container <- create_container(tdm.stack.nl,t(train.idx),virgin=FALSE)
According to the RTextTools websiste, one method of training and classifying data is batch
#models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"))
#results <- classify_models(container, models)
View the results by creating analytics
#analytics <- create_analytics(container, results)