PROJECT 4: Document Classification
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.
In this project, I had downloaded two different sample sets: One classified as spam
, and the other classified as ham
. Though the tm
package is one of the most utilizied packages for text mining, I wanted to take advantage of a different package called tidytext
. With this package, my goal is to attempt to find a subset of words that are particular for either spam
and ham
. With this information, this can help us differentiate what e-mails are spam or not spam.
Load the Appropriate Libraries for the Project
library(stringr)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(wordcloud)
Downloading of the Spam
and Ham
Folders
In this section, a folder was created (if one did not exist prior) for spam
and ham
. R had accessed the webpage where the .tar files were held and downloaded them into R for easy access.
if (!dir.exists("easy_ham")){
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2", destfile = "20021010_easy_ham.tar.bz2")
untar("20021010_easy_ham.tar.bz2",compressed = "bzip2")
}
ham_files = list.files(path = "easy_ham",full.names = TRUE)
if (!dir.exists("spam_2")){
download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile = "20050311_spam_2.tar.bz2")
untar("20050311_spam_2.tar.bz2", compressed = "bzip2")
}
spam_files = list.files(path = "spam_2", full.names = TRUE)
Cleaning of the Data
As we look through several of the spam
and ham
files, we note that essentially all of the e-mails contained headers. And within these headers were a lot of information that I wanted to exclude out. I wanted to examine the content of the body of the e-mails and utilize that information for my text mining and analysis.
As of note, initially, I did not do this the first time through. Ultimately, the analysis I was seeing was listed below (I will not be showing the code, as much of the code that was done on this initial evaluation was done again with this same data but minus the headers.)
Initial First Time Through Without Eliminating the Headers
As you can see in the above comparison, many of the keywords are not helpful (i.e. sansseriffont, brbr, etc.). As a result, in my second run through analysis, I had opted to leave out the headers and to examine the content of the e-mail instead.
Let’s start with spam.
spam.body.df <- data.frame(e.mail = NA)
spam.body <- readLines(spam_files[1])
for (i in 1:length(spam.body)){
if(str_detect(spam.body[i],"[[:alnum:]].*")){
spam.body[i] <- ""
}else{
spam.body <- unlist(str_c(spam.body, collapse = ""))
spam.body.df <- rbind(spam.body.df, spam.body)
break
}
}
# Omit the NAs
spam.body.df <- na.omit(spam.body.df)
for (i in 2:length(spam_files)){
spam.body <- readLines(spam_files[i])
for (j in 1:length(spam.body)){
if(str_detect(spam.body[j],"[[:alnum:]].*")){
spam.body[j] <- ""
}else{
spam.body <- unlist(str_c(spam.body, collapse = ""))
# Get rid of the HTML tags
spam.body <- unlist(gsub("<[[:alnum:]].*>", "", spam.body))
spam.body.df <- rbind(spam.body.df, spam.body)
break
}
}
}
# While not perfect, the data has certainly been cleaned up fairly extensively.
# Now to remove the punctuation, numbers, and excessive spaces.
x <- gsub("[[:punct:]]|[[:digit:]]", "", spam.body.df$e.mail[1])
x <- gsub("\\s{2,}", " ", x)
spam.content.cleaned <- data.frame(NUM = 1, E.Mail=x, stringsAsFactors = FALSE)
for (i in 2:length(spam.body.df$e.mail)) {
x <- gsub("[[:punct:]]|[[:digit:]]", "", spam.body.df$e.mail[i])
x <- gsub("\\s{2,}", " ", x)
y <- data.frame(NUM = i, E.Mail = x)
spam.content.cleaned <- rbind(spam.content.cleaned, y)
}
Now that we have collected a clean spam
data frame. In order to utilize this data set correctly in the tidytext package, we need to tokenize the data frame so that it is tidy
, or in other words, a table with one-token-per-row.
This is an excerpt from Dr. Silge (author of Text Mining with R). “A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”
Now in order to this, it would be helpful to rid of words that are not likely to be contributory (i.e. “the”, “a”, “of”, etc.). Fortunately, the tidytext
package contains a data frame of stop_words
of all the words that are most likely unnecessary for our text mining analysis. By using the anti_join()
function, we can exclude out these words in the spam
data frame.
# Tidying up the `spam` data
data(stop_words)
spam.content.test <- spam.content.cleaned[1:600,] # Why did I choose 600? For some unusual reason, anything above 600 was crashing RStudio, hence the arbitrary number.
spam.content.unnested <- spam.content.test %>% unnest_tokens(word, E.Mail) %>% anti_join(stop_words)
sorted.spam.content <- spam.content.unnested %>% count(word, sort = TRUE)
Now that we have done this for spam, we need to repeat the same exact function to ham
.
ham.body.df <- data.frame(e.mail = NA)
ham.body <- readLines(ham_files[1])
for (i in 1:length(ham.body)){
if(str_detect(ham.body[i],"[[:alnum:]].*")){
ham.body[i] <- ""
}else{
ham.body <- unlist(str_c(ham.body, collapse = ""))
ham.body.df <- rbind(ham.body.df, ham.body)
break
}
}
# Omit the NAs
ham.body.df <- na.omit(ham.body.df)
for (i in 2:length(ham_files)){
ham.body <- readLines(ham_files[i])
for (j in 1:length(ham.body)){
if(str_detect(ham.body[j],"[[:alnum:]].*")){
ham.body[j] <- ""
}else{
ham.body <- unlist(str_c(ham.body, collapse = ""))
# Get rid of the HTML tags
ham.body <- unlist(gsub("<[[:alnum:]].*>", "", ham.body))
ham.body.df <- rbind(ham.body.df, ham.body)
break
}
}
}
# Now to remove the punctuation, numbers, and excessive spaces.
x <- gsub("[[:punct:]]|[[:digit:]]", "", ham.body.df$e.mail[1])
x <- gsub("\\s{2,}", " ", x)
ham.content.cleaned <- data.frame(NUM = 1, E.Mail=x, stringsAsFactors = FALSE)
for (i in 2:length(ham.body.df$e.mail)) {
x <- gsub("[[:punct:]]|[[:digit:]]", "", ham.body.df$e.mail[i])
x <- gsub("\\s{2,}", " ", x)
y <- data.frame(NUM = i, E.Mail = x)
ham.content.cleaned <- rbind(ham.content.cleaned, y)
}
ham.content.test <- ham.content.cleaned[1:600,] # Again, 600 was chosen again because for some unusual reason, any higher number was causing my RStudio to crash.
ham.content.unnested <- ham.content.test %>% unnest_tokens(word, E.Mail) %>% anti_join(stop_words)
sorted.ham.content <- ham.content.unnested %>% count(word, sort = TRUE)
head(sorted.spam.content, 15)
## # A tibble: 15 × 2
## word n
## <chr> <int>
## 1 email 645
## 2 free 343
## 3 send 213
## 4 money 200
## 5 credit 199
## 6 people 199
## 7 business 198
## 8 information 195
## 9 time 188
## 10 address 180
## 11 list 177
## 12 receive 169
## 13 software 152
## 14 click 150
## 15 internet 139
head(sorted.ham.content, 15)
## # A tibble: 15 × 2
## word n
## <chr> <int>
## 1 dont 238
## 2 wrote 214
## 3 linux 203
## 4 people 194
## 5 email 182
## 6 time 174
## 7 yahoo 169
## 8 xml 157
## 9 technology 144
## 10 im 141
## 11 subject 127
## 12 users 122
## 13 world 122
## 14 companies 119
## 15 send 114
We have now successfully created our tidy data frames for text analysis!
Analysis of Spam
and Ham
Now, here comes the fun part: the analysis. In this portion, we’ll use the ggplot2
package to plot the most frequent words from spam
and ham
. For the sake of space, only the words that showed up over 100 times was plotted on the graph. To get a general idea what words were popular in each section, please look at the barplots below.
par(mfrow=c(2,1))
sorted.spam.content %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab("Spam") +
coord_flip()
sorted.ham.content %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab("Ham") +
coord_flip()
In the “Analyzing Word and Document Frequency” section of Text Mining in R textbook, there was a great section demonstrating graphical analysis for term frequency. Below is a graphical plot for the frequency (specific word / total number of words) i.e. (“email” / total number of words). We can use this plot to demonstrate how frequently some words showed up.
email.content <- bind_rows(mutate(sorted.spam.content, type = "spam"),
mutate(sorted.ham.content, type = "ham")) %>% ungroup()
total.email.content <- email.content %>%
group_by(type) %>%
summarize(total = sum(n))
email.content <- left_join(email.content, total.email.content)
ggplot(email.content, aes(n/total, fill = type)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.001) +
facet_wrap(~type, ncol = 2, scales = "free_y")
## Warning: Removed 190 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Looking at these curves, we notice something particular about them. They both resemble the power law in statistics. According to Wikipedia, " a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four."
Dr. Silge noticed from her work that distributions such as above are typical in languages. In fact, it is so common, a 20th century linguist by the name of George Zipf, had created a law named Zipf’s Law. Zipf’s Law states that the frequency that a word appears is inversely proportional to its rank.
Let’s apply Zipf’s Law to our data set and see what we find!
freq_by_rank.content <- email.content %>%
group_by(type) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank.content %>%
ggplot(aes(rank, `term frequency`, color = type)) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()
We are taking advantage of the fact that the term frequency resembles the power law. As a result, by converting to a logarithmic scale, we can see that the term frequency decreases as we go higher in rank (again in an approximate exponential format), or in other words, this is an inversely proportional relationship with a constant negative slope.
Here, we can see if we can plot the best fit line demonstrating the slope (or trend).
rank_subset <- freq_by_rank.content %>%
filter(rank < 1200,
rank > 800)
lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.0237 -0.9085
freq_by_rank.content %>%
ggplot(aes(rank, `term frequency`, color = type)) +
geom_abline(intercept = -1.0237, slope = -0.9085, color = "gray50", linetype = 2) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()
While it is fun to analyze how rapidly a word frequency decreases the higher in rank you go, it really does not get us closer to the differentiating factors between spam
and ham
. What we would really like to do is to find words that would likely be associated with spam
or ham
. Term frequency can be helpful, but this process does not necessarily tell us if that particular word is part of one group (for example, having the word “email” does not necessarily mean its from spam
.)
Fortunately, there’s a statistical term tf-idf that can help us do that. Again, according to Text Mining in R, “the idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common.”
Let’s see how the data changes when we calculate the tf-idf.
email.content <- email.content %>%
bind_tf_idf(word, type, n) %>%
select(-total) %>%
arrange(desc(tf_idf))
head(email.content,20)
## # A tibble: 20 × 6
## word n type tf idf
## <chr> <int> <chr> <dbl> <dbl>
## 1 xml 157 ham 0.0027158401 0.6931472
## 2 datapower 104 ham 0.0017990278 0.6931472
## 3 httpdocsyahoocominfoterms 87 ham 0.0015049560 0.6931472
## 4 toforteanaunsubscribeegroupscom 87 ham 0.0015049560 0.6931472
## 5 sep 84 ham 0.0014530609 0.6931472
## 6 tm 56 spam 0.0013704664 0.6931472
## 7 tablets 52 spam 0.0012725760 0.6931472
## 8 unseen 68 ham 0.0011762874 0.6931472
## 9 msgs 66 ham 0.0011416907 0.6931472
## 10 beberg 64 ham 0.0011070941 0.6931472
## 11 capillaris 44 spam 0.0010767951 0.6931472
## 12 ragga 44 spam 0.0010767951 0.6931472
## 13 dagga 43 spam 0.0010523225 0.6931472
## 14 listslk 59 ham 0.0010206023 0.6931472
## 15 french 58 ham 0.0010033040 0.6931472
## 16 botanical 39 spam 0.0009544320 0.6931472
## 17 charsetisocontenttransferencoding 38 spam 0.0009299594 0.6931472
## 18 intro 36 spam 0.0008810141 0.6931472
## 19 monthly 36 spam 0.0008810141 0.6931472
## 20 herba 32 spam 0.0007831237 0.6931472
## # ... with 1 more variables: tf_idf <dbl>
plot_email.content <- email.content %>%
mutate(type = factor(type, levels = rev(unique(type))))
plot_email.content %>%
group_by(type) %>%
top_n(50) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = type)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~type, ncol = 2, scales = "free") +
coord_flip() +
theme(axis.text.x = element_text(angle=60, hjust=1))
Now as you can see, there are words that are more important in spam
and other words that are more important in ham
. Looking at the information, it certainly provides some more insight into what words were used more frequently. Words like “sweet, prescription, mastercard, aphrodasic, advertisement, prescription, nigeria” seem to make sense for spam
, whereas “technologies, infrastructure, xml” make sense for ham
. Interestingly though, this technique wasn’t perfect, as words like “promiscuous” seem to have made it into the ham
list, and I suspect that this should’ve probably been in the spam
list.
Let’s have some fun and take this new information, and plot it on a wordcloud.
# Red indicates `Spam`
plot_email.content %>%
filter(type == "spam") %>%
top_n(75) %>%
with(wordcloud(word, n, max.words = 75, colors = "#F8766D"))
# Lightblue indicates `Ham`
plot_email.content %>%
filter(type == "ham") %>%
top_n(75) %>%
with(wordcloud(word, n, max.words = 75, colors = "#00BFC4"))
Another Approach: The TM
Package
While ’tidytext
was a fun package to play with. The more established package has been the tm
package. The tm
packages gives an opportunity to text mine and to allow some supervised learning. So it is very important that we understand the functions that are entailed in this package.
Let’s load up the library first.
library(tm)
So from above, we have already downloaded the spam and ham files into our working directory and called it spam_files
and ham_files
. Since this part of the work has been done, we can now take this information and create a corpus. It is important that we create a meta tag that goes with its appropriate email. So for instance, we’re going to create a tag of “classification” and assign it to “spam” if the e-mail was a spam. And likewise, we will do this for “ham” for ham emails. Initially, I was using the Corpus() function and had ran into many errors with the meta tags (for instance, the package was not allowing me to create new meta tags). However, I was able to circumvent this issue by using VCorpus as opposed to a regular SCorpus.
# We'll take the e-mails and read them into a temporary file. Then I took the tmp file, converted it into a VCorpus and attached an associated meta tag (classification, spam).
tmp <- readLines(spam_files[1])
tmp <- str_c(tmp, collapse = "")
txt_corpus <- VCorpus(VectorSource(tmp))
meta(txt_corpus[[1]], "classification") <- "spam"
# Now I take the rest of the files and repeat the process in a for loop
n <- 1
for (i in 2:length(spam_files)) {
tmp <- readLines(spam_files[i])
tmp <- str_c(tmp, collapse = "")
tmp_corpus <- VCorpus(VectorSource(tmp))
txt_corpus <- c(txt_corpus, tmp_corpus) # Adding the Vcorpus together.
# Now to add the meta key-value, "classification", "spam"
n <- n + 1
meta(txt_corpus[[n]], "classification") <- "spam"
}
# Now we perform the same process for the ham files and add it to the txt_corpus from prior
for (i in 1:length(ham_files)) {
tmp <- readLines(ham_files[i])
tmp <- str_c(tmp, collapse = "")
n <- n + 1
tmp_corpus <- VCorpus(VectorSource(tmp))
txt_corpus <- c(txt_corpus, tmp_corpus)
meta(txt_corpus[[n]], "classification") <- "ham"
}
# Now that we have a complete corpus, we want to randomize the corpus so that way, it isn't all spams then hams. It'll be a mix-up using the sample() function.
txt_corpus <- sample(txt_corpus)
txt_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3948
I have now successfully created the corpus. However, this corpus will need some cleanup. After I have had examined several e-mails from both the ham and spam folders, there is a lot of words that need to be stemmed, have spaces stripped, numbers removed, and so forth, so that I can standardized the texts. So once we perform some analysis on the texts, R will be able to recognize that i.e. “Playing” is the same as “play”. Fortunately, there is a function called tm_map that will allow us to do this. Interestingly, there may have been updates to the tm
package as the meta tags were being dropped during the “cleaning” process. In order to preserve the meta tags and its associated information, the function content_transformer() needed to be added.
corpus.tmp <- tm_map(txt_corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')))
# This above function exists as I am working on a MAC. For reproducibility on a Windows machine, you may need to take the above line out.
corpus.tmp <- tm_map(corpus.tmp, content_transformer(removePunctuation))
# Unfortunately, the removePunctuation function will remove the punctuation, but not necessarily add in a space. According to the Automated Data Collection book, they suggested tm_map(corpus.tmp, str_replace_all, pattern = "[[:punct:]]", replacement = " "). But again, I lose all of the associated meta tags. Perhaps, someone can find an alternative way to do this.
corpus.tmp <- tm_map(corpus.tmp, content_transformer(removeNumbers))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(stripWhitespace))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(stemDocument))
corpus.tmp <- tm_map(corpus.tmp, removeWords, words = stopwords("en"))
# http://stackoverflow.com/questions/13640188/error-converting-text-to-lowercase-with-tm-map-tolower
corpus.tmp
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3948
Once we have a cleaned up Corpus, it is ready to be transformed into a term document matrix. According to Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, “a term-document matrix is a way to arrange text in matrix form where the rows represent individual terms and columns contain the texts. The cells are filled with counts of how often a particularly term appears in a given text.” Once it is set up in this matrix, then we can perform statistical analysis on it. To do this, we use the TermDocumentMatrix() function to perform this task.
tdm <- TermDocumentMatrix(corpus.tmp)
tdm
## <<TermDocumentMatrix (terms: 112361, documents: 3948)>>
## Non-/sparse entries: 623637/442977591
## Sparsity : 100%
## Maximal term length: 74230
## Weighting : term frequency (tf)
It is not surprising that the matrix is extremsely sparse, meaning that most cells have not a single entry (approximately 99%). What I will do next is get rid of the sparse terms. While this may potentially rid of terms that could be valuable, it is at the benefit of improving computational feasibility. In this case, I will use the removeSpareTerms() function to discard all terms that appear in 10 documents or less.
tdm <- removeSparseTerms(tdm, 1-(10/length(corpus.tmp)))
tdm
## <<TermDocumentMatrix (terms: 5830, documents: 3948)>>
## Non-/sparse entries: 451763/22565077
## Sparsity : 98%
## Maximal term length: 110
## Weighting : term frequency (tf)
Once a TDM has been obtained, now I take this opportunity to perform some supervised learning on this matrix. In this case, Support Vector Machines, Random Forest, and Maximum Entropy.
According to Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining:
Support Vector Machines: “This employs a spatial representation of the data. The term occurrences which we stored in the TDM represent the spatial locations of our documents in high-dimensional spaces. Recall that we supplied the group memberships, that is, classification types, of the documents in the training data. Using the SVMs, we try to fit vectors between the document features that best separate the documents into the various groups. After the estimation, we can classify new documents by checking on which sides of the vectors the features of unlabeled documents come to life and estimate the categorical membership.”
Random Forest: “This classifier creates multiple decision trees and takes the most frequntly predicted membership category of many decision trees as the classification that is most likely to be accurate. A decision tree models the group membership of the object we care to classify based on various observed features (here, spam vs. ham). In a classification of a new document, we move down the tree and consider whether the trained features are present or absent to be able to predict the categorical membership of the document.”
Maximum Entropy: “This logit model predicts the probability of belonging to one of two categories”
We will need to load up the RTextTools package to perform these analyses.
library(RTextTools)
## Warning: package 'SparseM' was built under R version 3.3.2
In order for RTextTools to work appropriately, we need to prepare the dataa. However, the information written in the textbook is now outdated and the new information is listed in this webpage. http://www.r-datacollection.com/errata/errata.pdf
# Assemble a vector of labels from the DTM that we had collected
# The labels are stored in the meta tags
meta_type <- as.vector(unlist(meta(corpus.tmp, type = "local", tag = "classification")))
meta_data <- data.frame(
type = unlist(meta_type)
)
# Showing the first 10 items in the meta_type vector
head(meta_type, 10)
## [1] "spam" "spam" "spam" "ham" "spam" "ham" "spam" "ham" "ham" "ham"
table(meta_data)
## meta_data
## ham spam
## 2551 1397
Now it’s time to create a container using the create_container() function. We will use the first 3000 e-mails as training data while we want the documents from 3001 to 3948.
N <- length(meta_type)
container <- create_container(tdm,
labels = meta_type,
trainSize = 1:3000,
testSize = 3001:N,
virgin = FALSE)
slotNames(container)
## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
This container now contains the class matrix_container
. This contains a set of objects that are used for the estimation procedures of the supervised learning methods.
Next we are going to create training models, using the train_model() function. We will create models that we have discussed from prior.
# Training models
#svm_model <- train_model(container, algorithm = "SVM")
#tree_model <- train_model(container, algorithm = "TREE")
#maxent_model <- train_model(container, algorithm = "MAXENT")
#svm_out <- classify_model(container, svm_model)
#tree_out <- classify_model(container, tree_model)
#maxent_out <- classify_model(container, maxent_model)
# Now let's inspect the outcome. The first column represents the estimated labels and the second column provides an estimate of the probability of classification.
#head(svm_out)
#head(tree_out)
#head(maxent_out)
## FYI, all of these were blocked off and it is unclear why they the train_model function had worked prior but suddenly did not work. There was an error in svm.default (x = container@training_matrix, y = container@training_codes, : x and y don't match)
# We can investigate how often the algorithms have misclassified the press releases.
#labels_out <- data.frame(
# correct_label = meta_type[3001:n],
# svm = as.character(svm_out[,1]),
# tree = as.character(tree_out[,1]),
# maxent = as.character(maxent_out[,1]),
# stringAsFactors = F)
## Again, these are blocked agian, because they were working prior, but ?they don't work anymore because of the above train_model() function.
## SVM performance
#table(labels_out[,1] == labels_out[,2])
#prop.table(table(labels_out[,1] == labels_out[,2]))
# It appears that the SVM performance correctly classified about 64% of the time.
## Random forest performance
#table(labels_out[,1] == labels_out[,3])
#prop.table(table(labels_out[,1] == labels_out[,3]))
# It appears that the random forest perfomance correctly classified about 64% of the time.
## Maximum entropy performance
#table(labels_out[,1] == labels_out[,4])
#prop.table(table(labels_out[,1] == labels_out[,4]))
# Maximum entropy performance correctly classified about 60.1% of the time.
This was the screenshot when it was working. As you can see, the first set was SVM, the second set was forest plot, and third set was maximum entropy.
So interestingly, my models have turned out to be somewhat better than chance, but appears to get it right about 2/3 of the time (from the testing data).
Conclusion
So there we have two different ways to approach text mining. One via the tidytext package, and the other via supervised learning in the tm package.