PROJECT 4: Document Classification

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

In this project, I had downloaded two different sample sets: One classified as spam, and the other classified as ham. Though the tm package is one of the most utilizied packages for text mining, I wanted to take advantage of a different package called tidytext. With this package, my goal is to attempt to find a subset of words that are particular for either spam and ham. With this information, this can help us differentiate what e-mails are spam or not spam.

Load the Appropriate Libraries for the Project

library(stringr)
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(wordcloud)

Downloading of the `Spam` and `Ham` Folders

In this section, a folder was created (if one did not exist prior) for spam and ham. R had accessed the webpage where the .tar files were held and downloaded them into R for easy access.

if (!dir.exists("easy_ham")){
  download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2", destfile = "20021010_easy_ham.tar.bz2")
  untar("20021010_easy_ham.tar.bz2",compressed = "bzip2")
}
ham_files = list.files(path = "easy_ham",full.names = TRUE)

if (!dir.exists("spam_2")){
  download.file(url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile = "20050311_spam_2.tar.bz2")
  untar("20050311_spam_2.tar.bz2", compressed = "bzip2")
}
spam_files = list.files(path = "spam_2", full.names = TRUE)

Cleaning of the Data

As we look through several of the spam and ham files, we note that essentially all of the e-mails contained headers. And within these headers were a lot of information that I wanted to exclude out. I wanted to examine the content of the body of the e-mails and utilize that information for my text mining and analysis.

As of note, initially, I did not do this the first time through. Ultimately, the analysis I was seeing was listed below (I will not be showing the code, as much of the code that was done on this initial evaluation was done again with this same data but minus the headers.)

Initial First Time Through Without Eliminating the Headers

As you can see in the above comparison, many of the keywords are not helpful (i.e. sansseriffont, brbr, etc.). As a result, in my second run through analysis, I had opted to leave out the headers and to examine the content of the e-mail instead.

Let’s start with spam.

spam.body.df <- data.frame(e.mail = NA)

spam.body <- readLines(spam_files[1])
for (i in 1:length(spam.body)){
  if(str_detect(spam.body[i],"[[:alnum:]].*")){
    spam.body[i] <- ""
  }else{
    spam.body <- unlist(str_c(spam.body, collapse = ""))
    spam.body.df <- rbind(spam.body.df, spam.body)
    break
  }
}

# Omit the NAs
spam.body.df <- na.omit(spam.body.df)

for (i in 2:length(spam_files)){
  spam.body <- readLines(spam_files[i])
  for (j in 1:length(spam.body)){
    if(str_detect(spam.body[j],"[[:alnum:]].*")){
      spam.body[j] <- ""
    }else{
      spam.body <- unlist(str_c(spam.body, collapse = ""))
      # Get rid of the HTML tags
      spam.body <- unlist(gsub("<[[:alnum:]].*>", "", spam.body))
      spam.body.df <- rbind(spam.body.df, spam.body)
      break
    }
  }
}

# While not perfect, the data has certainly been cleaned up fairly extensively.
# Now to remove the punctuation, numbers, and excessive spaces.

x <- gsub("[[:punct:]]|[[:digit:]]", "", spam.body.df$e.mail[1])
x <- gsub("\\s{2,}", " ", x)
spam.content.cleaned <- data.frame(NUM = 1, E.Mail=x, stringsAsFactors = FALSE)

for (i in 2:length(spam.body.df$e.mail)) {
  x <- gsub("[[:punct:]]|[[:digit:]]", "", spam.body.df$e.mail[i])
  x <- gsub("\\s{2,}", " ", x)
  y <- data.frame(NUM = i, E.Mail = x)
  spam.content.cleaned <- rbind(spam.content.cleaned, y)
}

Now that we have collected a clean spam data frame. In order to utilize this data set correctly in the tidytext package, we need to tokenize the data frame so that it is tidy, or in other words, a table with one-token-per-row.

This is an excerpt from Dr. Silge (author of Text Mining with R). “A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.”

Now in order to this, it would be helpful to rid of words that are not likely to be contributory (i.e. “the”, “a”, “of”, etc.). Fortunately, the tidytext package contains a data frame of stop_words of all the words that are most likely unnecessary for our text mining analysis. By using the anti_join() function, we can exclude out these words in the spam data frame.

# Tidying up the `spam` data
data(stop_words)
spam.content.test <- spam.content.cleaned[1:600,] # Why did I choose 600? For some unusual reason, anything above 600 was crashing RStudio, hence the arbitrary number.
spam.content.unnested <- spam.content.test %>% unnest_tokens(word, E.Mail) %>% anti_join(stop_words)
sorted.spam.content <- spam.content.unnested %>% count(word, sort = TRUE)

Now that we have done this for spam, we need to repeat the same exact function to ham.

ham.body.df <- data.frame(e.mail = NA)

ham.body <- readLines(ham_files[1])
for (i in 1:length(ham.body)){
  if(str_detect(ham.body[i],"[[:alnum:]].*")){
    ham.body[i] <- ""
  }else{
    ham.body <- unlist(str_c(ham.body, collapse = ""))
    ham.body.df <- rbind(ham.body.df, ham.body)
    break
  }
}

# Omit the NAs
ham.body.df <- na.omit(ham.body.df)

for (i in 2:length(ham_files)){
  ham.body <- readLines(ham_files[i])
  for (j in 1:length(ham.body)){
    if(str_detect(ham.body[j],"[[:alnum:]].*")){
      ham.body[j] <- ""
    }else{
      ham.body <- unlist(str_c(ham.body, collapse = ""))
      # Get rid of the HTML tags
      ham.body <- unlist(gsub("<[[:alnum:]].*>", "", ham.body))
      ham.body.df <- rbind(ham.body.df, ham.body)
      break
    }
  }
}

# Now to remove the punctuation, numbers, and excessive spaces.

x <- gsub("[[:punct:]]|[[:digit:]]", "", ham.body.df$e.mail[1])
x <- gsub("\\s{2,}", " ", x)
ham.content.cleaned <- data.frame(NUM = 1, E.Mail=x, stringsAsFactors = FALSE)

for (i in 2:length(ham.body.df$e.mail)) {
  x <- gsub("[[:punct:]]|[[:digit:]]", "", ham.body.df$e.mail[i])
  x <- gsub("\\s{2,}", " ", x)
  y <- data.frame(NUM = i, E.Mail = x)
  ham.content.cleaned <- rbind(ham.content.cleaned, y)
}

ham.content.test <- ham.content.cleaned[1:600,] # Again, 600 was chosen again because for some unusual reason, any higher number was causing my RStudio to crash. 
ham.content.unnested <- ham.content.test %>% unnest_tokens(word, E.Mail) %>% anti_join(stop_words)
sorted.ham.content <- ham.content.unnested %>% count(word, sort = TRUE)

head(sorted.spam.content, 15)

## # A tibble: 15 × 2
##           word     n
##          <chr> <int>
## 1        email   645
## 2         free   343
## 3         send   213
## 4        money   200
## 5       credit   199
## 6       people   199
## 7     business   198
## 8  information   195
## 9         time   188
## 10     address   180
## 11        list   177
## 12     receive   169
## 13    software   152
## 14       click   150
## 15    internet   139

head(sorted.ham.content, 15)

## # A tibble: 15 × 2
##          word     n
##         <chr> <int>
## 1        dont   238
## 2       wrote   214
## 3       linux   203
## 4      people   194
## 5       email   182
## 6        time   174
## 7       yahoo   169
## 8         xml   157
## 9  technology   144
## 10         im   141
## 11    subject   127
## 12      users   122
## 13      world   122
## 14  companies   119
## 15       send   114

We have now successfully created our tidy data frames for text analysis!

Analysis of `Spam` and `Ham`

Now, here comes the fun part: the analysis. In this portion, we’ll use the ggplot2 package to plot the most frequent words from spam and ham. For the sake of space, only the words that showed up over 100 times was plotted on the graph. To get a general idea what words were popular in each section, please look at the barplots below.

par(mfrow=c(2,1))
sorted.spam.content %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab("Spam") +
  coord_flip()

sorted.ham.content %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab("Ham") +
  coord_flip()

In the “Analyzing Word and Document Frequency” section of Text Mining in R textbook, there was a great section demonstrating graphical analysis for term frequency. Below is a graphical plot for the frequency (specific word / total number of words) i.e. (“email” / total number of words). We can use this plot to demonstrate how frequently some words showed up.

email.content <- bind_rows(mutate(sorted.spam.content, type = "spam"),
                   mutate(sorted.ham.content, type = "ham")) %>% ungroup()

total.email.content <- email.content %>% 
  group_by(type) %>% 
  summarize(total = sum(n))

email.content <- left_join(email.content, total.email.content)

ggplot(email.content, aes(n/total, fill = type)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.001) +
  facet_wrap(~type, ncol = 2, scales = "free_y")

## Warning: Removed 190 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Looking at these curves, we notice something particular about them. They both resemble the power law in statistics. According to Wikipedia, " a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four."

Dr. Silge noticed from her work that distributions such as above are typical in languages. In fact, it is so common, a 20th century linguist by the name of George Zipf, had created a law named Zipf’s Law. Zipf’s Law states that the frequency that a word appears is inversely proportional to its rank.

Let’s apply Zipf’s Law to our data set and see what we find!

freq_by_rank.content <- email.content %>% 
  group_by(type) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank.content %>% 
  ggplot(aes(rank, `term frequency`, color = type)) + 
  geom_line(size = 1.2, alpha = 0.8) + 
  scale_x_log10() +
  scale_y_log10()

We are taking advantage of the fact that the term frequency resembles the power law. As a result, by converting to a logarithmic scale, we can see that the term frequency decreases as we go higher in rank (again in an approximate exponential format), or in other words, this is an inversely proportional relationship with a constant negative slope.

Here, we can see if we can plot the best fit line demonstrating the slope (or trend).

rank_subset <- freq_by_rank.content %>% 
  filter(rank < 1200,
         rank > 800)

lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

## 
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -1.0237      -0.9085

freq_by_rank.content %>% 
  ggplot(aes(rank, `term frequency`, color = type)) + 
  geom_abline(intercept = -1.0237, slope = -0.9085, color = "gray50", linetype = 2) +
  geom_line(size = 1.2, alpha = 0.8) + 
  scale_x_log10() +
  scale_y_log10()

While it is fun to analyze how rapidly a word frequency decreases the higher in rank you go, it really does not get us closer to the differentiating factors between spam and ham. What we would really like to do is to find words that would likely be associated with spam or ham. Term frequency can be helpful, but this process does not necessarily tell us if that particular word is part of one group (for example, having the word “email” does not necessarily mean its from spam.)

Fortunately, there’s a statistical term tf-idf that can help us do that. Again, according to Text Mining in R, “the idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common.”

Let’s see how the data changes when we calculate the tf-idf.

email.content <- email.content %>%
  bind_tf_idf(word, type, n) %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))

head(email.content,20)

## # A tibble: 20 × 6
##                                 word     n  type           tf       idf
##                                <chr> <int> <chr>        <dbl>     <dbl>
## 1                                xml   157   ham 0.0027158401 0.6931472
## 2                          datapower   104   ham 0.0017990278 0.6931472
## 3          httpdocsyahoocominfoterms    87   ham 0.0015049560 0.6931472
## 4    toforteanaunsubscribeegroupscom    87   ham 0.0015049560 0.6931472
## 5                                sep    84   ham 0.0014530609 0.6931472
## 6                                 tm    56  spam 0.0013704664 0.6931472
## 7                            tablets    52  spam 0.0012725760 0.6931472
## 8                             unseen    68   ham 0.0011762874 0.6931472
## 9                               msgs    66   ham 0.0011416907 0.6931472
## 10                            beberg    64   ham 0.0011070941 0.6931472
## 11                        capillaris    44  spam 0.0010767951 0.6931472
## 12                             ragga    44  spam 0.0010767951 0.6931472
## 13                             dagga    43  spam 0.0010523225 0.6931472
## 14                           listslk    59   ham 0.0010206023 0.6931472
## 15                            french    58   ham 0.0010033040 0.6931472
## 16                         botanical    39  spam 0.0009544320 0.6931472
## 17 charsetisocontenttransferencoding    38  spam 0.0009299594 0.6931472
## 18                             intro    36  spam 0.0008810141 0.6931472
## 19                           monthly    36  spam 0.0008810141 0.6931472
## 20                             herba    32  spam 0.0007831237 0.6931472
## # ... with 1 more variables: tf_idf <dbl>

plot_email.content <- email.content %>% 
  mutate(type = factor(type, levels = rev(unique(type))))

plot_email.content %>% 
  group_by(type) %>% 
  top_n(50) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = type)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~type, ncol = 2, scales = "free") +
  coord_flip() +
  theme(axis.text.x = element_text(angle=60, hjust=1))

Now as you can see, there are words that are more important in spam and other words that are more important in ham. Looking at the information, it certainly provides some more insight into what words were used more frequently. Words like “sweet, prescription, mastercard, aphrodasic, advertisement, prescription, nigeria” seem to make sense for spam, whereas “technologies, infrastructure, xml” make sense for ham. Interestingly though, this technique wasn’t perfect, as words like “promiscuous” seem to have made it into the ham list, and I suspect that this should’ve probably been in the spam list.

Let’s have some fun and take this new information, and plot it on a wordcloud.

# Red indicates `Spam`
plot_email.content %>%
  filter(type == "spam") %>%
  top_n(75) %>%
  with(wordcloud(word, n, max.words = 75, colors = "#F8766D"))

# Lightblue indicates `Ham`
plot_email.content %>%
  filter(type == "ham") %>%
  top_n(75) %>% 
  with(wordcloud(word, n, max.words = 75, colors = "#00BFC4"))

Another Approach: The `TM` Package

While ’tidytext was a fun package to play with. The more established package has been the tm package. The tm packages gives an opportunity to text mine and to allow some supervised learning. So it is very important that we understand the functions that are entailed in this package.

Let’s load up the library first.

library(tm)

So from above, we have already downloaded the spam and ham files into our working directory and called it spam_files and ham_files. Since this part of the work has been done, we can now take this information and create a corpus. It is important that we create a meta tag that goes with its appropriate email. So for instance, we’re going to create a tag of “classification” and assign it to “spam” if the e-mail was a spam. And likewise, we will do this for “ham” for ham emails. Initially, I was using the Corpus() function and had ran into many errors with the meta tags (for instance, the package was not allowing me to create new meta tags). However, I was able to circumvent this issue by using VCorpus as opposed to a regular SCorpus.

# We'll take the e-mails and read them into a temporary file. Then I took the tmp file, converted it into a VCorpus and attached an associated meta tag (classification, spam).
tmp <- readLines(spam_files[1])
tmp <- str_c(tmp, collapse = "")
txt_corpus <- VCorpus(VectorSource(tmp))
meta(txt_corpus[[1]], "classification") <- "spam"

# Now I take the rest of the files and repeat the process in a for loop
n <- 1
for (i in 2:length(spam_files)) {
  tmp <- readLines(spam_files[i])
  tmp <- str_c(tmp, collapse = "")
  tmp_corpus <- VCorpus(VectorSource(tmp))
  txt_corpus <- c(txt_corpus, tmp_corpus) # Adding the Vcorpus together.
  
  # Now to add the meta key-value, "classification", "spam"
  n <- n + 1
  meta(txt_corpus[[n]], "classification") <- "spam"
}

# Now we perform the same process for the ham files and add it to the txt_corpus from prior
for (i in 1:length(ham_files)) {
  tmp <- readLines(ham_files[i])
  tmp <- str_c(tmp, collapse = "")
  
  n <- n + 1
  tmp_corpus <- VCorpus(VectorSource(tmp))
  txt_corpus <- c(txt_corpus, tmp_corpus)
  meta(txt_corpus[[n]], "classification") <- "ham"
}

# Now that we have a complete corpus, we want to randomize the corpus so that way, it isn't all spams then hams. It'll be a mix-up using the sample() function.
txt_corpus <- sample(txt_corpus)
txt_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3948

I have now successfully created the corpus. However, this corpus will need some cleanup. After I have had examined several e-mails from both the ham and spam folders, there is a lot of words that need to be stemmed, have spaces stripped, numbers removed, and so forth, so that I can standardized the texts. So once we perform some analysis on the texts, R will be able to recognize that i.e. “Playing” is the same as “play”. Fortunately, there is a function called tm_map that will allow us to do this. Interestingly, there may have been updates to the tm package as the meta tags were being dropped during the “cleaning” process. In order to preserve the meta tags and its associated information, the function content_transformer() needed to be added.

corpus.tmp <- tm_map(txt_corpus, content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')))
# This above function exists as I am working on a MAC. For reproducibility on a Windows machine, you may need to take the above line out.
corpus.tmp <- tm_map(corpus.tmp, content_transformer(removePunctuation))
# Unfortunately, the removePunctuation function will remove the punctuation, but not necessarily add in a space. According to the Automated Data Collection book, they suggested tm_map(corpus.tmp, str_replace_all, pattern = "[[:punct:]]", replacement = " "). But again, I lose all of the associated meta tags. Perhaps, someone can find an alternative way to do this.
corpus.tmp <- tm_map(corpus.tmp, content_transformer(removeNumbers))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(stripWhitespace))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
corpus.tmp <- tm_map(corpus.tmp, content_transformer(stemDocument))
corpus.tmp <- tm_map(corpus.tmp, removeWords, words = stopwords("en"))
# http://stackoverflow.com/questions/13640188/error-converting-text-to-lowercase-with-tm-map-tolower

corpus.tmp

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3948

Once we have a cleaned up Corpus, it is ready to be transformed into a term document matrix. According to Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, “a term-document matrix is a way to arrange text in matrix form where the rows represent individual terms and columns contain the texts. The cells are filled with counts of how often a particularly term appears in a given text.” Once it is set up in this matrix, then we can perform statistical analysis on it. To do this, we use the TermDocumentMatrix() function to perform this task.

tdm <- TermDocumentMatrix(corpus.tmp)
tdm

## <<TermDocumentMatrix (terms: 112361, documents: 3948)>>
## Non-/sparse entries: 623637/442977591
## Sparsity           : 100%
## Maximal term length: 74230
## Weighting          : term frequency (tf)

It is not surprising that the matrix is extremsely sparse, meaning that most cells have not a single entry (approximately 99%). What I will do next is get rid of the sparse terms. While this may potentially rid of terms that could be valuable, it is at the benefit of improving computational feasibility. In this case, I will use the removeSpareTerms() function to discard all terms that appear in 10 documents or less.

tdm <- removeSparseTerms(tdm, 1-(10/length(corpus.tmp)))
tdm

## <<TermDocumentMatrix (terms: 5830, documents: 3948)>>
## Non-/sparse entries: 451763/22565077
## Sparsity           : 98%
## Maximal term length: 110
## Weighting          : term frequency (tf)

Once a TDM has been obtained, now I take this opportunity to perform some supervised learning on this matrix. In this case, Support Vector Machines, Random Forest, and Maximum Entropy.

According to Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining:

Support Vector Machines: “This employs a spatial representation of the data. The term occurrences which we stored in the TDM represent the spatial locations of our documents in high-dimensional spaces. Recall that we supplied the group memberships, that is, classification types, of the documents in the training data. Using the SVMs, we try to fit vectors between the document features that best separate the documents into the various groups. After the estimation, we can classify new documents by checking on which sides of the vectors the features of unlabeled documents come to life and estimate the categorical membership.”
Random Forest: “This classifier creates multiple decision trees and takes the most frequntly predicted membership category of many decision trees as the classification that is most likely to be accurate. A decision tree models the group membership of the object we care to classify based on various observed features (here, spam vs. ham). In a classification of a new document, we move down the tree and consider whether the trained features are present or absent to be able to predict the categorical membership of the document.”
Maximum Entropy: “This logit model predicts the probability of belonging to one of two categories”

We will need to load up the RTextTools package to perform these analyses.

library(RTextTools)

## Warning: package 'SparseM' was built under R version 3.3.2

In order for RTextTools to work appropriately, we need to prepare the dataa. However, the information written in the textbook is now outdated and the new information is listed in this webpage. http://www.r-datacollection.com/errata/errata.pdf

# Assemble a vector of labels from the DTM that we had collected
# The labels are stored in the meta tags
meta_type <- as.vector(unlist(meta(corpus.tmp, type = "local", tag = "classification")))
meta_data <- data.frame(
  type = unlist(meta_type)
)

# Showing the first 10 items in the meta_type vector
head(meta_type, 10)

##  [1] "spam" "spam" "spam" "ham"  "spam" "ham"  "spam" "ham"  "ham"  "ham"

table(meta_data)

## meta_data
##  ham spam 
## 2551 1397

Now it’s time to create a container using the create_container() function. We will use the first 3000 e-mails as training data while we want the documents from 3001 to 3948.

N <- length(meta_type)
container <- create_container(tdm,
                              labels = meta_type,
                              trainSize = 1:3000,
                              testSize = 3001:N,
                              virgin = FALSE)

slotNames(container)

## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"

This container now contains the class matrix_container. This contains a set of objects that are used for the estimation procedures of the supervised learning methods.

Next we are going to create training models, using the train_model() function. We will create models that we have discussed from prior.

# Training models 
#svm_model <- train_model(container, algorithm = "SVM")
#tree_model <- train_model(container, algorithm = "TREE")
#maxent_model <- train_model(container, algorithm = "MAXENT")

#svm_out <- classify_model(container, svm_model)
#tree_out <- classify_model(container, tree_model)
#maxent_out <- classify_model(container, maxent_model)

# Now let's inspect the outcome. The first column represents the estimated labels and the second column provides an estimate of the probability of classification. 
#head(svm_out)
#head(tree_out)
#head(maxent_out)

## FYI, all of these were blocked off and it is unclear why they the train_model function had worked prior but suddenly did not work. There was an error in svm.default (x = container@training_matrix, y = container@training_codes, : x and y don't match)

# We can investigate how often the algorithms have misclassified the press releases. 
#labels_out <- data.frame(
#  correct_label = meta_type[3001:n],
#  svm = as.character(svm_out[,1]),
#  tree = as.character(tree_out[,1]),
#  maxent = as.character(maxent_out[,1]),
#  stringAsFactors = F)

## Again, these are blocked agian, because they were working prior, but ?they don't work anymore because of the above train_model() function.

## SVM performance
#table(labels_out[,1] == labels_out[,2])
#prop.table(table(labels_out[,1] == labels_out[,2]))

# It appears that the SVM performance correctly classified about 64% of the time.

## Random forest performance
#table(labels_out[,1] == labels_out[,3])
#prop.table(table(labels_out[,1] == labels_out[,3]))

# It appears that the random forest perfomance correctly classified about 64% of the time.

## Maximum entropy performance
#table(labels_out[,1] == labels_out[,4])
#prop.table(table(labels_out[,1] == labels_out[,4]))

# Maximum entropy performance correctly classified about 60.1% of the time.

This was the screenshot when it was working. As you can see, the first set was SVM, the second set was forest plot, and third set was maximum entropy.

So interestingly, my models have turned out to be somewhat better than chance, but appears to get it right about 2/3 of the time (from the testing data).

Conclusion

So there we have two different ways to approach text mining. One via the tidytext package, and the other via supervised learning in the tm package.

CUNY 607 Project 4

Joel Park

2017-04-13

PROJECT 4: Document Classification

Load the Appropriate Libraries for the Project

Downloading of the `Spam` and `Ham` Folders

Cleaning of the Data

Analysis of `Spam` and `Ham`

Another Approach: The `TM` Package

Conclusion

PROJECT 4: Document Classification

Load the Appropriate Libraries for the Project

Downloading of the Spam and Ham Folders

Cleaning of the Data

Analysis of Spam and Ham

Another Approach: The TM Package

Conclusion

Downloading of the `Spam` and `Ham` Folders

Analysis of `Spam` and `Ham`

Another Approach: The `TM` Package