Packages Used

All of the functions can be accomplished with RTextTools only, but the methods from the text were used, which requires the tm and SnowballC packages.

packages <- c("RTextTools", "tm", "SnowballC")
lapply(packages, library, character.only = T)

Classified Documents

The documents used in this assignment are “sentiment-labelled” sentences from reviews on Amazon, Yelp, and IMDB, and can be found on the University of California - Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). Each text file is from the corresponding website, with each line in the file being a different review. Not all of the reivews are classified - just the reviews that could be clearly labelled as positive (with a “1”) or negative (with a “0”).

Here, we’ll take the individual text files and put them together in a data frame, adding a source column in case we wanted to do some models based on the source. Even though the reviews are classified as positive or negative, the subject matter across the three different sites differs (i.e. products on Amazon, restaurants on Yelp, movies on IMDB), so the words that might classify a review differ in some cases.

yelp <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/yelp_labelled.txt", header = FALSE)
yelp$source <- "Yelp"

amazon <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/amazon_cells_labelled.txt", header = FALSE)
amazon$source <- "Amazon"

imdb <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/imdb_labelled.txt", header = FALSE)
imdb$source <- "IMDB"

reviews <- rbind(yelp, amazon, imdb)
head(reviews)
##                                                                                                           V1
## 1                                                                                   Wow... Loved this place.
## 2 I learned that if an electric slicer is used the blade becomes hot enough to start to cook the prosciutto.
## 3                                                                           But they don't clean the chiles?
## 4                                                                                         Crust is not good.
## 5                                                                  Not tasty and the texture was just nasty.
## 6        Pancake was WAY bigger than I thought and kind of regretted getting now since I couldn't finish it.
##   V2 source
## 1  1   Yelp
## 2 NA   Yelp
## 3 NA   Yelp
## 4  0   Yelp
## 5  0   Yelp
## 6 NA   Yelp

As we can see above, the data frame contains all reviews, and only select cases are classified. There is a way to run models on unclassified data, but in this case we will remove those reviews that are not classified as positive or negative.

reviews <- na.omit(reviews)

colnames(reviews) <- c("Text", "Sentiment", "Source")

Create Corpus and Document Term Matrix

Before the data can be analyzed by the models, we need to create a Document Term Matrix, which is the format that RTextTools takes as its input. First we create a corpus by simply passing the Text column from the data frame into the nested VectorSource and Corpus functions from the tm (text minint) package. The meta data is set using the meta() function.

review_corpus <- Corpus(VectorSource(reviews$Text))
meta(review_corpus[[1]], "sentiment") <- reviews$Sentiment
meta(review_corpus[[1]], "source") <- reviews$Source

Once we have our corpus of words, we pass the corpus containing the text of the reviews to the DocumentTermMatrix function. Punctuation, numbers, and english “stop words” have been removed to improve performance of the models. All characters have also been converted to lower case, and some very sparse terms have been removed as well.

dtm <- DocumentTermMatrix(review_corpus,
                                  control = list(removePunctuation = TRUE,
                                                 removeNumbers = TRUE,
                                                 stopwords = TRUE,
                                                 tolower = TRUE))

dtm <- removeSparseTerms(dtm, 0.998)

RTextTools also has a function for creating the Document Term Matrix, using the create_matrix function. We could have achieved the same result using the following:

dtm <- create_matrix(reviews$Text, removePunctuation=TRUE, removeNumbers=TRUE, removeStopwords=TRUE, toLower=TRUE, removeSparseTerms=.998)

Model Estimation

Before running our estimation procedures, we will create a container object by passing our Document Term Matrix, sent_labels object containing the positive/negative classification, the number of documents to be used in our training set, the documents for the test set, and a logical value specifying whether to treat the data as virgin or not into the create_container function. The result is stored in an object simply called container.

The method below is from the text (“Automated Data Collection with R”), but alternatively, since we have a dataframe containing the data, instead of creating the sent_labels object with the meta data, we could have simply passed the column with the classification information (reviews$Sentiment) into the fuction instead.

sent_labels <- unlist(meta(review_corpus, "sentiment"))

container <- create_container(
  dtm,
  labels = sent_labels,
  trainSize = 1:1000,
  testSize = 1001:length(sent_labels),
  virgin = FALSE)

There are nine algorithms included in RTextTools, and for this set of documents, we will use the support vector machine (“SVM”), maximum entropy (“MAXENT”), decision tree (“TREE”), and random forest (“RF”) training models.

To train our selected algorithms, we will use the train_model function, passing the container object and the string referencing the algorithm into it. Each of the four training sets are then passsed into the classify_model function to return the classified data.

svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
rf_model <- train_model(container, "RF")

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
rf_out <- classify_model(container, rf_model)

Model Comparison

To compare our results, we will make a data frame containing the correct labels (supplied by the sent_labels object), and the results of each model.

labels_out <- data.frame(
  correct_label = sent_labels[1001:length(sent_labels)],
  svm = as.character(svm_out[,1]),
  tree = as.character(tree_out[,1]),
  maxent = as.character(maxent_out[,1]),
  rf = as.character(rf_out[,1]),
  stringsAsFactors = FALSE)

head(labels_out)
##       correct_label svm tree maxent rf
## 11001             1   1    0      1  1
## 11002             1   1    0      0  0
## 11003             0   0    0      0  0
## 11004             1   1    1      1  1
## 11005             0   1    0      1  0
## 11006             1   1    1      1  1

A table comparing the counts of correct label to the output of the model, as well as the percentage is created for each model:

SVM Model Performance

## 
## FALSE  TRUE 
##   277   637
## 
##     FALSE      TRUE 
## 0.3030635 0.6969365

Decision Tree Performance

## 
## FALSE  TRUE 
##   346   568
## 
##     FALSE      TRUE 
## 0.3785558 0.6214442

Max. Entropy Performance

## 
## FALSE  TRUE 
##   279   635
## 
##     FALSE      TRUE 
## 0.3052516 0.6947484

Random Forest Performance

## 
## FALSE  TRUE 
##   278   636
## 
##     FALSE      TRUE 
## 0.3041575 0.6958425

Alternatively, we can also use the create_analytics function from RTextTools to create a table showing the performance of each algorithm. Calling summary on the analytics object will display the ensemble summary and individual algorithm performance.

analytics <- create_analytics(container, cbind(svm_out, tree_out, maxent_out, rf_out))
summary(analytics)
## ENSEMBLE SUMMARY
## 
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00              0.68
## n >= 2                1.00              0.68
## n >= 3                0.89              0.72
## n >= 4                0.52              0.81
## 
## 
## ALGORITHM PERFORMANCE
## 
##        SVM_PRECISION           SVM_RECALL           SVM_FSCORE 
##                0.705                0.695                0.695 
##    FORESTS_PRECISION       FORESTS_RECALL       FORESTS_FSCORE 
##                0.715                0.695                0.690 
##       TREE_PRECISION          TREE_RECALL          TREE_FSCORE 
##                0.705                0.620                0.580 
## MAXENTROPY_PRECISION    MAXENTROPY_RECALL    MAXENTROPY_FSCORE 
##                0.695                0.695                0.695

We can also get the label summary,

analytics@label_summary
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0                461                 649                   519
## 1                453                 265                   395
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0            140.7809             112.58134                      88.93709
## 1             58.4989              87.19647                      47.24062
##   PCT_CORRECTLY_CODED_PROBABILITY
## 0                        75.48807
## 1                        62.25166

And lastly, a preview of the document summary:

head(analytics@document_summary)
##   SVM_LABEL  SVM_PROB TREE_LABEL TREE_PROB MAXENTROPY_LABEL
## 1         1 0.8293420          0 0.6006441                1
## 2         1 0.5551329          0 0.6006441                0
## 3         0 0.9561301          0 1.0000000                0
## 4         1 0.8928492          1 0.9333333                1
## 5         1 0.5403048          0 0.6006441                1
## 6         1 0.9589604          1 0.9878049                1
##   MAXENTROPY_PROB FORESTS_LABEL FORESTS_PROB MANUAL_CODE CONSENSUS_CODE
## 1       0.9999981             1        0.885           1              1
## 2       1.0000000             0        0.705           1              0
## 3       1.0000000             0        1.000           0              0
## 4       1.0000000             1        0.955           1              1
## 5       1.0000000             0        0.635           0              0
## 6       1.0000000             1        1.000           1              1
##   CONSENSUS_AGREE CONSENSUS_INCORRECT PROBABILITY_CODE
## 1               3                   0                1
## 2               3                   1                0
## 3               4                   0                0
## 4               4                   0                1
## 5               2                   0                1
## 6               4                   0                1
##   PROBABILITY_INCORRECT
## 1                     0
## 2                     1
## 3                     0
## 4                     0
## 5                     1
## 6                     0

Conclusion

All four models were not able to classify the documents with greater accuracy than 70%. This may be because of the sparsity of the documents, or the different subject matter of each review set. Individually, the Support Vector Machine, Maximum Entropy, and Random Forest models performed similarly (~68-69% accuracy), and the Decision Tree model had the worst performance, with 62% accuracy.