All of the functions can be accomplished with RTextTools
only, but the methods from the text were used, which requires the tm
and SnowballC
packages.
packages <- c("RTextTools", "tm", "SnowballC")
lapply(packages, library, character.only = T)
The documents used in this assignment are “sentiment-labelled” sentences from reviews on Amazon, Yelp, and IMDB, and can be found on the University of California - Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). Each text file is from the corresponding website, with each line in the file being a different review. Not all of the reivews are classified - just the reviews that could be clearly labelled as positive (with a “1”) or negative (with a “0”).
Here, we’ll take the individual text files and put them together in a data frame, adding a source
column in case we wanted to do some models based on the source. Even though the reviews are classified as positive or negative, the subject matter across the three different sites differs (i.e. products on Amazon, restaurants on Yelp, movies on IMDB), so the words that might classify a review differ in some cases.
yelp <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/yelp_labelled.txt", header = FALSE)
yelp$source <- "Yelp"
amazon <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/amazon_cells_labelled.txt", header = FALSE)
amazon$source <- "Amazon"
imdb <- read.delim("https://raw.githubusercontent.com/Logan213/DATA607_Week11/master/imdb_labelled.txt", header = FALSE)
imdb$source <- "IMDB"
reviews <- rbind(yelp, amazon, imdb)
head(reviews)
## V1
## 1 Wow... Loved this place.
## 2 I learned that if an electric slicer is used the blade becomes hot enough to start to cook the prosciutto.
## 3 But they don't clean the chiles?
## 4 Crust is not good.
## 5 Not tasty and the texture was just nasty.
## 6 Pancake was WAY bigger than I thought and kind of regretted getting now since I couldn't finish it.
## V2 source
## 1 1 Yelp
## 2 NA Yelp
## 3 NA Yelp
## 4 0 Yelp
## 5 0 Yelp
## 6 NA Yelp
As we can see above, the data frame contains all reviews, and only select cases are classified. There is a way to run models on unclassified data, but in this case we will remove those reviews that are not classified as positive or negative.
reviews <- na.omit(reviews)
colnames(reviews) <- c("Text", "Sentiment", "Source")
Before the data can be analyzed by the models, we need to create a Document Term Matrix, which is the format that RTextTools
takes as its input. First we create a corpus by simply passing the Text
column from the data frame into the nested VectorSource
and Corpus
functions from the tm
(text minint) package. The meta data is set using the meta()
function.
review_corpus <- Corpus(VectorSource(reviews$Text))
meta(review_corpus[[1]], "sentiment") <- reviews$Sentiment
meta(review_corpus[[1]], "source") <- reviews$Source
Once we have our corpus of words, we pass the corpus containing the text of the reviews to the DocumentTermMatrix
function. Punctuation, numbers, and english “stop words” have been removed to improve performance of the models. All characters have also been converted to lower case, and some very sparse terms have been removed as well.
dtm <- DocumentTermMatrix(review_corpus,
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
tolower = TRUE))
dtm <- removeSparseTerms(dtm, 0.998)
RTextTools
also has a function for creating the Document Term Matrix, using the create_matrix
function. We could have achieved the same result using the following:
dtm <- create_matrix(reviews$Text, removePunctuation=TRUE, removeNumbers=TRUE, removeStopwords=TRUE, toLower=TRUE, removeSparseTerms=.998)
Before running our estimation procedures, we will create a container object by passing our Document Term Matrix, sent_labels
object containing the positive/negative classification, the number of documents to be used in our training set, the documents for the test set, and a logical value specifying whether to treat the data as virgin or not into the create_container
function. The result is stored in an object simply called container
.
The method below is from the text (“Automated Data Collection with R”), but alternatively, since we have a dataframe containing the data, instead of creating the sent_labels
object with the meta data, we could have simply passed the column with the classification information (reviews$Sentiment
) into the fuction instead.
sent_labels <- unlist(meta(review_corpus, "sentiment"))
container <- create_container(
dtm,
labels = sent_labels,
trainSize = 1:1000,
testSize = 1001:length(sent_labels),
virgin = FALSE)
There are nine algorithms included in RTextTools
, and for this set of documents, we will use the support vector machine (“SVM”), maximum entropy (“MAXENT”), decision tree (“TREE”), and random forest (“RF”) training models.
To train our selected algorithms, we will use the train_model
function, passing the container object and the string referencing the algorithm into it. Each of the four training sets are then passsed into the classify_model
function to return the classified data.
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
rf_model <- train_model(container, "RF")
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
rf_out <- classify_model(container, rf_model)
To compare our results, we will make a data frame containing the correct labels (supplied by the sent_labels
object), and the results of each model.
labels_out <- data.frame(
correct_label = sent_labels[1001:length(sent_labels)],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
rf = as.character(rf_out[,1]),
stringsAsFactors = FALSE)
head(labels_out)
## correct_label svm tree maxent rf
## 11001 1 1 0 1 1
## 11002 1 1 0 0 0
## 11003 0 0 0 0 0
## 11004 1 1 1 1 1
## 11005 0 1 0 1 0
## 11006 1 1 1 1 1
A table comparing the counts of correct label to the output of the model, as well as the percentage is created for each model:
##
## FALSE TRUE
## 277 637
##
## FALSE TRUE
## 0.3030635 0.6969365
##
## FALSE TRUE
## 346 568
##
## FALSE TRUE
## 0.3785558 0.6214442
##
## FALSE TRUE
## 279 635
##
## FALSE TRUE
## 0.3052516 0.6947484
##
## FALSE TRUE
## 278 636
##
## FALSE TRUE
## 0.3041575 0.6958425
Alternatively, we can also use the create_analytics
function from RTextTools
to create a table showing the performance of each algorithm. Calling summary on the analytics
object will display the ensemble summary and individual algorithm performance.
analytics <- create_analytics(container, cbind(svm_out, tree_out, maxent_out, rf_out))
summary(analytics)
## ENSEMBLE SUMMARY
##
## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1.00 0.68
## n >= 2 1.00 0.68
## n >= 3 0.89 0.72
## n >= 4 0.52 0.81
##
##
## ALGORITHM PERFORMANCE
##
## SVM_PRECISION SVM_RECALL SVM_FSCORE
## 0.705 0.695 0.695
## FORESTS_PRECISION FORESTS_RECALL FORESTS_FSCORE
## 0.715 0.695 0.690
## TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0.705 0.620 0.580
## MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0.695 0.695 0.695
We can also get the label summary,
analytics@label_summary
## NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0 461 649 519
## 1 453 265 395
## PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0 140.7809 112.58134 88.93709
## 1 58.4989 87.19647 47.24062
## PCT_CORRECTLY_CODED_PROBABILITY
## 0 75.48807
## 1 62.25166
And lastly, a preview of the document summary:
head(analytics@document_summary)
## SVM_LABEL SVM_PROB TREE_LABEL TREE_PROB MAXENTROPY_LABEL
## 1 1 0.8293420 0 0.6006441 1
## 2 1 0.5551329 0 0.6006441 0
## 3 0 0.9561301 0 1.0000000 0
## 4 1 0.8928492 1 0.9333333 1
## 5 1 0.5403048 0 0.6006441 1
## 6 1 0.9589604 1 0.9878049 1
## MAXENTROPY_PROB FORESTS_LABEL FORESTS_PROB MANUAL_CODE CONSENSUS_CODE
## 1 0.9999981 1 0.885 1 1
## 2 1.0000000 0 0.705 1 0
## 3 1.0000000 0 1.000 0 0
## 4 1.0000000 1 0.955 1 1
## 5 1.0000000 0 0.635 0 0
## 6 1.0000000 1 1.000 1 1
## CONSENSUS_AGREE CONSENSUS_INCORRECT PROBABILITY_CODE
## 1 3 0 1
## 2 3 1 0
## 3 4 0 0
## 4 4 0 1
## 5 2 0 1
## 6 4 0 1
## PROBABILITY_INCORRECT
## 1 0
## 2 1
## 3 0
## 4 0
## 5 1
## 6 0
All four models were not able to classify the documents with greater accuracy than 70%. This may be because of the sparsity of the documents, or the different subject matter of each review set. Individually, the Support Vector Machine, Maximum Entropy, and Random Forest models performed similarly (~68-69% accuracy), and the Decision Tree model had the worst performance, with 62% accuracy.