Adena Lin
December 9, 2015
Sentiment analysis: “Opinion mining” - determining if a piece of text is postive, negative, or neutral
The greatest challenge is interrater reliability.
“Numerous studies have shown that the rate of human concordance is between 70% and 80%.”
Source: http://brnrd.me/sentiment-analysis-never-accurate/
As a result, I aimed for an accuracy of 80% for my models.
The models presented do not examine sentence structure.
Only word sentiments will be analyzed to provide overall sentiment ratings for a piece of text.
1) Using lexicon of positive & negative words
2) Using 'syuzhet' package
3) Using classification models (SVM & RF)
Only the classification models were tested against the test set, as they performed the best. All models were compared to the original training set.
Train dataset: http://www.cs.cornell.edu/people/pabo/movie-review-data/ [polarity dataset v2.0]
Test dataset: http://ai.stanford.edu/~amaas/data/sentiment/
library(RTextTools)
library(plyr)
library(stringr)
library(readr)
# Load in lexicon of words that are categorized into positive/negative sentiment
# Source: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
positiveWords <- scan("~Adena/Desktop/review_polarity/lexicon/positive-words.txt", what = 'character')
negativeWords <- scan("~Adena/Desktop/review_polarity/lexicon/negative-words.txt", what = 'character')
# Function break down sentences into individual words to compared to lexicon
sentimentFunction <- function(sentences, positiveWords, negativeWords, .progress='none') {
finalScore <- matrix('', 0, 3)
scores <- laply(sentences, function(sentence, positiveWords, negativeWords) {
compareSentence <- sentence
# split into words
word.list <- str_split(sentence, '\\s+')
words <- unlist(word.list)
# compare our words to the lexicon of +/- terms
posMatches <- match(words, positiveWords)
negMatches <- match(words, negativeWords)
# find total number of words in each category
posMatches <- sum(!is.na(posMatches))
negMatches <- sum(!is.na(negMatches))
score <- c(posMatches, negMatches)
# add row to scores table
new_row <- c(compareSentence, score)
finalScore <- rbind(finalScore, new_row)
return(finalScore)
}, positiveWords, negativeWords)
return(scores)}
### Import training files ###
setwd("~/Desktop/review_polarity/txt_sentoken/pos/")
pos <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/pos/", pattern = "*.txt")
positiveReviews <- lapply(pos, read_file)
# remove punctuation & line breaks
positiveReviews <- lapply(positiveReviews, gsub, pattern = '[[:punct:],\n]', replacement = '')
# remove digits
positiveReviews <- lapply(positiveReviews, gsub, pattern = '\\d+', replacement = '')
# do the same for 'negativeReviews'
# data frame of +/- reviews with scores
# compares words from training data to lexicon to count number of +/- words
posResult <- as.data.frame(sentimentFunction(positiveReviews, positiveWords, negativeWords))
negResult <- as.data.frame(sentimentFunction(negativeReviews, positiveWords, negativeWords))
posResult <- cbind(posResult, 'positive')
colnames(posResult) <- c('sentence','positive', 'negative','sentiment')
negResult <- cbind(negResult, 'negative')
colnames(negResult) <- c('sentence', 'positive', 'negative', 'sentiment')
functionResults <- rbind(posResult, negResult)
# end up with data frame that lists the number of +/- words in each review
# FORMULA: more positive than negative words = positive sentiment
functionResults$predicted <- ifelse(as.numeric(functionResults$positive) > as.numeric(functionResults$negative), 'positive', 'negative')
# test on (compare with) original training data
# overall accuracy = 65%
recall_accuracy(functionResults$sentiment, functionResults$predicted)
# positive sentiments accuracy = 58%
recall_accuracy(functionResults$sentiment[1:1000], functionResults$predicted[1:1000])
# negative sentiments accuracy = 72%
recall_accuracy(functionResults$sentiment[1001:2000], functionResults$predicted[1001:2000])
library(syuzhet)
# Import text files
setwd("~Adena/Desktop/review_polarity/txt_sentoken/pos/")
pos <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/pos/", pattern = "*.txt")
positive.reviews <- lapply(pos, read_file)
setwd("~Adena/Desktop/review_polarity/txt_sentoken/neg/")
neg <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/neg/", pattern = "*.txt")
negative.reviews <- lapply(neg, read_file)
# sentiment analysis for all positive reviews
positive.reviews <- as.character(positive.reviews)
posSentiment <- get_sentiment(positive.reviews, method = "bing") # using "Bing" lexicon
length(which(posSentiment > 0)) # this says that 598 of these reviews were classified as positive
# since we know all 1000 reviews were positive, this gives an accuracy of about 60%
# sentiment analysis for all negative reviews
negative.reviews <- as.character(negative.reviews)
negSentiment <- get_sentiment(negative.reviews, method = "bing")
length(which(negSentiment < 0))
# 771/1000 negative reviews were correctly, giving an accuracy of 77%
library(RTextTools)
library(e1071)
library(readr)
# Import text files & removing punctuation, digits, and line breaks
# Create new sentiment column for each review:
posReviews <- cbind(positive.reviews, 'positive')
negReviews <- cbind(negative.reviews, 'negative')
# Create one large dataset for training
allReviews <- rbind(posReviews, negReviews)
reviewsMatrix <-
create_matrix(allReviews[,1],
language="english",
removeStopwords = FALSE, # some stop words may be relevant
stemWords=FALSE, toLower = TRUE) # turn words into lowercase
allReviews[,2] <-
as.factor(as.character(allReviews[,2]))
# CLASSIFICATION: '2' is positive, '1' is negative
# create model to test first on training data
reviewsContainer <- create_container(reviewsMatrix, as.numeric(allReviews[,2]),
trainSize=1:2000, # train entire dataset
virgin=FALSE)
# train models
sentimentModels <-
train_models(reviewsContainer,
algorithms = c("SVM", "RF"),
cost = 5, # SVM parameter
ntree = 20 # RF parameter)
results1 <- classify_models(reviewsContainer, sentimentModels)
# test on training data
# SVM
recall_accuracy(as.numeric(allReviews[,2]), results1[,'SVM_LABEL']) # 81.8% accuracy
# RF
recall_accuracy(as.numeric(allReviews[,2]), results1[,'FORESTS_LABEL']) # 100% accuracy?!
# build data frame for test data (using approx 400 text files, 50% positive)
testData <- rbind(positiveTest, negativeTest)
allReviews_test <- rbind(allReviews, testData)
# identify data
testContainer <- create_container
(testMatrix,
as.numeric(allReviews_test[,2]),
trainSize=1:2000,
testSize = 2001:2400, #test data
virgin=FALSE)
# train models (same code as before)
# test model
results2 <- classify_models(testContainer, testModels)
# will give test output of 400 predictions
# SVM = 54% accuracy
recall_accuracy
(as.numeric(allReviews_test[2001:2400,2]), results2[,"SVM_LABEL"])
# RF = 69% accuracy
recall_accuracy
(as.numeric(allReviews_test[2001:2400,2]), results2[,"FORESTS_LABEL"])
Conlusions
Results are not as high as I had anticipated. This could be due to the fact that the test data text files were considerably shorter than the training data text files.