Sentiment Analysis

Adena Lin
December 9, 2015

Introduction

Sentiment analysis: “Opinion mining” - determining if a piece of text is postive, negative, or neutral

The greatest challenge is interrater reliability.
“Numerous studies have shown that the rate of human concordance is between 70% and 80%.”

Source: http://brnrd.me/sentiment-analysis-never-accurate/

As a result, I aimed for an accuracy of 80% for my models.

Other Challenges

  • Context “My bank does a great job stealing money from me”
  • Ambiguity “Recommendations for a good comedy?”
  • Comparatives “Coke is better than Pepsi”
  • Slang “That was a sick concert!”
  • Sarcasm

The models presented do not examine sentence structure.

Only word sentiments will be analyzed to provide overall sentiment ratings for a piece of text.

Methodology

1) Using lexicon of positive & negative words

2) Using 'syuzhet' package

3) Using classification models (SVM & RF)

Only the classification models were tested against the test set, as they performed the best. All models were compared to the original training set.

Train dataset: http://www.cs.cornell.edu/people/pabo/movie-review-data/ [polarity dataset v2.0]

Test dataset: http://ai.stanford.edu/~amaas/data/sentiment/

Using a Lexicon

library(RTextTools)
library(plyr)
library(stringr)
library(readr)

# Load in lexicon of words that are categorized into positive/negative sentiment
# Source: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
positiveWords <- scan("~Adena/Desktop/review_polarity/lexicon/positive-words.txt", what = 'character')
negativeWords <- scan("~Adena/Desktop/review_polarity/lexicon/negative-words.txt", what = 'character')
# Function break down sentences into individual words to compared to lexicon
sentimentFunction <- function(sentences, positiveWords, negativeWords, .progress='none') {
  finalScore <- matrix('', 0, 3)
  scores <- laply(sentences, function(sentence, positiveWords, negativeWords) {
    compareSentence <- sentence
    # split into words
    word.list <- str_split(sentence, '\\s+')
    words <- unlist(word.list)
    # compare our words to the lexicon of +/- terms
    posMatches <- match(words, positiveWords)
    negMatches <- match(words, negativeWords)
    # find total number of words in each category
    posMatches <- sum(!is.na(posMatches))
    negMatches <- sum(!is.na(negMatches))
    score <- c(posMatches, negMatches)
# add row to scores table
    new_row <- c(compareSentence, score)
    finalScore <- rbind(finalScore, new_row)
    return(finalScore)
}, positiveWords, negativeWords)
return(scores)}
### Import training files ###
setwd("~/Desktop/review_polarity/txt_sentoken/pos/")
pos <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/pos/", pattern = "*.txt")
positiveReviews <- lapply(pos, read_file)
# remove punctuation & line breaks
positiveReviews <- lapply(positiveReviews, gsub, pattern = '[[:punct:],\n]', replacement = '')
# remove digits
positiveReviews <- lapply(positiveReviews, gsub, pattern = '\\d+', replacement = '')
# do the same for 'negativeReviews'
# data frame of +/- reviews with scores
# compares words from training data to lexicon to count number of +/- words
posResult <- as.data.frame(sentimentFunction(positiveReviews, positiveWords, negativeWords))
negResult <- as.data.frame(sentimentFunction(negativeReviews, positiveWords, negativeWords))
posResult <- cbind(posResult, 'positive')
colnames(posResult) <- c('sentence','positive', 'negative','sentiment')
negResult <- cbind(negResult, 'negative')
colnames(negResult) <- c('sentence', 'positive', 'negative', 'sentiment')
functionResults <- rbind(posResult, negResult)
# end up with data frame that lists the number of +/- words in each review
# FORMULA: more positive than negative words = positive sentiment
functionResults$predicted <- ifelse(as.numeric(functionResults$positive) > as.numeric(functionResults$negative), 'positive', 'negative')

# test on (compare with) original training data
# overall accuracy = 65%
recall_accuracy(functionResults$sentiment, functionResults$predicted)
# positive sentiments accuracy = 58%
recall_accuracy(functionResults$sentiment[1:1000], functionResults$predicted[1:1000])
# negative sentiments accuracy = 72%
recall_accuracy(functionResults$sentiment[1001:2000], functionResults$predicted[1001:2000])

Using 'syuzhet' package

library(syuzhet)

# Import text files
setwd("~Adena/Desktop/review_polarity/txt_sentoken/pos/")
pos <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/pos/", pattern = "*.txt")
positive.reviews <- lapply(pos, read_file)

setwd("~Adena/Desktop/review_polarity/txt_sentoken/neg/")
neg <- list.files(path = "~Adena/Desktop/review_polarity/txt_sentoken/neg/", pattern = "*.txt")
negative.reviews <- lapply(neg, read_file)
# sentiment analysis for all positive reviews
positive.reviews <- as.character(positive.reviews)
posSentiment <- get_sentiment(positive.reviews, method = "bing") # using "Bing" lexicon
length(which(posSentiment > 0)) # this says that 598 of these reviews were classified as positive
# since we know all 1000 reviews were positive, this gives an accuracy of about 60%

# sentiment analysis for all negative reviews
negative.reviews <- as.character(negative.reviews)
negSentiment <- get_sentiment(negative.reviews, method = "bing")
length(which(negSentiment < 0))
# 771/1000 negative reviews were correctly, giving an accuracy of 77%

Similar results to manual sentiment analysis using lexicon.

Using classification models

library(RTextTools)
library(e1071)
library(readr)

# Import text files & removing punctuation, digits, and line breaks
# Create new sentiment column for each review:
posReviews <- cbind(positive.reviews, 'positive')
negReviews <- cbind(negative.reviews, 'negative')

# Create one large dataset for training
allReviews <- rbind(posReviews, negReviews)
reviewsMatrix <- 
  create_matrix(allReviews[,1], 
                language="english",
                removeStopwords = FALSE, # some stop words may be relevant
                stemWords=FALSE, toLower = TRUE) # turn words into lowercase

allReviews[,2] <- 
  as.factor(as.character(allReviews[,2]))
# CLASSIFICATION: '2' is positive, '1' is negative

# create model to test first on training data
reviewsContainer <- create_container(reviewsMatrix, as.numeric(allReviews[,2]), 
                   trainSize=1:2000, # train entire dataset
                   virgin=FALSE)
# train models
sentimentModels <- 
  train_models(reviewsContainer, 
               algorithms = c("SVM", "RF"),
               cost = 5, # SVM parameter
               ntree = 20 # RF parameter)

results1 <- classify_models(reviewsContainer, sentimentModels)

# test on training data
# SVM
recall_accuracy(as.numeric(allReviews[,2]), results1[,'SVM_LABEL']) # 81.8% accuracy
# RF
recall_accuracy(as.numeric(allReviews[,2]), results1[,'FORESTS_LABEL']) # 100% accuracy?!
# build data frame for test data (using approx 400 text files, 50% positive)
testData <- rbind(positiveTest, negativeTest)
allReviews_test <- rbind(allReviews, testData)

# identify data
testContainer <- create_container
(testMatrix, 
as.numeric(allReviews_test[,2]), 
trainSize=1:2000,
testSize = 2001:2400, #test data
virgin=FALSE)

# train models (same code as before)
# test model
results2 <- classify_models(testContainer, testModels)
# will give test output of 400 predictions
# SVM = 54% accuracy
recall_accuracy
(as.numeric(allReviews_test[2001:2400,2]), results2[,"SVM_LABEL"])
# RF = 69% accuracy
recall_accuracy
(as.numeric(allReviews_test[2001:2400,2]), results2[,"FORESTS_LABEL"])

Conlusions

Results are not as high as I had anticipated. This could be due to the fact that the test data text files were considerably shorter than the training data text files.