Adena Lin
December 18, 2015
Left off with this formula
# FORMULA: more positive than negative words = positive sentiment
functionResults$predicted <- ifelse(as.numeric(functionResults$positive) > as.numeric(functionResults$negative), 'positive', 'negative')
Next, use a classification model in place of the formula.
First, positive and negative word lists were updated with common movie review terminology from this site.
positiveWords[2007:2015] <- c('comical',
'uproarious',
'original',
'absorbing',
'riveting',
'surprising',
'dazzling',
'thought-provoking',
'unpretentious')
negativeWords[4784:4790] <- c('second-rate',
'third-rate',
'juvenile',
'ordinary',
'predictable',
'uninteresting',
'outdated')
1) Counting punctuation (.?!)
# count number of periods in each review
positivePeriods <- lapply(gregexpr("[.]", positiveReviews), function(x) ifelse(x[[1]] > 0, length(x), as.integer(0)))
# count number of exclamation marks
positiveExclamations <- lapply(gregexpr("[!]", positiveReviews), function(x) ifelse(x[[1]] > 0, length(x), as.integer(0)))
# count number of question marks
positiveQuestions <- lapply(gregexpr("[?]", positiveReviews), function(x) ifelse(x[[1]] > 0, length(x), as.integer(0)))
2) Proportion of counted words to total number of words in lexicon for each sentiment
# Create two new columns for proportion of +/- words (ie. + words counted / total # of + words)
functionResults$prop_pos <- functionResults$positive/(length(positiveWords))
functionResults$prop_neg <- functionResults$negative/(length(negativeWords))
3) Difference between number of positive vs. negative words
# add feature: difference between positive and negative words
functionResults$difference <- functionResults$positive - functionResults$negative
4) Proportion of positive to negative words in each review
# add feature: proportion of positive to negative words
functionResults$prop_pos_neg <- ifelse(functionResults$negative == 0, 100, functionResults$positive / functionResults$negative)
5) Proportion of positive or negative words to total number of words
# add feature: proportion of positive or negative words to total words
functionResults$pos_totalwords <- functionResults$positive/functionResults$num_words
functionResults$neg_totalwords <- functionResults$negative/functionResults$num_words
6) Total number of words in each review
wordcount <- function(str) {
sapply(gregexpr("\\b\\W+\\b", str, perl=TRUE), function(x) sum(x>0) ) + 1
}
functionResults$num_words <- wordcount(functionResults$sentence)
12 predictors in total
Decision tree
library(rpart)
rfAnalysis <- rpart(sentiment ~ ., data = sentimentMatrix)
trainResults <- predict(rfAnalysis, sentimentMatrix)
trainResults <- data.frame(trainResults)
trainResults$sentiment <- ifelse(trainResults$positive > trainResults$negative, 'positive', 'negative')
testResults <- predict(rfAnalysis, testMatrix)
testResults <- data.frame(testResults)
testResults$sentiment <- ifelse(testResults$positive > testResults$negative, 'positive', 'negative')
recall_accuracy(testMatrix$sentiment, testResults$sentiment)
Tested against training data
100% accuracy with random forests
70.55% accuracy with decision tree
Although decision tree had worse accuracy as compared to random forests when predictions were tested against original training data, it fared better when tested against test data, suggesting that the RF were overfitting.
Next, we will use automatic feature selection instead of using all predictors to build a model.
ctrl <- rfeControl(method = "repeatedcv",
repeats = 5,
verbose = TRUE,
functions = rfFuncs)
featureSelection <- rfe(x = sentimentMatrix[,2:13],
y = sentimentMatrix$sentiment,
sizes = c(1:13),
metric = "Accuracy",
rfeControl = ctrl)
# automated feature chose 3 variables: prop_pos_neg, difference, neg_totalwords, prop_pos, positive
# build new RF model using these 5 features
rfFeature <- rpart(sentiment ~ prop_pos_neg + difference + neg_totalwords + prop_pos + positive, data = sentimentMatrix)
# TEST model
testResults2 <- predict(rfFeature, testData)
# results return a dataframe of percentages for chance of 'positive' or 'negative' sentiment
# turn percentages into factors for comparison
testResults2 <- data.frame(testResults2)
testResults2$sentiment <- ifelse(testResults2$positive > testResults2$negative, 'positive', 'negative')
recall_accuracy(testData$sentiment, testResults2$sentiment)
Results
72.75% accuracy using only 5 features (same accuracy if we use all features)