Introduction

Social media sources deliver a massive amount of valuable textual data on a daily basis. The numerical value of these data can be extracted, so that whole sentences, or broader, posts, can be summarized. This value is often named ‘sentiment’, as it encapsulates the emotional background of the correlated expression. Sentiment analysis allows to gather an overall social feedback towards particular goods, events or even people. It helps in areas such as consumer analysis or marketing.

A paper written by Go, Bhayani and Huang (2011) is the most notable article that has laid a ground for the development of sentiment analysis area. In this short report, I will try to reproduce their approach, but use a different data set. In their paper Go et al. present an analysis of sentiment in Twitter posts - tweets. Tweets are short messages posted on a public website, in which people mostly express their opinions on a wide range of topics. Go et al. train several classifiers to see whether they can effectively measure whether a particular tweet has rather positive or negative sentiment. To train a classifier that can effectively catch what is the sentiment of the textual data, a labeled data set is necessary. Unfortunately, Twitter does not provide users with an option to mark their tweet with some sentiment measure, however Go et al. make use of emoticons. If the particular tweet has emoticons such as: ‘:)’ or ‘:D’ they consider such tweet as positive on, whereas when the tweet has emoticons such as: ‘:(’ or ‘:-(’ they consider such tweet as negative on. The classifier methods that they utilize include: SVM, Naive Bayes and logistic regression, as well as a manual way to assign sentiment to each word. Their results indicate that the accuracy of the machine learning methods is above 80%, which is relatively good.

In this report, I would like to utilize another data set, from a different area. Game reviews are a way for developers to know how well they did their job. Either a game gets good, positive reviews or it is completely dragged through the mud. Given a game had very good reviews, developers usually give it another chance by releasing sequel with new story, graphics or audio. But the attitude towards sequels is very mixed, though. Sometimes it is even better not to try to fix a masterpiece, but several times in the history of gaming developers managed to provide gamers with another episode of excellent quality entertainment. Given that, I would like to extend this report in another area - by measuring a difference in sentiment of original and sequel games.

Data

The data consists of reviews in the form of JSONs that were fetched using Steam’s API for Mass Effect 1 and 2, and Mafia 1 and 2. The data is not provided here, but feel free to download the reviews using their API: https://partner.steamgames.com/doc/store/getreviews. The API call returns several thousand of the most fresh reviews for the particular title. A single JSON contains: an author, timestamp, review text, popularity (number of times the review has been liked) an other seemingly not necessary here features.

#ME
me <- fromJSON("me.json", flatten=TRUE)
me1.orig <- me %>% filter(product_id == "17460")
me2.orig <- me %>% filter(product_id == "24980")
me$product_id <- ifelse(me$product_id == "17460", "Mass Effect 1", "Mass Effect 2")
me$gameplay <- cut(me$hours, breaks = 5, include.lowest = T, dig.lab = 4)

#mafia
mafia <- fromJSON("mafia.json", flatten=TRUE)
mafia1.orig <- mafia %>% filter(product_id == "40990")
mafia2.orig <- mafia %>% filter(product_id == "50130")
mafia$product_id <- ifelse(mafia$product_id == "40990", "Mafia 1", "Mafia 2")
mafia$gameplay <- cut(mafia$hours, breaks = 5, include.lowest = T, dig.lab = 4)
createTdmDtm <- function(text) {
  
  toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

  corp <- Corpus(VectorSource(text))
  corp <- tm_map(corp, toSpace, "/")
  corp <- tm_map(corp, toSpace, "@")
  corp <- tm_map(corp, toSpace, "\\|")
  corp <- tm_map(corp, toSpace, "[^\x01-\x7F]")
  corp <- tm_map(corp, toSpace, "¦")
  corp <- tm_map(corp, stripWhitespace)
  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp, removeNumbers)
  corp <- tm_map(corp, content_transformer(tolower))
  corp <- tm_map(corp, removeWords, stopwords("english"))
  corp <- tm_map(corp, stemDocument)
  
  dtm <- DocumentTermMatrix(corp)
  tdm <- TermDocumentMatrix(corp)
  
  list(dtm, tdm)
}

mostCommonWordsPlt <- function(DTM1, DTM2, sparsity = 0.9, gameName1, gameName2){
  
  freq1 <- colSums(as.matrix(removeSparseTerms(DTM1, sparsity)))
  freq2 <- colSums(as.matrix(removeSparseTerms(DTM2, sparsity)))
  freq <- full_join(
    data.frame(word = names(freq1), freq1 = freq1), 
    data.frame(word = names(freq2), freq2 = freq2))
  
  freq %>% 
    subset(., freq1 > 30 | freq2 > 30) %>%
    head(20) %>% 
    ggplot(aes(x = reorder(word, -freq1))) + 
    geom_bar(aes(y = freq1, fill = "Sequel"), stat = "identity", alpha = 0.7) + 
    geom_bar(aes(y = freq2, fill = "Original"), stat = "identity", alpha = 0.7) + 
    theme(axis.text.x = element_text(angle=45, hjust=1)) + 
    labs(x = "Words", y = "Frequencies", title = paste0("Words frequencies for ", gameName1, " and ", gameName2)) +
    scale_fill_manual(name=NULL, values=c(Original ="lightblue", Sequel ="red")) +
    theme(legend.key.size = unit(0.25, "cm"), legend.position="bottom", plot.title = element_text(size=8))
  
}

Tokenization

The first step is to parse the review’s text into R, tokenize the words, remove unnecessary words, punctuations, special characters (which are obviously redundant in data that comes from the Internet) and in the end stem the words to their root forms.

Each of the games has the following number of reviews:

rbind(
  c("", "Mass Effect", "Mafia"),
  cbind(
    rbind("Original", 
          "Sequel"),
    rbind(nrow(me1.orig),
          nrow(me2.orig)),
    rbind(nrow(mafia1.orig),
          nrow(mafia2.orig))
  )
) %>% 
  kable() %>% 
  kable_styling() %>% 
  row_spec(1, bold = T)
Mass Effect Mafia
Original 6747 503
Sequel 7151 10937

Exploratory analysis

Let us have a quick derivation from the main research topic an have a look at the most common root forms in each of the games. Each of the games has the word game itself as one of the most popular. Name of the game often shows up in most reviews. Story and play are also common. Mass Effect is more of a RPG/shooter game - story is an important review word, one for a single player game genre. Mafia - a shooter game with a great story. Most of the reviews are already emphasized with words like great or good, which inclines that the game ratings are mostly good.

  df <- rbind(c(
      "Mass Effect 1", "", "Mass Effect 2", "", "Mafia 1", "", "Mafia 2",  ""
    ),
    cbind(
      head(names(sort(rowSums(as.matrix(me1.TDM)), T)), 10),
      head(sort(rowSums(as.matrix(me1.TDM)), T), 10),
      head(names(sort(rowSums(as.matrix(me2.TDM)), T)), 10),
      head(sort(rowSums(as.matrix(me2.TDM)), T), 10),
      head(names(sort(rowSums(as.matrix(mafia1.TDM)), T)), 10),
      head(sort(rowSums(as.matrix(mafia1.TDM)), T), 10),
      head(names(sort(rowSums(as.matrix(mafia2.TDM)), T)), 10),
      head(sort(rowSums(as.matrix(mafia2.TDM)), T), 10)
    )
    )
  rownames(df) <- NULL
  kable(df, caption = "Number of most common terms occurences")  %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Number of most common terms occurences
Mass Effect 1 Mass Effect 2 Mafia 1 Mafia 2
game 11185 game 10480 game 1041 game 17736
play 3931 effect 3669 mafia 298 stori 6300
effect 3256 mass 3525 play 265 mafia 5170
stori 3179 play 3471 stori 222 play 4098
mass 3111 one 2695 time 192 good 3922
one 2410 stori 2631 get 175 great 3841
like 2067 best 2197 one 171 like 3750
can 2030 first 2001 mission 169 one 2681
great 1978 charact 1976 like 157 time 2655
charact 1967 great 1818 just 147 just 2490

If we now distinguish the conclusions between the original and sequel game, we do not see much differences. Both games in both editions share the most common words. However, we can already note that the number of reviews for Mafia 1 is much lower than for its descendant.

mostCommonWordsPlt(me1.DTM, me2.DTM, 0.85, gameName1 = "Mass Effect 1", gameName2 = "Mass Effect 2")

mostCommonWordsPlt(mafia1.DTM, mafia2.DTM, 0.85, gameName1 = "Mafia 1", gameName2 = "Mafia 2")

Another way to represent the most common words is to prepare a wordcloud. We present them below, they should resemble the aforementioned information, but the higher sparsity matrices were used in these examples.

me1.DTM.nonsparse <- removeSparseTerms(me1.DTM, 0.9)
me1.freq <- colSums(as.matrix(me1.DTM.nonsparse))
w1 <- wordcloud2(data.frame(names(me1.freq),
           me1.freq),
           color = "random-dark")
saveWidget(w1, '1.html', selfcontained = F)
webshot('1.html', '1.png', vwidth=500,vheight=350, delay = 5)

me2.DTM.nonsparse <- removeSparseTerms(me2.DTM, 0.9)
me2.freq <- colSums(as.matrix(me2.DTM.nonsparse))
w2 <- wordcloud2(data.frame(names(me2.freq),
           me2.freq),
           color = "random-dark")
saveWidget(w2, '2.html', selfcontained = F)
webshot('2.html', '2.png', vwidth=500,vheight=350, delay = 5)

mafia1.DTM.nonsparse <- removeSparseTerms(mafia1.DTM, 0.9)
mafia1.freq <- colSums(as.matrix(mafia1.DTM.nonsparse))
w3 <- wordcloud2(data.frame(names(mafia1.freq),
           mafia1.freq),
           color = "random-dark")
saveWidget(w3, '3.html', selfcontained = F)
webshot('3.html', '3.png', vwidth=500,vheight=350, delay = 5)

mafia2.DTM.nonsparse <- removeSparseTerms(mafia2.DTM, 0.9)
mafia2.freq <- colSums(as.matrix(mafia2.DTM.nonsparse))
w4 <- wordcloud2(data.frame(names(mafia2.freq),
           mafia2.freq),
           color = "random-dark")
saveWidget(w4, '4.html', selfcontained = F)
webshot('4.html', '4.png', vwidth=500,vheight=350, delay = 5)

Sentiment modelling (manual word sentiment)

The main goal of the analysis is to find what is the average sentiment of the reviews an whether it is higher for sequels or not. In the simplest scenario, Go et al. assess the tweet’s (review’s) sentiment based on the sum of sentiment of particular keywords. In this approach similar keyword dictionart is applied and overall sentiment (sentiment score for each review) is presented in the grapsh below.

In case of the original/sequel “battle”, Mass Effect has a better sentiment score for the sequel game. In this case, there were over 82% of positive reviews, which is also a percentage of people who recommended the game. It is also worth noting that the percentage of people who did not recommend the game is equal to the sum of neutral and negative reviews. Mafia games are similiar to Mass Effect ones - sequel is better, on overall the sentiment values are very close for Mafia 2 and Mass Effect 2. The percentage of positive sentiment score is a little bit lower than the percentage of recommend reviews for the original, but 92% score of recommendation matches positive and neutral reviews summed up.

 me1.sent <- analyzeSentiment(me1.orig$text)
 plotSentiment(me1.sent$SentimentHE, xlab = "Mass Effect 1")

 me2.sent <- analyzeSentiment(me2.orig$text)
 plotSentiment(me2.sent$SentimentHE, xlab = "Mass Effect 2")

 mafia1.sent <- analyzeSentiment(mafia1.orig$text)
 plotSentiment(mafia1.sent$SentimentHE, xlab = "Mafia 1")

 mafia2.sent <- analyzeSentiment(iconv(mafia2.orig$text, "latin1", "ASCII", sub="")) #these have been identified as a wrongly formatted reviews
 plotSentiment(mafia2.sent$SentimentHE, xlab = "Mafia 2")

rbind(
  c("Mass Effect 1", "Mass Effect 2"),
    cbind(round(mean(me1.sent$SentimentQDAP, na.rm = T), 3),
       round(mean(me2.sent$SentimentQDAP, na.rm = T), 3))
       ) %>%
     kable(caption = "Mean sentiment for Mass Effect games", digits = 2)   %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Mean sentiment for Mass Effect games
Mass Effect 1 Mass Effect 2
0.194 0.215
rbind(c("Mass Effect 1", "Mass Effect 2"),
 cbind(
  round(prop.table(table(convertToDirection(me1.sent$SentimentQDAP))), 3),
   round(prop.table(table(convertToDirection(me2.sent$SentimentQDAP))), 3))
 ) %>%
   kable(caption = "Sentiment scores for Mass Effect games")  %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Sentiment scores for Mass Effect games
Mass Effect 1 Mass Effect 2
negative 0.062 0.055
neutral 0.113 0.123
positive 0.825 0.822
rbind(c("Mass Effect 1", "Mass Effect 2"),
 cbind(
   round(prop.table(table(me1.orig$recommended)), 3),
   round(prop.table(table(me2.orig$recommended)), 3)) 
 ) %>%
    kable(caption = "Percentage of reviewers that recommend Mass Effect games")   %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Percentage of reviewers that recommend Mass Effect games
Mass Effect 1 Mass Effect 2
FALSE 0.061 0.053
TRUE 0.939 0.947
rbind(
  c("Mafia 1", "Mafia 2"),
    cbind(round(mean(mafia1.sent$SentimentQDAP, na.rm = T), 3),
       round(mean(mafia2.sent$SentimentQDAP, na.rm = T), 3))
       ) %>%
     kable(caption = "Mean sentiment for Mafia games", digits = 2)   %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Mean sentiment for Mafia games
Mafia 1 Mafia 2
0.191 0.214
rbind(c("Mafia 1", "Mafia 2"),
 cbind(
  round(prop.table(table(convertToDirection(mafia1.sent$SentimentQDAP))), 3),
   round(prop.table(table(convertToDirection(mafia2.sent$SentimentQDAP))), 3))
 ) %>%
   kable(caption = "Sentiment scores for Mafia games")  %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Sentiment scores for Mafia games
Mafia 1 Mafia 2
negative 0.072 0.067
neutral 0.123 0.129
positive 0.805 0.804
rbind(c("Mafia 1", "Mafia 2"),
 cbind(
   round(prop.table(table(mafia1.orig$recommended)), 3),
   round(prop.table(table(mafia2.orig$recommended)), 3)) 
 ) %>%
    kable(caption = "Percentage of reviewers that recommend Mafia games")   %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Percentage of reviewers that recommend Mafia games
Mafia 1 Mafia 2
FALSE 0.113 0.076
TRUE 0.887 0.924

Machine learning classifiers

Returning back to the main flow of the analyzed paper, Go et al. employ three machine learning techniques that are trained to classify whether particular tweet (in this case particular review) has positive or negative sentiment. The label is fetched from the recommend option that is given to the reviewers. The models are built over dummy variables representing aggregated usage of particular words in a review.

me1.df <- cbind(as.data.frame(data.matrix(me1.DTM), stringsAsfactors = FALSE), as.factor(me1.orig$recommended))
me2.df <- cbind(as.data.frame(data.matrix(me2.DTM), stringsAsfactors = FALSE), as.factor(me2.orig$recommended))
colnames(me1.df)[ncol(me1.df)] <- "Recommended"
colnames(me2.df)[ncol(me2.df)] <- "Recommended"

set.seed(2137)
me1.df.cut <- me1.df[sample(nrow(me1.df), 2500), ]
me2.df.cut <- me2.df[sample(nrow(me2.df), 2500), ]

me1.part <- createDataPartition(me1.df.cut$Recommended, p = 0.8, list = F)
me2.part <- createDataPartition(me1.df.cut$Recommended, p = 0.8, list = F)

me1.train <- me1.df.cut[me1.part, c(colSums(me1.df.cut[, -ncol(me1.df.cut)]) > 0.8*0.025*nrow(me1.df.cut), T)]
me1.test <- me1.df.cut[-me1.part, c(colSums(me1.df.cut[, -ncol(me1.df.cut)]) > 0.8*0.025*nrow(me1.df.cut), T)]

me2.train <- me2.df.cut[me2.part, c(colSums(me2.df.cut[, -ncol(me2.df.cut)]) > 0.8*0.025*nrow(me2.df.cut), T)]
me2.test <- me2.df.cut[-me2.part, c(colSums(me2.df.cut[, -ncol(me2.df.cut)]) > 0.8*0.025*nrow(me2.df.cut), T)]

me1.train <- upSample(me1.train, me1.train$Recommended)
me2.train <- upSample(me2.train, me2.train$Recommended)

me1.train$Class <- NULL
me2.train$Class <- NULL

# me 1
glm.model1 <- train(Recommended ~ . ,
                    data = me1.train,
                    method="glm",
                    family=binomial(),
                    preProcess = c("center", "scale"))

# me 2
glm.model2 <- train(Recommended ~ . ,
                   data = me2.train,
                   method="glm",
                   family=binomial(),
                   preProcess = c("center", "scale"))


me1.glm.pred <- predict(glm.model1, me1.test)
me2.glm.pred <- predict(glm.model2, me2.test)

#----------------------------------

# me 1
svm.model1 <- train(Recommended ~ . ,
                    data = me1.train,
                    method = "svmLinear",
                    preProcess = c("center", "scale"))

# me 2
svm.model2 <- train(Recommended ~ . ,
                    data = me2.train,
                    method = "svmLinear",
                    preProcess = c("center", "scale"))


me1.svm.pred <- predict(svm.model1, me1.test)
me2.svm.pred <- predict(svm.model2, me2.test)

#----------------------------------

# me 1
nb.model1 <- train(Recommended ~ . ,
                    data = me1.train,
                    method = "nb",
                    preProcess = c("center", "scale"))

# me 2
nb.model2 <- train(Recommended ~ . ,
                    data = me2.train,
                    method = "nb",
                    preProcess = c("center", "scale"))


me1.nb.pred <- predict(nb.model1, me1.test)
me2.nb.pred <- predict(nb.model2, me2.test)
mafia1.df <- cbind(as.data.frame(data.matrix(mafia1.DTM), stringsAsfactors = FALSE), as.factor(mafia1.orig$recommended))
mafia2.df <- cbind(as.data.frame(data.matrix(mafia2.DTM), stringsAsfactors = FALSE), as.factor(mafia2.orig$recommended))
colnames(mafia1.df)[ncol(mafia1.df)] <- "Recommended"
colnames(mafia2.df)[ncol(mafia2.df)] <- "Recommended"

set.seed(2137)
mafia1.df.cut <- mafia1.df
mafia2.df.cut <- mafia2.df[sample(nrow(mafia2.df), 2500), ]

mafia1.part <- createDataPartition(mafia1.df.cut$Recommended, p = 0.8, list = F)
mafia2.part <- createDataPartition(mafia2.df.cut$Recommended, p = 0.8, list = F)

mafia1.train <- mafia1.df.cut[mafia1.part, c(colSums(mafia1.df.cut[, -ncol(mafia1.df.cut)]) > 0.8*0.025*nrow(mafia1.df.cut), T)]
mafia1.test <- mafia1.df.cut[-mafia1.part, c(colSums(mafia1.df.cut[, -ncol(mafia1.df.cut)]) > 0.8*0.025*nrow(mafia1.df.cut), T)]

mafia2.train <- mafia2.df.cut[mafia2.part, c(colSums(mafia2.df.cut[, -ncol(mafia2.df.cut)]) > 0.8*0.025*nrow(mafia2.df.cut), T)]
mafia2.test <- mafia2.df.cut[-mafia2.part, c(colSums(mafia2.df.cut[, -ncol(mafia2.df.cut)]) > 0.8*0.025*nrow(mafia2.df.cut), T)]


mafia1.train <- upSample(mafia1.train, mafia1.train$Recommended)
mafia2.train <- upSample(mafia2.train, mafia2.train$Recommended)

mafia1.train$Class <- NULL
mafia2.train$Class <- NULL

# mafia 1
glm.model1 <- train(Recommended ~ . ,
                    data = mafia1.train,
                    method="glm",
                    family=binomial(),
                    preProcess = c("center", "scale"))

# mafia 2
glm.model2 <- train(Recommended ~ . ,
                    data = mafia2.train,
                    method="glm",
                    family=binomial(),
                    preProcess = c("center", "scale"))


mafia1.glm.pred <- predict(glm.model1, mafia1.test)
mafia2.glm.pred <- predict(glm.model2, mafia2.test)

#----------------------------------

# mafia 1
svm.model1 <- train(Recommended ~ . ,
                    data = mafia1.train,
                    method = "svmLinear",
                    preProcess = c("center", "scale"))

# mafia 2
svm.model2 <- train(Recommended ~ . ,
                    data = mafia2.train,
                    method = "svmLinear",
                    preProcess = c("center", "scale"))


mafia1.svm.pred <- predict(svm.model1, mafia1.test)
mafia2.svm.pred <- predict(svm.model2, mafia2.test)

#----------------------------------

# mafia 1
nb.model1 <- train(Recommended ~ . ,
                    data = mafia1.train,
                    method = "nb",
                    preProcess = c("center", "scale"))

# mafia 2
nb.model2 <- train(Recommended ~ . ,
                    data = mafia2.train,
                    method = "nb",
                    preProcess = c("center", "scale"))


mafia1.nb.pred <- predict(nb.model1, mafia1.test)
mafia2.nb.pred <- predict(nb.model2, mafia2.test)
  rbind(c("", "Mass Effect 1", "Mass Effect 2", "Mafia 1", "Mafia 2"),
    cbind(c("logistic regression", "SVM", "Naive Bayes"),
      c(sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.glm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.svm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.nb.pred)$overall["Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.glm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.svm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.nb.pred)$overall["Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.glm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.svm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.nb.pred)$overall["Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.glm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.svm.pred)$overall["Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.nb.pred)$overall["Accuracy"]))

        )
    ) %>%
  kable(row.names = F, digits = 2, caption = "Accuracy measure for each of the studied models") %>%
  kable_styling() %>% 
  column_spec(1, bold = T) %>% 
  row_spec(1, bold = T)
Accuracy measure for each of the studied models
Mass Effect 1 Mass Effect 2 Mafia 1 Mafia 2
logistic regression 0.9018 0.8657 0.5300 0.8920
SVM 0.8858 0.8818 0.8800 0.8760
Naive Bayes 0.1062 0.8958 0.2500 0.9200

In their work Go et al. have applied models to both unigrams and bigrams. Due to much richer dataset (in textual sense) applying bigram model here would be impractical. Therefore, only unigram model results are reported. In addition to that, I have limited the training sample sizes to 2500 obs. due to the same reason. Feel free to remove this restriction if you have more computational power.

The results are comparable to the ones obtained by Go et al. In their work Naive Bayes and SVM approaches were the best. However in this scenario, the training sample might have been too poor to provide enough variants of observations, therefore in case of Mass Effect 1 and Mafia 1 Naive Bayes is very poor in terms of prediction power. Therefore SVM is the best in terms of accuracy in this scenario. It needs to be noted though, that the accuracy of the models is very similar to the number of positive cases in the dataset and lower than the ratio in the testing sample. This is a minor disadvantage of Go et al. paper, because we cannot relate accuracy to the real number of observations in the sample.

rbind(c("Mass Effect 1", "Mass Effect 2", "Mafia 1", "Mafia 2"),
      round(unname(c(
        prop.table(table(me1.test$Recommended))['TRUE'],
        prop.table(table(me2.test$Recommended))['TRUE'],
        prop.table(table(mafia1.test$Recommended))['TRUE'],
        prop.table(table(mafia2.test$Recommended))['TRUE']
      )), 2)
    ) %>%
  kable(row.names = F, digits = 2, caption = "Percentage of positive reviews in the testing sample") %>%
  kable_styling() %>% 
  row_spec(1, bold = T)
Percentage of positive reviews in the testing sample
Mass Effect 1 Mass Effect 2 Mafia 1 Mafia 2
0.94 0.94 0.89 0.92

Therefore I present balanced accuracy measure for each of the models. This indicates that the training sample feature engineering is either poor in this scenario or the training sample variety is too small. However, SVM proves to be of value, due to accuracy at circa 60-65%, which is not that bad. In summary, following Go et al. approach one can build a predictive model to automatically detect sentiment in the online reviews.

  rbind(c("", "Mass Effect 1", "Mass Effect 2", "Mafia 1", "Mafia 2"),
    cbind(c("logistic regression", "SVM", "Naive Bayes"),
      c(sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.glm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.svm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me1.test$Recommended, me1.nb.pred)$byClass["Balanced Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.glm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.svm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(me2.test$Recommended, me2.nb.pred)$byClass["Balanced Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.glm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.svm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia1.test$Recommended, mafia1.nb.pred)$byClass["Balanced Accuracy"])),

      c(sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.glm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.svm.pred)$byClass["Balanced Accuracy"]),
        sprintf("%2.4f",confusionMatrix(mafia2.test$Recommended, mafia2.nb.pred)$byClass["Balanced Accuracy"]))

        )
    ) %>%
  kable(row.names = F, digits = 2, caption = "Balanced accuracy measure for each of the studied models") %>%
  kable_styling() %>% 
  column_spec(1, bold = T) %>% 
  row_spec(1, bold = T)
Balanced accuracy measure for each of the studied models
Mass Effect 1 Mass Effect 2 Mafia 1 Mafia 2
logistic regression 0.5865 0.5842 0.4988 0.6556
SVM 0.5669 0.6211 0.6526 0.6385
Naive Bayes 0.5101 0.6124 0.5640 NA