User Score Classification with Naive Bayes and Tree-Based Methods

Intro
Data Preparation
Modeling
Evaluation
Model Improvement
Conclusion

Intro

What We’ll Do

We will try to classify whether a user will give a game an above average score based on the content of the reviews. We will use the naive bayes and some three-based methods. Reviews will be extracted using text mining approach. The dataset is user reviews of 100 best PC games from metacritic website. I already scraped the data, which you can download here . I’ve done similar classification task but with different feature and models, using the sentiment value of the reviews. You can check it here .

Why It Matters

It’s important for companies to pay close attention to Voice of Customer (VoC). By analyzing and getting insights from customer feedback, companies have better information to make strategic decisions, an accurate understanding of what the customer actually wants and, as a result, a better experience for everyone. But, what are customers saying about the product? Text mining is a useful tool to gain insights from unstructured text data. Analyzing the content of user reviews with text mining will boost our understanding on what really matter for the user, such as specific group of words that can be used to indicate if a user will give us an above average score.

PC Gaming is a competitive market in entertainment industry, especially the video game industry. On Steam, the largest online PC gaming platform, there is a rapid growth on number of game released since 2004. Game developer need to understand what the people want. Therefore, we shall look at the best PC games out there and see what their customer say about their game. Classifying the score into above average or below average based on the sentiment of customer reviews may help us to gain insight at what make people rate the game higher or lower than other people on average.

Data Preparation

Load the required package.

library(tidyverse)
library(tidytext)
library(tidymodels)
library(caret)
library(plotly)
library(data.table)
library(RVerbalExpressions)
library(textclean)
library(randomForest)
library(e1071)
library(ranger)
library(rpart)
library(rattle)
library(rpart.plot)
library(tictoc)

set.seed(123)

Load the dataset. The dataset consists of the V1 (review id), the title of the game, the score given by the user, the category (based on metacritics, which we will not use), and the user review.

game_review <- fread("game_review.csv")
game_review

We want to clean the text by removing url and any word elongation. We will replace “?” with “questionmark” and “!” with “exclamationmark” to see if these characters can be useful in our analysis, etc.

question <- rx() %>% 
  rx_find(value = "?") %>% 
  rx_one_or_more()

exclamation <- rx() %>% 
  rx_find(value = "!") %>% 
  rx_one_or_more()

punctuation <- rx_punctuation()

number <- rx_digit()

dollar <- rx() %>% 
  rx_find("$")

game_review <- game_review %>% 
  mutate(
    text_clean = review %>% 
      replace_url() %>% 
      replace_html() %>% 
      replace_contraction() %>% 
      replace_word_elongation() %>% 
      str_replace_all(pattern = question, replacement = " questionmark ") %>% 
      str_replace_all(pattern = exclamation, replacement = " exclamationmark ") %>% 
      str_remove_all(pattern = punctuation) %>% 
      str_remove_all(pattern = number) %>% 
      str_remove_all(pattern = dollar) %>% 
      str_to_lower() %>% 
      str_squish()
  )
game_review

Since we want to classify the score into above average or below average, we need to add the label into the data.

# Remove game with only 1 review
more_1 <- game_review %>% group_by(game) %>% summarise(total = n()) %>% filter(total > 
    1)
game_review <- game_review[game_review$game %in% more_1$game, ]

x <- game_review %>% mutate(game = factor(game, unique(game))) %>% group_by(game) %>% 
    summarise(game_mean = mean(score))

# Label the data with above average/below average
game_clean <- game_review %>% left_join(x) %>% mutate(above_average = if_else(score > 
    game_mean, "Above", "Below")) %>% mutate(above_average = factor(above_average, 
    c("Below", "Above"))) %>% select(V1, above_average, text_clean) %>% na.omit()

game_clean

Finally, we will make a document term matrix, with the row indicate each review and the columns consists of top 500 words in the entire reviews. We will use the matrix to classify if the user will give an above average score based on the appearance of one or more terms.

game_token <- game_clean %>% unnest_tokens(word, text_clean)

game_tidy <- game_token %>% anti_join(stop_words) %>% count(V1, above_average, 
    word)

top_word <- game_tidy %>% count(word, sort = T) %>% top_n(500) %>% select(word)

game_tidy <- game_tidy %>% inner_join(top_word)
game.y <- game_tidy %>% group_by(V1, above_average) %>% summarise(total = n()) %>% 
    select(V1, above_average) %>% as.matrix()

game_dtm <- game_tidy %>% cast_dtm(document = V1, term = word, value = n)
game.x <- as.matrix(game_dtm)
game.x <- as.data.frame(game.x)
game_data <- cbind(game.y, game.x)

head(game_data)

Modeling

Holdout: Train-Test Split

We will split the data with proportion 0f 80% training set and 20% testing set.

# split data
set.seed(100)
intrain <- initial_split(data = game_data, prop = 0.8, strata = "above_average")

data_train <- training(intrain)
data_test <- testing(intrain)

data_train <- data_train %>% select(-V1)
data_test <- data_test %>% select(-V1)

Next, we want to check if there is a great class imbalance in our training dataset

data_train %>% group_by(above_average) %>% summarise(total = n()) %>% mutate(proportion = (total/sum(total)) * 
    100)

Naive Bayes

Naive bayes is a machine learning model that used the principle of Bayesian Probability, for example: \[ P(above | awesome) = \frac {P(awesome | above) . P(above)}{P(awesome)}\]

We can calculate the probability of the user would give an above average score based on the appearance of the word “awesome” in his/her review.

Because Naive Bayes acts on calculating probability based on prior events, we need to transform the variables into events (occur or not occured). If a word appear in a review, instead of counting the number of words, we will label the variable as occur or not occured.

data_train_occur <- data_train %>% mutate_if(~is.numeric(.), ~as.factor(ifelse(. >= 
    1, "occured", "not_occured")))
data_test_occur <- data_test %>% mutate_if(~is.numeric(.), ~as.factor(ifelse(. >= 
    1, "occured", "not_occured")))
data_train_occur

Now, we model the naive bayes classifier.

tic()
bayes_mod <- naiveBayes(data_train_occur[, -1], data_train_occur[, 1], laplace = 1)
toc()

1.42 sec elapsed

Decision Tree

Decision tree is a machine learning method that can be used to classify target by splitting variables into 2 categories, with the first split is variables that will give the highest information gain. Information gain is how much information we can get by looking at the difference between the entropy of two clusters.

\[ informationgain = entropy(P) - entropy (S) \]

In binary classification, the entropy can be calculated with: \[ entropy = -p.log_2(p) - (1-p).log_2(1-p) \] with p is proportion of class-1 in a cluster and (1-p) is the proportion of the other class.

Now we will model the decision tree using the rpart package.

set.seed(123)
tic()
dtree <- rpart(above_average ~ ., data = data_train, method = "class")
toc()

56.78 sec elapsed

fancyRpartPlot(dtree, sub = NULL)

Decision tree is easy to interpret. The tree start at the top, with the first node split by looking at the appearance of term “boring”. If the review has less than 0.5 (or < 1) number of word “boring”, the probability of user to give an above average score is 94%. The node split again on the “Above” class, is there any “overrated” word in the review? If yes, the user will tend to give a below average score. If no, is there any “amazing” word? If no, is there any “worst” word? If yes, then the user tend to give a below average score.

Random Forest

Random forest is an ensemble-method, meaning that it combine various models to get better performance than a single model. Random forest is consists of several/many “trees”, with the term tree refer to a single decision tree model. The classification will be based on the voting (or average in regression problems) of all the trees.

Illustration of Random Forest

Using the tidymodels, we can create the random forest model with specific parameters. The mtry parameter mean how many variables will be used in each tree. The trees parameter mean how many trees will be used on this classification, and min_n mean the minimal number of data points in a node that are required to split the node further.

# define model spec
model_spec <- rand_forest(mode = "classification", mtry = 3, trees = 500, min_n = 1)

# define model engine
model_spec <- set_engine(model_spec, engine = "ranger", seed = 123, num.threads = parallel::detectCores(), 
    importance = "impurity")

# model fitting
set.seed(123)
tic()
model <- fit_xy(object = model_spec, x = select(data_train, -above_average), 
    y = select(data_train, above_average))
toc()

43.6 sec elapsed

Evaluation

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

	Predicted_Yes	Predicted_No
Actual_Yes	True Positive	False Negative
Actual_No	False Postive	True Negative

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

\[ Accuracy = \frac {TP+TN}{TP+TN+FP+FN}\] \[ Sensitivity = \frac {TP}{TP+FN}\] \[ Specificity = \frac {TN}{TN+FP}\] \[ Precision = \frac {TP}{TP+FP}\]

Naive Bayes

bayes_pred <- select(data_test_occur, above_average) %>% bind_cols(pred_class = predict(bayes_mod, 
    newdata = data_test_occur[, -1], type = "class"))

# performance
bayes_pred %>% conf_mat(above_average, pred_class)

          Truth
Prediction Above Below
     Above  3452  1061
     Below   644   816

perf_bayes <- bayes_pred %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_bayes

Decision Tree

dt_pred <- select(data_test, above_average) %>% bind_cols(pred_class = predict(dtree, 
    newdata = data_test[, -1], type = "class"))

# confusion matrix
dt_pred %>% conf_mat(above_average, pred_class)

          Truth
Prediction Above Below
     Above  3948  1531
     Below   148   346

perf_dt <- dt_pred %>% summarise(accuracy = accuracy_vec(above_average, pred_class), 
    sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
        pred_class), precision = precision_vec(above_average, pred_class))
perf_dt

Random Forest

Beside measuring the model performance, we also want to tweak the random forest model in order to improve it in the future. Thus, we also look at the sensitivity-specificity curve and the precision-recall curve. Those curves can help us to choose the value of threshold that can increase the prefered performance metric.

Performance

# predict on test
set.seed(123)
pred_test <- select(data_test, above_average) %>% bind_cols(predict(model, select(data_test, 
    -above_average))) %>% bind_cols(predict(model, select(data_test, -above_average), 
    type = "prob"))

pred_test$above_average <- factor(pred_test$above_average, c("Above", "Below"))
pred_test$.pred_class <- factor(pred_test$.pred_class, c("Above", "Below"))

pred_test %>% conf_mat(above_average, .pred_class)

          Truth
Prediction Above Below
     Above  4028  1478
     Below    68   399

perf_rf <- pred_test %>% summarise(accuracy = accuracy_vec(above_average, .pred_class), 
    sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average, 
        .pred_class), precision = precision_vec(above_average, .pred_class))
perf_rf

ROC Curve

An ROC graph is a two-dimensional plot of a classifier with false positive rate on the x axis against true positive rate on the y axis. As such, a ROC graph depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives). ROC curves are useful for comparing different classifiers, since they take into account all possible thresholds.

pred_test %>% roc_curve(above_average, .pred_Above) %>% autoplot()

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. We expect a classifier that performs no better than chance to have an AUC of 0.5 (when evaluated on an independent test set not used in model training).

pred_test %>% roc_auc(above_average, .pred_Above)

Sensitivity-Specificity Trade-Off

pred_test_roc <- pred_test %>% roc_curve(above_average, .pred_Above) %>% mutate_if(~is.numeric(.), 
    ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

Based on the curve, we reach equilibrium when the threshold is around 0.675.

Precision-Recall Trade-Off

pred_test_pr <- pred_test %>% pr_curve(above_average, .pred_Above) %>% mutate_if(~is.numeric(.), 
    ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_pr, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

Based on the precision-recall curve, we can reach equilibrium when the threshold is around 0.63.

Depending on what our objective is, we can change the threshold to boost the specific metrics. If we want to get a good trade-off between the recall, specificity, and the precision, we may choose threshold = 0.675, since this value will give us a higher precision and equilibrium between sensitivity and specificity.

Model Improvement

We will try to improve the model performance by tuning the parameters. We will also try to see if models will be improved if we balance the class in training dataset. We will also scale all numeric predictors to range of [0,1].

rec <- recipe(above_average ~ ., data = data_train) %>% step_range(all_predictors(), 
    min = 0, max = 1) %>% step_upsample(above_average, ratio = 1/1, seed = 123) %>% 
    prep(strings_as_factors = F)

data_train <- juice(rec)
data_test <- bake(rec, testing(intrain))

Let’s see the class proportion

prop.table(table(data_train$above_average))


Above Below 
  0.5   0.5

Decision Tree

We will change the minimum number of data point in a node required to perform a split with minsplit in the control parameter.

control = rpart.control(minsplit = 1)
set.seed(123)
dtree_tune <- rpart(above_average ~ ., data = data_train, method = "class", 
    control = control)
fancyRpartPlot(dtree_tune, sub = NULL)

We got additional nodes with the word “bad” and “quesitonmark” (“?”).

Let’s see the performance of the decision tree

set.seed(123)
dt_pred_tune <- select(data_test, above_average) %>% bind_cols(pred_class = predict(dtree, 
    newdata = data_test[, -1], type = "class"))

# confusion matrix
dt_pred_tune %>% conf_mat(above_average, pred_class)

          Truth
Prediction Above Below
     Above  4095  1869
     Below     1     8

perf_dt_tune <- dt_pred_tune %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_dt_tune

Random Forest

By Changing The Threshold Value Of The Previous Model

We will try the performance of the random forest with the new threshold.

pred_test <- pred_test %>% mutate(pred_ed = factor(if_else(.pred_Above > 0.675, 
    "Above", "Below"), c("Above", "Below")))


pred_test %>% conf_mat(above_average, pred_ed)

          Truth
Prediction Above Below
     Above  2933   510
     Below  1163  1367

perf_rf_thres <- pred_test %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_ed), sensitivity = sens_vec(above_average, pred_ed), specificity = spec_vec(above_average, 
    pred_ed), precision = precision_vec(above_average, pred_ed))
perf_rf_thres

By increasing the threshold, we can significantly increase the specificity. The precision is increased by around 0.12 point (12%). Even though the accuracy and the sensitivity are decreased, they are still on acceptable level.

Modeling With Balanced Dataset

Now we try to classify the data with the new balanced dataset.

# define model spec
set.seed(123)
model_spec_tune <- rand_forest(mode = "classification", mtry = 3, trees = 500, 
    min_n = 1)

# define model engine
model_spec_tune <- set_engine(model_spec_tune, engine = "ranger", seed = 123, 
    num.threads = parallel::detectCores(), importance = "impurity")

# model fitting
set.seed(123)
model_tune <- fit_xy(object = model_spec_tune, x = select(data_train, -above_average), 
    y = select(data_train, above_average))

Model Performance

Now we look at the model performance for training dataset

set.seed(123)
pred_train_tune <- select(data_train, above_average) %>% bind_cols(predict(model_tune, 
    select(data_train, -above_average))) %>% bind_cols(predict(model_tune, select(data_train, 
    -above_average), type = "prob"))

pred_train_tune$above_average <- factor(pred_train_tune$above_average, c("Above", 
    "Below"))
pred_train_tune$.pred_class <- factor(pred_train_tune$.pred_class, c("Above", 
    "Below"))

pred_train_tune %>% conf_mat(above_average, .pred_class)

          Truth
Prediction Above Below
     Above 15849  1724
     Below   536 14661

pred_train_tune %>% summarise(accuracy = accuracy_vec(above_average, .pred_class), 
    sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average, 
        .pred_class), precision = precision_vec(above_average, .pred_class))

Next, we look at the model performance for testing dataset

# predict on test
set.seed(123)
pred_test_tune <- select(data_test, above_average) %>% bind_cols(predict(model_tune, 
    select(data_test, -above_average))) %>% bind_cols(predict(model_tune, select(data_test, 
    -above_average), type = "prob"))

pred_test_tune$above_average <- factor(pred_test_tune$above_average, c("Above", 
    "Below"))
pred_test_tune$.pred_class <- factor(pred_test_tune$.pred_class, c("Above", 
    "Below"))

pred_test_tune %>% conf_mat(above_average, .pred_class)

          Truth
Prediction Above Below
     Above  3555   752
     Below   541  1125

perf_rf_tune <- pred_test_tune %>% summarise(accuracy = accuracy_vec(above_average, 
    .pred_class), sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average, 
    .pred_class), precision = precision_vec(above_average, .pred_class))
perf_rf_tune

ROC Curve

pred_test_tune %>% roc_auc(above_average, .pred_Above)

Sensitivity-Specificity Trade-off

pred_test_roc <- pred_test_tune %>% roc_curve(above_average, .pred_Above) %>% 
    mutate_if(~is.numeric(.), ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

Based on the curve, we can reach equilibrium by choosing the threshold value of 0.55.

Precision-Recall Trade-Off

pred_test_prc <- pred_test_tune %>% pr_curve(above_average, .pred_Above) %>% 
    mutate_if(~is.numeric(.), ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_prc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

Based on the curve, we can reach equilibrium by choosing the threshold value of 0.51.

We will try to increase the threshold to 0.55 to see if the model is getting better.

pred_test_tune <- pred_test_tune %>% mutate(pred_ed = factor(if_else(.pred_Above > 
    0.55, "Above", "Below"), c("Above", "Below")))


pred_test_tune %>% conf_mat(above_average, pred_ed)

          Truth
Prediction Above Below
     Above  3007   491
     Below  1089  1386

perf_rf_tune_thres <- pred_test_tune %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_ed), sensitivity = sens_vec(above_average, pred_ed), specificity = spec_vec(above_average, 
    pred_ed), precision = precision_vec(above_average, pred_ed))
perf_rf_tune_thres

Conclusion

Based on the precision, the decision tree has the worst precision value. Naive bayes is a bit better than the initial random forest (threshold = 0.5) but cannot compete with other variation of random forest models. The tuned random forest with threshold = 0.55 has the highest precision, followed by the initial random forest with threshold = 0.675. The tuned random forest (threshold = 0.55) also has better accuracy, sensitivity, and specificity compared to the initial random forest (threshold = 0.675). Naive Bayes is the best in term of computational time.

If we want to get a high precision (we want our positive prediction to be as good as possible), we should choose the tuned random forest with threshold = 0.55.