User Score Classification with Naive Bayes and Tree-Based Methods
Intro
What We’ll Do
We will try to classify whether a user will give a game an above average score based on the content of the reviews. We will use the naive bayes and some three-based methods. Reviews will be extracted using text mining approach. The dataset is user reviews of 100 best PC games from metacritic website. I already scraped the data, which you can download here . I’ve done similar classification task but with different feature and models, using the sentiment value of the reviews. You can check it here .
Why It Matters
It’s important for companies to pay close attention to Voice of Customer (VoC). By analyzing and getting insights from customer feedback, companies have better information to make strategic decisions, an accurate understanding of what the customer actually wants and, as a result, a better experience for everyone. But, what are customers saying about the product? Text mining is a useful tool to gain insights from unstructured text data. Analyzing the content of user reviews with text mining will boost our understanding on what really matter for the user, such as specific group of words that can be used to indicate if a user will give us an above average score.
PC Gaming is a competitive market in entertainment industry, especially the video game industry. On Steam, the largest online PC gaming platform, there is a rapid growth on number of game released since 2004. Game developer need to understand what the people want. Therefore, we shall look at the best PC games out there and see what their customer say about their game. Classifying the score into above average or below average based on the sentiment of customer reviews may help us to gain insight at what make people rate the game higher or lower than other people on average.
Data Preparation
Load the required package.
library(tidyverse)
library(tidytext)
library(tidymodels)
library(caret)
library(plotly)
library(data.table)
library(RVerbalExpressions)
library(textclean)
library(randomForest)
library(e1071)
library(ranger)
library(rpart)
library(rattle)
library(rpart.plot)
library(tictoc)
set.seed(123)
Load the dataset. The dataset consists of the V1 (review id), the title of the game, the score given by the user, the category (based on metacritics, which we will not use), and the user review.
We want to clean the text by removing url and any word elongation. We will replace “?” with “questionmark” and “!” with “exclamationmark” to see if these characters can be useful in our analysis, etc.
question <- rx() %>%
rx_find(value = "?") %>%
rx_one_or_more()
exclamation <- rx() %>%
rx_find(value = "!") %>%
rx_one_or_more()
punctuation <- rx_punctuation()
number <- rx_digit()
dollar <- rx() %>%
rx_find("$")
game_review <- game_review %>%
mutate(
text_clean = review %>%
replace_url() %>%
replace_html() %>%
replace_contraction() %>%
replace_word_elongation() %>%
str_replace_all(pattern = question, replacement = " questionmark ") %>%
str_replace_all(pattern = exclamation, replacement = " exclamationmark ") %>%
str_remove_all(pattern = punctuation) %>%
str_remove_all(pattern = number) %>%
str_remove_all(pattern = dollar) %>%
str_to_lower() %>%
str_squish()
)
game_review
Since we want to classify the score into above average or below average, we need to add the label into the data.
# Remove game with only 1 review
more_1 <- game_review %>% group_by(game) %>% summarise(total = n()) %>% filter(total >
1)
game_review <- game_review[game_review$game %in% more_1$game, ]
x <- game_review %>% mutate(game = factor(game, unique(game))) %>% group_by(game) %>%
summarise(game_mean = mean(score))
# Label the data with above average/below average
game_clean <- game_review %>% left_join(x) %>% mutate(above_average = if_else(score >
game_mean, "Above", "Below")) %>% mutate(above_average = factor(above_average,
c("Below", "Above"))) %>% select(V1, above_average, text_clean) %>% na.omit()
game_clean
Finally, we will make a document term matrix, with the row indicate each review and the columns consists of top 500 words in the entire reviews. We will use the matrix to classify if the user will give an above average score based on the appearance of one or more terms.
game_token <- game_clean %>% unnest_tokens(word, text_clean)
game_tidy <- game_token %>% anti_join(stop_words) %>% count(V1, above_average,
word)
top_word <- game_tidy %>% count(word, sort = T) %>% top_n(500) %>% select(word)
game_tidy <- game_tidy %>% inner_join(top_word)
game.y <- game_tidy %>% group_by(V1, above_average) %>% summarise(total = n()) %>%
select(V1, above_average) %>% as.matrix()
game_dtm <- game_tidy %>% cast_dtm(document = V1, term = word, value = n)
game.x <- as.matrix(game_dtm)
game.x <- as.data.frame(game.x)
game_data <- cbind(game.y, game.x)
head(game_data)
Modeling
Holdout: Train-Test Split
We will split the data with proportion 0f 80% training set and 20% testing set.
# split data
set.seed(100)
intrain <- initial_split(data = game_data, prop = 0.8, strata = "above_average")
data_train <- training(intrain)
data_test <- testing(intrain)
data_train <- data_train %>% select(-V1)
data_test <- data_test %>% select(-V1)
Next, we want to check if there is a great class imbalance in our training dataset
data_train %>% group_by(above_average) %>% summarise(total = n()) %>% mutate(proportion = (total/sum(total)) *
100)
Naive Bayes
Naive bayes is a machine learning model that used the principle of Bayesian Probability, for example: \[ P(above | awesome) = \frac {P(awesome | above) . P(above)}{P(awesome)}\]
We can calculate the probability of the user would give an above average score based on the appearance of the word “awesome” in his/her review.
Because Naive Bayes acts on calculating probability based on prior events, we need to transform the variables into events (occur or not occured). If a word appear in a review, instead of counting the number of words, we will label the variable as occur or not occured.
data_train_occur <- data_train %>% mutate_if(~is.numeric(.), ~as.factor(ifelse(. >=
1, "occured", "not_occured")))
data_test_occur <- data_test %>% mutate_if(~is.numeric(.), ~as.factor(ifelse(. >=
1, "occured", "not_occured")))
data_train_occur
Now, we model the naive bayes classifier.
1.42 sec elapsed
Decision Tree
Decision tree is a machine learning method that can be used to classify target by splitting variables into 2 categories, with the first split is variables that will give the highest information gain. Information gain is how much information we can get by looking at the difference between the entropy of two clusters.
\[ informationgain = entropy(P) - entropy (S) \]
In binary classification, the entropy can be calculated with: \[ entropy = -p.log_2(p) - (1-p).log_2(1-p) \] with p is proportion of class-1 in a cluster and (1-p) is the proportion of the other class.
Now we will model the decision tree using the rpart
package.
56.78 sec elapsed
Decision tree is easy to interpret. The tree start at the top, with the first node split by looking at the appearance of term “boring”. If the review has less than 0.5 (or < 1) number of word “boring”, the probability of user to give an above average score is 94%. The node split again on the “Above” class, is there any “overrated” word in the review? If yes, the user will tend to give a below average score. If no, is there any “amazing” word? If no, is there any “worst” word? If yes, then the user tend to give a below average score.
Random Forest
Random forest is an ensemble-method, meaning that it combine various models to get better performance than a single model. Random forest is consists of several/many “trees”, with the term tree refer to a single decision tree model. The classification will be based on the voting (or average in regression problems) of all the trees.
Illustration of Random Forest
Using the tidymodels
, we can create the random forest model with specific parameters. The mtry
parameter mean how many variables will be used in each tree. The trees
parameter mean how many trees will be used on this classification, and min_n
mean the minimal number of data points in a node that are required to split the node further.
# define model spec
model_spec <- rand_forest(mode = "classification", mtry = 3, trees = 500, min_n = 1)
# define model engine
model_spec <- set_engine(model_spec, engine = "ranger", seed = 123, num.threads = parallel::detectCores(),
importance = "impurity")
# model fitting
set.seed(123)
tic()
model <- fit_xy(object = model_spec, x = select(data_train, -above_average),
y = select(data_train, above_average))
toc()
43.6 sec elapsed
Evaluation
Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
Predicted_Yes | Predicted_No | |
---|---|---|
Actual_Yes | True Positive | False Negative |
Actual_No | False Postive | True Negative |
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
\[ Accuracy = \frac {TP+TN}{TP+TN+FP+FN}\] \[ Sensitivity = \frac {TP}{TP+FN}\] \[ Specificity = \frac {TN}{TN+FP}\] \[ Precision = \frac {TP}{TP+FP}\]
Naive Bayes
bayes_pred <- select(data_test_occur, above_average) %>% bind_cols(pred_class = predict(bayes_mod,
newdata = data_test_occur[, -1], type = "class"))
# performance
bayes_pred %>% conf_mat(above_average, pred_class)
Truth
Prediction Above Below
Above 3452 1061
Below 644 816
perf_bayes <- bayes_pred %>% summarise(accuracy = accuracy_vec(above_average,
pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average,
pred_class), precision = precision_vec(above_average, pred_class))
perf_bayes
Decision Tree
dt_pred <- select(data_test, above_average) %>% bind_cols(pred_class = predict(dtree,
newdata = data_test[, -1], type = "class"))
# confusion matrix
dt_pred %>% conf_mat(above_average, pred_class)
Truth
Prediction Above Below
Above 3948 1531
Below 148 346
perf_dt <- dt_pred %>% summarise(accuracy = accuracy_vec(above_average, pred_class),
sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average,
pred_class), precision = precision_vec(above_average, pred_class))
perf_dt
Random Forest
Beside measuring the model performance, we also want to tweak the random forest model in order to improve it in the future. Thus, we also look at the sensitivity-specificity curve and the precision-recall curve. Those curves can help us to choose the value of threshold that can increase the prefered performance metric.
Performance
# predict on test
set.seed(123)
pred_test <- select(data_test, above_average) %>% bind_cols(predict(model, select(data_test,
-above_average))) %>% bind_cols(predict(model, select(data_test, -above_average),
type = "prob"))
pred_test$above_average <- factor(pred_test$above_average, c("Above", "Below"))
pred_test$.pred_class <- factor(pred_test$.pred_class, c("Above", "Below"))
pred_test %>% conf_mat(above_average, .pred_class)
Truth
Prediction Above Below
Above 4028 1478
Below 68 399
perf_rf <- pred_test %>% summarise(accuracy = accuracy_vec(above_average, .pred_class),
sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average,
.pred_class), precision = precision_vec(above_average, .pred_class))
perf_rf
ROC Curve
An ROC graph is a two-dimensional plot of a classifier with false positive rate on the x axis against true positive rate on the y axis. As such, a ROC graph depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives). ROC curves are useful for comparing different classifiers, since they take into account all possible thresholds.
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. We expect a classifier that performs no better than chance to have an AUC of 0.5 (when evaluated on an independent test set not used in model training).
Sensitivity-Specificity Trade-Off
pred_test_roc <- pred_test %>% roc_curve(above_average, .pred_Above) %>% mutate_if(~is.numeric(.),
~round(., 4)) %>% gather(metric, value, -.threshold)
p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) +
scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive",
y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)
Based on the curve, we reach equilibrium when the threshold is around 0.675.
Precision-Recall Trade-Off
pred_test_pr <- pred_test %>% pr_curve(above_average, .pred_Above) %>% mutate_if(~is.numeric(.),
~round(., 4)) %>% gather(metric, value, -.threshold)
p <- ggplot(pred_test_pr, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) +
scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive",
y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)
Based on the precision-recall curve, we can reach equilibrium when the threshold is around 0.63.
Depending on what our objective is, we can change the threshold to boost the specific metrics. If we want to get a good trade-off between the recall, specificity, and the precision, we may choose threshold = 0.675, since this value will give us a higher precision and equilibrium between sensitivity and specificity.
Model Improvement
We will try to improve the model performance by tuning the parameters. We will also try to see if models will be improved if we balance the class in training dataset. We will also scale all numeric predictors to range of [0,1].
rec <- recipe(above_average ~ ., data = data_train) %>% step_range(all_predictors(),
min = 0, max = 1) %>% step_upsample(above_average, ratio = 1/1, seed = 123) %>%
prep(strings_as_factors = F)
data_train <- juice(rec)
data_test <- bake(rec, testing(intrain))
Let’s see the class proportion
Above Below
0.5 0.5
Decision Tree
We will change the minimum number of data point in a node required to perform a split with minsplit
in the control parameter.
control = rpart.control(minsplit = 1)
set.seed(123)
dtree_tune <- rpart(above_average ~ ., data = data_train, method = "class",
control = control)
fancyRpartPlot(dtree_tune, sub = NULL)
We got additional nodes with the word “bad” and “quesitonmark” (“?”).
Let’s see the performance of the decision tree
set.seed(123)
dt_pred_tune <- select(data_test, above_average) %>% bind_cols(pred_class = predict(dtree,
newdata = data_test[, -1], type = "class"))
# confusion matrix
dt_pred_tune %>% conf_mat(above_average, pred_class)
Truth
Prediction Above Below
Above 4095 1869
Below 1 8
perf_dt_tune <- dt_pred_tune %>% summarise(accuracy = accuracy_vec(above_average,
pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average,
pred_class), precision = precision_vec(above_average, pred_class))
perf_dt_tune
Random Forest
By Changing The Threshold Value Of The Previous Model
We will try the performance of the random forest with the new threshold.
pred_test <- pred_test %>% mutate(pred_ed = factor(if_else(.pred_Above > 0.675,
"Above", "Below"), c("Above", "Below")))
pred_test %>% conf_mat(above_average, pred_ed)
Truth
Prediction Above Below
Above 2933 510
Below 1163 1367
perf_rf_thres <- pred_test %>% summarise(accuracy = accuracy_vec(above_average,
pred_ed), sensitivity = sens_vec(above_average, pred_ed), specificity = spec_vec(above_average,
pred_ed), precision = precision_vec(above_average, pred_ed))
perf_rf_thres
By increasing the threshold, we can significantly increase the specificity. The precision is increased by around 0.12 point (12%). Even though the accuracy and the sensitivity are decreased, they are still on acceptable level.
Modeling With Balanced Dataset
Now we try to classify the data with the new balanced dataset.
# define model spec
set.seed(123)
model_spec_tune <- rand_forest(mode = "classification", mtry = 3, trees = 500,
min_n = 1)
# define model engine
model_spec_tune <- set_engine(model_spec_tune, engine = "ranger", seed = 123,
num.threads = parallel::detectCores(), importance = "impurity")
# model fitting
set.seed(123)
model_tune <- fit_xy(object = model_spec_tune, x = select(data_train, -above_average),
y = select(data_train, above_average))
Model Performance
Now we look at the model performance for training dataset
set.seed(123)
pred_train_tune <- select(data_train, above_average) %>% bind_cols(predict(model_tune,
select(data_train, -above_average))) %>% bind_cols(predict(model_tune, select(data_train,
-above_average), type = "prob"))
pred_train_tune$above_average <- factor(pred_train_tune$above_average, c("Above",
"Below"))
pred_train_tune$.pred_class <- factor(pred_train_tune$.pred_class, c("Above",
"Below"))
pred_train_tune %>% conf_mat(above_average, .pred_class)
Truth
Prediction Above Below
Above 15849 1724
Below 536 14661
pred_train_tune %>% summarise(accuracy = accuracy_vec(above_average, .pred_class),
sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average,
.pred_class), precision = precision_vec(above_average, .pred_class))
Next, we look at the model performance for testing dataset
# predict on test
set.seed(123)
pred_test_tune <- select(data_test, above_average) %>% bind_cols(predict(model_tune,
select(data_test, -above_average))) %>% bind_cols(predict(model_tune, select(data_test,
-above_average), type = "prob"))
pred_test_tune$above_average <- factor(pred_test_tune$above_average, c("Above",
"Below"))
pred_test_tune$.pred_class <- factor(pred_test_tune$.pred_class, c("Above",
"Below"))
pred_test_tune %>% conf_mat(above_average, .pred_class)
Truth
Prediction Above Below
Above 3555 752
Below 541 1125
perf_rf_tune <- pred_test_tune %>% summarise(accuracy = accuracy_vec(above_average,
.pred_class), sensitivity = sens_vec(above_average, .pred_class), specificity = spec_vec(above_average,
.pred_class), precision = precision_vec(above_average, .pred_class))
perf_rf_tune
Sensitivity-Specificity Trade-off
pred_test_roc <- pred_test_tune %>% roc_curve(above_average, .pred_Above) %>%
mutate_if(~is.numeric(.), ~round(., 4)) %>% gather(metric, value, -.threshold)
p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) +
scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive",
y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)
Based on the curve, we can reach equilibrium by choosing the threshold value of 0.55.
Precision-Recall Trade-Off
pred_test_prc <- pred_test_tune %>% pr_curve(above_average, .pred_Above) %>%
mutate_if(~is.numeric(.), ~round(., 4)) %>% gather(metric, value, -.threshold)
p <- ggplot(pred_test_prc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) +
scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive",
y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)
Based on the curve, we can reach equilibrium by choosing the threshold value of 0.51.
We will try to increase the threshold to 0.55 to see if the model is getting better.
pred_test_tune <- pred_test_tune %>% mutate(pred_ed = factor(if_else(.pred_Above >
0.55, "Above", "Below"), c("Above", "Below")))
pred_test_tune %>% conf_mat(above_average, pred_ed)
Truth
Prediction Above Below
Above 3007 491
Below 1089 1386
perf_rf_tune_thres <- pred_test_tune %>% summarise(accuracy = accuracy_vec(above_average,
pred_ed), sensitivity = sens_vec(above_average, pred_ed), specificity = spec_vec(above_average,
pred_ed), precision = precision_vec(above_average, pred_ed))
perf_rf_tune_thres
Conclusion
Based on the precision, the decision tree has the worst precision value. Naive bayes is a bit better than the initial random forest (threshold = 0.5) but cannot compete with other variation of random forest models. The tuned random forest with threshold = 0.55 has the highest precision, followed by the initial random forest with threshold = 0.675. The tuned random forest (threshold = 0.55) also has better accuracy, sensitivity, and specificity compared to the initial random forest (threshold = 0.675). Naive Bayes is the best in term of computational time.
If we want to get a high precision (we want our positive prediction to be as good as possible), we should choose the tuned random forest with threshold = 0.55.