User Score Classification with Neural Network and Keras

1 Intro
- 1.1 What We’ll Do
- 1.2 The Dataset
2 Data Preparation
- 2.1 Import Data
- 2.2 Data PreProcessing
3 Exploratory Data Analysis
4 Modeling
- 4.1 Cross-Validation
- 4.2 Neural Network
5 Evaluation
6 Model Improvement
- 6.1 Only Change the Threshold
- 6.2 Tuning the Model Parameters
7 Conclusion

1 Intro

1.1 What We’ll Do

We will try to classify whether a user will give a game an above average score based on the content of the reviews. We will use the naive bayes and some three-based methods. Reviews will be extracted using text mining approach. I’ve done similar classification task but with different feature and models, using the sentiment value of the reviews. You can check it here . I also used the Naive Bayes and Random Forest method to classify the score, you can check it here .

1.2 The Dataset

The dataset is user reviews of 100 best PC games from metacritic website. I already scraped the data, which you can download here .

2 Data Preparation

First, we load the required package.

library(tidyverse)
library(tidymodels)
library(tidytext)
library(keras)
library(plotly)
library(data.table)
library(RVerbalExpressions)
library(textclean)

options(scipen = 100)
# set seed keras for reproducible result
use_session_with_seed(2)

# set conda env
use_condaenv("tensorflow")

2.1 Import Data

We will import the dataset using fread for faster importing.

game_review <- fread("game_review.csv")
game_review

2.2 Data PreProcessing

We want to clean the text by removing url and any word elongation. We will replace “?” with “questionmark” and “!” with “exclamationmark” to see if these characters can be useful in our analysis, etc.

question <- rx() %>% rx_find(value = "?") %>% rx_one_or_more()

exclamation <- rx() %>% rx_find(value = "!") %>% rx_one_or_more()

punctuation <- rx_punctuation()

number <- rx_digit()

dollar <- rx() %>% rx_find("$")

game_review <- game_review %>% mutate(text_clean = review %>% replace_url() %>% 
    replace_html() %>% replace_contraction() %>% replace_word_elongation() %>% 
    str_replace_all(pattern = question, replacement = " questionmark ") %>% 
    str_replace_all(pattern = exclamation, replacement = " exclamationmark ") %>% 
    str_remove_all(pattern = punctuation) %>% str_remove_all(pattern = number) %>% 
    str_remove_all(pattern = dollar) %>% str_to_lower() %>% str_squish())
game_review

Since we want to classify the score into above average or below average, we need to add the label into the data.

# Remove game with only 1 review
more_1 <- game_review %>% group_by(game) %>% summarise(total = n()) %>% filter(total > 
    1)
game_review <- game_review[game_review$game %in% more_1$game, ]

x <- game_review %>% mutate(game = factor(game, unique(game))) %>% group_by(game) %>% 
    summarise(game_mean = mean(score))

# Label the data with above average/below average
game_clean <- game_review %>% left_join(x) %>% mutate(above_average = if_else(score > 
    game_mean, "Above", "Below")) %>% mutate(above_average = factor(above_average, 
    c("Below", "Above"))) %>% select(V1, above_average, text_clean) %>% na.omit()

game_clean

Finally, we will make a document term matrix, with the row indicate each review and the columns consists of top 1024 words in the entire reviews. We will use the matrix to classify if the user will give an above average score based on the appearance of one or more terms.

game_token <- game_clean %>% unnest_tokens(word, text_clean)

game_tidy <- game_token %>% anti_join(stop_words) %>% count(V1, above_average, 
    word)

top_word <- game_tidy %>% count(word, sort = T) %>% top_n(1024) %>% select(word)

game_tidy <- game_tidy %>% inner_join(top_word)
game.y <- game_tidy %>% group_by(V1, above_average) %>% summarise(total = n()) %>% 
    select(V1, above_average) %>% as.matrix()

game_dtm <- game_tidy %>% cast_dtm(document = V1, term = word, value = n)
game.x <- as.matrix(game_dtm)
game.x <- as.data.frame(game.x)
game_data <- cbind(game.y, game.x)
game_data <- game_data %>% select(-V1)

head(game_data)

3 Exploratory Data Analysis

We will see if there is a class imbalanec by looking at the proportion of the target variable.

prop.table(table(game_data$above_average))


    Above     Below 
0.6858394 0.3141606

Turns out there is a class imbalance, with the class of above average score is twice bigger than the below average.

4 Modeling

4.1 Cross-Validation

We will split the data into training set, validation set, and testing set. First, we split the data into training set and testing set.

set.seed(123)
game_data$above_average <- factor(game_data$above_average, levels = c("Above", 
    "Below"))
intrain <- initial_split(game_data, prop = 0.8, strata = "above_average")

We will balance the class in the training set and normalize all numeric features. Then we split the testing set into validation set and the testing test itself.

rec <- recipe(above_average ~ ., training(intrain)) %>% step_downsample(above_average, 
    ratio = 1/1, seed = 123) %>% step_center(all_numeric(), -above_average) %>% 
    step_scale(all_numeric(), -above_average) %>% prep(strings_as_factors = F)

data_train <- juice(rec)
data_test <- bake(rec, testing(intrain))

valtes <- initial_split(data_test, prop = 0.5, strata = "above_average")
data_val <- training(valtes)
data_test <- testing(valtes)

prop.table(table(data_train$above_average))


Above Below 
  0.5   0.5

We adjust the data to get a proper structure before we fed them into keras.

data_train_x <- data.matrix(data_train %>% select(-above_average))
data_test_x <- data.matrix(data_test %>% select(-above_average))
data_val_x <- data.matrix(data_val %>% select(-above_average))

data_train_y <- to_categorical((as.numeric(data_train$above_average) - 1), num_classes = 2)
data_test_y <- to_categorical((as.numeric(data_test$above_average) - 1), num_classes = 2)
data_val_y <- to_categorical((as.numeric(data_val$above_average) - 1), num_classes = 2)

4.2 Neural Network

We will build the neural network architecture.

model <- keras_model_sequential()

Our model would have several layers. There are layer dense which will scale our data using the relu activation function on the first and second layer dense. There are also layer dropout to prevent the model from overfitting. Finally, we scale back our data into range of [0,1] with the sigmoid function as the probability of our data belong to a particular class. The epoch represent the number of our model doing the feed-forward and back-propagation.

model %>% layer_dense(units = 64, activation = "relu", input_shape = c(1031)) %>% 
    layer_dropout(rate = 0.4) %>% layer_dense(units = 16, activation = "relu") %>% 
    layer_dropout(rate = 0.4) %>% layer_dense(units = 2, activation = "sigmoid")

model %>% compile(loss = "binary_crossentropy", optimizer = optimizer_adamax(), 
    metrics = c("accuracy"))

history <- model %>% fit(data_train_x, data_train_y, batch_size = 128, epochs = 30, 
    verbose = 1, validation_data = list(data_val_x, data_val_y))

plot(history)

Our model has accuracy of 86.4 % on training dataset and 74.18 % on validation set at the end of the training phase. Since the difference between the training and validation set is not too big, we can conclude that our model is not overfit.

5 Evaluation

5.1 Performance

# predict on train
train_pred <- select(data_train, above_average) %>% bind_cols(pred_class = model %>% 
    predict_classes(data_train_x))
train_pred <- train_pred %>% mutate(pred_class = factor(if_else(pred_class == 
    0, "Above", "Below"), levels = c("Above", "Below")))

data_train_prob <- as.data.frame(model %>% predict_proba(data_train_x))
names(data_train_prob) <- c("pred_above", "pred_below")

train_pred <- train_pred %>% bind_cols(data_train_prob)

head(train_pred)

test_pred <- select(data_test, above_average) %>% bind_cols(pred_class = model %>% 
    predict_classes(data_test_x))
test_pred <- test_pred %>% mutate(pred_class = factor(if_else(pred_class == 
    0, "Above", "Below"), levels = c("Above", "Below")))

data_test_prob <- as.data.frame(model %>% predict_proba(data_test_x))
names(data_test_prob) <- c("pred_above", "pred_below")

test_pred <- test_pred %>% bind_cols(data_test_prob)

head(test_pred)

We will check the confusion matrix from the training set and the testing set.

train_pred %>% conf_mat(above_average, pred_class)

          Truth
Prediction Above Below
     Above  6894   689
     Below   627  6832

test_pred %>% conf_mat(above_average, pred_class)

          Truth
Prediction Above Below
     Above  1555   260
     Below   497   679

We will check the performance of our model on the training set.

# performance on data train
perf_train <- train_pred %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_train

Next, we check the performance on the testing set.

perf_test <- test_pred %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_test

5.2 ROC Curve

test_pred %>% roc_curve(above_average, pred_above) %>% autoplot()

test_pred %>% roc_auc(above_average, pred_above)

5.3 Sensitivity-Specificity Curve

pred_test_roc <- test_pred %>% roc_curve(above_average, pred_above) %>% mutate_if(~is.numeric(.), 
    ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

5.4 Precision-Recall Curve

pred_test_pr <- test_pred %>% pr_curve(above_average, pred_above) %>% mutate_if(~is.numeric(.), 
    ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_pr, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

6 Model Improvement

6.1 Only Change the Threshold

test_pred2 <- test_pred %>% select(above_average, pred_above) %>% mutate(pred_class = factor(if_else(pred_above < 
    0.55, "Below", "Above"), levels = c("Above", "Below")))
test_pred2

perf_test2 <- test_pred2 %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_test2

By changing the threshold, we have increased the precision (even only by small margin).

6.2 Tuning the Model Parameters

We will try to use the tf_idf instead of the term frequency. Tf_idf (term frequency - inverse document frequency) represent how unique a term or word is accross reviews.

game_tf <- game_tidy %>% bind_tf_idf(word, V1, n)
game_tf

game_dtm <- game_tf %>% cast_dtm(document = V1, term = word, value = tf_idf)
game.x <- as.matrix(game_dtm)
game.x <- as.data.frame(game.x)
game_data <- cbind(game.y, game.x)
game_data <- game_data %>% select(-V1)
game_data

set.seed(123)
game_data$above_average <- factor(game_data$above_average, levels = c("Above", 
    "Below"))
intrain <- initial_split(game_data, prop = 0.8, strata = "above_average")

The numeric features will be scaled using min-max scaling instead of the normalization.

rec <- recipe(above_average ~ ., training(intrain)) %>% step_downsample(above_average, 
    ratio = 1/1, seed = 123) %>% step_range(all_numeric(), min = 0, max = 1, 
    -above_average) %>% prep(strings_as_factors = F)

data_train <- juice(rec)
data_test <- bake(rec, testing(intrain))

valtes <- initial_split(data_test, prop = 0.5, strata = "above_average")
data_val <- training(valtes)
data_test <- testing(valtes)

prop.table(table(data_train$above_average))


Above Below 
  0.5   0.5

data_train_x <- data.matrix(data_train %>% select(-above_average))
data_test_x <- data.matrix(data_test %>% select(-above_average))
data_val_x <- data.matrix(data_val %>% select(-above_average))

data_train_y <- to_categorical((as.numeric(data_train$above_average) - 1), num_classes = 2)
data_test_y <- to_categorical((as.numeric(data_test$above_average) - 1), num_classes = 2)
data_val_y <- to_categorical((as.numeric(data_val$above_average) - 1), num_classes = 2)

We build a new neural network architecture.

model_tune <- keras_model_sequential()

model_tune %>% layer_dense(units = 128, activation = "relu", input_shape = c(1031)) %>% 
    layer_dropout(rate = 0.4) %>% layer_dense(units = 32, activation = "relu") %>% 
    layer_dropout(rate = 0.4) %>% layer_dense(units = 4, activation = "relu") %>% 
    layer_dropout(rate = 0.4) %>% layer_dense(units = 2, activation = "sigmoid")

model_tune %>% compile(loss = "binary_crossentropy", optimizer = optimizer_adamax(), 
    metrics = c("accuracy"))

history <- model_tune %>% fit(data_train_x, data_train_y, batch_size = 128, 
    epochs = 30, verbose = 1, validation_data = list(data_val_x, data_val_y))

plot(history)

Our model has accuracy of 88.21 % on training dataset and 75.7 % on validation set at the end of the training phase. Since the difference between the training and validation set is not too big, we can conclude that our model is not overfit.
We then check the model performance.

test_pred_tune <- select(data_test, above_average) %>% bind_cols(pred_class = model_tune %>% 
    predict_classes(data_test_x))

test_pred_tune <- test_pred_tune %>% mutate(pred_class = factor(if_else(pred_class == 
    0, "Above", "Below"), levels = c("Above", "Below")))

test_prob <- as.data.frame(model_tune %>% predict_proba(data_test_x))

names(test_prob) <- c("pred_above", "pred_below")

test_pred_tune <- test_pred_tune %>% bind_cols(test_prob)

test_pred_tune

perf_test_tune <- test_pred_tune %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_test_tune

6.2.1 ROC Curve

test_pred_tune %>% roc_curve(above_average, pred_above) %>% autoplot()

test_pred_tune %>% roc_auc(above_average, pred_above)

6.2.2 Sensitivity-Specificity Curve

pred_test_roc <- test_pred_tune %>% roc_curve(above_average, pred_above) %>% 
    mutate_if(~is.numeric(.), ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

6.2.3 Precision-Recall Curve

pred_test_pr <- test_pred_tune %>% pr_curve(above_average, pred_above) %>% mutate_if(~is.numeric(.), 
    ~round(., 4)) %>% gather(metric, value, -.threshold)

p <- ggplot(pred_test_pr, aes(x = .threshold, y = value)) + geom_line(aes(colour = metric)) + 
    scale_x_continuous(breaks = seq(0, 1, 0.1)) + labs(x = "Probability Threshold to be Classified as Positive", 
    y = "Value", colour = "Metrics") + theme_minimal()
ggplotly(p)

We will try to change the threshold into 0.65

test_pred_tune2 <- test_pred_tune %>% select(above_average, pred_above) %>% 
    mutate(pred_class = factor(if_else(pred_above < 0.65, "Below", "Above"), 
        levels = c("Above", "Below")))
test_pred_tune2

perf_test_tune2 <- test_pred_tune2 %>% summarise(accuracy = accuracy_vec(above_average, 
    pred_class), sensitivity = sens_vec(above_average, pred_class), specificity = spec_vec(above_average, 
    pred_class), precision = precision_vec(above_average, pred_class))
perf_test_tune2

7 Conclusion

This is the summary of our model performance. Model 1 refers to the neural network model who use the term frequency, while Model 2 refers to those that use tf-idf.

perf_test %>% bind_rows(perf_test2) %>% bind_rows(perf_test_tune) %>% bind_rows(perf_test_tune2) %>% 
    bind_cols(Model = c("NN Model 1", "NN Model 1 (Threshold: 0.55)", "NN Model 2", 
        "NN Model 2 (Threshold: 0.65)"))

There is no apparent difference in model performance between using the term frequency and the tf-idf.