User Score Classification With Sentiment Analysis: Logistic Regression and K-NN

Intro

What We’ll Do

We will try to do a sentiment analysis and classify if the score given by the user is above average or below average based on the sentiment value of the review. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is user reviews of 100 best PC games from metacritic website. I already scraped the data, which you can download here . Or you can scrape the data yourself on web scrape tab.

Why It Matters

It’s important for companies to pay close attention to Voice of Customer (VoC). By analyzing and getting insights from customer feedback, companies have better information to make strategic decisions, an accurate understanding of what the customer actually wants and, as a result, a better experience for everyone. But, what are customers saying about the product? Thanks to sentiment analysis, we can gain more informative result about the feedback automatically and separate the feedback into positive, neutral, and negative feedback. We can focus on what people really like and/or dislike about our product.

PC Gaming is a competitive market in entertainment industry, especially the video game industry. On Steam, the largest online PC gaming platform, there is a rapid growth on number of game released since 2004. Game developer need to understand what the people want. Therefore, we shall look at the best PC games out there and see what their customer say about their game. Classifying the score into above average or below average based on the sentiment of customer reviews may help us to gain insight at what make people rate the game higher or lower than other people on average.

Web Scraping

Use/modify this code to scrape the user review data on metacritic.

library(tidyverse)

URL <- "https://www.metacritic.com/browse/games/score/metascore/all/pc/filtered?sort=desc"

library(rvest)
webpage <- read_html(URL)
title_html <- html_nodes(webpage,'#main .product_title a')

#convert title to text
title_text <- html_text(title_html)

#Clean the game title with gsub
##remove \n
title_text <- gsub("\n","",title_text)
##remove the first space
title_text <- substring(title_text,29)
##remove the last space
x <- "----------------------------------------------------"
title_text <- substring(title_text,1,last = nchar(title_text)-nchar(x) )
title_text <- gsub(" ","-",title_text)
head(title_text)

title_add <- tolower(title_text)
title_add <- str_replace(title_add,"'","")
title_add <- str_replace(title_add,":","")

library(curl)
URL_add <- "https://www.metacritic.com/game/pc/"
  
game_review <- data.frame()

for (i in 1:100){
  print(paste("GAME :",i,title_add[i]))
  webpage_add <- read_html(paste(URL_add,title_add[i],"/user-reviews?sort-by=score&num_items=100&page=0",sep = ""))
  
  pagenum_html <- html_nodes(webpage_add,".page_num")
  pagenum_text <- html_text(pagenum_html)
  pagenum_text <- as.numeric(pagenum_text)
  pagenum_text <- max(pagenum_text)
  print(paste("number of page:",max(pagenum_text)))
  maxpage <- max(pagenum_text)
  if (maxpage == -Inf | maxpage == Inf) {
    maxpage <-  1
  }
  
for (j in 1:maxpage) {
  page_url <- paste(URL_add,title_add[i],
                    "/user-reviews?sort-by=score&num_items=100&page=",(j-1),sep = "")
  
  print(page_url)
  webpage_add <- read_html(url(page_url))
  
  #Get review
  review_html <- html_nodes(webpage_add,".review_body span")
  review_text <- html_text(review_html)
  review_text <- review_text[review_text != ", click expand to view"]
  expandrow <- which(str_length(review_text)==6 & str_detect(review_text,"Expand")==T)
  if (length(expandrow) == 0 ) {
    expandrow <-  1
  }
  if (expandrow[1] > 4) {
    review_text <- review_text[-c(expandrow,(expandrow-1),(expandrow-3),(expandrow-4))]  
  }
  
    x <- substr(review_text,1,20)
    xo <- which(x == c(x[-1],0))
    xox <- xo[xo - c(xo[-1],0) == -1]
    rox <- c(xox,xox+1)
  
  #Get score
  score_html <- html_nodes(webpage_add,"#main .indiv")
  score_text <- html_text(score_html)
  score <- as.numeric(score_text) 
  
  if (length(review_text) == 99) {
    score <- score[-1]
  }
  
  if (length(rox) != 0 & length(review_text) != 100 & length(score) != length(review_text)) {
    review_text <- review_text[-rox]
  }
  
  #combine into data frame
  review_1 <- data.frame(game = title_add[i], review = review_text,score=score)
  review_1 <- review_1%>% mutate(category = case_when(score > 7 ~ "Positive",
                                                      score > 4 & score < 7 ~ "Mixed",
                                                      score < 4 ~ "Negative"))
  game_review <- rbind(game_review,review_1) 
}
  
}

write.csv(game_review,"D:/R/datasets/meta_review.csv")

Data Preparation

First, we load the required package.

The data is user reviews of best 100 PC games of all time from metacritic website. The data consists of the game title, the user score, the category of the score (positive, mixed, negative), and the review.

Before we jump into data analysis, perhaps it’s best to explore and clean the data. Some words may not be recognized by sentiment lexicons because it’s not written properly.

Since we want to classify the score into above average or below average, we need to add the label into the data.

We need to transform the review into tidy format, with each row contain only 1 word. We will use the tidytext package.

We will do a sentiment analysis on the reviews. The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are afinn, bing, loughran, and nrc. The bing lexicon contain only 2 class of sentiment: positive and negative sentiment. The loughran lexicons contain 6 different sentiments. The nrc contain 10 different sentiments. Meanwhile, the afinn lexicons don’t give a class of sentiment but it give numeric scales to the sentiment, ranging from -5 to 5 to show the strength of the sentiment, with negative scores indicating negative sentiment and positive scores indicating positive sentiment (Nielsen, 2011).

Since we want to get the overall sentiment value of each review, we will use the afinn lexicon, because the sentiment may be best represented as numeric value and the afinn lexicon measure the degree of the sentiment.

Exploratory Data Analysis

For a bit fun, we try to create a word cloud of the most frequent word with positive and negative sentiment using the reshape2 and wordcloud package.

We are curious whether there is a pattern between the rank of the game with the number of people who give above average score.

Some games are deleted because they have only 1 review, so we can’t decide if the score give is above/below average. There is no clear pattern between the rank of the game with the number of people who give above average score. Higher rank doesn’t guarantee that there will be more above average score. Therefore, we will not use the game rank as a feature to classify the user score.

Next, we want to see if there is difference in sentiment value between review with above average and below average score.

As we can see on the figure above, there is difference between above average and below average review. Reviews with user score above average are tend to have higher sentiment value and vice versa. We will use the sentiment value as feature to classify the user score.

Modeling

Holdout : Train-Test split

Before we train our model, first we check if there is a class imbalance between the above average and below average score. Class imbalance can affect our model ability to classify the output.

# A tibble: 2 x 3
  above_average total proportion
  <chr>         <int>      <dbl>
1 No             9522       31.7
2 Yes           20528       68.3

The ratio of data with class-1 (above average) and class-2 is close to 2:1, so the imbalance is not too high and we can proceed to the next step.

We need to convert the class or label into factor.

We must split the dataset into train and test dataset. This is done to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).

Logistic Regression

We train the logistic regression model. This is the equation of a logistic regression model: \[ log (\frac {p(X)}{1-p(X)}) = B_0 + B_1.X \] The left-hand side is called the log-odds or logit. On the right side, the \(B_0\) is the model intercept and \(B_1\) is the coefficient of feature \(X\).


Call:
glm(formula = above_average ~ value, family = binomial("logit"), 
    data = data_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.8781  -1.2588   0.7414   0.8985   2.2815  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.359114   0.017699   20.29   <2e-16 ***
value       0.056563   0.001649   34.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 30015  on 24039  degrees of freedom
Residual deviance: 28581  on 24038  degrees of freedom
AIC: 28585

Number of Fisher Scoring iterations: 4

The output show the summary of our logistic regression model. The coefficient for the intercept is 0.36 and for the sentiment value is 0.06. Both coefficients has small p-value (< 0.05), indicating that they are significant and should not be removed from the equation. The intercept value indicate that if the sentiment value is 0, the probability of of the outcome to be “Yes” is 0.5888 (James et al. , 2013). The coefficient for sentiment value shows that for sentiment value of 1 unit-point, the probability of the outcome to be “Yes” is 0.6024. The standard error shows the confidence interval for the estimate value of each coefficient with the following equation: \[estimate - Std. Error < x < estimate + Std. Error\]

K-NN

K-NN classify the outcome by looking at the nearest “neighbour”. In other words, K-NN is looking at what is the class of data-point(s) with the least/shortest distance to the data we want to classify. The distance is measured with Euclidean Distance, which can penalizes neighbour with greater distance.

\[d(x,y) = \sqrt{\sum {(x_i - y_i)}^2}\]

We can customize the number of neighbours (K) we want to see in order to classify our data. First, we try to classify using the optimum number of K, which is the square root of the number of train dataset based on the rule of thumb.

Error in class::knn(train = train_x, test = test_x, cl = train_y, k = neighbour): too many ties in knn

We encountered a problem here, so let’s try another number of K.

Error in class::knn(train = train_x, test = test_x, cl = train_y, k = 3): too many ties in knn
Error in class::knn(train = train_x, test = test_x, cl = train_y, k = 5): too many ties in knn

We encountered same problems multiple times. The errors show that there are too many ties in K-NN, meaning two or more points are equidistant from an unclassified observation, thereby making it difficult to choose which neighbors are included (Pylypiw, 2017). Moreover, we only use 1 feature to classify the target. The sentiment value itself has a rather short range, from -375 to 234. Since we have more than 20,000 instanes in our training data, it’s almost certain that we would have lots of data with the same sentiment value.

.

.

One solution is to break the tie by randomly selecting the class. We will use different K-NN function for this, the knn1 () instead of knn () from class package. Other alternative is use the knn () function from neighbr package, but I would not recommend it since it runs slowly.

We’ve successfully classify the data test with 1 neighbour.

Evaluation

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

Predicted_Yes Predicted_No
Actual_Yes True Positive False Negative
Actual_No False Postive True Negative

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

\[ Accuracy = \frac {TP+TN}{TP+TN+FP+FN}\] \[ Sensitivity = \frac {TP}{TP+FN}\] \[ Specificity = \frac {TN}{TN+FP}\] \[ Precision = \frac {TP}{TP+FP}\]

Logistic Regression

Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or Above Average).

We will convert the probability into class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No   181   48
       Yes 1730 4051
                                          
               Accuracy : 0.7042          
                 95% CI : (0.6924, 0.7157)
    No Information Rate : 0.682           
    P-Value [Acc > NIR] : 0.0001112       
                                          
                  Kappa : 0.1085          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.98829         
            Specificity : 0.09471         
         Pos Pred Value : 0.70074         
         Neg Pred Value : 0.79039         
             Prevalence : 0.68203         
         Detection Rate : 0.67404         
   Detection Prevalence : 0.96190         
      Balanced Accuracy : 0.54150         
                                          
       'Positive' Class : Yes             
                                          

The result shows that our logistic regression model has accuracy of 70.42 % on test dataset, meaning that 70.42 % of our data is correctly classified. The value of sensitivity and specificity is 98.83 % and 9.47 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 70.07 %, meaning that 70.07 % of our positive prediction is correct.

K-NN

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No   468  229
       Yes 1443 3870
                                          
               Accuracy : 0.7218          
                 95% CI : (0.7103, 0.7331)
    No Information Rate : 0.682           
    P-Value [Acc > NIR] : 1.161e-11       
                                          
                  Kappa : 0.2276          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9441          
            Specificity : 0.2449          
         Pos Pred Value : 0.7284          
         Neg Pred Value : 0.6714          
             Prevalence : 0.6820          
         Detection Rate : 0.6439          
   Detection Prevalence : 0.8840          
      Balanced Accuracy : 0.5945          
                                          
       'Positive' Class : Yes             
                                          

The result shows that our K-NN with K = 1 has accuracy of 72.18 % on test dataset, meaning that 72.18 % of our data is correctly classified. The value of sensitivity and specificity is 94.41 % and 24.49 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 72.84 %, meaning that 72.84 % of our positive prediction is correct.

Model Improvement

We want to improve the performance of our model. One way to do that is by making a more balanced class. First, we can do oversampling method, we simply increase the number of the under-represented class by replicating the existing data. Second, we can do undersampling, where we decresase the number of over-represented class to match the under-represented one. Lastly, we can create an artificial sample using a technique called SMOTE (Synthetic Minority Over-sampling Technique). We will use the oversampling method, which is very simple and we will not sacrifice lots of information because we don’t have to remove our over-represented class.

# A tibble: 2 x 3
  above_average total proportion
  <fct>         <int>      <dbl>
1 Yes           20528         50
2 No            20528         50

Logistic Regression


Call:
glm(formula = above_average ~ value, family = binomial("logit"), 
    data = data_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.8526  -1.2603   0.7437   0.8994   2.2682  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.360531   0.017841   20.21   <2e-16 ***
value       0.055948   0.001655   33.80   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29983  on 24039  degrees of freedom
Residual deviance: 28592  on 24038  degrees of freedom
AIC: 28596

Number of Fisher Scoring iterations: 4
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No   424   45
       Yes 3724 4019
                                          
               Accuracy : 0.541           
                 95% CI : (0.5302, 0.5519)
    No Information Rate : 0.5051          
    P-Value [Acc > NIR] : 3.903e-11       
                                          
                  Kappa : 0.0903          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9889          
            Specificity : 0.1022          
         Pos Pred Value : 0.5190          
         Neg Pred Value : 0.9041          
             Prevalence : 0.4949          
         Detection Rate : 0.4894          
   Detection Prevalence : 0.9429          
      Balanced Accuracy : 0.5456          
                                          
       'Positive' Class : Yes             
                                          

K-NN

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  2014  891
       Yes 2134 3173
                                          
               Accuracy : 0.6316          
                 95% CI : (0.6211, 0.6421)
    No Information Rate : 0.5051          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2655          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7808          
            Specificity : 0.4855          
         Pos Pred Value : 0.5979          
         Neg Pred Value : 0.6933          
             Prevalence : 0.4949          
         Detection Rate : 0.3864          
   Detection Prevalence : 0.6462          
      Balanced Accuracy : 0.6331          
                                          
       'Positive' Class : Yes             
                                          

ROC Curve

To see if any of the models is better than a random guess, we can use ROC Curve which shows the true positive rate and the false positive rate.

The straight line represent any combinations of sensitivity/true positive rate with the false positive rate if we classify the target by “random guessing”. In other words, if our model is position below this line, then we can safely conclude that our model is worse/no better than if we randomly guess the classes. The ideal position would be the [0,1] where our sensitivity is 1 and the false positive rate is 0. The closer we are to this point, the better our model is.

Now we look at our models. We’ve made 4 models (2 initial models and 2 improved models by balancing the data). Both logistic regression models have really high false positive rate, meaning that they are almost unable to classify a negative class (No or below average). The initial K-NN model (K-NN 1) has a high sensitivity, but also high false positive rate. Compared to the K-NN 2, K-NN 1 perform worse in correctly predicting a negative class. However, K-NN 2 has a lower sensitivity compared to K-NN 1.

Conclusion

Both model perform worse in accuracy, sensitivity and precision but the specificity is increased after we used the oversampling.

No significant difference between logistic regression and K-NN in term of accuracy. The logistic regression model is classifying data as positive outcome (above average) more often than K-NN. As a result, the sensitivity of logistic regression is higher than K-NN and the specificity is lower than K-NN. However, they perform worse in predicting a true negative outcome. The precision of K-NN is a bit higher than logistic regression. Overall, K-NN is better than logistic regression.

Depending on what we want to achieve, we can choose one of the K-NN models. Accuracy may not be the best metric on this case. If we want to maximize both the number of correct positive and negative outcome, we should choose the improved K-NN model. However, if all that matter for us is the the positive classification, for example if we spent some money to target those who are predicted as positive and we want our investment to be worthy, we should prioritize model with higher precision value. Thus, we may be better off if we choose the initial model of K-NN.

Reference

  1. Gough, Christina. 2019. Number of games released on Steam worldwide from 2004 to 2018 . Accessed on September 18, 2019 via https://www.statista.com/statistics/552623/number-games-released-steam/.
  2. James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
  3. Nielsen,Finn Årup. 2011.A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages , 93-98. https://arxiv.org/abs/1103.2903
  4. Pylypiw, Nicholas. 2017. Breaking Ties in K-NN Classification . Accessed on September 18, 2019 via https://www.linkedin.com/pulse/breaking-ties-k-nn-classification-nicholas-pylypiw.
  5. Saito, Takaya and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One , 10(3):e011843. DOI: https://doi.org/10.1371/journal.pone.0118432

2019-09-19