User Score Classification With Sentiment Analysis: Logistic Regression and K-NN
Intro
What We’ll Do
We will try to do a sentiment analysis and classify if the score given by the user is above average or below average based on the sentiment value of the review. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is user reviews of 100 best PC games from metacritic website. I already scraped the data, which you can download here . Or you can scrape the data yourself on web scrape tab.
Why It Matters
It’s important for companies to pay close attention to Voice of Customer (VoC). By analyzing and getting insights from customer feedback, companies have better information to make strategic decisions, an accurate understanding of what the customer actually wants and, as a result, a better experience for everyone. But, what are customers saying about the product? Thanks to sentiment analysis, we can gain more informative result about the feedback automatically and separate the feedback into positive, neutral, and negative feedback. We can focus on what people really like and/or dislike about our product.
PC Gaming is a competitive market in entertainment industry, especially the video game industry. On Steam, the largest online PC gaming platform, there is a rapid growth on number of game released since 2004. Game developer need to understand what the people want. Therefore, we shall look at the best PC games out there and see what their customer say about their game. Classifying the score into above average or below average based on the sentiment of customer reviews may help us to gain insight at what make people rate the game higher or lower than other people on average.
Web Scraping
Use/modify this code to scrape the user review data on metacritic.
library(tidyverse)
URL <- "https://www.metacritic.com/browse/games/score/metascore/all/pc/filtered?sort=desc"
library(rvest)
webpage <- read_html(URL)
title_html <- html_nodes(webpage,'#main .product_title a')
#convert title to text
title_text <- html_text(title_html)
#Clean the game title with gsub
##remove \n
title_text <- gsub("\n","",title_text)
##remove the first space
title_text <- substring(title_text,29)
##remove the last space
x <- "----------------------------------------------------"
title_text <- substring(title_text,1,last = nchar(title_text)-nchar(x) )
title_text <- gsub(" ","-",title_text)
head(title_text)
title_add <- tolower(title_text)
title_add <- str_replace(title_add,"'","")
title_add <- str_replace(title_add,":","")
library(curl)
URL_add <- "https://www.metacritic.com/game/pc/"
game_review <- data.frame()
for (i in 1:100){
print(paste("GAME :",i,title_add[i]))
webpage_add <- read_html(paste(URL_add,title_add[i],"/user-reviews?sort-by=score&num_items=100&page=0",sep = ""))
pagenum_html <- html_nodes(webpage_add,".page_num")
pagenum_text <- html_text(pagenum_html)
pagenum_text <- as.numeric(pagenum_text)
pagenum_text <- max(pagenum_text)
print(paste("number of page:",max(pagenum_text)))
maxpage <- max(pagenum_text)
if (maxpage == -Inf | maxpage == Inf) {
maxpage <- 1
}
for (j in 1:maxpage) {
page_url <- paste(URL_add,title_add[i],
"/user-reviews?sort-by=score&num_items=100&page=",(j-1),sep = "")
print(page_url)
webpage_add <- read_html(url(page_url))
#Get review
review_html <- html_nodes(webpage_add,".review_body span")
review_text <- html_text(review_html)
review_text <- review_text[review_text != ", click expand to view"]
expandrow <- which(str_length(review_text)==6 & str_detect(review_text,"Expand")==T)
if (length(expandrow) == 0 ) {
expandrow <- 1
}
if (expandrow[1] > 4) {
review_text <- review_text[-c(expandrow,(expandrow-1),(expandrow-3),(expandrow-4))]
}
x <- substr(review_text,1,20)
xo <- which(x == c(x[-1],0))
xox <- xo[xo - c(xo[-1],0) == -1]
rox <- c(xox,xox+1)
#Get score
score_html <- html_nodes(webpage_add,"#main .indiv")
score_text <- html_text(score_html)
score <- as.numeric(score_text)
if (length(review_text) == 99) {
score <- score[-1]
}
if (length(rox) != 0 & length(review_text) != 100 & length(score) != length(review_text)) {
review_text <- review_text[-rox]
}
#combine into data frame
review_1 <- data.frame(game = title_add[i], review = review_text,score=score)
review_1 <- review_1%>% mutate(category = case_when(score > 7 ~ "Positive",
score > 4 & score < 7 ~ "Mixed",
score < 4 ~ "Negative"))
game_review <- rbind(game_review,review_1)
}
}
write.csv(game_review,"D:/R/datasets/meta_review.csv")
Data Preparation
First, we load the required package.
library(data.table)
library(dplyr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(plotly)
library(tidytext)
library(ROSE)
library(reshape2)
library(wordcloud)
library(ggrepel)
set.seed(13)
The data is user reviews of best 100 PC games of all time from metacritic website. The data consists of the game title, the user score, the category of the score (positive, mixed, negative), and the review.
Before we jump into data analysis, perhaps it’s best to explore and clean the data. Some words may not be recognized by sentiment lexicons because it’s not written properly.
# type correction
game_review$review <- gsub("shouldn't","should not",game_review$review)
game_review$review <- gsub("didn't","did not",game_review$review)
game_review$review <- gsub("don't","do not",game_review$review)
game_review$review <- gsub("can't","can not",game_review$review)
game_review$review <- gsub("couldn't","could not",game_review$review)
game_review$review <- gsub("'ll"," will",game_review$review)
game_review$review <- gsub("'ve"," have",game_review$review)
Since we want to classify the score into above average or below average, we need to add the label into the data.
#add game rank
game_rank <- as.numeric(factor(game_review$game,unique(game_review$game)))
game_review$rank <- game_rank
#Get mean of user score on each game
more_1 <- game_review %>% group_by(game) %>% summarise(total = n()) %>% filter(total > 1)
game_review <- game_review[game_review$game %in% more_1$game,]
x <- game_review %>%
mutate(game = factor(game,unique(game))) %>%
group_by(game,rank) %>%
summarise(game_mean = mean(score))
#Label the data with above average/below average
game_review <- game_review %>% left_join(x) %>%
mutate(above_average = if_else(score > game_mean, "Yes", "No"))
rmarkdown::paged_table(game_review)
We need to transform the review into tidy format, with each row contain only 1 word. We will use the tidytext
package.
#unnest token to extract word from review
game_token <- game_review %>% unnest_tokens(word,review)
rmarkdown::paged_table(head(game_token,10))
We will do a sentiment analysis on the reviews. The tidytext
package contains several sentiment lexicons. Three general-purpose lexicons are afinn
, bing
, loughran
, and nrc
. The bing
lexicon contain only 2 class of sentiment: positive and negative sentiment. The loughran
lexicons contain 6 different sentiments. The nrc
contain 10 different sentiments. Meanwhile, the afinn
lexicons don’t give a class of sentiment but it give numeric scales to the sentiment, ranging from -5 to 5 to show the strength of the sentiment, with negative scores indicating negative sentiment and positive scores indicating positive sentiment (Nielsen, 2011).
Since we want to get the overall sentiment value of each review, we will use the afinn
lexicon, because the sentiment may be best represented as numeric value and the afinn
lexicon measure the degree of the sentiment.
# join the word with sentiment
game_sentiment <- game_token %>% inner_join(get_sentiments("afinn"))
# summarise the sentiment value of each review
game_sentiment <- game_sentiment %>% group_by(V1,game,rank,above_average) %>%
summarise(value = sum(value)) %>% ungroup()
rmarkdown::paged_table(head(game_sentiment,10))
Exploratory Data Analysis
For a bit fun, we try to create a word cloud of the most frequent word with positive and negative sentiment using the reshape2
and wordcloud
package.
# wordcloud from the review with above average score
game_token %>% filter(above_average == "Yes") %>% count(word,sort = T) %>%
inner_join(get_sentiments("bing")) %>%
filter(is.na(sentiment)==F) %>%
acast(word~sentiment, value.var = "n",fill = 0) %>%
comparison.cloud(colors = c("red","blue"),max.words = 100,title.bg.colors = "white")
# Top 15 word for each sentiment
game_token %>% filter(above_average == "Yes") %>% count(word,sort = T) %>%
inner_join(get_sentiments("bing")) %>%
filter(is.na(sentiment)==F) %>%
group_by(sentiment) %>% top_n(10,n) %>% ungroup() %>%
mutate(word = factor(word,rev(word))) %>%
ggplot()+
geom_col(aes(word,n,fill=sentiment))+
coord_flip()+
labs(title = "Top Words For Each Sentiment For Reviews With Above Average Score",
y = "Frequency")+
theme(axis.title.y = element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
axis.text = element_text(colour = "black"))
# wordcloud from the review with below average score
game_token %>% filter(above_average == "No") %>% count(word,sort = T) %>%
inner_join(get_sentiments("bing")) %>%
filter(is.na(sentiment)==F) %>%
acast(word~sentiment, value.var = "n",fill = 0) %>%
comparison.cloud(colors = c("red","blue"),max.words = 100,title.bg.colors = "white")
# Top 15 word for each sentiment
game_token %>% filter(above_average == "No") %>% count(word,sort = T) %>%
inner_join(get_sentiments("bing")) %>%
filter(is.na(sentiment)==F) %>%
group_by(sentiment) %>% top_n(10,n) %>% ungroup() %>%
mutate(word = factor(word,rev(word))) %>%
ggplot()+
geom_col(aes(word,n,fill=sentiment))+
coord_flip()+
labs(title = "Top Words For Each Sentiment For Reviews With Below Average Score",
y = "Frequency")+
theme(axis.title.y = element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
axis.text = element_text(colour = "black"))
We are curious whether there is a pattern between the rank of the game with the number of people who give above average score.
num_x <- game_review %>% group_by(game,rank,above_average) %>%
summarise(total = n())
xo <- game_review %>% group_by(game,rank) %>% summarise(n= n())
num_x <- num_x %>% left_join(xo) %>% mutate(prop = total/n)
p <- num_x %>% ggplot(aes(rank,prop))+
geom_col(aes(fill=above_average),position = "stack")+
theme(panel.background = element_blank(),
panel.grid = element_blank())+
labs(title = "Proportion of above average vs below average score",
x = "game rank", y = "proportion")
ggplotly(p)
Some games are deleted because they have only 1 review, so we can’t decide if the score give is above/below average. There is no clear pattern between the rank of the game with the number of people who give above average score. Higher rank doesn’t guarantee that there will be more above average score. Therefore, we will not use the game rank as a feature to classify the user score.
Next, we want to see if there is difference in sentiment value between review with above average and below average score.
median_score <- game_sentiment %>%
group_by(above_average) %>%
summarise(median = median(value),
mean = mean(value),
value = 200)
game_sentiment %>%
ggplot(aes(above_average,value))+
geom_boxplot()+
labs(x = "Above Average", y = "Sentiment Value",
title = "Sentiment Value Distribution")+
geom_text(aes(label = paste("median :",median)),data=median_score,
hjust="left",color="red3",nudge_x = -0.5)+
geom_text(aes(label = paste("mean :",round(mean,2))),data=median_score,
nudge_y = -30,hjust="left",color="red3",nudge_x = -0.5)+
theme(panel.grid = element_blank(),
panel.background = element_blank(),
axis.text = element_text(colour = "black"))
As we can see on the figure above, there is difference between above average and below average review. Reviews with user score above average are tend to have higher sentiment value and vice versa. We will use the sentiment value as feature to classify the user score.
Modeling
Holdout : Train-Test split
Before we train our model, first we check if there is a class imbalance between the above average and below average score. Class imbalance can affect our model ability to classify the output.
game_sentiment %>%
group_by(above_average) %>%
summarise(total = n()) %>%
mutate(proportion = (total/sum(total))*100)
# A tibble: 2 x 3
above_average total proportion
<chr> <int> <dbl>
1 No 9522 31.7
2 Yes 20528 68.3
The ratio of data with class-1 (above average) and class-2 is close to 2:1, so the imbalance is not too high and we can proceed to the next step.
We need to convert the class or label into factor.
We must split the dataset into train and test dataset. This is done to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).
Logistic Regression
We train the logistic regression model. This is the equation of a logistic regression model: \[ log (\frac {p(X)}{1-p(X)}) = B_0 + B_1.X \] The left-hand side is called the log-odds or logit. On the right side, the \(B_0\) is the model intercept and \(B_1\) is the coefficient of feature \(X\).
logit_mod <- glm(above_average~value,data = data_train,family = binomial("logit"))
summary(logit_mod)
Call:
glm(formula = above_average ~ value, family = binomial("logit"),
data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8781 -1.2588 0.7414 0.8985 2.2815
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.359114 0.017699 20.29 <2e-16 ***
value 0.056563 0.001649 34.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 30015 on 24039 degrees of freedom
Residual deviance: 28581 on 24038 degrees of freedom
AIC: 28585
Number of Fisher Scoring iterations: 4
The output show the summary of our logistic regression model. The coefficient for the intercept is 0.36 and for the sentiment value is 0.06. Both coefficients has small p-value (< 0.05), indicating that they are significant and should not be removed from the equation. The intercept value indicate that if the sentiment value is 0, the probability of of the outcome to be “Yes” is 0.5888 (James et al. , 2013). The coefficient for sentiment value shows that for sentiment value of 1 unit-point, the probability of the outcome to be “Yes” is 0.6024. The standard error shows the confidence interval for the estimate value of each coefficient with the following equation: \[estimate - Std. Error < x < estimate + Std. Error\]
K-NN
K-NN classify the outcome by looking at the nearest “neighbour”. In other words, K-NN is looking at what is the class of data-point(s) with the least/shortest distance to the data we want to classify. The distance is measured with Euclidean Distance
, which can penalizes neighbour with greater distance.
\[d(x,y) = \sqrt{\sum {(x_i - y_i)}^2}\]
We can customize the number of neighbours (K) we want to see in order to classify our data. First, we try to classify using the optimum number of K, which is the square root of the number of train dataset based on the rule of thumb.
#separate target variable
train_x <- as.matrix(data_train$value)
test_x <- as.matrix(data_test$value)
train_y <- as.factor(data_train$above_average)
# Predict Data Test Class with 1 Nearest Neighbour
neighbour <- round(sqrt(nrow(data_train)),0)
knn_pred <- class::knn(train = train_x,test = test_x,cl = train_y,k = neighbour)
Error in class::knn(train = train_x, test = test_x, cl = train_y, k = neighbour): too many ties in knn
We encountered a problem here, so let’s try another number of K.
Error in class::knn(train = train_x, test = test_x, cl = train_y, k = 3): too many ties in knn
Error in class::knn(train = train_x, test = test_x, cl = train_y, k = 5): too many ties in knn
We encountered same problems multiple times. The errors show that there are too many ties in K-NN, meaning two or more points are equidistant from an unclassified observation, thereby making it difficult to choose which neighbors are included (Pylypiw, 2017). Moreover, we only use 1 feature to classify the target. The sentiment value itself has a rather short range, from -375 to 234. Since we have more than 20,000 instanes in our training data, it’s almost certain that we would have lots of data with the same sentiment value.
.
One solution is to break the tie by randomly selecting the class. We will use different K-NN function for this, the knn1 ()
instead of knn ()
from class
package. Other alternative is use the knn ()
function from neighbr
package, but I would not recommend it since it runs slowly.
knn_1 <- class::knn1(train = train_x, test = test_x, cl = train_y)
rmarkdown::paged_table(head(as.data.frame(knn_1), 10))
We’ve successfully classify the data test with 1 neighbour.
Evaluation
Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
Predicted_Yes | Predicted_No | |
---|---|---|
Actual_Yes | True Positive | False Negative |
Actual_No | False Postive | True Negative |
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
\[ Accuracy = \frac {TP+TN}{TP+TN+FP+FN}\] \[ Sensitivity = \frac {TP}{TP+FN}\] \[ Specificity = \frac {TN}{TN+FP}\] \[ Precision = \frac {TP}{TP+FP}\]
Logistic Regression
Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or Above Average).
#predict
pred_logit <- predict(logit_mod,newdata = data_test,type = "response")
rmarkdown::paged_table(head(as.data.frame(pred_logit),10))
We will convert the probability into class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.
# determine the class based on the threshold 0.5
pred_class <- as.factor(if_else(pred_logit > 0.5, "Yes", "No"))
# confusion matrix
perf_logit1 <- confusionMatrix(data = pred_class, reference = data_test$above_average,
positive = "Yes")
perf_logit1
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 181 48
Yes 1730 4051
Accuracy : 0.7042
95% CI : (0.6924, 0.7157)
No Information Rate : 0.682
P-Value [Acc > NIR] : 0.0001112
Kappa : 0.1085
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.98829
Specificity : 0.09471
Pos Pred Value : 0.70074
Neg Pred Value : 0.79039
Prevalence : 0.68203
Detection Rate : 0.67404
Detection Prevalence : 0.96190
Balanced Accuracy : 0.54150
'Positive' Class : Yes
The result shows that our logistic regression model has accuracy of 70.42 % on test dataset, meaning that 70.42 % of our data is correctly classified. The value of sensitivity and specificity is 98.83 % and 9.47 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 70.07 %, meaning that 70.07 % of our positive prediction is correct.
K-NN
perf_knn1 <- confusionMatrix(data = knn_1, reference = data_test$above_average,
positive = "Yes")
perf_knn1
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 468 229
Yes 1443 3870
Accuracy : 0.7218
95% CI : (0.7103, 0.7331)
No Information Rate : 0.682
P-Value [Acc > NIR] : 1.161e-11
Kappa : 0.2276
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9441
Specificity : 0.2449
Pos Pred Value : 0.7284
Neg Pred Value : 0.6714
Prevalence : 0.6820
Detection Rate : 0.6439
Detection Prevalence : 0.8840
Balanced Accuracy : 0.5945
'Positive' Class : Yes
The result shows that our K-NN with K = 1 has accuracy of 72.18 % on test dataset, meaning that 72.18 % of our data is correctly classified. The value of sensitivity and specificity is 94.41 % and 24.49 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 72.84 %, meaning that 72.84 % of our positive prediction is correct.
Model Improvement
We want to improve the performance of our model. One way to do that is by making a more balanced class. First, we can do oversampling method, we simply increase the number of the under-represented class by replicating the existing data. Second, we can do undersampling, where we decresase the number of over-represented class to match the under-represented one. Lastly, we can create an artificial sample using a technique called SMOTE (Synthetic Minority Over-sampling Technique). We will use the oversampling method, which is very simple and we will not sacrifice lots of information because we don’t have to remove our over-represented class.
balance_data <- ovun.sample(above_average~.,data = game_sentiment,method = "over",
N = 20528*2,seed = 1)$data
balance_data %>%
group_by(above_average) %>%
summarise(total = n()) %>%
mutate(proportion = (total/sum(total))*100)
# A tibble: 2 x 3
above_average total proportion
<fct> <int> <dbl>
1 Yes 20528 50
2 No 20528 50
samplesize <- 0.8*nrow(balance_data)
index <- sample(seq_len(nrow(balance_data)),size = samplesize)
balance_data$above_average <- factor(balance_data$above_average,c("No","Yes"))
data_train <- balance_data[index,]
data_test <- balance_data[-index,]
Logistic Regression
logit_mod <- glm(above_average~value,data = data_train,family = binomial("logit"))
summary(logit_mod)
Call:
glm(formula = above_average ~ value, family = binomial("logit"),
data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8526 -1.2603 0.7437 0.8994 2.2682
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.360531 0.017841 20.21 <2e-16 ***
value 0.055948 0.001655 33.80 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 29983 on 24039 degrees of freedom
Residual deviance: 28592 on 24038 degrees of freedom
AIC: 28596
Number of Fisher Scoring iterations: 4
# predict
pred_logit <- predict(logit_mod, newdata = data_test, type = "response")
# determine the class based on the threshold 0.5
pred_class2 <- as.factor(if_else(pred_logit > 0.5, "Yes", "No"))
# confusion matrix
perf_logit2 <- confusionMatrix(data = pred_class2, reference = data_test$above_average,
positive = "Yes")
perf_logit2
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 424 45
Yes 3724 4019
Accuracy : 0.541
95% CI : (0.5302, 0.5519)
No Information Rate : 0.5051
P-Value [Acc > NIR] : 3.903e-11
Kappa : 0.0903
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9889
Specificity : 0.1022
Pos Pred Value : 0.5190
Neg Pred Value : 0.9041
Prevalence : 0.4949
Detection Rate : 0.4894
Detection Prevalence : 0.9429
Balanced Accuracy : 0.5456
'Positive' Class : Yes
K-NN
# separate target variable
train_x <- as.matrix(data_train$value)
test_x <- as.matrix(data_test$value)
train_y <- data_train$above_average
knn_2 <- knn1(train = train_x, test = test_x, cl = train_y)
perf_knn2 <- confusionMatrix(data = knn_2, reference = data_test$above_average,
positive = "Yes")
perf_knn2
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 2014 891
Yes 2134 3173
Accuracy : 0.6316
95% CI : (0.6211, 0.6421)
No Information Rate : 0.5051
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2655
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7808
Specificity : 0.4855
Pos Pred Value : 0.5979
Neg Pred Value : 0.6933
Prevalence : 0.4949
Detection Rate : 0.3864
Detection Prevalence : 0.6462
Balanced Accuracy : 0.6331
'Positive' Class : Yes
ROC Curve
To see if any of the models is better than a random guess, we can use ROC Curve which shows the true positive rate and the false positive rate.
tpr <- c(perf_logit1$byClass[1], perf_logit2$byClass[1], perf_knn1$byClass[1],
perf_knn2$byClass[1])
fpr <- c(1 - perf_logit1$byClass[2], 1 - perf_logit2$byClass[2], 1 - perf_knn1$byClass[2],
1 - perf_knn2$byClass[2])
perf <- data.frame(tpr = tpr, fpr = fpr, model = c("Logit 1", "Logit 2", "K-NN 1",
"K-NN 2"))
perf %>% ggplot() + geom_point(aes(fpr, tpr), size = 4, alpha = 0.7, color = "skyblue3") +
geom_abline(aes(intercept = 0, slope = 1)) + geom_text_repel(aes(fpr, tpr,
label = model), box.padding = 1) + geom_text(aes(1, 0.25, label = "Logit 1 : Initial Logistic Regression Model"),
hjust = "right") + geom_text(aes(1, 0.2, label = "Logit 2 : Improved Logistic Regression Model"),
hjust = "right") + geom_text(aes(1, 0.15, label = "K-NN 1 : Initial K-NN"),
hjust = "right") + geom_text(aes(1, 0.1, label = "K-NN 2 : Improved K-NN"),
hjust = "right") + scale_x_continuous(limits = c(0, 1)) + scale_y_continuous(limits = c(0,
1)) + theme(panel.background = element_blank(), panel.grid = element_blank()) +
labs(title = "ROC Curve", x = "False Positive Rate", y = "True Positive Rate")
The straight line represent any combinations of sensitivity/true positive rate with the false positive rate if we classify the target by “random guessing”. In other words, if our model is position below this line, then we can safely conclude that our model is worse/no better than if we randomly guess the classes. The ideal position would be the [0,1] where our sensitivity is 1 and the false positive rate is 0. The closer we are to this point, the better our model is.
Now we look at our models. We’ve made 4 models (2 initial models and 2 improved models by balancing the data). Both logistic regression models have really high false positive rate, meaning that they are almost unable to classify a negative class (No or below average). The initial K-NN model (K-NN 1) has a high sensitivity, but also high false positive rate. Compared to the K-NN 2, K-NN 1 perform worse in correctly predicting a negative class. However, K-NN 2 has a lower sensitivity compared to K-NN 1.
Conclusion
Both model perform worse in accuracy, sensitivity and precision but the specificity is increased after we used the oversampling.
No significant difference between logistic regression and K-NN in term of accuracy. The logistic regression model is classifying data as positive outcome (above average) more often than K-NN. As a result, the sensitivity of logistic regression is higher than K-NN and the specificity is lower than K-NN. However, they perform worse in predicting a true negative outcome. The precision of K-NN is a bit higher than logistic regression. Overall, K-NN is better than logistic regression.
Depending on what we want to achieve, we can choose one of the K-NN models. Accuracy may not be the best metric on this case. If we want to maximize both the number of correct positive and negative outcome, we should choose the improved K-NN model. However, if all that matter for us is the the positive classification, for example if we spent some money to target those who are predicted as positive and we want our investment to be worthy, we should prioritize model with higher precision value. Thus, we may be better off if we choose the initial model of K-NN.
Reference
- Gough, Christina. 2019. Number of games released on Steam worldwide from 2004 to 2018 . Accessed on September 18, 2019 via https://www.statista.com/statistics/552623/number-games-released-steam/.
- James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
- Nielsen,Finn Årup. 2011.A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages , 93-98. https://arxiv.org/abs/1103.2903
- Pylypiw, Nicholas. 2017. Breaking Ties in K-NN Classification . Accessed on September 18, 2019 via https://www.linkedin.com/pulse/breaking-ties-k-nn-classification-nicholas-pylypiw.
- Saito, Takaya and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One , 10(3):e011843. DOI: https://doi.org/10.1371/journal.pone.0118432