User Score Classification with Sentiment Analysis: Logistic Regression and kNN methods

Intro

This documentation is an adaptation from the Python Notebook by Dyah Nurlita to conduct a sentiment analysis from woman’s clothing review data set that is downloaded from Kaggle repository. The data set contains 23486 observations and 10 features of variables which would be described as follow:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewers age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.

Within this case study, the target variable would be the recommended_ind which is a boolean variable that describes whether the item is recommended by the user. Moreover, the predictor would be the value of sentiments of the review text that would be acquired from the text mining process.

To conduct the sentiment analysis, a text mining approach would be used and the supervised models that would be used and compared are Logistic Regression and k-NN methods. To evaluate the models, a confusion matrix would be used by identifying the Accuracy, Sensitivity, and Precision Matrix of the models. Likewise, ROC Curve would be presented to identify whether any of the models are still better than a random guess.

Data Preparation

To initiate, the required packages would be loaded. The packages that are loaded are as below:

library(dplyr)
library(readr)
library(class)
library(caret)
library(stringr)
library(ggplot2)
library(tidytext)
library(tm)
library(textclean)
library(SnowballC)
library(ROSE)
library(reshape2)
library(wordcloud)
library(ggrepel)

The next step is to import the data from the local directory. To ease the labeling, the column names would be changed as follow:

clothes_review <- read_csv("data_input/reviews.csv") 
names(clothes_review) <- c("id", "clothing_id", "age", "title", "review_text","rating", "recommended_ind", "pos_feedback_cnt", "division", "department", "class")

glimpse(clothes_review)

#> Rows: 23,486
#> Columns: 11
#> $ id               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
#> $ clothing_id      <dbl> 767, 1080, 1077, 1049, 847, 1080, 858, 858, 1077, 107~
#> $ age              <dbl> 33, 34, 60, 50, 47, 49, 39, 39, 24, 34, 53, 39, 53, 4~
#> $ title            <chr> NA, NA, "Some major design flaws", "My favorite buy!"~
#> $ review_text      <chr> "Absolutely wonderful - silky and sexy and comfortabl~
#> $ rating           <dbl> 4, 5, 3, 5, 5, 2, 5, 4, 5, 5, 3, 5, 5, 5, 3, 4, 3, 5,~
#> $ recommended_ind  <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,~
#> $ pos_feedback_cnt <dbl> 0, 4, 0, 0, 6, 4, 1, 4, 0, 0, 14, 2, 2, 0, 1, 3, 2, 0~
#> $ division         <chr> "Initmates", "General", "General", "General Petite", ~
#> $ department       <chr> "Intimate", "Dresses", "Dresses", "Bottoms", "Tops", ~
#> $ class            <chr> "Intimates", "Dresses", "Dresses", "Pants", "Blouses"~

It is seen that each of the observations leads to one specific clothing id. To investigate the number of clothings that are reviewed would be performed by the code below.

length(unique(clothes_review$clothing_id))

#> [1] 1206

The result shows that there are 1206 unique clothes that are identified in the dataset. To continue the data preparation process, the missing and duplicated values identification would be performed.

colSums(is.na(clothes_review))

#>               id      clothing_id              age            title 
#>                0                0                0             3810 
#>      review_text           rating  recommended_ind pos_feedback_cnt 
#>              845                0                0                0 
#>         division       department            class 
#>               14               14               14

As shown from the result, the title column has lots of missing values in comparison to the other variables which would then be ommitted in the next step. Moreover, the review_text column has a small proportion of missing values. This would allow the missing values in that variable to be dropped.

anyDuplicated(clothes_review)

#> [1] 0

This result would also show that no duplicated values are identified in the dataset. For the next step, the missing values from the review_text are removed. Note that this would also removed several observations and reducing the numbers of unique clothes to be observed.The reduction of unique clothes count would be approximately 20 unique clothes which is assumed to be insignificant in this case.

clothes_review <- clothes_review %>% 
  filter(review_text != is.na(review_text))

nrow(clothes_review)

#> [1] 22641

length(unique(clothes_review$clothing_id))

#> [1] 1179

sum(is.na(clothes_review$review_text))

#> [1] 0

glimpse(clothes_review)

#> Rows: 22,641
#> Columns: 11
#> $ id               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
#> $ clothing_id      <dbl> 767, 1080, 1077, 1049, 847, 1080, 858, 858, 1077, 107~
#> $ age              <dbl> 33, 34, 60, 50, 47, 49, 39, 39, 24, 34, 53, 39, 53, 4~
#> $ title            <chr> NA, NA, "Some major design flaws", "My favorite buy!"~
#> $ review_text      <chr> "Absolutely wonderful - silky and sexy and comfortabl~
#> $ rating           <dbl> 4, 5, 3, 5, 5, 2, 5, 4, 5, 5, 3, 5, 5, 5, 3, 4, 3, 5,~
#> $ recommended_ind  <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,~
#> $ pos_feedback_cnt <dbl> 0, 4, 0, 0, 6, 4, 1, 4, 0, 0, 14, 2, 2, 0, 1, 3, 2, 0~
#> $ division         <chr> "Initmates", "General", "General", "General Petite", ~
#> $ department       <chr> "Intimate", "Dresses", "Dresses", "Bottoms", "Tops", ~
#> $ class            <chr> "Intimates", "Dresses", "Dresses", "Pants", "Blouses"~

As in this document would only required to perform a sentiment analysis from review_text, only several columns would be selected including clothing_id, review_text, and recommended_ind. Likewise, the type of recommended_ind column is still in character type. Thus, it would be required to transform to a factor type and change its label for the analysis purpose.

clothes_review_subset <- clothes_review %>% 
  select(clothing_id, review_text, recommended_ind) %>% 
  mutate(recommended_ind = as.factor(ifelse(recommended_ind > 0, "yes", "no")))

glimpse(clothes_review_subset)

#> Rows: 22,641
#> Columns: 3
#> $ clothing_id     <dbl> 767, 1080, 1077, 1049, 847, 1080, 858, 858, 1077, 1077~
#> $ review_text     <chr> "Absolutely wonderful - silky and sexy and comfortable~
#> $ recommended_ind <fct> yes, yes, no, yes, yes, no, yes, yes, yes, yes, no, ye~

The data set then could be reviewed in the table below:

head(clothes_review_subset)

To go forward, the text mining process would be conducted by utilizing several libraries including tidytext, tm, and textclean. Several steps are performed in this process to acquire a clean and processed word for the sentiment analysis.

Expanding Contractions

In this step, several abbreviated words (such as “I’m”, “He’s”, etc.) would be expanded to ease the analysis. To perform this step, replace_contraction function from textclean library would be used. Note that one of the parameter of the function is contraction.key which would required the key of the contraction. In this context, the contraction key that is used is from lexicon library.

added_contractions <- data.frame(contraction = c("bc", "bcs"),
                                 expanded = c("because", "because"))

contraction_ext <- rbind(lexicon::key_contractions, added_contractions)

clothes_review_subset$review_text <- replace_contraction(clothes_review_subset$review_text, 
                                                         contraction.key = contraction_ext,
                                                         ignore.case = T)

head(clothes_review_subset)

To Lower Case

Lowering the case of each words would be useful in the text mining process due to the tokenization process that differs the use of the case of each words. For example, the words of “Algoritma” and “algoritma” would be counted as different words. This would be inefficient as the analysis would only requires to count each of unique words apart from their case. To conduct this step, tolower function would be used as follow:

clothes_review_subset$review_text <- tolower(clothes_review_subset$review_text)
head(clothes_review_subset)

Removing Punctuation and Special Characters

The removal of any special characters (!, ?, ., etc.) is required as this would be unnecessary to the analysis and would make the different counts in each words (e.g. “Go” and “Go!”). To remove the punctuation and any special characters, removePunctuation would be used from tm library.

clothes_review_subset$review_text <- removePunctuation(clothes_review_subset$review_text)

head(clothes_review_subset)

length(unique(clothes_review_subset$clothing_id))

#> [1] 1179

Replace Elongated Words

The elongated words (e.g. “soooo”) would also be counted as different words from its shortened version during the tokenization which would be necessary to transform such words in the text. To transformed them, replace_word_elongation could be used from textclean library

clothes_review_subset$review_text <- replace_word_elongation(clothes_review_subset$review_text)

head(clothes_review_subset)

Dropping Numbers

In the text mining process, dropping numbers would be favorable as numbers have no meaning in the text which would be unnecessary in the sentiment analysis. To drop the numbers from the text, removeNumbers function could be used from the tm package.

clothes_review_subset$review_text <- removeNumbers(clothes_review_subset$review_text)

head(clothes_review_subset)

length(unique(clothes_review_subset$clothing_id))

#> [1] 1179

Tokenization

The tokenization step is one of the mandatory steps in text mining. This step would divide the text data words-by-words which would allow the sentiment analysis to be performed on each word. There are several ways to perform this step in R. In this context, however, the unnest_tokens is used from tidytext library by passing down the new column word and the text column review_text into the function parameter.

clothes_review_token <- clothes_review_subset %>% 
  unnest_tokens(word, review_text)

clothes_review_token

Removing Stopwords

Stopwords are words that are common in any text data such as “I”, “He”, or “She”, that have less meaning in the sentiment analysis. These words could be removed in many ways. In this documentation, tidytext library would be used by performing an anti-join to the words that are listed in stop_words table of tidytext. Likewise, the “snowball” lexicon would be used in the joining process by filtering the table.

clothes_review_token <- clothes_review_token %>% 
  anti_join(stop_words %>% filter(lexicon == "snowball"))

clothes_review_token

Stemming

There are several words in english that have different verbs (such as “playing” and “play”) that have similar meaning but would be counted as different words. To tackle this problem, a stemming approach would be helpful to generalized the verb of the words. wordStem function from tm library could be used for this purpose. By default, the algorithm in wordStem for the Stemming is the Porter’s Stemming Algorithm as some of the words might be transformed in an incorrect form (e.g. “stories” to be “stori”).

clothes_review_token <- clothes_review_token %>% 
  mutate(word = wordStem(word))

clothes_review_token

Adding Sentiments

The final step of the text mining process is adding the sentiments for each words for the analysis. In tidytext package, several lexicons are available including afinn, bing, loughran, and nrc. Each of these lexicons provide different methods for the sentiment analysis. For example, bing lexicon contains two classes of sentiment including positive and negative. Meanwhile, the loughran lexicon provides 6 different classes. In this documentation, afinn lexicon would be used as this lexicon would give values on each words ranging from -5 to 5 where the positives indicates the positive sentiments of a word and the negatives indicates the negative sentiments of the word. the numeric value also shows the strength of the sentiments for each word. To visualize this description, the affin lexicon table is provided below:

head(get_sentiments("afinn"), 10)

After getting the lexicon, the table would then be joined with the tokenized table clothes_review_token that was prepared above. The join opperation would be performed by using inner_join to remove the words that do not have the sentiment values (or NAs). Finally, the value of each word would be accounted and summarised by sum operation.

review_sentiment <- clothes_review_token %>% 
  inner_join(get_sentiments("afinn"))

review_sentiment <- review_sentiment %>% group_by(clothing_id, recommended_ind) %>% 
  summarise(value = sum(value)) %>% ungroup()

review_sentiment

review_sentiment %>% group_by(recommended_ind) %>% 
  summarise(total = n())

Exploratory Data Analysis

In this exploratory phase, several graphs and charts are presented to visualize the holistic view of the data set to be working on. Below is the visualization of recommended_ind class proportion

clothes_review_subset %>% 
  group_by(recommended_ind) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(recommended_ind, -n), n, fill = recommended_ind)) +
  geom_col() +
  geom_text(aes(label = paste(round(n/sum(n) *100, 2), "%")),
            vjust = -0.2) +
  labs(title = "Proportion of Recommended Items and Non-Recommended Items",
       x = "Recommended Indicies",
       y = NULL) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.y = element_blank(),
        panel.grid = element_blank(),
        legend.position = "none")

Based on the visualization above, it is observed that the ‘yes’ label in the recommended_ind variable has a larger proportion in comparison to its counterpart. This would be problematic for modeling the machine learning as the majority of the class would affect the prediction of the model. Thus, a balanced proportion of train and test split may be required which would be explained and performed in the model tuning section.

Next, word clouds and ranks for each words would be presented below. To establish the word cloud visualization, reshape2 and wordcloud package would be used. The word cloud and word ranks would be presented by filtering the recommended_ind variable to divide the visualization for ‘yes’ class and ‘no’ class. Below are the word cloud for ‘yes’ label of the recommended_ind variable.

clothes_review_token %>% 
  filter(recommended_ind == "yes") %>% 
  count(word, sort= T) %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(is.na(sentiment) == F) %>% 
  acast(word~sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("firebrick1", "steelblue"), max.words = 100, title.bg.colors = "white")

clothes_review_token %>% filter(recommended_ind == "yes") %>% count(word, sort = T) %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(is.na(sentiment) == F) %>% 
  group_by(sentiment) %>%  top_n(10, n) %>% ungroup() %>% 
  mutate(word = factor(word, rev(word))) %>% 
  ggplot() +
  geom_col(aes(n, word, fill = sentiment)) +
  labs(title = "Top Words for Each Sentiments for Recommended Items",
       x = "Frequency",
       y = NULL) +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank())

To compare its counterpart, below is shown the word cloud and the word ranks for no label of recommended_ind variable.

clothes_review_token %>% 
  filter(recommended_ind == "no") %>% 
  count(word, sort= T) %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(is.na(sentiment) == F) %>% 
  acast(word~sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("firebrick1", "steelblue"), max.words = 100, title.bg.colors = "white")

clothes_review_token %>% filter(recommended_ind == "no") %>% count(word, sort = T) %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(is.na(sentiment) == F) %>% 
  group_by(sentiment) %>%  top_n(10, n) %>% ungroup() %>% 
  mutate(word = factor(word, rev(word))) %>% 
  ggplot() +
  geom_col(aes(n, word, fill = sentiment)) +
  labs(title = "Top Words for Each Sentiments for Recommended Items",
       x = "Frequency",
       y = NULL) +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank())

Modelling

Splitting Train and Test data

To split the data into train and test data set, sample function would be used. The size of the train data set would be 75% of the total observation in the data set. At the end of the code chunk, the proportions of classes within the target variable are checked to see whether the target class proportion is balanced

set.seed(123)

train_size <- 0.75*nrow(review_sentiment)

train_index <- sample(seq_len(nrow(review_sentiment)), size = train_size)

data_train <- review_sentiment[train_index,]
data_test <- review_sentiment[-train_index,]

prop.table(table(data_train$recommended_ind))

#> 
#>        no       yes 
#> 0.3094629 0.6905371

Based on the proportion table above, it is seen that the proportion of class “yes” and class “no” have approximately ratio of 2:1 and would be considered as not too high for their disproportion. By this, the next step could be proceeded which is the establishment of the logistic regression model and the k-NN model.

Logistic Regression

To briefly explain the establishment of the logistic regression model, an equation of logistic regression would be described as below.

\[log(\frac{p(X)}{1-p(X)}) = B_0+B_1.X\]

As seen from the equation, the left hand side is the log of odds of the prediction. Meanwhile, \(B_0\) is the intercept of the model and \(B_1\) is the coefficient of feature \(X\).

To establish the logistic regression model in R, glm function is used and the “binomial” input is passed down into the family parameter.

logit_model <- glm(recommended_ind ~ value, data = data_train, family = "binomial")

summary(logit_model)

#> 
#> Call:
#> glm(formula = recommended_ind ~ value, family = "binomial", data = data_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0815  -1.4849   0.8799   0.8939   0.9018  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept) 0.6943916  0.0678904  10.228 < 0.0000000000000002 ***
#> value       0.0015258  0.0004657   3.276              0.00105 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1451.4  on 1172  degrees of freedom
#> Residual deviance: 1429.7  on 1171  degrees of freedom
#> AIC: 1433.7
#> 
#> Number of Fisher Scoring iterations: 6

By this summary of the model, it is seen that both of the intercept and coefficients of word value has a small p-value (below 0.05) which indicates that both of the coefficients are significant to the model and should be kept into the model. The value of each coefficients shows the log of odds of each coefficient. To interpret the model clearly, a conversion to the probability of each coefficients would be performed by using the inv.logit function from gtools library.

data.frame(
  row.names = c("(Intercept)","value"),
  probability = c(gtools::inv.logit(0.6943916), gtools::inv.logit(0.6943916 + 0.0015258))
)

By the table above, the model could be interpreted as the log of odds values in the model has been transformed into their probability values respectively (James et al. , 2013). The probability of the intercept indicates that if the sentiment value is 0, then the probability of the predictions of the model to be ‘yes’ is 0.6669431. Meanwhile, the probability value of the sentiment value coefficient shows that the increase of sentiment value by 1 unit would resulted the probability of ‘yes’ occurred by 0.6672820.

k-NN

In the k-NN model, the observations are classified by their nearest “neighbor” or the other observations that are close to them by their distances between each other. The distances are measured by the Ecludiean Distance which would be calculated by the equation below.

\[d(x,y)=\sqrt{\Sigma(x_i-y_i)^2}\]

The ‘k’ itself stands for how many neighbors that should be observed within the model. the k value could be customized as there would be any certain guidelines to determine it. However, the rule of thumb to determine the ‘k’ value of the model could be calculated by finding the square-root of the total data train population (\(\sqrt{n}\)). The determination of k value in this case would be shown as below.

sqrt(nrow(data_train))

#> [1] 34.24909

The value of 34 has been attained which shows the recommended ‘k’ of the model. A ‘k’ of 33, however, would be chosen as the classes of the target variable have 2 classes which is an even number (again, this is the rule of thumb to determine the ‘k’ value).

To establish the model, the target and the predictor variables are separated before hand. In k-NN model, the scales among the numeric variables of the predictors should have the same scales as to have the same distance comparisons between each predictor values. In this case, however, the predictor variable would only be the sentiment value which makes the feature scaling unnecessary to the model.

train_x <- as.matrix(data_train$value)
train_y <- as.matrix(data_train$recommended_ind)
test_x <- as.matrix(data_test$value)

knn_pred <- knn(train = train_x, test = test_x, cl = train_y, k = 33)

knn_pred %>% as.data.frame() %>% head(10)

Evaluation

In the evaluation of the classification models, a confusion matrix would be used as it shows several different categories of the prediction including True Positives, True Negatives, False Positives, and False Negatives.

data.frame(predicted_yes = c("True Positive", "False Positive"),
           predicted_no = c("False Negative", "True Negative"),
           row.names = c("actual_yes", "actual_no"))

The metric of the performance that are used are Accuracy, Sensitivity/Recall, Precision, and Specificity (Saito and Rehmsmeier, 2015). Accuracy measures how many predictions that are correctly predicted. Separately, Sensitivity measures how many positive predictions are correctly predicted of all the positive outcome. The Precision is a metric to define how many of the positive predictions are correct. Lastly, Specificity measures the ratio of Negative predictions are correctly predicted. The equation of each metrics are described on the equation below.

\[Accuracy = \frac{TP + TN}{TP + TN +FP + FN}\] \[Sensitivity = \frac{TP}{TP+FN}\] \[Specificity = \frac{TN}{TN+FP}\] \[Precision = \frac{TP}{TP+FP}\]

Logistic Regression

To attain the probability values logistic regression predictions, predict functions is used by passing down ‘response’ string into the type parameter. The result of the predictions would be shown on the table below.

pred_logit <- predict(logit_model, newdata = data_test, type = "response")

as.data.frame(pred_logit)

By the table above, it is shown the probability value of each predictions. In a classification problem by the logistic model, the result should be converted to the class of the target variable. In this case, the ‘yes’ class would be presented as 1 binary value while the ‘no’ class would be presented as 0. At the end the code chunk, the confusion matrix would be established by using confusionMatrix function.

pred_class <- as.factor(if_else(pred_logit > 0.5, "yes", "no"))

perf_logit <- confusionMatrix(data = pred_class, reference = data_test$recommended_ind,
                              positive = "yes")

perf_logit

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no    0   0
#>        yes 130 261
#>                                              
#>                Accuracy : 0.6675             
#>                  95% CI : (0.6184, 0.7141)   
#>     No Information Rate : 0.6675             
#>     P-Value [Acc > NIR] : 0.5238             
#>                                              
#>                   Kappa : 0                  
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.0000             
#>          Pos Pred Value : 0.6675             
#>          Neg Pred Value :    NaN             
#>              Prevalence : 0.6675             
#>          Detection Rate : 0.6675             
#>    Detection Prevalence : 1.0000             
#>       Balanced Accuracy : 0.5000             
#>                                              
#>        'Positive' Class : yes                
#>

By the result of the matrix, it is shown that the accuracy of the model on the data set is 66.75 % which means that only 66.75% of the predictions are correctly classified by the model. The Sensitivity and Specificity of 100% and 0% respectively show that all of the positive outcomes are correctly classified, while none of the negative outcomes are correctly specified. Lastly, the precision of 66.75 % (Pos Pred Value) shows that 66.75% of the positive predictions are correct.

k-NN

The confusion matrix of the k-NN model could be directly established as the result of the prediction of the k-NN are labelled by the class of the target. The confusion matrix of k-NN model is shown as below.

perf_knn <- confusionMatrix(data = knn_pred, reference = data_test$recommended_ind,
                            positive = "yes")
perf_knn

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no   10   5
#>        yes 120 256
#>                                              
#>                Accuracy : 0.6803             
#>                  95% CI : (0.6316, 0.7263)   
#>     No Information Rate : 0.6675             
#>     P-Value [Acc > NIR] : 0.3162             
#>                                              
#>                   Kappa : 0.0742             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.98084            
#>             Specificity : 0.07692            
#>          Pos Pred Value : 0.68085            
#>          Neg Pred Value : 0.66667            
#>              Prevalence : 0.66752            
#>          Detection Rate : 0.65473            
#>    Detection Prevalence : 0.96164            
#>       Balanced Accuracy : 0.52888            
#>                                              
#>        'Positive' Class : yes                
#>

By the result of the matrix, it is shown that the model is performed better than the logistic regression. The accuracy of the model is 68.03 % meaning that 68.03 % of the predictions are matched with the test data set. The value of Sensitivity and Specificity are slightly different from the logistic regression model which are respectively 98.08 % and 7.69 % respectively. Finally, the Precision of the model shows that 68.08 % of positive predictions are correctly classified.

Model Improvement

In the previous section, it is shown that both of the models has a greater sensitivity above all of the other metrics. This may be potentially occurred due to an unbalanced proportion of target variable’s classes in the train data set. There are several ways to balance the proportion of the class. The first option is to do an up-sampling to the train data set where the proportion of the minority class of the target variable is increased by the same level of the majority class. The second option is to do a down-sampling where the majority class of the target variable would be reduced. In this case, however, the down-sampling method would be unfavorable as the difference between the majority and minority class are high in the data set as seen on the visualization sections. Another option is to create an artificial sample using a technique called SMOTE (Synthetic Minority Over-sampling Technique) which is to increase the proportion of the minority class by adding the artificial observations in the minority class. To do so, an ovun.sample function is used as seen as below. Note that the N parameter is used by calculating the number of observations in the majority class times the number of class presented in the target variable.

balanced_data <- ovun.sample(recommended_ind ~ ., data = review_sentiment, 
                             method = "over", N = 1071*2, seed = 123)$data

balanced_data %>% 
  group_by(recommended_ind) %>% 
  summarise(total = n()) %>% 
  mutate(proportion = (total/sum(total)*100))

set.seed(123)

train_size2 <- 0.75*nrow(balanced_data)

train_index2 <- sample(seq_len(nrow(balanced_data)), size = train_size2)

balanced_data$recommended_ind <- factor(balanced_data$recommended_ind, c("no","yes"))

data_train2 <- balanced_data[train_index2,]
data_test2 <- balanced_data[-train_index2,]

table(data_test2$recommended_ind)

#> 
#>  no yes 
#> 285 251

table(data_train2$recommended_ind)

#> 
#>  no yes 
#> 786 820

Improvement for Logistic Regression Model

logit_model2 <- glm(recommended_ind ~ value, data = data_train2, family = "binomial")

summary(logit_model2)

#> 
#> Call:
#> glm(formula = recommended_ind ~ value, family = "binomial", data = data_train2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.8191  -1.1507   0.4379   1.2000   1.2132  
#> 
#> Coefficients:
#>               Estimate Std. Error z value   Pr(>|z|)    
#> (Intercept) -0.0786575  0.0542599  -1.450      0.147    
#> value        0.0017187  0.0003578   4.804 0.00000156 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2225.7  on 1605  degrees of freedom
#> Residual deviance: 2184.3  on 1604  degrees of freedom
#> AIC: 2188.3
#> 
#> Number of Fisher Scoring iterations: 5

pred_logit2 <- predict(logit_model2, newdata = data_test2, type = "response")

pred_class2 <- as.factor(if_else(pred_logit2 > 0.5, "yes", "no"))

perf_logit2 <- confusionMatrix(data = pred_class2, reference = data_test2$recommended_ind,
                               positive = "yes")

perf_logit2

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  222 185
#>        yes  63  66
#>                                              
#>                Accuracy : 0.5373             
#>                  95% CI : (0.4941, 0.5802)   
#>     No Information Rate : 0.5317             
#>     P-Value [Acc > NIR] : 0.4147             
#>                                              
#>                   Kappa : 0.0431             
#>                                              
#>  Mcnemar's Test P-Value : 0.00000000000001548
#>                                              
#>             Sensitivity : 0.2629             
#>             Specificity : 0.7789             
#>          Pos Pred Value : 0.5116             
#>          Neg Pred Value : 0.5455             
#>              Prevalence : 0.4683             
#>          Detection Rate : 0.1231             
#>    Detection Prevalence : 0.2407             
#>       Balanced Accuracy : 0.5209             
#>                                              
#>        'Positive' Class : yes                
#>

As seen on the matrix above, it is shown that the accuracy of the improved logistic regression model is reduced to 53.73 % by the changes in the class proportion of the target variable. However, the specificity of the model has a significant increase from 0 % to 77.89%. This shows that the balanced proportion of the class affects the Negative predictions are correctly predicted.

k-NN Improvement

train_x2 <- as.matrix(data_train2$value)
train_y2 <- as.matrix(data_train2$recommended_ind)
test_x2 <- as.matrix(data_test2$value)

knn_pred2 <- knn(train = train_x2, test = test_x2, cl = train_y2, k = 33)

perf_knn2 <- confusionMatrix(data = knn_pred2, reference = data_test2$recommended_ind,
                             positive = "yes")

perf_knn2

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  200 108
#>        yes  85 143
#>                                           
#>                Accuracy : 0.6399          
#>                  95% CI : (0.5977, 0.6806)
#>     No Information Rate : 0.5317          
#>     P-Value [Acc > NIR] : 0.0000002603    
#>                                           
#>                   Kappa : 0.273           
#>                                           
#>  Mcnemar's Test P-Value : 0.1133          
#>                                           
#>             Sensitivity : 0.5697          
#>             Specificity : 0.7018          
#>          Pos Pred Value : 0.6272          
#>          Neg Pred Value : 0.6494          
#>              Prevalence : 0.4683          
#>          Detection Rate : 0.2668          
#>    Detection Prevalence : 0.4254          
#>       Balanced Accuracy : 0.6357          
#>                                           
#>        'Positive' Class : yes             
#>

Similar to the improvement of the logistic regression model, the balanced proportion decreases the accuracy of the models and increases the specificity of the model to 70.18 %.

ROC Curve

ROC Curve could also be used to see the performance of the classification models. In the ROC curve, the x-axis and y-axis are the False Positive Rate and True Positive Rate respectively. The ROC curve is presented as below.

tpr <- c(perf_knn$byClass[1], perf_knn2$byClass[1], 
         perf_logit$byClass[1], perf_logit2$byClass[1])
fpr <- c(1 - perf_knn$byClass[2], 1 - perf_knn2$byClass[2], 
         1 - perf_logit$byClass[2], 1 - perf_logit2$byClass[2])
perf <- data.frame(tpr = tpr, fpr = fpr, model = c("kNN - 1", "kNN - 2", "Logit - 1", "Logit - 2"))

perf %>% ggplot() +
  geom_point(aes(fpr, tpr), size = 3, alpha = 0.8, color = "steelblue") +
  geom_abline(aes(intercept = 0, slope = 1)) +
  geom_text_repel(aes(fpr, tpr, label = model), box.padding = 1) +
  geom_text(aes(1, 0.25, label = "Logit - 1: Initial Logistic Regression Model"), hjust = "right") +
  geom_text(aes(1, 0.20, label = "Logit - 2: Improved Logistic Regression Model"), hjust = "right") +
  geom_text(aes(1, 0.15, label = "kNN - 1: Initial kNN"), hjust = "right") +
  geom_text(aes(1, 0.10, label = "kNN - 2: Improved kNN"), hjust = "right") +
  scale_x_continuous(limits = c(0,1)) + scale_y_continuous(limits = c(0,1)) +
  labs(title = "ROC Curve", 
       x = "False Positive Rate",
       y = "True Positive Rate") +
  theme(panel.background = element_blank(), panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5))

To interpret the chart, the straight line within the chart indicates whether the performance of each models are worse or better than any random guess. If any of the models fall below the straight line, it is shown that the models are performed worse than a random guess. Otherwise, the models would be defined as a great model in comparison to any random guess. The ideal and position within the chart would be [0,1] which has True Positive Rate of 1 and False Positive Rate of 0.

Based on the chart above, the 4 models that has been established shows different positions within the chart. The initial Logistic Regression Model falls on the straight line which means that the predictions of the model may be similar to a random guess. Separately, the improved logistic regression models shows a slightly better performance than any random guess. However, both of the True Positive and False Positive rates are lower in this case which indicates that the improved model has a low sensitivity but perform well in predicting the negative class (or ‘no’ class).

In the case of the k-NN models, the initial model has a great sensitivity (having a great True Positive rate) despite having a lower performance on predicting the negative class. Meanwhile, the improved model has a greater specificity but lower sensitivity than the initial one.

Conclusion

In conclusion, both of the Logistic Regression models and the k-NN models performed worse in Accuracy and the Precision metrics but having a great performance in the Sensitivity metric. The overall performances show that the k-NN models perform better than the Logistic Regression Models despite the differences of both of the models are insignificant. After the balance of the class proportion has been sorted, both of the models shows an immense increase on the Specificity metric

The selection of the models for this case depends on the case of the business. For selecting the model, Accuracy metric may not be the best metric to use as it would only show the general accuracy of the model. If the goals is to maximize the correct predictions of as a whole, it would be better to use the initial k-NN model. Likewise, if the priority is the positive classification value, such as the focus of investment for the buyers who recommended the items in the platform, the model that is used should be the model with a higher precision which is also the initial k-NN model.

References

Agarap, AF. 2018. Statistical Analysis on E-Commerce Reviews, with Sentiment Classification using Bidirectional Recurrent Neural Network. via https://www.researchgate.net/publication/323545316_Statistical_Analysis_on_E-Commerce_Reviews_with_Sentiment_Classification_using_Bidirectional_Recurrent_Neural_Network
Brownlee, Jason. “SMOTE for Imbalanced Classification with Python”. Accessed September 5, 2021, via https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Harrison, Onel. “Machine Learning Basics with the K-Nearest Neighbors Algorithm”. Accessed September 6, 2021, via https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An Introduction to Statistical Learning: with Applications in R . New York: Springer.
Maltenfort, Mitchell. Answering question in “Why does the ROC curve plot for sensitivity by (1-specificity)?”. Accessed September 9, 2021, via https://www.researchgate.net/post/Why-does-the-ROC-curve-plot-for-sensitivity-by-1-specificity
Narkhede, Sarang. “Understanding AUC - ROC Curve”. Accessed September 6, 2021, via https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Narkhede, Sarang. “Understanding Confusion Matrix”. Accessed September 6, 2021, via https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Nielsen,Finn Årup. 2011.A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages , 93-98. https://arxiv.org/abs/1103.2903
Saito, Takaya and Marc Rehmsmeier. 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One , 10(3):e011843. DOI: https://doi.org/10.1371/journal.pone.0118432
Silipo, Rosaria. “Confusion Matrix and Class Statistics”. Accessed September 6, 2021, via https://towardsdatascience.com/confusion-matrix-and-class-statistics-68b79f4f510b
Swaminathan, Saishruthi. “Logistic Regression — Detailed Overview”. Accessed September 6, 2021, via https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

User Score Classification with Sentiment Analysis: Logistic Regression and kNN methods

Aditya Adhrevi

September 09, 2021

Intro

Data Preparation

Expanding Contractions

To Lower Case

Removing Punctuation and Special Characters

Replace Elongated Words

Dropping Numbers

Tokenization

Removing Stopwords

Stemming

Adding Sentiments

Exploratory Data Analysis

Modelling

Splitting Train and Test data

Logistic Regression

k-NN

Evaluation

Logistic Regression

k-NN

Model Improvement

Improvement for Logistic Regression Model

k-NN Improvement

ROC Curve

Conclusion

References