Amazon Alexa Reviews - Sentiment Analysis

Introduction

Objective: Performing Sentiment Analysis on Amazon Alexa reviews data set

Source: Kaggle

About Data:

This dataset consists of a nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc.

Loading Data

Let us take a look at the data:

alexa<-as.data.frame(fread("amazon_alexa.tsv"))

alexa <- as.tibble(alexa)


x <- head(alexa, n = 10)
datatable(x, caption = "Table: Data")

The dataset has reviews from 3150 alexa users.

The different alexa products that are reviewed:

alexa %>% 
  group_by(variation) %>% 
  count() %>% 
  arrange(desc(n))
## # A tibble: 16 x 2
## # Groups:   variation [16]
##    variation                        n
##    <chr>                        <int>
##  1 Black  Dot                     516
##  2 Charcoal Fabric                430
##  3 Configuration: Fire TV Stick   350
##  4 Black  Plus                    270
##  5 Black  Show                    265
##  6 Black                          261
##  7 Black  Spot                    241
##  8 White  Dot                     184
##  9 Heather Gray Fabric            157
## 10 White  Spot                    109
## 11 White                           91
## 12 Sandstone Fabric                90
## 13 White  Show                     85
## 14 White  Plus                     78
## 15 Oak Finish                      14
## 16 Walnut Finish                    9



Let us take a look at the ratings:

ggplot(alexa , aes(rating)) + 
  geom_bar()

Clearly, we can see that most of the reviews are positive. Very few reviews are negative having rating <=2. Let us take a look at some of the negative reviews:



alexa_reviews <- alexa %>% 
  filter(rating == 2 | rating == 1) %>% 
  select(verified_reviews)

alexa_head <- head(alexa_reviews, n = 10)
datatable(alexa_head, caption = "Reviews with rating <=2")



Let us take a look at the word cloud:

We observe words like echo, amazon, product, music more frequenty. Few not-so-good words are also observed. Let’s remove the obvious words and look deep. Coloring postitive words in green and negative words in red:

check <- alexa %>%  unnest_tokens(word, verified_reviews)

check1 <- check %>% 
  group_by(word) %>% 
  mutate(freq = n()) %>% 
  select(rating, variation, feedback, word, freq) 

checked <- check1 %>%  inner_join(get_sentiments("bing"), by = "word")

what <- checked %>%
  count(word, sentiment) %>%
  mutate(color = ifelse(sentiment == "positive", "darkgreen", "red"))

wordcloud(what$word, what$n, random.order = FALSE, colors = what$color, ordered.colors = TRUE)

We observe that positive words like love and great are dominating. However, there are also negatives ones like- disappointing, frustrating, disabled. Let us closely look at only the negative words:



below_rated <- checked %>%
  filter(rating <= 2) %>% 
  count(word, rating, sentiment) %>% 
  filter(sentiment == 'negative')

wordcloud(below_rated$word, below_rated$n, max.words = 100, random.order = FALSE)



Let us check the ratio of positive/negative words for the different alexa products.



checked %>%  
  group_by(variation, sentiment) %>% 
  summarize(freq = mean(freq)) %>% 
  spread(sentiment, freq) %>% 
  ungroup() %>% 
  mutate(ratio = positive/negative, 
         variation = reorder(variation, ratio)) %>% 
  ggplot(aes(variation, ratio)) +
  geom_point() +
  coord_flip()

Conclusion: White Show is appreciated the most whereas Black Spot isn’t taken well.



Move on to the next tab for sentiment analysis using different lexicons such as bing, AFINN, and nrc.

Sentiment Analysis Using Lexicons

Sentiment Analysis Using BING:

alexa2 <- alexa1 %>%
    unnest_tokens(word, verified_reviews)  

alexa2$variation <- as.factor(alexa2$variation)

alexa2 %>%
  group_by(variation) %>% 
  inner_join(get_sentiments("bing")) %>%
  count(variation, review.no = review.no , sentiment) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative,
         variation = factor(variation)) %>%
  ggplot(aes(review.no, sentiment, fill = variation)) +
  geom_bar(alpha = 0.5, stat = "identity", show.legend = FALSE) +
  facet_wrap(~ variation, ncol = 2, scales = "free_x")



Comparing Sentiments Using BING, AFINN, and NRC

  • AFINN has a range from -5 (strong -ve) to +5 (strong +ve) for measuring sentiment for different words. BING and NRC are more similar as they are used for binary classification of sentiment and we have used positive sentiment = +1 and negative sentiment = -1 for these 2 lexicons. Hence, AFINN will have higher peaks compared with the other two lexicons.

  • The overall trend for these 3 lexicons looks similar. Due to the variation in scoring, AFINN will have different sentiment score.

afinn <- alexa2 %>%
  group_by(variation) %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(variation, review.no) %>%
  summarise(sentiment = sum(score)) %>%
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(alexa2 %>%
                            group_by(variation) %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing"),
                          alexa2 %>%
                            group_by(variation) %>% 
                            inner_join(get_sentiments("nrc") %>%
                                         filter(sentiment %in% c("positive", "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(variation, method, review.no = review.no , sentiment) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  dplyr::select(variation, review.no, method, sentiment)


bind_rows(afinn, 
          bing_and_nrc) %>%
  ungroup() %>%
  mutate(variation = factor(variation)) %>%
  ggplot(aes(review.no, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(variation ~ method)



Comparing Sentiments - Negative Reviews

  • I have visually compared the three lexicon’s performance in classifying a review as negative

  • Only negative reviews have been used and we should expect scores to be negative and below the axis

  • We can clearly see in the following figure that some reviews are wrongly classified as positive

afinn.neg <- alexa2 %>% filter(rating <3) %>% 
  group_by(variation) %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(variation, review.no) %>%
  summarise(sentiment = sum(score)) %>%
  mutate(method = "AFINN")

bing_and_nrc.neg <- bind_rows(alexa2 %>% 
  filter(rating <3) %>% 
  group_by(variation) %>% 
  inner_join(get_sentiments("bing")) %>%
  mutate(method = "Bing"),
  alexa2 %>% 
  filter(rating <3) %>% 
  group_by(variation) %>% 
  inner_join(get_sentiments("nrc") %>%
  filter(sentiment %in% c("positive", "negative"))) %>%
  mutate(method = "NRC")) %>%
  count(variation, method, review.no = review.no , sentiment) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  dplyr::select(variation, review.no, method, sentiment)


bind_rows(afinn.neg, bing_and_nrc.neg) %>%
  ungroup() %>%
  mutate(variation = factor(variation)) %>%
  ggplot(aes(review.no, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(variation ~ method)



Sentiment Analysis-Review Comparison

Comparing Sentiment Analysis Results with Customer Reviews

  • Customer review score range: 1-5 with being bad and 5 being good

  • If actual customer review <=2, then that review has been considered as negative. Otherwise non -negative

  • If sentiment score <0 then negative, otherwise non negative

# get sentiment of each review by adding up sentiment scores at sentence level
alexa_sentences <- alexa1 %>% unnest_tokens(sentence, verified_reviews, token = "sentences")
# text is verified_review

abc <- alexa_sentences %>%
  group_by(review.no) %>%
  mutate(sentence_num = 1:n()) %>%
  unnest_tokens(word, sentence) %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(variation, review.no) %>%
  summarise(sentiment = sum(score, na.rm = TRUE)) %>%
  mutate(calcul.sentiment = ifelse(sentiment<=0,"negative", "non-negative"))

table(abc$calcul.sentiment)
## 
##     negative non-negative 
##           81          974
abc1 <- abc %>%  left_join(alexa1, c("variation", "review.no")) %>%
  mutate(actual.sent = ifelse(rating<=2,"negative", "non-negative")) %>% 
  mutate(match= ifelse(actual.sent==calcul.sentiment,1,0))


table(abc1$actual.sent, abc$calcul.sentiment, dnn=c("Actual","Measured Sentiment"))
##               Measured Sentiment
## Actual         negative non-negative
##   negative           29           30
##   non-negative       52          944
  • Misclassification Rate = 7.7% which seems decent.

Bigram Analysis

  • Common Bigrams
alexa3 <- alexa1 %>%
  unnest_tokens(bigram, verified_reviews, token = "ngrams", n = 2)  

# Common bigrams
alexa3 %>%
  count(bigram, sort = TRUE)
## # A tibble: 11,428 x 2
##    bigram       n
##    <chr>    <int>
##  1 love it    155
##  2 i love     117
##  3 the echo   110
##  4 set up     100
##  5 easy to     98
##  6 i have      87
##  7 to set      77
##  8 it is       76
##  9 echo dot    75
## 10 i am        73
## # ... with 11,418 more rows



Removing stop words - only from word 1 and looking at diiferent variations

# removing stop words - only from word 1 and not word 2. Information lost - example: "love it"
#  if "it" is removed, "love" will automatically be removed
x <- alexa3 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word1 %in% c("echo", "prime", "black", "dot", "white", "alexa") ) %>%
  count(word1, word2, sort = TRUE) %>% filter(!is.na(word1))


alexa3 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word1 %in% c("echo", "prime", "black", "dot", "white", "alexa")) %>%
  count(variation, word1, word2, sort = TRUE) %>% filter(!is.na(word1)) %>% 
  unite("bigram", c(word1, word2), sep = " ") %>%
  group_by(variation) %>% 
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(reorder(bigram, n), n, fill = variation)) +
  geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
  facet_wrap(~ variation, ncol = 2, scales = "free") +
  coord_flip()



Negating Word COmbinations

alexa3 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(word1 == "not") %>%
  count(variation, word1, word2, sort = TRUE)
## # A tibble: 124 x 4
##    variation       word1 word2         n
##    <chr>           <chr> <chr>     <int>
##  1 Black  Dot      not   sure          6
##  2 Black  Dot      not   the           6
##  3 Black  Dot      not   a             4
##  4 Black  Dot      not   as            4
##  5 Black  Dot      not   have          4
##  6 Black  Dot      not   impressed     4
##  7 Charcoal Fabric not   have          4
##  8 Charcoal Fabric not   like          4
##  9 Charcoal Fabric not   to            4
## 10 Charcoal Fabric not   very          4
## # ... with 114 more rows



Wrong sentiment scores because of not considering negation:

AFINN <- get_sentiments("afinn")

(nots <- alexa3 %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(word1 == "not") %>%
    inner_join(AFINN, by = c(word2 = "word")) %>%
    count(word2, score, sort = TRUE) 
)
## # A tibble: 18 x 3
##    word2        score     n
##    <chr>        <int> <int>
##  1 like             2     9
##  2 allow            1     4
##  3 impressed        3     4
##  4 awkward         -2     2
##  5 bad             -3     2
##  6 easy             1     2
##  7 miss            -2     2
##  8 perfect          3     2
##  9 satisfied        2     2
## 10 supporting       1     2
## 11 true             2     2
## 12 worth            2     2
## 13 disappointed    -2     1
## 14 good             3     1
## 15 happy            3     1
## 16 recommend        2     1
## 17 regret          -2     1
## 18 worry           -3     1
nots %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  ggplot(aes(reorder(word2, contribution), n * score, fill = n * score > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  xlab("Words preceded by 'not'") +
  ylab("Sentiment score multiplied by no. of occurrances") +
  coord_flip()

  • We can see the total impact the negated cases had on mis specifying sentiment. For example, we see that the topmost word preceded by “not” is “like”. The sentiment score for “like” is +2; however, “like” was preceded by “not” 9 times which means the sentiment could easily have been overstated by 9 × 2 = 18 points.



Words preceded by ‘not’ and ‘no’

# looking at a number of negation words
negation_words <- c("not", "no", "without")

negated <- alexa3 %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(word1 %in% negation_words) %>%
    inner_join(AFINN, by = c(word2 = "word")) %>%
    count(word1, word2, score, sort = TRUE) %>%
    ungroup()

negated %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ggplot(aes(reorder(word2, contribution), contribution, fill = contribution > 0)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  xlab("Words preceded by 'not'") +
  ylab("Sentiment score multiplied by no. of occurrances") +
  facet_wrap(~ word1, scales = "free") +
  coord_flip()

  • The primary reason of misclassification of sentiment scores and customer reviews in the previous section is because of ignoring the effect of negating words such as “not good”, “not recommend”, etc.

Summary

  • We found that the 3 lexicons gave a similar trend for sentiment scores for reviews (visual inspection)

  • We compared the negative reviews (customer review<2) with the sentiment score. Only 7.5% of the reviews were wrongly misclassified. Hence, the sentiment analysis gave us decent results.

  • The primary reason of misclassification is ignoring the effect of negating words such as “not good”, “not recommend”, etc.