Amazon products Review

The database that we are going to use in this projet consists of 5000 consumer reviews for Amazon products.

The variables that we want to include in our task are:

name = name of Amazon product
review.rating = rating from 1 to 5 for each review
review.text = full text of each review

Amazon_P <- read.csv('/Users/panca97/Desktop/Erasmus/Text\ mining/Final\ Project/Amazon_Consumer_Reviews_of_Amazon_Products.xls')
Amazon_P <- Amazon_P[,c(4,19,21)]
Amazon_P <- as.tibble(Amazon_P)
setwd("~/Desktop/Erasmus/Text mining/Final Project")

Project topic

The first purpose of this analysis is to provide advice on the general purchase of an Amazon Echo product.

Then, we are going to compare two different types of this product: Amazon Echo Show vs Amazon Echo Plus.

The text mining technique used to meet the project objectives will be sentimental analysis.

Thus, our intention is to filter the dataset only to products Echo of our interest:

Amazon Show is a smart speaker that is part of the Amazon Echo line of products. Similarly to other devices in the family, it is designed around Amazon's virtual assistant Alexa, but additionally features a 7-inch touchscreen display that can be used to display visual information to accompany its responses, as well as play video and conduct video calls with other Echo Show users.

Amazon Echo Plus is a hands-free smart speaker that you control using your voice. It connects to Alexa – a cloud based voice service to play music, make calls, check weather and news, set alarms, control smart home devices, and much more.

Amazon_echo <- Amazon_P %>% 
  filter(str_detect(name, '\\b(Echo)\\b')) %>%
  rename(rating = reviews.rating,
         text = reviews.text) 

for(i in 1:nrow(Amazon_echo)){
  if(Amazon_echo$name[i] == 'Amazon Echo Show Alexa-enabled Bluetooth Speaker with 7" Screen'){
    Amazon_echo$name[i] <- "Amazon Echo Show" 
  } else{
    Amazon_echo$name[i] <- "Amazon Echo Plus"
  }
}

First view of products reviews

Let's analysis the new restricted data called Amazon_echo.

Naturally, the variables remained the same while the observation became 1435.

Furthermore our dataset doesn't present missing value!

str(Amazon_echo)

## tibble [1,435 × 3] (S3: tbl_df/tbl/data.frame)
##  $ name  : chr [1:1435] "Amazon Echo Show" "Amazon Echo Show" "Amazon Echo Show" "Amazon Echo Show" ...
##  $ rating: int [1:1435] 5 5 5 5 5 5 5 5 4 4 ...
##  $ text  : chr [1:1435] "Great Gift for anyone. Very easy to setup. Coexist with all IOT Devices. Alexa is AWESOME!" "Super excited to give this as a gift. It's super convenient that Best Buy has Echo products in store instead of"| __truncated__ "We bought this for mother in law, buying another for me." "Well designed, good sound, has everything Alexa has plus the HD video. Always ready with answers with associate"| __truncated__ ...

sum(is.na(Amazon_echo))

## [1] 0

The horizontal chart below illustrates the distribution of ratings between the two Echo products.

It easy to see that both the Amazon Echo are full of positive evaluations.

If, on the other hand, we consider the number of rating, Amazon echo Show is valued more time.

summary(as.factor(Amazon_echo$name))

## Amazon Echo Plus Amazon Echo Show 
##              590              845

ggplot(Amazon_echo, aes(rating, fill= as.factor(name))) +
  geom_bar(position = "dodge") +
  coord_flip()+
  scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
  ggtitle("Distribution of Rating between Amazon products")+
  guides(fill = guide_legend(title="Type of product"))

Regarding the form of text, before our analysis we must clean it!

pandoc.table(Amazon_echo[1:4,c(1,3)], 
             justify = c('center','left'), style = 'grid')

## 
## 
## +------------------+--------------------------------+
## |       name       | text                           |
## +==================+================================+
## | Amazon Echo Show | Great Gift for anyone. Very    |
## |                  | easy to setup. Coexist with    |
## |                  | all IOT Devices. Alexa is      |
## |                  | AWESOME!                       |
## +------------------+--------------------------------+
## | Amazon Echo Show | Super excited to give this as  |
## |                  | a gift. It's super convenient  |
## |                  | that Best Buy has Echo         |
## |                  | products in store instead of   |
## |                  | having to purchase from        |
## |                  | Amazon.                        |
## +------------------+--------------------------------+
## | Amazon Echo Show | We bought this for mother in   |
## |                  | law, buying another for me.    |
## +------------------+--------------------------------+
## | Amazon Echo Show | Well designed, good sound, has |
## |                  | everything Alexa has plus the  |
## |                  | HD video. Always ready with    |
## |                  | answers with associated video  |
## |                  | or text if applicable. Can     |
## |                  | show movie trailers, can also  |
## |                  | watch Amazon video. Excellent  |
## |                  | on-demand security video with  |
## |                  | Amazon compatible cameras.     |
## |                  | Voice activated message and/or |
## |                  | video calls to Amazon Show     |
## |                  | owners or Alexa App holders.   |
## |                  | Highly recommended.            |
## +------------------+--------------------------------+

Another important comparison could be done by average of rating: Amazon Echo Plus seems rating better than the competitor.

Amazon_echo %>%
  group_by(name) %>%
  summarise(rating_mean = round(mean(rating),2))

## # A tibble: 2 x 2
##   name             rating_mean
##   <chr>                  <dbl>
## 1 Amazon Echo Plus        4.75
## 2 Amazon Echo Show        4.66

Text Cleaning

In order to clean directly the text of each review, we use some interesting function that allow to:

Remove Number
Replace Punctuation with whitespace
Remove stopwords
Tranform uppercase in lowercase letter
Remove English stopwords

Amazon_echo$text <- gsub("[[:punct:]]", " ", Amazon_echo$text)
Amazon_echo$text <- tolower(Amazon_echo$text)
Amazon_echo$text<- removeNumbers(Amazon_echo$text)
Amazon_echo$text<- removeWords(Amazon_echo$text,stopwords("english"))
Amazon_echo$text<- stripWhitespace(Amazon_echo$text)

Continuing with the cleanup, we create a new dataset called tidy_echo where we can found the new variable Line_N.

The table below illustres the majority words present in our new data.

tidy_echo <- Amazon_echo %>%
  group_by(name) %>%
  mutate(Line_N = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_echo %>%
  count(word) %>%
  arrange(desc(n))

## # A tibble: 2,445 x 2
##    word        n
##    <chr>   <int>
##  1 echo      609
##  2 alexa     432
##  3 love      421
##  4 music     284
##  5 amazon    253
##  6 easy      217
##  7 home      211
##  8 product   192
##  9 bought    178
## 10 screen    160
## # … with 2,435 more rows

Clearly, the tibble above advices us to remove some words that could be influence in our analysis.

In fact words like echo, alexa and amazon will not given any feeling.

US.word <- tribble(
  ~word, ~lexicon,
  "alexa", "US",
  "echo", "US",
  "amazon", "US",
  "product", "US"
)
stop_words2 <- stop_words %>%
  bind_rows(US.word)

tidy_echo <- Amazon_echo %>%
  group_by(name) %>%
  mutate(Line_N = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words2)

tidy_echo %>%
  count(word) %>%
  arrange(desc(n))

## # A tibble: 2,441 x 2
##    word       n
##    <chr>  <int>
##  1 love     421
##  2 music    284
##  3 easy     217
##  4 home     211
##  5 bought   178
##  6 screen   160
##  7 smart    158
##  8 sound    156
##  9 set      147
## 10 video    146
## # … with 2,431 more rows

General Amazon Echo

Love is the most used word in reviews, therefore we could start thinking that buying an Amazon Echo is the right way.

Let's create a graph that compares all the repeated words, we could identify other interesting words.

Of course, all of the Amazon Echo products are used to listen to music: this is the cause that word is repeted nearly 300 times.

word_counts <- tidy_echo %>%
  count(word) %>%
  filter(n>100) %>%
  mutate(word2 = fct_reorder(word, n))

ggplot(word_counts, aes(x=word2, y=n, fill = n)) + 
  geom_col(show.legend= F) +
  coord_flip() +
  scale_fill_gradient(low = "yellow", high = "red") + 
  labs(title = "Review Word Counts")

Sentiment Analysis

Sentiment analysis is a natural language processing method used to assess if the data are positive, negative or neutral.

Sentiment analysis is also conducted on textual data to the feel of the product in the reviews of the consumer and to consider the needs of the customer.

It is useful to rapidly gain insights using large amounts of text data.

A sentiment lexicon is a collection of words, also known as polar or opinion words, associated with their sentiment orientation.

Loughran

The Loughran lexicon labels words with six possible sentiments:

negative
positive
litigious
uncertainty
constraining
superfluous

It is easy to guess from the barplot below, that the most sentiment present in Amazon Echo products review is positive.

sentiment_review <- tidy_echo %>%
  inner_join(get_sentiments("loughran")) %>%
  count(sentiment) %>%
  mutate(word3 = fct_reorder(sentiment, n))

ggplot(sentiment_review, aes(x=word3, y=n, fill = word3)) + 
  geom_col(show.legend= F) +
  coord_flip() +
  scale_fill_manual(values = c("darkorchid4", "chocolate4", "lightseagreen","red", "green")) + 
  labs(title = "Review Sentiment Counts")

Bing

The bing lexicon categorizes words in a binary fashion into positive and negative categories.

The bar chart below presents the difference between the positive and negative feelings of each review.

As we could have expected there are much more reviews summarily categorized as positive (lines over the zero).

Echo_sentiment <- tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = Line_N, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(Echo_sentiment, aes(index, sentiment, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  scale_fill_gradient(low = "red", high = "darkgreen") +
  labs(title = "")

Ok, we understand that most of the reviews are positive.

But now let's see what are the negative words associated with our product.

Both representations refer to the same word classification.

Alarm is defined as most popular negative word, but we think that in general the alarm clock is a characteristic of this product.

Although in terms of frequency it is very low, some consumers use limited as negative word.

bing_word_counts <- tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = T) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "darkgreen"),title.colors = c("red", "darkgreen"),
                   max.words = 100)

Procede wiht out sentimental analysis on Amazon Echo, we create a overall sentimental by rating.

We find out that rating 1-2 is not positive at all while the rating 3-4-5 are rating overall classify as positive.

tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(rating, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(overall_sentiment = positive - negative)

## # A tibble: 5 x 4
##   rating negative positive overall_sentiment
##    <int>    <int>    <int>             <int>
## 1      1       23       12               -11
## 2      2       11        8                -3
## 3      3       33       45                12
## 4      4      110      389               279
## 5      5      168     1786              1618

sentiment_stars <- tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(rating, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(overall_sentiment = positive - negative,
    rating = fct_reorder(as.factor(rating), overall_sentiment)
  )

ggplot(
  sentiment_stars, aes(x=rating, y=overall_sentiment, fill=as.factor(rating))
) + 
  geom_col(show.legend=FALSE) +
  coord_flip() +
  labs(
    title = "Overall Sentiment by rating",
    subtitle = "Reviews for Amazon Echo",
    x = "rating",
    y = "Overall Sentiment"
  )

Afinn

The AFINN lexicon is a list of English terms manually rated for valence between -5 (negative) to +5 (positive).

Let's start counting sentiment words from the most recurrent.

There are several word classify with neutral value like 1 or 2 (easy, smart, ecc).

sentiment_echo_afinn <- tidy_echo %>%
  inner_join(get_sentiments("afinn"))

sentiment_echo_afinn %>%
  count(word, value) %>%
  arrange(desc(n))

## # A tibble: 270 x 3
##    word      value     n
##    <chr>     <dbl> <int>
##  1 love          3   421
##  2 easy          1   217
##  3 smart         1   158
##  4 fun           4   107
##  5 nice          3    78
##  6 gift          2    77
##  7 awesome       4    61
##  8 recommend     2    55
##  9 amazing       4    54
## 10 enjoy         2    44
## # … with 260 more rows

We decided to remove the value between -2 to +2 in order to exlude the neutral sentiment.

The multiplot below summarise the final result of our analysis.

There aren't word considered with very negative sentiment but on other hand there are 3 words intended as maximum positive:

outstanding
superb
thrilled

sentiment_echo_afinn2 <- sentiment_echo_afinn %>%
  filter(value %in% c( "3","4","5", "-3","-4","-5"))

word_counts_afinn <- sentiment_echo_afinn2 %>%
  count(word, value) %>%
  group_by(value) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(
    word = fct_reorder(word, n)
  )

ggplot(word_counts_afinn, aes(x=word, y=n, fill=value)) + 
  geom_col(show.legend=FALSE) +
  facet_wrap(~value, scales="free") +
  coord_flip() +
  labs(
    y = "Sentiment Word Counts",
    x = "Words")

Amazon Plus vs Amazon Show

After coming to the conclusion that the Amazon Echo product could be considered as a great purchase, now we are going to compare the two type of Amazon Echo by sentimental analysis.

Sentimental analysis

First of all is interesting to find the words repeated more time in the reviews of the two products.

Here it is clear how our products stand out:

Amazon Echo Plus is used to play music, regulate the smart home and it is equipped with lights.
Amazon Echo Show's main difference feature is owning a video/screen.

word_counts2 <- tidy_echo %>%
  count(word, name) %>%
  group_by(name) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))

ggplot(word_counts2, aes(x=word2, y=n, fill=name)) + 
  geom_col(show.legend= F) +
  facet_wrap(~name, scales="free_y") +
  coord_flip() +
  scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
  guides(fill = guide_legend(title="Type of product"))

NRC

The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

We try to explore the trust emotion between the two different type of products.

The comparison plot shows the list of words that are associated with trust sentiments.

Some words are not present for Amazon Echo Plus, this is due to we visulize only the words repeted at least 10 time or could be possible that some are not present in text review for this product.

Generally, we might say that Amazon Echo Show benefit from the trust of the consumer more than the competitor.

nrc_trust <- get_sentiments("nrc") %>% 
  filter(sentiment == "trust")

tidy_echo %>%
  filter(name == "Amazon Echo Show") %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE)

## # A tibble: 112 x 2
##    word          n
##    <chr>     <int>
##  1 calls        32
##  2 recommend    31
##  3 system       30
##  4 helpful      21
##  5 shopping     21
##  6 enjoy        20
##  7 don          19
##  8 pretty       19
##  9 perfect      18
## 10 deal         16
## # … with 102 more rows

tidy_echo %>%
  filter(name == "Amazon Echo Plus") %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE)

## # A tibble: 94 x 2
##    word          n
##    <chr>     <int>
##  1 enjoy        24
##  2 recommend    24
##  3 happy        16
##  4 pretty       16
##  5 system       15
##  6 don          14
##  7 helpful      13
##  8 excellent    10
##  9 perfect      10
## 10 friend        9
## # … with 84 more rows

ncr_sentiment <- tidy_echo %>%
  group_by(name) %>%
  inner_join(nrc_trust) %>%
  count(word, sort = TRUE) %>%
  filter(n>10) 

ggplot(ncr_sentiment, aes(word, n, fill = name)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~name, ncol = 2, scales = "free_x") +
  scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
  labs(title = "")

Loughran

Naturally, due to the increased number of reviews, Amazon Echo Show present in general more sentiment review.

The only exception for Amazon Echo Plus is the sentiment positive, in fact present more or less the same overall count.

sentiment_review2 <- tidy_echo %>%
  group_by(name) %>%
  inner_join(get_sentiments("loughran")) %>%
  count(sentiment) 

ggplot(sentiment_review2, aes(x=sentiment, y=n, fill = name)) + 
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("darkslateblue","olivedrab4")) + 
  labs(title = "Review Sentiment Counts between Plus vs Show ")

Bing

As we already did before for general evaluation of Amazon Echo, we create a line plot that illustre the difference between the positive and negative feelings of each review.

Both of the products are positively reviewed.

Amazon Echo Show has reviews that are considered truly positive but also the most negative.

Echo_sentiment2 <- tidy_echo %>%
  inner_join(get_sentiments("bing")) %>%
  count(name, index = Line_N, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(Echo_sentiment2, aes(index, sentiment, fill = name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~name, ncol = 2, scales = "free_x") +
  scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
  labs(title = "")

Conclusion

The final conclusion of our project is that, analyzing the feelings contained in the reviews of our dataset, we are able to suggest the purchase of Amazon Echo.

On other hand, regarding the comparison between Amazon Echo Plus vs Amazon Echo Show, we can say that the latter seems reviewed in more specific way.

The presence of the video in fact allows the possibility of achieving high satisfaction altough could be a double-edged sword given the presence of some reviews classified as very negative.

On other hand Amazon Echo Plus could be a safer purchase able to satisfy but hardly able to surprise.

Project 2 of Text Mining

Matteo Pancaldi, Riccardo Ventura

Amazon products Review

Project topic

First view of products reviews

Text Cleaning

General Amazon Echo

Sentiment Analysis

Loughran

Bing

Afinn

Amazon Plus vs Amazon Show

Sentimental analysis

NRC

Loughran

Bing

Conclusion