The database that we are going to use in this projet consists of 5000 consumer reviews for Amazon products.
The variables that we want to include in our task are:
name = name of Amazon productreview.rating = rating from 1 to 5 for each reviewreview.text = full text of each reviewAmazon_P <- read.csv('/Users/panca97/Desktop/Erasmus/Text\ mining/Final\ Project/Amazon_Consumer_Reviews_of_Amazon_Products.xls')
Amazon_P <- Amazon_P[,c(4,19,21)]
Amazon_P <- as.tibble(Amazon_P)
setwd("~/Desktop/Erasmus/Text mining/Final Project")The first purpose of this analysis is to provide advice on the general purchase of an Amazon Echo product.
Then, we are going to compare two different types of this product: Amazon Echo Show vs Amazon Echo Plus.
The text mining technique used to meet the project objectives will be sentimental analysis.
Thus, our intention is to filter the dataset only to products Echo of our interest:
Amazon Show is a smart speaker that is part of the Amazon Echo line of products. Similarly to other devices in the family, it is designed around Amazon's virtual assistant Alexa, but additionally features a 7-inch touchscreen display that can be used to display visual information to accompany its responses, as well as play video and conduct video calls with other Echo Show users.Amazon Echo Plus is a hands-free smart speaker that you control using your voice. It connects to Alexa – a cloud based voice service to play music, make calls, check weather and news, set alarms, control smart home devices, and much more.Amazon_echo <- Amazon_P %>%
filter(str_detect(name, '\\b(Echo)\\b')) %>%
rename(rating = reviews.rating,
text = reviews.text)
for(i in 1:nrow(Amazon_echo)){
if(Amazon_echo$name[i] == 'Amazon Echo Show Alexa-enabled Bluetooth Speaker with 7" Screen'){
Amazon_echo$name[i] <- "Amazon Echo Show"
} else{
Amazon_echo$name[i] <- "Amazon Echo Plus"
}
}Let's analysis the new restricted data called Amazon_echo.
Naturally, the variables remained the same while the observation became 1435.
Furthermore our dataset doesn't present missing value!
str(Amazon_echo)## tibble [1,435 × 3] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:1435] "Amazon Echo Show" "Amazon Echo Show" "Amazon Echo Show" "Amazon Echo Show" ...
## $ rating: int [1:1435] 5 5 5 5 5 5 5 5 4 4 ...
## $ text : chr [1:1435] "Great Gift for anyone. Very easy to setup. Coexist with all IOT Devices. Alexa is AWESOME!" "Super excited to give this as a gift. It's super convenient that Best Buy has Echo products in store instead of"| __truncated__ "We bought this for mother in law, buying another for me." "Well designed, good sound, has everything Alexa has plus the HD video. Always ready with answers with associate"| __truncated__ ...
sum(is.na(Amazon_echo))## [1] 0
The horizontal chart below illustrates the distribution of ratings between the two Echo products.
It easy to see that both the Amazon Echo are full of positive evaluations.
If, on the other hand, we consider the number of rating, Amazon echo Show is valued more time.
summary(as.factor(Amazon_echo$name))## Amazon Echo Plus Amazon Echo Show
## 590 845
ggplot(Amazon_echo, aes(rating, fill= as.factor(name))) +
geom_bar(position = "dodge") +
coord_flip()+
scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
ggtitle("Distribution of Rating between Amazon products")+
guides(fill = guide_legend(title="Type of product"))Regarding the form of text, before our analysis we must clean it!
pandoc.table(Amazon_echo[1:4,c(1,3)],
justify = c('center','left'), style = 'grid')##
##
## +------------------+--------------------------------+
## | name | text |
## +==================+================================+
## | Amazon Echo Show | Great Gift for anyone. Very |
## | | easy to setup. Coexist with |
## | | all IOT Devices. Alexa is |
## | | AWESOME! |
## +------------------+--------------------------------+
## | Amazon Echo Show | Super excited to give this as |
## | | a gift. It's super convenient |
## | | that Best Buy has Echo |
## | | products in store instead of |
## | | having to purchase from |
## | | Amazon. |
## +------------------+--------------------------------+
## | Amazon Echo Show | We bought this for mother in |
## | | law, buying another for me. |
## +------------------+--------------------------------+
## | Amazon Echo Show | Well designed, good sound, has |
## | | everything Alexa has plus the |
## | | HD video. Always ready with |
## | | answers with associated video |
## | | or text if applicable. Can |
## | | show movie trailers, can also |
## | | watch Amazon video. Excellent |
## | | on-demand security video with |
## | | Amazon compatible cameras. |
## | | Voice activated message and/or |
## | | video calls to Amazon Show |
## | | owners or Alexa App holders. |
## | | Highly recommended. |
## +------------------+--------------------------------+
Another important comparison could be done by average of rating: Amazon Echo Plus seems rating better than the competitor.
Amazon_echo %>%
group_by(name) %>%
summarise(rating_mean = round(mean(rating),2))## # A tibble: 2 x 2
## name rating_mean
## <chr> <dbl>
## 1 Amazon Echo Plus 4.75
## 2 Amazon Echo Show 4.66
In order to clean directly the text of each review, we use some interesting function that allow to:
Amazon_echo$text <- gsub("[[:punct:]]", " ", Amazon_echo$text)
Amazon_echo$text <- tolower(Amazon_echo$text)
Amazon_echo$text<- removeNumbers(Amazon_echo$text)
Amazon_echo$text<- removeWords(Amazon_echo$text,stopwords("english"))
Amazon_echo$text<- stripWhitespace(Amazon_echo$text)Continuing with the cleanup, we create a new dataset called tidy_echo where we can found the new variable Line_N.
The table below illustres the majority words present in our new data.
tidy_echo <- Amazon_echo %>%
group_by(name) %>%
mutate(Line_N = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
tidy_echo %>%
count(word) %>%
arrange(desc(n)) ## # A tibble: 2,445 x 2
## word n
## <chr> <int>
## 1 echo 609
## 2 alexa 432
## 3 love 421
## 4 music 284
## 5 amazon 253
## 6 easy 217
## 7 home 211
## 8 product 192
## 9 bought 178
## 10 screen 160
## # … with 2,435 more rows
Clearly, the tibble above advices us to remove some words that could be influence in our analysis.
In fact words like echo, alexa and amazon will not given any feeling.
US.word <- tribble(
~word, ~lexicon,
"alexa", "US",
"echo", "US",
"amazon", "US",
"product", "US"
)
stop_words2 <- stop_words %>%
bind_rows(US.word)
tidy_echo <- Amazon_echo %>%
group_by(name) %>%
mutate(Line_N = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(stop_words2)
tidy_echo %>%
count(word) %>%
arrange(desc(n)) ## # A tibble: 2,441 x 2
## word n
## <chr> <int>
## 1 love 421
## 2 music 284
## 3 easy 217
## 4 home 211
## 5 bought 178
## 6 screen 160
## 7 smart 158
## 8 sound 156
## 9 set 147
## 10 video 146
## # … with 2,431 more rows
Love is the most used word in reviews, therefore we could start thinking that buying an Amazon Echo is the right way.
Let's create a graph that compares all the repeated words, we could identify other interesting words.
Of course, all of the Amazon Echo products are used to listen to music: this is the cause that word is repeted nearly 300 times.
word_counts <- tidy_echo %>%
count(word) %>%
filter(n>100) %>%
mutate(word2 = fct_reorder(word, n))
ggplot(word_counts, aes(x=word2, y=n, fill = n)) +
geom_col(show.legend= F) +
coord_flip() +
scale_fill_gradient(low = "yellow", high = "red") +
labs(title = "Review Word Counts") Sentiment analysis is a natural language processing method used to assess if the data are positive, negative or neutral.
Sentiment analysis is also conducted on textual data to the feel of the product in the reviews of the consumer and to consider the needs of the customer.
It is useful to rapidly gain insights using large amounts of text data.
A sentiment lexicon is a collection of words, also known as polar or opinion words, associated with their sentiment orientation.
The Loughran lexicon labels words with six possible sentiments:
It is easy to guess from the barplot below, that the most sentiment present in Amazon Echo products review is positive.
sentiment_review <- tidy_echo %>%
inner_join(get_sentiments("loughran")) %>%
count(sentiment) %>%
mutate(word3 = fct_reorder(sentiment, n))
ggplot(sentiment_review, aes(x=word3, y=n, fill = word3)) +
geom_col(show.legend= F) +
coord_flip() +
scale_fill_manual(values = c("darkorchid4", "chocolate4", "lightseagreen","red", "green")) +
labs(title = "Review Sentiment Counts") The bing lexicon categorizes words in a binary fashion into positive and negative categories.
The bar chart below presents the difference between the positive and negative feelings of each review.
As we could have expected there are much more reviews summarily categorized as positive (lines over the zero).
Echo_sentiment <- tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(index = Line_N, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(Echo_sentiment, aes(index, sentiment, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low = "red", high = "darkgreen") +
labs(title = "")Ok, we understand that most of the reviews are positive.
But now let's see what are the negative words associated with our product.
Both representations refer to the same word classification.
Alarm is defined as most popular negative word, but we think that in general the alarm clock is a characteristic of this product.
Although in terms of frequency it is very low, some consumers use limited as negative word.
bing_word_counts <- tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = T) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "darkgreen"),title.colors = c("red", "darkgreen"),
max.words = 100)Procede wiht out sentimental analysis on Amazon Echo, we create a overall sentimental by rating.
We find out that rating 1-2 is not positive at all while the rating 3-4-5 are rating overall classify as positive.
tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(rating, sentiment) %>%
spread(sentiment, n) %>%
mutate(overall_sentiment = positive - negative)## # A tibble: 5 x 4
## rating negative positive overall_sentiment
## <int> <int> <int> <int>
## 1 1 23 12 -11
## 2 2 11 8 -3
## 3 3 33 45 12
## 4 4 110 389 279
## 5 5 168 1786 1618
sentiment_stars <- tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(rating, sentiment) %>%
spread(sentiment, n) %>%
mutate(overall_sentiment = positive - negative,
rating = fct_reorder(as.factor(rating), overall_sentiment)
)
ggplot(
sentiment_stars, aes(x=rating, y=overall_sentiment, fill=as.factor(rating))
) +
geom_col(show.legend=FALSE) +
coord_flip() +
labs(
title = "Overall Sentiment by rating",
subtitle = "Reviews for Amazon Echo",
x = "rating",
y = "Overall Sentiment"
)The AFINN lexicon is a list of English terms manually rated for valence between -5 (negative) to +5 (positive).
Let's start counting sentiment words from the most recurrent.
There are several word classify with neutral value like 1 or 2 (easy, smart, ecc).
sentiment_echo_afinn <- tidy_echo %>%
inner_join(get_sentiments("afinn"))
sentiment_echo_afinn %>%
count(word, value) %>%
arrange(desc(n))## # A tibble: 270 x 3
## word value n
## <chr> <dbl> <int>
## 1 love 3 421
## 2 easy 1 217
## 3 smart 1 158
## 4 fun 4 107
## 5 nice 3 78
## 6 gift 2 77
## 7 awesome 4 61
## 8 recommend 2 55
## 9 amazing 4 54
## 10 enjoy 2 44
## # … with 260 more rows
We decided to remove the value between -2 to +2 in order to exlude the neutral sentiment.
The multiplot below summarise the final result of our analysis.
There aren't word considered with very negative sentiment but on other hand there are 3 words intended as maximum positive:
outstandingsuperbthrilledsentiment_echo_afinn2 <- sentiment_echo_afinn %>%
filter(value %in% c( "3","4","5", "-3","-4","-5"))
word_counts_afinn <- sentiment_echo_afinn2 %>%
count(word, value) %>%
group_by(value) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word = fct_reorder(word, n)
)
ggplot(word_counts_afinn, aes(x=word, y=n, fill=value)) +
geom_col(show.legend=FALSE) +
facet_wrap(~value, scales="free") +
coord_flip() +
labs(
y = "Sentiment Word Counts",
x = "Words")After coming to the conclusion that the Amazon Echo product could be considered as a great purchase, now we are going to compare the two type of Amazon Echo by sentimental analysis.
First of all is interesting to find the words repeated more time in the reviews of the two products.
Here it is clear how our products stand out:
Amazon Echo Plus is used to play music, regulate the smart home and it is equipped with lights.Amazon Echo Show's main difference feature is owning a video/screen.word_counts2 <- tidy_echo %>%
count(word, name) %>%
group_by(name) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word2 = fct_reorder(word, n))
ggplot(word_counts2, aes(x=word2, y=n, fill=name)) +
geom_col(show.legend= F) +
facet_wrap(~name, scales="free_y") +
coord_flip() +
scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
guides(fill = guide_legend(title="Type of product"))The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
We try to explore the trust emotion between the two different type of products.
The comparison plot shows the list of words that are associated with trust sentiments.
Some words are not present for Amazon Echo Plus, this is due to we visulize only the words repeted at least 10 time or could be possible that some are not present in text review for this product.
Generally, we might say that Amazon Echo Show benefit from the trust of the consumer more than the competitor.
nrc_trust <- get_sentiments("nrc") %>%
filter(sentiment == "trust")
tidy_echo %>%
filter(name == "Amazon Echo Show") %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE)## # A tibble: 112 x 2
## word n
## <chr> <int>
## 1 calls 32
## 2 recommend 31
## 3 system 30
## 4 helpful 21
## 5 shopping 21
## 6 enjoy 20
## 7 don 19
## 8 pretty 19
## 9 perfect 18
## 10 deal 16
## # … with 102 more rows
tidy_echo %>%
filter(name == "Amazon Echo Plus") %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE)## # A tibble: 94 x 2
## word n
## <chr> <int>
## 1 enjoy 24
## 2 recommend 24
## 3 happy 16
## 4 pretty 16
## 5 system 15
## 6 don 14
## 7 helpful 13
## 8 excellent 10
## 9 perfect 10
## 10 friend 9
## # … with 84 more rows
ncr_sentiment <- tidy_echo %>%
group_by(name) %>%
inner_join(nrc_trust) %>%
count(word, sort = TRUE) %>%
filter(n>10)
ggplot(ncr_sentiment, aes(word, n, fill = name)) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~name, ncol = 2, scales = "free_x") +
scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
labs(title = "")Naturally, due to the increased number of reviews, Amazon Echo Show present in general more sentiment review.
The only exception for Amazon Echo Plus is the sentiment positive, in fact present more or less the same overall count.
sentiment_review2 <- tidy_echo %>%
group_by(name) %>%
inner_join(get_sentiments("loughran")) %>%
count(sentiment)
ggplot(sentiment_review2, aes(x=sentiment, y=n, fill = name)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
labs(title = "Review Sentiment Counts between Plus vs Show ") As we already did before for general evaluation of Amazon Echo, we create a line plot that illustre the difference between the positive and negative feelings of each review.
Both of the products are positively reviewed.
Amazon Echo Show has reviews that are considered truly positive but also the most negative.
Echo_sentiment2 <- tidy_echo %>%
inner_join(get_sentiments("bing")) %>%
count(name, index = Line_N, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(Echo_sentiment2, aes(index, sentiment, fill = name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~name, ncol = 2, scales = "free_x") +
scale_fill_manual(values = c("darkslateblue","olivedrab4")) +
labs(title = "")The final conclusion of our project is that, analyzing the feelings contained in the reviews of our dataset, we are able to suggest the purchase of Amazon Echo.
On other hand, regarding the comparison between Amazon Echo Plus vs Amazon Echo Show, we can say that the latter seems reviewed in more specific way.
The presence of the video in fact allows the possibility of achieving high satisfaction altough could be a double-edged sword given the presence of some reviews classified as very negative.
On other hand Amazon Echo Plus could be a safer purchase able to satisfy but hardly able to surprise.