Marketing Analytics- Lab 4.3 Amazon Data set

MKT-4223 LAB 4-3

Author

Abigail Russell

Introduction

This report looks at a random selection of customer reviews from Amazon. All reviews include a title, review text, and the number of stars for the rating.
We run topic modeling on the reviews and create visualizations of 6 of the key topics - the highest rated 3 and lowest rated 3 topics.
At the end we will create insights to the topics with the the highest and lowest ratings – and add those insights to the end of the file here.

library(stm)
library(tm)
library(Rtsne)
library(rsvd)
library(geometry)
library(SnowballC)
library(wordcloud)
set.seed(4)

Read in the Customer Reviews Dataset

Random 200

These datasets are very large, so you will read in a random sample from the dataset.

library(dplyr)
library(readr)
file_path <- "amazon_reviews_us_Books_v1_02.tsv"  

# Count the rows in the .tsv file
total_rows <- sum(read_tsv(file_path, progress = FALSE) %>% 
                                        mutate(count = 1) %>% 
                                        pull(count))

# Generate a random sample of X row numbers
set.seed(4) # this ensures reproducibility
sample_rows <- sample(1:total_rows, 300)  

# Read in only the rows from the dataset that are random-selected row numbers 
reviews <- vroom::vroom(file_path, 
                                                col_select = c(marketplace,review_headline, review_body, star_rating)) %>%
  slice(sample_rows) %>%
  filter(marketplace == "US") %>%
  rename(review_title = review_headline,
         review_text = review_body,
         review_star = star_rating) %>% 
    select(-marketplace)

My original customer reviews dataset had 3.089872^{6} customer reviews.
I took a random sample fo 2000 rows from the full dataset.

Look at the first few rows of the selected sample of reviews

head(reviews)

# A tibble: 6 × 3
  review_title                                       review_text     review_star
  <chr>                                              <chr>                 <dbl>
1 Off  the Mark...A Mixed Bag of Essays...           Patrick Collis…           1
2 Everyday Things                                    Harms writes p…           5
3 An excellent sci-fi book!                          Time Enough fo…           5
4 Even veteran Tiel owners can learn something new   I've owned tie…           5
5 Good but not perfect biography of an important man Lamer's biogra…           4
6 Magnificent - a must read!                         My 41 year old…           5

Distribution of the sample by Star Rating

table(reviews$review_star)


  1   2   3   4   5 
 18  21  28  41 192

An additional step for our Mac Users

If you’re running on a Mac, you need to run this code. (Remove the comment #)

#reviews$comments <- iconv(reviews$comments, to = "utf-8-mac")

Process Reviews (each row is a document)

Here are the custom words I added to the “stop words”

customwords = c("book", "page","read")  #CHANGE THIS FOR YOUR DATASET
customwords

[1] "book" "page" "read"

Build the corpus

library(stm)
processed <- textProcessor(reviews$review_text, metadata = reviews, 
                           customstopwords=customwords)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Remove Custom Stopwords...
Removing numbers... 
Stemming... 
Creating Output...

out <- prepDocuments(processed$documents, processed$vocab, processed$meta)

Removing 3362 of 5345 terms (3362 of 16915 tokens) due to frequency 
Your corpus now has 300 documents, 1983 terms and 13553 tokens.

docs <- out$documents
vocab <- out$vocab
meta <- out$meta

Determine the Number of Topics

NOTE: THIS CAN TAKE A SIGNIFICANT AMOUNT OF TIME

reviewsFit <- stm(documents = out$documents, vocab = out$vocab, K = 0, seed = 1,
                                    prevalence =~ review_star, data = out$meta, init.type = "Spectral",
                                    verbose = FALSE)
num_topics <- reviewsFit$settings$dim$K
num_topics

[1] 51

There are 51 topics in this corpus.

See which topics relate to high vs. low ratings

As a Plot

out$meta$rating <- as.factor(out$meta$review_star)
prep <- estimateEffect(1:num_topics ~ rating, reviewsFit, meta=out$meta, 
     uncertainty="Global")

plot(prep, covariate="rating", topics=c(1:num_topics), model=reviewsFit, 
     method="difference", cov.value1=5, cov.value2=1,
     xlab="Lower Rating ... Higher Rating", 
         main="Relationship between Topic and Rating",
     labeltype ="custom", custom.labels=c(1:num_topics))

As a List

Lowest 3 Ratings by Score

differences <- numeric()

for (i in 1:length(prep$parameters)) {
  ests <- prep$parameters[[i]][[1]]$est
  differences[i] <- ests["rating5"] - ests["(Intercept)"]
}

difference_table <- data.frame(
  Topic = 1:length(differences),
  Difference_Effect = differences)

library(dplyr)
dtsorted <- difference_table %>% 
              arrange(desc(Difference_Effect))

t3 <- tail(dtsorted,3)
t3

   Topic Difference_Effect
49     3        -0.1289491
50    24        -0.1359205
51    42        -0.2000584

Highest 3 ratings by Scores

h3 <- head(dtsorted,3)
h3

  Topic Difference_Effect
1    47        0.02977683
2    50        0.02950766
3     8        0.02823816

Visualize Lowest and Highest Topics using Wordcloud

Lowest 3 Ratings

Note the last part of the options is the selected color palette.
I just picked a few to show you options. They are not the best choices, so play around here a bit.

install.packages("RColorBrewer")
library(RColorBrewer)

stm::cloud(reviewsFit, topic=t3$Topic[3],colors = brewer.pal(5, "Dark2"))

stm::cloud(reviewsFit, topic=t3$Topic[2],colors = brewer.pal(6, "Dark2"))

stm::cloud(reviewsFit, topic=t3$Topic[1],colors = brewer.pal(7, "Dark2"))

Highest 3 Ratings

library(RColorBrewer)
stm::cloud(reviewsFit, topic=h3$Topic[1],colors = brewer.pal(5, "Paired"))

stm::cloud(reviewsFit, topic=h3$Topic[2],colors = brewer.pal(8, "Paired"))

stm::cloud(reviewsFit, topic=h3$Topic[3],colors = brewer.pal(8, "Paired"))

INSIGHTS

These word clouds are from the amazon reviews of books. After finding stop words and creating groups there are 51 topics.

Based on the word clouds, I think that these reviews of the books are about the writing or the author. It seems that there are a lot of negative reviews on book titles which isn’t an amazon issues it is an author and their book issue. We should only take action if the book becomes non-profitable for us. The second lowest topic I believe is reviews on either historical books or faith based books. It seems that people are unhappy with the characters and the detail. Finally in the third lowest topic, I think that these are related to self-help books. It could be that they do not work with the emphasis on use or it is saying that they are no longer useful. And customers seem to be upset that a lot of them are about food.

With the three highest topics, the first appears to be books about health or motherhood. I can see that answer is included so I believe that customers find these books helpful and worth purchasing. With the next high topic,I believe these are books on how to make things or construction. Thus, those customers are enjoying the techniques and structure given. Finally with the third, I think that these are romance novels. Due to the love and time and even japan this could be japanese romance which would make sense for US amazon as they would not be able to purchase them in the US stores. It appears that customers are liking these books with words like well, and love and wonder.