library(stm)
library(tm)
library(Rtsne)
library(rsvd)
library(geometry)
library(SnowballC)
library(wordcloud)
set.seed(4) Marketing Analytics- Lab 4.3 Amazon Data set
MKT-4223 LAB 4-3
Introduction
This report looks at a random selection of customer reviews from Amazon. All reviews include a title, review text, and the number of stars for the rating.
We run topic modeling on the reviews and create visualizations of 6 of the key topics - the highest rated 3 and lowest rated 3 topics.
At the end we will create insights to the topics with the the highest and lowest ratings – and add those insights to the end of the file here.
Read in the Customer Reviews Dataset
Random 200
These datasets are very large, so you will read in a random sample from the dataset.
library(dplyr)
library(readr)
file_path <- "amazon_reviews_us_Books_v1_02.tsv"
# Count the rows in the .tsv file
total_rows <- sum(read_tsv(file_path, progress = FALSE) %>%
mutate(count = 1) %>%
pull(count))
# Generate a random sample of X row numbers
set.seed(4) # this ensures reproducibility
sample_rows <- sample(1:total_rows, 300)
# Read in only the rows from the dataset that are random-selected row numbers
reviews <- vroom::vroom(file_path,
col_select = c(marketplace,review_headline, review_body, star_rating)) %>%
slice(sample_rows) %>%
filter(marketplace == "US") %>%
rename(review_title = review_headline,
review_text = review_body,
review_star = star_rating) %>%
select(-marketplace)My original customer reviews dataset had 3.089872^{6} customer reviews.
I took a random sample fo 2000 rows from the full dataset.
Look at the first few rows of the selected sample of reviews
head(reviews)# A tibble: 6 × 3
review_title review_text review_star
<chr> <chr> <dbl>
1 Off the Mark...A Mixed Bag of Essays... Patrick Collis… 1
2 Everyday Things Harms writes p… 5
3 An excellent sci-fi book! Time Enough fo… 5
4 Even veteran Tiel owners can learn something new I've owned tie… 5
5 Good but not perfect biography of an important man Lamer's biogra… 4
6 Magnificent - a must read! My 41 year old… 5
Distribution of the sample by Star Rating
table(reviews$review_star)
1 2 3 4 5
18 21 28 41 192
An additional step for our Mac Users
If you’re running on a Mac, you need to run this code. (Remove the comment #)
#reviews$comments <- iconv(reviews$comments, to = "utf-8-mac")Process Reviews (each row is a document)
Here are the custom words I added to the “stop words”
customwords = c("book", "page","read") #CHANGE THIS FOR YOUR DATASET
customwords[1] "book" "page" "read"
Build the corpus
library(stm)
processed <- textProcessor(reviews$review_text, metadata = reviews,
customstopwords=customwords)Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Remove Custom Stopwords...
Removing numbers...
Stemming...
Creating Output...
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)Removing 3362 of 5345 terms (3362 of 16915 tokens) due to frequency
Your corpus now has 300 documents, 1983 terms and 13553 tokens.
docs <- out$documents
vocab <- out$vocab
meta <- out$metaDetermine the Number of Topics
NOTE: THIS CAN TAKE A SIGNIFICANT AMOUNT OF TIME
reviewsFit <- stm(documents = out$documents, vocab = out$vocab, K = 0, seed = 1,
prevalence =~ review_star, data = out$meta, init.type = "Spectral",
verbose = FALSE)
num_topics <- reviewsFit$settings$dim$K
num_topics[1] 51
There are 51 topics in this corpus.
See which topics relate to high vs. low ratings
As a Plot
out$meta$rating <- as.factor(out$meta$review_star)
prep <- estimateEffect(1:num_topics ~ rating, reviewsFit, meta=out$meta,
uncertainty="Global")
plot(prep, covariate="rating", topics=c(1:num_topics), model=reviewsFit,
method="difference", cov.value1=5, cov.value2=1,
xlab="Lower Rating ... Higher Rating",
main="Relationship between Topic and Rating",
labeltype ="custom", custom.labels=c(1:num_topics))As a List
Lowest 3 Ratings by Score
differences <- numeric()
for (i in 1:length(prep$parameters)) {
ests <- prep$parameters[[i]][[1]]$est
differences[i] <- ests["rating5"] - ests["(Intercept)"]
}
difference_table <- data.frame(
Topic = 1:length(differences),
Difference_Effect = differences)
library(dplyr)
dtsorted <- difference_table %>%
arrange(desc(Difference_Effect))
t3 <- tail(dtsorted,3)
t3 Topic Difference_Effect
49 3 -0.1289491
50 24 -0.1359205
51 42 -0.2000584
Highest 3 ratings by Scores
h3 <- head(dtsorted,3)
h3 Topic Difference_Effect
1 47 0.02977683
2 50 0.02950766
3 8 0.02823816
Visualize Lowest and Highest Topics using Wordcloud
Lowest 3 Ratings
Note the last part of the options is the selected color palette.
I just picked a few to show you options. They are not the best choices, so play around here a bit.
install.packages("RColorBrewer")
library(RColorBrewer)
stm::cloud(reviewsFit, topic=t3$Topic[3],colors = brewer.pal(5, "Dark2"))stm::cloud(reviewsFit, topic=t3$Topic[2],colors = brewer.pal(6, "Dark2")) stm::cloud(reviewsFit, topic=t3$Topic[1],colors = brewer.pal(7, "Dark2")) Highest 3 Ratings
library(RColorBrewer)
stm::cloud(reviewsFit, topic=h3$Topic[1],colors = brewer.pal(5, "Paired"))stm::cloud(reviewsFit, topic=h3$Topic[2],colors = brewer.pal(8, "Paired"))stm::cloud(reviewsFit, topic=h3$Topic[3],colors = brewer.pal(8, "Paired"))INSIGHTS
These word clouds are from the amazon reviews of books. After finding stop words and creating groups there are 51 topics.
Based on the word clouds, I think that these reviews of the books are about the writing or the author. It seems that there are a lot of negative reviews on book titles which isn’t an amazon issues it is an author and their book issue. We should only take action if the book becomes non-profitable for us. The second lowest topic I believe is reviews on either historical books or faith based books. It seems that people are unhappy with the characters and the detail. Finally in the third lowest topic, I think that these are related to self-help books. It could be that they do not work with the emphasis on use or it is saying that they are no longer useful. And customers seem to be upset that a lot of them are about food.
With the three highest topics, the first appears to be books about health or motherhood. I can see that answer is included so I believe that customers find these books helpful and worth purchasing. With the next high topic,I believe these are books on how to make things or construction. Thus, those customers are enjoying the techniques and structure given. Finally with the third, I think that these are romance novels. Due to the love and time and even japan this could be japanese romance which would make sense for US amazon as they would not be able to purchase them in the US stores. It appears that customers are liking these books with words like well, and love and wonder.