Background

As part of our data analytics course assignment, we were tasked with evaluating the sales of a SD card manufacturer using Amazon reviews. Our objective is to practice using string manipulation and text mining and to explore questions of competitive industry analysis using text analysis methods.

Part 1: Identify Competitors in the Amazon SD Marketplace

1.a Setup

setwd("~/Desktop/OIDD245/Lab 5")

library(rvest)
library(ggplot2)
library(dplyr)
library(magrittr)
library(stringr)
library(dplyr)
library(stringr)
library(tidyr)
library(tidytext)
library(syuzhet)
library(tm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)

# load in the data
reviews <- read.csv("electronics_downsample.csv")
colnames(reviews) <- tolower(colnames(reviews))

1. b &c Filter for reviews containing “sd” or “SD” from the text reviews. Aggregate using ASINs. Search the manufacturers of the top 3 ASINs.

# aggregate the sdcard data
sdcard <- reviews %>% 
  mutate(sd.detected = str_detect(reviewtext, "\\b(sd|SD)\\b")) %>% 
  filter(sd.detected == TRUE) %>% 
  group_by(asin) %>% 
  summarize(total = n()) %>% 
  arrange(desc(total))

head(sdcard, 3)
## # A tibble: 3 x 2
##   asin       total
##   <fct>      <int>
## 1 B007WTAJTO   576
## 2 B002WE6D44   214
## 3 B000VX6XL6   192

The top 3 ASINs and their respective product manufacturers are the following: B007WTAJTO (576) which belongs to SanDisk, B002WE6D44 (214) which belongs to Transcend and B000VX6XL6 (192) which belongs to Kingston.

Part 2: Exploratory Analysis

2.a Find average score.

What are the average overall number of stars for the three products youidentified in the previous part?

top3 <- reviews %>% 
  filter(asin %in% c("B007WTAJTO","B002WE6D44","B000VX6XL6"))

mean(top3$overall) #4.499221
## [1] 4.499221
top3.mean <- reviews %>% 
  filter(asin %in% c("B007WTAJTO","B002WE6D44","B000VX6XL6")) %>% 
  group_by(asin) %>% 
  summarize(mean(overall))

head(top3.mean,3)
## # A tibble: 3 x 2
##   asin       `mean(overall)`
##   <fct>                <dbl>
## 1 B000VX6XL6            4.51
## 2 B002WE6D44            4.53
## 3 B007WTAJTO            4.48

The average overall number of stars is 4.499221. Individually, we can see that the overall number of stars is B000VX6XL6 = 4.507160, B002WE6D44 = 4.527871, B007WTAJTO = 4.481679

2.b Sentiments score. What are the average sentiment scores for the reviews of these three products? For this part, use a sentiment analysis package (e.g. syuzhet or tidytext) to analyze the sentiment of the text for the reviews and compute the average sentiment score for each of the three products. Choose a sentiment computation method that produces a numeric score (e.g. -5 to 5) rather than a category (happy, angry, sad).

# average sentiment score for all top 3 products
top3sentiments <- as.vector(top3$reviewtext)
mean(get_sentiment(top3sentiments)) #1.260529

# average sentiment score for "B007WTAJTO"
B007sentiments <- as.vector(top3$reviewtext[top3$asin == "B007WTAJTO"])
mean(get_sentiment(B007sentiments)) #1.341498

# average sentiment score for "B002WE6D44"
B002sentiments <- as.vector(top3$reviewtext[top3$asin == "B002WE6D44"])
mean(get_sentiment(B002sentiments)) #1.103794

# average sentiment score for "B000VX6XL6"
B000sentiments <- as.vector(top3$reviewtext[top3$asin == "B000VX6XL6"])
mean(get_sentiment(B000sentiments)) #1.267422

The average sentiment score is 1.26 for the reviews of these 3 products. For the individual products, the reviews for “B007WTAJTO” is 1.34, “B002WE6D44” is 1.10 while “B000VX6XL6” is 1.26 respectively.

Part 3: Text exploration with wordclouds.

For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these products, not just the ones you identified in your regex match in Part 1. The goal of this section is to develop word clouds that summarize the review words most highly correlated with positive and negative scores for this product category.

3.a Convert into corpus.

Convert the relevant reviews into a text corpus using VCorpus (located in the tm package). Clean the reviews by eliminating “stopwords”, removing whitespace, and converting words to lowercase. You may also choose to make other adjustments, such as removing punctuation and numbers.

corp.original = VCorpus(VectorSource(top3$reviewtext)) 
corp = tm_map(corp.original, removePunctuation) 
corp = tm_map(corp, removeNumbers) 
corp = tm_map(corp, content_transformer(tolower) ,lazy=TRUE) 
corp = tm_map(corp, content_transformer(stemDocument) ,lazy=TRUE) 
corp = tm_map(corp, stripWhitespace)
corp = tm_map(corp, content_transformer(removeWords), c("TIL"), lazy=TRUE) 
corp = tm_map(corp, content_transformer(removeWords), 
    c(stopwords("english")), lazy=TRUE)

3.b Document-term matrix.

Generate a document-term matrix from this Corpus. Remove sparse terms using RemoveSparseTerms and a threshold for the sparsity parameter that leaves you with no more than 300 words. Then, attach a column of data to this matrix that includes overall star scores for each review.Generate a document-term matrix from this Corpus. Remove sparse terms using RemoveSparseTerms and a threshold for the sparsity parameter that leaves you with no more than 300 words. Then, attach a column of data to this matrix that includes overall star scores for each review.

dtm = DocumentTermMatrix(corp)
dim(dtm)
dtms = removeSparseTerms(dtm, .983) #300 words is between 0.983 and 0.984
dim(dtms)
dtms_matrix = as.matrix(dtms)
dim(dtms_matrix)

#starscore <- as.vector(top3$overall)
#dtms_matrix2 <- cbind(dtms_matrix, starscore)
dtms_matrix2 <- cbind(dtms_matrix, top3$overall)

3.c Correlation of positive and negative words.

Extract the 30 words that are most highly positively correlated with the number of stars, and the 30 words most negatively correlated with the number of stars.

# Create a new dataframe for correlation. 
corr = cor(dtms_matrix2, top3$overall)
words <- rownames(corr)
corr2 <- as.data.frame(cbind(words,corr))

# Noticed that there is an empty row whose correlation is equal to '1'. Remove it. 
which(corr2$V2 == '1') #289 
corr2 <- corr2[-c(289),] 

# top 30 +ve correlated words
corr30positive = order(corr2$V2, decreasing = T)[1:30]
corr30positivewords = rownames(corr2)[corr30positive]
corr30positivewords

# top 30 +ve correlated words
corr30negative = order(corr2$V2, decreasing = F)[1:30]
corr30negativewords = rownames(corr2)[corr30negative]
corr30negativewords
  • The top 30 positively correlated words are:

[1] “great” “price” “fast” “space”
[5] “perfect” “high” “room” “storag”
[9] “capac” “plenti” “love” “anyon”
[13] “ton” “hold” “easi” “best”
[17] “expand” “nice” “excel” “movi”
[21] “valu” “fit” “enough” “extra”
[25] “size” “complaint” “worri” “recommend” [29] “need” “definit”

  • The top 30 negatively correlated words are:

[1] “well” “flash” “lot” “full” “qualiti”
[6] “can” “hero” “soni” “perform” “hard”
[11] “wrong” “slot” “download” “version” “camcord”
[16] “gig” “avail” “larg” “memori” “suppos”
[21] “ever” “nikon” “run” “note” “sdhc”
[26] “pic” “chip” “want” “arriv” “advertis”

3.d Wordcloud.

Plot two wordclouds: One wordcloud for your list of positively correlated words and another wordcloud for your list of negatively correlated words. For each of these word clouds, the size of the words that appear in the cloud should be in proportion to the strength of the correlation between that word and the number of stars.

set.seed(111)
corrpositive <- corr2 %>% 
  mutate(numeric = as.numeric(paste(V2))) %>% 
  filter(numeric > 0)
wordcloud(words = corrpositive$words, freq = corrpositive$numeric)

set.seed(111)
corrnegative <- corr2 %>% 
  mutate(numeric = as.numeric(paste(V2))) %>% 
  filter(numeric < 0) %>% 
  mutate(numeric.abs = abs(numeric))
wordcloud(words = corrnegative$words, freq = corrnegative$numeric.abs, scale=c(2.2,0.01))

Part 4: Predicting review helpfulness

Put yourself in the shoes of Amazon’s platform designers. How do web platforms let reviewers know when they have written a good review? Amazon solves this problem by allowing reviews to be voted “Helpful” by other readers. This question asks you to build a predictive model for whether a review has at least one helpful vote or not.

4.a Creating binary variable.

First create a new binary variable indicating whether a review has at least one helpful vote.

reviews$helpful.extracted <- str_extract(reviews$helpful, "\\d{1,}")
reviews$helpful.extracted2 <- as.numeric(reviews$helpful.extracted)
reviews$helpful.binary <- ifelse(reviews$helpful.extracted2 >0,1,0)

4.b Feature engineering.

From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.

#reviews$reviews.sentiments <- as.vector(reviews$reviewtext)
#reviews$mean.sentiments <- (get_sentiment(reviews$reviews.sentiments))

# feature 1: sentiment score
reviews <- reviews %>%
  mutate(text.vector = as.vector(reviewtext)) %>% 
  mutate(sentiments= get_sentiment(text.vector))

# feature 2: number of words in a review
# install.packages("ngram")
library(ngram)
reviews <- reviews %>% 
  mutate(reviewtextcounts = sapply(strsplit(text.vector, " "), length))

# feature 3: number of words in the summary tab
reviews <- reviews %>% 
  mutate(summarycounts = sapply(strsplit(as.vector(summary), " "), length))

# feature 4: year of the review (discovered later when building a model this was not significant)
reviews$year <- str_sub(reviews$reviewtime,-4,-1)

4.c Predictive logistic regression

From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.

# Divide into training and testing. 
set.seed(1234)
train <- sample(nrow(reviews), 0.8*nrow(reviews), replace = FALSE)
trainset <- reviews[train,]
testset <- reviews[-train,]

# Build logistic regression 

model = glm(helpful.binary ~ reviewtextcounts + sentiments, data = reviews, family = 'binomial') # add in the summarycounts later
summary(model)
## 
## Call:
## glm(formula = helpful.binary ~ reviewtextcounts + sentiments, 
##     family = "binomial", data = reviews)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9139  -0.8600  -0.7698   1.2047   1.7211  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.2519773  0.0132278 -94.648  < 2e-16 ***
## reviewtextcounts  0.0061099  0.0001022  59.786  < 2e-16 ***
## sentiments        0.0193331  0.0039473   4.898 9.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88108  on 65808  degrees of freedom
## Residual deviance: 79020  on 65806  degrees of freedom
## AIC: 79026
## 
## Number of Fisher Scoring iterations: 5

4.d Assigning 0 and its accuracy

If you predicted that every row was a 0 (not helpful), you would be close to 60% accurate. How much better does your model do?

# only using zero
testset$onlyzero <- 0
mean(testset$onlyzero == testset$helpful.binary) #0.6084882

# compare to my model
z = predict(model, newdata = testset)
testset$outcomes <- predict(model, newdata = testset, type = "response")

# classify as accept if the predicted probability is more than 0.6
x <- 0.36
testset$glm.binary <- 0
testset$glm.binary[testset$outcomes >= x] <- 1
mean(testset$glm.binary == testset$helpful.binary) #0.6853062

The model that I created is better than the threshold of assigning only 0. The accuracy is 0.6853062 whereas the assigning is 0.

4.e Improve the logistic regression by adjusting the features.

# build revised model by adding one new feature
model.revised = glm(helpful.binary ~ year + reviewtextcounts + sentiments, data = reviews, family = 'binomial') 
summary(model.revised)
## 
## Call:
## glm(formula = helpful.binary ~ year + reviewtextcounts + sentiments, 
##     family = "binomial", data = reviews)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0598  -0.8186  -0.6112   0.9882   2.0595  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       9.438e+00  4.625e+01   0.204    0.838    
## year2000         -7.045e+00  4.626e+01  -0.152    0.879    
## year2001         -8.353e+00  4.625e+01  -0.181    0.857    
## year2002         -8.412e+00  4.625e+01  -0.182    0.856    
## year2003         -8.285e+00  4.625e+01  -0.179    0.858    
## year2004         -7.806e+00  4.625e+01  -0.169    0.866    
## year2005         -7.857e+00  4.625e+01  -0.170    0.865    
## year2006         -8.212e+00  4.625e+01  -0.178    0.859    
## year2007         -8.997e+00  4.625e+01  -0.195    0.846    
## year2008         -9.411e+00  4.625e+01  -0.203    0.839    
## year2009         -9.669e+00  4.625e+01  -0.209    0.834    
## year2010         -9.854e+00  4.625e+01  -0.213    0.831    
## year2011         -1.011e+01  4.625e+01  -0.219    0.827    
## year2012         -1.050e+01  4.625e+01  -0.227    0.820    
## year2013         -1.099e+01  4.625e+01  -0.238    0.812    
## year2014         -1.147e+01  4.625e+01  -0.248    0.804    
## reviewtextcounts  5.307e-03  1.029e-04  51.600  < 2e-16 ***
## sentiments        1.684e-02  4.108e-03   4.100 4.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88108  on 65808  degrees of freedom
## Residual deviance: 72528  on 65791  degrees of freedom
## AIC: 72564
## 
## Number of Fisher Scoring iterations: 9
testset$outcomes.revised <- predict(model.revised, newdata = testset, type = "response")

# classify as accept if the predicted probability is more than 0.6
x <- 0.36
testset$glm.binary.revised <- 0
testset$glm.binary.revised[testset$outcomes.revised >= x] <- 1
mean(testset$glm.binary.revised == testset$helpful.binary) #0.7109102
## [1] 0.7109102

By introducing the year in which the review was written, we have increased the accuracy of the model. This makes sense because as the year progresses, as more/better people get on the Amazon platform, the quality gets better.

Completed by: Pitchaya Liewchanpatana