As part of our data analytics course assignment, we were tasked with evaluating the sales of a SD card manufacturer using Amazon reviews. Our objective is to practice using string manipulation and text mining and to explore questions of competitive industry analysis using text analysis methods.
setwd("~/Desktop/OIDD245/Lab 5")
library(rvest)
library(ggplot2)
library(dplyr)
library(magrittr)
library(stringr)
library(dplyr)
library(stringr)
library(tidyr)
library(tidytext)
library(syuzhet)
library(tm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)
# load in the data
reviews <- read.csv("electronics_downsample.csv")
colnames(reviews) <- tolower(colnames(reviews))
# aggregate the sdcard data
sdcard <- reviews %>%
mutate(sd.detected = str_detect(reviewtext, "\\b(sd|SD)\\b")) %>%
filter(sd.detected == TRUE) %>%
group_by(asin) %>%
summarize(total = n()) %>%
arrange(desc(total))
head(sdcard, 3)
## # A tibble: 3 x 2
## asin total
## <fct> <int>
## 1 B007WTAJTO 576
## 2 B002WE6D44 214
## 3 B000VX6XL6 192
The top 3 ASINs and their respective product manufacturers are the following: B007WTAJTO (576) which belongs to SanDisk, B002WE6D44 (214) which belongs to Transcend and B000VX6XL6 (192) which belongs to Kingston.
What are the average overall number of stars for the three products youidentified in the previous part?
top3 <- reviews %>%
filter(asin %in% c("B007WTAJTO","B002WE6D44","B000VX6XL6"))
mean(top3$overall) #4.499221
## [1] 4.499221
top3.mean <- reviews %>%
filter(asin %in% c("B007WTAJTO","B002WE6D44","B000VX6XL6")) %>%
group_by(asin) %>%
summarize(mean(overall))
head(top3.mean,3)
## # A tibble: 3 x 2
## asin `mean(overall)`
## <fct> <dbl>
## 1 B000VX6XL6 4.51
## 2 B002WE6D44 4.53
## 3 B007WTAJTO 4.48
The average overall number of stars is 4.499221. Individually, we can see that the overall number of stars is B000VX6XL6 = 4.507160, B002WE6D44 = 4.527871, B007WTAJTO = 4.481679
# average sentiment score for all top 3 products
top3sentiments <- as.vector(top3$reviewtext)
mean(get_sentiment(top3sentiments)) #1.260529
# average sentiment score for "B007WTAJTO"
B007sentiments <- as.vector(top3$reviewtext[top3$asin == "B007WTAJTO"])
mean(get_sentiment(B007sentiments)) #1.341498
# average sentiment score for "B002WE6D44"
B002sentiments <- as.vector(top3$reviewtext[top3$asin == "B002WE6D44"])
mean(get_sentiment(B002sentiments)) #1.103794
# average sentiment score for "B000VX6XL6"
B000sentiments <- as.vector(top3$reviewtext[top3$asin == "B000VX6XL6"])
mean(get_sentiment(B000sentiments)) #1.267422
The average sentiment score is 1.26 for the reviews of these 3 products. For the individual products, the reviews for “B007WTAJTO” is 1.34, “B002WE6D44” is 1.10 while “B000VX6XL6” is 1.26 respectively.
For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these products, not just the ones you identified in your regex match in Part 1. The goal of this section is to develop word clouds that summarize the review words most highly correlated with positive and negative scores for this product category.
Convert the relevant reviews into a text corpus using VCorpus (located in the tm package). Clean the reviews by eliminating “stopwords”, removing whitespace, and converting words to lowercase. You may also choose to make other adjustments, such as removing punctuation and numbers.
corp.original = VCorpus(VectorSource(top3$reviewtext))
corp = tm_map(corp.original, removePunctuation)
corp = tm_map(corp, removeNumbers)
corp = tm_map(corp, content_transformer(tolower) ,lazy=TRUE)
corp = tm_map(corp, content_transformer(stemDocument) ,lazy=TRUE)
corp = tm_map(corp, stripWhitespace)
corp = tm_map(corp, content_transformer(removeWords), c("TIL"), lazy=TRUE)
corp = tm_map(corp, content_transformer(removeWords),
c(stopwords("english")), lazy=TRUE)
Generate a document-term matrix from this Corpus. Remove sparse terms using RemoveSparseTerms and a threshold for the sparsity parameter that leaves you with no more than 300 words. Then, attach a column of data to this matrix that includes overall star scores for each review.Generate a document-term matrix from this Corpus. Remove sparse terms using RemoveSparseTerms and a threshold for the sparsity parameter that leaves you with no more than 300 words. Then, attach a column of data to this matrix that includes overall star scores for each review.
dtm = DocumentTermMatrix(corp)
dim(dtm)
dtms = removeSparseTerms(dtm, .983) #300 words is between 0.983 and 0.984
dim(dtms)
dtms_matrix = as.matrix(dtms)
dim(dtms_matrix)
#starscore <- as.vector(top3$overall)
#dtms_matrix2 <- cbind(dtms_matrix, starscore)
dtms_matrix2 <- cbind(dtms_matrix, top3$overall)
Extract the 30 words that are most highly positively correlated with the number of stars, and the 30 words most negatively correlated with the number of stars.
# Create a new dataframe for correlation.
corr = cor(dtms_matrix2, top3$overall)
words <- rownames(corr)
corr2 <- as.data.frame(cbind(words,corr))
# Noticed that there is an empty row whose correlation is equal to '1'. Remove it.
which(corr2$V2 == '1') #289
corr2 <- corr2[-c(289),]
# top 30 +ve correlated words
corr30positive = order(corr2$V2, decreasing = T)[1:30]
corr30positivewords = rownames(corr2)[corr30positive]
corr30positivewords
# top 30 +ve correlated words
corr30negative = order(corr2$V2, decreasing = F)[1:30]
corr30negativewords = rownames(corr2)[corr30negative]
corr30negativewords
[1] “great” “price” “fast” “space”
[5] “perfect” “high” “room” “storag”
[9] “capac” “plenti” “love” “anyon”
[13] “ton” “hold” “easi” “best”
[17] “expand” “nice” “excel” “movi”
[21] “valu” “fit” “enough” “extra”
[25] “size” “complaint” “worri” “recommend” [29] “need” “definit”
[1] “well” “flash” “lot” “full” “qualiti”
[6] “can” “hero” “soni” “perform” “hard”
[11] “wrong” “slot” “download” “version” “camcord”
[16] “gig” “avail” “larg” “memori” “suppos”
[21] “ever” “nikon” “run” “note” “sdhc”
[26] “pic” “chip” “want” “arriv” “advertis”
Plot two wordclouds: One wordcloud for your list of positively correlated words and another wordcloud for your list of negatively correlated words. For each of these word clouds, the size of the words that appear in the cloud should be in proportion to the strength of the correlation between that word and the number of stars.
set.seed(111)
corrpositive <- corr2 %>%
mutate(numeric = as.numeric(paste(V2))) %>%
filter(numeric > 0)
wordcloud(words = corrpositive$words, freq = corrpositive$numeric)
set.seed(111)
corrnegative <- corr2 %>%
mutate(numeric = as.numeric(paste(V2))) %>%
filter(numeric < 0) %>%
mutate(numeric.abs = abs(numeric))
wordcloud(words = corrnegative$words, freq = corrnegative$numeric.abs, scale=c(2.2,0.01))
Put yourself in the shoes of Amazon’s platform designers. How do web platforms let reviewers know when they have written a good review? Amazon solves this problem by allowing reviews to be voted “Helpful” by other readers. This question asks you to build a predictive model for whether a review has at least one helpful vote or not.
First create a new binary variable indicating whether a review has at least one helpful vote.
reviews$helpful.extracted <- str_extract(reviews$helpful, "\\d{1,}")
reviews$helpful.extracted2 <- as.numeric(reviews$helpful.extracted)
reviews$helpful.binary <- ifelse(reviews$helpful.extracted2 >0,1,0)
From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.
#reviews$reviews.sentiments <- as.vector(reviews$reviewtext)
#reviews$mean.sentiments <- (get_sentiment(reviews$reviews.sentiments))
# feature 1: sentiment score
reviews <- reviews %>%
mutate(text.vector = as.vector(reviewtext)) %>%
mutate(sentiments= get_sentiment(text.vector))
# feature 2: number of words in a review
# install.packages("ngram")
library(ngram)
reviews <- reviews %>%
mutate(reviewtextcounts = sapply(strsplit(text.vector, " "), length))
# feature 3: number of words in the summary tab
reviews <- reviews %>%
mutate(summarycounts = sapply(strsplit(as.vector(summary), " "), length))
# feature 4: year of the review (discovered later when building a model this was not significant)
reviews$year <- str_sub(reviews$reviewtime,-4,-1)
From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.
# Divide into training and testing.
set.seed(1234)
train <- sample(nrow(reviews), 0.8*nrow(reviews), replace = FALSE)
trainset <- reviews[train,]
testset <- reviews[-train,]
# Build logistic regression
model = glm(helpful.binary ~ reviewtextcounts + sentiments, data = reviews, family = 'binomial') # add in the summarycounts later
summary(model)
##
## Call:
## glm(formula = helpful.binary ~ reviewtextcounts + sentiments,
## family = "binomial", data = reviews)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9139 -0.8600 -0.7698 1.2047 1.7211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2519773 0.0132278 -94.648 < 2e-16 ***
## reviewtextcounts 0.0061099 0.0001022 59.786 < 2e-16 ***
## sentiments 0.0193331 0.0039473 4.898 9.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 88108 on 65808 degrees of freedom
## Residual deviance: 79020 on 65806 degrees of freedom
## AIC: 79026
##
## Number of Fisher Scoring iterations: 5
If you predicted that every row was a 0 (not helpful), you would be close to 60% accurate. How much better does your model do?
# only using zero
testset$onlyzero <- 0
mean(testset$onlyzero == testset$helpful.binary) #0.6084882
# compare to my model
z = predict(model, newdata = testset)
testset$outcomes <- predict(model, newdata = testset, type = "response")
# classify as accept if the predicted probability is more than 0.6
x <- 0.36
testset$glm.binary <- 0
testset$glm.binary[testset$outcomes >= x] <- 1
mean(testset$glm.binary == testset$helpful.binary) #0.6853062
The model that I created is better than the threshold of assigning only 0. The accuracy is 0.6853062 whereas the assigning is 0.
# build revised model by adding one new feature
model.revised = glm(helpful.binary ~ year + reviewtextcounts + sentiments, data = reviews, family = 'binomial')
summary(model.revised)
##
## Call:
## glm(formula = helpful.binary ~ year + reviewtextcounts + sentiments,
## family = "binomial", data = reviews)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0598 -0.8186 -0.6112 0.9882 2.0595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.438e+00 4.625e+01 0.204 0.838
## year2000 -7.045e+00 4.626e+01 -0.152 0.879
## year2001 -8.353e+00 4.625e+01 -0.181 0.857
## year2002 -8.412e+00 4.625e+01 -0.182 0.856
## year2003 -8.285e+00 4.625e+01 -0.179 0.858
## year2004 -7.806e+00 4.625e+01 -0.169 0.866
## year2005 -7.857e+00 4.625e+01 -0.170 0.865
## year2006 -8.212e+00 4.625e+01 -0.178 0.859
## year2007 -8.997e+00 4.625e+01 -0.195 0.846
## year2008 -9.411e+00 4.625e+01 -0.203 0.839
## year2009 -9.669e+00 4.625e+01 -0.209 0.834
## year2010 -9.854e+00 4.625e+01 -0.213 0.831
## year2011 -1.011e+01 4.625e+01 -0.219 0.827
## year2012 -1.050e+01 4.625e+01 -0.227 0.820
## year2013 -1.099e+01 4.625e+01 -0.238 0.812
## year2014 -1.147e+01 4.625e+01 -0.248 0.804
## reviewtextcounts 5.307e-03 1.029e-04 51.600 < 2e-16 ***
## sentiments 1.684e-02 4.108e-03 4.100 4.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 88108 on 65808 degrees of freedom
## Residual deviance: 72528 on 65791 degrees of freedom
## AIC: 72564
##
## Number of Fisher Scoring iterations: 9
testset$outcomes.revised <- predict(model.revised, newdata = testset, type = "response")
# classify as accept if the predicted probability is more than 0.6
x <- 0.36
testset$glm.binary.revised <- 0
testset$glm.binary.revised[testset$outcomes.revised >= x] <- 1
mean(testset$glm.binary.revised == testset$helpful.binary) #0.7109102
## [1] 0.7109102
By introducing the year in which the review was written, we have increased the accuracy of the model. This makes sense because as the year progresses, as more/better people get on the Amazon platform, the quality gets better.