Transcend Text Analytics

Data Source:

The data to be analyzed were originally taken from Julian McAuley’s curated Amazon Review Dataset (http://jmcauley.ucsd.edu/data/amazon/). In that data set, there are over 1.6 million reviews for electronics products from May 1996 – July 2014, and its key features include: i) Amazon ASIN number (i.e. product /product family ID), ii) review text, iii) star rating score, iv) the review date, and v) helpfulness of the review.

Data Reading

getwd()
setwd("C:/Users/fzy20/Downloads")
library(readr)
library(wordcloud)
library(stringr)
library(dplyr)
library(tm)
library(syuzhet)
library(ggplot2)
electronics <- read_csv("electronics_downsample.csv")

Part 1

To narrow the scope of analysis as prefaced earlier, I am interested in reviews about SD cards of our manufacturer. While I do not know which SD cards are the top selling competitors, I can quickly identify them by searching for “SD” in the review text. Then, after isolating potential candidates, I will use an Amazon search to find product details.

Examination of SD card related data points and cross checked with amazon for the top SD card manufacturers

#Extracting Relevant Observations
elec = electronics[str_detect(electronics$reviewText, " sd | SD "),] 
#Summarizing and Counting the reviews
agg = elec %>% group_by(asin) %>% summarise(count = n())
#ordering the reviews
agg_ordered = agg[order(agg$count,decreasing = TRUE),]

Part 2:

For this section, I will use only the reviews for the three product ASINs identified above. Furthermore I aim to answer the following question:

A. What are the average overall number of stars for the three products you identified in the previous part? B. What are the average sentiment scores for the reviews of these three products?

Observe the average ratings of the top three products by reviews

#Condensed the dataset based on the key observations
condensed = electronics[electronics$asin %in% agg_ordered$asin[1:3],]
#Summarize based on average stars
avg_star = condensed %>% group_by(asin) %>% summarise(average_star = mean(overall))
avg_star

## # A tibble: 3 x 2
##   asin       average_star
##   <chr>             <dbl>
## 1 B000VX6XL6         4.51
## 2 B002WE6D44         4.53
## 3 B007WTAJTO         4.48

#Visualizing the data
ggplot(aes(x = asin, y = average_star),data = avg_star) + geom_col() + geom_text(aes(label = average_star),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()

Observe the average review sentiment of the top three products by reviews

#extarct sentiments of all relevant review texts
condensed$sentiment = sapply(condensed$reviewText, get_sentiment)
#summarize review sentiment
avg_sent = condensed %>% group_by(asin) %>% summarise(average_sentiment = mean(sentiment))
avg_sent

## # A tibble: 3 x 2
##   asin       average_sentiment
##   <chr>                  <dbl>
## 1 B000VX6XL6              1.27
## 2 B002WE6D44              1.10
## 3 B007WTAJTO              1.34

#Visualize Sentiment
ggplot(aes(x = asin, y = average_sentiment),data = avg_sent) + geom_col() + geom_text(aes(label = average_sentiment),hjust = 1.02,size = 5, color = "white") + theme_minimal() + coord_flip()

The Star rating is about the same across the board, but the sentiment of the second SD is significantly lower.

Part 3

In this section,my goal word clouds that summarize the review words most highly correlated with positive and negative scores for this product category.

Extarcting words that are most indicative of good and bad reviews

#Data Tranformation and Cleaning
reviews = VCorpus(VectorSource(condensed$reviewText))
reviews <- tm_map(reviews, removePunctuation)
reviews <- tm_map(reviews, removeNumbers)
reviews <- tm_map(reviews, content_transformer(removeWords), stopwords("SMART"), lazy=TRUE)  
reviews <- tm_map(reviews, content_transformer(tolower), lazy=TRUE) 
reviews <- tm_map(reviews, content_transformer(removeWords), c("til")) 
reviews <- tm_map(reviews, stripWhitespace)
#DTM tranformation
dtm = DocumentTermMatrix(reviews)
dtms = removeSparseTerms(dtm, .988)
dtms_matrix <- as.matrix(dtms)
dim(dtms_matrix)

## [1] 5138  284

dtms_matrix = cbind(dtms_matrix, condensed$overall)
#Correlation Extarction
correlation = cor(dtms_matrix[,c(-285)],dtms_matrix[,c(285)])
#Selecting Top Words
top30 <- tail(correlation[order(correlation)],30)
top30_names <- tail(row.names(correlation)[order(correlation)],30)
bottom30 <- head(correlation[order(correlation)],30)
bottom30_name <- head(row.names(correlation)[order(correlation)],30)

Part 3 Visuals

Words indicative of Good Reviews

#Visualizing words
wordcloud(words = top30_names,freq=top30)

Words indicative of Bad Reviews

#Visualizing words
wordcloud(words = bottom30_name,freq= bottom30*-0.2)

The word clouds seems quite as expected, customer seems particularly happy when it comes to the price and speed of the SD card. While the concerns with sd cards mainly comes from fucntional reasons(such as the card failing) and logistic reasons(such as card being lost)

Part 4

Now moving on to the general case. a crucial concern of Amazon’s platform designers is How do web platforms let reviewers know when they have written a good review? This problem was solved by allowing reviews to be voted “Helpful” by other readers. In this part I will build a predictive model for whether a review has at least one helpful vote or not. In the data, the number of helpful votes is the first of the two numbers in the helpful column.

Predicting Reviw Helpfulness from relevant information

#Binary Vairable Creation
head(electronics)

## # A tibble: 6 x 10
##       X1 asin  helpful overall reviewText reviewTime reviewerID
##    <dbl> <chr> <chr>     <dbl> <chr>      <chr>      <chr>     
## 1 175426 B000~ [0, 0]        5 Was deliv~ 04 27, 20~ AYK7KLKHF~
## 2 175427 B000~ [0, 0]        5 More than~ 09 28, 20~ A28OOZEJ1~
## 3 175428 B000~ [0, 0]        5 Nice pric~ 03 12, 20~ A2W8HPGBR~
## 4 175429 B000~ [0, 0]        5 This card~ 04 17, 20~ A2OMYXNS1~
## 5 175430 B000~ [0, 0]        4 I have ha~ 05 8, 2014 A3O1Y8Y31~
## 6 175431 B000~ [0, 1]        4 I made th~ 05 6, 2013 A1OXQAY95~
## # ... with 3 more variables: reviewerName <chr>, summary <chr>,
## #   unixReviewTime <dbl>

first_elem = function(x){
  return(str_match(x, regex("\\[(.*?)\\,"))[1,2])
}
electronics$helpful_single <- sapply(electronics$helpful, first_elem) 
electronics$helpful_binary <- electronics$helpful_single > 0

I will use the words from the reviews to make the prediction. the final number of words were a result of paramter testing that takes into account both accuracy and complexity

#Data Tranformation on whole Dataset
reviews = VCorpus(VectorSource(electronics$reviewText))
reviews <- tm_map(reviews, removePunctuation)
reviews <- tm_map(reviews, removeNumbers)
reviews <- tm_map(reviews, content_transformer(removeWords), stopwords("SMART"), lazy=TRUE)  
reviews <- tm_map(reviews, content_transformer(tolower), lazy=TRUE) 
reviews <- tm_map(reviews, content_transformer(removeWords), c("til")) 
reviews <- tm_map(reviews, stripWhitespace)
#DTM
dtm = DocumentTermMatrix(reviews)
dtms = removeSparseTerms(dtm, .988)
dtms_matrix <- as.matrix(dtms)
dim(dtms_matrix)

## [1] 65809   683

dtms_matrix = cbind(dtms_matrix, electronics)
frame <- data.frame(dtms_matrix)
electronics_prep <- cbind(frame, electronics)
#Use correlation for feature selection; Pick top 70
correlation = cor(electronics_prep[,c(1:683,687)],electronics_prep[,c(695)])

#Data Splitting
index = sample(nrow(electronics_prep),0.8*nrow(electronics_prep))
train = electronics_prep[index,]
test = electronics_prep[-index,]

for(i in c(40,50,60,70,80,90)){
relevant = row.names(correlation)[order(correlation,decreasing = TRUE)][1:i]
#Construct formula
variable_text = paste(unlist(relevant), collapse = " + ")
#Construct model
model= glm(paste("helpful_binary ~",variable_text), data=electronics_prep)

#Threshold Choice
prediction = predict(model,test)

func_1 <- function(num){
  pred <- as.integer(prediction >= num)
  return(sum(pred==test$helpful_binary)/nrow(test))
}
num <- 1:100
num <- c(num/100)

result = sapply(num,function(i) func_1(i))

final = prediction > num[result == max(result)]
accuracy = sum(final == test$helpful_binary)/nrow(test)
#Acuracy
print(paste("the model with ", i, " terms, has a max accuracy of", accuracy))
}

## [1] "the model with  40  terms, has a max accuracy of 0.6897887858988"
## [1] "the model with  50  terms, has a max accuracy of 0.691384288102112"
## [1] "the model with  60  terms, has a max accuracy of 0.692296003646862"
## [1] "the model with  70  terms, has a max accuracy of 0.694879197690321"
## [1] "the model with  80  terms, has a max accuracy of 0.695487008053487"
## [1] "the model with  90  terms, has a max accuracy of 0.696930557666008"

the parameter with 70 words seems to be start of the accuracy plateau without increasing complexity.

relevant = row.names(correlation)[order(correlation,decreasing = TRUE)][1:70]
#Construct formula
variable_text = paste(unlist(relevant), collapse = " + ")
#Construct model
model= glm(paste("helpful_binary ~",variable_text), data=electronics_prep)

summary(model)

## 
## Call:
## glm(formula = paste("helpful_binary ~", variable_text), data = electronics_prep)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8130  -0.3362  -0.2915   0.5585   0.7387  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.282455   0.002398 117.775  < 2e-16 ***
## dont         0.016320   0.003345   4.879 1.07e-06 ***
## good         0.002987   0.002582   1.157 0.247224    
## time         0.007303   0.003140   2.325 0.020051 *  
## quality      0.019336   0.003244   5.961 2.52e-09 ***
## make         0.016685   0.004769   3.499 0.000468 ***
## find         0.019348   0.005103   3.792 0.000150 ***
## back         0.011287   0.003695   3.055 0.002253 ** 
## set          0.014838   0.003748   3.959 7.52e-05 ***
## features     0.039087   0.006443   6.067 1.31e-09 ***
## people       0.020496   0.006290   3.259 0.001120 ** 
## found        0.025992   0.005670   4.584 4.58e-06 ***
## thing        0.023425   0.004547   5.152 2.59e-07 ***
## doesnt       0.030370   0.005007   6.066 1.32e-09 ***
## reviews      0.050770   0.006066   8.369  < 2e-16 ***
## review      -0.001904   0.006118  -0.311 0.755595    
## point        0.018544   0.006462   2.870 0.004108 ** 
## sound        0.014493   0.002236   6.483 9.06e-11 ***
## camera       0.009599   0.001745   5.501 3.78e-08 ***
## ive          0.007308   0.003559   2.053 0.040056 *  
## screen       0.009908   0.002663   3.720 0.000199 ***
## small        0.021349   0.004375   4.880 1.06e-06 ***
## feature      0.005158   0.006341   0.813 0.415956    
## unit         0.021004   0.002694   7.797 6.44e-15 ***
## bit          0.008988   0.004575   1.964 0.049497 *  
## manual       0.038246   0.007259   5.269 1.38e-07 ***
## big          0.030157   0.006183   4.877 1.08e-06 ***
## lot         -0.004592   0.005119  -0.897 0.369676    
## high         0.014206   0.006053   2.347 0.018922 *  
## system       0.018319   0.003566   5.137 2.80e-07 ***
## give         0.022889   0.006683   3.425 0.000616 ***
## software     0.025722   0.003586   7.172 7.45e-13 ***
## amazon       0.034060   0.004326   7.873 3.51e-15 ***
## problem      0.020120   0.004512   4.459 8.26e-06 ***
## things      -0.001659   0.006447  -0.257 0.796911    
## work         0.005291   0.003354   1.578 0.114609    
## easy         0.016182   0.004070   3.976 7.03e-05 ***
## light        0.012738   0.004156   3.065 0.002176 ** 
## nice         0.010358   0.004188   2.473 0.013400 *  
## picture      0.033295   0.004740   7.025 2.17e-12 ***
## power        0.022183   0.003182   6.972 3.15e-12 ***
## turn         0.012234   0.006244   1.959 0.050068 .  
## youre        0.016567   0.005433   3.049 0.002295 ** 
## read         0.025430   0.005371   4.735 2.20e-06 ***
## support      0.047792   0.005416   8.824  < 2e-16 ***
## low          0.015808   0.006518   2.425 0.015301 *  
## full         0.012014   0.006346   1.893 0.058359 .  
## pretty      -0.004181   0.005156  -0.811 0.417425    
## video       -0.001430   0.003229  -0.443 0.657956    
## home         0.018229   0.005264   3.463 0.000534 ***
## settings    -0.011047   0.006630  -1.666 0.095657 .  
## didnt        0.004191   0.005190   0.808 0.419356    
## buy          0.023825   0.004273   5.575 2.48e-08 ***
## control      0.017612   0.006072   2.900 0.003727 ** 
## hours        0.026091   0.006072   4.297 1.73e-05 ***
## model        0.048242   0.006171   7.817 5.47e-15 ***
## image        0.004586   0.005957   0.770 0.441413    
## update       0.020961   0.006511   3.220 0.001285 ** 
## box          0.012026   0.004894   2.457 0.014002 *  
## great        0.003748   0.002600   1.441 0.149471    
## wanted       0.036530   0.006653   5.491 4.01e-08 ***
## lens         0.020368   0.001993  10.218  < 2e-16 ***
## experience   0.018360   0.008095   2.268 0.023337 *  
## made         0.022145   0.005461   4.055 5.02e-05 ***
## size         0.011785   0.005421   2.174 0.029703 *  
## makes        0.010044   0.006261   1.604 0.108682    
## setting     -0.007049   0.008461  -0.833 0.404794    
## button      -0.001940   0.004801  -0.404 0.686229    
## mode        -0.005553   0.005178  -1.073 0.283455    
## range        0.019429   0.006781   2.865 0.004168 ** 
## wont         0.020670   0.007100   2.911 0.003598 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2134052)
## 
##     Null deviance: 15678  on 65808  degrees of freedom
## Residual deviance: 14029  on 65738  degrees of freedom
## AIC: 85185
## 
## Number of Fisher Scoring iterations: 2

#Exmaining baseline model
all_zeroes = rep(0,13162)
proxy_accuracy = sum(all_zeroes == test$helpful_binary)/nrow(test)
proxy_accuracy

## [1] 0.6065188

Wrapping up, my model out performs the base model(zero based) by 9% in terms of accuracy.