Yelp milestone report on exploratory analysis-Capstone Project

Introduction

This is exploratory analysis during the fall 2015 session of the Coursera Data Science capstone project. The goal of the capstone project is to build a model that provides business rates based on user supplied review text, “Write your tip, we rate for you”.

Consumer’s reviews from web pages and social media have tremendous impact for business success also have impact in performance of the employee work effectiveness. Customers look for a complete and satisfactory experience regarding performance, quality of service, and many other feature and review about them. Customers reviews content are very diverse in their topics and more informative than numeric rating although numeric rating is a valuable way for quick business review.

In this project we tried to answer those questions:

How well can you guess a review’s rating from its text alone? What are rates between positive and negative words used from customer’s review in each specific star?

We start this project by using the data that come from Yelp.Once the text was cleaned up, DocumentTermMatrix was generated by applying quad-gram tokenization on the review dataset, and removing all the sparse terms. We run a Latent Dirichlet Allocation (LDA) and Correlated Topics Model using the document-term frequencies matrix of star review language model as input. We obtain most frequent words used for each topic in specific star review.

Part two of this project, we evaluate the opportunity of using a dictionary-based approach to classify the review text as positive or negative sentiment. We train the algorithms based in differences between positive and negative words for specific star review.

The data

The dataset is part of the Yelp Dataset Challenge and the specific dataset used in this capstone corresponds to Round 6 of their challenge.The dataset consists of a set of JSON files that include business information, reviews, tips user information. Review objects list for star rating, the review text, the review date, the number of reviews that the business has received and many other data qualities.

To answer my questions, we have selected sample of data from yelp academic dataset review file and includes zones in review text, star review and review id. The review text will form the basic corpus of this project. We applied two techniques to create a document-term matrix as input. Preprocessing is critical for creation of the model, as the frequencies and sentiment can be distorted due to non unique elements such extra space, punctuation, lower and uppercase and the text articles.

# review<-'yelp_academic_dataset_review.json'
# review <- stream_in(file(review))
# save(review,file='review.RData')
# review1<-review[1:5000, ]
# save(review1,file='review1.RData')
# rm(review)

# business.file <- 'yelp_academic_dataset_business.json'
# business <- stream_in(file(business.file))
# save(business,file='business.RData')
# rm(business)

# checkin<- 'yelp_academic_dataset_checkin.json'
# checkin <- stream_in(file(checkin))
# save(checkin,file='checkin.RData')
# rm(checkin)

# tip<- 'yelp_academic_dataset_tip.json'
# tip <- stream_in(file(tip))
# save(tip,file='tip.RData')
# rm(tip)

# user<- 'yelp_academic_dataset_user.json'
# user <- stream_in(file(user))
# save(user,file='user.RData')
# rm(user)

Question 1 (Topic Modeling)

DocumentTermMatrix was generated by applying quad-gram tokenization for each star review text. Implementation of the Latent Dirichlet model is provided in the topicmodels package. In general, a topic model discerns topics within a relied text review. The seven most frequent words for each topic for the specific stars assisted us gaining insight into what are consumer’s expressions for that particular star review, and diversity of positive and negative expression used in review text.

load('review1.RData')
review1$stars<-as.factor(review1$stars)
streview<-data.frame(review1$stars, review1$text, review1$user_id)
names(streview)<-c("stars","text", "user_id")
s1treview<- streview[ streview$stars==1, ]
s2treview<- streview[ streview$stars==2, ]
s3treview<- streview[ streview$stars==3, ]
s4treview<- streview[ streview$stars==4, ]
s5treview<- streview[ streview$stars==5, ]

sentiment<-function(data){   
text<-data[ ,2]
textfun <- function(text, lowercase=TRUE, numbers=TRUE, punctuation=TRUE, spaces=TRUE){
  text=iconv(text,to="ASCII",sub="")
  if (lowercase)
    text=tolower(text)
  if (numbers)
    text=removeNumbers(text)
  if (punctuation)
    text=removePunctuation(text)
  if (spaces) 
    (text=stripWhitespace(text)) 
  text
}

text<-textfun(text) 
options(mc.cores=1)
  NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 5 ,delimiters = " \\r\\n\\t.,;:\"()?!"))
dtm.streview.fourgram<- DocumentTermMatrix(VCorpus(VectorSource(text), readerControl=list(language="en")), control=list(tokenize = NgramTokenizer, 
                                                                                                                        wordLengths=c(3,Inf), tolower=FALSE,stemming=TRUE,bounds=list(global=c(5,Inf)),stopwords=TRUE
  )) 
  user_id<-data[ ,3]
  rownames(dtm.streview.fourgram) <-user_id 

doc.freq <- rollup(dtm.streview.fourgram,2,FUN=sum)
is.empty <- as.vector(doc.freq ==0)
dtm.streview.fourgram <- dtm.streview.fourgram[!is.empty,]

lda_out <- LDA(dtm.streview.fourgram, 3)
result<-terms(lda_out, 7)

word.result <- data.frame(word=colnames(result),word.freq=matrix(result), stringsAsFactors = FALSE)
word.result 
}

star1review<-sentiment(s1treview)
star2review<-sentiment(s2treview)
star3review<-sentiment(s3treview)
star4review<-sentiment(s4treview)
star5review<-sentiment(s5treview)

Figure 1. A topic model for star one reviews. Only the 7 most frequent words are shown for each topic. Colors in the chart shows the relative in-topic relation of each words, being the most large the most frequent term.

Figure 2. A topic model for star two reviews. Only the 7 most frequent words are shown for each topic. Colors in the chart shows the relative in-topic relation of each words, being the most large the most frequent term.

Figure 3. A topic model for star three reviews. Only the 7 most frequent words are shown for each topic. Colors in the chart shows the relative in-topic relation of each words, being the most large the most frequent term.

Figure 4. A topic model for star four reviews. Only the 7 most frequent words are shown for each topic. Colors in the chart shows the relative in-topic relation of each words, being the most large the most frequent term.

Figure 5. A topic model for star five reviews. Only the 7 most frequent words are shown for each topic. Colors in the chart shows the relative in-topic relation of each words, being the most large the most frequent term.

Question 2 (Sentiment analysis)

To summarize the overall sentiment analysis of the review text, for each star we calculate the difference between positive and negative terms, text that do not contain positive and negative words were extracted from analysis. In this project for sentiment analysis, we use the dictionary that is provided by Hu and Liu (2004) and Liu et al. (2005). It consists two lists, both of several thousand positive and negative terms. (Opinion-Lexicon-English).

In an regularly term-document matrix, the frequency of the terms in the texts can be presented in different column, this can be generated by adding the control option weighting of the function TermDocumentMatrix() to weightBin. We review the text with sentiment dictionary presented above and we calculate difference between positive and negative terms.

stareview<-function(data){ 
  text<-data[ ,2]
  reviews <- Corpus(VectorSource(text))
  reviews <- tm_map(reviews, removeNumbers)
  reviews <- tm_map(reviews, removePunctuation)
  reviews <- tm_map(reviews, removeWords, words = stopwords("en"))
  reviews <- tm_map(reviews, tolower)
  reviews <- tm_map(reviews, stemDocument, language = "english")
  
  pos <- readLines("positive-words.txt")
  pos <- pos[!str_detect(pos, "ˆ;")]
  pos <- pos[1:length(pos)]
  
  neg <- readLines("negative-words.txt")
  neg <- neg[!str_detect(neg, "ˆ;")]
  neg <- neg[1:length(neg)]
  
  pos <- stemDocument(pos, language = "english")
  pos <- pos[!duplicated(pos)]
  neg <- stemDocument(neg, language = "english")
  neg <- neg[!duplicated(neg)]
  
  reviews <- tm_map(reviews, PlainTextDocument)
  tdm.reviews.bin <- TermDocumentMatrix(reviews, control = list(weighting = weightBin))
  tdm.reviews.bin <- removeSparseTerms( tdm.reviews.bin,1-(3/length(reviews)))
  
  pos.mat <- tdm.reviews.bin[rownames(tdm.reviews.bin) %in% pos, ]
  neg.mat <- tdm.reviews.bin[rownames(tdm.reviews.bin) %in% neg, ]
  pos.out <- apply(pos.mat, 2, sum)
  neg.out <- apply(neg.mat, 2, sum)
  
  senti.diff<- pos.out-neg.out
  senti.diff
}

star1review<-stareview(s1treview)
star2review<-stareview(s2treview)
star3review<-stareview(s3treview)
star4review<-stareview(s4treview)
star5review<-stareview(s5treview)

summary(star1review);summary(star2review);summary(star3review);summary(star4review);summary(star5review)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -6.000   0.000   2.000   2.207   4.000  22.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.000   1.000   3.000   3.327   5.000  16.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.000   2.000   4.000   3.938   6.000  20.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -4.000   2.000   3.000   3.994   6.000  19.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.000   2.000   3.000   4.043   6.000  21.000

Figure 6. Violin plots of estimated difference between positive and negative sentiment in the text reviews for star rating.

Conclusions

Loading the Yelp dataset takes a lot of time due to large file size. So it was necessary to create a data sample. But this will decrease accuracy for the specific terms used in the star review text, although algorithm and functions are same for any type of file sizes. Results gives us a hint of next most likely sentiment words for phrases for each star reviews and lead us to use these algorithm for building the model for star prediction from text alone.

Next step for the project would be to build a model for final phase.