Predicting the locale of businesses from Yelp reviews

The slide deck presentation for this project is at this rpubs link.

Title

An attempt is made in this project to predict the location of a business based on the reviewers’ text. Techniques of natural language processing were then applied to the review text to make it possible to conduct supervised learning on the corpus. The algorithm was supposed to predict the state the business belonged to from the text. Using the methods used in the project, algorithm was moderately succesful in predicting the state with a 62% accuracy.

Introduction

From the thousands of reviews available from different locations, I was curious whether basic natural language processing and machine learning could be used to predict the locations of businesses based on the reviews. The assumption here is that reviewers would have slightly different ways of reviewing the same business categories in these locations. Though not attempted in this project it would be interesting to study what those differences would be.

Methods and Data

Data for this project was obtained from two files yelp_academic_dataset_review.json and yelp_academic_dataset_business.json These two files were joined by bid and they were read into a data frame as summarized below.

str(review.df)

## 'data.frame':    833021 obs. of  5 variables:
##  $ state   : Factor w/ 22 levels "AZ","BW","CA",..: 1 1 1 1 1 16 16 16 16 16 ...
##  $ city    : Factor w/ 348 levels "1023 E Frye Rd",..: 231 231 231 231 231 83 20 20 238 238 ...
##  $ category: Factor w/ 426 levels "Accessories",..: 125 125 125 125 125 282 3 3 363 363 ...
##  $ bid     : Factor w/ 55725 levels "000i-lkjp-wsnk5s6z3s2Q",..: 47255 47255 47255 47255 47255 46369 13213 13213 23352 23352 ...
##  $ review  : Factor w/ 832429 levels ":)",":) :) :)",..: 113552 733752 113551 163543 329476 91506 100259 535373 603929 20705 ...

From this data, reviews from countries: US, Canada, UK and Germany were selected. Business categories from the German state BW were only considered as they contained only 342 reviews. All the reviews of individual businesses were concatenated making them the line items for our analysis.

review.df <- subset(review.df, state=='AZ'|state=='BW'| state=='EDH'|state=='QC')

uniqueCategories <- unique(subset(review.df,state=='BW')[,'category'])

review.df <- review.df[review.df$category %in% uniqueCategories,]

review.df <- droplevels(review.df)

R package tm was then used to do convert the text to lower case, remove stop words, punctuation and to conduct stemming. The resulting corpus was then turned into bigrams. Bigram analysis is used for clustering commonly occuring keywords. This results are then stored in a sparse matrix in the tm package called DocumentTermMatrix

require(tm)
dat.tm <- Corpus(VectorSource(list.vector.words))  # make a corpus
dat.tm <- tm_map(dat.tm,  content_transformer(tolower))    # convert all words to lowercase
options(mc.cores=1)
dat.tm <- tm_map(dat.tm, removeWords, stopwords('english'))
dat.tm <- tm_map(dat.tm, removeWords, stopwords('french'))
dat.tm <- tm_map(dat.tm, removeWords, stopwords('german'))
dat.tm <- tm_map(dat.tm, removeWords, words=c("yelp","review","reviews","rezension","revue","like","aimer",'mögen'))  
dat.tm <- tm_map(dat.tm, removePunctuation)  
dat.tm <- tm_map(dat.tm, stripWhitespace)  # remove extra white space
dat.tm <- tm_map(dat.tm, stemDocument)      
dat.tm<- tm_map(dat.tm, PlainTextDocument)

require(RWeka)

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
# create the document-term matrix
datmat<- DocumentTermMatrix(dat.tm, control = list(tokenize = BigramTokenizer))

summary(datmat)

##          Length   Class  Mode   
## i        50442286 -none- numeric
## j        50442286 -none- numeric
## v        50442286 -none- numeric
## nrow            1 -none- numeric
## ncol            1 -none- numeric
## dimnames        2 -none- list

This matrix has a lot of sparsly used terms. To remove them, column sums of the matrix were taken and only columns which has a sum of more than four were only kept. findFreqTerms can be used to find the upper and lower frequency of the terms in the matrix.

require(tm)

## Loading required package: tm
## Loading required package: NLP

findFreqTerms(datmat.out, highfreq=10)[100:120]

##  [1] "actual tast"    "actual use"     "actual want"    "add extra"     
##  [5] "add nice"       "ad extra"       "ad nice"        "adult pool"    
##  [9] "age group"      "ago friend"     "ago now"        "ahead order"   
## [13] "ahead time"     "aim bien"       "air condit"     "ale tap"       
## [17] "allow us"       "almond milk"    "almost empti"   "almost everyth"
## [21] "almost forgot"

findFreqTerms(datmat.out, lowfreq=100)

##  [1] "5 star"         "can get"        "come back"      "custom servic" 
##  [5] "even though"    "everi time"     "first time"     "food good"     
##  [9] "go back"        "good food"      "great place"    "happi hour"    
## [13] "high recommend" "ice cream"      "last night"     "love place"    
## [17] "make sure"      "next time"      "one best"       "place go"      
## [21] "pretti good"    "pretti much"    "realli good"    "reason price"  
## [25] "staff friend"

As a side note, it is interesting to observe that the high frequency terms convey mostly positive sentiment.

Results

Commonly used text classification algorithms SVM, k-nearest neigbors (knn) failed to find any patterns from the data. In the case of knn, all the results were classified into one group. I then decided to the ensemble learning method random forests to see if it can classify the data into one of the four states. After randomizing the data, 2200 observations were used in the training set and 1600 observations were used in the test set.

library(randomForest)
dat.mat <- dat.mat[sample(nrow(dat.mat)),]
dat.df <- as.data.frame(dat.mat)

dat.df$state <- lapply(dat.df$state, as.factor)
dat.df$state <- unlist(dat.df$state)

rfTrain <- randomForest(x=dat.df[1:2200,-1], y= dat.df[1:2200,1])
rfPredict <- predict(rfTrain,dat.df[2200:3800,-1] )

The following are the results. The factors are as follows 1 = AZ(US), 2 = BW (Germany), 3 = EDH (UK), 4 = QC (Canada)

rfTable <- table(RF = rfPredict, "actual"=  dat.df[2200:3800,1])
#confusion matrix
rfTable

##    actual
## RF    1   3   4   2
##   1 583 159 293  52
##   3  33 208  23   0
##   4  25  23 171   3
##   2   0   1   0  27

sum(diag(rfTable))/nrow(dat.df[2200:3800,-1])

## [1] 0.6177389

The above result shows that the prediction is 62% accurate.

Discussion

There seems to be enough variation in the reviews to predict the location of businesseses from them. Bigram analysis seems to be a resonable way of custering human language. From the methods used single algorithms were unable to classify the reviews. Even with ensemble learning our accuracy is only moderate. However it still shows that such a classification is possible and advanced language processing methods may improve the accuracy even further. It would be interesting to find out the acutal differences that make these classifications possible.