Background

This project is based on a data challenge hosted by Yelp. From this report you can see how I:
1. perform Natural Language Processing (NLP) to check on word frequency and determine the mood behind reviews,
2. designed an algorithm to filter out the top 20 brunch spots in Pittsburgh,
3. visualize the places using an interactive map.

Data Cleaning

Yelp provided a very thorough dataset covering 400 cities with 5 data files each having every possible attributes about the business, users, reviews, check-ins and tips. The first step is to read in the files, and narrow down the dataset by city and the attributes we’re interested in.

review = stream_in(file("yelp_academic_dataset_review.json"))
business = stream_in(file("yelp_academic_dataset_business.json"))
user <- read.csv("yelp_academic_dataset_user.csv")

Choosing a City

Why Pittsburgh? The thing is, even thorough there are more than 400 cities available, most of them have very few business in the dataset. We can take a look at the cities, and not surprisingly there’re not many options. Vegas has a bigger city but it’s not really THE place your mindset’s on brunch :) Pittsburgh’s fine, pitt is cute.

#cities <- count(business$city)
cities[cities$freq >= 3000, ]
##              x  freq
## 63   Charlotte  5695
## 106  Edinburgh  3360
## 150  Henderson  3145
## 183  Las Vegas 19328
## 224       Mesa  3638
## 242   Montréal  4371
## 298    Phoenix 11852
## 307 Pittsburgh  3628
## 358 Scottsdale  5638
## 404      Tempe  3043

All brunch places in Pitt

Here’s a chunk of code showing how I filtered out the brunch spots using a few filter: “Pittsburgh”, “Breakfast & Brunch”, still open. Then I merged the review of these places into a bigger data frame, and renamed the variables. To make the knitting process smooth I used rds I saved before.

# narrow down by pitt and brunch
pitt <- business[business$city == "Pittsburgh", ]
pos <- c()
for (i in 1:3628) {pos[i] <- "Breakfast & Brunch" %in% pitt$categories[[i]]}
pitt.brunch <- pitt[pos,]
pitt.brunch <- pitt.brunch[pitt.brunch$open == TRUE,]
save(pitt.brunch, file = "pitt.brunch.rds")
# merging reveiws into the dataset
pitt.brunch.review <- merge(pitt.brunch, review, by = "business_id")
pbreview <- data.frame(pitt.brunch.review[c("name", "stars.x", "full_address", "longitude", "latitude", "review_count", "date", "stars.y", "text")], pitt.brunch.review$votes$useful)
names(pbreview) <- c("name", "avg.star", "full.address", "longitude", "latitude", "review.count", "review.date", "review.star", "text", "vote.useful")
save("pbreview", file = "pbreview.rds")

Now I have the dataset that I’ll be using throughout the project.

names(pbreview)
##  [1] "name"         "avg.star"     "full.address" "longitude"   
##  [5] "latitude"     "review.count" "review.date"  "review.star" 
##  [9] "text"         "vote.useful"

A Quick Look

pb <- unique(pbreview[1:5])
map <- get_googlemap("pittsburgh", zoom = 12, marker = data.frame(pb$longitude, pb$latitude), scale = 2, maptype = "roadmap")
ggmap(map, extent = 'device')

Here’s a map of all the brunch spots in Pittsburgh from the dataset. In this plot there are 61 pins on the map, and later on you’ll see how I narrow it down to 20 and visualize it using an interactive map.

NLP Text Analysis

Wordcloud on All 61 Brunch Places in Pitt

textall <- pbreview$text
textall <- iconv(textall,to="utf-8-mac")
docs <- Corpus(VectorSource(textall))
docs <- tm_map(docs, stemDocument)   
docs <- tm_map(docs, stripWhitespace)   
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removeWords, c('just', 'like', 'dont', 'get', 'one', 'amp', 'the', stopwords('english')))

docs <- Corpus(VectorSource(docs))

dtm <- DocumentTermMatrix(docs)   


set.seed(19)
wordcloud(docs, max.words = 60, colors = brewer.pal(5, "Dark2"))

The 61 brunch places have in total 4375 reviews each containing one paragraph of text. Here we’re using wordcloud to break the info down and see what’s in them. As we can see, th most frequently mentioned words are very brunch-ish: pancake, egg, good, coffee, order, nice.. Nothing in particular, but provides an overview of the words in these reviews.

Word Frequency (Ref here)

We see from the word cloud that these places are having the same range of words - not hard to imagine, these are all brunch places, what would you expect? But from our real life experience we also have this vague ideas that: this place has amazing egg benedict, that place is famous for its bacon hash.

Taking one diner “Ritters Diner” as an example

ritters <- pbreview[pbreview$name == "Ritters Diner", ]
docs <- Corpus(VectorSource(ritters$text))
docs <- tm_map(docs, removeWords, stopwords('english'))
dtm <- DocumentTermMatrix(docs)   

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)   
head(freq, 14) 
##      food       the     diner     place      good      like breakfast 
##       187       138       126        90        81        75        73 
##      just   ritters       get     great    always     night      late 
##        65        57        49        49        48        46        42
wf <- data.frame(word=names(freq), freq=freq)   
head(wf)  
##        word freq
## food   food  187
## the     the  138
## diner diner  126
## place place   90
## good   good   81
## like   like   75
p <- ggplot(subset(wf, freq>25), aes(word, freq))    
p + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1))     

Again, not too much information because most restaurants will have the same words repeatedly in their review. But what if we get rid of the words these restaurants have in common and see what’s left?

Sentiment Analysis (Ref here)

Sentiment analysis use naive bayes algorithm to predict whether a text vector is having a positive, neutrel or negative emotion in its text. Using sentiment analysis can help us determine the emotion within reviews of each brunch place, and see if it’s compatible with its average rating on the Yelp database. Taking “Ritters Diner” as an example, we throw in the reviews and here’s what it returned.

textdata <- ritters$text
class_emo = classify_emotion(textdata, algorithm="bayes", prior=1.0)
emotion = class_emo[,7]
emotion[is.na(emotion)] = "unknown"
class_pol = classify_polarity(textdata, algorithm="bayes")
polarity = class_pol[,4]

sent_df = data.frame(text=textdata, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)
sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

p1 <- ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + 
  scale_fill_brewer(palette="Dark2") + labs(x="emotion categories", y="") # emotion 
p2 <- ggplot(sent_df, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + 
  scale_fill_brewer(palette="RdGy") + labs(x="polarity categories", y="") # polarity
grid.arrange(p1, p2, ncol = 2, top = "Emotion and Polarity of Reviews for Diner 'Ritters Diner'")

The Interactive Map

I’m using github to render a json file containing the data csv file that I just saved. GitHub has this nice feature that allows you to present a geojson file with a interactive map by inserting a javascript.
For converting csv to geojson, python script here.