This project is based on a data challenge hosted by Yelp. From this report you can see how I:
1. perform Natural Language Processing (NLP) to check on word frequency and determine the mood behind reviews,
2. designed an algorithm to filter out the top 20 brunch spots in Pittsburgh,
3. visualize the places using an interactive map.
Yelp provided a very thorough dataset covering 400 cities with 5 data files each having every possible attributes about the business, users, reviews, check-ins and tips. The first step is to read in the files, and narrow down the dataset by city and the attributes we’re interested in.
review = stream_in(file("yelp_academic_dataset_review.json"))
business = stream_in(file("yelp_academic_dataset_business.json"))
user <- read.csv("yelp_academic_dataset_user.csv")
Why Pittsburgh? The thing is, even thorough there are more than 400 cities available, most of them have very few business in the dataset. We can take a look at the cities, and not surprisingly there’re not many options. Vegas has a bigger city but it’s not really THE place your mindset’s on brunch :) Pittsburgh’s fine, pitt is cute.
#cities <- count(business$city)
cities[cities$freq >= 3000, ]
## x freq
## 63 Charlotte 5695
## 106 Edinburgh 3360
## 150 Henderson 3145
## 183 Las Vegas 19328
## 224 Mesa 3638
## 242 Montréal 4371
## 298 Phoenix 11852
## 307 Pittsburgh 3628
## 358 Scottsdale 5638
## 404 Tempe 3043
Here’s a chunk of code showing how I filtered out the brunch spots using a few filter: “Pittsburgh”, “Breakfast & Brunch”, still open. Then I merged the review of these places into a bigger data frame, and renamed the variables. To make the knitting process smooth I used rds I saved before.
# narrow down by pitt and brunch
pitt <- business[business$city == "Pittsburgh", ]
pos <- c()
for (i in 1:3628) {pos[i] <- "Breakfast & Brunch" %in% pitt$categories[[i]]}
pitt.brunch <- pitt[pos,]
pitt.brunch <- pitt.brunch[pitt.brunch$open == TRUE,]
save(pitt.brunch, file = "pitt.brunch.rds")
# merging reveiws into the dataset
pitt.brunch.review <- merge(pitt.brunch, review, by = "business_id")
pbreview <- data.frame(pitt.brunch.review[c("name", "stars.x", "full_address", "longitude", "latitude", "review_count", "date", "stars.y", "text")], pitt.brunch.review$votes$useful)
names(pbreview) <- c("name", "avg.star", "full.address", "longitude", "latitude", "review.count", "review.date", "review.star", "text", "vote.useful")
save("pbreview", file = "pbreview.rds")
Now I have the dataset that I’ll be using throughout the project.
names(pbreview)
## [1] "name" "avg.star" "full.address" "longitude"
## [5] "latitude" "review.count" "review.date" "review.star"
## [9] "text" "vote.useful"
pb <- unique(pbreview[1:5])
map <- get_googlemap("pittsburgh", zoom = 12, marker = data.frame(pb$longitude, pb$latitude), scale = 2, maptype = "roadmap")
ggmap(map, extent = 'device')
Here’s a map of all the brunch spots in Pittsburgh from the dataset. In this plot there are 61 pins on the map, and later on you’ll see how I narrow it down to 20 and visualize it using an interactive map.
textall <- pbreview$text
textall <- iconv(textall,to="utf-8-mac")
docs <- Corpus(VectorSource(textall))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
docs <- tm_map(docs, removeWords, c('just', 'like', 'dont', 'get', 'one', 'amp', 'the', stopwords('english')))
docs <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(docs)
set.seed(19)
wordcloud(docs, max.words = 60, colors = brewer.pal(5, "Dark2"))
The 61 brunch places have in total 4375 reviews each containing one paragraph of text. Here we’re using wordcloud to break the info down and see what’s in them. As we can see, th most frequently mentioned words are very brunch-ish: pancake, egg, good, coffee, order, nice.. Nothing in particular, but provides an overview of the words in these reviews.
We see from the word cloud that these places are having the same range of words - not hard to imagine, these are all brunch places, what would you expect? But from our real life experience we also have this vague ideas that: this place has amazing egg benedict, that place is famous for its bacon hash.
ritters <- pbreview[pbreview$name == "Ritters Diner", ]
docs <- Corpus(VectorSource(ritters$text))
docs <- tm_map(docs, removeWords, stopwords('english'))
dtm <- DocumentTermMatrix(docs)
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)
## food the diner place good like breakfast
## 187 138 126 90 81 75 73
## just ritters get great always night late
## 65 57 49 49 48 46 42
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## food food 187
## the the 138
## diner diner 126
## place place 90
## good good 81
## like like 75
p <- ggplot(subset(wf, freq>25), aes(word, freq))
p + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=45, hjust=1))
Again, not too much information because most restaurants will have the same words repeatedly in their review. But what if we get rid of the words these restaurants have in common and see what’s left?
Sentiment analysis use naive bayes algorithm to predict whether a text vector is having a positive, neutrel or negative emotion in its text. Using sentiment analysis can help us determine the emotion within reviews of each brunch place, and see if it’s compatible with its average rating on the Yelp database. Taking “Ritters Diner” as an example, we throw in the reviews and here’s what it returned.
textdata <- ritters$text
class_emo = classify_emotion(textdata, algorithm="bayes", prior=1.0)
emotion = class_emo[,7]
emotion[is.na(emotion)] = "unknown"
class_pol = classify_polarity(textdata, algorithm="bayes")
polarity = class_pol[,4]
sent_df = data.frame(text=textdata, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)
sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
p1 <- ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette="Dark2") + labs(x="emotion categories", y="") # emotion
p2 <- ggplot(sent_df, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette="RdGy") + labs(x="polarity categories", y="") # polarity
grid.arrange(p1, p2, ncol = 2, top = "Emotion and Polarity of Reviews for Diner 'Ritters Diner'")
par(mfrow = c(1,2))
barplot(table(year(ritters$review.date)), xlab = "year", ylab = "number of reviews per year", col = "steelblue", cex.names=0.8, border = NA, main = "Ritters Diner")
barplot(table(year(pbreview$review.date)), xlab = "year", ylab = "number of reviews per year", col = "steelblue", cex.names=0.8, border = NA, main = "all brunch places in pitt")
As we can see, there’s a upwarding trend in the number of Yelp reviews in individual diners and the industry overall, showing the growing popularity of this service. I took a look at the data of year 2016, and it shows that the sudden decrease is out of the incompleteness of the dataset - it only covers the first half of 2016. Again not surprising, it’s still 2016 :)
Yelp provided a star in their system, but how? Could we reproduce them by the information we have in hand? Here I’m calculating my own rating based on the reviews. The method here is to assign weight to each of the review and get a weighted mean of the stars that users rated. I’m giving more weight to the reviews who are getting more “vote useful likes” from other users.
# pb$trend[is.na(pb$trend)] <- "new!"
add <- pb$full.address
star.list <- c()
for (i in 1: 61){
spot <- pbreview[pbreview$full.address == add[i], ]
if (sum(spot$vote.useful) == 0) {
spot$vote.rescale <- 1
} else {
# spot$star[unique(year(spot$review.date)) == 2016] <- "new!"
spot$vote.rescale <- 4/(max(spot$vote.useful)-min(spot$vote.useful))*spot$vote.useful+1
}
star <- sum((spot$review.star)*(spot$vote.rescale))/sum(spot$vote.rescale)
star.list <- c(star.list, star)
}
pb$star.calculated <- as.vector(star.list)
pb$star.rounded <- round(pb$star.calculated*2, digits = 0)/2
If we take a look at my recalcuated stars, and round it to integers and point fives, they’re actually pretty close to what they have in the system :)
head(data.frame(pb$avg.star, pb$star.calculated, pb$star.rounded))
## pb.avg.star pb.star.calculated pb.star.rounded
## 1 3.5 3.281289 3.5
## 2 3.0 2.833333 3.0
## 3 4.0 3.812309 4.0
## 4 4.5 4.307352 4.5
## 5 4.0 4.255178 4.5
## 6 3.0 2.000000 2.0
Here we’re fitting a fitting a mixed effect model to describe the trend of review. The different diners are random effects.
pbreview$year <- year(pbreview$review.date)
urs<-lmer(review.star~-1+(1|name)+year:name,data=pbreview)
fsim<-FEsim(urs)
plotFEsim(fsim)
# fixef(urs)
df <- data.frame(fixef(urs))
df <- add_rownames(df, var = "rowname")
df$rowname <- gsub("year:name", "", df$rowname)
pb.df <- merge(pb, df, by.x = "name", by.y = "rowname", all.x = TRUE)
# best20 <- subset(pb, avg.star > 3 & trend > 2 | trend == "new!")
pb.df$trend <- round(scale(pb.df$fixef.urs.),2)
best20.calculated <- subset(pb.df, star.calculated > 3.4 & trend > 0)[,-2]
best20.calculated <- best20.calculated[,c("name", "full.address", "longitude", "latitude", "star.rounded", "trend")]
names(best20.calculated) <- c("Title", "Full Address", "longitude", "latitude", "Recommending Level", "Trend")
write.csv(best20.calculated, file = "best20.csv", row.names = FALSE)
I’m using github to render a json file containing the data csv file that I just saved. GitHub has this nice feature that allows you to present a geojson file with a interactive map by inserting a javascript.
For converting csv to geojson, python script here.