An analysis of the Yelp Ratings Database: A milestone in the Coursera Data Science capstone project

Introduction

We are living in a review-obsessed culture and, when we are looking for a place to lunch or have dinner, it has now become a common reflex to check online ratings before making a final call. This reflects of course the customers point of view, trying to narrow their choices to some 3, 4 or 5 stars fancy venues depending on their taste and available budget.

But there is also the business owner point of view!

The importance of ratings for restaurants have been heavily emphasized in France by the suicide of the famous Chef Bernard Loiseau, whose tragic death 12 years ago has been related to the threat of losing his 3 stars prestigious ranking in the Michelin Red Guide.

It is obvious that a restaurant with poor rating will probably face serious business issues and maybe have to stop its activity. According to an article published in the New Yorker following the death of Loiseau (http://www.newyorker.com/magazine/2003/05/12/death-of-a-chef):

“the loss of a Michelin star can mean financial strain: business can drop by as much as twenty-five per cent. (When Loiseau got his third star, in 1991, business at La Côte d’Or increased by sixty per cent.)”

Then comes a simple question, beyond quality of food, what are the factors explaining restaurants ratings?

Being able to identify them and predict ratings would be a precious information for the owners, allowing them to propose a better customer experience and ultimately grow their businesses.

Our initial exploratory analysis results are presented below and illustrated in Figures 1-2-3 and 4, as well as in Table 1.

Getting the data

The dataset is available at http://www.yelp.com/dataset_challenge. It contains in particular:

1.6M reviews and 500K tips by 366K users for 61K businesses
481K business attributes, e.g., hours, parking availability, ambience

If you are not familiar with Yelp (maybe a user of TripAdvisor?), you will find below an example of French Bakery review in Oslo: http://www.yelp.co.uk/biz/pascal-oslo-2

Note: The data format is mainly JSON files.

R Libraries

In order to clean the data, visualize and publish our results, we will use the following R packages (if they are not installed please use the install.packages(“name of the package”); Installing packages such as “DT”" may also require to install “devtools” first).

library(rjson)
library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(glmnet)
library(googleVis)
library(DT)
library(scales)

Reading the “Business” JSON file

Note: The files have to be saved previously in your own working directory. Do not also forget to close the file.

con <- file("yelp_academic_dataset_business.json", "r")
input <- readLines(con, -1L)
close(con)

Creating and loading our dataframe

It simply consists in extracting each single business and make it a row of a data frame we will call “yelpdata”.

yelpdata <- input %>%
   lapply(function(x) t(unlist(fromJSON(x)))) %>% 
   ldply()
save(yelpdata, file= 'yelpdata.rdata')
load("yelpdata.rdata")

Cleaning the data

We will identify more clearly the different features such as the fact that there is “Happy Hour”, the “WIFI” available and so on.

We will also ensure the treament of “N/A” values.

clean.names <- function(df){
  colnames(df) <- gsub("[^[:alnum:]]", "", colnames(df))
  colnames(df) <- tolower(colnames(df))
  return(df)
}
yelpdata <- clean.names(yelpdata)
yelpdata <- yelpdata[,!duplicated(colnames(yelpdata))]

# Features 
yelpdata$stars <- as.numeric(as.character(yelpdata$stars))
yelpdata$reviewcount <- as.numeric(as.character(yelpdata$reviewcount))
names(yelpdata)[names(yelpdata)=="attributeshappyhour"] <- "happyhour"
names(yelpdata)[names(yelpdata)=="attributesacceptscreditcards"] <- "acc"
names(yelpdata)[names(yelpdata)=="attributesgoodforgroups"] <- "groups"
names(yelpdata)[names(yelpdata)=="attributesoutdoorseating"] <- "outdoor"
names(yelpdata)[names(yelpdata)=="attributespricerange"] <- "price"
names(yelpdata)[names(yelpdata)=="attributesalcohol"] <- "alcohol"
names(yelpdata)[names(yelpdata)=="attributesnoiselevel"] <- "noiselevel"
names(yelpdata)[names(yelpdata)=="attributesambienceclassy"] <- "classy"
names(yelpdata)[names(yelpdata)=="attributesparkingvalet"] <- "valet"
names(yelpdata)[names(yelpdata)=="neighborhoods"] <- "nhood"
names(yelpdata)[names(yelpdata)=="attributesdrivethru"] <- "drivethru"
names(yelpdata)[names(yelpdata)=="attributesparkinglot"] <- "parkinglot"
names(yelpdata)[names(yelpdata)=="attributesparkinglot"] <- "parkinglot"
names(yelpdata)[names(yelpdata)=="attributespaymenttypescashonly"] <- "cash"
names(yelpdata)[names(yelpdata)=="attributesambiencecasual"] <- "casual"
names(yelpdata)[names(yelpdata)=="attributesgoodfordancing"] <- "dance"
names(yelpdata)[names(yelpdata)=="attributesdelivery"] <- "delivery"
names(yelpdata)[names(yelpdata)=="attributescoatcheck"] <- "ccheck"
names(yelpdata)[names(yelpdata)=="attributestakeout"] <- "takeout"
names(yelpdata)[names(yelpdata)=="attributestakesreservations"] <- "res"
names(yelpdata)[names(yelpdata)=="attributeswaiterservice"] <- "service"
names(yelpdata)[names(yelpdata)=="attributesparkingstreet"] <- "street"
names(yelpdata)[names(yelpdata)=="attributesparkinggarage"] <- "garage"
names(yelpdata)[names(yelpdata)=="attributesgoodforlatenight"] <- "late"
names(yelpdata)[names(yelpdata)=="attributesgoodfordessert"] <- "desert"
names(yelpdata)[names(yelpdata)=="attributescaters"] <- "caters"
names(yelpdata)[names(yelpdata)=="attributeswifi"] <- "wifi"
names(yelpdata)[names(yelpdata)=="attributesattire"] <- "attire"
names(yelpdata)[names(yelpdata)=="attributesgoodforkids"] <- "goodforkids"
names(yelpdata)[names(yelpdata)=="attributeshastv"] <- "tv"
names(yelpdata)[names(yelpdata)=="attributesambienceromantic"] <- "romantic"
names(yelpdata)[names(yelpdata)=="attributesambiencetrendy"] <- "trendy"
names(yelpdata)[names(yelpdata)=="attributesambienceupscale"] <- "upscale"
names(yelpdata)[names(yelpdata)=="attributesambiencedivey"] <- "divey"
names(yelpdata)[names(yelpdata)=="attributeswheelchairaccessible"] <- "wheelchair"
names(yelpdata)[names(yelpdata)=="attributesmusicbackgroundmusic"] <- "bkgmusic"
names(yelpdata)[names(yelpdata)=="attributesmusiclive"] <- "livemusic"
names(yelpdata)[names(yelpdata)=="attributesbyob"] <- "byob"
names(yelpdata)[names(yelpdata)=="attributesdogsallowed"] <- "dogsallowed"
names(yelpdata)[names(yelpdata)=="attributesopen24hours"] <- "open24hrs"
names(yelpdata)[names(yelpdata)=="attributespaymenttypesamex"] <- "amex"
names(yelpdata)[names(yelpdata)=="attributesorderatcounter"] <- "orderatcounter"
names(yelpdata)[names(yelpdata)=="attributespaymenttypesvisa"] <- "visa"


# Identify N/A as "dnr" (did not respond).
addDNR <- function(x){
  if(is.factor(x)) return(factor(x, levels=c(levels(x), "dnr")))
  return(x)
}
yelpdata <- as.data.frame(lapply(yelpdata, addDNR))
yelpdata[is.na(yelpdata)] <- "dnr"

Then, we have to:

Identify the different city and US States
Classify the restaurants in the 60 most important categories (Wine Bar, Tex-Mex, Vegan…)

yelpdata <- mutate(yelpdata, loc = ifelse(yelpdata$state=="NV", "Las Vegas, NV",
                                        ifelse(yelpdata$state=="PA", "Pittsburg, PA",
                                          ifelse(yelpdata$state=="NC", "Charlotte, NC",
                                            ifelse(yelpdata$state=="AZ", "Phoenix, AZ",
                                              ifelse(yelpdata$state=="IL", "Urbana-Champaign, IL",
                                                ifelse(yelpdata$state=="WI", "Madison, WI",
                                                  ifelse(yelpdata$state=="MLN", "Edinburgh, UK",
                                                    ifelse(yelpdata$state=="BW", "Karlsruhe, Germany",
                                                      ifelse(yelpdata$state=="QC", "Montreal, Canada",  
                                                       ifelse(yelpdata$state=="ON", "Waterloo, Canada",
                                                        ifelse(yelpdata$state=="SC", "Charlotte, NC",
                                                         ifelse(yelpdata$state=="EDH", "Edinburgh, UK",
                                                          ifelse(yelpdata$state=="KHL", "Edinburgh, UK",
                                                           ifelse(yelpdata$state=="XGL", "Edinburgh, UK",
                                                            ifelse(yelpdata$state=="NTH", "Edinburgh, UK",
                                                            ifelse(yelpdata$state=="SCB", "Edinburgh, UK",
                                                         NA)))))))))))))))))

# Filter the restaurants.
all_restaurants <- filter(yelpdata, categories == "Restaurants" |
                     categories1 == "Restaurants" | 
                     categories2 == "Restaurants"| 
                     categories3 == "Restaurants"|
                     categories4 == "Restaurants"|
                     categories5 == "Restaurants"|
                     categories6 == "Restaurants"|
                     categories7 == "Restaurants"|
                     categories8 == "Restaurants"|
                     categories9 == "Restaurants"|
                     categories10 == "Restaurants") 

# Showy all of the categories of a restaurants 
bigcat <- c(as.character(all_restaurants$categories1), 
            as.character(all_restaurants$categories2), 
            as.character(all_restaurants$categories3),
            as.character(all_restaurants$categories4), 
            as.character(all_restaurants$categories5), 
            as.character(all_restaurants$categories6),
            as.character(all_restaurants$categories7), 
            as.character(all_restaurants$categories8), 
            as.character(all_restaurants$categories9),
            as.character(all_restaurants$categories10),
            as.character(all_restaurants$categories)) %>% 
  table() %>% 
  sort()
 
# Let's have a look at the most important categories
# tail(bigcat,65)

# "Varmaker" function creates a column for a category
# 1 = yes, 0 = no
varmaker <- function(x){
  all_restaurants <- mutate(all_restaurants, 
                            a = 
                              ifelse(
                                categories == x |
                                categories1 == x | 
                                categories2 == x | 
                                categories3 == x | 
                                categories4 == x | 
                                categories5 == x | 
                                categories6 == x | 
                                categories7 == x | 
                                categories8 == x | 
                                categories9 == x | 
                                categories10 == x , 1, 0) )
  all_restaurants$a <- as.factor(all_restaurants$a)
  names(all_restaurants)[names(all_restaurants)=="a"] <- gsub(" ", "", x, fixed = TRUE)
  return(all_restaurants)
  }

# Create new columns associated to the most important categories
all_restaurants <- varmaker("Fast Food")
all_restaurants <- varmaker("Pizza")
all_restaurants <- varmaker("Mexican")
all_restaurants <- varmaker("American (Traditional)")
all_restaurants <- varmaker("Nightlife")
all_restaurants <- varmaker("Sandwiches")
all_restaurants <- varmaker("Bars")
all_restaurants <- varmaker("Food")
all_restaurants <- varmaker("Italian")
all_restaurants <- varmaker("Chinese")
all_restaurants <- varmaker("American (New)")
all_restaurants <- varmaker("Burgers")
all_restaurants <- varmaker("Breakfast & Brunch")
all_restaurants <- varmaker("Cafes")
all_restaurants <- varmaker("Japanese")
all_restaurants <- varmaker("Sushi Bars")
all_restaurants <- varmaker("Delis")
all_restaurants <- varmaker("Steakhouses")
all_restaurants <- varmaker("Seafood")
all_restaurants <- varmaker("Chicken Wings")
all_restaurants <- varmaker("Sports Bars")
all_restaurants <- varmaker("Coffee & Tea")
all_restaurants <- varmaker("Mediterranean")
all_restaurants <- varmaker("Barbeque")
all_restaurants <- varmaker("Thai")
all_restaurants <- varmaker("Asian Fusion")
all_restaurants <- varmaker("French")
all_restaurants <- varmaker("Buffets")
all_restaurants <- varmaker("Indian")
all_restaurants <- varmaker("Pubs")
all_restaurants <- varmaker("Greek")
all_restaurants <- varmaker("Diners")
all_restaurants <- varmaker("Bakeries")
all_restaurants <- varmaker("Vietnamese")
all_restaurants <- varmaker("Tex-Mex")
all_restaurants <- varmaker("Vegetarian")
all_restaurants <- varmaker("Salad")
all_restaurants <- varmaker("Hot Dogs")
all_restaurants <- varmaker("Middle Eastern")
all_restaurants <- varmaker("Event Planning & Services")
all_restaurants <- varmaker("Specialty Food")
all_restaurants <- varmaker("Lounges")
all_restaurants <- varmaker("Korean")
all_restaurants <- varmaker("Canadian (New)")
all_restaurants <- varmaker("Arts & Entertainment")
all_restaurants <- varmaker("Wine Bars")
all_restaurants <- varmaker("Gluten-Free")
all_restaurants <- varmaker("Latin American")
all_restaurants <- varmaker("British")
all_restaurants <- varmaker("Gastropubs")
all_restaurants <- varmaker("Ice Cream & Frozen Yogurt")
all_restaurants <- varmaker("Southern")
all_restaurants <- varmaker("Vegan")
all_restaurants <- varmaker("Desserts")
all_restaurants <- varmaker("Hawaiian")
all_restaurants <- varmaker("German")
all_restaurants <- varmaker("Bagels")
all_restaurants <- varmaker("Caterers")
all_restaurants <- varmaker("Juice Bars & Smoothies")
all_restaurants <- varmaker("Fish & Chips")
all_restaurants <- varmaker("Ethnic Food")
all_restaurants <- varmaker("Tapas Bars")
all_restaurants <- varmaker("Soup")
all_restaurants <- varmaker("Halal")

Observations

Contrary to what may be naively expected, the distribution of ratings is not normal but rather negatively skewed (figure 1). Low ratings seems to be less frequent and some people would interpret too quickly this information as a sign of biased data. Furthermore, the histogram below shows that most places tend to have fewer than 10 reviews (figure 2).

Quite surprisingly, Figure 3 shows that they are more reviews for 5 stars restaurants than for 1 star, indicating that reporting a very good customer experience would probably be a more common Yelp User behavioural pattern than reporting a very very bad experience. We can argue that more probably 1 star restaurants do not stay open very long!

But, anyway as number of reviews increases, the Yelp rating will begin to approach this restaurants “true” rating, which will most likely not be 5 stars.

Figure 1

ggplot(all_restaurants, aes(as.factor(stars))) + 
  geom_histogram(fill = "blue", col="black") +
  xlab("Stars") +
  ylab("Number of Restaurants") +
  theme_classic()

Figure 2

ggplot(all_restaurants, aes(reviewcount)) + 
  geom_histogram(binwidth = .15, fill = "blue", col="white") + 
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                   labels = trans_format("log10", math_format(10^.x))) +
  xlab("Number of Reviews") +
  ylab("Number of Restaurants") + 
  theme_classic()

Figure 3

ggplot(all_restaurants, aes(as.factor(stars), reviewcount)) + 
  geom_boxplot(col = "blue") +
  scale_y_log10() +
  xlab("Stars") +
  ylab("Number of Reviews") + 
  theme_classic()

Observations related to location:

Thanks to the “GoogleVis” package, we can have striking views of the ratings (figure 4).

Table 1 gives more precisely the various average ratings per city.

Figure 4

# localisation in US and Europe
all_restaurants$latlong <- paste(all_restaurants$latitude, all_restaurants$longitude, sep=":")
counts <- all_restaurants %>% 
  group_by(loc) %>%
  summarize(Restaurants = n(),  Avg_Rating = round(mean(stars),2))

locdata <- data.frame(latlong = all_restaurants$latlong, 
                      loc = all_restaurants$loc)

counts <- inner_join(counts, locdata, by="loc") %>%
  group_by(loc) %>%
  summarize(Restaurants = first(Restaurants), 
            latlong = first(latlong),
            Avg_Rating = first(Avg_Rating))

require(datasets)

USmap <- gvisGeoChart(counts, "loc", 
                          sizevar="Restaurants",
                          colorvar="Avg_Rating", 
                          options=list(region = 'US',
                                       displayMode = "markers",
                                       colorAxis="{colors:['white', 'blue']}"))

Europemap <- gvisGeoChart(counts, "loc", 
                          sizevar = "Restaurants",
                          colorvar = "Avg_Rating", 
                          options=list(region = '150',
                                       displayMode = "markers",
                                       colorAxis="{colors:['white', 'blue']}"
                                       ))


print(gvisMerge(USmap, Europemap, horizontal=TRUE), "chart")

Table 1

dstate <- all_restaurants %>%
  group_by(loc) %>%
  summarise(num_rest = n(),
            avg_stars = round(mean(stars), digits =  2), 
            avg_num_rev = round(mean(reviewcount), digits =  2)) %>%
  tbl_df()

dstate <- dstate[order(-dstate$avg_stars),]

kable(dstate, col.names = c("City","Restaurants","Average Star Rating of Restaurants","Average Number of Ratings per Restaurant") , align  = "c")

City	Restaurants	Average Star Rating of Restaurants	Average Number of Ratings per Restaurant
NA	19	4.00	5.58
Edinburgh, UK	1137	3.81	14.47
Karlsruhe, Germany	469	3.71	11.88
Montreal, Canada	2353	3.61	16.16
Pittsburg, PA	1361	3.55	37.38
Waterloo, Canada	244	3.55	10.03
Madison, WI	981	3.45	34.10
Charlotte, NC	2119	3.43	33.74
Phoenix, AZ	7985	3.43	51.23
Las Vegas, NV	4960	3.42	90.13
Urbana-Champaign, IL	264	3.38	34.28

Next Step in this Capstone Project: Predict Ratings

In order to be able to predict the ratings and create the necessary algorithm, we will:

Use the business attributes (trendy, romantic…) as predictors
Define a training and testing set
Implement a LASSO (Least Absolute Shrinkage and Selection Operator) method with our linear regression analysis in order to avoid overfitting. A more detailed description of this statistical approach can be found in the book “The Elements of Statistical learning: Data Mining, Inference and Prediction” from Trevor Hastie et al. (see reference).

Conclusion

Achieving this prediction objective seems perfectly doable thanks to the richness of the Yelp database.

However, in terms of accuracy, taking only into accounts business attributes will not be optimal. We can expect other factors such as geographic location an the food categories (Indian, Greek, Fast Food…) to have a significant influence.

In a first approach, we will focus on a regression analysis method, but it would certainly be beneficial to also use a text mining approach including the users reviews texts to improve/confirm our model. Finally, creating a simple Shinyapps to visualize these information would be really interesting (though time consuming).

Therefore, there is still a lot of work and trials ahead before completing this project.

References

Trevor Hastie, Robert Tibshirani and Jerome Friedman, 2009, “The Elements of Statistical Learning”, 2nd Edition, p.68: available at http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://spidr-ursa.rutgers.edu/resources/WebDB.pdf
http://cs229.stanford.edu/proj2014/Chen%20Li,%20Jin%20Zhang,%20Prediction%20of%20Yelp%20Review%20Star%20Rating%20using%20Sentiment%20Analysis.pdf
http://zhongyaonan.com/predict-rating-of-yelp-user-review-text/
https://en.wikipedia.org/wiki/N-gram
http://www.entrepreneur.com/article/241051