ReviewSegMentor App: https://ojustwin.shinyapps.io/RestReviewApp
App Overview Presentation: http://rpubs.com/ojustwin/ReviewSegmentor
FIGURE 1: Shiny App
The restaurant industry has one of the highest business failure rates among the retail and service industries. The National Restaurant Association of US recognizes a 30% failure rate as the norm in the restaurant industry. So for new entrepreuneurs or existing business owners, the decision to open a new restaurant has to be thouroughly researched from multiple perspectives including but not limited to market share research.
Current market research sources for an entrepreuneur include very useful restaurant market metrics compiled by companies such as Nielsen Research or National Restaurant Association. But these sources can be very expensive. It is also unlikely that these sources include restaurant review share metrics segmented by restaurant concepts and localities.
To assist new entrepreneurs, the problem can be broke down into two components.
Is it possible to use Yelp review data to cluster and segment restaurant “Concepts” and “Localities”?
Using such such “Concept” and “Locality” segments is it possible to conduct useful restaurant review share analysis to determine if a restaurant “Concept” is a good candidate for a given “Locality”?
The data used is part of the Yelp Dataset Challenge. After reading in the JSON data it was cleaned and transformed. This included standardizing text where needed (e.g. lowercase for city names, date format for dates) or converting lists to binary flags or logical factors.
Out of the top 10 cities with most reviews cities like Las Vegas have the highest number of reviews. But being a tourist destiation and having a really high density of businesses does not make Las Vegas a good representative of most American cities. Therefore Pittsburgh was selected for this study.
Since individual restaurant review data is highly dependent on Concept execution, this study first defines Concept clusters using the individual restaurant data. Individual restaurant review data is then aggregated by Concepts before analysis. Similarly individual restaurant star ratings will not be used, instead Concept review volumes will be used to infer if overall concept was good enough to get clients to walk through the door.
The choice of location is expected to have significant impact on the success or failure of a restaurant. Direct competitors can include nearby restaurants with a similar Concept (menu, pricing, atmosphere or target market). Furthermore, all restaurants within a geographic trading area, or a 5-15 mile radius, can be in competition. For very busy urban areas, this radius can be (2-3 minute radius) while suburban and spread out markets may have a 5-10 minute drive competitive radius.
Therefore in this study a “Locality” is defined as a cluster or group of businesses operating within geographic proximity. A “Locality” so defined, can potentially cross neighborhoods and postal codes. It can be spread out in areas where the businesses are spread out and be much smaller in areas with a lot of businesses.
Instead of just the restaurants, it was more effective to use all businesses to define the “Locality” cluster. This provided a much larger sample set and resulted in a tighter definition of the Locality clusters. It is also likely that visits to other non-restaurant businesses also influence the decision to eat within a “Locality”.
For the Locality clustering exercise restaurant latitude and longitude variables were used. The distance matrix was calculated using Haversine Great Circles Formula (“as the crow flies”). Both hierarchical and K-means clustering procedures were evaluated.
K-Means clustering divided downtown Pittsburgh into multiple clusters. This division could be defended, given the high density of downtown businesses. But unfortunately the K-means clustering was not stable, as it resulted in different locality clustering when run multiple times on the same data.
Therefore Hierarchical clustering was chosen. the results for the Hierarchical clustering were also reasonable. Eventhough it did lump all of downtown in one cluster, it was better in avoiding overly spread out locality clusters in the suburbs. And the results were consistent after multiple runs.
The elbow method for optimal clustering recommended the cutoff at 14, but that results in cluster sizes too large to provide a meaningful analysis of review share within direct competitve vicinity. Therefore the clusters were cut at k = 40, since that provided a reasonable number of clusters while still retaining adequate differentiation.
FIGURE 2: Locality Cluster Representation
A poor concept that is not differentiated is one of the primary reasons for restaurant failures (Fields 2007).
In this study Restaurant specific Business metadata features in the Yelp dataset are used to identify Concept Clusters. Instead of focusing on Pittsburgh data, to gain a larger sample size, restaurants in all locations provided in the Yelp dataset were used.
A “Concept” is defined by using multiple restaurant metadata characteristics such as Cuisine, Price Range, Ambiance, Service, etc. as clustering variables.
After initial prep, further changes were applied to the data after performing the first round of clustering. For example the feature “categories.Food”" was removed as it was too general and skewed the clustering to generate hard to interpret categories.
In addition to keep the number of categories to a manageable level, a new feature called ‘cuisine.Region’ was introduced using external data. This feature was generated by mapping all the cuisine categories (extracted from Yelp Developer site) to high level world geographic regions (from wikipedia). For example the categories for ‘Chinese’, ‘DimSum’, ‘Ramen’, ‘Japanese’, ‘Sushi Bars’, ‘Udon, etc. were all mapped to the ’East Asian’ Region.
Since input dataset variables were binary, the distance was calculated using Gower dissimilarity coefficient (Jaccard). Since Gower allows weighting the features, features related to Cuisine Region, Price Range, Service, and Ambiance were weighted higher. Both hierarchical and K-means clustering procedures were evaluated.
After trying different cutoff points (10, 15, 20, 25, 60), based on the feature representation, k = 25 clusters provided the best balance between differentiation and end user consumption.
# Calculate Gower Distance
gowerDistance <- daisy(Restaurant.All.Clustered[,-1], type = list(), metric = "gower", weights = wt)
# Hierarchical Clustering
hcConcepts <- hclust(gowerDistance)
Restaurant.All.Clustered$Concept.H.Cluster <- cutree(hcConcepts,k=25)
FIGURE 3: Concept Cluster #4 British Pub Casual Details
TABLE 1 Concept Clusters
| Concept.Label | Price | Cuisine | Other | |
|---|---|---|---|---|
| 4 | British Pub Casual | Medium (2), Low (1) | Northern European, British, Irish, North American | Full Bar, Nightlife, Pubs, Fish & Chips |
| 5 | Italian Casual | Medium (2) | Southern European, Italian, North American, Mediterranean, Greek | Pizza, Nightlife, Sandwiches |
| 6 | East Asian Casual Dining | Medium (2) | East Asian, Chinese, Japanese, Southeast Asian, Thai, Korean, Vietnamese, Mongolian | Full Bar, Beer & Wine, Sushi Bars, Asian Fusion |
The clustered restaurant data was prepared and aggregated to create Yelp Review Share metrics for each of the Concept Locality combination.
Aggregated Metrics were created to compare and contrast Yelp Review Share for Concepts, Localities, and the combination of Concept Locality. Below is a list of the calculated metrics. The Metrics were compiled on an aggregate level as well as a yearly level to allow for trend analysis.
In addition to Yelp Review Share, a market concentration metric was calculated to aid with the analysis. The Herfindahl index (HHI) is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. HHI is defined as the sum of the squares of the review shares of the restaurants within a defined market. The major benefit of the Herfindahl index in relationship to such measures as the concentration ratio is that it gives more weight to larger firms. (See Wikipedia HHI)
Finally some of the metrics focus on identifying percentage of all locality reviews which were made by Concept Reviewers who did not review Concept restaurants in the locality. This can be used to identify untapped potential Concept reviewers in a Locality. Concept Metrics: Review Share in City, City Concept Herfindahl Index (Review Concentration). Locality Metrics: Review Share in City, Restaurant Herfindahl Index, Concepts Herfindahl Index. Concept-Locality Metrics: Review Share in Locality, Locality Concept Restaurant Herfindahl index. Concept Reviewer Metrics: Locality review share based on non-Concept reviews made by uncaptured Concept Reviewers.
Clustering Results: 40 Pittsburgh Localities Identified, 25 Restaurant Concepts Identified. 33 Review Metrics Calculated for Concept, Locality, and Concept Localities. Classification Results: The Random Forest model for Locality classification, based on Latitude Longitude, resulted in accuracy: 0.9745 with 95% CI : (0.9626, 0.9834). The Random Forest model for Concept classification, based on restaurant attributes, resulted in accuracy: 0.9939 with 95% CI : (0.9919, 0.9955).
Shiny Web App developed to encapsulate and support analysis of the results. See:
ReviewSegMentor App: https://ojustwin.shinyapps.io/RestReviewApp
App Overview Presentation: http://rpubs.com/ojustwin/ReviewSegmentor
The Shiny Web App referenced above was used to conduct the following analysis.
Scenario: an entrepreunuer wants to explore the possibility of opening a Persian restaurant in the Mt. Lebanon neighborhood in Pittsburgh. The entrepreunuer can specify restaurant attributes in the Shiny App and use the classification model to identify the “Middle Eastern Casual” Concept as a good fit for further evaluation.
Assesing the Concept
The overall review share of “Middle Eastern Casual” Concept in Pittsburgh is less than 1% (0.69), and has been relatively stable since 2010. The city review share of other related Concepts such as “Mediterranean Casual”" and “International Highend” is even lower.
Assessing the Locality
Nine restaurants within Pittsburgh fall in the “Middle Eastern Casual” category, with two of them near the Mt. Lebanon Cemetery on Washington Rd. and Cedar Blvd. (Kous Kous Cafe and Alladin’s Eatery). Within the “Shady Dr., Mount Lebanon Cemetery” Locality the “Middle Eastern Casual” Concept has 14.44% of the review share with HHI of 0.51. “Shady Dr., Mount Lebanon Cemetery” Locality has 1.6% of all city reviews, while “Mt. Lebanon” Locality has 0.33% of all city reviews.
In Mt. Lebanon Locality, there is higher market concentration (less competition) of review share than in “Shady Dr., Mount Lebanon Cemetery”" Locality. First reviews for Mt. Lebanon appear in 2010, at 0.15% of total city reviews. Over the next few years the reviews remain colse to 0.5% of total city reviews. “Mt. Lebanon” Locality has 11 restaurants: 3 “East Asian Budget” (11% Locality review share), 1 “East Asian Casual” (21%), 2 “Italian Casual” (5%), 2 “North American Casual” (17%), 1 “North American Fast Food” (25%), 1 “Italian/East Asian Upscale” (19%).
Assesing the Concept in Locality
Since there is no “Middle Eastern Casual” Concept represented in the Mt. Lebanon Locality, some of the other related concepts can be studied. “East Asian Casual”" has highest review share % within Mt. Lebanon in 2014. While Italian/ East Asian Upscale has 15.38% of the review share in the locality. Furthermore 16.9% of all “Mt. Lebanon” Locality reviews were made by Italian/East Asian Upscale Restaurant Reviewers who did not review the one Italian/East Asian Upscale restaurant in the locality. This indicates that the Italian/East Asian Upscale category in the Locality has uncaptured reviewers.
Based on the above assessment opening a Perian restaurant in the “Mt. Lebanon” Locality is not immediately justifiable. The overall interest based on city review share is low. From a competitive perspective the nearest “Middle Eastern Casual” restaurants are near the Mt. Lebanon Cemetery, so based on the strategy, a persian restaurant should either be placed closer to or further away from these restaurants. It may also be worthwhile to explore going after the upscale category in Mt. Lebanon rather than the casual.
This study validates that Yelp Review Data can be used by entrepreunuers to create Concept and Locality clusters to inform a new restaurant decision. Eventhough this analysis alone is not enough, it can narrow down choices or reveal new opportunities. The classification models can also make the initial analysis easier, by providing a surrogate Concept for analysis.
Assumptions/Limitations
This study is based on the assumption that Yelp review data volume is a good indicator of actual market activity, which has not been verified in this study. Furthermore the review activity may differ based on the Concept being analyzed, so it may not be an ideal indicator for comparing different concepts.
Next Steps
Topic Modeling could be used to create attributes to enhance Concept clustering. Additional metrics could be developed for the Concepts such as which features are most important for specific Concept segements for driving review volumes.
Fields, R. (2007). Restaurant Success by the Numbers. Berkley, CA: Ten Speed Press
Bryan Hood, Victor Hwang, Jennifer King. Inferring Future Business Attention. http://www.yelp.com/html/pdf/YelpDatasetChallengeWinner_InferringFuture.pdf
H.G. Parsa, Amy Gregory, Michael Terry. Why do restaurants fail. https://hospitality.ucf.edu/files/2011/08/DPI-Why-Restaurants-Fail.pdf
Nielson Research. http://www.nielsen.com/content/dam/corporate/us/en/public%20factsheets/restaurant-growth-index.pdf
Restaurant Types. http://www.foodservicewarehouse.com/blog/overview-different-restaurant-types/
Wikipedia HHI. https://en.wikipedia.org/wiki/Herfindahl_index