1. Introduction

My retirement plan is to start a quiet little cafe in the foothills of Himalayas. I want to fill my little space with paintings and books and obviously, provide amazing food. Though this is a long shot plan, I am constantly curious about market dynamics of restaurants. In order to start understanding what makes a good restaurant, I planned to do a market research analysis of restaurants in United states.

In order to conduct the market analysis, I am making use of the YELP dataset which is a subset of Yelp’s businesses, reviews, and user data.It was originally put together for the Yelp Dataset Challenge to give a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. Below are the proposed approach/analytic technique to conduct the analysis.

  • Geographical analysis - This analysis is to aid the location selection process and find the best possible place to establish the cafe.The user should be able to select the desired city and see the distribution of restaurants across the location and their ratings. May be it is better to start the restaurant that gives high quality food in a location with large number of low rated restaurants. May be opening a restaurant in a street full of popular restaurants gives good visibility. It all depends on the statergy adopted by the management.

  • Understanding previous trends - This is to aid the selection of working hours and cuisine for the restaurant. Some places might be suited for restaurants that are open during evening. Some locations might give profit when the restaurant is open through out the day. By analysing at the past trends, we can decide our restaurants best suited working hours. We will also explore the different cuisines that are popular in the locality chosen.

  • Learning from the best and worst players in the market - Sentiment analysis is done on the user reviews to understand what trends and patterns in user behavious. Bad reviews help us to learn without making the mistakes by ourselves. Positive reviews can help me understand what works for the general public and incorporate that into my business.

2. Packages

The details of different packages used for this analysis is listed below.

  • pacman: To load packages and install missing ones
  • data.table: Fast data load.
  • tidyverse: Package of multiple R packages used for data manipulation.stringr, dplyr, readr and ggplot2 from this package is used for analysis.
  • stringr: String operations in R
  • lubridate: Package to manipulate dates and times
  • DT: Package to put data objects in R as HTML tables
  • NLP: Package for basic classes and methods for Natural Language Processing
  • tidytext: Package for text mining for word processing and sentiment analysis
  • knitr: A General-Purpose Package for Dynamic Report Generation in R
  • leaflet: Create Interactive Web Maps
  • tm: Text mining package
  • wordcloud: To create wordclouds
  • grid: A rewrite of the graphics layout capabilities
  • gridExtra: Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
  • radarchart: Creates interactive radar charts
  • igraph: Network Analysis and Visualization
  • ggraph: The grammar of graphics as implemented in ggplot2
if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("data.table", 
               "tidyverse",
               "stringr", 
               "lubridate", 
               "DT", 
               "tidytext",
               "NLP" ,
               "knitr",
               "leaflet",
               "tm",
               "wordcloud",
               "grid",
               "gridExtra",
               "radarchart",
               "igraph",
               "ggraph")

3. Data Preperation

The data is loaded from the source and cleaned for further analysis.

3.1 Data source

Yelp is an American multinational corporation founded in 2004 by former PayPal employees Russel Simmons and Jeremy Stoppelman. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations. The dataset used in this study is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries.

In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information

  • business : Contains location and category information for each business
  • business_attriutes : Contains attributes of the business like music, delivery, parking etc.
  • business_hours : Contains working hours for each weekday for businesses
  • checkin : Number of checkins for each day and hour for a business is stored in this dataset.
  • tip : Contains tips and comment written by a user on a business
  • review : Contains user text review data for businesses

3.2 Data import

The six files containing data is loaded in this step. fread from data.table is mainly used for loading data as it is fast for large files. After importing data, each of these data sets are analysed and cleaned according to the needs.

business <- fread("yelp-dataset/yelp_business.csv")
business_attributes <- fread("yelp-dataset/yelp_business_attributes.csv")
business_hours <- read_csv("yelp-dataset/yelp_business_hours.csv")
checkin <- fread("yelp-dataset/yelp_checkin.csv")
tip <- fread("yelp-dataset/yelp_tip.csv")
review <- read_csv("yelp-dataset/yelp_review.csv")

3.3 Data cleaning

In this step, each of the six files are evaluated seperately and cleaned for futher analysis.

a. business

The business table containes the location and category details of businesses included in the dataset. There are 174567 businesses listed in the original dataset. For this project, only restaurants from Unites States is considered.

# Only the data for restaurants in United states are kept
business  <- business %>% 
             filter(state %in% state.abb) %>% 
             filter(categories %like% "Restaurant")

After filtering out the restaurants in United states, we have data regarding 32484 businesses. We will first investigate the amount of data available for each city and will consider the top 10 cities for this study.

# Displays the top 10 cities present in the data set based on number of restaurants for which data is available.

business %>% select(state,city) %>% 
             dplyr::group_by(state,city) %>% 
             summarise(n = n()) %>% 
             arrange(desc(n)) %>% 
             head(n = 10) %>%
             datatable(class = 'cell-border stripe hover condensed responsive')

Also for the top 10 cities there are multiple names present in the data set. Below are the different patterns found for the same city names.

  • Las Vegas : NV state cities with names Las Vegas, North Las Vegas, N Las Vegas, Las Vegas, las vegas, N. Las Vegas, South Las Vegas, Las vegas and LasVegas are combined together.
  • Phoenix : AZ state cities with names Phoenix, Pheonix, Pheonix AZ, Phoenix Valley are combined together.
  • Pittsburgh : PA state cities with names Pittsburgh and East Pittsburgh are combined together.
  • Scottsdale : AZ state cities with names Scottsdale and Scottdale are combined together.
  • Cleveland : OH state cities with names Cleveland, Cleveland Heights, East Cleveland, Cleveland Hghts. are combined together
  • Mesa : AZ state cities with names Mesa, MESA, Mesa AZ are combined together.
# Lists are created so that if any additional pattern is found, it can be added flexibly. 

# All restaurants in Las vegas have name is same format
lasVegas <- c("Las Vegas", "North Las Vegas", "N Las Vegas", "Las Vegas", "las vegas", "N. Las Vegas","South Las Vegas", "Las vegas", "LasVegas")
business$city[business$city %in% lasVegas & business$state == "NV" ] <- "Las Vegas"

# All restaurants in Phoenix have name is same format
phoenix <- c("Pheonix", "Pheonix", "Pheonix AZ" , "Phoenix Valley")
business$city[business$city %in% phoenix & business$state == "AZ" ] <- "Phoenix"

# All restaurants in Pittsburgh have name is same format
pittsburgh <- c("Pittsburgh", "East Pittsburgh")
business$city[business$city %in% pittsburgh & business$state == 'PA' ] <- "Pittsburgh"

# All restaurants in Scottsdale have name is same format
scottsdale <- c("Scottsdale", "Scottdale")
business$city[business$city %in% scottsdale & business$state == 'AZ' ] <- "Scottsdale"

# All restaurants in Cleveland have name is same format
cleveland <- c("Cleveland", "Cleveland Heights", "East Cleveland", "Cleveland Hghts.")
business$city[business$city %in% cleveland & business$state == 'OH' ] <- "Cleveland"

# All restaurants in Mesa have name is same format
mesa <- c("Mesa", "MESA", "Mesa AZ")
business$city[business$city %in% mesa & business$state == 'AZ' ] <- "Mesa"

# Filters the data so that we analyse only the restaurants from top 10 cities (Based on count)
top_10_cities <- c("Las Vegas", "Phoenix", "Charlotte", "Pittsburgh", "Scottsdale", "Cleveland", "Mesa", "Madison", "Tempe", "Chandler")
business <- business %>% filter(city %in% top_10_cities)

The final data set for top ten cities contains information regarding 21406 restaurants. Some of the important variables are describled below. A snapshot of the dataset is also provided.

Name Description
business_id ID of the business
name Name of the business
city City at which business is located
latitude Latitute coordinate of the restaurant. Will be used to plot the location on the map.
longitude Longitude coordinate of the restaurant. Will be used to plot the location on the map.
stars Provides the rating of the restaurant. Minimum = 0, Maximum = 5
is_open If the restaurant is working/shut down
categories Categories under which the restaurant falls
# Displays 10 rows from the dataset
datatable(head(business, n = 10), class = 'cell-border stripe hover condensed responsive')

b. business_attributes

The business_attributes table contains the attributes of businesses like music, delivery, parking etc. There are 152041 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

business_attributes  <- business_attributes %>% 
                        filter(business_id %in% business$business_id) 

After filtering, only 21123 restaurants are retained. There are 82 columns in this data set. Not all of them are relevant to us. We will drop the irrelevant variables.

# The attributes that are not significant are dropped from the dataset.

business_attributes <- business_attributes %>% 
                       select(-c( HasTV, Caters , BusinessAcceptsBitcoin, BYOBCorkage, BYOB), -contains("Hair"))

There are 68 variables left in the data set now. The final data set for top ten cities contains information regarding 21123 observations.

# 10 rows from business_attributes table is displayed
kable(head(business_attributes, n = 10))
business_id AcceptsInsurance ByAppointmentOnly BusinessAcceptsCreditCards BusinessParking_garage BusinessParking_street BusinessParking_validated BusinessParking_lot BusinessParking_valet RestaurantsPriceRange2 GoodForKids BikeParking Alcohol NoiseLevel RestaurantsAttire Music_dj Music_background_music Music_no_music Music_karaoke Music_live Music_video Music_jukebox Ambience_romantic Ambience_intimate Ambience_classy Ambience_hipster Ambience_divey Ambience_touristy Ambience_trendy Ambience_upscale Ambience_casual RestaurantsGoodForGroups WiFi RestaurantsReservations RestaurantsTakeOut HappyHour GoodForDancing RestaurantsTableService OutdoorSeating RestaurantsDelivery BestNights_monday BestNights_tuesday BestNights_friday BestNights_wednesday BestNights_thursday BestNights_sunday BestNights_saturday GoodForMeal_dessert GoodForMeal_latenight GoodForMeal_lunch GoodForMeal_dinner GoodForMeal_breakfast GoodForMeal_brunch CoatCheck Smoking DriveThru DogsAllowed Open24Hours Corkage DietaryRestrictions_dairy-free DietaryRestrictions_gluten-free DietaryRestrictions_vegan DietaryRestrictions_kosher DietaryRestrictions_halal DietaryRestrictions_soy-free DietaryRestrictions_vegetarian AgesAllowed RestaurantsCounterService
fNMVV_ZX7CJSDWQGdOM8Nw Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
rDMptJYWtnMhpQu_rRXHng Na Na Na Na False False False True Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na True Na Na Na Na Na Na Na Na Na Na Na
1WBkAuQg81kokZIPMpn9Zg Na Na Na Na False False False True Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na False Na Na Na Na Na Na Na Na Na Na Na
Pd52CjgyEU3Rb8co6QfTPw Na Na Na Na Na Na Na Na Na Na True Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
4srfPk1s8nlm1YusyDUbjg Na Na Na Na False False False False Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na False Na Na Na Na Na Na Na Na Na Na Na
n7V4cD-KqqE3OXk0irJTyA Na Na Na Na True False False True Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
EJFdWX908N8Yc2XG0Lky8A Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na False Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
iPa__LOhse-hobC2Xmp-Kw Na Na Na Na False False False True Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na True Na Na Na Na Na Na Na Na Na Na Na
o1fTwfqN0sDFNpV1CkOPPg Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
-nHkhiuerqmfBG3v2v9O-g Na Na Na Na Na Na Na Na Na Na False Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na Na
# NA in the data set is represented by the charecter 'Na'. They are replaced by NA
business_attributes[,2:ncol(business_attributes)][ business_attributes[,2:ncol(business_attributes)] == 'Na' ] <- NA

# The percentage of missing values is calculated for each attribute
data.table("Variable" = colnames(business_attributes), "Percentage of NA values" = colMeans(is.na(business_attributes)))
##                            Variable Percentage of NA values
##  1:                     business_id               0.0000000
##  2:                AcceptsInsurance               1.0000000
##  3:               ByAppointmentOnly               1.0000000
##  4:      BusinessAcceptsCreditCards               0.9941770
##  5:          BusinessParking_garage               0.9980116
##  6:          BusinessParking_street               0.3398192
##  7:       BusinessParking_validated               0.3398192
##  8:             BusinessParking_lot               0.3398192
##  9:           BusinessParking_valet               0.3398192
## 10:          RestaurantsPriceRange2               1.0000000
## 11:                     GoodForKids               0.9961180
## 12:                     BikeParking               0.6178573
## 13:                         Alcohol               0.9974909
## 14:                      NoiseLevel               0.9985324
## 15:               RestaurantsAttire               0.9994792
## 16:                        Music_dj               0.9991952
## 17:          Music_background_music               1.0000000
## 18:                  Music_no_music               1.0000000
## 19:                   Music_karaoke               1.0000000
## 20:                      Music_live               1.0000000
## 21:                     Music_video               1.0000000
## 22:                   Music_jukebox               1.0000000
## 23:               Ambience_romantic               1.0000000
## 24:               Ambience_intimate               0.9993372
## 25:                 Ambience_classy               0.9993372
## 26:                Ambience_hipster               0.9993372
## 27:                  Ambience_divey               0.9993372
## 28:               Ambience_touristy               0.9993372
## 29:                 Ambience_trendy               0.9993372
## 30:                Ambience_upscale               0.9993372
## 31:                 Ambience_casual               0.9993372
## 32:        RestaurantsGoodForGroups               0.9993372
## 33:                            WiFi               0.9992425
## 34:         RestaurantsReservations               0.9991005
## 35:              RestaurantsTakeOut               0.9976803
## 36:                       HappyHour               0.9826729
## 37:                  GoodForDancing               0.9987691
## 38:         RestaurantsTableService               0.9863182
## 39:                  OutdoorSeating               0.9774653
## 40:             RestaurantsDelivery               0.9966387
## 41:               BestNights_monday               0.9356625
## 42:              BestNights_tuesday               0.9995266
## 43:               BestNights_friday               0.9995266
## 44:            BestNights_wednesday               0.9995266
## 45:             BestNights_thursday               0.9995266
## 46:               BestNights_sunday               0.9995266
## 47:             BestNights_saturday               0.9995266
## 48:             GoodForMeal_dessert               0.9995266
## 49:           GoodForMeal_latenight               0.9401127
## 50:               GoodForMeal_lunch               0.9401127
## 51:              GoodForMeal_dinner               0.9401127
## 52:           GoodForMeal_breakfast               0.9401127
## 53:              GoodForMeal_brunch               0.9401127
## 54:                       CoatCheck               0.9401127
## 55:                         Smoking               0.9943190
## 56:                       DriveThru               0.9820101
## 57:                     DogsAllowed               0.8610993
## 58:                     Open24Hours               0.9996213
## 59:                         Corkage               1.0000000
## 60:  DietaryRestrictions_dairy-free               1.0000000
## 61: DietaryRestrictions_gluten-free               0.9978696
## 62:       DietaryRestrictions_vegan               0.9978696
## 63:      DietaryRestrictions_kosher               0.9978696
## 64:       DietaryRestrictions_halal               0.9978696
## 65:    DietaryRestrictions_soy-free               0.9978696
## 66:  DietaryRestrictions_vegetarian               0.9978696
## 67:                     AgesAllowed               0.9978696
## 68:       RestaurantsCounterService               1.0000000
##                            Variable Percentage of NA values

There are only 5 columns with less than 50% of missing values. From the data, we can see that they are business_id and attributes related to parking. Since this information is not valuable, we will discard this data set.

# Discard business attributes

rm("business_attributes")

c. business_hours

The business_hours table contains the attributes of businesses included in the dataset. There are 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

business_hours  <- business_hours %>% 
                   filter(business_id %in% business$business_id) 

The final data set for top ten cities contains information regarding 21406 restaurants.The data set contains different columns for each weekday. We create one single column for weekday and seperate the starting and closing hours for each business. The variables related to time are converted to hour and second format from character. The final data set form is displayed below.

business_hours <- business_hours %>%
                  # Name of the column will be stored in 'week_day' and value will be stored in 'working hours'
                  gather(week_day, working_hours, -business_id) %>%
                  # The character value None represents NA
                  transform(working_hours =  ifelse(working_hours == 'None',NA,working_hours)) %>% 
                  # All null values are removed
                  na.omit() %>% 
                  # From 'working_hours' column, starting and closing hours are seperated
                  separate(col = working_hours, into = c("starting_hour", "closing_hour"), sep = "-") %>% 
                  # Starting and closing hours are stored as Hour and minutes format
                  transform(starting_hour = hm(starting_hour), closing_hour = hm(closing_hour))
# Displays 3 rows of business_hours table

kable(head(business_hours, n = 5))
business_id week_day starting_hour closing_hour
1 fNMVV_ZX7CJSDWQGdOM8Nw monday 7H 0M 0S 15H 0M 0S
3 1WBkAuQg81kokZIPMpn9Zg monday 11H 0M 0S 22H 0M 0S
4 Pd52CjgyEU3Rb8co6QfTPw monday 8H 30M 0S 22H 30M 0S
6 n7V4cD-KqqE3OXk0irJTyA monday 11H 0M 0S 0S
8 iPa__LOhse-hobC2Xmp-Kw monday 5H 0M 0S 23H 0M 0S

d. checkin

The checkin table contains the number of checkins that occured at a restaurant at a given weekday and hour of the day. There are 3911218 check in observations for 146350 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities. The hour was stored as integer. It is converted to time format. To display the weekdays in order, they are converted to ordered factors.

checkin  <- checkin %>% 
            # Only the data for restaurants from top 10 cities are retained
            filter(business_id %in% business$business_id) %>%
            # Hour is converted to time format
            transform( hour = hour(hm(hour)),
                       # Weekdays are stored as factors with order level
                       weekday = factor(checkin$weekday, levels = c("Sun","Sat","Fri","Thu","Wed","Tue","Mon"), ordered = TRUE) )

The final data set for top ten cities contains 1054197 check in observations for 20714 restaurants. Variable descriptions and snap shot of data can be seen below. Minimum, Maximum and Average number of checkins for each week day is listed below.

Name Description
business_id ID of the business
weekday Weekday at which check in occured
hour Hour at which check in occured
checkins Average number of check ins for the given week day and hour
weekday min_checkins max_checkins average_checkins
Sun 1 827 6.606031
Sat 1 741 6.625402
Fri 1 831 6.641018
Thu 1 700 6.586976
Wed 1 785 6.595957
Tue 1 703 6.648213
Mon 1 875 6.640417

A snapshot of data is displayed below.

# Displays 10 rows of checkin table

datatable(head(checkin, n = 10), class = 'cell-border stripe hover condensed responsive')

e. tips

The tip table contains the attributes of tips given by users. There are 1098324 observations for 112365 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

tip  <- tip %>% 
        filter(business_id %in% business$business_id) 

The final data set for top ten cities contains information regarding 494099 tips for 19367 restaurants .

# Displays 10 observations for tip data set.

kable(head(tip, n = 10), class = 'cell-border stripe hover condensed responsive')
text date likes business_id user_id
Sunday $.55 bone-in wings
Monday $.55 boneless wings 2016-08-22 0 –ujy voQlwVoBgMYtA DiLA u lQ8Nyj7jCUR8M83SUMoRQ
Black Angus and the Roast beef :) 2012-12-03 0 JzB7NITHQ7gVHGVZ1ntgIQ TvkqJ8YEIsTb16RnnrNyfQ
Expensive, but convenient for hotel stays 2012-12-02 0 h14GmWZ8rXum9fXF__wt3w TvkqJ8YEIsTb16RnnrNyfQ
Finally, found some churros. Four types here. It should be great! 2012-03-28 0 xFN8mRubo3G0oIzJwc8XBA TvkqJ8YEIsTb16RnnrNyfQ
closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed 2012-03-28 0 Xmndl6GoZg8taEUlwQMYxg TvkqJ8YEIsTb16RnnrNyfQ
Try one of the Bento Box options 2012-10-09 0 eZDXz_RylvdD0tHEA8I0NA TvkqJ8YEIsTb16RnnrNyfQ
USAir check-in desk agent Nancy E. is the sadest poor customer service provider I experienced this week. She doesn’t want to be there. #Fail 2011-05-20 0 u7CxxEzx8hvjoJ8onN4zTg TvkqJ8YEIsTb16RnnrNyfQ
Great weather for eating outdoors. Good service. 2012-03-27 0 1CqDdPrrb0xvQpgu7fhI5w TvkqJ8YEIsTb16RnnrNyfQ
The fried banana dessert is good 2012-09-02 0 _AKdBFzkl7GY-daxUCCbVA TvkqJ8YEIsTb16RnnrNyfQ
I didn’t eat here, but they were nice enough to tell me where to find tacos de lengua y churros. Look at my next post to find out where :) 2012-03-28 0 HWjqW5ZFJ8eZRQuHcpySQA TvkqJ8YEIsTb16RnnrNyfQ

f. reviews

The review table contains information regarding a review given by a user for a business. There are 5261668 reviews for 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

review  <- review %>% 
           filter(business_id %in% business$business_id ) 

The final data set for top ten cities contains 2106287 reviews for 21406 businesses. The description for important variables and a snapshot of data can be seen below.

Name Description
review_id ID of the review
user_id ID of the user who gave the review
business_id ID of the business which is reviewed
stars Stars given by user for the business
date Date of review
text review text
useful Number of users who found the review as useful
funny Number of users who found the review as funny
cool Number of users who found the review as cool
# Displays 10 observations for tip data set.

kable(head(review, n = 10))
review_id user_id business_id stars date text useful funny cool
nsThIz_-TuvgoFh0o9XJfQ _L2SZSwf7A6YSrIHy_q4cw IXXERocY1bqGwRllcy8J2w 5 2009-08-30 Visiting from SF. Checked yelp and found this place. It is very small – a converted house. As you sit in it you can watch the kitchen. The food was excellent. We had an eggplant/red pepper omelet and peach waffles. The waffles were light and fluffy with fresh whipped creme.

We got pastries to go, which were also good, in particular the chocolate croissant was unique.

Great place for an informal breakfast. 0 0 0 BF0ANB54sc_f-3_howQBCg ssuXFjkH4neiBgwv-oN4IA JlNeaOymdVbE6_bubqjohg 1 2014-08-09 We always go to the chevo’s in chandler which is delicious, the one in ahwatukee is different for some reason. Ordered the chicken rolled tacos today there was a tiny lil piece of chicken in each one, so basically I had 3 rolled deep fried tortillas yuck! :( No flavor what so ever. Also ordered carne asada taco the meat tasted old like it was cooked earlier and just thrown on the grill to get warm. Very dissapointed!! 3 0 0 QgSf2JvYz-M4PU2yuJjxNQ nOTl4aPC4tKHK35T3bNauQ 9Jc3W0aR9Xf2gcHI0rEXsw 1 2012-08-23 After being scared away from Rock & Rita’s, we ended up at this place, which was, if nothing else, quieter.

I’ll start by saying that the hostess and our server were both lovely and they’re really what earned the single star, because they were sweet, but the food was just horrible.

I ordered a simple grilled cheese sandwich. I asked the server if it was processed cheese, or “real” cheese in the sandwich and he had no idea what I was talking about. I tried to clarify by saying, “Is it like Kraft single slices, or deli cheese?” He said it was “good”. This really should have been a sign and I could probably fault the waiter for this, since you SHOULD know what you’re serving, but he was so sweet, I couldn’t be mad!

My husband ordered a hot dog of sorts.

I asked if there was a way to substitute fries for a salad. There was not, but the server hooked me up and brought me a salad anyway (no charge). I asked for sweet potato fries with the sandwich.

Of course, my sandwich was made with plasticy, processed cheese, that was melted to the point of Cheeze-Wiz and was inedible. I took one bite of the sandwich and couldn’t do another. I got regular fries, instead of sweet potato (didn’t bother complaining, because I got the free salad, so whatever…like I said, nice server!) My husband’s food was fine, although nothing special. They were good with the refills of our drinks.

Anyway, as nice as the servers were, it wasn’t good. I would avoid eating here if I could. 0 1 0 gN6GARS_BRr5UX2D3WAH0w nOTl4aPC4tKHK35T3bNauQ xVEtGucSRLk5pxxN0t4i6g 5 2012-08-23 We got recommendations for this place from my parents and so, for our anniversary, we booked here. We were told though that it was first come, first served, in terms of having the ideal seats by the windows that overlook the fountains of the Bellagio and The Strip. We went for lunch though, so it wasn’t too busy. We were seated by the window and were promptly served some sort of puffed rolls, with cheese and herbs infused. They were delicious!

We ordered the onion soup to start and I ordered parmesan chicken and my husband ordered a lamb burger. Our food came and was delicious. The soup was probably my favorite part. We were feeling very full by the end and so declined dessert, but were still given some jellies and chocolate truffle and then were served a small chocolate mousse, on a decorated plate that said Happy Anniversary, which was a total surprise (they had asked when I booked if it was a special occasion, but had said nothing while we were there). Overall the service was great, the food delicious and the view wonderful. It was a great experience! 0 0 0 t4oXDPN4S4USIhBGpuSD8A nOTl4aPC4tKHK35T3bNauQ 2LZGeJy8qByYKB71ML-jcw 2 2012-08-23 We got a coupon to eat here when we checked in: $6.99 for breakfast and a second one for half, or free, or something to that effect. We went and had our breakfasts, which was served by a rather surely waitress and was average. If this had been our only experience, I would have given three stars, but another night we got in very late and were just looking for a late dinner and found ourselves at this place. It was not very crowded, but was as loud as if three times as many people were there. There was a small group of trashy (very obese people), scream/singing karaoke on a stage in the front of the restaurant. One of them was singing Eminem and the others were screaming and cheering. We sat down in the back, to try and avoid the chaos, but it was way too disruptive. Other people sitting there, who were in the middle of their dinners and therefore stuck there, were looking pained. We didn’t bother putting ourselves through this and left. It’s a crappy menu, that’s overpriced anyway. I guess Rock and Rita’s name really says it all, but basically if you’re not a fall-down, cheap drunk, I don’t think you’ll have much to say about this place. 2 1 0 R9w7GeMX_KZTV23gmI8Zjg tL2pS5UOmN6aAOi3Z-qFGg RhV7sraRUB3km-gF-tmDow 3 2013-02-06 I’ve eaten here numerous times and am still amazed how popular these places are. It must be because they’re open 24 hours and late at night it’s one of the few places you can get something to eat fast. That’s the only reason I eat here.

The good is soso. It fills you up but as far as being flavorful it leaves a lot to be desired.

I’m sure I’ll visit again. It serves a purpose. 0 0 0 oncT7W70CFwzzJkQoz3T5Q tL2pS5UOmN6aAOi3Z-qFGg NaZVUOzqk5b-l0mlki-9Og 4 2017-02-10 We stopped in here for lunch this afternoon. Staff was helpful and friendly. Food was good and you get a lot for the price. We noted that they deliver and since we live close by we’ll probably order from them in the future. 2 0 0 9dWoAJGcJHWscv2ZAdzkNg tL2pS5UOmN6aAOi3Z-qFGg tJzf6H1dkuUbL-t8bzL3dw 5 2014-04-27 I was looking for a nice place to take the family to dinner last night. After reading the reviews and looking at the photos I settled on this place. Great choice!

The ambience was perfect. Nice and quite so you don’t have to talk loud so your guests can hear you.

I was really surprised that on a Saturday night this place only had a few other tables with diners. You’d think a place like this would be packed.

The service is what really made this place great for me. I would rate this place in the top two or three restaurants I’ve ever eaten in as far as service goes.

The food was good to. Everyone in my party of 6 enjoyed their meals. I personally would rate my steak a 4 out of 5. But with the kind of service you get here I didn’t mind.

The total bill came to 274 and change for a party of 6. That includes one bottle of wine and a couple of other mixed drinks. So really not a bad price for a place like this.

If you’re looking for a romantic place to take a date or just a nice quiet place to dine I would highly recommend Carve! 0 0 0 ZGlUf9noms8FQ67rmTZSdA tL2pS5UOmN6aAOi3Z-qFGg FtaTjyMUIY457tPJahjg1A 4 2014-04-25 We stopped by here for lunch this afternoon. The place was packed.

I thought we’d have a long wait, but we were shown to a booth in just a few minutes.

The hostess and our waitress were both very friendly. I appreciated them taking the extra effort to stay friendly when it was crazy busy like that. I deal with the public on a daily basis and know when it’s super busy like that it can sometimes be a challenge to keep a smile on your face and a kind word on your lips.

All of our meals were excellent. If I’m ever in the neighborhood I’d definitely eat here again. 0 0 0 pcszB9oTZE2DNylbbXIZAg tL2pS5UOmN6aAOi3Z-qFGg yLiaMaJFq03JxXPk4puloQ 3 2017-04-20 I’ve stopped in here several times. It’s always busy but they seem to move along at a decent speed.

As with all fast food joints nowadays be sure to check your bag before you go as there’s a 50/50 chance they won’t get your order right.

They’ve got one of those new machines inside where you can place your order. Thanks but no thanks. I’d rather place my order with an old fashioned human being. 1 2 1

4. Exploratory Data Analysis

Top cuisines

Most frequent categories in data set

Categories variable lists the different categories to which a restaurant belongs to. There are few redundant categories like “Restaurants”, “Food”, “Nightlife”, “Bars”, “New” that are identified in the initial analysis. These are removed before creating the wordcloud to find the most frequent categories across cities. In the wordcloud, we consider only those categories that appear in data set atleast 300 times. Most Frequent categories appear in the center of the word cloud in large size. As we move ouwards, the size of words decreases denoting smaller frequencies. Same color words have similar frequency range.

# All categories listed for different restaurants are taken
categories <- unlist(strsplit(business$categories, ";"))
# Most common categories like food are removed
remove_categories <- c("Restaurants", "Food", "Nightlife", "Bars", "New")
clean_categories <- removeWords(categories, remove_categories)
# word cloud is created with this set of categories
wordcloud(clean_categories, 
          min.freq = 300,
          random.order=FALSE, 
          rot.per=0.35,
          colors=brewer.pal( 8,"Dark2"))

Geographical locations

The location of restaurants are shown in geographical maps in this section. A function is created to plot the maps. It takes in the city name, filters the data for the given city and plots the map. Leaflet package is used for this. Since there are so many restauants in a city, they are clustered and shown in map. When you hover the mouse on the number, the region for which the numbers are aggregated are highlighted. This shows us where restaurants are densely populated.

Clicking the numbers in circles zooms in to the area. On zooming in, the aggregation regions change. When it is zoomed into the lowest level, the restaurant locations can be seen in detail. At this point, if the rating of restaurant is 1 or 2, it is considered as a low rated outlet and is represented by a red circle. If the rating is 3, it is considerd as average and is shown as blue circle. High rated restaurants(places with rating of 4 or 5) are represented by green circles.

geographical_map <- function(location_name){
  
      location_business <- business %>%
                          # filter for the city
                          filter(city == location_name) %>%
                          # Creates 3 level based on rating
                          mutate( rating_level = ifelse(stars == 4 | stars == 5 ,"High", ifelse(stars == 3, "Medium", "Low")))
      
      # Creates color pallette for rating levels
      pallete <- colorFactor(c("dark red",  "blue","dark green"), domain = c("Low", "Medium","High"))
     
      location_business %>%  
                 leaflet() %>% 
                 setView(lng = mean(location_business$longitude), 
                         lat = mean(location_business$latitude), 
                         zoom = 12) %>% 
                 addProviderTiles(providers$CartoDB.Positron) %>%
                 addCircleMarkers(~longitude, 
                                  ~latitude,
                                  radius = 3,
                                  fillOpacity = 0.5,
                                  # Creates clusters for restaurants on high level
                                  clusterOptions = markerClusterOptions(),
                                  # Color palette is assigned based on rating level
                                  color = ~pallete(rating_level))
     
}

Restaurants across Las Vegas

geographical_map("Las Vegas")

Restaurants across Phoenix

geographical_map("Phoenix")

Restaurants across Pittsburgh

geographical_map("Pittsburgh")

Restaurants across Scottsdale

geographical_map("Scottsdale")

Restaurants across Chandler

geographical_map("Chandler")

Restaurants across Madison

geographical_map("Madison")

Restaurants across Cleveland

geographical_map("Cleveland")

Restaurants across Mesa

geographical_map("Mesa")

Restaurants across Charlotte

geographical_map("Charlotte")

Restaurants across Tempe

geographical_map("Tempe")

5. Sentiment Analysis

Since the RAM is not able to handle data for 10 locations, Only Las vegas reviews are used for sentimental analysis. In order to avoid untrustworthy reviews, a review is considered for analysis only if at least 5 people have rated it as useful. Text of the review is converted to lower case and numbers and stop words are removed from it. There are three words that are found to be common across reviews in high frequency in the initial analysis. Las vegas, http and www.yelp.com are removed from the text.

useful <-     review %>% 
              left_join(business, by = "business_id") %>% 
              # Filters reviews for Las Vegas
              filter(city == 'Las Vegas') %>%
              # Only reviews that atleast 5 people found as useful is taken
              filter(useful > 5)

# All text is converted to lower case
useful$text <- tolower(useful$text)
# Stop words and repeating words like Las vegas, http, www.yelp.com are removed
useful$text <- removeWords(useful$text ,c(stopwords("en"), "las vegas","http","www.yelp.com" ))
useful$text <- removeNumbers(useful$text )

Frequently linked words

The reviews given by customer related to food, service, ambience and staff are of interest for this case study. As first step, Words that generally appear in reviews after the key words food, service, ambience and staff are analysed through a network graph. For these bigrams containing any of these four words as the first word are created. In order to avoid insignificant relationships that crowd the space, only those words that appear alteast 100 times after these key words are considered.

# Bigrams are created with words in review text
bigrams <- useful %>% 
           unnest_tokens(bigram, text, token = "ngrams", n = 2)

# List of words that are of significance
analysis_word <- c("food", "ambience", "staff", "service")

# Creates data for network analysis graph
bg_grapgh <- bigrams %>%
             # Words from bigram are seperated
             separate(bigram, c("word1", "word2"), sep = " ")  %>% 
             # Count for each combination of words are calculated
             group_by(word1, word2) %>% 
             summarise( n = n()) %>%
             # Only the combination with significant words in the beginning and min freq of 100 are taken
             filter( word1 %in% analysis_word & n > 100) %>%
             # Creates data for network graph
             graph_from_data_frame()
             
arrow_format <- grid::arrow(type = "closed", length = unit(.1, "inches"))

## Visual representation of connection of pair of words
ggraph(bg_grapgh, layout = "fr") +
  # Connection between words are represented by arrows
  geom_edge_link(aes(edge_alpha = n), 
                 show.legend = TRUE,
                 arrow = arrow_format, 
                 end_cap = circle(.1, 'inches')) +
  # Nodes for words
  geom_node_point(color = 'light blue', 
                  size = 7) +
  # Text is displayed
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1,
                 repel = TRUE) +
  theme_void()

The thickness of the arrow denotes the number of times the word appeared. Service is linked to both horrible and great with thick lines. There are large number of positive and negative reviews for service. Food has many words that appear in reviews. Service and food have many common words used in the reviews. Ambience and Staff has less words that appear in reviews with high frequency.

Contributing words

A restaurant is rated between 1 to 5. Through this visualization, the patterns in reviews given for food, ambience, staff and service is studied. For each review, we find the words that contribute to positive or negative review for each attribute across rating. These are the words that preceed the significant words (food, ambience, staff or service).

We use sentiments from affinn lexicon for this analysis. The AFINN lexicon assigns each word with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. We have calculated product of score and number of occurances to identify how much a word affects a review. Top 5 positive and negative words for an attribute across ratings are shown. If multiple words have same effects , all of them are displayed.

For 5 rated restaurants, reviews have words like great and amazing to describe the staff and service. For 4 rated, good and awesome takes their place. For top rated restaurants, the all top 5 words describes food quality. But for 4 rated restaurant, the fifth positive word is pretty which could be used to describe the presentation of food. The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4.

When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible, awful. Food becomes dissappointing and horrible. Low rated restaurants have unconfortable, poor or horrible ambience.

Words that contrubute to positive and negative sentiments for key attributes in a rating category

# afinn lexicon is imported
AFINN <- get_sentiments("afinn")

analysis <- bigrams %>%
                # Bigrams are seperated
                separate(bigram, c("word1", "word2"), sep = " ") %>%
                # Only the words under analysis is chosen as first word
                filter(word1 %in% analysis_word) %>%
                # AFINN lexicon is used for sentiment analysis
                inner_join(AFINN, by = c(word2 = "word")) %>%
                # Count for each word and rating is taken
                group_by(word1, word2, score,stars.x)  %>%
                summarise(n = n()) %>%
                ungroup()

# creates plots for each star rating
star_plot <- function(star){
    analysis_plot <- analysis %>% filter(stars.x == star) %>%
        mutate(contribution = n * score, sign = ifelse(score > 0 , "P", "N")) %>%
        arrange(desc(abs(contribution))) %>%
        group_by(word1,sign) %>%
        # Selects the top 5 contributions to both positive and negative emotions
        top_n(5, abs(contribution)) %>%
        ggplot(aes(drlib::reorder_within(word2, contribution, word1), 
                   contribution, 
                   # Color is based on positive or negative emotion
                   fill = contribution > 0)) +
        geom_bar(stat = "identity", show.legend = FALSE) +
        xlab("Words preceded by topic under analysis") +
        ylab("Sentiment score * Number of occurrances") +
        ggtitle(paste("Contributing words for rating : ", as.character(star)))+
        drlib::scale_x_reordered() +
        facet_wrap( ~ word1,scales = "free", nrow = 1) +
        coord_flip()
    return(analysis_plot)
}

# Created a grid of 5 rows and n columns
# 5 columns comes from the facet_wrap
star_plots <- lapply(c(5,4,3,2,1), star_plot)
do.call("grid.arrange", c(star_plots, ncol = 1))

Emotions across rating

Now that we have found the words that contribute positive and negative emotions for each attributes across ratings, we will next analyse the general emotion expressed in reviews for each rating. NRC lexicon is used for this analysis. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For each star rating, the percentage of words that express the emotions of trust, fear, anger, anticipation, disgust, joy , sadness and surprise are shown below.

A radar plot is made across emotions to show the percentage projected in reviews. Color in this visualisation represents the rating of restaurant whose review is analysed. We can see that for restaurants with low rating of 1 and 2 have reviews that predominantly express anger, sadness, fear and disgust. These emotions are ecpressed the least in high rated restaurants. Medium rating of 3 shows a large amount of anticipation in their reviews. High rated restaurants (4 and 5 ratings) have very similar trend that in this visualization, they have almost overlapping lines. Though customers express equal percentage of trust in reviews for restaurants with rating of 4 and 5 (Overlapping pink and yellow lines), reviews shows more joy for 5 rated restaurants.

Percentage of emotions projected in reviews for each rating

# Creates unigrams from text
unigrams <- useful %>% unnest_tokens(word, text, token = "ngrams", n = 1)
# nrc lexicon is loaded 
nrc <- get_sentiments("nrc")

sentiment_analysis <- unigrams %>% 
               dplyr::group_by(stars.x, word) %>% 
               # Count of words in review for each rating is calculated
               summarise( n = n()) %>% 
               # NRC sentiment analysis
               inner_join(nrc)

# positive and negative emotions are dropped
review_nrc <- sentiment_analysis %>%
  filter(!grepl("positive|negative", sentiment))


review_tally <- review_nrc %>%
                group_by(stars.x, sentiment) %>%
                tally() %>% 
                # Calculates the percentage of words that attribute to a sentiment
                mutate(cuisine_words = (nn / sum(nn))*100) %>% 
                select(-nn)

# Key value pairs
scores <- review_tally %>%
          spread(stars.x, cuisine_words)

# JavaScript radar chart
chartJSRadar(scores)

Words across cuisines

In this step, we investigate if specific words are used in the reviews given to a specific cuisine. If the cuisine category is not registed by the merchant, reviews can be used to identify the cuisine. Common words seen in reviews across the cuisines in Las Vegas versus the frequent words in the reviews given to specific cuisine are identified. This allows us to compare the strong deviations of word frequency within each cuisine as compared to reviews given in location

Words that are close to the line(light grey) means they are used in similar frequency in reviews for the cuisine under stude and the rest of all the cuisines. For example, words such as “food” and “pizza” are fairly common and used with similar frequencies across most of the cuisines. Words that are far from the line (Green color) are words that are found more in one set of cuisine reviews than another. The words standing out above the line are common across the location but not for that particular category. The words below the line are common in that particular category but not across the location.

For example, “torta” stands out above the line in the American traditional cuisine. This means that “torta” is a word used fairly common in reviews given toother cuisines, but is not used as much in reviews for Traditional cuisine. In contrast, a word below the line such as “burgr” in the traditional American category suggests this word is common in this cuisine review but far less common in reviews for other cuisines.

# Calculates the top 6 cuisines
LV_top_6_cuisines <- business %>% 
                    dplyr::select(city , categories) %>%
                    transform(categories = strsplit(categories, ";")) %>%
                    unnest(categories) %>%
                    # Only the categories in cuisine list for LAs Vegas is considered
                    filter(categories %in% cuisine_list & city == 'Las Vegas')  %>%  
                    dplyr::group_by(categories) %>%
                    tally() %>% 
                    top_n(n = 6) 

# Calculates the percenatge of word use in whole
word_pct <- useful %>%
           transform(categories = strsplit(categories, ";")) %>%
           unnest(categories) %>%
           filter(categories %in% LV_top_6_cuisines$categories)  %>% 
           unnest_tokens(word, text) %>%
           # Removes stop words
           anti_join(stop_words) %>%
           dplyr::group_by(word) %>%
           summarise(n = n()) %>%
           # Calculate the percentage in the whole review set
           transmute(word, all_cuisines = n / sum(n))

# calculate percent of word use within each cuisine
  frequency <- useful %>%
             transform(categories = strsplit(categories, ";")) %>%
             unnest(categories) %>%
             filter(categories  %in% LV_top_6_cuisines$categories)  %>% 
             unnest_tokens(word, text) %>%
              anti_join(stop_words) %>%
              dplyr::group_by(categories,word) %>%
              summarise(n = n()) %>%
              # Calculate the percentage in the review for given category
              mutate(cuisine_words = n / sum(n)) %>%
              left_join(word_pct) %>%
              arrange(desc(cuisine_words)) %>%
              ungroup()

# Plots frequency of words in reviews specific to cuisine in x axis and percentage of appearance in all in y
ggplot(frequency, aes(x = cuisine_words, y = all_cuisines, color = abs(all_cuisines - cuisine_words))) +
        geom_abline(color = "gray40", lty = 2) +
        geom_jitter(alpha = 0.1, size = 3, width = 0.3, height = 0.3) +
        geom_text(aes(label = word, size = 1), check_overlap = TRUE, vjust = ifelse(frequency$all_cuisines > frequency$cuisine_words, 2,-2)) +
        scale_x_log10(labels = scales::percent_format()) +
        scale_y_log10(labels = scales::percent_format()) +
        scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
        facet_wrap(~ categories, ncol = 2) +
        theme(legend.position = "none") +
        labs(y = "Cuisines", x = NULL)

6.Summary

Problem statement and data used

Restaurants is US is identified as target market. In order to understand the existing business model, YELP data set is selected. This data set contains information for business across 11 metropolitan areas in four countries. Restaurant data for US was filtered and top 10 cities based on number of restaurants whose data is available is taken. 10 cities were chosen for study based on data available and below are some of the findings from this case study. We have information regarding location, category, working hours for each weekday, number of checkins for each day and hour and user text review. Data is cleansed and processed to create visualizations for exploratory data analysis. Text mining is done on reviews to understand the customer sentiments.

Data analysis

Exploratory data analysis

Based on analysis of Yelp dataset,trends in restaurant market is studied. Data was sliced and diced to create different visualisations to uncover the patterns present. Restaurant locations were plotted on map and where they were aggregated to find the locations where restaurants are densely located. In ordered to find the most busy hours in a city, heat maps where generated for each hour in a weekday. Most frequent categories across the cities were identified using a word cloud. Then it was drilled down to find the top 5 cuisines in each city using bar charts.

Sentiment analysis

In order to understand the sentiments of customers, four key attributes of a business were identified. The general customer reaction to food, ambience, staff and service at a restaurant was analysed. As intial step, the words commonly used along with the key terms are visually represented in a network chart. To identify the difference between high and low rated restaurants, the reviews given to each of these four attributes in each rating category were studied seperately The words that contribute to positive and negative sentiments in each attribute are identified and top 5 positive and negative words were shown in a diverging bar chart. The percentage of different emotions expressed in reviews given to each rating category is shown in a radar chart. The difference in usage of words in reviews given to top 6 cuisines in Las vegas was also investigated and displayed.

Ineteresting findings

  • American cuisine is most popular in most locations. Sandwiches and fast food are the next best options. Asian cuisines like Indian and Chinese did not make it to the top 5 in any of the locations.

  • Locations like Las Vegas has an active night life and hours after midnight are the most busy working time for restaurants there. Scottsdale is busy during dinner time around 7 on most days. The patterns in working hours can be seen in the visualization provided in the exploratory data analysis

  • Restaurants are densely pesent mostly in downtowns. When you move further away from the city, number restuarants for which we have information reduces.

  • Lot of words are used in common to describe food and service. Great and horrible are used comparitively in same frequency to describe services of restaurants in Las Vegas. Staff and ambience is not descibed using frequent terms in reviews as compared to food and service.

  • The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4. Low rated restaurants have unconfortable, poor or horrible ambience. When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible and awful. Food becomes dissappointing and horrible.

  • Reviews for restaurants with low rating of 1 and 2 predominantly express anger, sadness, fear and disgust. Medium rating of 3 shows a large amount of anticipation in their reviews. Though customers express percentage of trust in reviews for restaurants with rating of 4 and 5, reviews shows more joy in customers at 5 rated restaurants.

  • Each cuisine review has words specific to that particular category. For example refried is a word commonly used only for Mexican cuisine. This can be used to identify the cuisines from reviews.

Further steps.

  • Collect more data on food served, menu, music etc and explore more trends and patterns that will aid a new business person planning to open a restaurant in US

  • Create a machine learning model that will identify the cuisine from the review given

  • Predict the rating that customer might give to a restaurant by analysing the review given by him/her.

  • Due to time constraints, all sentiment analysis in this case study is done using unigrams and bigrams. Expand the scope of study to large n-grams and sentences