Restaurant analysis

1. Introduction

My retirement plan is to start a quiet little cafe in the foothills of Himalayas. I want to fill my little space with paintings and books and obviously, provide amazing food. Though this is a long shot plan, I am constantly curious about market dynamics of restaurants. In order to start understanding what makes a good restaurant, I planned to do a market research analysis of restaurants in United states.

In order to conduct the market analysis, I am making use of the YELP dataset which is a subset of Yelp’s businesses, reviews, and user data.It was originally put together for the Yelp Dataset Challenge to give a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. Below are the proposed approach/analytic technique to conduct the analysis.

Geographical analysis - This analysis is to aid the location selection process and find the best possible place to establish the cafe.The user should be able to select the desired city and see the distribution of restaurants across the location and their ratings. May be it is better to start the restaurant that gives high quality food in a location with large number of low rated restaurants. May be opening a restaurant in a street full of popular restaurants gives good visibility. It all depends on the statergy adopted by the management.
Understanding previous trends - This is to aid the selection of working hours and cuisine for the restaurant. Some places might be suited for restaurants that are open during evening. Some locations might give profit when the restaurant is open through out the day. By analysing at the past trends, we can decide our restaurants best suited working hours. We will also explore the different cuisines that are popular in the locality chosen.
Learning from the best and worst players in the market - Sentiment analysis is done on the user reviews to understand what trends and patterns in user behavious. Bad reviews help us to learn without making the mistakes by ourselves. Positive reviews can help me understand what works for the general public and incorporate that into my business.

2. Packages

The details of different packages used for this analysis is listed below.

pacman: To load packages and install missing ones
data.table: Fast data load.
tidyverse: Package of multiple R packages used for data manipulation.stringr, dplyr, readr and ggplot2 from this package is used for analysis.
stringr: String operations in R
lubridate: Package to manipulate dates and times
DT: Package to put data objects in R as HTML tables
NLP: Package for basic classes and methods for Natural Language Processing
tidytext: Package for text mining for word processing and sentiment analysis
knitr: A General-Purpose Package for Dynamic Report Generation in R
leaflet: Create Interactive Web Maps
tm: Text mining package
wordcloud: To create wordclouds
grid: A rewrite of the graphics layout capabilities
gridExtra: Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
radarchart: Creates interactive radar charts
igraph: Network Analysis and Visualization
ggraph: The grammar of graphics as implemented in ggplot2

if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("data.table", 
               "tidyverse",
               "stringr", 
               "lubridate", 
               "DT", 
               "tidytext",
               "NLP" ,
               "knitr",
               "leaflet",
               "tm",
               "wordcloud",
               "grid",
               "gridExtra",
               "radarchart",
               "igraph",
               "ggraph")

3. Data Preperation

The data is loaded from the source and cleaned for further analysis.

3.1 Data source

Yelp is an American multinational corporation founded in 2004 by former PayPal employees Russel Simmons and Jeremy Stoppelman. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations. The dataset used in this study is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries.

In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information

business : Contains location and category information for each business
business_attriutes : Contains attributes of the business like music, delivery, parking etc.
business_hours : Contains working hours for each weekday for businesses
checkin : Number of checkins for each day and hour for a business is stored in this dataset.
tip : Contains tips and comment written by a user on a business
review : Contains user text review data for businesses

3.2 Data import

The six files containing data is loaded in this step. fread from data.table is mainly used for loading data as it is fast for large files. After importing data, each of these data sets are analysed and cleaned according to the needs.

business <- fread("yelp-dataset/yelp_business.csv")
business_attributes <- fread("yelp-dataset/yelp_business_attributes.csv")
business_hours <- read_csv("yelp-dataset/yelp_business_hours.csv")
checkin <- fread("yelp-dataset/yelp_checkin.csv")
tip <- fread("yelp-dataset/yelp_tip.csv")
review <- read_csv("yelp-dataset/yelp_review.csv")

3.3 Data cleaning

In this step, each of the six files are evaluated seperately and cleaned for futher analysis.

a. business

The business table containes the location and category details of businesses included in the dataset. There are 174567 businesses listed in the original dataset. For this project, only restaurants from Unites States is considered.

# Only the data for restaurants in United states are kept
business  <- business %>% 
             filter(state %in% state.abb) %>% 
             filter(categories %like% "Restaurant")

After filtering out the restaurants in United states, we have data regarding 32484 businesses. We will first investigate the amount of data available for each city and will consider the top 10 cities for this study.

# Displays the top 10 cities present in the data set based on number of restaurants for which data is available.

business %>% select(state,city) %>% 
             dplyr::group_by(state,city) %>% 
             summarise(n = n()) %>% 
             arrange(desc(n)) %>% 
             head(n = 10) %>%
             datatable(class = 'cell-border stripe hover condensed responsive')

Also for the top 10 cities there are multiple names present in the data set. Below are the different patterns found for the same city names.

Las Vegas : NV state cities with names Las Vegas, North Las Vegas, N Las Vegas, Las Vegas, las vegas, N. Las Vegas, South Las Vegas, Las vegas and LasVegas are combined together.
Phoenix : AZ state cities with names Phoenix, Pheonix, Pheonix AZ, Phoenix Valley are combined together.
Pittsburgh : PA state cities with names Pittsburgh and East Pittsburgh are combined together.
Scottsdale : AZ state cities with names Scottsdale and Scottdale are combined together.
Cleveland : OH state cities with names Cleveland, Cleveland Heights, East Cleveland, Cleveland Hghts. are combined together
Mesa : AZ state cities with names Mesa, MESA, Mesa AZ are combined together.

# Lists are created so that if any additional pattern is found, it can be added flexibly. 

# All restaurants in Las vegas have name is same format
lasVegas <- c("Las Vegas", "North Las Vegas", "N Las Vegas", "Las Vegas", "las vegas", "N. Las Vegas","South Las Vegas", "Las vegas", "LasVegas")
business$city[business$city %in% lasVegas & business$state == "NV" ] <- "Las Vegas"

# All restaurants in Phoenix have name is same format
phoenix <- c("Pheonix", "Pheonix", "Pheonix AZ" , "Phoenix Valley")
business$city[business$city %in% phoenix & business$state == "AZ" ] <- "Phoenix"

# All restaurants in Pittsburgh have name is same format
pittsburgh <- c("Pittsburgh", "East Pittsburgh")
business$city[business$city %in% pittsburgh & business$state == 'PA' ] <- "Pittsburgh"

# All restaurants in Scottsdale have name is same format
scottsdale <- c("Scottsdale", "Scottdale")
business$city[business$city %in% scottsdale & business$state == 'AZ' ] <- "Scottsdale"

# All restaurants in Cleveland have name is same format
cleveland <- c("Cleveland", "Cleveland Heights", "East Cleveland", "Cleveland Hghts.")
business$city[business$city %in% cleveland & business$state == 'OH' ] <- "Cleveland"

# All restaurants in Mesa have name is same format
mesa <- c("Mesa", "MESA", "Mesa AZ")
business$city[business$city %in% mesa & business$state == 'AZ' ] <- "Mesa"

# Filters the data so that we analyse only the restaurants from top 10 cities (Based on count)
top_10_cities <- c("Las Vegas", "Phoenix", "Charlotte", "Pittsburgh", "Scottsdale", "Cleveland", "Mesa", "Madison", "Tempe", "Chandler")
business <- business %>% filter(city %in% top_10_cities)

The final data set for top ten cities contains information regarding 21406 restaurants. Some of the important variables are describled below. A snapshot of the dataset is also provided.

Name	Description
business_id	ID of the business
name	Name of the business
city	City at which business is located
latitude	Latitute coordinate of the restaurant. Will be used to plot the location on the map.
longitude	Longitude coordinate of the restaurant. Will be used to plot the location on the map.
stars	Provides the rating of the restaurant. Minimum = 0, Maximum = 5
is_open	If the restaurant is working/shut down
categories	Categories under which the restaurant falls

# Displays 10 rows from the dataset
datatable(head(business, n = 10), class = 'cell-border stripe hover condensed responsive')

b. business_attributes

The business_attributes table contains the attributes of businesses like music, delivery, parking etc. There are 152041 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

business_attributes  <- business_attributes %>% 
                        filter(business_id %in% business$business_id)

After filtering, only 21123 restaurants are retained. There are 82 columns in this data set. Not all of them are relevant to us. We will drop the irrelevant variables.

# The attributes that are not significant are dropped from the dataset.

business_attributes <- business_attributes %>% 
                       select(-c( HasTV, Caters , BusinessAcceptsBitcoin, BYOBCorkage, BYOB), -contains("Hair"))

There are 68 variables left in the data set now. The final data set for top ten cities contains information regarding 21123 observations.

# 10 rows from business_attributes table is displayed
kable(head(business_attributes, n = 10))

business_id	AcceptsInsurance	ByAppointmentOnly	BusinessAcceptsCreditCards	BusinessParking_garage	BusinessParking_street	BusinessParking_validated	BusinessParking_lot	BusinessParking_valet	RestaurantsPriceRange2	GoodForKids	BikeParking	Alcohol	NoiseLevel	RestaurantsAttire	Music_dj	Music_background_music	Music_no_music	Music_karaoke	Music_live	Music_video	Music_jukebox	Ambience_romantic	Ambience_intimate	Ambience_classy	Ambience_hipster	Ambience_divey	Ambience_touristy	Ambience_trendy	Ambience_upscale	Ambience_casual	RestaurantsGoodForGroups	WiFi	RestaurantsReservations	RestaurantsTakeOut	HappyHour	GoodForDancing	RestaurantsTableService	OutdoorSeating	RestaurantsDelivery	BestNights_monday	BestNights_tuesday	BestNights_friday	BestNights_wednesday	BestNights_thursday	BestNights_sunday	BestNights_saturday	GoodForMeal_dessert	GoodForMeal_latenight	GoodForMeal_lunch	GoodForMeal_dinner	GoodForMeal_breakfast	GoodForMeal_brunch	CoatCheck	Smoking	DriveThru	DogsAllowed	Open24Hours	Corkage	DietaryRestrictions_dairy-free	DietaryRestrictions_gluten-free	DietaryRestrictions_vegan	DietaryRestrictions_kosher	DietaryRestrictions_halal	DietaryRestrictions_soy-free	DietaryRestrictions_vegetarian	AgesAllowed	RestaurantsCounterService
fNMVV_ZX7CJSDWQGdOM8Nw	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
rDMptJYWtnMhpQu_rRXHng	Na	Na	Na	Na	False	False	False	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
1WBkAuQg81kokZIPMpn9Zg	Na	Na	Na	Na	False	False	False	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	False	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
Pd52CjgyEU3Rb8co6QfTPw	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
4srfPk1s8nlm1YusyDUbjg	Na	Na	Na	Na	False	False	False	False	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	False	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
n7V4cD-KqqE3OXk0irJTyA	Na	Na	Na	Na	True	False	False	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
EJFdWX908N8Yc2XG0Lky8A	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	False	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
iPa__LOhse-hobC2Xmp-Kw	Na	Na	Na	Na	False	False	False	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	True	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
o1fTwfqN0sDFNpV1CkOPPg	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na
-nHkhiuerqmfBG3v2v9O-g	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	False	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na	Na

# NA in the data set is represented by the charecter 'Na'. They are replaced by NA
business_attributes[,2:ncol(business_attributes)][ business_attributes[,2:ncol(business_attributes)] == 'Na' ] <- NA

# The percentage of missing values is calculated for each attribute
data.table("Variable" = colnames(business_attributes), "Percentage of NA values" = colMeans(is.na(business_attributes)))

##                            Variable Percentage of NA values
##  1:                     business_id               0.0000000
##  2:                AcceptsInsurance               1.0000000
##  3:               ByAppointmentOnly               1.0000000
##  4:      BusinessAcceptsCreditCards               0.9941770
##  5:          BusinessParking_garage               0.9980116
##  6:          BusinessParking_street               0.3398192
##  7:       BusinessParking_validated               0.3398192
##  8:             BusinessParking_lot               0.3398192
##  9:           BusinessParking_valet               0.3398192
## 10:          RestaurantsPriceRange2               1.0000000
## 11:                     GoodForKids               0.9961180
## 12:                     BikeParking               0.6178573
## 13:                         Alcohol               0.9974909
## 14:                      NoiseLevel               0.9985324
## 15:               RestaurantsAttire               0.9994792
## 16:                        Music_dj               0.9991952
## 17:          Music_background_music               1.0000000
## 18:                  Music_no_music               1.0000000
## 19:                   Music_karaoke               1.0000000
## 20:                      Music_live               1.0000000
## 21:                     Music_video               1.0000000
## 22:                   Music_jukebox               1.0000000
## 23:               Ambience_romantic               1.0000000
## 24:               Ambience_intimate               0.9993372
## 25:                 Ambience_classy               0.9993372
## 26:                Ambience_hipster               0.9993372
## 27:                  Ambience_divey               0.9993372
## 28:               Ambience_touristy               0.9993372
## 29:                 Ambience_trendy               0.9993372
## 30:                Ambience_upscale               0.9993372
## 31:                 Ambience_casual               0.9993372
## 32:        RestaurantsGoodForGroups               0.9993372
## 33:                            WiFi               0.9992425
## 34:         RestaurantsReservations               0.9991005
## 35:              RestaurantsTakeOut               0.9976803
## 36:                       HappyHour               0.9826729
## 37:                  GoodForDancing               0.9987691
## 38:         RestaurantsTableService               0.9863182
## 39:                  OutdoorSeating               0.9774653
## 40:             RestaurantsDelivery               0.9966387
## 41:               BestNights_monday               0.9356625
## 42:              BestNights_tuesday               0.9995266
## 43:               BestNights_friday               0.9995266
## 44:            BestNights_wednesday               0.9995266
## 45:             BestNights_thursday               0.9995266
## 46:               BestNights_sunday               0.9995266
## 47:             BestNights_saturday               0.9995266
## 48:             GoodForMeal_dessert               0.9995266
## 49:           GoodForMeal_latenight               0.9401127
## 50:               GoodForMeal_lunch               0.9401127
## 51:              GoodForMeal_dinner               0.9401127
## 52:           GoodForMeal_breakfast               0.9401127
## 53:              GoodForMeal_brunch               0.9401127
## 54:                       CoatCheck               0.9401127
## 55:                         Smoking               0.9943190
## 56:                       DriveThru               0.9820101
## 57:                     DogsAllowed               0.8610993
## 58:                     Open24Hours               0.9996213
## 59:                         Corkage               1.0000000
## 60:  DietaryRestrictions_dairy-free               1.0000000
## 61: DietaryRestrictions_gluten-free               0.9978696
## 62:       DietaryRestrictions_vegan               0.9978696
## 63:      DietaryRestrictions_kosher               0.9978696
## 64:       DietaryRestrictions_halal               0.9978696
## 65:    DietaryRestrictions_soy-free               0.9978696
## 66:  DietaryRestrictions_vegetarian               0.9978696
## 67:                     AgesAllowed               0.9978696
## 68:       RestaurantsCounterService               1.0000000
##                            Variable Percentage of NA values

There are only 5 columns with less than 50% of missing values. From the data, we can see that they are business_id and attributes related to parking. Since this information is not valuable, we will discard this data set.

# Discard business attributes

rm("business_attributes")

c. business_hours

The business_hours table contains the attributes of businesses included in the dataset. There are 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

business_hours  <- business_hours %>% 
                   filter(business_id %in% business$business_id)

The final data set for top ten cities contains information regarding 21406 restaurants.The data set contains different columns for each weekday. We create one single column for weekday and seperate the starting and closing hours for each business. The variables related to time are converted to hour and second format from character. The final data set form is displayed below.

business_hours <- business_hours %>%
                  # Name of the column will be stored in 'week_day' and value will be stored in 'working hours'
                  gather(week_day, working_hours, -business_id) %>%
                  # The character value None represents NA
                  transform(working_hours =  ifelse(working_hours == 'None',NA,working_hours)) %>% 
                  # All null values are removed
                  na.omit() %>% 
                  # From 'working_hours' column, starting and closing hours are seperated
                  separate(col = working_hours, into = c("starting_hour", "closing_hour"), sep = "-") %>% 
                  # Starting and closing hours are stored as Hour and minutes format
                  transform(starting_hour = hm(starting_hour), closing_hour = hm(closing_hour))

# Displays 3 rows of business_hours table

kable(head(business_hours, n = 5))

	business_id	week_day	starting_hour	closing_hour
1	fNMVV_ZX7CJSDWQGdOM8Nw	monday	7H 0M 0S	15H 0M 0S
3	1WBkAuQg81kokZIPMpn9Zg	monday	11H 0M 0S	22H 0M 0S
4	Pd52CjgyEU3Rb8co6QfTPw	monday	8H 30M 0S	22H 30M 0S
6	n7V4cD-KqqE3OXk0irJTyA	monday	11H 0M 0S	0S
8	iPa__LOhse-hobC2Xmp-Kw	monday	5H 0M 0S	23H 0M 0S

d. checkin

The checkin table contains the number of checkins that occured at a restaurant at a given weekday and hour of the day. There are 3911218 check in observations for 146350 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities. The hour was stored as integer. It is converted to time format. To display the weekdays in order, they are converted to ordered factors.

checkin  <- checkin %>% 
            # Only the data for restaurants from top 10 cities are retained
            filter(business_id %in% business$business_id) %>%
            # Hour is converted to time format
            transform( hour = hour(hm(hour)),
                       # Weekdays are stored as factors with order level
                       weekday = factor(checkin$weekday, levels = c("Sun","Sat","Fri","Thu","Wed","Tue","Mon"), ordered = TRUE) )

The final data set for top ten cities contains 1054197 check in observations for 20714 restaurants. Variable descriptions and snap shot of data can be seen below. Minimum, Maximum and Average number of checkins for each week day is listed below.

Name	Description
business_id	ID of the business
weekday	Weekday at which check in occured
hour	Hour at which check in occured
checkins	Average number of check ins for the given week day and hour

weekday	min_checkins	max_checkins	average_checkins
Sun	1	827	6.606031
Sat	1	741	6.625402
Fri	1	831	6.641018
Thu	1	700	6.586976
Wed	1	785	6.595957
Tue	1	703	6.648213
Mon	1	875	6.640417

A snapshot of data is displayed below.

# Displays 10 rows of checkin table

datatable(head(checkin, n = 10), class = 'cell-border stripe hover condensed responsive')

e. tips

The tip table contains the attributes of tips given by users. There are 1098324 observations for 112365 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

tip  <- tip %>% 
        filter(business_id %in% business$business_id)

The final data set for top ten cities contains information regarding 494099 tips for 19367 restaurants .

# Displays 10 observations for tip data set.

kable(head(tip, n = 10), class = 'cell-border stripe hover condensed responsive')

text	date	likes	business_id	user_id
Sunday $.55 bone-in wings
Monday $.55 boneless wings 2016-08-22 0 –ujy	voQlwVoBgMYtA	DiLA u	lQ8Nyj7jCUR8M83SUMoRQ
Black Angus and the Roast beef :)	2012-12-03	0	JzB7NITHQ7gVHGVZ1ntgIQ	TvkqJ8YEIsTb16RnnrNyfQ
Expensive, but convenient for hotel stays	2012-12-02	0	h14GmWZ8rXum9fXF__wt3w	TvkqJ8YEIsTb16RnnrNyfQ
Finally, found some churros. Four types here. It should be great!	2012-03-28	0	xFN8mRubo3G0oIzJwc8XBA	TvkqJ8YEIsTb16RnnrNyfQ
closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed	2012-03-28	0	Xmndl6GoZg8taEUlwQMYxg	TvkqJ8YEIsTb16RnnrNyfQ
Try one of the Bento Box options	2012-10-09	0	eZDXz_RylvdD0tHEA8I0NA	TvkqJ8YEIsTb16RnnrNyfQ
USAir check-in desk agent Nancy E. is the sadest poor customer service provider I experienced this week. She doesn’t want to be there. #Fail	2011-05-20	0	u7CxxEzx8hvjoJ8onN4zTg	TvkqJ8YEIsTb16RnnrNyfQ
Great weather for eating outdoors. Good service.	2012-03-27	0	1CqDdPrrb0xvQpgu7fhI5w	TvkqJ8YEIsTb16RnnrNyfQ
The fried banana dessert is good	2012-09-02	0	_AKdBFzkl7GY-daxUCCbVA	TvkqJ8YEIsTb16RnnrNyfQ
I didn’t eat here, but they were nice enough to tell me where to find tacos de lengua y churros. Look at my next post to find out where :)	2012-03-28	0	HWjqW5ZFJ8eZRQuHcpySQA	TvkqJ8YEIsTb16RnnrNyfQ

f. reviews

The review table contains information regarding a review given by a user for a business. There are 5261668 reviews for 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.

# Only the data for restaurants from top 10 cities are retained

review  <- review %>% 
           filter(business_id %in% business$business_id )

The final data set for top ten cities contains 2106287 reviews for 21406 businesses. The description for important variables and a snapshot of data can be seen below.

Name	Description
review_id	ID of the review
user_id	ID of the user who gave the review
business_id	ID of the business which is reviewed
stars	Stars given by user for the business
date	Date of review
text	review text
useful	Number of users who found the review as useful
funny	Number of users who found the review as funny
cool	Number of users who found the review as cool

# Displays 10 observations for tip data set.

kable(head(review, n = 10))

review_id	user_id	business_id	stars	date	text	useful	funny	cool
nsThIz_-TuvgoFh0o9XJfQ	_L2SZSwf7A6YSrIHy_q4cw	IXXERocY1bqGwRllcy8J2w	5	2009-08-30	Visiting from SF. Checked yelp and found this place. It is very small – a converted house. As you sit in it you can watch the kitchen. The food was excellent. We had an eggplant/red pepper omelet and peach waffles. The waffles were light and fluffy with fresh whipped creme.

We got pastries to go, which were also good, in particular the chocolate croissant was unique.

Great place for an informal breakfast. 0 0 0 BF0ANB54sc_f-3_howQBCg ssuXFjkH4neiBgwv-oN4IA JlNeaOymdVbE6_bubqjohg 1 2014-08-09 We always go to the chevo’s in chandler which is delicious, the one in ahwatukee is different for some reason. Ordered the chicken rolled tacos today there was a tiny lil piece of chicken in each one, so basically I had 3 rolled deep fried tortillas yuck! :( No flavor what so ever. Also ordered carne asada taco the meat tasted old like it was cooked earlier and just thrown on the grill to get warm. Very dissapointed!! 3 0 0 QgSf2JvYz-M4PU2yuJjxNQ nOTl4aPC4tKHK35T3bNauQ 9Jc3W0aR9Xf2gcHI0rEXsw 1 2012-08-23 After being scared away from Rock & Rita’s, we ended up at this place, which was, if nothing else, quieter.

I’ll start by saying that the hostess and our server were both lovely and they’re really what earned the single star, because they were sweet, but the food was just horrible.

I ordered a simple grilled cheese sandwich. I asked the server if it was processed cheese, or “real” cheese in the sandwich and he had no idea what I was talking about. I tried to clarify by saying, “Is it like Kraft single slices, or deli cheese?” He said it was “good”. This really should have been a sign and I could probably fault the waiter for this, since you SHOULD know what you’re serving, but he was so sweet, I couldn’t be mad!

My husband ordered a hot dog of sorts.

I asked if there was a way to substitute fries for a salad. There was not, but the server hooked me up and brought me a salad anyway (no charge). I asked for sweet potato fries with the sandwich.

Of course, my sandwich was made with plasticy, processed cheese, that was melted to the point of Cheeze-Wiz and was inedible. I took one bite of the sandwich and couldn’t do another. I got regular fries, instead of sweet potato (didn’t bother complaining, because I got the free salad, so whatever…like I said, nice server!) My husband’s food was fine, although nothing special. They were good with the refills of our drinks.

Anyway, as nice as the servers were, it wasn’t good. I would avoid eating here if I could. 0 1 0 gN6GARS_BRr5UX2D3WAH0w nOTl4aPC4tKHK35T3bNauQ xVEtGucSRLk5pxxN0t4i6g 5 2012-08-23 We got recommendations for this place from my parents and so, for our anniversary, we booked here. We were told though that it was first come, first served, in terms of having the ideal seats by the windows that overlook the fountains of the Bellagio and The Strip. We went for lunch though, so it wasn’t too busy. We were seated by the window and were promptly served some sort of puffed rolls, with cheese and herbs infused. They were delicious!

We ordered the onion soup to start and I ordered parmesan chicken and my husband ordered a lamb burger. Our food came and was delicious. The soup was probably my favorite part. We were feeling very full by the end and so declined dessert, but were still given some jellies and chocolate truffle and then were served a small chocolate mousse, on a decorated plate that said Happy Anniversary, which was a total surprise (they had asked when I booked if it was a special occasion, but had said nothing while we were there). Overall the service was great, the food delicious and the view wonderful. It was a great experience! 0 0 0 t4oXDPN4S4USIhBGpuSD8A nOTl4aPC4tKHK35T3bNauQ 2LZGeJy8qByYKB71ML-jcw 2 2012-08-23 We got a coupon to eat here when we checked in: $6.99 for breakfast and a second one for half, or free, or something to that effect. We went and had our breakfasts, which was served by a rather surely waitress and was average. If this had been our only experience, I would have given three stars, but another night we got in very late and were just looking for a late dinner and found ourselves at this place. It was not very crowded, but was as loud as if three times as many people were there. There was a small group of trashy (very obese people), scream/singing karaoke on a stage in the front of the restaurant. One of them was singing Eminem and the others were screaming and cheering. We sat down in the back, to try and avoid the chaos, but it was way too disruptive. Other people sitting there, who were in the middle of their dinners and therefore stuck there, were looking pained. We didn’t bother putting ourselves through this and left. It’s a crappy menu, that’s overpriced anyway. I guess Rock and Rita’s name really says it all, but basically if you’re not a fall-down, cheap drunk, I don’t think you’ll have much to say about this place. 2 1 0 R9w7GeMX_KZTV23gmI8Zjg tL2pS5UOmN6aAOi3Z-qFGg RhV7sraRUB3km-gF-tmDow 3 2013-02-06 I’ve eaten here numerous times and am still amazed how popular these places are. It must be because they’re open 24 hours and late at night it’s one of the few places you can get something to eat fast. That’s the only reason I eat here.

The good is soso. It fills you up but as far as being flavorful it leaves a lot to be desired.

I’m sure I’ll visit again. It serves a purpose. 0 0 0 oncT7W70CFwzzJkQoz3T5Q tL2pS5UOmN6aAOi3Z-qFGg NaZVUOzqk5b-l0mlki-9Og 4 2017-02-10 We stopped in here for lunch this afternoon. Staff was helpful and friendly. Food was good and you get a lot for the price. We noted that they deliver and since we live close by we’ll probably order from them in the future. 2 0 0 9dWoAJGcJHWscv2ZAdzkNg tL2pS5UOmN6aAOi3Z-qFGg tJzf6H1dkuUbL-t8bzL3dw 5 2014-04-27 I was looking for a nice place to take the family to dinner last night. After reading the reviews and looking at the photos I settled on this place. Great choice!

The ambience was perfect. Nice and quite so you don’t have to talk loud so your guests can hear you.

I was really surprised that on a Saturday night this place only had a few other tables with diners. You’d think a place like this would be packed.

The service is what really made this place great for me. I would rate this place in the top two or three restaurants I’ve ever eaten in as far as service goes.

The food was good to. Everyone in my party of 6 enjoyed their meals. I personally would rate my steak a 4 out of 5. But with the kind of service you get here I didn’t mind.

The total bill came to 274 and change for a party of 6. That includes one bottle of wine and a couple of other mixed drinks. So really not a bad price for a place like this.

If you’re looking for a romantic place to take a date or just a nice quiet place to dine I would highly recommend Carve! 0 0 0 ZGlUf9noms8FQ67rmTZSdA tL2pS5UOmN6aAOi3Z-qFGg FtaTjyMUIY457tPJahjg1A 4 2014-04-25 We stopped by here for lunch this afternoon. The place was packed.

I thought we’d have a long wait, but we were shown to a booth in just a few minutes.

The hostess and our waitress were both very friendly. I appreciated them taking the extra effort to stay friendly when it was crazy busy like that. I deal with the public on a daily basis and know when it’s super busy like that it can sometimes be a challenge to keep a smile on your face and a kind word on your lips.

All of our meals were excellent. If I’m ever in the neighborhood I’d definitely eat here again. 0 0 0 pcszB9oTZE2DNylbbXIZAg tL2pS5UOmN6aAOi3Z-qFGg yLiaMaJFq03JxXPk4puloQ 3 2017-04-20 I’ve stopped in here several times. It’s always busy but they seem to move along at a decent speed.

As with all fast food joints nowadays be sure to check your bag before you go as there’s a 50/50 chance they won’t get your order right.

They’ve got one of those new machines inside where you can place your order. Thanks but no thanks. I’d rather place my order with an old fashioned human being. 1 2 1

4. Exploratory Data Analysis

Top cuisines

Most frequent categories in data set

Categories variable lists the different categories to which a restaurant belongs to. There are few redundant categories like “Restaurants”, “Food”, “Nightlife”, “Bars”, “New” that are identified in the initial analysis. These are removed before creating the wordcloud to find the most frequent categories across cities. In the wordcloud, we consider only those categories that appear in data set atleast 300 times. Most Frequent categories appear in the center of the word cloud in large size. As we move ouwards, the size of words decreases denoting smaller frequencies. Same color words have similar frequency range.

# All categories listed for different restaurants are taken
categories <- unlist(strsplit(business$categories, ";"))
# Most common categories like food are removed
remove_categories <- c("Restaurants", "Food", "Nightlife", "Bars", "New")
clean_categories <- removeWords(categories, remove_categories)
# word cloud is created with this set of categories
wordcloud(clean_categories, 
          min.freq = 300,
          random.order=FALSE, 
          rot.per=0.35,
          colors=brewer.pal( 8,"Dark2"))

Most popular five cuisines in each city

From the frequency plot, we can see that there are few categories like buffets, speciality ,caterers etc. Inorder to understand the top cuisines across locations, we consider the following cusinines that are identified in the initial analysis.

Cuisines considered : “American (Traditional)”,“American (New)”,“Sandwiches”,“Fast Food”,“Mexican”, “Pizza”, “Italian”,“Chinese”,“Ice Cream & Frozen Yogurt”,“Bakeries”,“Desserts”,“Seafood”,“Sushi Bars”,“Juice Bars & Smoothies”,“Mediterranean”,“Steakhouses”,“Burgers”, “Salad”,“Barbeque”,“Cocktail Bars”,“Thai”,“Buffets”,“Hot Dogs”, “Asian”, “Japanese”,“American”

We can see that the American cuisine is most popular in most places. But Fast food, Pizza, Mexican and sandwiches has won the race in some places. Asian cuisines like Chinese and Japanese did not make it to the top 5.

# Selected list of cuisines
cuisine_list <- c( "American (Traditional)", "American (New)", "Sandwiches", "Fast Food", "Mexican", "Pizza", "Italian", "Chinese", "Ice Cream & Frozen Yogurt", "Bakeries", "Desserts", "Seafood", "Sushi Bars", "Juice Bars & Smoothies", "Mediterranean", "Steakhouses", "Burgers", "Salad", "Barbeque", "Cocktail Bars", "Thai", "Buffets", "Hot Dogs", "Asian", "Japanese", "American", "Indian")

top_cusine_plot <- business %>% 
                   # Only city and categories are needed for this analysis
                   dplyr::select(city , categories) %>%
                   # categories are seperated by ; in teh variable
                   transform(categories = strsplit(categories, ";")) %>%
                   # A single row is created for each category from the list
                   unnest(categories) %>%
                   # Only categories in the cuisine_list is considered
                   filter(categories %in% cuisine_list)  %>%  
                    dplyr::group_by(categories,city) %>%
                    # Count is calculated a combination of city and category
                    tally() %>% 
                    # Top 5 categories for a city is filtered 
                    dplyr::group_by(city) %>% 
                    top_n(n = 5) %>% 
                    # To plot uniformly, ranks are given. Maximum count is given the rank 1
                    mutate( count_rank = rank(-n)) %>% 
                    # Bar chart for each city is shown
                    ggplot(aes(x = count_rank, y = n))+ 
                    geom_bar(stat = "identity", fill = "#87cefa") + 
                    facet_wrap(~city, nrow = 2, scales = "free") + 
                    coord_flip() + 
                    # 1 should come on the top
                    scale_x_reverse() +
                    # Name of the cuisine is displayed inside the bars
                    geom_text(aes(label = categories), size = 2,hjust="inward") 

top_cusine_plot

Check in trends

For each city, average check-ins for all hours for every weekday is calculated. We can see the patterns across time and day with this visualization. Dark blue denotes more checkins and orange shows lesser number of checkins for a square that corresponds to a given hour in a weekday. We can see that Las Vegas being a party city, hours after midnight are the busy hours throughout the week. We can see that for the city of Scottsdale, dinner hour or the time around 7 is the busiest hour throught the weekend. Sunday breakfast is seen as the busiest hour for Pittsburgh.

Busy and lazy working hours in each city

timeplot <- function(city_name){
        checkin %>% 
        inner_join(business, by = "business_id") %>% 
        # Data for the city is selected
        filter(city == city_name)%>%
        # For each weekday and hour, average number of checkins are calculated 
        group_by( weekday, hour)%>%
        summarise( mean_checkin = mean(checkins)) %>%
        ggplot(aes(x = hour, 
                   y= weekday, 
                   fill = mean_checkin))+
        geom_tile(colour = "white") + 
        # If teh number of checkins are high, they are in blue and if they are low, they will be in orange
        scale_fill_gradient(low = "orange", high = "blue", name = "Check-in count") + 
        ylab(" Day of the week") + 
        xlab(" Hour of the day") +
        ggtitle(paste("Check in trends for",city_name))

}
# Plots are created for all cities in the list
checkin_plots <- lapply(top_10_cities, timeplot)
# They are plotted in a grid of 5 rows and 2 columns
do.call("grid.arrange", c(checkin_plots, ncol=2))

Geographical locations

The location of restaurants are shown in geographical maps in this section. A function is created to plot the maps. It takes in the city name, filters the data for the given city and plots the map. Leaflet package is used for this. Since there are so many restauants in a city, they are clustered and shown in map. When you hover the mouse on the number, the region for which the numbers are aggregated are highlighted. This shows us where restaurants are densely populated.

Clicking the numbers in circles zooms in to the area. On zooming in, the aggregation regions change. When it is zoomed into the lowest level, the restaurant locations can be seen in detail. At this point, if the rating of restaurant is 1 or 2, it is considered as a low rated outlet and is represented by a red circle. If the rating is 3, it is considerd as average and is shown as blue circle. High rated restaurants(places with rating of 4 or 5) are represented by green circles.

geographical_map <- function(location_name){
  
      location_business <- business %>%
                          # filter for the city
                          filter(city == location_name) %>%
                          # Creates 3 level based on rating
                          mutate( rating_level = ifelse(stars == 4 | stars == 5 ,"High", ifelse(stars == 3, "Medium", "Low")))
      
      # Creates color pallette for rating levels
      pallete <- colorFactor(c("dark red",  "blue","dark green"), domain = c("Low", "Medium","High"))
     
      location_business %>%  
                 leaflet() %>% 
                 setView(lng = mean(location_business$longitude), 
                         lat = mean(location_business$latitude), 
                         zoom = 12) %>% 
                 addProviderTiles(providers$CartoDB.Positron) %>%
                 addCircleMarkers(~longitude, 
                                  ~latitude,
                                  radius = 3,
                                  fillOpacity = 0.5,
                                  # Creates clusters for restaurants on high level
                                  clusterOptions = markerClusterOptions(),
                                  # Color palette is assigned based on rating level
                                  color = ~pallete(rating_level))
     
}

Restaurants across Las Vegas

geographical_map("Las Vegas")

Restaurants across Phoenix

geographical_map("Phoenix")

Restaurants across Pittsburgh

geographical_map("Pittsburgh")

Restaurants across Scottsdale

geographical_map("Scottsdale")

Restaurants across Chandler

geographical_map("Chandler")

Restaurants across Madison

geographical_map("Madison")

Restaurants across Cleveland

geographical_map("Cleveland")

Restaurants across Mesa

geographical_map("Mesa")

Restaurants across Charlotte

geographical_map("Charlotte")

Restaurants across Tempe

geographical_map("Tempe")

5. Sentiment Analysis

Since the RAM is not able to handle data for 10 locations, Only Las vegas reviews are used for sentimental analysis. In order to avoid untrustworthy reviews, a review is considered for analysis only if at least 5 people have rated it as useful. Text of the review is converted to lower case and numbers and stop words are removed from it. There are three words that are found to be common across reviews in high frequency in the initial analysis. Las vegas, http and www.yelp.com are removed from the text.

useful <-     review %>% 
              left_join(business, by = "business_id") %>% 
              # Filters reviews for Las Vegas
              filter(city == 'Las Vegas') %>%
              # Only reviews that atleast 5 people found as useful is taken
              filter(useful > 5)

# All text is converted to lower case
useful$text <- tolower(useful$text)
# Stop words and repeating words like Las vegas, http, www.yelp.com are removed
useful$text <- removeWords(useful$text ,c(stopwords("en"), "las vegas","http","www.yelp.com" ))
useful$text <- removeNumbers(useful$text )

Frequently linked words

The reviews given by customer related to food, service, ambience and staff are of interest for this case study. As first step, Words that generally appear in reviews after the key words food, service, ambience and staff are analysed through a network graph. For these bigrams containing any of these four words as the first word are created. In order to avoid insignificant relationships that crowd the space, only those words that appear alteast 100 times after these key words are considered.

# Bigrams are created with words in review text
bigrams <- useful %>% 
           unnest_tokens(bigram, text, token = "ngrams", n = 2)

# List of words that are of significance
analysis_word <- c("food", "ambience", "staff", "service")

# Creates data for network analysis graph
bg_grapgh <- bigrams %>%
             # Words from bigram are seperated
             separate(bigram, c("word1", "word2"), sep = " ")  %>% 
             # Count for each combination of words are calculated
             group_by(word1, word2) %>% 
             summarise( n = n()) %>%
             # Only the combination with significant words in the beginning and min freq of 100 are taken
             filter( word1 %in% analysis_word & n > 100) %>%
             # Creates data for network graph
             graph_from_data_frame()
             
arrow_format <- grid::arrow(type = "closed", length = unit(.1, "inches"))

## Visual representation of connection of pair of words
ggraph(bg_grapgh, layout = "fr") +
  # Connection between words are represented by arrows
  geom_edge_link(aes(edge_alpha = n), 
                 show.legend = TRUE,
                 arrow = arrow_format, 
                 end_cap = circle(.1, 'inches')) +
  # Nodes for words
  geom_node_point(color = 'light blue', 
                  size = 7) +
  # Text is displayed
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1,
                 repel = TRUE) +
  theme_void()

The thickness of the arrow denotes the number of times the word appeared. Service is linked to both horrible and great with thick lines. There are large number of positive and negative reviews for service. Food has many words that appear in reviews. Service and food have many common words used in the reviews. Ambience and Staff has less words that appear in reviews with high frequency.

Contributing words

A restaurant is rated between 1 to 5. Through this visualization, the patterns in reviews given for food, ambience, staff and service is studied. For each review, we find the words that contribute to positive or negative review for each attribute across rating. These are the words that preceed the significant words (food, ambience, staff or service).

We use sentiments from affinn lexicon for this analysis. The AFINN lexicon assigns each word with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. We have calculated product of score and number of occurances to identify how much a word affects a review. Top 5 positive and negative words for an attribute across ratings are shown. If multiple words have same effects , all of them are displayed.

For 5 rated restaurants, reviews have words like great and amazing to describe the staff and service. For 4 rated, good and awesome takes their place. For top rated restaurants, the all top 5 words describes food quality. But for 4 rated restaurant, the fifth positive word is pretty which could be used to describe the presentation of food. The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4.

When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible, awful. Food becomes dissappointing and horrible. Low rated restaurants have unconfortable, poor or horrible ambience.

Words that contrubute to positive and negative sentiments for key attributes in a rating category

# afinn lexicon is imported
AFINN <- get_sentiments("afinn")

analysis <- bigrams %>%
                # Bigrams are seperated
                separate(bigram, c("word1", "word2"), sep = " ") %>%
                # Only the words under analysis is chosen as first word
                filter(word1 %in% analysis_word) %>%
                # AFINN lexicon is used for sentiment analysis
                inner_join(AFINN, by = c(word2 = "word")) %>%
                # Count for each word and rating is taken
                group_by(word1, word2, score,stars.x)  %>%
                summarise(n = n()) %>%
                ungroup()

# creates plots for each star rating
star_plot <- function(star){
    analysis_plot <- analysis %>% filter(stars.x == star) %>%
        mutate(contribution = n * score, sign = ifelse(score > 0 , "P", "N")) %>%
        arrange(desc(abs(contribution))) %>%
        group_by(word1,sign) %>%
        # Selects the top 5 contributions to both positive and negative emotions
        top_n(5, abs(contribution)) %>%
        ggplot(aes(drlib::reorder_within(word2, contribution, word1), 
                   contribution, 
                   # Color is based on positive or negative emotion
                   fill = contribution > 0)) +
        geom_bar(stat = "identity", show.legend = FALSE) +
        xlab("Words preceded by topic under analysis") +
        ylab("Sentiment score * Number of occurrances") +
        ggtitle(paste("Contributing words for rating : ", as.character(star)))+
        drlib::scale_x_reordered() +
        facet_wrap( ~ word1,scales = "free", nrow = 1) +
        coord_flip()
    return(analysis_plot)
}

# Created a grid of 5 rows and n columns
# 5 columns comes from the facet_wrap
star_plots <- lapply(c(5,4,3,2,1), star_plot)
do.call("grid.arrange", c(star_plots, ncol = 1))

Emotions across rating

Now that we have found the words that contribute positive and negative emotions for each attributes across ratings, we will next analyse the general emotion expressed in reviews for each rating. NRC lexicon is used for this analysis. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For each star rating, the percentage of words that express the emotions of trust, fear, anger, anticipation, disgust, joy , sadness and surprise are shown below.

A radar plot is made across emotions to show the percentage projected in reviews. Color in this visualisation represents the rating of restaurant whose review is analysed. We can see that for restaurants with low rating of 1 and 2 have reviews that predominantly express anger, sadness, fear and disgust. These emotions are ecpressed the least in high rated restaurants. Medium rating of 3 shows a large amount of anticipation in their reviews. High rated restaurants (4 and 5 ratings) have very similar trend that in this visualization, they have almost overlapping lines. Though customers express equal percentage of trust in reviews for restaurants with rating of 4 and 5 (Overlapping pink and yellow lines), reviews shows more joy for 5 rated restaurants.

Percentage of emotions projected in reviews for each rating

# Creates unigrams from text
unigrams <- useful %>% unnest_tokens(word, text, token = "ngrams", n = 1)
# nrc lexicon is loaded 
nrc <- get_sentiments("nrc")

sentiment_analysis <- unigrams %>% 
               dplyr::group_by(stars.x, word) %>% 
               # Count of words in review for each rating is calculated
               summarise( n = n()) %>% 
               # NRC sentiment analysis
               inner_join(nrc)

# positive and negative emotions are dropped
review_nrc <- sentiment_analysis %>%
  filter(!grepl("positive|negative", sentiment))


review_tally <- review_nrc %>%
                group_by(stars.x, sentiment) %>%
                tally() %>% 
                # Calculates the percentage of words that attribute to a sentiment
                mutate(cuisine_words = (nn / sum(nn))*100) %>% 
                select(-nn)

# Key value pairs
scores <- review_tally %>%
          spread(stars.x, cuisine_words)

# JavaScript radar chart
chartJSRadar(scores)

Words across cuisines

In this step, we investigate if specific words are used in the reviews given to a specific cuisine. If the cuisine category is not registed by the merchant, reviews can be used to identify the cuisine. Common words seen in reviews across the cuisines in Las Vegas versus the frequent words in the reviews given to specific cuisine are identified. This allows us to compare the strong deviations of word frequency within each cuisine as compared to reviews given in location

Words that are close to the line(light grey) means they are used in similar frequency in reviews for the cuisine under stude and the rest of all the cuisines. For example, words such as “food” and “pizza” are fairly common and used with similar frequencies across most of the cuisines. Words that are far from the line (Green color) are words that are found more in one set of cuisine reviews than another. The words standing out above the line are common across the location but not for that particular category. The words below the line are common in that particular category but not across the location.

For example, “torta” stands out above the line in the American traditional cuisine. This means that “torta” is a word used fairly common in reviews given toother cuisines, but is not used as much in reviews for Traditional cuisine. In contrast, a word below the line such as “burgr” in the traditional American category suggests this word is common in this cuisine review but far less common in reviews for other cuisines.

# Calculates the top 6 cuisines
LV_top_6_cuisines <- business %>% 
                    dplyr::select(city , categories) %>%
                    transform(categories = strsplit(categories, ";")) %>%
                    unnest(categories) %>%
                    # Only the categories in cuisine list for LAs Vegas is considered
                    filter(categories %in% cuisine_list & city == 'Las Vegas')  %>%  
                    dplyr::group_by(categories) %>%
                    tally() %>% 
                    top_n(n = 6) 

# Calculates the percenatge of word use in whole
word_pct <- useful %>%
           transform(categories = strsplit(categories, ";")) %>%
           unnest(categories) %>%
           filter(categories %in% LV_top_6_cuisines$categories)  %>% 
           unnest_tokens(word, text) %>%
           # Removes stop words
           anti_join(stop_words) %>%
           dplyr::group_by(word) %>%
           summarise(n = n()) %>%
           # Calculate the percentage in the whole review set
           transmute(word, all_cuisines = n / sum(n))

# calculate percent of word use within each cuisine
  frequency <- useful %>%
             transform(categories = strsplit(categories, ";")) %>%
             unnest(categories) %>%
             filter(categories  %in% LV_top_6_cuisines$categories)  %>% 
             unnest_tokens(word, text) %>%
              anti_join(stop_words) %>%
              dplyr::group_by(categories,word) %>%
              summarise(n = n()) %>%
              # Calculate the percentage in the review for given category
              mutate(cuisine_words = n / sum(n)) %>%
              left_join(word_pct) %>%
              arrange(desc(cuisine_words)) %>%
              ungroup()

# Plots frequency of words in reviews specific to cuisine in x axis and percentage of appearance in all in y
ggplot(frequency, aes(x = cuisine_words, y = all_cuisines, color = abs(all_cuisines - cuisine_words))) +
        geom_abline(color = "gray40", lty = 2) +
        geom_jitter(alpha = 0.1, size = 3, width = 0.3, height = 0.3) +
        geom_text(aes(label = word, size = 1), check_overlap = TRUE, vjust = ifelse(frequency$all_cuisines > frequency$cuisine_words, 2,-2)) +
        scale_x_log10(labels = scales::percent_format()) +
        scale_y_log10(labels = scales::percent_format()) +
        scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
        facet_wrap(~ categories, ncol = 2) +
        theme(legend.position = "none") +
        labs(y = "Cuisines", x = NULL)

6.Summary

Problem statement and data used

Restaurants is US is identified as target market. In order to understand the existing business model, YELP data set is selected. This data set contains information for business across 11 metropolitan areas in four countries. Restaurant data for US was filtered and top 10 cities based on number of restaurants whose data is available is taken. 10 cities were chosen for study based on data available and below are some of the findings from this case study. We have information regarding location, category, working hours for each weekday, number of checkins for each day and hour and user text review. Data is cleansed and processed to create visualizations for exploratory data analysis. Text mining is done on reviews to understand the customer sentiments.

Data analysis

Exploratory data analysis

Based on analysis of Yelp dataset,trends in restaurant market is studied. Data was sliced and diced to create different visualisations to uncover the patterns present. Restaurant locations were plotted on map and where they were aggregated to find the locations where restaurants are densely located. In ordered to find the most busy hours in a city, heat maps where generated for each hour in a weekday. Most frequent categories across the cities were identified using a word cloud. Then it was drilled down to find the top 5 cuisines in each city using bar charts.

Sentiment analysis

In order to understand the sentiments of customers, four key attributes of a business were identified. The general customer reaction to food, ambience, staff and service at a restaurant was analysed. As intial step, the words commonly used along with the key terms are visually represented in a network chart. To identify the difference between high and low rated restaurants, the reviews given to each of these four attributes in each rating category were studied seperately The words that contribute to positive and negative sentiments in each attribute are identified and top 5 positive and negative words were shown in a diverging bar chart. The percentage of different emotions expressed in reviews given to each rating category is shown in a radar chart. The difference in usage of words in reviews given to top 6 cuisines in Las vegas was also investigated and displayed.

Ineteresting findings

American cuisine is most popular in most locations. Sandwiches and fast food are the next best options. Asian cuisines like Indian and Chinese did not make it to the top 5 in any of the locations.
Locations like Las Vegas has an active night life and hours after midnight are the most busy working time for restaurants there. Scottsdale is busy during dinner time around 7 on most days. The patterns in working hours can be seen in the visualization provided in the exploratory data analysis
Restaurants are densely pesent mostly in downtowns. When you move further away from the city, number restuarants for which we have information reduces.
Lot of words are used in common to describe food and service. Great and horrible are used comparitively in same frequency to describe services of restaurants in Las Vegas. Staff and ambience is not descibed using frequent terms in reviews as compared to food and service.
The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4. Low rated restaurants have unconfortable, poor or horrible ambience. When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible and awful. Food becomes dissappointing and horrible.
Reviews for restaurants with low rating of 1 and 2 predominantly express anger, sadness, fear and disgust. Medium rating of 3 shows a large amount of anticipation in their reviews. Though customers express percentage of trust in reviews for restaurants with rating of 4 and 5, reviews shows more joy in customers at 5 rated restaurants.
Each cuisine review has words specific to that particular category. For example refried is a word commonly used only for Mexican cuisine. This can be used to identify the cuisines from reviews.

Further steps.

Collect more data on food served, menu, music etc and explore more trends and patterns that will aid a new business person planning to open a restaurant in US
Create a machine learning model that will identify the cuisine from the review given
Predict the rating that customer might give to a restaurant by analysing the review given by him/her.
Due to time constraints, all sentiment analysis in this case study is done using unigrams and bigrams. Expand the scope of study to large n-grams and sentences