My retirement plan is to start a quiet little cafe in the foothills of Himalayas. I want to fill my little space with paintings and books and obviously, provide amazing food. Though this is a long shot plan, I am constantly curious about market dynamics of restaurants. In order to start understanding what makes a good restaurant, I planned to do a market research analysis of restaurants in United states.
In order to conduct the market analysis, I am making use of the YELP dataset which is a subset of Yelp’s businesses, reviews, and user data.It was originally put together for the Yelp Dataset Challenge to give a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. Below are the proposed approach/analytic technique to conduct the analysis.
Geographical analysis - This analysis is to aid the location selection process and find the best possible place to establish the cafe.The user should be able to select the desired city and see the distribution of restaurants across the location and their ratings. May be it is better to start the restaurant that gives high quality food in a location with large number of low rated restaurants. May be opening a restaurant in a street full of popular restaurants gives good visibility. It all depends on the statergy adopted by the management.
Understanding previous trends - This is to aid the selection of working hours and cuisine for the restaurant. Some places might be suited for restaurants that are open during evening. Some locations might give profit when the restaurant is open through out the day. By analysing at the past trends, we can decide our restaurants best suited working hours. We will also explore the different cuisines that are popular in the locality chosen.
Learning from the best and worst players in the market - Sentiment analysis is done on the user reviews to understand what trends and patterns in user behavious. Bad reviews help us to learn without making the mistakes by ourselves. Positive reviews can help me understand what works for the general public and incorporate that into my business.
The details of different packages used for this analysis is listed below.
if (!require("pacman")) install.packages("pacman")
# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("data.table",
"tidyverse",
"stringr",
"lubridate",
"DT",
"tidytext",
"NLP" ,
"knitr",
"leaflet",
"tm",
"wordcloud",
"grid",
"gridExtra",
"radarchart",
"igraph",
"ggraph")
The data is loaded from the source and cleaned for further analysis.
Yelp is an American multinational corporation founded in 2004 by former PayPal employees Russel Simmons and Jeremy Stoppelman. It develops, hosts and markets Yelp.com and the Yelp mobile app, which publish crowd-sourced reviews about local businesses, as well as the online reservation service Yelp Reservations. The dataset used in this study is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries.
In the dataset you’ll find information about businesses across 11 metropolitan areas in four countries. There are 6 tables available that containes business related information
The six files containing data is loaded in this step. fread from data.table is mainly used for loading data as it is fast for large files. After importing data, each of these data sets are analysed and cleaned according to the needs.
business <- fread("yelp-dataset/yelp_business.csv")
business_attributes <- fread("yelp-dataset/yelp_business_attributes.csv")
business_hours <- read_csv("yelp-dataset/yelp_business_hours.csv")
checkin <- fread("yelp-dataset/yelp_checkin.csv")
tip <- fread("yelp-dataset/yelp_tip.csv")
review <- read_csv("yelp-dataset/yelp_review.csv")
In this step, each of the six files are evaluated seperately and cleaned for futher analysis.
The business table containes the location and category details of businesses included in the dataset. There are 174567 businesses listed in the original dataset. For this project, only restaurants from Unites States is considered.
# Only the data for restaurants in United states are kept
business <- business %>%
filter(state %in% state.abb) %>%
filter(categories %like% "Restaurant")
After filtering out the restaurants in United states, we have data regarding 32484 businesses. We will first investigate the amount of data available for each city and will consider the top 10 cities for this study.
# Displays the top 10 cities present in the data set based on number of restaurants for which data is available.
business %>% select(state,city) %>%
dplyr::group_by(state,city) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(n = 10) %>%
datatable(class = 'cell-border stripe hover condensed responsive')
Also for the top 10 cities there are multiple names present in the data set. Below are the different patterns found for the same city names.
# Lists are created so that if any additional pattern is found, it can be added flexibly.
# All restaurants in Las vegas have name is same format
lasVegas <- c("Las Vegas", "North Las Vegas", "N Las Vegas", "Las Vegas", "las vegas", "N. Las Vegas","South Las Vegas", "Las vegas", "LasVegas")
business$city[business$city %in% lasVegas & business$state == "NV" ] <- "Las Vegas"
# All restaurants in Phoenix have name is same format
phoenix <- c("Pheonix", "Pheonix", "Pheonix AZ" , "Phoenix Valley")
business$city[business$city %in% phoenix & business$state == "AZ" ] <- "Phoenix"
# All restaurants in Pittsburgh have name is same format
pittsburgh <- c("Pittsburgh", "East Pittsburgh")
business$city[business$city %in% pittsburgh & business$state == 'PA' ] <- "Pittsburgh"
# All restaurants in Scottsdale have name is same format
scottsdale <- c("Scottsdale", "Scottdale")
business$city[business$city %in% scottsdale & business$state == 'AZ' ] <- "Scottsdale"
# All restaurants in Cleveland have name is same format
cleveland <- c("Cleveland", "Cleveland Heights", "East Cleveland", "Cleveland Hghts.")
business$city[business$city %in% cleveland & business$state == 'OH' ] <- "Cleveland"
# All restaurants in Mesa have name is same format
mesa <- c("Mesa", "MESA", "Mesa AZ")
business$city[business$city %in% mesa & business$state == 'AZ' ] <- "Mesa"
# Filters the data so that we analyse only the restaurants from top 10 cities (Based on count)
top_10_cities <- c("Las Vegas", "Phoenix", "Charlotte", "Pittsburgh", "Scottsdale", "Cleveland", "Mesa", "Madison", "Tempe", "Chandler")
business <- business %>% filter(city %in% top_10_cities)
The final data set for top ten cities contains information regarding 21406 restaurants. Some of the important variables are describled below. A snapshot of the dataset is also provided.
| Name | Description |
|---|---|
| business_id | ID of the business |
| name | Name of the business |
| city | City at which business is located |
| latitude | Latitute coordinate of the restaurant. Will be used to plot the location on the map. |
| longitude | Longitude coordinate of the restaurant. Will be used to plot the location on the map. |
| stars | Provides the rating of the restaurant. Minimum = 0, Maximum = 5 |
| is_open | If the restaurant is working/shut down |
| categories | Categories under which the restaurant falls |
# Displays 10 rows from the dataset
datatable(head(business, n = 10), class = 'cell-border stripe hover condensed responsive')
The business_attributes table contains the attributes of businesses like music, delivery, parking etc. There are 152041 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.
# Only the data for restaurants from top 10 cities are retained
business_attributes <- business_attributes %>%
filter(business_id %in% business$business_id)
After filtering, only 21123 restaurants are retained. There are 82 columns in this data set. Not all of them are relevant to us. We will drop the irrelevant variables.
# The attributes that are not significant are dropped from the dataset.
business_attributes <- business_attributes %>%
select(-c( HasTV, Caters , BusinessAcceptsBitcoin, BYOBCorkage, BYOB), -contains("Hair"))
There are 68 variables left in the data set now. The final data set for top ten cities contains information regarding 21123 observations.
# 10 rows from business_attributes table is displayed
kable(head(business_attributes, n = 10))
| business_id | AcceptsInsurance | ByAppointmentOnly | BusinessAcceptsCreditCards | BusinessParking_garage | BusinessParking_street | BusinessParking_validated | BusinessParking_lot | BusinessParking_valet | RestaurantsPriceRange2 | GoodForKids | BikeParking | Alcohol | NoiseLevel | RestaurantsAttire | Music_dj | Music_background_music | Music_no_music | Music_karaoke | Music_live | Music_video | Music_jukebox | Ambience_romantic | Ambience_intimate | Ambience_classy | Ambience_hipster | Ambience_divey | Ambience_touristy | Ambience_trendy | Ambience_upscale | Ambience_casual | RestaurantsGoodForGroups | WiFi | RestaurantsReservations | RestaurantsTakeOut | HappyHour | GoodForDancing | RestaurantsTableService | OutdoorSeating | RestaurantsDelivery | BestNights_monday | BestNights_tuesday | BestNights_friday | BestNights_wednesday | BestNights_thursday | BestNights_sunday | BestNights_saturday | GoodForMeal_dessert | GoodForMeal_latenight | GoodForMeal_lunch | GoodForMeal_dinner | GoodForMeal_breakfast | GoodForMeal_brunch | CoatCheck | Smoking | DriveThru | DogsAllowed | Open24Hours | Corkage | DietaryRestrictions_dairy-free | DietaryRestrictions_gluten-free | DietaryRestrictions_vegan | DietaryRestrictions_kosher | DietaryRestrictions_halal | DietaryRestrictions_soy-free | DietaryRestrictions_vegetarian | AgesAllowed | RestaurantsCounterService |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fNMVV_ZX7CJSDWQGdOM8Nw | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| rDMptJYWtnMhpQu_rRXHng | Na | Na | Na | Na | False | False | False | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| 1WBkAuQg81kokZIPMpn9Zg | Na | Na | Na | Na | False | False | False | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | False | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| Pd52CjgyEU3Rb8co6QfTPw | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| 4srfPk1s8nlm1YusyDUbjg | Na | Na | Na | Na | False | False | False | False | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | False | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| n7V4cD-KqqE3OXk0irJTyA | Na | Na | Na | Na | True | False | False | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| EJFdWX908N8Yc2XG0Lky8A | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | False | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| iPa__LOhse-hobC2Xmp-Kw | Na | Na | Na | Na | False | False | False | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | True | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| o1fTwfqN0sDFNpV1CkOPPg | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
| -nHkhiuerqmfBG3v2v9O-g | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | False | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na | Na |
# NA in the data set is represented by the charecter 'Na'. They are replaced by NA
business_attributes[,2:ncol(business_attributes)][ business_attributes[,2:ncol(business_attributes)] == 'Na' ] <- NA
# The percentage of missing values is calculated for each attribute
data.table("Variable" = colnames(business_attributes), "Percentage of NA values" = colMeans(is.na(business_attributes)))
## Variable Percentage of NA values
## 1: business_id 0.0000000
## 2: AcceptsInsurance 1.0000000
## 3: ByAppointmentOnly 1.0000000
## 4: BusinessAcceptsCreditCards 0.9941770
## 5: BusinessParking_garage 0.9980116
## 6: BusinessParking_street 0.3398192
## 7: BusinessParking_validated 0.3398192
## 8: BusinessParking_lot 0.3398192
## 9: BusinessParking_valet 0.3398192
## 10: RestaurantsPriceRange2 1.0000000
## 11: GoodForKids 0.9961180
## 12: BikeParking 0.6178573
## 13: Alcohol 0.9974909
## 14: NoiseLevel 0.9985324
## 15: RestaurantsAttire 0.9994792
## 16: Music_dj 0.9991952
## 17: Music_background_music 1.0000000
## 18: Music_no_music 1.0000000
## 19: Music_karaoke 1.0000000
## 20: Music_live 1.0000000
## 21: Music_video 1.0000000
## 22: Music_jukebox 1.0000000
## 23: Ambience_romantic 1.0000000
## 24: Ambience_intimate 0.9993372
## 25: Ambience_classy 0.9993372
## 26: Ambience_hipster 0.9993372
## 27: Ambience_divey 0.9993372
## 28: Ambience_touristy 0.9993372
## 29: Ambience_trendy 0.9993372
## 30: Ambience_upscale 0.9993372
## 31: Ambience_casual 0.9993372
## 32: RestaurantsGoodForGroups 0.9993372
## 33: WiFi 0.9992425
## 34: RestaurantsReservations 0.9991005
## 35: RestaurantsTakeOut 0.9976803
## 36: HappyHour 0.9826729
## 37: GoodForDancing 0.9987691
## 38: RestaurantsTableService 0.9863182
## 39: OutdoorSeating 0.9774653
## 40: RestaurantsDelivery 0.9966387
## 41: BestNights_monday 0.9356625
## 42: BestNights_tuesday 0.9995266
## 43: BestNights_friday 0.9995266
## 44: BestNights_wednesday 0.9995266
## 45: BestNights_thursday 0.9995266
## 46: BestNights_sunday 0.9995266
## 47: BestNights_saturday 0.9995266
## 48: GoodForMeal_dessert 0.9995266
## 49: GoodForMeal_latenight 0.9401127
## 50: GoodForMeal_lunch 0.9401127
## 51: GoodForMeal_dinner 0.9401127
## 52: GoodForMeal_breakfast 0.9401127
## 53: GoodForMeal_brunch 0.9401127
## 54: CoatCheck 0.9401127
## 55: Smoking 0.9943190
## 56: DriveThru 0.9820101
## 57: DogsAllowed 0.8610993
## 58: Open24Hours 0.9996213
## 59: Corkage 1.0000000
## 60: DietaryRestrictions_dairy-free 1.0000000
## 61: DietaryRestrictions_gluten-free 0.9978696
## 62: DietaryRestrictions_vegan 0.9978696
## 63: DietaryRestrictions_kosher 0.9978696
## 64: DietaryRestrictions_halal 0.9978696
## 65: DietaryRestrictions_soy-free 0.9978696
## 66: DietaryRestrictions_vegetarian 0.9978696
## 67: AgesAllowed 0.9978696
## 68: RestaurantsCounterService 1.0000000
## Variable Percentage of NA values
There are only 5 columns with less than 50% of missing values. From the data, we can see that they are business_id and attributes related to parking. Since this information is not valuable, we will discard this data set.
# Discard business attributes
rm("business_attributes")
The business_hours table contains the attributes of businesses included in the dataset. There are 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.
# Only the data for restaurants from top 10 cities are retained
business_hours <- business_hours %>%
filter(business_id %in% business$business_id)
The final data set for top ten cities contains information regarding 21406 restaurants.The data set contains different columns for each weekday. We create one single column for weekday and seperate the starting and closing hours for each business. The variables related to time are converted to hour and second format from character. The final data set form is displayed below.
business_hours <- business_hours %>%
# Name of the column will be stored in 'week_day' and value will be stored in 'working hours'
gather(week_day, working_hours, -business_id) %>%
# The character value None represents NA
transform(working_hours = ifelse(working_hours == 'None',NA,working_hours)) %>%
# All null values are removed
na.omit() %>%
# From 'working_hours' column, starting and closing hours are seperated
separate(col = working_hours, into = c("starting_hour", "closing_hour"), sep = "-") %>%
# Starting and closing hours are stored as Hour and minutes format
transform(starting_hour = hm(starting_hour), closing_hour = hm(closing_hour))
# Displays 3 rows of business_hours table
kable(head(business_hours, n = 5))
| business_id | week_day | starting_hour | closing_hour | |
|---|---|---|---|---|
| 1 | fNMVV_ZX7CJSDWQGdOM8Nw | monday | 7H 0M 0S | 15H 0M 0S |
| 3 | 1WBkAuQg81kokZIPMpn9Zg | monday | 11H 0M 0S | 22H 0M 0S |
| 4 | Pd52CjgyEU3Rb8co6QfTPw | monday | 8H 30M 0S | 22H 30M 0S |
| 6 | n7V4cD-KqqE3OXk0irJTyA | monday | 11H 0M 0S | 0S |
| 8 | iPa__LOhse-hobC2Xmp-Kw | monday | 5H 0M 0S | 23H 0M 0S |
The checkin table contains the number of checkins that occured at a restaurant at a given weekday and hour of the day. There are 3911218 check in observations for 146350 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities. The hour was stored as integer. It is converted to time format. To display the weekdays in order, they are converted to ordered factors.
checkin <- checkin %>%
# Only the data for restaurants from top 10 cities are retained
filter(business_id %in% business$business_id) %>%
# Hour is converted to time format
transform( hour = hour(hm(hour)),
# Weekdays are stored as factors with order level
weekday = factor(checkin$weekday, levels = c("Sun","Sat","Fri","Thu","Wed","Tue","Mon"), ordered = TRUE) )
The final data set for top ten cities contains 1054197 check in observations for 20714 restaurants. Variable descriptions and snap shot of data can be seen below. Minimum, Maximum and Average number of checkins for each week day is listed below.
| Name | Description |
|---|---|
| business_id | ID of the business |
| weekday | Weekday at which check in occured |
| hour | Hour at which check in occured |
| checkins | Average number of check ins for the given week day and hour |
| weekday | min_checkins | max_checkins | average_checkins |
|---|---|---|---|
| Sun | 1 | 827 | 6.606031 |
| Sat | 1 | 741 | 6.625402 |
| Fri | 1 | 831 | 6.641018 |
| Thu | 1 | 700 | 6.586976 |
| Wed | 1 | 785 | 6.595957 |
| Tue | 1 | 703 | 6.648213 |
| Mon | 1 | 875 | 6.640417 |
A snapshot of data is displayed below.
# Displays 10 rows of checkin table
datatable(head(checkin, n = 10), class = 'cell-border stripe hover condensed responsive')
The tip table contains the attributes of tips given by users. There are 1098324 observations for 112365 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.
# Only the data for restaurants from top 10 cities are retained
tip <- tip %>%
filter(business_id %in% business$business_id)
The final data set for top ten cities contains information regarding 494099 tips for 19367 restaurants .
# Displays 10 observations for tip data set.
kable(head(tip, n = 10), class = 'cell-border stripe hover condensed responsive')
| text | date | likes | business_id | user_id |
|---|---|---|---|---|
| Sunday $.55 bone-in wings | ||||
| Monday $.55 boneless wings 2016-08-22 0 –ujy | voQlwVoBgMYtA | DiLA u | lQ8Nyj7jCUR8M83SUMoRQ | |
| Black Angus and the Roast beef :) | 2012-12-03 | 0 | JzB7NITHQ7gVHGVZ1ntgIQ | TvkqJ8YEIsTb16RnnrNyfQ |
| Expensive, but convenient for hotel stays | 2012-12-02 | 0 | h14GmWZ8rXum9fXF__wt3w | TvkqJ8YEIsTb16RnnrNyfQ |
| Finally, found some churros. Four types here. It should be great! | 2012-03-28 | 0 | xFN8mRubo3G0oIzJwc8XBA | TvkqJ8YEIsTb16RnnrNyfQ |
| closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed closed | 2012-03-28 | 0 | Xmndl6GoZg8taEUlwQMYxg | TvkqJ8YEIsTb16RnnrNyfQ |
| Try one of the Bento Box options | 2012-10-09 | 0 | eZDXz_RylvdD0tHEA8I0NA | TvkqJ8YEIsTb16RnnrNyfQ |
| USAir check-in desk agent Nancy E. is the sadest poor customer service provider I experienced this week. She doesn’t want to be there. #Fail | 2011-05-20 | 0 | u7CxxEzx8hvjoJ8onN4zTg | TvkqJ8YEIsTb16RnnrNyfQ |
| Great weather for eating outdoors. Good service. | 2012-03-27 | 0 | 1CqDdPrrb0xvQpgu7fhI5w | TvkqJ8YEIsTb16RnnrNyfQ |
| The fried banana dessert is good | 2012-09-02 | 0 | _AKdBFzkl7GY-daxUCCbVA | TvkqJ8YEIsTb16RnnrNyfQ |
| I didn’t eat here, but they were nice enough to tell me where to find tacos de lengua y churros. Look at my next post to find out where :) | 2012-03-28 | 0 | HWjqW5ZFJ8eZRQuHcpySQA | TvkqJ8YEIsTb16RnnrNyfQ |
The review table contains information regarding a review given by a user for a business. There are 5261668 reviews for 174567 businesses listed in the original dataset. We need to filter and keep the restaurants in top 10 cities.
# Only the data for restaurants from top 10 cities are retained
review <- review %>%
filter(business_id %in% business$business_id )
The final data set for top ten cities contains 2106287 reviews for 21406 businesses. The description for important variables and a snapshot of data can be seen below.
| Name | Description |
|---|---|
| review_id | ID of the review |
| user_id | ID of the user who gave the review |
| business_id | ID of the business which is reviewed |
| stars | Stars given by user for the business |
| date | Date of review |
| text | review text |
| useful | Number of users who found the review as useful |
| funny | Number of users who found the review as funny |
| cool | Number of users who found the review as cool |
# Displays 10 observations for tip data set.
kable(head(review, n = 10))
| review_id | user_id | business_id | stars | date | text | useful | funny | cool |
|---|---|---|---|---|---|---|---|---|
| nsThIz_-TuvgoFh0o9XJfQ | _L2SZSwf7A6YSrIHy_q4cw | IXXERocY1bqGwRllcy8J2w | 5 | 2009-08-30 | Visiting from SF. Checked yelp and found this place. It is very small – a converted house. As you sit in it you can watch the kitchen. The food was excellent. We had an eggplant/red pepper omelet and peach waffles. The waffles were light and fluffy with fresh whipped creme. |
We got pastries to go, which were also good, in particular the chocolate croissant was unique.
Great place for an informal breakfast. 0 0 0 BF0ANB54sc_f-3_howQBCg ssuXFjkH4neiBgwv-oN4IA JlNeaOymdVbE6_bubqjohg 1 2014-08-09 We always go to the chevo’s in chandler which is delicious, the one in ahwatukee is different for some reason. Ordered the chicken rolled tacos today there was a tiny lil piece of chicken in each one, so basically I had 3 rolled deep fried tortillas yuck! :( No flavor what so ever. Also ordered carne asada taco the meat tasted old like it was cooked earlier and just thrown on the grill to get warm. Very dissapointed!! 3 0 0 QgSf2JvYz-M4PU2yuJjxNQ nOTl4aPC4tKHK35T3bNauQ 9Jc3W0aR9Xf2gcHI0rEXsw 1 2012-08-23 After being scared away from Rock & Rita’s, we ended up at this place, which was, if nothing else, quieter.
I’ll start by saying that the hostess and our server were both lovely and they’re really what earned the single star, because they were sweet, but the food was just horrible.
I ordered a simple grilled cheese sandwich. I asked the server if it was processed cheese, or “real” cheese in the sandwich and he had no idea what I was talking about. I tried to clarify by saying, “Is it like Kraft single slices, or deli cheese?” He said it was “good”. This really should have been a sign and I could probably fault the waiter for this, since you SHOULD know what you’re serving, but he was so sweet, I couldn’t be mad!
My husband ordered a hot dog of sorts.
I asked if there was a way to substitute fries for a salad. There was not, but the server hooked me up and brought me a salad anyway (no charge). I asked for sweet potato fries with the sandwich.
Of course, my sandwich was made with plasticy, processed cheese, that was melted to the point of Cheeze-Wiz and was inedible. I took one bite of the sandwich and couldn’t do another. I got regular fries, instead of sweet potato (didn’t bother complaining, because I got the free salad, so whatever…like I said, nice server!) My husband’s food was fine, although nothing special. They were good with the refills of our drinks.
Anyway, as nice as the servers were, it wasn’t good. I would avoid eating here if I could. 0 1 0 gN6GARS_BRr5UX2D3WAH0w nOTl4aPC4tKHK35T3bNauQ xVEtGucSRLk5pxxN0t4i6g 5 2012-08-23 We got recommendations for this place from my parents and so, for our anniversary, we booked here. We were told though that it was first come, first served, in terms of having the ideal seats by the windows that overlook the fountains of the Bellagio and The Strip. We went for lunch though, so it wasn’t too busy. We were seated by the window and were promptly served some sort of puffed rolls, with cheese and herbs infused. They were delicious!
We ordered the onion soup to start and I ordered parmesan chicken and my husband ordered a lamb burger. Our food came and was delicious. The soup was probably my favorite part. We were feeling very full by the end and so declined dessert, but were still given some jellies and chocolate truffle and then were served a small chocolate mousse, on a decorated plate that said Happy Anniversary, which was a total surprise (they had asked when I booked if it was a special occasion, but had said nothing while we were there). Overall the service was great, the food delicious and the view wonderful. It was a great experience! 0 0 0 t4oXDPN4S4USIhBGpuSD8A nOTl4aPC4tKHK35T3bNauQ 2LZGeJy8qByYKB71ML-jcw 2 2012-08-23 We got a coupon to eat here when we checked in: $6.99 for breakfast and a second one for half, or free, or something to that effect. We went and had our breakfasts, which was served by a rather surely waitress and was average. If this had been our only experience, I would have given three stars, but another night we got in very late and were just looking for a late dinner and found ourselves at this place. It was not very crowded, but was as loud as if three times as many people were there. There was a small group of trashy (very obese people), scream/singing karaoke on a stage in the front of the restaurant. One of them was singing Eminem and the others were screaming and cheering. We sat down in the back, to try and avoid the chaos, but it was way too disruptive. Other people sitting there, who were in the middle of their dinners and therefore stuck there, were looking pained. We didn’t bother putting ourselves through this and left. It’s a crappy menu, that’s overpriced anyway. I guess Rock and Rita’s name really says it all, but basically if you’re not a fall-down, cheap drunk, I don’t think you’ll have much to say about this place. 2 1 0 R9w7GeMX_KZTV23gmI8Zjg tL2pS5UOmN6aAOi3Z-qFGg RhV7sraRUB3km-gF-tmDow 3 2013-02-06 I’ve eaten here numerous times and am still amazed how popular these places are. It must be because they’re open 24 hours and late at night it’s one of the few places you can get something to eat fast. That’s the only reason I eat here.
The good is soso. It fills you up but as far as being flavorful it leaves a lot to be desired.
I’m sure I’ll visit again. It serves a purpose. 0 0 0 oncT7W70CFwzzJkQoz3T5Q tL2pS5UOmN6aAOi3Z-qFGg NaZVUOzqk5b-l0mlki-9Og 4 2017-02-10 We stopped in here for lunch this afternoon. Staff was helpful and friendly. Food was good and you get a lot for the price. We noted that they deliver and since we live close by we’ll probably order from them in the future. 2 0 0 9dWoAJGcJHWscv2ZAdzkNg tL2pS5UOmN6aAOi3Z-qFGg tJzf6H1dkuUbL-t8bzL3dw 5 2014-04-27 I was looking for a nice place to take the family to dinner last night. After reading the reviews and looking at the photos I settled on this place. Great choice!
The ambience was perfect. Nice and quite so you don’t have to talk loud so your guests can hear you.
I was really surprised that on a Saturday night this place only had a few other tables with diners. You’d think a place like this would be packed.
The service is what really made this place great for me. I would rate this place in the top two or three restaurants I’ve ever eaten in as far as service goes.
The food was good to. Everyone in my party of 6 enjoyed their meals. I personally would rate my steak a 4 out of 5. But with the kind of service you get here I didn’t mind.
The total bill came to 274 and change for a party of 6. That includes one bottle of wine and a couple of other mixed drinks. So really not a bad price for a place like this.
If you’re looking for a romantic place to take a date or just a nice quiet place to dine I would highly recommend Carve! 0 0 0 ZGlUf9noms8FQ67rmTZSdA tL2pS5UOmN6aAOi3Z-qFGg FtaTjyMUIY457tPJahjg1A 4 2014-04-25 We stopped by here for lunch this afternoon. The place was packed.
I thought we’d have a long wait, but we were shown to a booth in just a few minutes.
The hostess and our waitress were both very friendly. I appreciated them taking the extra effort to stay friendly when it was crazy busy like that. I deal with the public on a daily basis and know when it’s super busy like that it can sometimes be a challenge to keep a smile on your face and a kind word on your lips.
All of our meals were excellent. If I’m ever in the neighborhood I’d definitely eat here again. 0 0 0 pcszB9oTZE2DNylbbXIZAg tL2pS5UOmN6aAOi3Z-qFGg yLiaMaJFq03JxXPk4puloQ 3 2017-04-20 I’ve stopped in here several times. It’s always busy but they seem to move along at a decent speed.
As with all fast food joints nowadays be sure to check your bag before you go as there’s a 50/50 chance they won’t get your order right.
They’ve got one of those new machines inside where you can place your order. Thanks but no thanks. I’d rather place my order with an old fashioned human being. 1 2 1
Categories variable lists the different categories to which a restaurant belongs to. There are few redundant categories like “Restaurants”, “Food”, “Nightlife”, “Bars”, “New” that are identified in the initial analysis. These are removed before creating the wordcloud to find the most frequent categories across cities. In the wordcloud, we consider only those categories that appear in data set atleast 300 times. Most Frequent categories appear in the center of the word cloud in large size. As we move ouwards, the size of words decreases denoting smaller frequencies. Same color words have similar frequency range.
# All categories listed for different restaurants are taken
categories <- unlist(strsplit(business$categories, ";"))
# Most common categories like food are removed
remove_categories <- c("Restaurants", "Food", "Nightlife", "Bars", "New")
clean_categories <- removeWords(categories, remove_categories)
# word cloud is created with this set of categories
wordcloud(clean_categories,
min.freq = 300,
random.order=FALSE,
rot.per=0.35,
colors=brewer.pal( 8,"Dark2"))
From the frequency plot, we can see that there are few categories like buffets, speciality ,caterers etc. Inorder to understand the top cuisines across locations, we consider the following cusinines that are identified in the initial analysis.
We can see that the American cuisine is most popular in most places. But Fast food, Pizza, Mexican and sandwiches has won the race in some places. Asian cuisines like Chinese and Japanese did not make it to the top 5.
# Selected list of cuisines
cuisine_list <- c( "American (Traditional)", "American (New)", "Sandwiches", "Fast Food", "Mexican", "Pizza", "Italian", "Chinese", "Ice Cream & Frozen Yogurt", "Bakeries", "Desserts", "Seafood", "Sushi Bars", "Juice Bars & Smoothies", "Mediterranean", "Steakhouses", "Burgers", "Salad", "Barbeque", "Cocktail Bars", "Thai", "Buffets", "Hot Dogs", "Asian", "Japanese", "American", "Indian")
top_cusine_plot <- business %>%
# Only city and categories are needed for this analysis
dplyr::select(city , categories) %>%
# categories are seperated by ; in teh variable
transform(categories = strsplit(categories, ";")) %>%
# A single row is created for each category from the list
unnest(categories) %>%
# Only categories in the cuisine_list is considered
filter(categories %in% cuisine_list) %>%
dplyr::group_by(categories,city) %>%
# Count is calculated a combination of city and category
tally() %>%
# Top 5 categories for a city is filtered
dplyr::group_by(city) %>%
top_n(n = 5) %>%
# To plot uniformly, ranks are given. Maximum count is given the rank 1
mutate( count_rank = rank(-n)) %>%
# Bar chart for each city is shown
ggplot(aes(x = count_rank, y = n))+
geom_bar(stat = "identity", fill = "#87cefa") +
facet_wrap(~city, nrow = 2, scales = "free") +
coord_flip() +
# 1 should come on the top
scale_x_reverse() +
# Name of the cuisine is displayed inside the bars
geom_text(aes(label = categories), size = 2,hjust="inward")
top_cusine_plot
For each city, average check-ins for all hours for every weekday is calculated. We can see the patterns across time and day with this visualization. Dark blue denotes more checkins and orange shows lesser number of checkins for a square that corresponds to a given hour in a weekday. We can see that Las Vegas being a party city, hours after midnight are the busy hours throughout the week. We can see that for the city of Scottsdale, dinner hour or the time around 7 is the busiest hour throught the weekend. Sunday breakfast is seen as the busiest hour for Pittsburgh.
timeplot <- function(city_name){
checkin %>%
inner_join(business, by = "business_id") %>%
# Data for the city is selected
filter(city == city_name)%>%
# For each weekday and hour, average number of checkins are calculated
group_by( weekday, hour)%>%
summarise( mean_checkin = mean(checkins)) %>%
ggplot(aes(x = hour,
y= weekday,
fill = mean_checkin))+
geom_tile(colour = "white") +
# If teh number of checkins are high, they are in blue and if they are low, they will be in orange
scale_fill_gradient(low = "orange", high = "blue", name = "Check-in count") +
ylab(" Day of the week") +
xlab(" Hour of the day") +
ggtitle(paste("Check in trends for",city_name))
}
# Plots are created for all cities in the list
checkin_plots <- lapply(top_10_cities, timeplot)
# They are plotted in a grid of 5 rows and 2 columns
do.call("grid.arrange", c(checkin_plots, ncol=2))
The location of restaurants are shown in geographical maps in this section. A function is created to plot the maps. It takes in the city name, filters the data for the given city and plots the map. Leaflet package is used for this. Since there are so many restauants in a city, they are clustered and shown in map. When you hover the mouse on the number, the region for which the numbers are aggregated are highlighted. This shows us where restaurants are densely populated.
Clicking the numbers in circles zooms in to the area. On zooming in, the aggregation regions change. When it is zoomed into the lowest level, the restaurant locations can be seen in detail. At this point, if the rating of restaurant is 1 or 2, it is considered as a low rated outlet and is represented by a red circle. If the rating is 3, it is considerd as average and is shown as blue circle. High rated restaurants(places with rating of 4 or 5) are represented by green circles.
geographical_map <- function(location_name){
location_business <- business %>%
# filter for the city
filter(city == location_name) %>%
# Creates 3 level based on rating
mutate( rating_level = ifelse(stars == 4 | stars == 5 ,"High", ifelse(stars == 3, "Medium", "Low")))
# Creates color pallette for rating levels
pallete <- colorFactor(c("dark red", "blue","dark green"), domain = c("Low", "Medium","High"))
location_business %>%
leaflet() %>%
setView(lng = mean(location_business$longitude),
lat = mean(location_business$latitude),
zoom = 12) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(~longitude,
~latitude,
radius = 3,
fillOpacity = 0.5,
# Creates clusters for restaurants on high level
clusterOptions = markerClusterOptions(),
# Color palette is assigned based on rating level
color = ~pallete(rating_level))
}
geographical_map("Las Vegas")
geographical_map("Phoenix")
geographical_map("Pittsburgh")
geographical_map("Scottsdale")
geographical_map("Chandler")
geographical_map("Madison")
geographical_map("Cleveland")
geographical_map("Mesa")
geographical_map("Charlotte")
geographical_map("Tempe")
Since the RAM is not able to handle data for 10 locations, Only Las vegas reviews are used for sentimental analysis. In order to avoid untrustworthy reviews, a review is considered for analysis only if at least 5 people have rated it as useful. Text of the review is converted to lower case and numbers and stop words are removed from it. There are three words that are found to be common across reviews in high frequency in the initial analysis. Las vegas, http and www.yelp.com are removed from the text.
useful <- review %>%
left_join(business, by = "business_id") %>%
# Filters reviews for Las Vegas
filter(city == 'Las Vegas') %>%
# Only reviews that atleast 5 people found as useful is taken
filter(useful > 5)
# All text is converted to lower case
useful$text <- tolower(useful$text)
# Stop words and repeating words like Las vegas, http, www.yelp.com are removed
useful$text <- removeWords(useful$text ,c(stopwords("en"), "las vegas","http","www.yelp.com" ))
useful$text <- removeNumbers(useful$text )
The reviews given by customer related to food, service, ambience and staff are of interest for this case study. As first step, Words that generally appear in reviews after the key words food, service, ambience and staff are analysed through a network graph. For these bigrams containing any of these four words as the first word are created. In order to avoid insignificant relationships that crowd the space, only those words that appear alteast 100 times after these key words are considered.
# Bigrams are created with words in review text
bigrams <- useful %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
# List of words that are of significance
analysis_word <- c("food", "ambience", "staff", "service")
# Creates data for network analysis graph
bg_grapgh <- bigrams %>%
# Words from bigram are seperated
separate(bigram, c("word1", "word2"), sep = " ") %>%
# Count for each combination of words are calculated
group_by(word1, word2) %>%
summarise( n = n()) %>%
# Only the combination with significant words in the beginning and min freq of 100 are taken
filter( word1 %in% analysis_word & n > 100) %>%
# Creates data for network graph
graph_from_data_frame()
arrow_format <- grid::arrow(type = "closed", length = unit(.1, "inches"))
## Visual representation of connection of pair of words
ggraph(bg_grapgh, layout = "fr") +
# Connection between words are represented by arrows
geom_edge_link(aes(edge_alpha = n),
show.legend = TRUE,
arrow = arrow_format,
end_cap = circle(.1, 'inches')) +
# Nodes for words
geom_node_point(color = 'light blue',
size = 7) +
# Text is displayed
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1,
repel = TRUE) +
theme_void()
The thickness of the arrow denotes the number of times the word appeared. Service is linked to both horrible and great with thick lines. There are large number of positive and negative reviews for service. Food has many words that appear in reviews. Service and food have many common words used in the reviews. Ambience and Staff has less words that appear in reviews with high frequency.
A restaurant is rated between 1 to 5. Through this visualization, the patterns in reviews given for food, ambience, staff and service is studied. For each review, we find the words that contribute to positive or negative review for each attribute across rating. These are the words that preceed the significant words (food, ambience, staff or service).
We use sentiments from affinn lexicon for this analysis. The AFINN lexicon assigns each word with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. We have calculated product of score and number of occurances to identify how much a word affects a review. Top 5 positive and negative words for an attribute across ratings are shown. If multiple words have same effects , all of them are displayed.
For 5 rated restaurants, reviews have words like great and amazing to describe the staff and service. For 4 rated, good and awesome takes their place. For top rated restaurants, the all top 5 words describes food quality. But for 4 rated restaurant, the fifth positive word is pretty which could be used to describe the presentation of food. The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4.
When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible, awful. Food becomes dissappointing and horrible. Low rated restaurants have unconfortable, poor or horrible ambience.
# afinn lexicon is imported
AFINN <- get_sentiments("afinn")
analysis <- bigrams %>%
# Bigrams are seperated
separate(bigram, c("word1", "word2"), sep = " ") %>%
# Only the words under analysis is chosen as first word
filter(word1 %in% analysis_word) %>%
# AFINN lexicon is used for sentiment analysis
inner_join(AFINN, by = c(word2 = "word")) %>%
# Count for each word and rating is taken
group_by(word1, word2, score,stars.x) %>%
summarise(n = n()) %>%
ungroup()
# creates plots for each star rating
star_plot <- function(star){
analysis_plot <- analysis %>% filter(stars.x == star) %>%
mutate(contribution = n * score, sign = ifelse(score > 0 , "P", "N")) %>%
arrange(desc(abs(contribution))) %>%
group_by(word1,sign) %>%
# Selects the top 5 contributions to both positive and negative emotions
top_n(5, abs(contribution)) %>%
ggplot(aes(drlib::reorder_within(word2, contribution, word1),
contribution,
# Color is based on positive or negative emotion
fill = contribution > 0)) +
geom_bar(stat = "identity", show.legend = FALSE) +
xlab("Words preceded by topic under analysis") +
ylab("Sentiment score * Number of occurrances") +
ggtitle(paste("Contributing words for rating : ", as.character(star)))+
drlib::scale_x_reordered() +
facet_wrap( ~ word1,scales = "free", nrow = 1) +
coord_flip()
return(analysis_plot)
}
# Created a grid of 5 rows and n columns
# 5 columns comes from the facet_wrap
star_plots <- lapply(c(5,4,3,2,1), star_plot)
do.call("grid.arrange", c(star_plots, ncol = 1))
Now that we have found the words that contribute positive and negative emotions for each attributes across ratings, we will next analyse the general emotion expressed in reviews for each rating. NRC lexicon is used for this analysis. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. For each star rating, the percentage of words that express the emotions of trust, fear, anger, anticipation, disgust, joy , sadness and surprise are shown below.
A radar plot is made across emotions to show the percentage projected in reviews. Color in this visualisation represents the rating of restaurant whose review is analysed. We can see that for restaurants with low rating of 1 and 2 have reviews that predominantly express anger, sadness, fear and disgust. These emotions are ecpressed the least in high rated restaurants. Medium rating of 3 shows a large amount of anticipation in their reviews. High rated restaurants (4 and 5 ratings) have very similar trend that in this visualization, they have almost overlapping lines. Though customers express equal percentage of trust in reviews for restaurants with rating of 4 and 5 (Overlapping pink and yellow lines), reviews shows more joy for 5 rated restaurants.
# Creates unigrams from text
unigrams <- useful %>% unnest_tokens(word, text, token = "ngrams", n = 1)
# nrc lexicon is loaded
nrc <- get_sentiments("nrc")
sentiment_analysis <- unigrams %>%
dplyr::group_by(stars.x, word) %>%
# Count of words in review for each rating is calculated
summarise( n = n()) %>%
# NRC sentiment analysis
inner_join(nrc)
# positive and negative emotions are dropped
review_nrc <- sentiment_analysis %>%
filter(!grepl("positive|negative", sentiment))
review_tally <- review_nrc %>%
group_by(stars.x, sentiment) %>%
tally() %>%
# Calculates the percentage of words that attribute to a sentiment
mutate(cuisine_words = (nn / sum(nn))*100) %>%
select(-nn)
# Key value pairs
scores <- review_tally %>%
spread(stars.x, cuisine_words)
# JavaScript radar chart
chartJSRadar(scores)
In this step, we investigate if specific words are used in the reviews given to a specific cuisine. If the cuisine category is not registed by the merchant, reviews can be used to identify the cuisine. Common words seen in reviews across the cuisines in Las Vegas versus the frequent words in the reviews given to specific cuisine are identified. This allows us to compare the strong deviations of word frequency within each cuisine as compared to reviews given in location
Words that are close to the line(light grey) means they are used in similar frequency in reviews for the cuisine under stude and the rest of all the cuisines. For example, words such as “food” and “pizza” are fairly common and used with similar frequencies across most of the cuisines. Words that are far from the line (Green color) are words that are found more in one set of cuisine reviews than another. The words standing out above the line are common across the location but not for that particular category. The words below the line are common in that particular category but not across the location.
For example, “torta” stands out above the line in the American traditional cuisine. This means that “torta” is a word used fairly common in reviews given toother cuisines, but is not used as much in reviews for Traditional cuisine. In contrast, a word below the line such as “burgr” in the traditional American category suggests this word is common in this cuisine review but far less common in reviews for other cuisines.
# Calculates the top 6 cuisines
LV_top_6_cuisines <- business %>%
dplyr::select(city , categories) %>%
transform(categories = strsplit(categories, ";")) %>%
unnest(categories) %>%
# Only the categories in cuisine list for LAs Vegas is considered
filter(categories %in% cuisine_list & city == 'Las Vegas') %>%
dplyr::group_by(categories) %>%
tally() %>%
top_n(n = 6)
# Calculates the percenatge of word use in whole
word_pct <- useful %>%
transform(categories = strsplit(categories, ";")) %>%
unnest(categories) %>%
filter(categories %in% LV_top_6_cuisines$categories) %>%
unnest_tokens(word, text) %>%
# Removes stop words
anti_join(stop_words) %>%
dplyr::group_by(word) %>%
summarise(n = n()) %>%
# Calculate the percentage in the whole review set
transmute(word, all_cuisines = n / sum(n))
# calculate percent of word use within each cuisine
frequency <- useful %>%
transform(categories = strsplit(categories, ";")) %>%
unnest(categories) %>%
filter(categories %in% LV_top_6_cuisines$categories) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
dplyr::group_by(categories,word) %>%
summarise(n = n()) %>%
# Calculate the percentage in the review for given category
mutate(cuisine_words = n / sum(n)) %>%
left_join(word_pct) %>%
arrange(desc(cuisine_words)) %>%
ungroup()
# Plots frequency of words in reviews specific to cuisine in x axis and percentage of appearance in all in y
ggplot(frequency, aes(x = cuisine_words, y = all_cuisines, color = abs(all_cuisines - cuisine_words))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 3, width = 0.3, height = 0.3) +
geom_text(aes(label = word, size = 1), check_overlap = TRUE, vjust = ifelse(frequency$all_cuisines > frequency$cuisine_words, 2,-2)) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~ categories, ncol = 2) +
theme(legend.position = "none") +
labs(y = "Cuisines", x = NULL)
Restaurants is US is identified as target market. In order to understand the existing business model, YELP data set is selected. This data set contains information for business across 11 metropolitan areas in four countries. Restaurant data for US was filtered and top 10 cities based on number of restaurants whose data is available is taken. 10 cities were chosen for study based on data available and below are some of the findings from this case study. We have information regarding location, category, working hours for each weekday, number of checkins for each day and hour and user text review. Data is cleansed and processed to create visualizations for exploratory data analysis. Text mining is done on reviews to understand the customer sentiments.
Based on analysis of Yelp dataset,trends in restaurant market is studied. Data was sliced and diced to create different visualisations to uncover the patterns present. Restaurant locations were plotted on map and where they were aggregated to find the locations where restaurants are densely located. In ordered to find the most busy hours in a city, heat maps where generated for each hour in a weekday. Most frequent categories across the cities were identified using a word cloud. Then it was drilled down to find the top 5 cuisines in each city using bar charts.
In order to understand the sentiments of customers, four key attributes of a business were identified. The general customer reaction to food, ambience, staff and service at a restaurant was analysed. As intial step, the words commonly used along with the key terms are visually represented in a network chart. To identify the difference between high and low rated restaurants, the reviews given to each of these four attributes in each rating category were studied seperately The words that contribute to positive and negative sentiments in each attribute are identified and top 5 positive and negative words were shown in a diverging bar chart. The percentage of different emotions expressed in reviews given to each rating category is shown in a radar chart. The difference in usage of words in reviews given to top 6 cuisines in Las vegas was also investigated and displayed.
American cuisine is most popular in most locations. Sandwiches and fast food are the next best options. Asian cuisines like Indian and Chinese did not make it to the top 5 in any of the locations.
Locations like Las Vegas has an active night life and hours after midnight are the most busy working time for restaurants there. Scottsdale is busy during dinner time around 7 on most days. The patterns in working hours can be seen in the visualization provided in the exploratory data analysis
Restaurants are densely pesent mostly in downtowns. When you move further away from the city, number restuarants for which we have information reduces.
Lot of words are used in common to describe food and service. Great and horrible are used comparitively in same frequency to describe services of restaurants in Las Vegas. Staff and ambience is not descibed using frequent terms in reviews as compared to food and service.
The distinguishing factor between 4 and 5 could be the taste/quality of food. Top rated restaurant has classy ambience that distinguishes them from 4. Low rated restaurants have unconfortable, poor or horrible ambience. When we move down the rating, the negative words increases across the attributes. The staff at low rated restaurants are described as annoying, stingy, confused and dumb. The service sucks and is horrible and awful. Food becomes dissappointing and horrible.
Reviews for restaurants with low rating of 1 and 2 predominantly express anger, sadness, fear and disgust. Medium rating of 3 shows a large amount of anticipation in their reviews. Though customers express percentage of trust in reviews for restaurants with rating of 4 and 5, reviews shows more joy in customers at 5 rated restaurants.
Each cuisine review has words specific to that particular category. For example refried is a word commonly used only for Mexican cuisine. This can be used to identify the cuisines from reviews.
Collect more data on food served, menu, music etc and explore more trends and patterns that will aid a new business person planning to open a restaurant in US
Create a machine learning model that will identify the cuisine from the review given
Predict the rating that customer might give to a restaurant by analysing the review given by him/her.
Due to time constraints, all sentiment analysis in this case study is done using unigrams and bigrams. Expand the scope of study to large n-grams and sentences