Exploratory analysis on Yelp challenge data.

We use yelp for ratings on restaurants, activities and etc on a daily basis. Now yelp has released part of its data for the Yelp Dataset Challenge. Let us do a exploratory analysis on the data to see if we could get any useful information.

As explained on its website, the data contains 4.1M reviews and 947K tips by 1M users for 144K businesses in * UK: Edinburgh * Germany: Karlsruhe * Canada: Montreal and Waterloo * U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland

They also has some questions laid out

What cuisines do Yelpers rave about in these different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? Who are the most influential reviewers in the data? How to predict whether the business will remain open or going to close?

Exploring the dataset

Let us first import the data into R.

#load library
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(ggplot2)
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

library(tibble)
library(stringr)
library(tidyr)
library(dplyr)
library(Amelia)    # a package to visualize missing data

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

#import data
business=stream_in( file("yelp_academic_dataset_business.json") )

## opening file input connection.

## closing file input connection.

#checkin=stream_in(  file("yelp_academic_dataset_checkin.json") )

#tip=stream_in( file("yelp_academic_dataset_tip.json") )

#review=stream_in( file("yelp_academic_dataset_review.json")   )

#user=stream_in( file("yelp_academic_dataset_user.json") )

Next, we need to flatten the nested lists read from the json file into R dataframes. Then, I will transform it into tibble data form, which is a variation of dataframe that is easy on the eye.

Let us first look at missing data for business

missmap(business,main="Yelp Data - Missings Map", col=c("black", "yellow"), legend=TRUE)

There aren’t significant missing data from the visualization. But there might be some error in the data but it would require too much time to manually fix it so that we will continue with the data.

Let us look at the city variable in business dataframe

business$city=business$city%>%as.factor

business$city%>%summary

##         Las Vegas           Toronto           Phoenix        Scottsdale 
##             22892             14540             14468              6917 
##         Charlotte        Pittsburgh          Montréal              Mesa 
##              6912              5275              4785              4714 
##         Henderson             Tempe         Edinburgh          Chandler 
##              3788              3703              3601              3325 
##         Cleveland           Madison           Gilbert          Glendale 
##              2785              2711              2574              2555 
##       Mississauga         Stuttgart            Peoria           Markham 
##              2094              1955              1367              1285 
##   North Las Vegas         Champaign        North York          Surprise 
##              1154              1018               883               853 
##       Scarborough     Richmond Hill          Goodyear           Concord 
##               781               719               646               632 
##          Brampton           Vaughan         Etobicoke          Matthews 
##               631               601               543               533 
##          Avondale          Oakville      Huntersville          Lakewood 
##               531               502               458               424 
##         Fort Mill            Urbana         Cornelius            Mentor 
##               410               336               327               315 
##        Cave Creek          Gastonia     North Olmsted          Westlake 
##               311               303               286               283 
##       Monroeville         Middleton         Thornhill      Strongsville 
##               281               271               266               263 
##             Laval         Pineville    Fountain Hills            Aurora 
##               259               248               247               237 
##    Cuyahoga Falls      Boulder City         Newmarket            Medina 
##               235               217               215               214 
##              Kent         Beachwood          Montreal           Wexford 
##               208               205               201               201 
##         Pickering       Ludwigsburg       Rocky River              Ajax 
##               199               189               187               186 
##             Parma      Indian Trail         Inverness       Sun Prairie 
##               179               174               171               168 
##          Sun City        Willoughby Cleveland Heights         Fitchburg 
##               167               167               166               165 
##              Stow             Solon   Litchfield Park        Woodbridge 
##               165               158               155               154 
##              Avon            Hudson        Coraopolis     Chagrin Falls 
##               145               145               137               133 
##       Bridgeville       Bethel Park          Brossard            Verona 
##               129               128               126               122 
##        Canonsburg            Verdun         East York         Brunswick 
##               121               121               117               116 
##      Sindelfingen            Elyria   Paradise Valley           Belmont 
##               115               114               113               109 
##            Laveen  Mayfield Heights         Esslingen            Waxhaw 
##               109               109               108               108 
##         Homestead            Anthem          Fairlawn           (Other) 
##               105               104               104              9090

What is the most common categories in the dataset

business%>%select(-starts_with("hours"), -starts_with("attribute")) %>% unnest(categories) %>%
 select(name, categories)%>%group_by(categories)%>%summarise(n=n())%>%arrange(desc(n))%>%head(20)

## # A tibble: 20 × 2
##                   categories     n
##                        <chr> <int>
## 1                Restaurants 48485
## 2                   Shopping 22466
## 3                       Food 21189
## 4              Beauty & Spas 13711
## 5              Home Services 11241
## 6                  Nightlife 10524
## 7           Health & Medical 10476
## 8                       Bars  9087
## 9                 Automotive  8554
## 10            Local Services  8133
## 11 Event Planning & Services  7224
## 12               Active Life  6722
## 13                   Fashion  5824
## 14    American (Traditional)  5312
## 15                 Fast Food  5250
## 16                     Pizza  5229
## 17                Sandwiches  5220
## 18              Coffee & Tea  5099
## 19               Hair Salons  4858
## 20           Hotels & Travel  4857

As a first glance, we see that most businesses included in the data is related to eating, drinking, shopping.

What are the most common categories that are still open

business%>%  filter(is_open==1) %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%
 select(name, categories)%>%group_by(categories)%>%summarise(n=n())%>%arrange(desc(n))%>%head(20)

## # A tibble: 20 × 2
##                   categories     n
##                        <chr> <int>
## 1                Restaurants 36889
## 2                   Shopping 19587
## 3                       Food 17348
## 4              Beauty & Spas 12182
## 5              Home Services 10835
## 6           Health & Medical  9870
## 7                  Nightlife  8155
## 8                 Automotive  7984
## 9             Local Services  7712
## 10                      Bars  7062
## 11 Event Planning & Services  6521
## 12               Active Life  5989
## 13                   Fashion  4960
## 14                 Fast Food  4778
## 15           Hotels & Travel  4496
## 16                Sandwiches  4343
## 17               Hair Salons  4294
## 18                     Pizza  4231
## 19    American (Traditional)  4167
## 20              Coffee & Tea  4074

We could see that the top 20 businesses catogories that are still open is the same as the top 20 businesses in the dataset.

What is the most dying business in proportion in different categories?

So if you want to open a business in these cities in the dataset, what businesses are more likely to fail?

We only include categories with more than 100 businesses since we want to avoid the randomness.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%head(20)

## # A tibble: 20 × 4
##                    categories     n dead_prop live_prop
##                         <chr> <int>     <dbl>     <dbl>
## 1                    Gay Bars   158 0.4746835 0.5253165
## 2                     Spanish   177 0.3728814 0.6271186
## 3                   Soul Food   212 0.3679245 0.6320755
## 4                  Tapas Bars   286 0.3531469 0.6468531
## 5                    Hawaiian   207 0.3478261 0.6521739
## 6                     Buffets   710 0.3422535 0.6577465
## 7                 Dance Clubs   743 0.3257066 0.6742934
## 8                Cajun/Creole   206 0.3155340 0.6844660
## 9                     Dim Sum   220 0.3090909 0.6909091
## 10                Hookah Bars   211 0.3080569 0.6919431
## 11                     Korean   639 0.3004695 0.6995305
## 12                    Lounges  1261 0.2950040 0.7049960
## 13 Videos & Video Game Rental   234 0.2948718 0.7051282
## 14                     French   903 0.2934662 0.7065338
## 15                   Hot Dogs   551 0.2921960 0.7078040
## 16             American (New)  3621 0.2913560 0.7086440
## 17                   Japanese  2054 0.2862707 0.7137293
## 18                   Filipino   161 0.2795031 0.7204969
## 19            Modern European   197 0.2791878 0.7208122
## 20                   Barbeque  1279 0.2752150 0.7247850

So generally, you would want to avoid opening these catogories if you want to open some businesses in those cities. But the analysis here is too rough here, we will need to look at each business in each city and each neighborhood for more detailed analysis.

What are the states that have most dying business in proportion?

We also only include states with more than 100 businesses.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(state)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))

## # A tibble: 15 × 4
##    state      n  dead_prop live_prop
##    <chr>  <int>      <dbl>     <dbl>
## 1    EDH  12272 0.17869948 0.8213005
## 2     ON  84728 0.16054905 0.8394509
## 3    MLN    627 0.15151515 0.8484848
## 4     IL   5901 0.14607694 0.8539231
## 5     NV 105568 0.14117915 0.8588209
## 6     SC   1786 0.13829787 0.8617021
## 7     WI  15303 0.13696661 0.8630334
## 8     AZ 162750 0.13399078 0.8660092
## 9     QC  22433 0.13177016 0.8682298
## 10    NC  37809 0.12422968 0.8757703
## 11    PA  29260 0.11760082 0.8823992
## 12    OH  37437 0.11413842 0.8858616
## 13    BW  10672 0.09895052 0.9010495
## 14   HLD    481 0.03326403 0.9667360
## 15   FIF    208 0.02884615 0.9711538

States BW, HLD and FIF have a suprisingly low dying business proportion, while EDH, ON, MLN has a relatively high dying business proportion. But we should not say that opening a business in EDH, ON or MLN is a bad idea. We will need to have more information to analyze if a business would survive.

What are the cities that have most dying business in proportion?

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))

## # A tibble: 198 × 4
##                   city     n dead_prop live_prop
##                 <fctr> <int>     <dbl>     <dbl>
## 1           Harrisburg   322 0.2236025 0.7763975
## 2           Unionville   130 0.2230769 0.7769231
## 3      Paradise Valley   406 0.2216749 0.7783251
## 4  Dollard-des-Ormeaux   127 0.2204724 0.7795276
## 5            Wickliffe   116 0.1982759 0.8017241
## 6        Richmond Hill  2471 0.1926346 0.8073654
## 7              Toronto 50971 0.1869887 0.8130113
## 8       Shaker Heights   197 0.1827411 0.8172589
## 9            Edinburgh 12467 0.1807171 0.8192829
## 10           las vegas   103 0.1747573 0.8252427
## # ... with 188 more rows

What is the most dying business in proportion in different categories in different cities?

Next, we would want to know what are the business catorgories to avoid in different cities.

Let us first look at Toronto.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Toronto")

## Source: local data frame [106 x 5]
## Groups: city [1]
## 
##       city                categories     n dead_prop live_prop
##     <fctr>                     <chr> <int>     <dbl>     <dbl>
## 1  Toronto                    French   153 0.4640523 0.5359477
## 2  Toronto                   Lounges   201 0.3880597 0.6119403
## 3  Toronto               Dance Clubs   109 0.3761468 0.6238532
## 4  Toronto            American (New)   166 0.3734940 0.6265060
## 5  Toronto Ice Cream & Frozen Yogurt   207 0.3381643 0.6618357
## 6  Toronto                   Italian   505 0.3168317 0.6831683
## 7  Toronto                    Diners   152 0.3092105 0.6907895
## 8  Toronto            Canadian (New)   571 0.3047285 0.6952715
## 9  Toronto                      Thai   254 0.3031496 0.6968504
## 10 Toronto                     Books   179 0.3016760 0.6983240
## # ... with 96 more rows

It is clear that We should avoid French food, Lounges, Dance club, New American food, and ice cream in Toronto.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Toronto")

## # A tibble: 1 × 4
##      city     n dead_prop live_prop
##    <fctr> <int>     <dbl>     <dbl>
## 1 Toronto 50971 0.1869887 0.8130113

The average business dying proportion is about 18%, so it is best to avoid any business that has a dying rate of 18% in Toronto.

You could so this on any other cities in the data. But I am especially interested in Cleveland since this is where I live.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Cleveland")%>%head(20)

## Source: local data frame [20 x 5]
## Groups: city [1]
## 
##         city                categories     n  dead_prop live_prop
##       <fctr>                     <chr> <int>      <dbl>     <dbl>
## 1  Cleveland            American (New)   172 0.25581395 0.7441860
## 2  Cleveland               Restaurants  1235 0.21700405 0.7829960
## 3  Cleveland                      Bars   341 0.19941349 0.8005865
## 4  Cleveland                 Nightlife   385 0.19220779 0.8077922
## 5  Cleveland    American (Traditional)   197 0.18274112 0.8172589
## 6  Cleveland              Coffee & Tea   104 0.16346154 0.8365385
## 7  Cleveland                   Burgers   106 0.14150943 0.8584906
## 8  Cleveland                      Food   525 0.13142857 0.8685714
## 9  Cleveland                  Shopping   333 0.12912913 0.8708709
## 10 Cleveland                     Pizza   155 0.12903226 0.8709677
## 11 Cleveland                Sandwiches   156 0.12820513 0.8717949
## 12 Cleveland      Arts & Entertainment   154 0.11688312 0.8831169
## 13 Cleveland             Beauty & Spas   150 0.10666667 0.8933333
## 14 Cleveland               Active Life   111 0.09909910 0.9009009
## 15 Cleveland Event Planning & Services   165 0.08484848 0.9151515
## 16 Cleveland            Specialty Food   112 0.08035714 0.9196429
## 17 Cleveland                 Fast Food   114 0.06140351 0.9385965
## 18 Cleveland           Hotels & Travel   128 0.05468750 0.9453125
## 19 Cleveland            Local Services   114 0.04385965 0.9561404
## 20 Cleveland                Automotive   163 0.04294479 0.9570552

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Cleveland")

## # A tibble: 1 × 4
##        city     n dead_prop live_prop
##      <fctr> <int>     <dbl>     <dbl>
## 1 Cleveland 10204 0.1348491 0.8651509

It seems that Cleveland is not so friendly to restaurants, bars and nightlife business compared to other businesses in Cleveland, especially New American Food, which has twice the dying rate than the city average dying rate.

Let us compare Toronto, Las Vegas and Cleveland on the New American food, Bars, Nightlife and Restaurants

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
 unnest(categories)%>%
  filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter(categories=="Restaurants" | categories=="American (New)"| categories=="Bars"| categories=="Nightlife") %>%
  group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(categories,desc(dead_prop),desc(n))

## Source: local data frame [12 x 5]
## Groups: city [3]
## 
##         city     categories     n dead_prop live_prop
##       <fctr>          <chr> <int>     <dbl>     <dbl>
## 1    Toronto American (New)   166 0.3734940 0.6265060
## 2  Las Vegas American (New)   527 0.2808349 0.7191651
## 3  Cleveland American (New)   172 0.2558140 0.7441860
## 4    Toronto           Bars  1170 0.2333333 0.7666667
## 5  Las Vegas           Bars  1211 0.2320396 0.7679604
## 6  Cleveland           Bars   341 0.1994135 0.8005865
## 7    Toronto      Nightlife  1331 0.2389181 0.7610819
## 8  Las Vegas      Nightlife  1658 0.2346200 0.7653800
## 9  Cleveland      Nightlife   385 0.1922078 0.8077922
## 10 Las Vegas    Restaurants  5431 0.2912907 0.7087093
## 11   Toronto    Restaurants  6347 0.2733575 0.7266425
## 12 Cleveland    Restaurants  1235 0.2170040 0.7829960

It seems that the aforementioned categories have more businesses in Toronto or Las Vegas than Cleveland. But business in Cleveland has a lower dying rate.

What is the histogram of star ratings of businesses in those catogories in Toronto, Las Vegas and Cleveland?

Let us compare the ratings of Nightlife+Bars ratings in Toronto, Las Vegas and Cleveland.

I also need to choose the business with more than 30 review counts in order for the star ratings to make sense.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
 unnest(categories)%>%
  filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter( categories=="Nightlife"|categories=="Bars")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Nightlife+Bar businesses with different star ratings in several cities.")

The ratings of Nightlife in Toronto center around 3.5, while the ratings of Nightlife in Las Vegas and Cleveland center around 4. This kind of explained the higher dying rate of Nightlife in Toronto compared to Cleveland or Las Vegas.

How about restaurants? Restaurants+American (New)

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
 unnest(categories)%>%
  filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter( categories=="Restaurants"|categories=="American (New)")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of restaurants businesses with different star ratings in several cities.")

The same trend exists in Restaurants. It seems that star ratings are related to dying rate of businesses.

Comparing French food and Chinese food.

Since I am really interested in both Chinese food and French Food, I will try to answer the following questions:

Which city has the most Chinese/French food choice?
Which city has higher ratings on Chinese/French food (a.k.a What city to go if I want to have quality Chinese/French food?)
Which city should I go if I want both high quality Chinese and French food?

Let us only consider cities with more than 50 Chinese restaurants.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>10)%>%filter(is_open==1)%>%
 unnest(categories)%>%
  filter( categories=="Chinese")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>50)%>%arrange(desc(n),desc(avg_star))

## # A tibble: 9 × 3
##            city     n avg_star
##          <fctr> <int>    <dbl>
## 1     Las Vegas   246 3.333333
## 2       Toronto   234 3.339744
## 3       Phoenix   134 3.376866
## 4       Markham   105 3.271429
## 5     Charlotte    89 3.297753
## 6    Pittsburgh    71 3.415493
## 7      Montréal    66 3.545455
## 8   Mississauga    64 3.375000
## 9 Richmond Hill    61 3.270492

It seems Chinese restaurants have a relatively low star ratings of around 3.3 to 3.4. Plus, Las vegas has the most Chinese restaurants with Toronto being runner-up.

Notice that if we relieve the condition that a restaurant needs to have at least 30 review_count to be considered. The number of Chinese restaurants are almost doubled.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(is_open==1)%>%
 unnest(categories)%>%
  filter( categories=="Chinese")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>100)%>%arrange(desc(n),desc(avg_star))

## # A tibble: 6 × 3
##        city     n avg_star
##      <fctr> <int>    <dbl>
## 1   Toronto   369 3.254743
## 2 Las Vegas   271 3.297048
## 3   Phoenix   152 3.266447
## 4   Markham   147 3.265306
## 5  Montréal   130 3.519231
## 6 Charlotte   122 3.159836

It seems that most Chinese restaurants have a lower review_count and star rating.

How about French Food?

Let us first explore French restaurants with more than 30 review_count.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
 unnest(categories)%>%
  filter( categories=="French")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))

## # A tibble: 6 × 3
##         city     n avg_star
##       <fctr> <int>    <dbl>
## 1   Montréal    92 3.929348
## 2    Toronto    50 3.850000
## 3  Las Vegas    38 4.157895
## 4 Pittsburgh    11 4.181818
## 5  Edinburgh    11 4.136364
## 6  Charlotte    11 4.045455

French restaurants are relatively rare compared to Chinese restaurants but it has a much higher star rating than Chinese restaurants.

Let us relieve the above restriction.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(is_open==1)%>%filter(review_count>10)%>%
 unnest(categories)%>%
  filter( categories=="French")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))

## # A tibble: 6 × 3
##         city     n avg_star
##       <fctr> <int>    <dbl>
## 1   Montréal   164 3.871951
## 2    Toronto    67 3.850746
## 3  Las Vegas    42 4.154762
## 4  Edinburgh    26 4.269231
## 5  Charlotte    13 4.038462
## 6 Pittsburgh    11 4.181818

It seems that Montreal has a surprisingly large number of French restaurants with a relatively high star rating.

Let visualize star ratings of Chinese restaurants for Las Vegas, Toronto and Pheonix.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
 unnest(categories)%>%
  filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas"|city=="Pheonix")%>%filter( categories=="Chinese")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Chinese restaurants businesses with different star ratings in several cities.")

These cities have similar distribution of star rating for Chinese restaurants, but Las Vegas has a lot more 5 star rating Chinese restaurants.

Let visualize star ratings of French restaurants for Las Vegas, Toronto and Montreal.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
 unnest(categories)%>%
  filter(city=="Toronto"|city=="Las Vegas"|city=="Montréal")%>%filter( categories=="French")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of French restaurants businesses with different star ratings in several cities.")

It is clear that Toronto has more French restaurants with rating around 3.5 while Montréal has more French restaurants with rating around 4. There is less French restaurants in Las Vegas compared to Montréal but Las Vegas has several 5 star rating French restaurants.

I am especially interested in how does New York City, Los Angeles, San Franciso, Seattle and Chicago compared to Las Vegas and Toronto.

If you love Chinese or French cuisine, Las Vegas offers most high quality restaurants in Chinese and French cuisine. But if Toronto is closer to you, it would also be a nice choice.

How about night life?

Find which city has the largest proportion of nightlife that is still open with more than 100 nightlife businesses, which have more than 30 review_count.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%unnest(categories) %>% filter(review_count>30)%>%
  filter(str_detect(categories,"Nightlife"))%>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(live_prop),dead_prop, n)

## # A tibble: 11 × 4
##          city     n  dead_prop live_prop
##        <fctr> <int>      <dbl>     <dbl>
## 1   Edinburgh   119 0.03361345 0.9663866
## 2    Montréal   140 0.06428571 0.9357143
## 3     Madison   122 0.07377049 0.9262295
## 4  Pittsburgh   225 0.08444444 0.9155556
## 5   Cleveland   157 0.10191083 0.8980892
## 6   Charlotte   271 0.12915129 0.8708487
## 7     Toronto   511 0.12915851 0.8708415
## 8     Phoenix   357 0.14845938 0.8515406
## 9   Las Vegas   832 0.19350962 0.8064904
## 10      Tempe   123 0.23577236 0.7642276
## 11 Scottsdale   286 0.32167832 0.6783217

It seems that people in Edinburgh, Montréal, Madison, Pittsburgh and Cleveland love drinking a lot.

Let us look at the star ratings of Nightlife.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
 unnest(categories)%>%
  filter( categories=="Nightlife")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))

## # A tibble: 29 × 3
##          city     n avg_star
##        <fctr> <int>    <dbl>
## 1   Las Vegas   671 3.726528
## 2     Toronto   445 3.435955
## 3     Phoenix   304 3.657895
## 4   Charlotte   236 3.519068
## 5  Pittsburgh   206 3.594660
## 6  Scottsdale   194 3.739691
## 7   Cleveland   141 3.680851
## 8    Montréal   131 3.832061
## 9   Edinburgh   115 3.773913
## 10    Madison   113 3.654867
## # ... with 19 more rows

Let us visualize Nightlife star rating distribution in Las Vegas, Toronto and Pheonix.

business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
 unnest(categories)%>%
  filter(city=="Toronto"|city=="Las Vegas"| city=="Phoenix")%>%filter( categories=="Nightlife")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Nightlife businesses with different star ratings in several cities.")

It is clear that Las Vegas is the best place for night life.

Does star rating distribution affect whether business is open or not.

Before we proceed to build a model to predict if a business could surivive or not, let us visualize the star ratings histogram for business that is closed and for business that is still open.

business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor) %>%select(-starts_with("hours"), -starts_with("attribute"),-starts_with("categories"))%>%filter(review_count>10)%>%
  ggplot(aes(x=stars,fill=business_status ))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Review count for closed and open businesses")

It seems that a small difference (0.5) in star ratings means a lot for business survival.

Let see the difference in review_count. The difference in number of business is dramatically different. So it is best we analyze different regions of review_count separately.

business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count<200)%>%
  ggplot(aes(x=review_count,fill=business_status  ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =10)+ggtitle("Review count for closed and open businesses")

business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count>200 &review_count<2000)%>%
  ggplot(aes(x=review_count,fill=business_status  ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =100)+ggtitle("Review count for closed and open businesses")

business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count>2000)%>%
  ggplot(aes(x=review_count,fill=business_status  ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =100)+ggtitle("Review count for closed and open businesses")

It is clear that review_count is a stronger predictor to predict if a business would succeed than star rating.

Predict if a business could survive or not.

Use the information in these datasets to predict if a business is open or not.

Yelp data challenge round 9 data exploration

Edward Lu

March 1st, 2017