We use yelp for ratings on restaurants, activities and etc on a daily basis. Now yelp has released part of its data for the Yelp Dataset Challenge. Let us do a exploratory analysis on the data to see if we could get any useful information.
As explained on its website, the data contains 4.1M reviews and 947K tips by 1M users for 144K businesses in * UK: Edinburgh * Germany: Karlsruhe * Canada: Montreal and Waterloo * U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, Cleveland
They also has some questions laid out
What cuisines do Yelpers rave about in these different countries? Do Americans tend to eat out late compared to those in Germany or the U.K.? Who are the most influential reviewers in the data? How to predict whether the business will remain open or going to close?
Let us first import the data into R.
#load library
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
library(tibble)
library(stringr)
library(tidyr)
library(dplyr)
library(Amelia) # a package to visualize missing data
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
#import data
business=stream_in( file("yelp_academic_dataset_business.json") )
## opening file input connection.
## closing file input connection.
#checkin=stream_in( file("yelp_academic_dataset_checkin.json") )
#tip=stream_in( file("yelp_academic_dataset_tip.json") )
#review=stream_in( file("yelp_academic_dataset_review.json") )
#user=stream_in( file("yelp_academic_dataset_user.json") )
Next, we need to flatten the nested lists read from the json file into R dataframes. Then, I will transform it into tibble data form, which is a variation of dataframe that is easy on the eye.
missmap(business,main="Yelp Data - Missings Map", col=c("black", "yellow"), legend=TRUE)
There aren’t significant missing data from the visualization. But there might be some error in the data but it would require too much time to manually fix it so that we will continue with the data.
business$city=business$city%>%as.factor
business$city%>%summary
## Las Vegas Toronto Phoenix Scottsdale
## 22892 14540 14468 6917
## Charlotte Pittsburgh Montréal Mesa
## 6912 5275 4785 4714
## Henderson Tempe Edinburgh Chandler
## 3788 3703 3601 3325
## Cleveland Madison Gilbert Glendale
## 2785 2711 2574 2555
## Mississauga Stuttgart Peoria Markham
## 2094 1955 1367 1285
## North Las Vegas Champaign North York Surprise
## 1154 1018 883 853
## Scarborough Richmond Hill Goodyear Concord
## 781 719 646 632
## Brampton Vaughan Etobicoke Matthews
## 631 601 543 533
## Avondale Oakville Huntersville Lakewood
## 531 502 458 424
## Fort Mill Urbana Cornelius Mentor
## 410 336 327 315
## Cave Creek Gastonia North Olmsted Westlake
## 311 303 286 283
## Monroeville Middleton Thornhill Strongsville
## 281 271 266 263
## Laval Pineville Fountain Hills Aurora
## 259 248 247 237
## Cuyahoga Falls Boulder City Newmarket Medina
## 235 217 215 214
## Kent Beachwood Montreal Wexford
## 208 205 201 201
## Pickering Ludwigsburg Rocky River Ajax
## 199 189 187 186
## Parma Indian Trail Inverness Sun Prairie
## 179 174 171 168
## Sun City Willoughby Cleveland Heights Fitchburg
## 167 167 166 165
## Stow Solon Litchfield Park Woodbridge
## 165 158 155 154
## Avon Hudson Coraopolis Chagrin Falls
## 145 145 137 133
## Bridgeville Bethel Park Brossard Verona
## 129 128 126 122
## Canonsburg Verdun East York Brunswick
## 121 121 117 116
## Sindelfingen Elyria Paradise Valley Belmont
## 115 114 113 109
## Laveen Mayfield Heights Esslingen Waxhaw
## 109 109 108 108
## Homestead Anthem Fairlawn (Other)
## 105 104 104 9090
business%>%select(-starts_with("hours"), -starts_with("attribute")) %>% unnest(categories) %>%
select(name, categories)%>%group_by(categories)%>%summarise(n=n())%>%arrange(desc(n))%>%head(20)
## # A tibble: 20 × 2
## categories n
## <chr> <int>
## 1 Restaurants 48485
## 2 Shopping 22466
## 3 Food 21189
## 4 Beauty & Spas 13711
## 5 Home Services 11241
## 6 Nightlife 10524
## 7 Health & Medical 10476
## 8 Bars 9087
## 9 Automotive 8554
## 10 Local Services 8133
## 11 Event Planning & Services 7224
## 12 Active Life 6722
## 13 Fashion 5824
## 14 American (Traditional) 5312
## 15 Fast Food 5250
## 16 Pizza 5229
## 17 Sandwiches 5220
## 18 Coffee & Tea 5099
## 19 Hair Salons 4858
## 20 Hotels & Travel 4857
As a first glance, we see that most businesses included in the data is related to eating, drinking, shopping.
business%>% filter(is_open==1) %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%
select(name, categories)%>%group_by(categories)%>%summarise(n=n())%>%arrange(desc(n))%>%head(20)
## # A tibble: 20 × 2
## categories n
## <chr> <int>
## 1 Restaurants 36889
## 2 Shopping 19587
## 3 Food 17348
## 4 Beauty & Spas 12182
## 5 Home Services 10835
## 6 Health & Medical 9870
## 7 Nightlife 8155
## 8 Automotive 7984
## 9 Local Services 7712
## 10 Bars 7062
## 11 Event Planning & Services 6521
## 12 Active Life 5989
## 13 Fashion 4960
## 14 Fast Food 4778
## 15 Hotels & Travel 4496
## 16 Sandwiches 4343
## 17 Hair Salons 4294
## 18 Pizza 4231
## 19 American (Traditional) 4167
## 20 Coffee & Tea 4074
We could see that the top 20 businesses catogories that are still open is the same as the top 20 businesses in the dataset.
So if you want to open a business in these cities in the dataset, what businesses are more likely to fail?
We only include categories with more than 100 businesses since we want to avoid the randomness.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%head(20)
## # A tibble: 20 × 4
## categories n dead_prop live_prop
## <chr> <int> <dbl> <dbl>
## 1 Gay Bars 158 0.4746835 0.5253165
## 2 Spanish 177 0.3728814 0.6271186
## 3 Soul Food 212 0.3679245 0.6320755
## 4 Tapas Bars 286 0.3531469 0.6468531
## 5 Hawaiian 207 0.3478261 0.6521739
## 6 Buffets 710 0.3422535 0.6577465
## 7 Dance Clubs 743 0.3257066 0.6742934
## 8 Cajun/Creole 206 0.3155340 0.6844660
## 9 Dim Sum 220 0.3090909 0.6909091
## 10 Hookah Bars 211 0.3080569 0.6919431
## 11 Korean 639 0.3004695 0.6995305
## 12 Lounges 1261 0.2950040 0.7049960
## 13 Videos & Video Game Rental 234 0.2948718 0.7051282
## 14 French 903 0.2934662 0.7065338
## 15 Hot Dogs 551 0.2921960 0.7078040
## 16 American (New) 3621 0.2913560 0.7086440
## 17 Japanese 2054 0.2862707 0.7137293
## 18 Filipino 161 0.2795031 0.7204969
## 19 Modern European 197 0.2791878 0.7208122
## 20 Barbeque 1279 0.2752150 0.7247850
So generally, you would want to avoid opening these catogories if you want to open some businesses in those cities. But the analysis here is too rough here, we will need to look at each business in each city and each neighborhood for more detailed analysis.
We also only include states with more than 100 businesses.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(state)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))
## # A tibble: 15 × 4
## state n dead_prop live_prop
## <chr> <int> <dbl> <dbl>
## 1 EDH 12272 0.17869948 0.8213005
## 2 ON 84728 0.16054905 0.8394509
## 3 MLN 627 0.15151515 0.8484848
## 4 IL 5901 0.14607694 0.8539231
## 5 NV 105568 0.14117915 0.8588209
## 6 SC 1786 0.13829787 0.8617021
## 7 WI 15303 0.13696661 0.8630334
## 8 AZ 162750 0.13399078 0.8660092
## 9 QC 22433 0.13177016 0.8682298
## 10 NC 37809 0.12422968 0.8757703
## 11 PA 29260 0.11760082 0.8823992
## 12 OH 37437 0.11413842 0.8858616
## 13 BW 10672 0.09895052 0.9010495
## 14 HLD 481 0.03326403 0.9667360
## 15 FIF 208 0.02884615 0.9711538
States BW, HLD and FIF have a suprisingly low dying business proportion, while EDH, ON, MLN has a relatively high dying business proportion. But we should not say that opening a business in EDH, ON or MLN is a bad idea. We will need to have more information to analyze if a business would survive.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))
## # A tibble: 198 × 4
## city n dead_prop live_prop
## <fctr> <int> <dbl> <dbl>
## 1 Harrisburg 322 0.2236025 0.7763975
## 2 Unionville 130 0.2230769 0.7769231
## 3 Paradise Valley 406 0.2216749 0.7783251
## 4 Dollard-des-Ormeaux 127 0.2204724 0.7795276
## 5 Wickliffe 116 0.1982759 0.8017241
## 6 Richmond Hill 2471 0.1926346 0.8073654
## 7 Toronto 50971 0.1869887 0.8130113
## 8 Shaker Heights 197 0.1827411 0.8172589
## 9 Edinburgh 12467 0.1807171 0.8192829
## 10 las vegas 103 0.1747573 0.8252427
## # ... with 188 more rows
Next, we would want to know what are the business catorgories to avoid in different cities.
Let us first look at Toronto.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Toronto")
## Source: local data frame [106 x 5]
## Groups: city [1]
##
## city categories n dead_prop live_prop
## <fctr> <chr> <int> <dbl> <dbl>
## 1 Toronto French 153 0.4640523 0.5359477
## 2 Toronto Lounges 201 0.3880597 0.6119403
## 3 Toronto Dance Clubs 109 0.3761468 0.6238532
## 4 Toronto American (New) 166 0.3734940 0.6265060
## 5 Toronto Ice Cream & Frozen Yogurt 207 0.3381643 0.6618357
## 6 Toronto Italian 505 0.3168317 0.6831683
## 7 Toronto Diners 152 0.3092105 0.6907895
## 8 Toronto Canadian (New) 571 0.3047285 0.6952715
## 9 Toronto Thai 254 0.3031496 0.6968504
## 10 Toronto Books 179 0.3016760 0.6983240
## # ... with 96 more rows
It is clear that We should avoid French food, Lounges, Dance club, New American food, and ice cream in Toronto.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Toronto")
## # A tibble: 1 × 4
## city n dead_prop live_prop
## <fctr> <int> <dbl> <dbl>
## 1 Toronto 50971 0.1869887 0.8130113
The average business dying proportion is about 18%, so it is best to avoid any business that has a dying rate of 18% in Toronto.
You could so this on any other cities in the data. But I am especially interested in Cleveland since this is where I live.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Cleveland")%>%head(20)
## Source: local data frame [20 x 5]
## Groups: city [1]
##
## city categories n dead_prop live_prop
## <fctr> <chr> <int> <dbl> <dbl>
## 1 Cleveland American (New) 172 0.25581395 0.7441860
## 2 Cleveland Restaurants 1235 0.21700405 0.7829960
## 3 Cleveland Bars 341 0.19941349 0.8005865
## 4 Cleveland Nightlife 385 0.19220779 0.8077922
## 5 Cleveland American (Traditional) 197 0.18274112 0.8172589
## 6 Cleveland Coffee & Tea 104 0.16346154 0.8365385
## 7 Cleveland Burgers 106 0.14150943 0.8584906
## 8 Cleveland Food 525 0.13142857 0.8685714
## 9 Cleveland Shopping 333 0.12912913 0.8708709
## 10 Cleveland Pizza 155 0.12903226 0.8709677
## 11 Cleveland Sandwiches 156 0.12820513 0.8717949
## 12 Cleveland Arts & Entertainment 154 0.11688312 0.8831169
## 13 Cleveland Beauty & Spas 150 0.10666667 0.8933333
## 14 Cleveland Active Life 111 0.09909910 0.9009009
## 15 Cleveland Event Planning & Services 165 0.08484848 0.9151515
## 16 Cleveland Specialty Food 112 0.08035714 0.9196429
## 17 Cleveland Fast Food 114 0.06140351 0.9385965
## 18 Cleveland Hotels & Travel 128 0.05468750 0.9453125
## 19 Cleveland Local Services 114 0.04385965 0.9561404
## 20 Cleveland Automotive 163 0.04294479 0.9570552
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories) %>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(dead_prop),desc(n))%>%filter(city=="Cleveland")
## # A tibble: 1 × 4
## city n dead_prop live_prop
## <fctr> <int> <dbl> <dbl>
## 1 Cleveland 10204 0.1348491 0.8651509
It seems that Cleveland is not so friendly to restaurants, bars and nightlife business compared to other businesses in Cleveland, especially New American Food, which has twice the dying rate than the city average dying rate.
Let us compare Toronto, Las Vegas and Cleveland on the New American food, Bars, Nightlife and Restaurants
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%
unnest(categories)%>%
filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter(categories=="Restaurants" | categories=="American (New)"| categories=="Bars"| categories=="Nightlife") %>%
group_by(city,categories)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(categories,desc(dead_prop),desc(n))
## Source: local data frame [12 x 5]
## Groups: city [3]
##
## city categories n dead_prop live_prop
## <fctr> <chr> <int> <dbl> <dbl>
## 1 Toronto American (New) 166 0.3734940 0.6265060
## 2 Las Vegas American (New) 527 0.2808349 0.7191651
## 3 Cleveland American (New) 172 0.2558140 0.7441860
## 4 Toronto Bars 1170 0.2333333 0.7666667
## 5 Las Vegas Bars 1211 0.2320396 0.7679604
## 6 Cleveland Bars 341 0.1994135 0.8005865
## 7 Toronto Nightlife 1331 0.2389181 0.7610819
## 8 Las Vegas Nightlife 1658 0.2346200 0.7653800
## 9 Cleveland Nightlife 385 0.1922078 0.8077922
## 10 Las Vegas Restaurants 5431 0.2912907 0.7087093
## 11 Toronto Restaurants 6347 0.2733575 0.7266425
## 12 Cleveland Restaurants 1235 0.2170040 0.7829960
It seems that the aforementioned categories have more businesses in Toronto or Las Vegas than Cleveland. But business in Cleveland has a lower dying rate.
Let us compare the ratings of Nightlife+Bars ratings in Toronto, Las Vegas and Cleveland.
I also need to choose the business with more than 30 review counts in order for the star ratings to make sense.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
unnest(categories)%>%
filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter( categories=="Nightlife"|categories=="Bars")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Nightlife+Bar businesses with different star ratings in several cities.")
The ratings of Nightlife in Toronto center around 3.5, while the ratings of Nightlife in Las Vegas and Cleveland center around 4. This kind of explained the higher dying rate of Nightlife in Toronto compared to Cleveland or Las Vegas.
How about restaurants? Restaurants+American (New)
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
unnest(categories)%>%
filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas")%>%filter( categories=="Restaurants"|categories=="American (New)")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of restaurants businesses with different star ratings in several cities.")
The same trend exists in Restaurants. It seems that star ratings are related to dying rate of businesses.
Since I am really interested in both Chinese food and French Food, I will try to answer the following questions:
Which city has the most Chinese/French food choice?
Which city has higher ratings on Chinese/French food (a.k.a What city to go if I want to have quality Chinese/French food?)
Which city should I go if I want both high quality Chinese and French food?
Let us only consider cities with more than 50 Chinese restaurants.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>10)%>%filter(is_open==1)%>%
unnest(categories)%>%
filter( categories=="Chinese")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>50)%>%arrange(desc(n),desc(avg_star))
## # A tibble: 9 × 3
## city n avg_star
## <fctr> <int> <dbl>
## 1 Las Vegas 246 3.333333
## 2 Toronto 234 3.339744
## 3 Phoenix 134 3.376866
## 4 Markham 105 3.271429
## 5 Charlotte 89 3.297753
## 6 Pittsburgh 71 3.415493
## 7 Montréal 66 3.545455
## 8 Mississauga 64 3.375000
## 9 Richmond Hill 61 3.270492
It seems Chinese restaurants have a relatively low star ratings of around 3.3 to 3.4. Plus, Las vegas has the most Chinese restaurants with Toronto being runner-up.
Notice that if we relieve the condition that a restaurant needs to have at least 30 review_count to be considered. The number of Chinese restaurants are almost doubled.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(is_open==1)%>%
unnest(categories)%>%
filter( categories=="Chinese")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>100)%>%arrange(desc(n),desc(avg_star))
## # A tibble: 6 × 3
## city n avg_star
## <fctr> <int> <dbl>
## 1 Toronto 369 3.254743
## 2 Las Vegas 271 3.297048
## 3 Phoenix 152 3.266447
## 4 Markham 147 3.265306
## 5 Montréal 130 3.519231
## 6 Charlotte 122 3.159836
It seems that most Chinese restaurants have a lower review_count and star rating.
How about French Food?
Let us first explore French restaurants with more than 30 review_count.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
unnest(categories)%>%
filter( categories=="French")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))
## # A tibble: 6 × 3
## city n avg_star
## <fctr> <int> <dbl>
## 1 Montréal 92 3.929348
## 2 Toronto 50 3.850000
## 3 Las Vegas 38 4.157895
## 4 Pittsburgh 11 4.181818
## 5 Edinburgh 11 4.136364
## 6 Charlotte 11 4.045455
French restaurants are relatively rare compared to Chinese restaurants but it has a much higher star rating than Chinese restaurants.
Let us relieve the above restriction.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(is_open==1)%>%filter(review_count>10)%>%
unnest(categories)%>%
filter( categories=="French")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))
## # A tibble: 6 × 3
## city n avg_star
## <fctr> <int> <dbl>
## 1 Montréal 164 3.871951
## 2 Toronto 67 3.850746
## 3 Las Vegas 42 4.154762
## 4 Edinburgh 26 4.269231
## 5 Charlotte 13 4.038462
## 6 Pittsburgh 11 4.181818
It seems that Montreal has a surprisingly large number of French restaurants with a relatively high star rating.
Let visualize star ratings of Chinese restaurants for Las Vegas, Toronto and Pheonix.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
unnest(categories)%>%
filter(city=="Cleveland"|city=="Toronto"|city=="Las Vegas"|city=="Pheonix")%>%filter( categories=="Chinese")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Chinese restaurants businesses with different star ratings in several cities.")
These cities have similar distribution of star rating for Chinese restaurants, but Las Vegas has a lot more 5 star rating Chinese restaurants.
Let visualize star ratings of French restaurants for Las Vegas, Toronto and Montreal.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%
unnest(categories)%>%
filter(city=="Toronto"|city=="Las Vegas"|city=="Montréal")%>%filter( categories=="French")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of French restaurants businesses with different star ratings in several cities.")
It is clear that Toronto has more French restaurants with rating around 3.5 while Montréal has more French restaurants with rating around 4. There is less French restaurants in Las Vegas compared to Montréal but Las Vegas has several 5 star rating French restaurants.
I am especially interested in how does New York City, Los Angeles, San Franciso, Seattle and Chicago compared to Las Vegas and Toronto.
Find which city has the largest proportion of nightlife that is still open with more than 100 nightlife businesses, which have more than 30 review_count.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%unnest(categories) %>% filter(review_count>30)%>%
filter(str_detect(categories,"Nightlife"))%>%group_by(city)%>%summarise(n=n(), dead_prop=sum(is_open==0)/n,live_prop=sum(is_open==1)/n)%>%filter(n>100)%>%arrange(desc(live_prop),dead_prop, n)
## # A tibble: 11 × 4
## city n dead_prop live_prop
## <fctr> <int> <dbl> <dbl>
## 1 Edinburgh 119 0.03361345 0.9663866
## 2 Montréal 140 0.06428571 0.9357143
## 3 Madison 122 0.07377049 0.9262295
## 4 Pittsburgh 225 0.08444444 0.9155556
## 5 Cleveland 157 0.10191083 0.8980892
## 6 Charlotte 271 0.12915129 0.8708487
## 7 Toronto 511 0.12915851 0.8708415
## 8 Phoenix 357 0.14845938 0.8515406
## 9 Las Vegas 832 0.19350962 0.8064904
## 10 Tempe 123 0.23577236 0.7642276
## 11 Scottsdale 286 0.32167832 0.6783217
It seems that people in Edinburgh, Montréal, Madison, Pittsburgh and Cleveland love drinking a lot.
Let us look at the star ratings of Nightlife.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
unnest(categories)%>%
filter( categories=="Nightlife")%>%group_by(city)%>%summarise(n=n(),avg_star=mean(stars,na.rm=TRUE))%>%filter(n>10)%>%arrange(desc(n),desc(avg_star))
## # A tibble: 29 × 3
## city n avg_star
## <fctr> <int> <dbl>
## 1 Las Vegas 671 3.726528
## 2 Toronto 445 3.435955
## 3 Phoenix 304 3.657895
## 4 Charlotte 236 3.519068
## 5 Pittsburgh 206 3.594660
## 6 Scottsdale 194 3.739691
## 7 Cleveland 141 3.680851
## 8 Montréal 131 3.832061
## 9 Edinburgh 115 3.773913
## 10 Madison 113 3.654867
## # ... with 19 more rows
Let us visualize Nightlife star rating distribution in Las Vegas, Toronto and Pheonix.
business %>%select(-starts_with("hours"), -starts_with("attribute"))%>%filter(review_count>30)%>%filter(is_open==1)%>%
unnest(categories)%>%
filter(city=="Toronto"|city=="Las Vegas"| city=="Phoenix")%>%filter( categories=="Nightlife")%>%ggplot(aes(x=stars,fill=city))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Number of Nightlife businesses with different star ratings in several cities.")
It is clear that Las Vegas is the best place for night life.
Before we proceed to build a model to predict if a business could surivive or not, let us visualize the star ratings histogram for business that is closed and for business that is still open.
business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor) %>%select(-starts_with("hours"), -starts_with("attribute"),-starts_with("categories"))%>%filter(review_count>10)%>%
ggplot(aes(x=stars,fill=business_status ))+geom_histogram(position="dodge",binwidth = 0.25)+ggtitle("Review count for closed and open businesses")
It seems that a small difference (0.5) in star ratings means a lot for business survival.
Let see the difference in review_count. The difference in number of business is dramatically different. So it is best we analyze different regions of review_count separately.
business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count<200)%>%
ggplot(aes(x=review_count,fill=business_status ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =10)+ggtitle("Review count for closed and open businesses")
business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count>200 &review_count<2000)%>%
ggplot(aes(x=review_count,fill=business_status ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =100)+ggtitle("Review count for closed and open businesses")
business%>%mutate(business_status=ifelse(is_open==0,"Closed","Open")%>%as.factor)%>%select(business_id,business_status,review_count)%>%filter(review_count>2000)%>%
ggplot(aes(x=review_count,fill=business_status ),xlim=c(0,500))+geom_histogram(position="dodge",binwidth =100)+ggtitle("Review count for closed and open businesses")
It is clear that review_count is a stronger predictor to predict if a business would succeed than star rating.
Use the information in these datasets to predict if a business is open or not.