Introduction

People all over the world like fast food. It is mostly unhealthy, but very convenient if you are in a hurry. Very popular fast food “dishes” are burgers and sandwiches. Using Yelp data set I’ll try to understand the differences between people who like burgers and those who prefer sandwiches. In addition I’ll look at the data from the fast food owner perspective: are there any reasons to go with burgers or sandwiches, and how location might influence the decision. Planning fast food menu is very important as the profit relies heavily on the number of people attending. So the main goal of this paper is to understand how the two group of yelp users differ, and can we recommend a menu of a fast food based on surrounding places.

Methods and Data

Exploratory data analysis

As mentioned above all analyses are done using Yelp data set. It includes five files in json format and could be downloaded from here. Files appeared to have nested data frames after they were downloaded. Those data frames where flattened using ‘flatten()’ method from ‘jsonlite’ package. Take a look at the table below for more information on the resulting files:

##            Observations Parameters File_Size_Mb
## 1 Business        61184        105       45.078
## 2   Review      1569264         10     1312.395
## 3 Check_in        45166        170       32.757
## 4      Tip       495107          6       81.396
## 5     User       366715         23      133.411

First lets get fast foods that serve burgers and sandwiches and see how they are distributed:

##            Group.Size Average.Rating Percentage.From.All.Businesses
## Sandwiches       3927       3.519862                    0.002502447
## Burgers         11364       3.380720                    0.007241611

The table above shows the mean difference in ratings for two groups, but to understand if it happened by chance or not tests have to be ran. We can’t just assume normality of the data, so qnorm plot below will help to understand this issue:

Plots above prove that we cannot assume normality of the data, so t-test or any other test that assumes data normality can’t be used. I will go with Wilcoxon-Mann-Whitney test (u-test), to test my null hypothesis.

I’ll start with assumption that people who like burgers and people who like sandwiches have no differences, and that fast food taste do not determine groups from one another. So my H₀ is that both groups come from the same population.

So with the p-value equal to 2.7961782\times 10^{-15} we reject the null hypotheses. We can say that groups come from different populations and the mean difference is statistically significant. Which means that bars with sandwiches are rated higher on average than those with burgers. But is it because sandwich bars are better, or do people who prefer sandwiches rate higher. I took people who gave either burger or sandwich bars the highest marks (4 or 5) and compared the mean ratings of the two groups of users. Here are the u-test results (H₀ - groups come from the same population):

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  sreviews$stars.x and breviews$stars.x
## W = 3557900000, p-value = 4.912e-11
## alternative hypothesis: true location shift is greater than 0

The test results clearly showed that people who like sandwiches, tend to rate everything higher than those who like burgers. The p-value is less then 0.05 and thus we reject the null hypothesis. Now we see that when it comes to ratings sandwich lovers are more generous, and tend to rate higher. The data also provides us with the information about the city the business is located, and if we assume that people go to fast foods in the same city they live, we can find out how sandwich and burger lovers differ by city (Remember we’re talking only about people who gave 4-5 stars to either burger fast food or sandwich). See below the plot of a ratio of 4-5 stars ratings and the whole number of ratings by city.

We can see from the plot that top cities by review ratio are really different, in fact only 2 cities out of top 10 for each group overlap. To understand more about what people who rated burgers or sandwiches highly in those cities we can look at the wordclouds with green for burgers and orange for sandwiches.

At a quick glance on the text results we see that people value speed a lot in burger fast foods, where as sandwich lovers talk a lot about food itself: beef, brisket, turkey, etc. If burger lovers talk about food they generally mention the size of it, and of course fries, as that I believe is the most common that goes with burgers all over the world. On the other hand people who rated Sandwiches value customer service and friendly staff.

Explaining methods

As mentioned before, the main goal of this work is to find out if we can determine what to put on the fast food menu based in the surrounding environment or not. I will use an association rule mining method called apriori. It is mostly used for market basket analysis, but if we think of a street or a shopping mall as a supermarket, where shops, entertainment centers, cafes, etc., are the goods on the supermarket shelves, we can easily apply apriori to our problem. So we need to find where people who like burgers or sandwiches also go to. To achieve this I’ve created a table that hold user_id as a unique key and a list of places that person rated. So apriori will devide the data into antecedent and consequent, in other words, if given several places person rated, what is the probability(it will be shown as confidance in the table below) that he/she also rated burgers for example.

Now we can turn list into transactions data and run apriori algorithm provided to us by “arules” package. And the result hopefully will show the most common places(item sets) that people with burgers or sandwiches attended.

In this example I’ve used apriori support and confidence threshold equal to 0.05 and 0.8. to eliminate all uncommon results. Here I’ve worked with only those users who actually rated either burgers or sandwiches or both. I could not use the whole database due to the size issues. I’ve also removed after applying apriori transactions where antecedent included sandwiches for the burgers consequent and vice versa.

Results

The resulting association rules show that we have 84 number of rules and 18 are common for both groups. The table below shows 6 support and confidence results for burgers and sandwiches (for rules that are in common):

##                     Businesses      B.Sup      S.Sup    B.conf    S.conf
## 1  Arts & Entertainment|Cinema 0.13274336 0.11867722 0.9615385 0.8596491
## 5                Bakeries|Food 0.07638565 0.07042385 0.9512761 0.8770302
## 6               Bars|Nightlife 0.06641826 0.05673032 0.9544846 0.8152610
## 7    Beauty & Spas|Nail Salons 0.07536097 0.06977177 0.8870614 0.8212719
## 9    Food|Beer, Wine & Spirits 0.05924546 0.05617140 0.9492537 0.9000000
## 10           Food|Coffee & Tea 0.18118305 0.15929204 0.9328537 0.8201439

So if we look at the first line for example, “Arts & Entertainment|Cinema” appear with Burgers 96% of times and 85% with sandwiches and 13% of all transactions both burgers and “Arts & Entertainment|Cinema” appear together, and it is 12% for sandwiches. We can see that support and confidence for burgers is always slightly bigger, to prove it by the test we should check the data normality first. And using the Shapiro-Wilk test of normality we fail to reject the H₀ with the p-value < 0.001 for burger support, and 0.001 for sandwich. We can now perform the t.test to see if the obtained results are statistically significant. So the null hypothesis is that there is no difference between two groups. And the alternative is that burgers are more common than sandwiches. So we can run the t-test to see if this happened by chance or the results are statistically significant.

So with the p-value = 0.491 (almost 0.05) but not quite, we look at the 95% confidence interval [-0.015,0.031] and as it includes zero we fail to reject the H₀. However what trully distinguishes burgers from sandwiches in this case is the confidance, as it shows what is the probability of the consequent given the antecedent. The Shapiro-Wilk test showd that the data is not normali distributed with p-value = 0.018 for burgers and 0.017 for sandwiches. So we apply Wilcoxon test to see if the groups come from the same population. The resulting p-value is 2.6710346\times 10^{-7} shows that the two groups are from the different populations. Below is the list of locations people who like burgers also go to with more than 95% confidence:

##  [1] "Food|Coffee & Tea,Shopping Centers|Shopping"                                          
##  [2] "Arts & Entertainment|Cinema,Food|Grocery"                                             
##  [3] "Food|Grocery,Food|Ice Cream & Frozen Yogurt"                                          
##  [4] "Arts & Entertainment|Cinema,Food|Ice Cream & Frozen Yogurt"                           
##  [5] "Arts & Entertainment|Cinema,Food|Coffee & Tea"                                        
##  [6] "Shopping Centers|Shopping"                                                            
##  [7] "Department Stores|Fashion|Shopping,Food|Coffee & Tea"                                 
##  [8] "Food|Coffee & Tea,Food|Ice Cream & Frozen Yogurt"                                     
##  [9] "Food|Coffee & Tea,Hotels & Travel|Event Planning & Services|Hotels"                   
## [10] "Food|Coffee & Tea,Food|Grocery"                                                       
## [11] "Arts & Entertainment|Resorts|Casinos|Event Planning & Services|Hotels & Travel|Hotels"
## [12] "Department Stores|Fashion|Shopping"                                                   
## [13] "Arts & Entertainment|Cinema"                                                          
## [14] "Bars|Nightlife|Lounges"                                                               
## [15] "Arts & Entertainment|Performing Arts"                                                 
## [16] "Bars|Nightlife"                                                                       
## [17] "Hotels & Travel|Arts & Entertainment|Casinos|Event Planning & Services|Hotels"        
## [18] "Food|Grocery"                                                                         
## [19] "Hotels & Travel|Airports"                                                             
## [20] "Bakeries|Food"

Sandwich lovers also attend this places, but not as frequently.

Discussion

Based on the confidance results I’d like to conclude that if choosing between burgers or sandwiches (and you have to pick only one), generaly I would recommend fast food owners to go with burgers. Even though sandwich lovers tend to rate businesses higher, burger fast foods are attended more frequently and thus might be more profitable. Sandwich lovers generaly rated the same places burger lovers did, however “Car Wash|Automotive” category was rated surprisingly only by burger lovers (if we’are using minimum confidance threshold == 0.8). So if you’re located near car wash - sell burgers, serve them fast and serve them with fries.
Of course, the descissions may depend on other factors (other than surrounding hotspots), like city your business is located in. In general the two groups of people that tend to rate burger or sandwich fast foods higher than the other are in fact different: starting from the city they live, in all the way through to the things they value in fast food restaurants. I am planning on digging deeper into the problem, by analyzing the days people attend different fast foods using check-in table. Fast food’s geografical location not only by city, but also taking geografical coordinates into account, would definetly be of interest too.

Choosing fast food menu: burgers or sandwiches, based on the surrounding infrastructure using Yelp data set