Analysis of the Seasonal and Historical Popularity of Food Items from the Yelp Academic Dataset

Introduction

The Yelp Academic Dataset (YAD) provides a open set of information extracted about businesses, reviews, and Yelp users. Two of the most popular business category on Yelp are “Restaurants” and “Food”, which are closely related. In fact, food and restaurant reviews make up more than 70% of all reviews contain in the YAD. In this project, efforts will be made to draw inference from the review text of the dataset to tell a story on the popularity of certain selected types food items, namely, of meat, seafood, green vegetables, staple foods, and superfruits. A few traditionally widespread food items (e.g. chicken, shrimp) and new some new-wave food items (e.g. kale, quinoa) are arbitrarily selected.

The question of interest is, are there historical and seasonal trends of these basic food items? This study is not an attempt at detecting which foods are consumed the most in terms of absolute numbers. After all, not all the food and drink items that we consume the most, such as water, or bread used in a sandwich, are likely not mentioned in a review. If the absolute-number approach is used, the new-wave food items, which may be interesting to explore, would not likely rank near the top of the list. Furthermore, there is a vast variety of seafood and green vegetable items, and to pre-allocate a dictionary of such capacity with the correct understanding of the other languages used in the review, French and German, requires some brute-force human effort that is not the intent of this project.

Methods

Data Preparation

The review text is the main source of the data. However, the business information itself certainly is needed, in order to filter out reviews for other types of businesses outside of the food and restaurant industry. The review and business information are merged into one data frame. The year, month, and day of the month of the review is extracted from the review date using functions in the lubridate package. The season of the year is determined based on the month of the review, and since all of the cities of the dataset are in the northern hemisphere, it would be consistent across all dates used.

There are also a few unusually labeled geographic locations, and after some research, the appropriate location is determined using a correction function. There are some reviews coming from businesses officially located in Fort Mill, South Carolina, although it is a suburb of Charlotte, North Carolina and they have been categorized under the Charlotte subset.

Term Extraction

Six basic categories of food (meat, seafood, green vegetables, staple foods, superfruits, and nuts) each have five arbitrarily selected items. In order to match the targeted terms via regular expressions, the entire body of each review text must first be converted to lower case. Normally, the aim is to ensure that the extracted term is indeed a word, and therefore these rules are put in place:

A term needs to have a space in front of it. The text for each review has been padded with one extra space at the very beginning to facilitate catching the first word of the review under this rule.
A term needs to have a space or a punctuation mark after it.

However, since German is used in Baden-Württemberg, there are bound to be some compound words. There are also some French words that could have apostrophes coming before them in the case of a possessive. Therefore, it is essential to do some research on the knowledge of the terms themselves to determine the most ideal regular expression search patterns.

Data Processing

The metric that will be used to evaluate food item popularity on Yelp is the number of occurrences of the targeted food term over a single review, or term frequency. Multiple occurrences of a term in one review will be counted multiple times, as opposed to only once. The baseball analogy would be the slugging average, where extra-base hits would be rewarded more, as opposed to the batting average, where only a single count would be awarded per at-bat. Still, the term frequency would almost certainly be a number smaller than one, due to the huge range of different types of foods being mentioned in the reviews. It may be helpful to multiply the term frequency by 100, and think of it as, how many times would the term occur over the course of 100 reviews.

Observing the histogram and Q-Q plot, it can be seen that the distribution of the raw frequency of a food term, and in this case, salmon, is very tail-heavy. The reason is that in most reviews, the term that is targeted would not be found at all, contributing to the very tall bin at zero on the histogram.

Figure 1 - Histogram and Q-Q Plot of Salmon Term Frequency, Raw

Nonetheless, there could be something interesting in this dataset, since there are over a million reviews to work with. To evaluate a difference in means by t-test or ANOVA, one assumption is that the data must be normally distributed. This implies that the raw term frequency data should be pre-processed in some way before considering to use t-test or ANOVA to evaluate the difference in means between subgroups of this dataset.

The strategy is to make use of the Central Limit Theorem to accomplish a normal distribution. Suppose that 2000 reviews are collected as a bundle, the average of the term frequencies in that bundle can be easily calculated, and called the bundled term frequency (BTF). If the entire body of reviews are split into bundles, then the average of all the bundled term frequencies would be the average term frequency of that food item, given that they have similar weights. Averaging the BTF would give a more Gaussian distribution.

There are several ways to build a bundle. Bundling by date would give many samples, but some dates in the earlier years of Yelp would probably result in many zero BTFs as well, given the sparse number of reviews. Bundling by day of the week should give a more representative set of samples, but there would only be 7 samples. Bundling by day of the month (DoM) might work well, since it would yield 31 samples.

tail(foodByMday[,c(1:7)])

##    mday count meat.beef meat.chicken  meat.pork  meat.duck meat.turkey
## 26   26 36252 0.1805418    0.2076851 0.06896171 0.01746111  0.01735077
## 27   27 36815 0.1787858    0.2049980 0.06703789 0.01982887  0.01616189
## 28   28 36784 0.1771966    0.2119400 0.07060135 0.01704545  0.01652893
## 29   29 34795 0.1794223    0.2042535 0.06601523 0.01741630  0.01629544
## 30   30 33725 0.1817346    0.2066716 0.07021497 0.01773165  0.01565604
## 31   31 21126 0.1875888    0.2095522 0.06645839 0.01883934  0.01633059

The 31 bundles give pretty close BTFs, surprisingly. The sample size for each is very close as well, beside the 31st day of the month, but that bundle still gives a very comparable BTF to the others. Below are the histogram and Q-Q plot of the BTF for salmon.

Figure 2 - Histogram and Q-Q Plot of Salmon Term Frequency, DoM-Bundled

Using this allows us to have a normal distribution as an assumption to use the aforementioned statistical tests.

Evaluation of Inference

Two questions will be evaluated:

Are there any statistically significant seasonal differences of the popularity of each food item according to Yelp? The alternative hypothesis would state that there is. This will be answered by using an ANOVA test on the different seasonal group means for each food item.
Are there any statistically significant changes in year-by-year averages of term frequencies for each selected food item? The alternative hypothesis would state that there is. This will be answered by using a t-test among the group means of the term frequencies between the years 2010-2014.

Results

Seasonal Differences

By sub-grouping the dataset into seasons according to the month of the review, and constructing a heat map, it can be seen that there are notable differences among the mean term frequencies of different seasons. Note that in the heat map below, each food item has its own color scale, so that the colors of the items with lesser frequencies would not be drowned out by the colors of the items with much higher frequencies. The purpose of the heat map is to enhance the contrast across seasons, not across food items.

Since an approximate normal distribution is achieved, ANOVA may now be performed to explore whether there is a strong seasonal effect upon the term frequency, for each of the 30 food items.

food.with.more.seasonal.variation.in.popularity

##  [1] "beef"        "chicken"     "pork"        "duck"        "salmon"     
##  [6] "tuna"        "crab"        "kale"        "broccoli"    "spinach"    
## [11] "asparagus"   "rice"        "corn"        "kiwi"        "pomegranate"
## [16] "blueberry"

food.with.less.seasonal.variation.in.popularity

##  [1] "turkey"    "shrimp"    "clams"     "chard"     "wheat"    
##  [6] "couscous"  "quinoa"    "acai"      "goji"      "almond"   
## [11] "cashew"    "peanut"    "pecan"     "pistachio"

As a result, there is a significant difference among the means of the term frequencies from reviews written in different seasons, for 16 of the 30 food items.

Historical Trends

It could also be interesting to see how trendy each of the food item is according to Yelp throughout the years, since its inception. It should be noted that there are more reviews in the latter years of the fairly short history of Yelp.

2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
10	458	2678	11414	30505	52677	98956	150412	173442	233392	335572	10281

T-tests are performed to compare the annual term frequencies for each item from one year to the previous year. Only the years 2010 through 2014 are considered, due to the relatively low number of reviews in the earlier years of Yelp, and in the partial year of 2015. If the null hypothesis is rejected, the indicator would denote the increase and decrease with Up or Down. If the null hypothesis cannot be rejected, the indicator would be a long straight dash ---.

food	2010-2011	2011-2012	2012-2013	2013-2014
meat.beef	Down	Down	Down	Down
meat.chicken	Down	Down	Down	Down
meat.pork	Down	——–	Down	Down
meat.duck	Down	——–	Down	Down
meat.turkey	Down	——–	Down	Down
seafood.shrimp	Down	Down	Down	Down
seafood.salmon	Down	——–	——–	Down
seafood.tuna	Down	Down	Down	Down
seafood.crab	Down	Down	Down	Down
seafood.clams	Down	——–	Down	Down
greenveg.kale	Up	Up	Up	——–
greenveg.chard	——–	——–	——–	——–
greenveg.broccoli	——–	Down	——–	Down
greenveg.spinach	Down	Down	Down	Down
greenveg.asparagus	Down	Down	Down	Down
staple.rice	Down	Down	Down	Down
staple.wheat	Down	——–	Down	Down
staple.corn	Down	Up	Down	Down
staple.couscous	——–	——–	——–	——–
staple.quinoa	——–	——–	Up	——–
superfruit.acai	——–	——–	Up	Up
superfruit.goji	——–	——–	——–	——–
superfruit.kiwi	——–	Down	——–	——–
superfruit.pomegranate	Down	Down	——–	——–
superfruit.blueberry	——–	——–	——–	Down
nut.almond	——–	——–	——–	——–
nut.cashew	——–	——–	——–	——–
nut.peanut	Down	Down	Down	Down
nut.pecan	——–	Down	——–	——–
nut.pistachio	Down	——–	——–	——–

The only food items showing statistically significant upward annual popularity trends are kale (2010-2013), corn (2011-2012), quinoa (2012-2013) and acai (2012-2014).

Discussion

While it is not easy to interpret information from word frequency data, search engines have long been reporting the “trending” searches of the day. Term frequency may very well have some value into detecting the fashionable trends of the day. In our case of data mining from restaurant reviews, there are a couple of things that we can infer from logic and common sense.

Firstly, while not all food items being consumed are mentioned in the reviews (e.g. bread, water), the food items that are mentioned in the reviews are almost certainly items that the reviewer or the dining party of the reviewer is having. Multiple mentions of the food item within one review may mean that there are multiple orders at the table containing that ingredient, or that the food item is worth mentioning more than once, regardless of whether the sentiment is positive or negative.

Furthermore, as seen from the part of the analysis on historical trends, several food items have statistically meaningful decreased average numbers of occurrences per review. This may be due to the “inflation effect”; while the item is still being consumed regularly, due to the state of multiculturalism that exists today, there might be gradually an increase in several other new food items being mentioned in the entire body of food and restaurant reviews, resulting in each food term being less frequent with the respect to the number of reviews. However, this perhaps implies that the food items which are statistically increasing in annual popularity are prevailing despite of the “inflation effect”, making the feat even more significant. Food and restaurant business owners and decisions makers may very well gain meaningful insights into consumer demands by further exploring, expanding, and refining the methods presented in this project.