Software used:

OSX El Capitan, Version10.11.2 and Linux Mint 17.2 Cinnamon 64-bit R version 3.2.2 (2015-08-14) – “Fire Safety”

Introduction

This data analysis focuses on predicting the probability that a business will be closed in a near future and the reasons why the businesses have bad ratings.

For a business owner, or for the investor, it is important to guess the future of the business with a good accuracy. Somehow intuitively, a business that has a bad rating, remains open for a shorter time than the business with good or above average rating. However, there are still many factors that can cause even a good business to close. What is the probability of the business closure given an average star rating? Does the answer depend on the duration of the business? What are bad business practices influencing the average business rating?

Imagine some outsider (e.g. person not familiar with given place) wants to open a business or invest in some city. The information that might be relevant is the question what are bad features of a given business type in a given city. One could learn, exploit and improve the weaknesses of a business and thus get a business advantage. To learn what is wrong, we ask the consumer of the business to say. This is where the Yelp data set comes in. From the table of yearly statistics (frequency of the bad feature occurence) we predict the future relevance of the bad features.

The above information with the future predictions might be of a high relevance for an open business. The business owner might try to focus and improve the bad features (owners own) business class. With the knowledge of possible future relevance the business owner might put a higher effort into the relevant sectors.

We first model the average review dependcy on time and compute the probability of a business to stay open given and average star review. Secondly, we have an accurate way to suggest the changes in a bad business culture according to the practices of many bad businesses and customer reviews. We develop a natural langues processing (NLP) method for a semantic text analysis and select only the nouns in the customer reviews connected to the bad business practices.

We breifly explain the (NLP) part further and read the bad reviews: 3 stars or less, in a given city, for a given business type. We analysed the text and get the information about the bad features. In the process we also obtain the statistics, i.e. how bad a given feature was. We rescale it with respect to the number of reviews, calculate the measure of inferiority. At the end of the (NLP) part we forcast the future developement of the inferiority measure for the bad feature vector. Various time series prediction methods are used: neural network method and two types of Arima algorithm.

Methods

In this section, we introduce a couple of methods we use in order to have a prediction whether the business will remain open or whether it will close - in particular, we find a probability of the business closure.

Methods:

simple histogram comparison of the number of received reviews for open/closed businesses
recursive partition tree (rpart method in caret model training package in R) method used to find the decision tree on conditions specifying whether a certain business is/will be open/closed based on number of reviews and stars :
- the business data is divided into training set (70% of the data, randomly chosen) and testing set (the rest of data - 30%)
- training set is used to find the probability that the business is open given a number of conditions, specified in the results section
non-linear curve fitting to find the dependence of the average rating on the time the business is/was open. We predict the average time for a duration of the business given the number of stars
(NLP) part: wordcloud, coreNLP, openNLP, syuzhet libraries methods (like Maxent_Word_Token_Annotator() etc. ) together with our own “epsilon neighborhood” noun method (explained later)
for forecasting we use the forecast library methods: nnetar() for neural network time serie prediction, Arima() for custom arima model prediction and auto.arima() for automatic arima prediction. We used no seasonality in time serie prediction methods.

Brief Motivation and Exploratory Analysis

Below, we take the business data and quickly plot the variable pairs to see any dependence between the average number of stars, location of the business (longitude and latitude) and the fact whether a certain business is still running (open) or whether it was closed.

From the above plot, we can see that businesses that are still open are reviewed more and that the number of stars tend to have an average 4 with the increasing number of reviews. Therefore, it seems that there might indeed be some correlation between the number of reviews, number of stars and the fact that the business is either open or closed.

Before further analysis, we explore the data a little more and plot the average number of reviews for specific star for closed/open businesses, normalized by the number of businesses with a specific average star. The normalization is performed separately for open and closed businesses and the idea is to have comparable quantities.

From the above we can infer that the businesses that are closed have higher chance to have a bad average rating:

if the business is closed, then it is more likely to have a bad average rating of 2.5 or 2 stars
if the business has a good rating (3-5 stars), then it is more likely to be still open and more likely to have many reviews

Note however, that a time for the business exists has a non-trivial influence on the above:

if certain bad business is open for a short time, it might be more likely that it does not have enough bad reviews to close
if the business is open only for a short time, it might have bad average review because there is not enough good reviews yet
if certain good business is open for a long time, it might be more likely that it has more reviews

Therefore, the above graph does not directly explain why a certain business stayed open or was closed and we will explain the time dependence and reasons for bad reviews in the next section (the (NLP) analysis and time prediction).

In this subsection we focus on the exploratory analyses necesary for further (NLP). More concretely, we take the bad reviews for a given business type and a city in a given year. For brevity, we focus on the city of Las Vegas and analyse bad reviews using the sentiment analysis (packages wordcloud, coreNLP, openNLP, rJava). We pick out the words with negative sentiment and search for the nouns in the “epsilon” neighborhood. The epsilon neighborhood is the set of words that is centered at a given (bad) word and includes 3 closest words to the left and right. The “epsilon” is appropriately chosen by looking at the data structure and estimating the computational time. The rationale of choosing the nouns in that neighborhood is the observation that:

Nouns usually carry the factual information about the subject, however nouns are not usually considered negative.
Negative words (often adjectives) usually do not carry any factual information but they carry the sentiment information.

Let us give an example. Lets just consider a statement “the bread was awful”. This simple example gives us the structure. The sentiment analysis will pick the word “awful” as a negative word. However sentiment analysis usually tells us that words like “bread” or “was” are neutral words. Therefore looking for nouns in vicinity of the word “awful” might give us the factual information. Moreover we were looking for the vector of nouns with weights. Each weight is number of times some noun was present in reviews in negative word vicinity (in reviews for a given city, given business type and given year).

In the outlined way we learn what is bad about a certain business using the inferiority measure. In the second step we use such information to predict the inferiority measures of the bad business features in time. We consider three different time prediction models: nnetar() method of the package forecast provides the neural network model for time prediction, Arima() and auto.arima() of the package forecast provide the arima model type predictions. We calculate the predicted values of all three models for the most significant bad features for years 2015 and 2016 and estimate the accuracy for each prediction for each model (compared using the MASE = mean absolute scaled error).

For the brevity and computation intensity we perform the analysis for the city of Las Vegas for a business type: “Cafes”, for years from 2006 to 2014. (note: the original data do contain more years, there was no or just a very few data for years 2004, 2005 and 2015. Therefore those years are disregarded).

Because we want to have everything organised nicely and we like the object oriented programming languages, we create the S4 class that subsets our data to a managable size. The class is called busS that eats up: the reviews data frame, the business data frame, city, business type and year. One method of the busS class is to subset the review data (the data frame is called rev from now on) for a given city for a given business type (“Cafes”) for a given year (from 2006 to 2014). This is done by the method getWholeBusiness() (for the class busS). Code for the sub setting (considering the busS already loaded with all methods):

df<-busS(data_rev=rev,data=bus,city=las_vegas,business="Cafes",year=2006)
df<-getWholeBusiness(df)

The previous code subsets the rev data table by the las_vegas city. We also want to explore whether we have enough data for each year for the city of Las Vegas in the business category “Cafes”. The following plot shows us the situation:

In summary: We know the names of Las Vegas, we can easily subset the whole business using the busS class and we know that for the business type “Cafes” we have enough data for exploration. The last plot also shows that for the years 2004,2005 we have relatively small amount of data. We also disregarded year 2015 since obviously the data there are incomplete.

Results

Below, we present a classification tree, that gives a probability that a certain business is open/closed based on the number of stars and review count. The classification tree is computed using the rpart caret package that uses bootstrap resampling method to find an accurate classification.

Decision Tree Interpretation: To find the probability that a certain business will be open, we treat the open/closed identificator as a continuous variable and not as a factor. The root node (probability P=0.88) gives a probability that a certain business is open/closed. From the classification tree, if the business has more than 4.8 stars, the probability that it’s open is 94%. The interpretation of the rest of the tree is analogous.

library(rattle)
# load the trained model for classification tree
load("businessData.rda")
modFit

## CART 
## 
## 42830 samples
##     5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 42830, 42830, 42830, 42830, 42830, 42830, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE       Rsquared     RMSE SD      Rsquared SD 
##   0.000964269  0.3262800  0.006794689  0.002199129  0.0009211813
##   0.002016676  0.3265864  0.004906140  0.002114058  0.0008596819
##   0.004573263  0.3270699  0.003965632  0.002246362  0.0006095627
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.000964269.

Time Range Influence:

The following classification tree decides on the probability of a certain business to be open according to:

time range - in days - over which is/was the business open (obtained from the first and last review date)
average star rating
number of reviews (the trained model does not do the split on this variable)

library(rattle)
#
load("modelWTimeRange.rda")
model

## CART 
## 
## 60785 samples
##     4 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 60785, 60785, 60785, 60785, 60785, 60785, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE       Rsquared     RMSE SD      Rsquared SD
##   0.004549401  0.3242587  0.017303553  0.002191247  0.002501036
##   0.007419798  0.3253123  0.011302606  0.002228746  0.003559455
##   0.008100342  0.3259707  0.009510323  0.002491002  0.002983017
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.004549401.

Time vs. average rating:

Below we show the dependence of the mean time (in days) over which the business stays open. Since we have a dataset for a contraint time range, we want to make sure that the curve is robust with respect to the time window we take. Therefore, the plot contains the data for different subsets according to the time when first review was written. The subsets include businesses that are/were open the latest 2 years ago (in 2013) up to datasets with businesses open since 2004. The data points have size according to the dataset: for example, size 10 (maxYears = 10) means that that data point belongs to the dataset with businesses reviewed for the first time since 10 years ago. The data in datasets are rescaled by the maximum average time duration for open/closed business.

Even for different datasets, we still see the same type of behavior and the curve fitted to the data for both open and closed businesses are a good fit. Grey lines show the standard deviation and we did not use any weights, i.e. datapoint corresponding to maxYears = 2 has the same weight as the datapoint for maxYears = 10.

Interpretation of the average rating - open time dependence

Short duration of a 5 star businesses is (as we show below) due to the fact that a long-lasting business tends to receive also some bad reviews and so the average rating is smaller (around 4 stars).
Businesses with 5 stars that are closed, were open only for a very short time (otherwise they would receive some bad rating and moved to the category with a worse average rating).
Businesses with very bad rating that are closed were open for a longer time than those that are still open. This suggests that the very bad businesses that are still open are inevitably heading to be (doomed) closed.
From the plot, we see that the average time for the existence of a 1 and 1.5 star business is approximately the same. The average duration of a very bad business is much shorter than that of a good business.
Businesses with 3-4 stars are in general the businesses that are open for the longest time. If the business was closed than there is not a large difference in average opened time from bad businesses.

(Good) Businesses Get Worse Rating Over Time

As the time passses, even the best business has the ‘luck’ of serving a demanding customers, resulting in worse ratings. Even in the case of very bad businesses, the average rating is decreasing over time.

As a result, the stars vs. average time open curve has a parabolic shape: good businesses cannot keep the great score and move to ~4 star businesses over time.

We have verified that there is a decreasing tendency of the average rating over time, no matter what is the average rating. Below, we show a couple of plots for evolution in average rating over time.

Plot of time evolution of bad businesses (<=2 star average rating)

Dots are specific businesses and color is different for a different business. Lines connect the same business over time.

Businesses open in 2006

Businesses open in 2007

Plot of time evolution of good businesses (>=4 star average rating)

Dots are specific businesses and color is different for a different business. Lines connect the same business over time.

Businesses open in 2004

Businesses open in 2005

Plot of time evolution of businesses with the average rating (>=3.5 and <2 star rating)

Dots are specific businesses and color is different for a different business. Lines connect the same business over time.

Businesses open in 2005

Businesses open in 2006

Suggesting Good Business Practices

We want find what is wrong with the particular business type. The class called words wraps the character string and the methods of this class can do various partial analysis of the given string. Two main ingredients of the class are the sentiment analysis function and tagging function. For the sentiment analysis and words tagging we use methods of the packages wordcloud, coreNLP,openNLP, tm, and syuzhet. The main method of the words class is the method called findNearNouns(). This method eats up the string, in our case a particular review and returns the list of the nouns that are in the “epsilon” neighborhood (chosen as plus minus three words) of the negative sentiment word.

We also create the class evaluate with data structure that consists of a character string (in our case negative sentiment nouns) and a file name. This class formats the result of findNearNouns() function into a nice format and saves it on the local drive. In the class we are using the fault tolerant implementations of the findNearNouns() (method eval_summary()), since we want to loop through a large chunks of data in the next step.

Next stage is to loop through the years and use the methods getWholeBusiness() and findNearNouns() (implemented as eval_summary()). We obtain the list of the nouns that represents the problems of “Cafes” type business in years 2006-2014. We also subset the reviews to have three or less stars:

set<-c(2006:2014)
SF<-sapply(set,function(x){cafe<-busS(data_rev=rev,data=bus,city=las,business="Cafes",year=as.numeric(x));
cafe<-getWholeBusiness(cafe);cafe<-subset(cafe,stars<=3);
cafe<-evaluate(text=cafe$text,filename=paste("cafe_rev_",x,collapse="",sep=""));cafe<-eval_summary(cafe)})

Another class called taB and its methods are used to get nicely formated and cutted (to desired level) version of the bad features, i.e. nice representation of the list SF, that holds the bad feature information together with their occurence. The taB class data structure consists of the list SF, low_cutof numeric value and a numeric vector years. The low_cutof holds user predefined low cutof on number of occurence. For example we can at the beginning disregard words with low occurence through years. In the class we can also further establish years which we want to see in final table.

There are severeal steps one has to overcomme in order properly analyse the content of SF. In the list SF there are words with the same meaning, but are distinguished in SF. The example is singular and plural of the noun. So we need to identify those words and count them as the same. This is done using the method called list_for_no_pick(), consider the following code:

SF_rev<-taB(theSF=SF,low_cutof=0,years=c(2006:2014))
no_pick_list<-list_for_no_pick(SF_rev)

We create the instence of the class taB called SF_rev. The method list_for_no_pick() gives the list of potentionally same words in the list SF. The observation was that in the resulting list more words belong to the same meaning category than do not. So it is easier to pick words that should NOT be considered equal. Therefore the next step is to identify the couples that should not be identified (somehow this is not an oxymoron) and pick out their positions into the no_pick numeric vector:

no_pick_list

## [[1]]
## [1] "ass"        "assumption"
## 
## [[2]]
## [1] "bagel"  "bagels"
## 
## [[3]]
## [1] "basket"  "baskets"
## 
## [[4]]
## [1] "bit"   "bites"
## 
## [[5]]
## [1] "bottle"  "bottles"
## 
## [[6]]
## [1] "bread"  "breads"
## 
## [[7]]
## [1] "break."    "breakfast"

In the above partial section of the no_pick_list list we can identify easily which words should NOT be evaluated as the same. The example being word couple “ass”-“assumption”. However the most of the words in no_pick_list should be actually identified. The real no_pick numeric vector in our situation is no_pick<-c(1,7,13,14,20,21,22,32,39,40,49,59,75,83,87). After identification of the no_pick the taB class has a method called merge_same() that eats up the SF_rev instance of the class and secretly the numeric vector no_pick list and merges the words which have the same meaning. It also produces nicely formated table:

table_rev<-merge_same(SF_rev)

##      anything breakfast cafe chi choice customer day drink egg experience
## 2006        0         0    0   0      0        0   0     0   1          0
## 2007        0         1    3   2      0        0   1     1   1          0
## 2008        1         1    1  10      0        0   1     1   2          3
## 2009        1         2    2   6      0        1   4     2   7          6
## 2010        3         0    3   0      2        1   4     0   0          0
## 2011        1         1    0   2      1        1   2     1   1          2
## 2012        0         3    1   0      3        3   3     0   0          5
## 2013        1         0    5   3      0        4   0     1   0         10
## 2014        4         5    3   4      0        4  12     3   0          9
##      flavor food friend fries lot menu mood nothing order place price
## 2006      0    0      0     0   0    1    0       0     1     0     0
## 2007      0    7      1     0   0    0    0       7     0     3     1
## 2008      6   24      1     1   0    2    0       0     1     3     1
## 2009      2   27      4     0   1    1    0       7     5     9     2
## 2010      0    9      0     0   3    3    0       2     0     7     3
## 2011      1   14      0     3   0    1    3       1     0     5     2
## 2012      0   23      0     3   0    1    1       4     2     5     2
## 2013      1   14      1     0   0    6    0       9     1    12     3
## 2014      2   38      1     3   0    6    3       7     1     2     2
##      quality review sandwich seating server service star taste wait
## 2006       0      0        0       0      0       0    0     0    0
## 2007       0      1        0       0      0       5    2     0    1
## 2008       2      3        3       0      0      16    1     3    3
## 2009       0      4        1       0      5      38    0     3    3
## 2010       1      5        1       0      0      19    2     1    1
## 2011       0      2        0       1      1      12    0     0    0
## 2012       4      5        1       1      1      35    3     3    1
## 2013       6      7        2       6      1      40    2     4    3
## 2014       9      7        1       2      1      74    2     5    4
##      waitress way
## 2006        0   0
## 2007        1   0
## 2008       11   0
## 2009        1   3
## 2010        1   3
## 2011        0   1
## 2012        0   1
## 2013        0   0
## 2014        1   3

The table above is however not the whole story. The Yelp data set has a second large source of the relevant information. It is the data set containing the tips. The tips could be considered as a very short reviews. And indeed regarding the data structure it is almost the same. So we proceed with no principal change in the analysis of the tips data set. We create the SF list again (for tip data set), called SF_tips. We use the taB class and its methods to obtain the tips analog of last table. The rationale being that actually we should not judge where people are writting what was wrong. We shoud judge just the overall occurence of given bad features. So we should ADD review and tips bad feature (occurence) tables together.

The next step is to rescale the table_final with the number of reviews and tips in a given year, to get the proportionality:

table_final

##          anything    breakfast         cafe          chi       choice
## 2006 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 2007 0.0000000000 0.0114942529 0.0344827586 0.0229885057 0.0000000000
## 2008 0.0044052863 0.0044052863 0.0044052863 0.0440528634 0.0000000000
## 2009 0.0032362460 0.0064724919 0.0064724919 0.0194174757 0.0000000000
## 2010 0.0039267016 0.0000000000 0.0039267016 0.0000000000 0.0026178010
## 2011 0.0007082153 0.0007082153 0.0000000000 0.0014164306 0.0007082153
## 2012 0.0000000000 0.0013999067 0.0004666356 0.0000000000 0.0013999067
## 2013 0.0003225806 0.0003225806 0.0016129032 0.0009677419 0.0000000000
## 2014 0.0008399832 0.0012599748 0.0006299874 0.0008399832 0.0000000000
##          customer         day        drink          egg  experience
## 2006 0.0000000000 0.000000000 0.0000000000 0.0769230769 0.000000000
## 2007 0.0000000000 0.011494253 0.0114942529 0.0114942529 0.000000000
## 2008 0.0000000000 0.004405286 0.0044052863 0.0088105727 0.013215859
## 2009 0.0032362460 0.012944984 0.0064724919 0.0226537217 0.019417476
## 2010 0.0013089005 0.005235602 0.0000000000 0.0000000000 0.001308901
## 2011 0.0007082153 0.001416431 0.0007082153 0.0007082153 0.001416431
## 2012 0.0013999067 0.001399907 0.0000000000 0.0000000000 0.002333178
## 2013 0.0012903226 0.000000000 0.0012903226 0.0000000000 0.004516129
## 2014 0.0008399832 0.002519950 0.0008399832 0.0000000000 0.001889962
##            flavor        food       friend        fries         lot
## 2006 0.0000000000 0.000000000 0.0000000000 0.0000000000 0.000000000
## 2007 0.0000000000 0.080459770 0.0114942529 0.0000000000 0.000000000
## 2008 0.0264317181 0.105726872 0.0044052863 0.0044052863 0.000000000
## 2009 0.0064724919 0.087378641 0.0129449838 0.0000000000 0.003236246
## 2010 0.0000000000 0.013089005 0.0000000000 0.0000000000 0.003926702
## 2011 0.0007082153 0.009915014 0.0000000000 0.0021246459 0.000000000
## 2012 0.0000000000 0.012599160 0.0000000000 0.0013999067 0.000000000
## 2013 0.0003225806 0.006451613 0.0003225806 0.0000000000 0.000000000
## 2014 0.0004199916 0.008609828 0.0002099958 0.0006299874 0.000000000
##              menu         mood      nothing        order       place
## 2006 0.0769230769 0.0000000000 0.0000000000 0.0769230769 0.000000000
## 2007 0.0000000000 0.0000000000 0.0804597701 0.0000000000 0.034482759
## 2008 0.0088105727 0.0000000000 0.0000000000 0.0044052863 0.013215859
## 2009 0.0032362460 0.0000000000 0.0226537217 0.0161812298 0.029126214
## 2010 0.0039267016 0.0000000000 0.0026178010 0.0000000000 0.009162304
## 2011 0.0007082153 0.0021246459 0.0007082153 0.0000000000 0.003541076
## 2012 0.0004666356 0.0004666356 0.0018665422 0.0009332711 0.002333178
## 2013 0.0019354839 0.0000000000 0.0029032258 0.0003225806 0.004516129
## 2014 0.0012599748 0.0006299874 0.0014699706 0.0002099958 0.001679966
##            price     quality      review     sandwich      seating
## 2006 0.000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
## 2007 0.011494253 0.000000000 0.011494253 0.0000000000 0.0000000000
## 2008 0.004405286 0.008810573 0.013215859 0.0132158590 0.0000000000
## 2009 0.006472492 0.000000000 0.012944984 0.0032362460 0.0000000000
## 2010 0.003926702 0.001308901 0.006544503 0.0013089005 0.0000000000
## 2011 0.001416431 0.000000000 0.001416431 0.0000000000 0.0007082153
## 2012 0.001866542 0.004199720 0.002333178 0.0004666356 0.0004666356
## 2013 0.001612903 0.002258065 0.002258065 0.0006451613 0.0019354839
## 2014 0.001049979 0.001889962 0.001469971 0.0002099958 0.0004199916
##            server    service         star       taste         wait
## 2006 0.0000000000 0.00000000 0.0000000000 0.000000000 0.0000000000
## 2007 0.0000000000 0.05747126 0.0229885057 0.000000000 0.0114942529
## 2008 0.0000000000 0.07048458 0.0044052863 0.013215859 0.0132158590
## 2009 0.0161812298 0.12297735 0.0000000000 0.009708738 0.0097087379
## 2010 0.0000000000 0.02617801 0.0026178010 0.001308901 0.0013089005
## 2011 0.0007082153 0.01133144 0.0000000000 0.000000000 0.0000000000
## 2012 0.0004666356 0.02053196 0.0013999067 0.001399907 0.0004666356
## 2013 0.0009677419 0.01741935 0.0006451613 0.001290323 0.0009677419
## 2014 0.0010499790 0.01952961 0.0004199916 0.001049979 0.0008399832
##          waitress          way
## 2006 0.0000000000 0.0000000000
## 2007 0.0114942529 0.0000000000
## 2008 0.0484581498 0.0000000000
## 2009 0.0032362460 0.0097087379
## 2010 0.0013089005 0.0039267016
## 2011 0.0000000000 0.0007082153
## 2012 0.0000000000 0.0004666356
## 2013 0.0000000000 0.0000000000
## 2014 0.0002099958 0.0006299874

Forecasting

Our last step is to make the time series predictions of the table_final. We use three methods for the time predictions from the library forecast. Later on we also need the fpp library.

First model is nnetar(), the second and third models are Arima() and auto.arima(). The model nnetar() uses the neural network to train on the previous year patterns and by this to predict the future values. The arima methods are the univariate time series prediction methods.

The first nnetar() builds the feed-forward neural network with lagged values of previous points in the time series. The one we used have size=18 layers and we average repeats=50 times.

The Arima() modelling uses the Hyndman and Khandakar algorithm which combines unit root tests, minimization of the AICc and MLE. We follow the description given at the webpage https://www.otexts.org/fpp/8/7 to find the best model with the Arima parameters: order=c(p,d,q)=c(0,1,4) where:

p is the order of the autoregressive part
d is the degree of first differencing involved
q is the order of the moving average part

We also estimate the model accuracies. We divide the dataset (table_final) into training and test set in ratio 9:1. We use the accuracy() function to estimate various accuracies. The accuracy(train,test) in general gives various measures of model accuracy. We use mean absolute scale error that is proposed to be “generally applicable measurement of forecast accuracy without the problems seen in the other measurements”, see: http://robjhyndman.com/papers/foresight.pdf.

We create the S4 class called predic that implements the above described predictions for the table_final. The class data structure consists of the table for which we want to do the predictions (in our case the table_final), numerical value called start_year (to indicate with what year one wants to start) and logical values nnet, arima and autoarima (to indicate which model should be used in method of the class). The method of the class is called pred. It calculates the desired models predictions (logical values of nnet, arima, autoarima) together with accuracies and produces the list:

pred_final<-predic(data=table_final,start_year=2006,nnet=T,arima=T,autoarima=T)

The list pred_final has two major part:

pred_final$data_frame is the formated table with the model predictions and respecting accuracies
pred_final$fit is the list of the fitted model for every column of table_final and every requested model ####Results ##### Table The resulting forecasts, comparison and estimated accuracies are given in the following table:

pred_final_list<-pred(pred_final)
pred_final_list$data_frame

##                          anything    breakfast          cafe          chi
## 2006                 0.0000000000 0.0000000000  0.0000000000 0.0000000000
## 2007                 0.0000000000 0.0114942529  0.0344827586 0.0229885057
## 2008                 0.0044052863 0.0044052863  0.0044052863 0.0440528634
## 2009                 0.0032362460 0.0064724919  0.0064724919 0.0194174757
## 2010                 0.0039267016 0.0000000000  0.0039267016 0.0000000000
## 2011                 0.0007082153 0.0007082153  0.0000000000 0.0014164306
## 2012                 0.0000000000 0.0013999067  0.0004666356 0.0000000000
## 2013                 0.0003225806 0.0003225806  0.0016129032 0.0009677419
## 2014                 0.0008399832 0.0012599748  0.0006299874 0.0008399832
## nnetar_pred_2015     0.0004870041 0.0006694357  0.0008737658 0.0006625092
## nnetar_acu_train     0.5799990328 0.4562024660  0.5581867797 0.0126729681
## nnetar_acu_test      0.2589113717 5.4265906023  0.1263649267 0.0423509384
## Arima_pred_2015      0.0015103199 0.0014178249 -0.0004394195 0.0018457399
## Arima_acu_train      0.4311651847 0.4677127177  0.5725588054 0.5133055900
## Arima_acu_test       1.0952812602 5.2826537827  0.0464405958 0.1604396743
## auto.arima_pred_2015 0.0014932237 0.0028958565  0.0000000000 0.0000000000
## auto.arima_acu_train 1.1423584371 0.7966564687  0.6105936979 0.8661814757
## auto.arima_acu_test  1.0828831029 4.8954499997  0.0000000000 0.0000000000
##                            choice     customer         day        drink
## 2006                 0.0000000000 0.0000000000 0.000000000 0.0000000000
## 2007                 0.0000000000 0.0000000000 0.011494253 0.0114942529
## 2008                 0.0000000000 0.0000000000 0.004405286 0.0044052863
## 2009                 0.0000000000 0.0032362460 0.012944984 0.0064724919
## 2010                 0.0026178010 0.0013089005 0.005235602 0.0000000000
## 2011                 0.0007082153 0.0007082153 0.001416431 0.0007082153
## 2012                 0.0013999067 0.0013999067 0.001399907 0.0000000000
## 2013                 0.0000000000 0.0012903226 0.000000000 0.0012903226
## 2014                 0.0000000000 0.0008399832 0.002519950 0.0008399832
## nnetar_pred_2015     0.0005235603 0.0010888226 0.049541842 0.0001417605
## nnetar_acu_train     0.6327981957 0.7240096883 0.281766552 0.5316165464
## nnetar_acu_test      0.6327969543 1.3300304854 9.596566990 5.6768029290
## Arima_pred_2015      0.0007968417 0.0007880048 0.004303463 0.0021424061
## Arima_acu_train      0.3570610668 0.4504416850 0.390774654 0.4600122251
## Arima_acu_test       0.9630983731 0.8985370832 0.808392665 5.1361483796
## auto.arima_pred_2015 0.0000000000 0.0009759527 0.004379601 0.0028011725
## auto.arima_acu_train 0.6346623475 0.8441952374 0.691348758 0.8201091662
## auto.arima_acu_test  0.0000000000 1.1128481080 0.822694980 4.9621018478
##                                egg  experience       flavor        food
## 2006                  0.0769230769 0.000000000 0.0000000000 0.000000000
## 2007                  0.0114942529 0.000000000 0.0000000000 0.080459770
## 2008                  0.0088105727 0.013215859 0.0264317181 0.105726872
## 2009                  0.0226537217 0.019417476 0.0064724919 0.087378641
## 2010                  0.0000000000 0.001308901 0.0000000000 0.013089005
## 2011                  0.0007082153 0.001416431 0.0007082153 0.009915014
## 2012                  0.0000000000 0.002333178 0.0000000000 0.012599160
## 2013                  0.0000000000 0.004516129 0.0003225806 0.006451613
## 2014                  0.0000000000 0.001889962 0.0004199916 0.008609828
## nnetar_pred_2015     -0.0009307194 0.002633564 0.0006929874 0.006184085
## nnetar_acu_train      0.1587443537 0.424783498 0.7492331430 0.213158253
## nnetar_acu_test       0.0694258082 2.177864703 0.1332853428 2.442698735
## Arima_pred_2015      -0.0037965084 0.008107857 0.0037132468 0.022857175
## Arima_acu_train       0.6422549371 0.704055479 0.7981453955 0.569484285
## Arima_acu_test        0.2864591960 1.158803070 0.5430722316 1.847667066
## auto.arima_pred_2015  0.0000000000 0.004899771 0.0000000000 0.036025545
## auto.arima_acu_train  1.0109894275 0.936206858 0.5582788787 1.384293639
## auto.arima_acu_test   0.0000000000 1.750708458 0.0000000000 1.351983489
##                             friend        fries          lot         menu
## 2006                  0.0000000000 0.0000000000 0.0000000000 0.0769230769
## 2007                  0.0114942529 0.0000000000 0.0000000000 0.0000000000
## 2008                  0.0044052863 0.0044052863 0.0000000000 0.0088105727
## 2009                  0.0129449838 0.0000000000 0.0032362460 0.0032362460
## 2010                  0.0000000000 0.0000000000 0.0039267016 0.0039267016
## 2011                  0.0000000000 0.0021246459 0.0000000000 0.0007082153
## 2012                  0.0000000000 0.0013999067 0.0000000000 0.0004666356
## 2013                  0.0003225806 0.0000000000 0.0000000000 0.0019354839
## 2014                  0.0002099958 0.0006299874 0.0000000000 0.0012599748
## nnetar_pred_2015      0.0011546244 0.0026404412 0.0005393748 0.0030703288
## nnetar_acu_train      0.4230277131 0.1903928861 0.6868033833 0.1571537793
## nnetar_acu_test       0.2382715542 1.5436043706 0.5494425702 0.2549251537
## Arima_pred_2015      -0.0023397898 0.0016367732 0.0017673708 0.0026509573
## Arima_acu_train       0.5565987565 0.3304991305 0.5868846096 1.0438183035
## Arima_acu_test        0.4621457145 0.9564885041 1.8003617672 0.2172852278
## auto.arima_pred_2015  0.0000000000 0.0000000000 0.0000000000 0.0000000000
## auto.arima_acu_train  0.6447160616 0.5557937783 0.8107395421 0.8858293364
## auto.arima_acu_test   0.0000000000 0.0000000000 0.0000000000 0.0000000000
##                               mood       nothing        order       place
## 2006                  0.000000e+00  0.0000000000 0.0769230769 0.000000000
## 2007                  0.000000e+00  0.0804597701 0.0000000000 0.034482759
## 2008                  0.000000e+00  0.0000000000 0.0044052863 0.013215859
## 2009                  0.000000e+00  0.0226537217 0.0161812298 0.029126214
## 2010                  0.000000e+00  0.0026178010 0.0000000000 0.009162304
## 2011                  2.124646e-03  0.0007082153 0.0000000000 0.003541076
## 2012                  4.666356e-04  0.0018665422 0.0009332711 0.002333178
## 2013                  0.000000e+00  0.0029032258 0.0003225806 0.004516129
## 2014                  6.299874e-04  0.0014699706 0.0002099958 0.001679966
## nnetar_pred_2015     -6.959825e-05 -0.0000267174 0.0014730888 0.011969622
## nnetar_acu_train      7.527431e-01  0.3407682960 0.1206735187 0.095442382
## nnetar_acu_test       1.205529e-01  0.0555530384 0.1088087257 2.341371242
## Arima_pred_2015       3.465542e-04  0.0072826440 0.0075703201 0.009997139
## Arima_acu_train       4.312499e-01  0.4417859069 0.9753572890 0.405932163
## Arima_acu_test        5.682055e-01  0.2785655191 0.5458934874 2.564425241
## auto.arima_pred_2015  0.000000e+00  0.0000000000 0.0000000000 0.010895276
## auto.arima_acu_train  5.868387e-01  0.4788943383 0.7930098942 0.758361833
## auto.arima_acu_test   0.000000e+00  0.0000000000 0.0000000000 2.494985379
##                            price     quality       review     sandwich
## 2006                 0.000000000 0.000000000 0.0000000000 0.0000000000
## 2007                 0.011494253 0.000000000 0.0114942529 0.0000000000
## 2008                 0.004405286 0.008810573 0.0132158590 0.0132158590
## 2009                 0.006472492 0.000000000 0.0129449838 0.0032362460
## 2010                 0.003926702 0.001308901 0.0065445026 0.0013089005
## 2011                 0.001416431 0.000000000 0.0014164306 0.0000000000
## 2012                 0.001866542 0.004199720 0.0023331778 0.0004666356
## 2013                 0.001612903 0.002258065 0.0022580645 0.0006451613
## 2014                 0.001049979 0.001889962 0.0014699706 0.0002099958
## nnetar_pred_2015     0.003577987 0.001283632 0.0027602548 0.0022830861
## nnetar_acu_train     0.174178675 0.438358469 0.3433916626 0.6343804426
## nnetar_acu_test      3.108577092 1.778199867 0.8736578086 0.6568938151
## Arima_pred_2015      0.001520529 0.002487615 0.0008751961 0.0037833547
## Arima_acu_train      0.561518838 0.250895275 0.5314876330 0.6740540400
## Arima_acu_test       3.816522626 1.407672972 0.2612989704 1.1001304171
## auto.arima_pred_2015 0.003582732 0.002051913 0.0057419158 0.0000000000
## auto.arima_acu_train 0.788787689 0.605650945 1.4086703517 0.6165476739
## auto.arima_acu_test  3.204891611 1.537984063 1.7143090681 0.0000000000
##                           seating       server    service         star
## 2006                 0.0000000000 0.0000000000 0.00000000 0.0000000000
## 2007                 0.0000000000 0.0000000000 0.05747126 0.0229885057
## 2008                 0.0000000000 0.0000000000 0.07048458 0.0044052863
## 2009                 0.0000000000 0.0161812298 0.12297735 0.0000000000
## 2010                 0.0000000000 0.0000000000 0.02617801 0.0026178010
## 2011                 0.0007082153 0.0007082153 0.01133144 0.0000000000
## 2012                 0.0004666356 0.0004666356 0.02053196 0.0013999067
## 2013                 0.0019354839 0.0009677419 0.01741935 0.0006451613
## 2014                 0.0004199916 0.0010499790 0.01952961 0.0004199916
## nnetar_pred_2015     0.0023018387 0.0006501627 0.01553851 0.0020261800
## nnetar_acu_train     0.2880289538 0.7310386255 0.05937490 0.5397649112
## nnetar_acu_test      4.6858179408 1.5255019120 3.43525701 0.2732228580
## Arima_pred_2015      0.0017878383 0.0041157571 0.02099686 0.0015488532
## Arima_acu_train      0.5157273672 0.5717483479 0.86236886 0.7600343518
## Arima_acu_test       3.6355397404 0.7265810779 3.25417936 0.2312047444
## auto.arima_pred_2015 0.0004199916 0.0000000000 0.03843595 0.0000000000
## auto.arima_acu_train 0.8888888889 0.5080647030 0.96813879 0.5386606441
## auto.arima_acu_test  0.8540460026 1.6979773240 2.69399204 0.0000000000
##                             taste         wait     waitress          way
## 2006                 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 2007                 0.0000000000 0.0114942529 0.0114942529 0.0000000000
## 2008                 0.0132158590 0.0132158590 0.0484581498 0.0000000000
## 2009                 0.0097087379 0.0097087379 0.0032362460 0.0097087379
## 2010                 0.0013089005 0.0013089005 0.0013089005 0.0039267016
## 2011                 0.0000000000 0.0000000000 0.0000000000 0.0007082153
## 2012                 0.0013999067 0.0004666356 0.0000000000 0.0004666356
## 2013                 0.0012903226 0.0009677419 0.0000000000 0.0000000000
## 2014                 0.0010499790 0.0008399832 0.0002099958 0.0006299874
## nnetar_pred_2015     0.0013961909 0.0010282713 0.0023940278 0.0002152927
## nnetar_acu_train     0.6499432262 0.4888578181 0.1869491878 0.7357297657
## nnetar_acu_test      1.6770114440 0.3274127128 0.1967764636 0.0906560543
## Arima_pred_2015      0.0008097033 0.0010148372 0.0059949830 0.0027654745
## Arima_acu_train      0.6246988468 0.4507243587 0.9361083178 0.5273689274
## Arima_acu_test       1.8124030050 0.2949334495 0.4937886714 1.1035708475
## auto.arima_pred_2015 0.0000000000 0.0042224568 0.0000000000 0.0000000000
## auto.arima_acu_train 0.8823330878 1.4047658486 0.5921961453 0.6846098777
## auto.arima_acu_test  2.0422564379 1.2271364767 0.0000000000 0.0000000000

Prediction plots

The pred_final$fit is really handy when one wants effectively create the plots of the interest. For plotting we used the library ggplot2 together with ggfortify. The plots of 2015 and 2016 time predictions for the most significant bad features of a “Cafe” type business in Las Vegas:

Discussion

We have shown that there are couple of factors that are connected to the probability that a certain business is open or closed:
- average time the business is/was open
- average rating
- number of reviews
The average rating of the business evolves in time, specifically, it decreases regardless of the duration of the business or its curent average rating
- The good businesses (with the average rating > 4 in the year of the first business review) evolve slowly to the average rating in the range 3.5 to 4 stars
- Bad businesses are open for a short time (approximately 60% of that of a good business)
In the NLP part we saw that the most significant bad features are “service”, “food”, “experence”, “cafe”, “breakfast” etc.
For example the predictions for “experience” and “food” suggest the potentional increase in significance, the prediction of “service”, “price” and “breakfast” is however on average stable. The message one should tkae from this is that when you want to start, invest or improve Cafe business in Las Vegas you should be carreful about “food”, “cafe” and “breakfast” and provide good “service” with good customer “experience”.

Yelp Data-set Challenge Part 6: Predicting Whether Business is Open or Closed and Suggesting the Good Business Practices

Lenka Kovalcinova (New Jersey Insitute of Technology, NJ, US), Martin Polacek (Stony Brook University, NY, US)

December 31, 2015

Software used:

Introduction

Methods

Brief Motivation and Exploratory Analysis

Results

(Good) Businesses Get Worse Rating Over Time

Suggesting Good Business Practices

Forecasting

Prediction plots

Discussion