OSX El Capitan, Version10.11.2 and Linux Mint 17.2 Cinnamon 64-bit R version 3.2.2 (2015-08-14) – “Fire Safety”
This data analysis focuses on predicting the probability that a business will be closed in a near future and the reasons why the businesses have bad ratings.
For a business owner, or for the investor, it is important to guess the future of the business with a good accuracy. Somehow intuitively, a business that has a bad rating, remains open for a shorter time than the business with good or above average rating. However, there are still many factors that can cause even a good business to close. What is the probability of the business closure given an average star rating? Does the answer depend on the duration of the business? What are bad business practices influencing the average business rating?
Imagine some outsider (e.g. person not familiar with given place) wants to open a business or invest in some city. The information that might be relevant is the question what are bad features of a given business type in a given city. One could learn, exploit and improve the weaknesses of a business and thus get a business advantage. To learn what is wrong, we ask the consumer of the business to say. This is where the Yelp data set comes in. From the table of yearly statistics (frequency of the bad feature occurence) we predict the future relevance of the bad features.
The above information with the future predictions might be of a high relevance for an open business. The business owner might try to focus and improve the bad features (owners own) business class. With the knowledge of possible future relevance the business owner might put a higher effort into the relevant sectors.
We first model the average review dependcy on time and compute the probability of a business to stay open given and average star review. Secondly, we have an accurate way to suggest the changes in a bad business culture according to the practices of many bad businesses and customer reviews. We develop a natural langues processing (NLP) method for a semantic text analysis and select only the nouns in the customer reviews connected to the bad business practices.
We breifly explain the (NLP) part further and read the bad reviews: 3 stars or less, in a given city, for a given business type. We analysed the text and get the information about the bad features. In the process we also obtain the statistics, i.e. how bad a given feature was. We rescale it with respect to the number of reviews, calculate the measure of inferiority. At the end of the (NLP) part we forcast the future developement of the inferiority measure for the bad feature vector. Various time series prediction methods are used: neural network method and two types of Arima algorithm.
In this section, we introduce a couple of methods we use in order to have a prediction whether the business will remain open or whether it will close - in particular, we find a probability of the business closure.
Methods:
simple histogram comparison of the number of received reviews for open/closed businesses
for forecasting we use the forecast library methods: nnetar() for neural network time serie prediction, Arima() for custom arima model prediction and auto.arima() for automatic arima prediction. We used no seasonality in time serie prediction methods.
Below, we take the business data and quickly plot the variable pairs to see any dependence between the average number of stars, location of the business (longitude and latitude) and the fact whether a certain business is still running (open) or whether it was closed.
From the above plot, we can see that businesses that are still open are reviewed more and that the number of stars tend to have an average 4 with the increasing number of reviews. Therefore, it seems that there might indeed be some correlation between the number of reviews, number of stars and the fact that the business is either open or closed.
Before further analysis, we explore the data a little more and plot the average number of reviews for specific star for closed/open businesses, normalized by the number of businesses with a specific average star. The normalization is performed separately for open and closed businesses and the idea is to have comparable quantities.
From the above we can infer that the businesses that are closed have higher chance to have a bad average rating:
Note however, that a time for the business exists has a non-trivial influence on the above:
Therefore, the above graph does not directly explain why a certain business stayed open or was closed and we will explain the time dependence and reasons for bad reviews in the next section (the (NLP) analysis and time prediction).
In this subsection we focus on the exploratory analyses necesary for further (NLP). More concretely, we take the bad reviews for a given business type and a city in a given year. For brevity, we focus on the city of Las Vegas and analyse bad reviews using the sentiment analysis (packages wordcloud, coreNLP, openNLP, rJava). We pick out the words with negative sentiment and search for the nouns in the “epsilon” neighborhood. The epsilon neighborhood is the set of words that is centered at a given (bad) word and includes 3 closest words to the left and right. The “epsilon” is appropriately chosen by looking at the data structure and estimating the computational time. The rationale of choosing the nouns in that neighborhood is the observation that:
Nouns usually carry the factual information about the subject, however nouns are not usually considered negative.
Negative words (often adjectives) usually do not carry any factual information but they carry the sentiment information.
Let us give an example. Lets just consider a statement “the bread was awful”. This simple example gives us the structure. The sentiment analysis will pick the word “awful” as a negative word. However sentiment analysis usually tells us that words like “bread” or “was” are neutral words. Therefore looking for nouns in vicinity of the word “awful” might give us the factual information. Moreover we were looking for the vector of nouns with weights. Each weight is number of times some noun was present in reviews in negative word vicinity (in reviews for a given city, given business type and given year).
In the outlined way we learn what is bad about a certain business using the inferiority measure. In the second step we use such information to predict the inferiority measures of the bad business features in time. We consider three different time prediction models: nnetar() method of the package forecast provides the neural network model for time prediction, Arima() and auto.arima() of the package forecast provide the arima model type predictions. We calculate the predicted values of all three models for the most significant bad features for years 2015 and 2016 and estimate the accuracy for each prediction for each model (compared using the MASE = mean absolute scaled error).
For the brevity and computation intensity we perform the analysis for the city of Las Vegas for a business type: “Cafes”, for years from 2006 to 2014. (note: the original data do contain more years, there was no or just a very few data for years 2004, 2005 and 2015. Therefore those years are disregarded).
Because we want to have everything organised nicely and we like the object oriented programming languages, we create the S4 class that subsets our data to a managable size. The class is called busS that eats up: the reviews data frame, the business data frame, city, business type and year. One method of the busS class is to subset the review data (the data frame is called rev from now on) for a given city for a given business type (“Cafes”) for a given year (from 2006 to 2014). This is done by the method getWholeBusiness() (for the class busS). Code for the sub setting (considering the busS already loaded with all methods):
df<-busS(data_rev=rev,data=bus,city=las_vegas,business="Cafes",year=2006)
df<-getWholeBusiness(df)
The previous code subsets the rev data table by the las_vegas city. We also want to explore whether we have enough data for each year for the city of Las Vegas in the business category “Cafes”. The following plot shows us the situation:
In summary: We know the names of Las Vegas, we can easily subset the whole business using the busS class and we know that for the business type “Cafes” we have enough data for exploration. The last plot also shows that for the years 2004,2005 we have relatively small amount of data. We also disregarded year 2015 since obviously the data there are incomplete.
Below, we present a classification tree, that gives a probability that a certain business is open/closed based on the number of stars and review count. The classification tree is computed using the rpart caret package that uses bootstrap resampling method to find an accurate classification.
Decision Tree Interpretation: To find the probability that a certain business will be open, we treat the open/closed identificator as a continuous variable and not as a factor. The root node (probability P=0.88) gives a probability that a certain business is open/closed. From the classification tree, if the business has more than 4.8 stars, the probability that it’s open is 94%. The interpretation of the rest of the tree is analogous.
library(rattle)
# load the trained model for classification tree
load("businessData.rda")
modFit
## CART
##
## 42830 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 42830, 42830, 42830, 42830, 42830, 42830, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared RMSE SD Rsquared SD
## 0.000964269 0.3262800 0.006794689 0.002199129 0.0009211813
## 0.002016676 0.3265864 0.004906140 0.002114058 0.0008596819
## 0.004573263 0.3270699 0.003965632 0.002246362 0.0006095627
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.000964269.
Time Range Influence:
The following classification tree decides on the probability of a certain business to be open according to:
library(rattle)
#
load("modelWTimeRange.rda")
model
## CART
##
## 60785 samples
## 4 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 60785, 60785, 60785, 60785, 60785, 60785, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared RMSE SD Rsquared SD
## 0.004549401 0.3242587 0.017303553 0.002191247 0.002501036
## 0.007419798 0.3253123 0.011302606 0.002228746 0.003559455
## 0.008100342 0.3259707 0.009510323 0.002491002 0.002983017
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.004549401.
Time vs. average rating:
Below we show the dependence of the mean time (in days) over which the business stays open. Since we have a dataset for a contraint time range, we want to make sure that the curve is robust with respect to the time window we take. Therefore, the plot contains the data for different subsets according to the time when first review was written. The subsets include businesses that are/were open the latest 2 years ago (in 2013) up to datasets with businesses open since 2004. The data points have size according to the dataset: for example, size 10 (maxYears = 10) means that that data point belongs to the dataset with businesses reviewed for the first time since 10 years ago. The data in datasets are rescaled by the maximum average time duration for open/closed business.
Even for different datasets, we still see the same type of behavior and the curve fitted to the data for both open and closed businesses are a good fit. Grey lines show the standard deviation and we did not use any weights, i.e. datapoint corresponding to maxYears = 2 has the same weight as the datapoint for maxYears = 10.
Interpretation of the average rating - open time dependence
Short duration of a 5 star businesses is (as we show below) due to the fact that a long-lasting business tends to receive also some bad reviews and so the average rating is smaller (around 4 stars).
Businesses with 5 stars that are closed, were open only for a very short time (otherwise they would receive some bad rating and moved to the category with a worse average rating).
Businesses with very bad rating that are closed were open for a longer time than those that are still open. This suggests that the very bad businesses that are still open are inevitably heading to be (doomed) closed.
From the plot, we see that the average time for the existence of a 1 and 1.5 star business is approximately the same. The average duration of a very bad business is much shorter than that of a good business.
Businesses with 3-4 stars are in general the businesses that are open for the longest time. If the business was closed than there is not a large difference in average opened time from bad businesses.
As the time passses, even the best business has the ‘luck’ of serving a demanding customers, resulting in worse ratings. Even in the case of very bad businesses, the average rating is decreasing over time.
As a result, the stars vs. average time open curve has a parabolic shape: good businesses cannot keep the great score and move to ~4 star businesses over time.
We have verified that there is a decreasing tendency of the average rating over time, no matter what is the average rating. Below, we show a couple of plots for evolution in average rating over time.
Plot of time evolution of bad businesses (<=2 star average rating)
Dots are specific businesses and color is different for a different business. Lines connect the same business over time.
Businesses open in 2006
Businesses open in 2007
Plot of time evolution of good businesses (>=4 star average rating)
Dots are specific businesses and color is different for a different business. Lines connect the same business over time.
Businesses open in 2004
Businesses open in 2005
Plot of time evolution of businesses with the average rating (>=3.5 and <2 star rating)
Dots are specific businesses and color is different for a different business. Lines connect the same business over time.
Businesses open in 2005
Businesses open in 2006
We want find what is wrong with the particular business type. The class called words wraps the character string and the methods of this class can do various partial analysis of the given string. Two main ingredients of the class are the sentiment analysis function and tagging function. For the sentiment analysis and words tagging we use methods of the packages wordcloud, coreNLP,openNLP, tm, and syuzhet. The main method of the words class is the method called findNearNouns(). This method eats up the string, in our case a particular review and returns the list of the nouns that are in the “epsilon” neighborhood (chosen as plus minus three words) of the negative sentiment word.
We also create the class evaluate with data structure that consists of a character string (in our case negative sentiment nouns) and a file name. This class formats the result of findNearNouns() function into a nice format and saves it on the local drive. In the class we are using the fault tolerant implementations of the findNearNouns() (method eval_summary()), since we want to loop through a large chunks of data in the next step.
Next stage is to loop through the years and use the methods getWholeBusiness() and findNearNouns() (implemented as eval_summary()). We obtain the list of the nouns that represents the problems of “Cafes” type business in years 2006-2014. We also subset the reviews to have three or less stars:
set<-c(2006:2014)
SF<-sapply(set,function(x){cafe<-busS(data_rev=rev,data=bus,city=las,business="Cafes",year=as.numeric(x));
cafe<-getWholeBusiness(cafe);cafe<-subset(cafe,stars<=3);
cafe<-evaluate(text=cafe$text,filename=paste("cafe_rev_",x,collapse="",sep=""));cafe<-eval_summary(cafe)})
Another class called taB and its methods are used to get nicely formated and cutted (to desired level) version of the bad features, i.e. nice representation of the list SF, that holds the bad feature information together with their occurence. The taB class data structure consists of the list SF, low_cutof numeric value and a numeric vector years. The low_cutof holds user predefined low cutof on number of occurence. For example we can at the beginning disregard words with low occurence through years. In the class we can also further establish years which we want to see in final table.
There are severeal steps one has to overcomme in order properly analyse the content of SF. In the list SF there are words with the same meaning, but are distinguished in SF. The example is singular and plural of the noun. So we need to identify those words and count them as the same. This is done using the method called list_for_no_pick(), consider the following code:
SF_rev<-taB(theSF=SF,low_cutof=0,years=c(2006:2014))
no_pick_list<-list_for_no_pick(SF_rev)
We create the instence of the class taB called SF_rev. The method list_for_no_pick() gives the list of potentionally same words in the list SF. The observation was that in the resulting list more words belong to the same meaning category than do not. So it is easier to pick words that should NOT be considered equal. Therefore the next step is to identify the couples that should not be identified (somehow this is not an oxymoron) and pick out their positions into the no_pick numeric vector:
no_pick_list
## [[1]]
## [1] "ass" "assumption"
##
## [[2]]
## [1] "bagel" "bagels"
##
## [[3]]
## [1] "basket" "baskets"
##
## [[4]]
## [1] "bit" "bites"
##
## [[5]]
## [1] "bottle" "bottles"
##
## [[6]]
## [1] "bread" "breads"
##
## [[7]]
## [1] "break." "breakfast"
In the above partial section of the no_pick_list list we can identify easily which words should NOT be evaluated as the same. The example being word couple “ass”-“assumption”. However the most of the words in no_pick_list should be actually identified. The real no_pick numeric vector in our situation is no_pick<-c(1,7,13,14,20,21,22,32,39,40,49,59,75,83,87). After identification of the no_pick the taB class has a method called merge_same() that eats up the SF_rev instance of the class and secretly the numeric vector no_pick list and merges the words which have the same meaning. It also produces nicely formated table:
table_rev<-merge_same(SF_rev)
## anything breakfast cafe chi choice customer day drink egg experience
## 2006 0 0 0 0 0 0 0 0 1 0
## 2007 0 1 3 2 0 0 1 1 1 0
## 2008 1 1 1 10 0 0 1 1 2 3
## 2009 1 2 2 6 0 1 4 2 7 6
## 2010 3 0 3 0 2 1 4 0 0 0
## 2011 1 1 0 2 1 1 2 1 1 2
## 2012 0 3 1 0 3 3 3 0 0 5
## 2013 1 0 5 3 0 4 0 1 0 10
## 2014 4 5 3 4 0 4 12 3 0 9
## flavor food friend fries lot menu mood nothing order place price
## 2006 0 0 0 0 0 1 0 0 1 0 0
## 2007 0 7 1 0 0 0 0 7 0 3 1
## 2008 6 24 1 1 0 2 0 0 1 3 1
## 2009 2 27 4 0 1 1 0 7 5 9 2
## 2010 0 9 0 0 3 3 0 2 0 7 3
## 2011 1 14 0 3 0 1 3 1 0 5 2
## 2012 0 23 0 3 0 1 1 4 2 5 2
## 2013 1 14 1 0 0 6 0 9 1 12 3
## 2014 2 38 1 3 0 6 3 7 1 2 2
## quality review sandwich seating server service star taste wait
## 2006 0 0 0 0 0 0 0 0 0
## 2007 0 1 0 0 0 5 2 0 1
## 2008 2 3 3 0 0 16 1 3 3
## 2009 0 4 1 0 5 38 0 3 3
## 2010 1 5 1 0 0 19 2 1 1
## 2011 0 2 0 1 1 12 0 0 0
## 2012 4 5 1 1 1 35 3 3 1
## 2013 6 7 2 6 1 40 2 4 3
## 2014 9 7 1 2 1 74 2 5 4
## waitress way
## 2006 0 0
## 2007 1 0
## 2008 11 0
## 2009 1 3
## 2010 1 3
## 2011 0 1
## 2012 0 1
## 2013 0 0
## 2014 1 3
The table above is however not the whole story. The Yelp data set has a second large source of the relevant information. It is the data set containing the tips. The tips could be considered as a very short reviews. And indeed regarding the data structure it is almost the same. So we proceed with no principal change in the analysis of the tips data set. We create the SF list again (for tip data set), called SF_tips. We use the taB class and its methods to obtain the tips analog of last table. The rationale being that actually we should not judge where people are writting what was wrong. We shoud judge just the overall occurence of given bad features. So we should ADD review and tips bad feature (occurence) tables together.
The next step is to rescale the table_final with the number of reviews and tips in a given year, to get the proportionality:
table_final
## anything breakfast cafe chi choice
## 2006 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 2007 0.0000000000 0.0114942529 0.0344827586 0.0229885057 0.0000000000
## 2008 0.0044052863 0.0044052863 0.0044052863 0.0440528634 0.0000000000
## 2009 0.0032362460 0.0064724919 0.0064724919 0.0194174757 0.0000000000
## 2010 0.0039267016 0.0000000000 0.0039267016 0.0000000000 0.0026178010
## 2011 0.0007082153 0.0007082153 0.0000000000 0.0014164306 0.0007082153
## 2012 0.0000000000 0.0013999067 0.0004666356 0.0000000000 0.0013999067
## 2013 0.0003225806 0.0003225806 0.0016129032 0.0009677419 0.0000000000
## 2014 0.0008399832 0.0012599748 0.0006299874 0.0008399832 0.0000000000
## customer day drink egg experience
## 2006 0.0000000000 0.000000000 0.0000000000 0.0769230769 0.000000000
## 2007 0.0000000000 0.011494253 0.0114942529 0.0114942529 0.000000000
## 2008 0.0000000000 0.004405286 0.0044052863 0.0088105727 0.013215859
## 2009 0.0032362460 0.012944984 0.0064724919 0.0226537217 0.019417476
## 2010 0.0013089005 0.005235602 0.0000000000 0.0000000000 0.001308901
## 2011 0.0007082153 0.001416431 0.0007082153 0.0007082153 0.001416431
## 2012 0.0013999067 0.001399907 0.0000000000 0.0000000000 0.002333178
## 2013 0.0012903226 0.000000000 0.0012903226 0.0000000000 0.004516129
## 2014 0.0008399832 0.002519950 0.0008399832 0.0000000000 0.001889962
## flavor food friend fries lot
## 2006 0.0000000000 0.000000000 0.0000000000 0.0000000000 0.000000000
## 2007 0.0000000000 0.080459770 0.0114942529 0.0000000000 0.000000000
## 2008 0.0264317181 0.105726872 0.0044052863 0.0044052863 0.000000000
## 2009 0.0064724919 0.087378641 0.0129449838 0.0000000000 0.003236246
## 2010 0.0000000000 0.013089005 0.0000000000 0.0000000000 0.003926702
## 2011 0.0007082153 0.009915014 0.0000000000 0.0021246459 0.000000000
## 2012 0.0000000000 0.012599160 0.0000000000 0.0013999067 0.000000000
## 2013 0.0003225806 0.006451613 0.0003225806 0.0000000000 0.000000000
## 2014 0.0004199916 0.008609828 0.0002099958 0.0006299874 0.000000000
## menu mood nothing order place
## 2006 0.0769230769 0.0000000000 0.0000000000 0.0769230769 0.000000000
## 2007 0.0000000000 0.0000000000 0.0804597701 0.0000000000 0.034482759
## 2008 0.0088105727 0.0000000000 0.0000000000 0.0044052863 0.013215859
## 2009 0.0032362460 0.0000000000 0.0226537217 0.0161812298 0.029126214
## 2010 0.0039267016 0.0000000000 0.0026178010 0.0000000000 0.009162304
## 2011 0.0007082153 0.0021246459 0.0007082153 0.0000000000 0.003541076
## 2012 0.0004666356 0.0004666356 0.0018665422 0.0009332711 0.002333178
## 2013 0.0019354839 0.0000000000 0.0029032258 0.0003225806 0.004516129
## 2014 0.0012599748 0.0006299874 0.0014699706 0.0002099958 0.001679966
## price quality review sandwich seating
## 2006 0.000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
## 2007 0.011494253 0.000000000 0.011494253 0.0000000000 0.0000000000
## 2008 0.004405286 0.008810573 0.013215859 0.0132158590 0.0000000000
## 2009 0.006472492 0.000000000 0.012944984 0.0032362460 0.0000000000
## 2010 0.003926702 0.001308901 0.006544503 0.0013089005 0.0000000000
## 2011 0.001416431 0.000000000 0.001416431 0.0000000000 0.0007082153
## 2012 0.001866542 0.004199720 0.002333178 0.0004666356 0.0004666356
## 2013 0.001612903 0.002258065 0.002258065 0.0006451613 0.0019354839
## 2014 0.001049979 0.001889962 0.001469971 0.0002099958 0.0004199916
## server service star taste wait
## 2006 0.0000000000 0.00000000 0.0000000000 0.000000000 0.0000000000
## 2007 0.0000000000 0.05747126 0.0229885057 0.000000000 0.0114942529
## 2008 0.0000000000 0.07048458 0.0044052863 0.013215859 0.0132158590
## 2009 0.0161812298 0.12297735 0.0000000000 0.009708738 0.0097087379
## 2010 0.0000000000 0.02617801 0.0026178010 0.001308901 0.0013089005
## 2011 0.0007082153 0.01133144 0.0000000000 0.000000000 0.0000000000
## 2012 0.0004666356 0.02053196 0.0013999067 0.001399907 0.0004666356
## 2013 0.0009677419 0.01741935 0.0006451613 0.001290323 0.0009677419
## 2014 0.0010499790 0.01952961 0.0004199916 0.001049979 0.0008399832
## waitress way
## 2006 0.0000000000 0.0000000000
## 2007 0.0114942529 0.0000000000
## 2008 0.0484581498 0.0000000000
## 2009 0.0032362460 0.0097087379
## 2010 0.0013089005 0.0039267016
## 2011 0.0000000000 0.0007082153
## 2012 0.0000000000 0.0004666356
## 2013 0.0000000000 0.0000000000
## 2014 0.0002099958 0.0006299874
Our last step is to make the time series predictions of the table_final. We use three methods for the time predictions from the library forecast. Later on we also need the fpp library.
First model is nnetar(), the second and third models are Arima() and auto.arima(). The model nnetar() uses the neural network to train on the previous year patterns and by this to predict the future values. The arima methods are the univariate time series prediction methods.
The first nnetar() builds the feed-forward neural network with lagged values of previous points in the time series. The one we used have size=18 layers and we average repeats=50 times.
The Arima() modelling uses the Hyndman and Khandakar algorithm which combines unit root tests, minimization of the AICc and MLE. We follow the description given at the webpage https://www.otexts.org/fpp/8/7 to find the best model with the Arima parameters: order=c(p,d,q)=c(0,1,4) where:
We also estimate the model accuracies. We divide the dataset (table_final) into training and test set in ratio 9:1. We use the accuracy() function to estimate various accuracies. The accuracy(train,test) in general gives various measures of model accuracy. We use mean absolute scale error that is proposed to be “generally applicable measurement of forecast accuracy without the problems seen in the other measurements”, see: http://robjhyndman.com/papers/foresight.pdf.
We create the S4 class called predic that implements the above described predictions for the table_final. The class data structure consists of the table for which we want to do the predictions (in our case the table_final), numerical value called start_year (to indicate with what year one wants to start) and logical values nnet, arima and autoarima (to indicate which model should be used in method of the class). The method of the class is called pred. It calculates the desired models predictions (logical values of nnet, arima, autoarima) together with accuracies and produces the list:
pred_final<-predic(data=table_final,start_year=2006,nnet=T,arima=T,autoarima=T)
The list pred_final has two major part:
pred_final_list<-pred(pred_final)
pred_final_list$data_frame
## anything breakfast cafe chi
## 2006 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 2007 0.0000000000 0.0114942529 0.0344827586 0.0229885057
## 2008 0.0044052863 0.0044052863 0.0044052863 0.0440528634
## 2009 0.0032362460 0.0064724919 0.0064724919 0.0194174757
## 2010 0.0039267016 0.0000000000 0.0039267016 0.0000000000
## 2011 0.0007082153 0.0007082153 0.0000000000 0.0014164306
## 2012 0.0000000000 0.0013999067 0.0004666356 0.0000000000
## 2013 0.0003225806 0.0003225806 0.0016129032 0.0009677419
## 2014 0.0008399832 0.0012599748 0.0006299874 0.0008399832
## nnetar_pred_2015 0.0004870041 0.0006694357 0.0008737658 0.0006625092
## nnetar_acu_train 0.5799990328 0.4562024660 0.5581867797 0.0126729681
## nnetar_acu_test 0.2589113717 5.4265906023 0.1263649267 0.0423509384
## Arima_pred_2015 0.0015103199 0.0014178249 -0.0004394195 0.0018457399
## Arima_acu_train 0.4311651847 0.4677127177 0.5725588054 0.5133055900
## Arima_acu_test 1.0952812602 5.2826537827 0.0464405958 0.1604396743
## auto.arima_pred_2015 0.0014932237 0.0028958565 0.0000000000 0.0000000000
## auto.arima_acu_train 1.1423584371 0.7966564687 0.6105936979 0.8661814757
## auto.arima_acu_test 1.0828831029 4.8954499997 0.0000000000 0.0000000000
## choice customer day drink
## 2006 0.0000000000 0.0000000000 0.000000000 0.0000000000
## 2007 0.0000000000 0.0000000000 0.011494253 0.0114942529
## 2008 0.0000000000 0.0000000000 0.004405286 0.0044052863
## 2009 0.0000000000 0.0032362460 0.012944984 0.0064724919
## 2010 0.0026178010 0.0013089005 0.005235602 0.0000000000
## 2011 0.0007082153 0.0007082153 0.001416431 0.0007082153
## 2012 0.0013999067 0.0013999067 0.001399907 0.0000000000
## 2013 0.0000000000 0.0012903226 0.000000000 0.0012903226
## 2014 0.0000000000 0.0008399832 0.002519950 0.0008399832
## nnetar_pred_2015 0.0005235603 0.0010888226 0.049541842 0.0001417605
## nnetar_acu_train 0.6327981957 0.7240096883 0.281766552 0.5316165464
## nnetar_acu_test 0.6327969543 1.3300304854 9.596566990 5.6768029290
## Arima_pred_2015 0.0007968417 0.0007880048 0.004303463 0.0021424061
## Arima_acu_train 0.3570610668 0.4504416850 0.390774654 0.4600122251
## Arima_acu_test 0.9630983731 0.8985370832 0.808392665 5.1361483796
## auto.arima_pred_2015 0.0000000000 0.0009759527 0.004379601 0.0028011725
## auto.arima_acu_train 0.6346623475 0.8441952374 0.691348758 0.8201091662
## auto.arima_acu_test 0.0000000000 1.1128481080 0.822694980 4.9621018478
## egg experience flavor food
## 2006 0.0769230769 0.000000000 0.0000000000 0.000000000
## 2007 0.0114942529 0.000000000 0.0000000000 0.080459770
## 2008 0.0088105727 0.013215859 0.0264317181 0.105726872
## 2009 0.0226537217 0.019417476 0.0064724919 0.087378641
## 2010 0.0000000000 0.001308901 0.0000000000 0.013089005
## 2011 0.0007082153 0.001416431 0.0007082153 0.009915014
## 2012 0.0000000000 0.002333178 0.0000000000 0.012599160
## 2013 0.0000000000 0.004516129 0.0003225806 0.006451613
## 2014 0.0000000000 0.001889962 0.0004199916 0.008609828
## nnetar_pred_2015 -0.0009307194 0.002633564 0.0006929874 0.006184085
## nnetar_acu_train 0.1587443537 0.424783498 0.7492331430 0.213158253
## nnetar_acu_test 0.0694258082 2.177864703 0.1332853428 2.442698735
## Arima_pred_2015 -0.0037965084 0.008107857 0.0037132468 0.022857175
## Arima_acu_train 0.6422549371 0.704055479 0.7981453955 0.569484285
## Arima_acu_test 0.2864591960 1.158803070 0.5430722316 1.847667066
## auto.arima_pred_2015 0.0000000000 0.004899771 0.0000000000 0.036025545
## auto.arima_acu_train 1.0109894275 0.936206858 0.5582788787 1.384293639
## auto.arima_acu_test 0.0000000000 1.750708458 0.0000000000 1.351983489
## friend fries lot menu
## 2006 0.0000000000 0.0000000000 0.0000000000 0.0769230769
## 2007 0.0114942529 0.0000000000 0.0000000000 0.0000000000
## 2008 0.0044052863 0.0044052863 0.0000000000 0.0088105727
## 2009 0.0129449838 0.0000000000 0.0032362460 0.0032362460
## 2010 0.0000000000 0.0000000000 0.0039267016 0.0039267016
## 2011 0.0000000000 0.0021246459 0.0000000000 0.0007082153
## 2012 0.0000000000 0.0013999067 0.0000000000 0.0004666356
## 2013 0.0003225806 0.0000000000 0.0000000000 0.0019354839
## 2014 0.0002099958 0.0006299874 0.0000000000 0.0012599748
## nnetar_pred_2015 0.0011546244 0.0026404412 0.0005393748 0.0030703288
## nnetar_acu_train 0.4230277131 0.1903928861 0.6868033833 0.1571537793
## nnetar_acu_test 0.2382715542 1.5436043706 0.5494425702 0.2549251537
## Arima_pred_2015 -0.0023397898 0.0016367732 0.0017673708 0.0026509573
## Arima_acu_train 0.5565987565 0.3304991305 0.5868846096 1.0438183035
## Arima_acu_test 0.4621457145 0.9564885041 1.8003617672 0.2172852278
## auto.arima_pred_2015 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## auto.arima_acu_train 0.6447160616 0.5557937783 0.8107395421 0.8858293364
## auto.arima_acu_test 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## mood nothing order place
## 2006 0.000000e+00 0.0000000000 0.0769230769 0.000000000
## 2007 0.000000e+00 0.0804597701 0.0000000000 0.034482759
## 2008 0.000000e+00 0.0000000000 0.0044052863 0.013215859
## 2009 0.000000e+00 0.0226537217 0.0161812298 0.029126214
## 2010 0.000000e+00 0.0026178010 0.0000000000 0.009162304
## 2011 2.124646e-03 0.0007082153 0.0000000000 0.003541076
## 2012 4.666356e-04 0.0018665422 0.0009332711 0.002333178
## 2013 0.000000e+00 0.0029032258 0.0003225806 0.004516129
## 2014 6.299874e-04 0.0014699706 0.0002099958 0.001679966
## nnetar_pred_2015 -6.959825e-05 -0.0000267174 0.0014730888 0.011969622
## nnetar_acu_train 7.527431e-01 0.3407682960 0.1206735187 0.095442382
## nnetar_acu_test 1.205529e-01 0.0555530384 0.1088087257 2.341371242
## Arima_pred_2015 3.465542e-04 0.0072826440 0.0075703201 0.009997139
## Arima_acu_train 4.312499e-01 0.4417859069 0.9753572890 0.405932163
## Arima_acu_test 5.682055e-01 0.2785655191 0.5458934874 2.564425241
## auto.arima_pred_2015 0.000000e+00 0.0000000000 0.0000000000 0.010895276
## auto.arima_acu_train 5.868387e-01 0.4788943383 0.7930098942 0.758361833
## auto.arima_acu_test 0.000000e+00 0.0000000000 0.0000000000 2.494985379
## price quality review sandwich
## 2006 0.000000000 0.000000000 0.0000000000 0.0000000000
## 2007 0.011494253 0.000000000 0.0114942529 0.0000000000
## 2008 0.004405286 0.008810573 0.0132158590 0.0132158590
## 2009 0.006472492 0.000000000 0.0129449838 0.0032362460
## 2010 0.003926702 0.001308901 0.0065445026 0.0013089005
## 2011 0.001416431 0.000000000 0.0014164306 0.0000000000
## 2012 0.001866542 0.004199720 0.0023331778 0.0004666356
## 2013 0.001612903 0.002258065 0.0022580645 0.0006451613
## 2014 0.001049979 0.001889962 0.0014699706 0.0002099958
## nnetar_pred_2015 0.003577987 0.001283632 0.0027602548 0.0022830861
## nnetar_acu_train 0.174178675 0.438358469 0.3433916626 0.6343804426
## nnetar_acu_test 3.108577092 1.778199867 0.8736578086 0.6568938151
## Arima_pred_2015 0.001520529 0.002487615 0.0008751961 0.0037833547
## Arima_acu_train 0.561518838 0.250895275 0.5314876330 0.6740540400
## Arima_acu_test 3.816522626 1.407672972 0.2612989704 1.1001304171
## auto.arima_pred_2015 0.003582732 0.002051913 0.0057419158 0.0000000000
## auto.arima_acu_train 0.788787689 0.605650945 1.4086703517 0.6165476739
## auto.arima_acu_test 3.204891611 1.537984063 1.7143090681 0.0000000000
## seating server service star
## 2006 0.0000000000 0.0000000000 0.00000000 0.0000000000
## 2007 0.0000000000 0.0000000000 0.05747126 0.0229885057
## 2008 0.0000000000 0.0000000000 0.07048458 0.0044052863
## 2009 0.0000000000 0.0161812298 0.12297735 0.0000000000
## 2010 0.0000000000 0.0000000000 0.02617801 0.0026178010
## 2011 0.0007082153 0.0007082153 0.01133144 0.0000000000
## 2012 0.0004666356 0.0004666356 0.02053196 0.0013999067
## 2013 0.0019354839 0.0009677419 0.01741935 0.0006451613
## 2014 0.0004199916 0.0010499790 0.01952961 0.0004199916
## nnetar_pred_2015 0.0023018387 0.0006501627 0.01553851 0.0020261800
## nnetar_acu_train 0.2880289538 0.7310386255 0.05937490 0.5397649112
## nnetar_acu_test 4.6858179408 1.5255019120 3.43525701 0.2732228580
## Arima_pred_2015 0.0017878383 0.0041157571 0.02099686 0.0015488532
## Arima_acu_train 0.5157273672 0.5717483479 0.86236886 0.7600343518
## Arima_acu_test 3.6355397404 0.7265810779 3.25417936 0.2312047444
## auto.arima_pred_2015 0.0004199916 0.0000000000 0.03843595 0.0000000000
## auto.arima_acu_train 0.8888888889 0.5080647030 0.96813879 0.5386606441
## auto.arima_acu_test 0.8540460026 1.6979773240 2.69399204 0.0000000000
## taste wait waitress way
## 2006 0.0000000000 0.0000000000 0.0000000000 0.0000000000
## 2007 0.0000000000 0.0114942529 0.0114942529 0.0000000000
## 2008 0.0132158590 0.0132158590 0.0484581498 0.0000000000
## 2009 0.0097087379 0.0097087379 0.0032362460 0.0097087379
## 2010 0.0013089005 0.0013089005 0.0013089005 0.0039267016
## 2011 0.0000000000 0.0000000000 0.0000000000 0.0007082153
## 2012 0.0013999067 0.0004666356 0.0000000000 0.0004666356
## 2013 0.0012903226 0.0009677419 0.0000000000 0.0000000000
## 2014 0.0010499790 0.0008399832 0.0002099958 0.0006299874
## nnetar_pred_2015 0.0013961909 0.0010282713 0.0023940278 0.0002152927
## nnetar_acu_train 0.6499432262 0.4888578181 0.1869491878 0.7357297657
## nnetar_acu_test 1.6770114440 0.3274127128 0.1967764636 0.0906560543
## Arima_pred_2015 0.0008097033 0.0010148372 0.0059949830 0.0027654745
## Arima_acu_train 0.6246988468 0.4507243587 0.9361083178 0.5273689274
## Arima_acu_test 1.8124030050 0.2949334495 0.4937886714 1.1035708475
## auto.arima_pred_2015 0.0000000000 0.0042224568 0.0000000000 0.0000000000
## auto.arima_acu_train 0.8823330878 1.4047658486 0.5921961453 0.6846098777
## auto.arima_acu_test 2.0422564379 1.2271364767 0.0000000000 0.0000000000
The pred_final$fit is really handy when one wants effectively create the plots of the interest. For plotting we used the library ggplot2 together with ggfortify. The plots of 2015 and 2016 time predictions for the most significant bad features of a “Cafe” type business in Las Vegas: