Author: Brandon Hao
Date: December 10, 2022
The dataset we’ll analyze to train and test our prediction models is the Dataset Challenge that was presented by Yelp in 2014 (Dataset, 2014). The data contains users from Phoenix, Las Vegas, Madison, Waterloo, and Edinburgh and information regarding 42,153 businesses, 320,002 business attributes, 31,617 check-in sets, 403,210 tips, and 1,125,458 text reviews. The dataset includes 5 files of each object type known as business, review, user, check-in, and tip, with each containing a .json file with the same name. However, “business.json” contains relevant business data (business ID, name, location, stars, review count, opening hours, etc.) and “review.json” contains data obtained from business reviews (business IDs, user IDs, stars, review text, date and votes), therefore only data from these files will be studied via our models.
In our paper, we treat a business’ average star rating as a binary variable (“good” or “bad”) and train supervised algorithms to build two prediction models: one utilizing lasso for variable selection and logistic regression for classification, and the other utilizing forward stepwise selection and k-nearest neighbors for classification. Finally, we evaluate each model under 10-fold cross-validation (for a train-test split of 90 to 10) to optimize our model’s prediction accuracy.
Sections 2 of our paper describe methods in tidying our dataset for model building, section 3 describes the inspiration and descriptions behind each of our chosen models, section 4 provides our results and analyses of these models, and section 5 provides our conclusion and future research.
The first step mainly focused on extracting the restaurants from the business.json file, as well as the reviews from the review.json file. This data was collected into one dataframe with 18 variables. As the goal of our models was to assess accuracy in correctly identifying the quality of a business instead of accurately predicting the true average star value of the business, we transformed business into a binary variable for our classification. Given that the distribution of average business stars was skewed left with more positive than negative reviews, we chose to split our levels such that values of {1, 1.5, 2, 2.5, 3} were mapped to 0 (or bad quality) and values of {3.5, 4, 4.5, 5} were mapped to 1 (or good quality). In this paper and in our dataset, we’ll refer to this variable as “Business_Quality”.
Even after our transformation, the distribution of positive and negative business qualities remained skewed, therefore 10-fold cross-validation was implemented for each of our models to minimize bias in our results.
Yelp users construct their reviews freely, without following any set of specific guidelines. Distinguishing positive reviews from negative is fairly easy, by analyzing words with either a positive (delicious, mouthwatering, considerate) or negative (disgusting, rude, ridiculous) connotation. However, neutral reviews and stop words (such as “fine”, “okay”, “but”, or “this”) are used frequently but hardly considered necessary. To quantify the sentiment contained within user reviews, we extracted only the positive and negative words from each review and used the total amount of positive words minus the total amount of negative words as a measure of sentiment within the review. To identify positive and negative words within each review, a predefined opinion lexicon was utilized (Opinion Lexicon, 2011).
Since only words containing sentiment are necessary for our analysis,
we utilized the R package stringr to remove superfluous
features. Every review was shifted to lower-case through the
tolower function, and the regex pattern \\w+
was used to match each sequence of word-characters of length at least 1
within the str_extract_all function. In this way, words
containing punctuation, numbers, or any other non-word characters were
ignored.
We then utilized word dictionaries (Opinion Lexicon, 2011) to identify positive and negative words and compute their respective frequencies for each review. This was completed by checking which words in a review match with those in the positive and negative word dictionaries, counting each of those occurrences, and subtracting the number of negative words from the number of positive words. Each calculation was stored in a new variable named Sentiment containing an overall sentiment score for each review which was then appended to our dataset.
Finally, unused data was removed from our dataset to improve the speed of our calculations. Since our models seek to estimate business quality from numerical user data, certain categorical and non-numeric data were removed from the dataset (namely index, User and business IDs, state and city, and elite years). This left us with the variables/factors of: Star, Useful, Cool, Funny, Bus_Ave_Star, User_Review_count, User_Useful_count, User_Funny_count, User_Cool_count, User_Fans, Users_Ave_Star, Sentiment. The above factors were the data used to train our models and algorithms.
Lasso
Least Absolute Shrinkage and Selection Operator (LASSO) is a form of
regression which utilizes shrinkage and variable selection. Seeing as
LASSO aims to minimize prediction error, it imposes a constraint on the
model parameters, thereby shrinking the regression coefficients to zero.
Those coefficients which are in fact shrunk to zero are regarded as
insignificant and are not included in the final regression model. The
sum of our remaining coefficients should be less than the complexity
constraint, lambda—the ideal lambda ultimately being determined by
k-fold cross-validation. Lasso can be prone to a bias tradeoff as well
as relatively uninterpretable regression coefficients, though it is
thought to help with overfitting (J. Ranstam).
Lasso was implemented to obtain a sparse model containing only those variables which most affect business quality. After implementing lasso with the best lambda, a matrix of variable coefficients was obtained; the variables not shrunk to zero were chosen for our logistic regression.
Logistic Regression
Logistic Regression is a form of regression that takes the natural log
of the odds—odds being the ratio of the probability of something
happening over the probability of something not happening—as a
regression function of predictors. The general form of logistic
regression reads something like this: ln(odds(Y=1))=β0+β1X, with β0
being our regression intercept, β1 being our regression coefficient, and
X being our predictor. As more predictors are added, more corresponding
regression coefficients will be added (LaValley).
The sparse model obtained from our lasso was implemented into our logistic regression. Through logistic regression, the log-odds of our outcome (Business_Quality) are modeled as a linear combination of our choice of predictors. In this way, our logistic regression models the conditional probability of our response given the reduced model. We used model summaries to check the significance of variables, and a confusion matrix averaged over our cross-validation was created to obtain a measure of our model’s accuracy in predicting business quality labels.
Forward Stepwise Selection (and BIC)
Stepwise selection is a form of regression that selects variables
sequentially. There are two particular methods of stepwise selection,
those methods being forward and backward. We implemented forward
stepwise selection, thus this approach to regression begins with a model
with no predictors, and slowly adds predictors until we have an optimal
model. The model adds predictors according to which are most highly
correlated with our response variable, however, we must also balance not
overfitting with the BIC (Bayesian Information Criteria) which favors a
simpler model, i.e., BIC penalizes models based on complexity as opposed
to its counterpart AIC which penalizes on predictive criteria. In
simpler terms, BIC weighs the benefit of additional variability
accounted for and correlation with the increase in complexity that comes
with adding another predictor (Royal) (Zhang).
Given that our sample size is large and contains many variables, forward stepwise selection was chosen as a way to iterate from simple to more complex models and gradually improve fit. Each model was evaluated and the model with the minimum BIC was chosen for our KNN.
KNN
KNN classification is a method of classification that is favorable in
situations in which there is little to no prior information about the
distribution of the data. Similarly, KNN is thought to be favorable with
large datasets of low dimensions. The idea of KNN is that the patterns
nearest to a target pattern will provide us information that will help
us classify the aforementioned target pattern. The K of KNN factors into
KNN as an integer. The lower the value of K, the more local the
prediction. The higher the value of K, the classifier ignores patterns
of minority labels. K essentially defines the locality of KNN
(Kramer).
A model was created using the variables chosen through forward selection, and KNN was applied for its evaluation. A value of k maximizing accuracy was obtained via cross-validation. We used this k to produce a confusion matrix and ROC curve. Using the ROC curve, an improved classification threshold was used for our confusion matrix in order to maximize accuracy and minimize false positives and negatives.
10-fold cross-validation (90-10 train-test split) was performed on Lasso, Logistic Regression, and KNN (Forward stepwise selection was performed on the whole dataset) to produce confusion matrices and accuracy measures for both prediction models.
Each implementation was done on an Intel Core i3 CPU with 2 cores
(2.1 GHz each), 8 GB RAM, and a 64-bit Windows 10 operating system. The
programming language and libraries used (stringr,
glmnet, caret, leaps,
ROCR) were based in R.
Using the Lasso technique, we determined that the best value of lambda was 0.01, or -4.6 for log(lambda). This is shown in the plot below, which demonstrates the relationship between log(lambda) and the binomial deviance—the binomial deviance being what minimizes the error of our model.
By using this lambda, we were able to ascertain that the most useful features were Sentiment and Star Rating, as these were the features that were not shrunk to zero. As a result of these findings, our logistic regression model was trained using solely Sentiment and Star Rating to predict business quality.
We applied logistic regression to this sparse model and found that our predictors were all significantly correlated with business quality.
According to this logistic regression model, we achieved an accuracy of roughly 0.89 (obtained from our averaged confusion matrix). However, we also produced some false negatives and multiple false positives.
This suggests that although our model fits the data well and a user’s star rating and the sentiment they express in their review tend to accurately predict the quality of a business, this logistic model, in particular, will likely overestimate the number of good quality businesses in a dataset.
We implemented forward stepwise selection on our pre-processed data and calculated a BIC value for each of 11 possible models. We created a plot relating BIC to the size of the model and found that the model minimizing BIC had 4 predictors. Taking a look at the summary of our stepwise selection, we found that the 4 optimal predictors for our classification were: Star, User_Review_count, User_Ave_Star, and Sentiment.
Using these as our predictors, KNN was fitted onto our second model. Cross-validating this model over several different values of K yielded a model with optimal accuracy at a value of K = 9. With a threshold of 0.5, we constructed a confusion matrix to calculate our accuracy. However, although we obtained an accuracy of roughly 0.89, the false negative rate for our model was high and the true positive rate was very poor. Shifting our threshold down to 0.3 improved our true positive rate at the cost of slightly higher false positive rates; however, this increase in bias is relatively negligible.
The final accuracy of our model became 0.86, indicating that under KNN, a user’s star review, total number of reviews, average star rating, and the sentiment they express in their review are decent predictors of business quality, albeit at the cost of higher chances of false positives and negatives.
The figures below compare the final accuracies of our 2-predictor logistic model and 4-predictor KNN model. Our logistic regression model performed slightly better than our KNN model in terms of overall accuracy. However, whereas the logistic regression model excelled at finding true positive correlations, the KNN model was optimal at classifying true negative values.
Moreover, both models suffered from their own set of biases, as our logistic model had a bias towards making false-positive predictions and our KNN model had a bias towards making both false-positive and false-negative predictions roughly equally.
Overall, neither model was able to wholly accurately classify a business’s quality. However, both were apt at either correctly classifying true positives or true negatives. Although this came at the cost of some bias, it does seem like these models are decent candidates for binary classification in the context of numerical data. Most importantly, the success of our models shows that aspects of a Yelp user’s data—specifically numerical data and Yelp reviews—may hold some predictive value in ascertaining the quality of the businesses being critiqued.
Although our model focused mostly on the numerical data available on Yelp, it would be interesting for future research to investigate the influence available categorical Yelp data has on the binary classification of business quality. Aspects such as a user’s elite level or county of residence are readily available and may garner insight into the influence those factors may have on accurately predicting business quality.
Since our model had average business stars transformed into business quality, research into accurately predicting a business’s true average star rating would also prove practically useful.