Home cooks of all skill levels frequent AllRecipes.com for meal inspiration, but the quality of user-submitted recipe content on the site varies widely. We have identified seven kinds of non-exclusive problems (related to missing/misleading steps, incorrect measurements, incorrect cooking/preparation time, incorrect temperature, a key/flavor ingredient being missing or an incorrect ingredient being included, and user error) that recipe reviewers might claim a recipe suffers from. Classifying these reviews according to the problems they identify could enable AllRecipes.com to effectively hone in on their recipe content that is most likely to need reevaluation.
The methodology we developed in order to predict problems asserted in recipe reviews included: isolating reviews most likely to assert problems; employing unsupervised learning with Latent Dirichlet Allocation (LDA) to divide reviews into topics; manually labeling reviews according to sentiment expressed and problems asserted; employing supervised learning with Naive Bayes, Support Vector Machine (SVM), eXtreme Gradient Boosting, and Logistic Regression models to predict problem classes using only unigrams, only bigrams, only trigrams, or a combination of all three; and attempting to improve performance by removing highly correlated ngram features.
Only two problems were frequent enough for us to build moderately
successful models: MEASUREMENTS and
INGREDIENTS. The best MEASUREMENTS classifier
was an SVM model using a combination of ngrams, and it achieved an F1
Score of 71.74 percent while being 79.86 percent accurate. The best
INGREDIENTS classifier was a Naive Bayes model using only
unigrams that achieved an F1 Score of 64.11 percent while being 70.20
percent accurate.
According to AllRecipes.com’s About Us page, “Only the best recipes achieve Kitchen Approved status and get published and promoted” on the Web site. When users submit recipes for “Kitchen Approved” consideration on AllRecipes.com, the Web site’s team of recipe editors claims to check the content for “Completion,” “Accuracy,” and “Replication,” among other things. The issues these quality assurance checks are supposed to cover are described below:
| CHECK | DESCRIPTION |
|---|---|
| Completion | The ingredient list is evaluated to be sure it is complete, that measurements are correct, and that ingredients are properly ordered and grouped according to their use. |
| Accuracy | Serving sizes and yields are checked for accuracy and to be sure that recipes reflect USDA dietary recommendations. |
| Replication | Recipes are reviewed and edited to ensure they are replicable when following the instructions. Editors assess the techniques, equipment, and appliances used in the recipes, and explain any intermediate or advanced techniques with simple language and visuals. |
However, the quality of user-submitted recipe content on the site still varies widely, and users frequently express opinions that suggest “Kitchen Approved” recipes suffer from problems even after having gone through AllRecipes.com’s purported quality assurance checks. These recipes may very well benefit from further review, and we believe utilizing text classification to predict problems asserted by reviews could help AllRecipes.com focus its attention on recipes that reviewers believe need it most.
We have identified seven distinct, but non-exclusive problems that users might claim published recipes suffer from, and descriptions of these problems are below:
| PROBLEM | DESCRIPTION |
|---|---|
| DIRECTIONS | One or more steps was unclear, incorrect, or missing |
| MEASUREMENTS | Included too much or too little of one or more ingredients |
| TIMING | Needed more/less time to prep or to finish cooking |
| TEMPERATURE | Needed higher/lower temperature to cook or assemble correctly |
| INGREDIENTS | Missing a key/flavor ingredient or included an incorrect ingredient |
| INAUTHENTIC | Lacked a resemblance to the dish’s traditional preparation/flavor |
| USER_ERROR | Reviewer suspected/admitted they made a mistake |
The question we intended to answer with our analysis was:
Can AllRecipes.com review text be classified according to the problems reviewers are claiming recipes suffer from?
To answer this question, we utilized a combination of unsupervised and supervised learning techniques. Of the 134,800 total reviews we gathered, we grouped the 14,954 reviews with a star rating of three or below into six topics using LDA, and we labeled 3,398 of these reviews according to sentiment expressed and problems asserted. We generated four featuresets for these reviews: one containing only unigrams, only containing only bigrams, one containing only trigrams, and a combined featureset containing all three. Four kinds of supervised learning models were trained on each featureset, and we needed seven versions of each kind of model per featureset (one for each non-exclusive problem class).
The classification of text, including review text and even more specifically recipe review text, has been previously analyzed from many perspectives. Our literature review focused on: 1) unsupervised learning techniques that might reduce the amount of time spent manually labeling the review text for supervised learning, as well as the pros and cons of some forms of text pre-processing prior to unsupervised learning; 2) comparisons of several supervised learning techniques commonly used in Natural Language Processing (NLP), as well as their applications in multi-class and non-exclusive classification; 3) developing featuresets from ngrams of different lengths; and 4) performance metrics for assessing text classifiers when the labeled classes are imbalanced.
Unsupervised Learning: When reviewing unsupervised learning techniques that can be applied to NLP, we decided early on that we would not consider fuzzy clustering via artificial neural networks like the ones described in Derhgawen et al. (2020) due to computation costs. We also rejected semantic orientation as described in Chaovalit & Zhou (2005), as it can only be used in binary classification tasks like simple sentiment analysis.
Instead, we looked at other methods. Both Sutherland et al. (2020) and Schofield & Mimno (2016) have used topic modeling via LDA, a common unsupervised learning technique in NLP, to effectively classify text into coherent groups automatically. Using LDA, Sutherland et al. (2020) were able to identify, then validate 14 distinct topics (value, amenities, etc.) of customer satisfaction in reviews guests left for various accommodations (hotels, etc.). So it seemed promising that we might be able to achieve similar success employing LDA in our analysis to automatically categorize reviews prior to manually labeling them. However, Sutherland et al. (2020) and Schofield & Mimno (2016) took different approaches to stemming, a popular text pre-processing method used to simplify the vocabulary of a text corpus by essentially combining words with the same beginnings but different endings into one root word. Sutherland et al. (2020) found stemming to be a key part of their topic modeling success, while Schofield & Mimno (2016) found that this method did not improve their topic modeling, no matter what number of coherent groups they tried to identify in their corpora or how little stemming they did. All stemming methods, even light ones, reduced performance. So the question, “To stem, or not to stem?” remained for us.
Supervised Learning: Having settled on LDA as our unsupervised learning technique, whether it made manually labeling our reviews any faster or not, we considered a wide variety of supervised learning techniques commonly used in NLP. We settled on: 1) Naive Bayes models, used by Lohar et al. (2017), Pranckevičius & Marcinkevičius (2017), Zhang & Lee (2003), and Madushani & Arudchelvam (2021); 2) SVM models, used by Al Amrani et al. (2018), Pranckevičius & Marcinkevičius (2017), and Zhang & Lee (2003); 3) Logistic Regression models, used by Pranckevičius & Marcinkevičius (2017); and 4) eXtreme Gradient Boosting models. Both Al Amrani et al. (2018) and Pranckevičius & Marcinkevičius (2017) used Random Forest models, but we knew from previous experience those models would not be able to handle the sparse matrices we would need to use to represent our ngram featuresets. So we chose eXtreme Gradient Boosting models, which can handle sparse matrices, instead.
Lohar et al. (2017) successfully used Naive Bayes models to classify Microsoft Office customer feedback into six categories (bug, complaint, etc.). While the categories were non-exclusive like those in our analysis, there was not as much overlap in these categories as there was in the problems we wanted to identify in recipe review text. So their best performing model was a one-versus-rest classifier, whereas we needed separate classifiers for each problem we wanted to identify.
Unlike Lohar et al. (2017), Pranckevičius & Marcinkevičius (2017) compared the results of many kinds of models, not just Naive Bayes ones. They found that Logistic Regression models performed better than Naive Bayes models, Random Forest Models, Decision Tree models, and SVM models at a multi-class classification task using Amazon product review text to predict star-rating (1-5). However, Logistic Regression models predicted some classes much better than others, unlike the rest of the models compared in this analysis, which were more stable. (Note that the only kernel function they were able to use for their SVM models was the linear kernel, as other kernel functions were not yet implemented in the Apache Spark machine learning library they used for their analysis.)
Like Pranckevičius & Marcinkevičius (2017), Zhang & Lee (2003) compared the results of many kinds of models. They found that SVM models using a linear kernel performed better than Nearest Neighbors models, Naive Bayes models, Decision Tree models, and Sparse Network of Winnows models at a multi-class classification task using question text to predict the kind of question being asked. They also found there was no performance difference between linear SVM models and SVM models using radial basis, polynomial, or sigmoid kernels.
Ngrams of Different Lengths: In addition to wanting to test a variety of models, we also wanted to test a variety of featuresets based on ngrams of different lengths. Using only Naive Bayes models, Madushani & Arudchelvam (2021) found that the models always performed better as the length of the ngrams used to train them increased. That is, bigrams alone produced better models than unigrams alone, and trigrams alone produced better models than bigrams alone. However, when comparing performance across a variety of models, Pranckevičius & Marcinkevičius (2017) found that unigrams alone were sufficient. Bigrams alone didn’t perform better than unigrams alone, and neither did trigrams alone. All three combined did not improve performance either.
Performance Metrics: Jurafsky & Martin (2024) stress the importance of precision, recall, and F scores in measuring text classifier performance when the classes are imbalanced like the recipe problems we wanted to identify in the review text. Lohar et al. (2017) found unsurprisingly that when predicting multiple non-exclusive classes, it is the classes that are the best represented in the data that are easiest for models to predict. While they only examined Naive Bayes models in their analysis, we found the same to be true regardless of the type of model used in our analysis.
In order to predict the problems asserted in recipe reviews, we developed the following methodology:
First, we isolated the reviews most likely to assert problems, i.e. those with a rating of three or fewer stars. While some positive-leaning reviews might also have identified recipe problems, it was reasonable for us to assume that negative-leaning reviews would be more likely to do so, and we expected 3-star and below reviews to express more negative sentiment than 4- or 5-star reviews would.
Then we employed LDA to fuzzy cluster the reviews into an optimal
number of topics based on their bigrams. We chose bigrams because they
include more context than unigrams, and some words that are commonly
considered stopwords for text data were useful for our analysis. Many
reviews included text like “too much” or “not enough,” both of which
could indicate MEASUREMENTS errors, and we hoped LDA would
group reviews together based on similar informative language like this
so that the next step of our methodology, manual labeling, might be
faster. The optimal number of topics turned out to be six.
Reviews were manually labeled according to sentiment expressed and problems asserted. Once that was completed, we employed Naive Bayes, SVM, eXtreme Gradient Boosting, and Logistic Regression models to predict problem classes using only unigrams, only bigrams, only trigrams, or a combination of all three. Models were then ranked by their F1 Scores on each of the seven problem classes.
Finally, we attempted to improve performance for all models by removing highly correlated ngram features from all featuresets.
The recipe data we gathered included several fields. Names and descriptions of the first 15 fields are below:
| VARIABLE | DESCRIPTION |
|---|---|
| URL | Web page address for the recipe |
| Recipe_Name | Name of the recipe |
| Basic_Info | List of info such as Prep Time, Cook Time, Servings, etc. |
| Ingredients | List of ingredients |
| Directions | List of directions |
| Nutrition | List of info such as Calories, Fat, Protein, etc. |
| Ratings_Total | Total number of star ratings |
| Ratings_Avg | Average star rating |
| Reviews_Total | Total number of reviews |
| MHPR_User | User who left Most Helpful Positive Review |
| MHPR_Rating | Star rating of Most Helpful Positive Review |
| MHPR_Date | Date of Most Helpful Positive Review |
| MHPR_Review | Text of Most Helpful Positive Review |
| MHPR_Review_Len | Length in characters of Most Helpful Positive Review |
| MHPR_Helpful | Count of users who marked Most Helpful Positive Review helpful |
The “Most Helpful Positive Review (MHPR)” mentioned in the last six
of these fields and their descriptions is the first review that
Allrecipes.com shows users. It is determined by user input, specifically
the number of other users who have marked the review “helpful.” Some
recipes didn’t have any positive reviews marked helpful by other users,
so their MHPR fields contained empty strings or NA
values.
The remaining field names and their descriptions are below.
| VARIABLE | DESCRIPTION |
|---|---|
| User_[0-8] | User who left review |
| Rating_[0-8] | Star rating for review |
| Date_[0-8] | Date of review |
| Review_[0-8] | Review number [0-8] |
| Review_Len[0-8] | Length in characters of review |
| Helpful_[0-8] | Count of users who marked review helpful |
Because these fields represent the nine most recent reviews for a
recipe, the same six field names were used for each review, then
numbered zero through eight to represent the review number. For
instance, the field “User_[0-8]” in the table above actually refers to
nine iterations of the “User” field, one for each of the nine most
recent reviews for a recipe. Some recipes had been reviewed less than
nine times, so any review fields beyond the number of times they had
actually been reviewed contained empty strings and NA
values. So there was more review data available for some recipes than
others.
During the data collection process, bad URLs were given a
Visited value of -1 to separate them from good links we had
already visited (value: 1) and links we had not yet visited (value: 0).
Some of these bad links were added to the records erroneously, so we
identified and removed them.
We also intended to exclude any recipes that had never been reviewed from the data collection process, but a small number of them did make their way into the records, so we removed those as well.
That left us with data for 18,347 recipes that had been reviewed at least once. We summarize the mean number of ratings, the mean star rating, and the mean number of reviews for recipes in our dataset below:
| VARIABLE | MEAN |
|---|---|
| Ratings Total | 114.9 |
| Ratings Average | 4.5 |
| Reviews Total | 90.0 |
The recipes in our dataset were generally rated very highly, with a mean star rating of 4.5. They generally received large numbers of ratings (mean: 114.9) and reviews (mean: 90) as well.
To prepare our data for unsupervised learning, we created a corpus of potentially negative reviews by pivoting our wide review data into a longer format, dropping many columns we didn’t need for our analysis, and removing any reviews with star ratings of 4 or higher from the 134,800 total reviews we gathered. We hoped this would make manually labeling reviews by what problem they were identifying faster by limiting the number of reviews we would have to look at that didn’t identify any problems at all. That left us with 14,954 potentially negative reviews.
We created a featureset using only bigrams for these 14,954
potentially negative reviews. Bigrams provide more context than
unigrams, and bigrams that contained words that are commonly considered
stopwords for text data were useful for our analysis. Many reviews
included text like “too much” or “not enough,” both of which could
indicate MEASUREMENTS errors, and we hoped LDA would group
reviews together based on similar informative language like this. So we
did not exclude bigrams that contained words like “too” and “not” from
consideration. We did eliminate bigrams containing non-useful stopwords
(e.g. pronouns), as well as bigrams containing numbers, but we noticed
later that our number-identification method unintentionally left bigrams
containing fractions in.
Using values of k from two to 12, we measured the perplexity of several different LDA models that would automatically partition the data into k topics. The below plot demonstrates how the perplexity for a model tends to decrease as k increases, but after a certain point, the reduction in perplexity achieved by increasing k begins to drop off.
In our case, the reduction in LDA model perplexity began to drop off after about six topics, so we set k to six in our final LDA model. The bigrams that were most frequent in the six identified topics were:
There was a lot of commonality in the bigrams that were most likely
to occur within these six topics. For instance, “next time” was the
first or second most identifying bigram in four out of six topics,
indicating the reviews in each of those topics were likely to exhibit
some disappointment and make some suggestions for improvement. The
bigrams “too much” and “way too” were the first and second most
identifying bigrams for topic six, so it seemed likely from this plot
that MEASUREMENTS errors might define this topic. However,
we were also able to find these bigrams somewhere in the identifying
bigrams list for every single other topic, so we did not expect
MEASUREMENTS errors to be exclusive to topic six.
Despite all the overlap, there were some bigrams that were unique to the top identifiers for each topic. It was notable that some of these unique bigrams were related to specific ingredients though, i.e. brown sugar being an identifier of topic one, cream cheese being an identifier of topics two and four, and soy sauce being an identifier of topic six. Ingredients therefore resulted in some unintended grouping. We suspected review length might have unintentionally factored into the grouping as well, so we plotted the distribution of review length by topic to confirm whether or not that was the case:
The topics actually had pretty similar review length distributions. Topic 4 had the highest median review length at 201 characters, but that was only 22 more characters than the lowest median review length (179 characters, shared by topics three and six). So review length did not adversely affect topic modeling.
We exported the potentially negative reviews, manually labeled a manageable portion of them, then loaded said newly labeled data for further data preparation and the application of supervised learning techniques.
There were 3,398 labeled observations in total: 515 observations from
each of the six topics identified during LDA, plus 308 observations that
were not composed of any significant bigrams and were thus assigned a
topic of NA. (We recoded NA topic values as
0.) Reasons why a review might not have been composed of any significant
bigrams included:
the review was devoid of any words (e.g. “!!!” )
the review contained only stopwords (e.g. “so so”)
all possible bigrams in the review contained stopwords (e.g. “I hated it”)
The observations were labeled in two ways:
as either “Neutral” or “Negative” in
SENTIMENT
by which of nine* specific problems they identified (if any):
DIRECTIONS, MEASUREMENTS, TIMING,
TEMPERATURE, ING_KEY_MISSING,
ING_FLAVOR_MISSING, ING_WRONG_INCL,
INAUTHENTIC, and USER_ERROR
*Note that these nine problems were later reduced to the seven
problems mentioned previously by combining the three ingredient-related
problems (ING_KEY_MISSING, ING_FLAVOR_MISSING,
and ING_WRONG_INCL) into a single problem
(INGREDIENTS).
Descriptions of the problems and the criteria we used to determine whether a review identified them are outlined below:
| PROBLEM | DESCRIPTION |
|---|---|
| DIRECTIONS | One or more steps was unclear, incorrect, or missing |
| MEASUREMENTS | Included too much or too little of one or more ingredients |
| TIMING | Needed more/less time to prep or to finish cooking |
| TEMPERATURE | Needed higher/lower temperature to cook or assemble correctly |
| ING_KEY_MISSING | Missing a key ingredient |
| ING_FLAVOR_MISSING | Missing a flavor ingredient |
| ING_WRONG_INCL | Included an incorrect ingredient |
| INAUTHENTIC | Lacked a resemblance to the dish’s traditional preparation/flavor |
| USER_ERROR | Reviewer suspected/admitted they made a mistake |
Labeling text manually is always subjective, and it requires a strong
commitment to consistency. We tried to apply the same criteria to every
review and to resort to the same logic when assigning labels, especially
in borderline cases. For instance, we decided it was not enough for a
reviewer to say that a recipe was “missing something” for us to label it
as indicating as ING_FLAVOR_MISSING. The reviewer had to go
one step further and state what they added or what they would add next
time in order to better the flavor. However, there were surely times
when we made a judgment call one way based on the text of one review,
then a different way when we encountered similar text in another review.
It is also true that another person might have made very different
labeling decisions throughout the process. We stand by the quality of
the labels we assigned nonetheless.
Recipe review text has some aspects that make labeling it a more complicated task than labeling other kinds of text. Many reviews were very long, but expressed little or nothing of use for our analysis. Reviewers were also frequently responding to one another when they left reviews, so reading those reviews felt like listening to one half of a phone call. It was more confusing than revealing. Many reviews required some reading between the lines, and in some cases the recipe itself had to be consulted to confirm our understanding of what was being said. Lastly, reviewers can be wrong. We did not allow the incorrectness of any statements in a review to disqualify the review from receiving problem indicator labels. However, some reviewers were occasionally so incorrect in their statements that we were offended on the recipe creator’s behalf.
Because we only looked at reviews with star ratings of three or
below, we originally felt it should not be possible for these reviews to
receive a “Positive” label for SENTIMENT. However, we did
encounter review text along the way that was clearly neither “Negative”
nor “Neutral,” and it’s possible the reviewer intended to leave a higher
star rating along with their seemingly “Positive” words. For the sake of
consistency, we continued to categorize these “Positive” reviews as
“Neutral” during labeling so that we could redefine our
SENTIMENT variable later and hopefully eliminate any
confusion. To that end, we replaced the SENTIMENT variable
with the NEGATIVE variable, a binary predictor in which a
value of 1 indicates the review text expressed a “Negative” sentiment,
and a value of 0 simply indicates it did not.
By visualizing the frequency of “Negative” reviews by star rating below, we confirm that the lower the star rating was, the more “Negative” the sentiment of the review text tended to be, as expected:
Almost all of the 1- and 2-star reviews we labeled expressed “Negative” sentiment, whereas a much smaller percentage of the 3-star reviews we labeled did. We also encountered more than twice as many 3-star reviews as we did 1- or 2-star reviews during labeling, but 1- and 2-star reviews are similar enough categories that we could justifiably aggregate them. When we considered them in aggregate, the number of 1- and 2-star reviews we labeled was similar to the number of 3-star reviews we labeled.
We visualized the problems we identified in terms of both frequency and overlap since the problems were non-exclusive.
Reviews that identified MEASUREMENTS errors were most
frequent, whereas reviews that identified TEMPERATURE
errors were least frequent. When reviews mentioned two problems,
MEASUREMENTS was frequently one of them. It is also worth
noting that some of the ingredient-related issues frequently occurred
together, as well as with authenticity issues. Lastly,
TEMPERATURE errors occurred frequently with
TIMING errors, which is sensible because in cooking, timing
issues can often be solved with temperature changes, and vice versa.
Despite the fact that the primary goal of LDA was pre-sorting the reviews so that manually labeling them by problem would hopefully be faster and simpler, we don’t believe there was actually much reduction in the time or effort it cost us to label the data. We visualized the distribution of problems by topic to demonstrate why:
While there were some visible differences in the exact frequencies of
problems by topic, it is clear that LDA did not segment the topics by
the problems, or there would have been topics where particular problems
were relatively absent and others were relatively dominant. We had
suspected topic six might be defined more by MEASUREMENTS
problems than the other topics because its most identifying bigrams were
“too much” and “way too,” but topic three actually had the most
MEASUREMENTS problems, and every topic had more
MEASUREMENTS problems than any other class of problem.
It’s not terribly surprising that LDA failed to segment the topics by the problems we wanted to identify. These problems are non-exclusive, so there has to be some overlap, and we only did a moderate amount of data preparation prior to LDA before manually labeling the data.
Since we anticipated unsupervised learning would be insufficient for our problem classification needs, we proceeded with preparing the labeled data further for supervised learning as planned. First, we collapsed the three ingredient-related problems into a single problem. While the specificity with which we originally labeled the data regarding ingredient issues might be helpful for other analyses, ours benefited from being slightly less complex.
The INGREDIENTS response variable took the place of the
ING_KEY_MISSING, ING_FLAVOR_MISSING, and
ING_WRONG_INCL response variables. A value of 1 for the new
variable indicates there was at least one ingredient issue, whereas a
value of 0 indicates there were no ingredient issues. For the sake of
clarity, the problems/criteria table has been reprinted below to reflect
this change:
| PROBLEM | DESCRIPTION |
|---|---|
| DIRECTIONS | One or more steps was unclear, incorrect, or missing |
| MEASUREMENTS | Included too much or too little of one or more ingredients |
| TIMING | Needed more/less time to prep or to finish cooking |
| TEMPERATURE | Needed higher/lower temperature to cook or assemble correctly |
| INGREDIENTS | Missing a key/flavor ingredient or included an incorrect ingredient |
| INAUTHENTIC | Lacked a resemblance to the dish’s traditional preparation/flavor |
| USER_ERROR | Reviewer suspected/admitted they made a mistake |
Then we refined the text pre-processing and tokenization that was done prior to LDA topic modeling. First, we did simple things like converting all text to lowercase and removing any punctuation marks except for apostrophes (after replacing right single quotation marks used as such). Then, we undertook more complex transformations.
First, we replaced any ingredients a review mentioned (i.e. “chicken,” “flour,” or “pepper”) with a single token: “__ING__.” We hoped this custom usage of stemming would help us group similar language together because we wanted to know how frequently a review discussed ingredients and the contexts in which ingredients were being discussed, but it didn’t really matter which ingredient a reviewer was talking about for our purposes. What mattered, as just one example, was whether the reviewer was saying there was “too much” or “not enough” of said ingredient. The ingredients list we used to reduce ingredient words to a single token was neither perfect nor exhaustive, but it did cover enough common ingredients to be useful.
For similar reasons, we replaced any measurement words and their common abbreviations (i.e. “tablespoons” or “tbsp”) with a single token: “__MEAS__.” We also replaced any numbers (including integers, fractions, mixed fractions, and decimals) with a single token: “__NUM__.”
We then tokenized the refined text into four featuresets: one containing only unigrams, one containing only bigrams, only containing only trigrams, and one containing a combination of all three. Each featureset was represented as a sparse matrix, which we split into train, validate, and test sets.
We generated four kinds of supervised learning models: Naive Bayes, SVM, eXtreme Gradient Boosting, and Logistic Regression. Each type of model was trained on each featureset, and there were seven versions of each kind of model per featureset (one for each non-exclusive problem class). For all SVM models, a linear kernel was used. (Using a radial basis kernel for the SVM models was also tested, and it was not only incredibly slow to build SVM models using a radial basis kernel on this training data, but they also performed poorly on all problems, no matter which featureset was used to train them. They tended to make no positive predictions whatsoever.)
We made predictions using the data that was set aside for validation.
Below are the top performing models by F1 Score for each problem. The
featuresets the top performers used to achieve their results are also
listed in the Token column.
| Model | Problem | Token | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Naive Bayes | DIRECTIONS | Unigrams | 0.8259 | 0.4848 | 0.3310 | 0.3934 |
| Support Vector Machine | MEASUREMENTS | All | 0.7776 | 0.7204 | 0.6442 | 0.6802 |
| Support Vector Machine | TIMING | Unigrams | 0.9012 | 0.6263 | 0.5688 | 0.5962 |
| eXtreme Gradient Boosting | TEMPERATURE | All | 0.9612 | 0.4800 | 0.3750 | 0.4211 |
| Naive Bayes | INGREDIENTS | All | 0.7412 | 0.6927 | 0.6927 | 0.6927 |
| Support Vector Machine | INAUTHENTIC | All | 0.9012 | 0.4074 | 0.2973 | 0.3438 |
| Support Vector Machine | USER_ERROR | Unigrams | 0.8894 | 0.5625 | 0.3529 | 0.4337 |
Performance on most problems was low for all models, no matter which
featuresets they used, but moderate performance was achieved on two
problems. The SVM model using a combination of unigrams, bigrams, and
trigrams performed well on the MEASUREMENTS problem,
achieving an F1 Score of 68.02 percent, and the Naive Bayes model using
the same combination of ngrams performed well on the
INGREDIENTS problem, achieving an F1 Score of 69.27
percent.
In hopes of improving performance, we eliminated any features that demonstrated a correlation greater than 0.9 (in absolute value) with another feature. Due to computation costs, we had to calculate correlations one at a time for unigrams, then bigrams, then trigrams. Then we eliminated these highly correlated features from any featureset they appeared in. It would have been better to consider correlations across all ngrams at once. That way, we could have seen whether a trigram was highly correlated with a unigram, for instance. But our limitations only allowed us to see whether ngrams of the same length were highly correlated.
We made new predictions on the data set aside for validation using
the reduced featuresets resulting from removing highly correlated ngrams
of the same length. Below are the top performing models by F1 Score for
each problem based on the new predictions. The featuresets the top
performers used to achieve their results are again listed in the
Token column.
| Model | Problem | Token | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Naive Bayes | DIRECTIONS | Unigrams | 0.8224 | 0.4712 | 0.3379 | 0.3936 |
| Support Vector Machine | MEASUREMENTS | All | 0.7729 | 0.7117 | 0.6410 | 0.6745 |
| Support Vector Machine | TIMING | Unigrams | 0.9012 | 0.6263 | 0.5688 | 0.5962 |
| eXtreme Gradient Boosting | TEMPERATURE | All | 0.9600 | 0.4615 | 0.3750 | 0.4138 |
| Naive Bayes | INGREDIENTS | All | 0.7376 | 0.6860 | 0.6955 | 0.6907 |
| Support Vector Machine | INAUTHENTIC | All | 0.9106 | 0.4792 | 0.3108 | 0.3770 |
| Support Vector Machine | USER_ERROR | Unigrams | 0.8882 | 0.5538 | 0.3529 | 0.4311 |
Performance was still low for most problems for all models, no matter
which featuresets they used. The SVM model using a combination of
unigrams, bigrams, and trigrams still performed well on the
MEASUREMENTS problem, but its F1 Score decreased slightly
to 67.45 percent, and the Naive Bayes model using the same combination
of ngrams still performed well on the INGREDIENTS problem,
but its F1 Score decreased slightly as well (to 69.07 percent).
Even though performance on the data set aside for validation decreased for our two best models after we removed highly correlated ngrams of the same length from all featuresets, we believed this feature reduction might still benefit these models when extrapolating to unseen data. So we made final predictions using the reduced featuresets on the data set aside for testing.
| Model | Problem | Token | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Support Vector Machine | DIRECTIONS | Unigrams | 0.8057 | 0.4220 | 0.3108 | 0.3580 |
| Support Vector Machine | MEASUREMENTS | All | 0.7986 | 0.7614 | 0.6781 | 0.7174 |
| Support Vector Machine | TIMING | Unigrams | 0.8716 | 0.6330 | 0.5000 | 0.5587 |
| Support Vector Machine | TEMPERATURE | All | 0.9706 | 0.8571 | 0.2000 | 0.3243 |
| Naive Bayes | INGREDIENTS | Unigrams | 0.7020 | 0.6141 | 0.6706 | 0.6411 |
| Support Vector Machine | INAUTHENTIC | All | 0.8846 | 0.5833 | 0.3818 | 0.4615 |
| Support Vector Machine | USER_ERROR | Unigrams | 0.8810 | 0.4462 | 0.3085 | 0.3648 |
Performance increased for the SVM model using all ngrams on the
MEASUREMENTS problems. It achieved a final F1 Score of
71.74 percent. However, performance for the Naive Bayes model decreased,
and it used only unigrams instead of all ngrams to achieve a final F1
Score of 64.11 percent.
We were impressed by the final performance of the SVM model using all
ngrams on the MEASUREMENTS problem, given that we only
labeled 22.7 percent of the 14,954 potentially negative reviews we
isolated for our analysis. Had we had time to label more data, we
believe the performance of the eXtreme Gradient Boosting models would
have improved, as we have seen them outperform both Naive Bayes and SVM
models on text classification tasks in the past.
Testing featuresets built from ngrams of different lengths was fruitful, but the featurespace dramatically increased when combining unigrams, bigrams, and trigrams. If we were to combine these into a single featureset again in the future, we would want to trim the vocabulary of the corpus further first.
We were disappointed that we couldn’t build good classifiers for all
the problem classes, so the answer to the question we asked prior to our
analysis, Can AllRecipes.com review text be classified according
to the problems reviewers are claiming recipes suffer from?, is
“kind of.” Only MEASUREMENTS and INGREDIENTS
problems could be classified with moderate success.
Labeling more data would have enabled us to build better models for
the less frequent problems. We saw F1 Scores greater than 50 percent for
TIMING, so it’s possible we were close to having enough
observations to train a good classifier for that problem.
DIRECTIONS problems were the hardest to classify. While
labeling, we felt that reviewers expressed issues with a recipe’s steps
in much more varied and vague ways than they expressed other issues, so
that makes sense. Rarely would someone say something as simple as “this
recipe is missing a step.” Instead, we were more likely to encounter
phrases like “if you try to put this much batter in one cake pan, it’s
going to spill over; use two cake pans” instead. Considering that, it
would actually probably have been worthwhile to separate the
DIRECTIONS problems that were equipment-specific into a
separate EQUIPMENT problem class.
There are many other ways in which this work could be improved upon and taken further. We would like to formalize and expand upon the custom stemming we did to reduce ingredients, measurements, and numbers to single tokens. We would also like to test how this recipe review-specific stemming might improve the LDA topic modeling we did, as we did not develop this stemming process until after LDA had already been performed. Since our supervised learning models performed best using either unigrams alone or a combination of unigrams, bigrams, and trigrams, and we used only bigrams during LDA, it is likely LDA would perform better using ngrams of different lengths as well. We intended for LDA to speed up our analysis, but we realize now that we would have actually needed to spend more time refining it than was available in order for us to have really gotten the most out of this tool.
knitr::opts_chunk$set(echo = FALSE)
library(caret)
library(e1071)
library(glmnet)
library(jsonlite)
library(knitr)
library(Matrix)
library(naivebayes)
library(RColorBrewer)
library(snakecase)
library(stats)
library(stopwords)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(UpSetR)
library(xgboost)
check1 <- c("Completion", "Accuracy", "Replication")
desc1 <- c("The ingredient list is evaluated to be sure it is complete, that measurements are correct, and that ingredients are properly ordered and grouped according to their use.", "Serving sizes and yields are checked for accuracy and to be sure that recipes reflect USDA dietary recommendations.", "Recipes are reviewed and edited to ensure they are replicable when following the instructions. Editors assess the techniques, equipment, and appliances used in the recipes, and explain any intermediate or advanced techniques with simple language and visuals.")
tbl_print <- cbind(check1, desc1)
colnames(tbl_print) <- c("CHECK", "DESCRIPTION")
kable(tbl_print, format = "simple")
prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE",
"INGREDIENTS", "INAUTHENTIC", "USER_ERROR")
desc2 <- c("One or more steps was unclear, incorrect, or missing",
"Included too much or too little of one or more ingredients",
"Needed more/less time to prep or to finish cooking",
"Needed higher/lower temperature to cook or assemble correctly",
"Missing a key/flavor ingredient or included an incorrect ingredient",
"Lacked a resemblance to the dish's traditional preparation/flavor",
"Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc2)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")
base <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/allrecipes_df_pt_"
exts <- c("1.csv", "2.csv", "3.csv", "4.csv", "5.csv")
for (i in 1:length(exts)){
url <- paste0(base, exts[i])
df <- read.csv(url)
if (i == 1){
ar_df <- df
}else{
ar_df <- ar_df |>
bind_rows(df)
}
}
url <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/links_df.csv"
links_df <- read.csv(url)
variable_descriptions1 <- read.table("variable_descriptions_p1.txt",
sep = "\t", header=TRUE)
kable(variable_descriptions1, format = "simple")
variable_descriptions2 <- read.table("variable_descriptions_p2.txt",
sep = "\t", header=TRUE)
kable(variable_descriptions2, format = "simple")
ar_df <- ar_df |>
left_join(links_df, by = join_by(URL)) |>
filter(Visited != -1) |>
select(-Visited)
rm(links_df)
ar_df <- ar_df |>
filter(!is.na(Reviews_Total))
ar_summ <- ar_df |>
summarize(mean_ratings_tot = round(mean(Ratings_Total), 1),
mean_ratings_avg = round(mean(Ratings_Avg), 1),
mean_reviews_tot = round(mean(Reviews_Total), 1))
colnames(ar_summ) <- c("Ratings Total", "Ratings Average", "Reviews Total")
rownames(ar_summ) <- c("MEAN")
ar_summ <- as.data.frame(t(ar_summ)) |>
rownames_to_column(var = "VARIABLE")
kable(ar_summ, format = "simple")
keep <- c("URL", "Recipe_Name", "Ratings_Total", "Ratings_Avg", "Reviews_Total")
user_sub <- ar_df |>
select(c(all_of(keep), starts_with("User_"))) |>
pivot_longer(cols = starts_with("User_"), names_to = "Review_Num",
values_to = "User", names_prefix = "User_")
rating_sub <- ar_df |>
select(c("URL", starts_with("Rating_"))) |>
pivot_longer(cols = starts_with("Rating_"), names_to = "Review_Num",
values_to = "Rating", names_prefix = "Rating_")
date_sub <- ar_df |>
select(c("URL", starts_with("Date_"))) |>
pivot_longer(cols = starts_with("Date_"), names_to = "Review_Num",
values_to = "Date", names_prefix = "Date_")
txt_sub <- ar_df |>
select(-starts_with("Review_Len")) |>
select(c("URL", starts_with("Review_"))) |>
pivot_longer(cols = starts_with("Review_"), names_to = "Review_Num",
values_to = "Review_Txt", names_prefix = "Review_")
txt_len_sub <- ar_df |>
select(c("URL", starts_with("Review_Len"))) |>
pivot_longer(cols = starts_with("Review_Len"), names_to = "Review_Num",
values_to = "Review_Len", names_prefix = "Review_Len")
helpful_sub <- ar_df |>
select(c("URL", starts_with("Helpful_"))) |>
pivot_longer(cols = starts_with("Helpful_"), names_to = "Review_Num",
values_to = "Helpful", names_prefix = "Helpful_")
ar_df_long <- user_sub |>
left_join(rating_sub, by = join_by(URL, Review_Num)) |>
left_join(date_sub, by = join_by(URL, Review_Num)) |>
left_join(txt_sub, by = join_by(URL, Review_Num)) |>
left_join(txt_len_sub, by = join_by(URL, Review_Num)) |>
left_join(helpful_sub, by = join_by(URL, Review_Num)) |>
filter(!is.na(Rating))
ar_df_long_neg <- ar_df_long |>
filter(Rating < 4)
stop_w <- as.data.frame(stopwords(language = "en", source = "nltk"))
colnames(stop_w) <- c("WORD")
excl <- c("do", "does", "did", "doing", "during", "before", "after", "up",
"down", "over", "under", "when", "why", "how", "all", "any", "both",
"each", "few", "more", "most", "no", "nor", "not", "only", "than",
"too", "very", "can", "will", "don't", "should", "should've",
"aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't",
"haven't", "isn't", "mightn't", "mustn't", "needn't", "shan't",
"shouldn't", "wasn't", "weren't", "won't", "wouldn't")
stop_w <- stop_w |>
mutate(SRC = "NLTK") |>
filter(!WORD %in% excl)
keep <- c("URL", "Review_Num", "Review_Txt")
names <- c("w1", "w2")
ar_df_long_neg_bigrams <- ar_df_long_neg |>
select(all_of(keep)) |>
unnest_tokens(output = BIGRAM, input = Review_Txt,
token = "ngrams", n = 2) |>
count(URL, Review_Num, BIGRAM, name = "COUNT") |>
ungroup() |>
separate_wider_delim(BIGRAM, delim = " ", names = names,
cols_remove = FALSE) |>
mutate(w1 = ifelse(w1 %in% stop_w$WORD, NA, w1),
w1 = ifelse(!is.na(as.numeric(w1)), NA, w1),
w2 = ifelse(w2 %in% stop_w$WORD, NA, w2),
w2 = ifelse(!is.na(as.numeric(w2)), NA, w2)) |>
filter(!is.na(w1) & !is.na(w2)) |>
select(-all_of(names)) |>
mutate(BIGRAM = to_snake_case(BIGRAM))
ar_df_long_neg_bigrams_dtm <- ar_df_long_neg_bigrams |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
cast_dtm(ID, BIGRAM, COUNT)
get_perplexity <- function(dtm, k_max){
#function adapted from https://bit.ly/4cLUHO8
m <- matrix(,1,1)
for (i in 2:k_max){
lda <- LDA(dtm, i)
per <- perplexity(lda)
m <- plyr::rbind.fill.matrix(m, per)
}
m <- as.data.frame(cbind(matrix(seq(2, k_max), ncol = 1), m[-1,]))
colnames(m) <- c("topics", "perplexity")
return(m)
}
fn <- "data/perplexity_values.csv"
if (!file.exists(fn)){
per_vals <- get_perplexity(ar_df_long_neg_bigrams_dtm, k_max = 12)
write.csv(per_vals, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
per_vals <- read.csv(fn)
}
cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(12, "Paired")
p0 <- per_vals |>
ggplot(aes(x = topics, y = perplexity)) +
geom_line(color = pal[9]) +
geom_point(color = pal[10]) +
scale_x_continuous(breaks = seq(2, 12, 1), limits = c(2, 12)) +
scale_y_continuous(breaks = seq(0, 25000, 5000), limits = c(0, 25000)) +
labs(title = "LDA Model Perplexity vs. Number of Topics")
p0
fn <- "ar_lda.rds"
if (!file.exists(fn)){
ar_lda <- LDA(ar_df_long_neg_bigrams_dtm, k = 6, control = list(seed = 208))
saveRDS(ar_lda, "ar_lda.rds")
}else{
ar_lda <- readRDS("ar_lda.rds")
}
ar_topics <- tidy(ar_lda, matrix = "beta")
top_terms <- ar_topics |>
group_by(topic) |>
slice_max(beta, n = 15) |>
ungroup() |>
arrange(topic, -beta) |>
mutate(term = reorder_within(term, beta, topic))
p1 <- top_terms |>
ggplot(aes(beta, term)) +
geom_col(color = pal[10], fill = pal[9]) +
facet_wrap(~ topic, scales = "free_y", nrow = 2) +
scale_y_reordered() +
labs(title = "Bigrams Most Likely to Occur in Topics Identified by LDA")
p1
ar_gamma <- tidy(ar_lda, matrix = "gamma")
names <- c("URL", "Review_Num")
ar_classifications <- ar_gamma |>
separate_wider_delim(document, delim = "_Review_Num_", names = names) |>
group_by(URL, Review_Num) |>
slice_max(gamma) |>
ungroup() |>
mutate(topic = as.factor(topic))
incl <- c("URL", "Review_Num", "Review_Len")
rev_len_by_top <- ar_classifications |>
left_join(ar_df_long_neg |> select(all_of(incl)),
by = join_by(URL, Review_Num))
annotations <- rev_len_by_top |>
group_by(topic) |>
summarize(`1st Qu.:` = round(quantile(Review_Len, probs = 0.25), 0),
`Med.:` = median(Review_Len),
`3rd Qu.:` = round(quantile(Review_Len, probs = 0.75), 0)) |>
pivot_longer(cols = !topic, names_to = "Variable",
values_to = "Value")
p2 <- rev_len_by_top |>
ggplot(aes(x = Review_Len)) +
geom_histogram(binwidth = 100, color = pal[10], fill = pal[9]) +
geom_segment(aes(x = Value, xend = Value, y = 0, yend = 1000),
data = annotations |> filter(Variable == "Med.:"),
linetype="dashed", color = "black", linewidth = 0.5) +
geom_text(aes(x = Value, y = 900, label = paste(Variable, Value)),
data = annotations |> filter(Variable == "Med.:"),
nudge_x = 450, color = "black") +
scale_y_continuous(breaks = seq(0, 1000, 250), limits = c(0, 1000)) +
facet_wrap(~ topic, nrow = 3) +
labs(x = "Review Length (Characters)", y = "Frequency",
title = "Distribution of Review Length by Topic")
p2
fn <- "data/allrecipes_df_neg_labs.csv"
if (!file.exists(fn)){
ar_df_long_neg_labs <- ar_df_long_neg |>
left_join(ar_classifications, by = join_by(URL, Review_Num)) |>
arrange(topic, desc(gamma))
write.csv(ar_df_long_neg_labs, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
ar_df_long_neg_labs <- read.csv(fn)
}
ar_df_long_neg_labs <- ar_df_long_neg_labs |>
mutate(topic = ifelse(is.na(topic), 0, topic),
gamma = ifelse(is.na(gamma), 0, gamma)) |>
na.omit()
prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE", "ING_KEY_MISSING", "ING_FLAVOR_MISSING", "ING_WRONG_INCL", "INAUTHENTIC", "USER_ERROR")
desc3 <- c("One or more steps was unclear, incorrect, or missing",
"Included too much or too little of one or more ingredients",
"Needed more/less time to prep or to finish cooking",
"Needed higher/lower temperature to cook or assemble correctly",
"Missing a key ingredient",
"Missing a flavor ingredient",
"Included an incorrect ingredient",
"Lacked a resemblance to the dish's traditional preparation/flavor",
"Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc3)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")
ar_df_long_neg_labs <- ar_df_long_neg_labs |>
rename(NEGATIVE = SENTIMENT) |>
mutate(NEGATIVE = ifelse(NEGATIVE == "Negative", 1, 0))
sel <- c("Negative", "Other")
sent_star_summ_df <- ar_df_long_neg_labs |>
filter(Rating > 0) |>
group_by(Rating) |>
summarize(Total = n(),
Negative = sum(NEGATIVE)) |>
ungroup() |>
mutate(Rating = paste0(Rating, "★"),
Other = Total - Negative) |>
pivot_longer(cols = all_of(sel),
names_to = "Category", values_to = "Value") |>
mutate(Perc = paste0(round(Value / Total * 100, 1), "%"))
greys <- brewer.pal(9, "Greys")
col <- c(pal[10], greys[6])
fil <- c(pal[9], greys[4])
title_str <- "Frequency of Reviews Expressing Negative Sentiment by Star Rating"
p3 <- sent_star_summ_df |>
ggplot(aes(x = Category, y = Value,
group = Category, color = Category, fill = Category)) +
geom_col() +
geom_text(aes(label = Perc), vjust = -0.75,
size = 4, fontface = "bold") +
scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, 250)) +
scale_color_manual(values = col) +
scale_fill_manual(values = fil) +
facet_grid(~ Rating, scales = "free_x", space = "free_x", switch = "y") +
labs(x = "Sentiment", y = "Frequency",
title = title_str) +
theme(legend.position = "none", panel.spacing = unit(0, units = "cm"),
strip.placement = "inside",
strip.background = element_blank(),
strip.text = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
plot.title.position = "plot")
p3
p4 <- ar_df_long_neg_labs |>
upset(sets = prob, group.by = "sets", cutoff = 4,
matrix.color = pal[10], main.bar.color = pal[10],
sets.bar.color = pal[10], text.scale = 1, show.numbers = "no")
p4
sel <- c("topic", prob)
prob_by_top <- ar_df_long_neg_labs |>
select(all_of(sel)) |>
group_by(topic) |>
summarize(across(all_of(prob), ~ sum(.x))) |>
pivot_longer(cols = !topic, names_to = "Problem", values_to = "Count")
p5 <- prob_by_top |>
filter(topic > 0) |>
ggplot(aes(x = reorder(Problem, desc(Count)), y = Count)) +
geom_bar(stat = "identity", color = pal[10], fill = pal[9]) +
labs(x = "Problem", y = "Frequency",
title = "Distribution of Problems by Topic") +
coord_flip() +
facet_wrap(~ topic, nrow = 3)
p5
ing <- c("ING_KEY_MISSING", "ING_FLAVOR_MISSING", "ING_WRONG_INCL")
ar_df_long_neg_labs <- ar_df_long_neg_labs |>
mutate(INGREDIENTS = rowSums(across(all_of(ing))),
INGREDIENTS = ifelse(INGREDIENTS > 0, 1, 0)) |>
select(-all_of(ing)) |>
relocate(INGREDIENTS, .before = INAUTHENTIC)
prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE",
"INGREDIENTS", "INAUTHENTIC", "USER_ERROR")
desc4 <- c("One or more steps was unclear, incorrect, or missing",
"Included too much or too little of one or more ingredients",
"Needed more/less time to prep or to finish cooking",
"Needed higher/lower temperature to cook or assemble correctly",
"Missing a key/flavor ingredient or included an incorrect ingredient",
"Lacked a resemblance to the dish's traditional preparation/flavor",
"Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc4)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")
my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/recipe_ing_train.json"
ings <- read_json(my_url1)
copy <- ings
for (i in 1:length(copy)){
l <- copy[[i]]
ings[[i]] <- l$ingredients
}
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/recipe_ing_test.json"
ings2 <- read_json(my_url2)
copy <- ings2
for (i in 1:length(copy)){
l <- copy[[i]]
ings2[[i]] <- l$ingredients
}
ings <- as.list(unique(sort(c(unlist(ings), unlist(ings2)), decreasing = TRUE)))
rm(ings2, copy)
ings <- gsub("[^[:alpha:] ]", "", ings)
ings <- lapply(ings, str_squish)
ings <- lapply(ings, tolower)
ings <- lapply(ings, function(x) paste0(" ", x, " "))
ing_patt_string <- paste(unlist(ings), collapse = "|")
int_patt <- c("\\s\\d+\\s(?!\\d)")
mixed_frac_patt <- c("\\s\\d+\\s\\d+slash\\d+\\s")
frac_patt <- c("\\s\\d+slash\\d+\\s")
dec_patt <- c("\\s\\d+point\\d+\\s")
num_patt_string <- paste(int_patt, mixed_frac_patt, frac_patt, dec_patt,
sep = "|")
#adapted from https://en.wikipedia.org/wiki/Cooking_weights_and_measures
meas <- list("drops", "drop", "drs", "dr", "smidgens", "smidgen", "smdgs", "smdg", "smis", "smi", "pinches", "pinch", "pns", "pn", "dashes", "dash", "ds", "teaspoons", "teaspoon", "tsps", "tsp", "t", "tablespoons", "tablespoon", "tbsps", "tbsp", "ounces", "ounce", "oz", "cups", "cup", "c", "pints", "pint", "pts", "pt", "quarts", "quart", "qts", "qt", "gallons", "gallon", "gals", "gal", "pounds", "pound", "lbs", "lb", "large", "small", "medium")
meas <- lapply(meas, function(x) paste0(" ", x, " "))
meas_patt_string <- paste(unlist(meas), collapse = "|")
ar_df_long_neg_ING <- ar_df_long_neg |>
mutate(Review_Txt = tolower(Review_Txt),
Review_Txt = paste0(" ", Review_Txt, " "),
Review_Txt = str_replace_all(Review_Txt,
pattern = "’",
replacement = "'"),
Review_Txt = str_replace_all(Review_Txt,
pattern = "(\\d+)\\.(\\d+)",
replacement = "\\1point\\2"),
Review_Txt = str_replace_all(Review_Txt,
pattern = "(\\d+)/(\\d+)",
replacement = "\\1slash\\2"),
Review_Txt = str_replace_all(Review_Txt,
pattern = "[^[:alnum:][:space:]']",
replacement = " "),
Review_Txt = str_replace_all(Review_Txt,
pattern = num_patt_string,
replacement = " __NUM__ "),
Review_Txt = str_replace_all(Review_Txt,
pattern = ing_patt_string,
replacement = " __ING__ "),
Review_Txt = str_replace_all(Review_Txt,
pattern = meas_patt_string,
replacement = " __MEAS__ "),
Review_Txt = str_squish(Review_Txt))
keep <- c("URL", "Review_Num", "Review_Txt")
ar_df_long_neg_alln <- ar_df_long_neg_ING |>
select(all_of(keep)) |>
unnest_tokens(output = TOKEN, input = Review_Txt,
token = "ngrams", n = 3, n_min = 1,
ngram_delim = ".", to_lower = FALSE,
stopwords = stop_w$WORD) |>
count(URL, Review_Num, TOKEN, name = "COUNT") |>
ungroup() |>
mutate(n = str_count(TOKEN, "\\.") + 1)
ar_df_long_neg_n1 <- ar_df_long_neg_alln |>
filter(n == 1) |>
select(-n)
ar_df_long_neg_n2 <- ar_df_long_neg_alln |>
filter(n == 2) |>
select(-n)
ar_df_long_neg_n3 <- ar_df_long_neg_alln |>
filter(n == 3) |>
select(-n)
ar_df_long_neg_alln <- ar_df_long_neg_alln |>
select(-n)
sel <- c("ID", prob)
ar_df_long_neg_labs_only_num <- ar_df_long_neg_labs |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
select(all_of(sel)) |>
arrange(ID)
ar_df_long_neg_labs_only <- ar_df_long_neg_labs_only_num |>
mutate(across(all_of(prob), ~ factor(., levels = c(0, 1),
labels = c("No", "Yes"))))
ar_df_long_neg_tokens_dtm1 <- ar_df_long_neg_n1 |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
group_by(TOKEN) |>
filter(n() > 2) |>
ungroup() |>
cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm2 <- ar_df_long_neg_n2 |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
group_by(TOKEN) |>
filter(n() > 2) |>
ungroup() |>
cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm3 <- ar_df_long_neg_n3 |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
group_by(TOKEN) |>
filter(n() > 2) |>
ungroup() |>
cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm4 <- ar_df_long_neg_alln |>
unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
group_by(TOKEN) |>
filter(n() > 2) |>
ungroup() |>
cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_sparse1 <- tidy(ar_df_long_neg_tokens_dtm1) |>
right_join(ar_df_long_neg_labs_only |> select(ID),
by = join_by(document == ID)) |>
arrange(document) |>
mutate(term = ifelse(is.na(term), "***__***", term),
count = ifelse(is.na(count), 1, count)) |>
cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse2 <- tidy(ar_df_long_neg_tokens_dtm2) |>
right_join(ar_df_long_neg_labs_only |> select(ID),
by = join_by(document == ID)) |>
arrange(document) |>
mutate(term = ifelse(is.na(term), "***__***", term),
count = ifelse(is.na(count), 1, count)) |>
cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse3 <- tidy(ar_df_long_neg_tokens_dtm3) |>
right_join(ar_df_long_neg_labs_only |> select(ID),
by = join_by(document == ID)) |>
arrange(document) |>
mutate(term = ifelse(is.na(term), "***__***", term),
count = ifelse(is.na(count), 1, count)) |>
cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse4 <- tidy(ar_df_long_neg_tokens_dtm4) |>
right_join(ar_df_long_neg_labs_only |> select(ID),
by = join_by(document == ID)) |>
arrange(document) |>
mutate(term = ifelse(is.na(term), "***__***", term),
count = ifelse(is.na(count), 1, count)) |>
cast_sparse(document, term, count)
set.seed(1006)
sample <- sample(nrow(ar_df_long_neg_tokens_sparse1),
round(nrow(ar_df_long_neg_tokens_sparse1) * 0.5),
replace = FALSE)
train_x1 <- ar_df_long_neg_tokens_sparse1[sample, ]
train_x2 <- ar_df_long_neg_tokens_sparse2[sample, ]
train_x3 <- ar_df_long_neg_tokens_sparse3[sample, ]
train_x4 <- ar_df_long_neg_tokens_sparse4[sample, ]
train_x_all <- list(train_x1, train_x2, train_x3, train_x4)
train_y <- ar_df_long_neg_labs_only[sample, ]
train_y_num <- ar_df_long_neg_labs_only_num[sample, ]
validate_test_x1 <- ar_df_long_neg_tokens_sparse1[-sample, ]
validate_test_x2 <- ar_df_long_neg_tokens_sparse2[-sample, ]
validate_test_x3 <- ar_df_long_neg_tokens_sparse3[-sample, ]
validate_test_x4 <- ar_df_long_neg_tokens_sparse4[-sample, ]
validate_test_y <- ar_df_long_neg_labs_only[-sample, ]
validate_test_y_num <- ar_df_long_neg_labs_only_num[-sample, ]
sample <- sample(nrow(validate_test_x1),
round(nrow(validate_test_x1) * 0.5),
replace = FALSE)
validate_x1 <- validate_test_x1[sample, ]
validate_x2 <- validate_test_x2[sample, ]
validate_x3 <- validate_test_x3[sample, ]
validate_x4 <- validate_test_x4[sample, ]
validate_x_all <- list(validate_x1, validate_x2, validate_x3, validate_x4)
validate_y <- validate_test_y[sample, ]
validate_y_num <- validate_test_y_num[sample, ]
test_x1 <- validate_test_x1[-sample, ]
test_x2 <- validate_test_x2[-sample, ]
test_x3 <- validate_test_x3[-sample, ]
test_x4 <- validate_test_x4[-sample, ]
test_x_all <- list(test_x1, test_x2, test_x3, test_x4)
test_y <- validate_test_y[-sample, ]
test_y_num <- validate_test_y_num[-sample, ]
fol <- c("models/")
dat <- c("uni_", "bi_", "tri_", "all_")
fns <- c("nb1.rds", "nb2.rds", "nb3.rds", "nb4.rds", "nb5.rds", "nb6.rds",
"nb7.rds")
nbs <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
nb <- multinomial_naive_bayes(train_x_all[[i]],
train_y[, col], laplace = 1)
saveRDS(nb, fn)
}else{
nb <- readRDS(fn)
}
count = count + 1
nbs[[count]] <- nb
}
}
fns <- c("svm1.rds", "svm2.rds", "svm3.rds", "svm4.rds", "svm5.rds", "svm6.rds",
"svm7.rds")
svms <- list()
count = 0
ctrl <- tune.control(sampling = "cross", cross = 10, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
svm_tune <- tune(svm, train.x = train_x_all[[i]],
train.y = train_y[, col],
kernel = "linear", ranges = tune_grid,
tunecontrol = ctrl)
svm_mod <- svm_tune$best.model
saveRDS(svm_mod, fn)
}else{
svm_mod <- readRDS(fn)
}
count = count + 1
svms[[count]] <- svm_mod
}
}
fns <- c("xgb1.json", "xgb2.json", "xgb3.json", "xgb4.json", "xgb5.json",
"xgb6.json", "xgb7.json")
xgbs <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
xgb <- xgboost(train_x_all[[i]], train_y_num[, col], nrounds = 100,
objective = "binary:hinge", verbose = 0)
xgb.save(xgb, fn)
}else{
xgb <- xgb.load(fn)
}
count = count + 1
xgbs[[count]] <- xgb
}
}
fns <- c("logm1.rds", "logm2.rds", "logm3.rds", "logm4.rds", "logm5.rds", "logm6.rds", "logm7.rds")
logms <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
logm <- cv.glmnet(train_x_all[[i]], train_y[, col],
family = "binomial", alpha = 1,
type.logistic = "modified.Newton",
type.measure = "auc", nfolds = 10)
saveRDS(logm, fn)
}else{
logm <- readRDS(fn)
}
count = count + 1
logms[[count]] <- logm
}
}
nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
nb <- nbs[[i]]
nb_pred <- predict(nb, validate_x_all[[dat_num]], type = "class")
nb_preds[[i]] <- nb_pred
nbcm_complete <- confusionMatrix(nb_pred, validate_y[, col],
positive = "Yes")
nb_cms_complete[[i]] <- nbcm_complete
nbcm <- as.data.frame(nbcm_complete$table)
nbcm$Reference <- factor(nbcm$Reference,
levels = rev(levels(nbcm$Reference)))
nbcm <- nbcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
svm_mod <- svms[[i]]
svm_pred <- predict(svm_mod, validate_x_all[[dat_num]], type = "class")
svm_preds[[i]] <- svm_pred
svmcm_complete <- confusionMatrix(svm_pred, validate_y[, col],
positive = "Yes")
svm_cms_complete[[i]] <- svmcm_complete
svmcm <- as.data.frame(svmcm_complete$table)
svmcm$Reference <- factor(svmcm$Reference,
levels = rev(levels(svmcm$Reference)))
svmcm <- svmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
xgb <- xgbs[[i]]
xgb_pred <- predict(xgb, validate_x_all[[dat_num]], type = "class")
xgb_preds[[i]] <- xgb_pred
xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
as.factor(validate_y_num[, col]),
positive = "1")
xgb_cms_complete[[i]] <- xgbcm_complete
xgbcm <- as.data.frame(xgbcm_complete$table)
xgbcm$Reference <- factor(xgbcm$Reference,
levels = rev(levels(xgbcm$Reference)))
xgbcm <- xgbcm |>
mutate(
Label = case_when(
Prediction == 0 & Reference == 0 ~ "TN",
Prediction == 1 & Reference == 1 ~ "TP",
Prediction == 0 & Reference == 1 ~ "FN",
Prediction == 1 & Reference == 0 ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
logm <- logms[[i]]
logm_pred <- predict(logm, validate_x_all[[dat_num]], type = "class")
logm_preds[[i]] <- logm_pred
logmcm_complete <- confusionMatrix(factor(logm_pred,
levels = c("No", "Yes")),
validate_y[, col], positive = "Yes")
logm_cms_complete[[i]] <- logmcm_complete
logmcm <- as.data.frame(logmcm_complete$table)
logmcm$Reference <- factor(logmcm$Reference,
levels = rev(levels(logmcm$Reference)))
logmcm <- logmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
logm_cms[[i]] <- logmcm
}
cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
cm_complete <- cms_complete[[i]]
by_class[[i]] <- as.data.frame(cm_complete$byClass)
overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
rep("Support Vector Machine", 28),
rep("eXtreme Gradient Boosting", 28),
rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
rep("Bigrams", 7),
rep("Trigrams", 7),
rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
"Logistic Regression")
metrics <- metrics |>
select(all_of(keep)) |>
mutate(Model = factor(Model, levels = mods),
Problem = factor(Problem, levels = prob)) |>
group_by(Problem) |>
filter(F1 == max(F1, na.rm = TRUE)) |>
arrange(Problem, desc(F1))
kable(metrics, format = "simple")
fn <- "data/high_corr.csv"
if (!file.exists(fn)){
corr <- cor(as.matrix(ar_df_long_neg_tokens_sparse1))
corr[upper.tri(corr)] <- 0
diag(corr) <- 0
corr2 <- cor(as.matrix(ar_df_long_neg_tokens_sparse2))
corr2[upper.tri(corr2)] <- 0
diag(corr2) <- 0
corr3 <- cor(as.matrix(ar_df_long_neg_tokens_sparse3))
corr3[upper.tri(corr3)] <- 0
diag(corr3) <- 0
high_corr_uni <- colnames(corr[, apply(corr, 2,
function(x) any(abs(x) > 0.9,
na.rm = TRUE))])
high_corr_bi <- colnames(corr2[, apply(corr2, 2,
function(x) any(abs(x) > 0.9,
na.rm = TRUE))])
high_corr_tri <- colnames(corr3[, apply(corr3, 2,
function(x) any(abs(x) > 0.9,
na.rm = TRUE))])
high_corr <- as.data.frame(c(high_corr_uni,
high_corr_bi,
high_corr_tri))
colnames(high_corr) <- c("Variable")
write.csv(high_corr, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
high_corr <- read.csv(fn)
}
copy <- train_x_all
rem <- as.character(high_corr$Variable)
for (i in 1:length(copy)){
train_x <- train_x_all[[i]]
validate_x <- validate_x_all[[i]]
test_x <- test_x_all[[i]]
train_x_all[[i]] <- train_x[, !colnames(train_x) %in% rem]
validate_x_all[[i]] <- validate_x[, !colnames(validate_x) %in% rem]
test_x_all[[i]] <- test_x[, !colnames(test_x) %in% rem]
}
fol <- c("models2/")
dat <- c("uni_", "bi_", "tri_", "all_")
fns <- c("nb1.rds", "nb2.rds", "nb3.rds", "nb4.rds", "nb5.rds", "nb6.rds",
"nb7.rds")
nbs <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
nb <- multinomial_naive_bayes(train_x_all[[i]],
train_y[, col], laplace = 1)
saveRDS(nb, fn)
}else{
nb <- readRDS(fn)
}
count = count + 1
nbs[[count]] <- nb
}
}
fns <- c("svm1.rds", "svm2.rds", "svm3.rds", "svm4.rds", "svm5.rds", "svm6.rds",
"svm7.rds")
svms <- list()
count = 0
ctrl <- tune.control(sampling = "cross", cross = 10, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
svm_tune <- tune(svm, train.x = train_x_all[[i]],
train.y = train_y[, col],
kernel = "linear", ranges = tune_grid,
tunecontrol = ctrl)
svm_mod <- svm_tune$best.model
saveRDS(svm_mod, fn)
}else{
svm_mod <- readRDS(fn)
}
count = count + 1
svms[[count]] <- svm_mod
}
}
fns <- c("xgb1.json", "xgb2.json", "xgb3.json", "xgb4.json", "xgb5.json",
"xgb6.json", "xgb7.json")
xgbs <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
xgb <- xgboost(train_x_all[[i]], train_y_num[, col], nrounds = 100,
objective = "binary:hinge", verbose = 0)
xgb.save(xgb, fn)
}else{
xgb <- xgb.load(fn)
}
count = count + 1
xgbs[[count]] <- xgb
}
}
fns <- c("logm1.rds", "logm2.rds", "logm3.rds", "logm4.rds", "logm5.rds", "logm6.rds", "logm7.rds")
logms <- list()
count = 0
for (i in 1:length(dat)){
for (j in 1:length(fns)){
fn <- paste0(fol, dat[i], fns[j])
if (!file.exists(fn)){
col <- j + 1
logm <- cv.glmnet(train_x_all[[i]], train_y[, col],
family = "binomial", alpha = 1,
type.logistic = "modified.Newton",
type.measure = "auc", nfolds = 10)
saveRDS(logm, fn)
}else{
logm <- readRDS(fn)
}
count = count + 1
logms[[count]] <- logm
}
}
nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
nb <- nbs[[i]]
nb_pred <- predict(nb, validate_x_all[[dat_num]], type = "class")
nb_preds[[i]] <- nb_pred
nbcm_complete <- confusionMatrix(nb_pred, validate_y[, col],
positive = "Yes")
nb_cms_complete[[i]] <- nbcm_complete
nbcm <- as.data.frame(nbcm_complete$table)
nbcm$Reference <- factor(nbcm$Reference,
levels = rev(levels(nbcm$Reference)))
nbcm <- nbcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
svm_mod <- svms[[i]]
svm_pred <- predict(svm_mod, validate_x_all[[dat_num]], type = "class")
svm_preds[[i]] <- svm_pred
svmcm_complete <- confusionMatrix(svm_pred, validate_y[, col],
positive = "Yes")
svm_cms_complete[[i]] <- svmcm_complete
svmcm <- as.data.frame(svmcm_complete$table)
svmcm$Reference <- factor(svmcm$Reference,
levels = rev(levels(svmcm$Reference)))
svmcm <- svmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
xgb <- xgbs[[i]]
xgb_pred <- predict(xgb, validate_x_all[[dat_num]], type = "class")
xgb_preds[[i]] <- xgb_pred
xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
as.factor(validate_y_num[, col]),
positive = "1")
xgb_cms_complete[[i]] <- xgbcm_complete
xgbcm <- as.data.frame(xgbcm_complete$table)
xgbcm$Reference <- factor(xgbcm$Reference,
levels = rev(levels(xgbcm$Reference)))
xgbcm <- xgbcm |>
mutate(
Label = case_when(
Prediction == 0 & Reference == 0 ~ "TN",
Prediction == 1 & Reference == 1 ~ "TP",
Prediction == 0 & Reference == 1 ~ "FN",
Prediction == 1 & Reference == 0 ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
logm <- logms[[i]]
logm_pred <- predict(logm, validate_x_all[[dat_num]], type = "class")
logm_preds[[i]] <- logm_pred
logmcm_complete <- confusionMatrix(factor(logm_pred,
levels = c("No", "Yes")),
validate_y[, col], positive = "Yes")
logm_cms_complete[[i]] <- logmcm_complete
logmcm <- as.data.frame(logmcm_complete$table)
logmcm$Reference <- factor(logmcm$Reference,
levels = rev(levels(logmcm$Reference)))
logmcm <- logmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
logm_cms[[i]] <- logmcm
}
cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
cm_complete <- cms_complete[[i]]
by_class[[i]] <- as.data.frame(cm_complete$byClass)
overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
rep("Support Vector Machine", 28),
rep("eXtreme Gradient Boosting", 28),
rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
rep("Bigrams", 7),
rep("Trigrams", 7),
rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
"Logistic Regression")
metrics <- metrics |>
select(all_of(keep)) |>
mutate(Model = factor(Model, levels = mods),
Problem = factor(Problem, levels = prob)) |>
group_by(Problem) |>
filter(F1 == max(F1, na.rm = TRUE)) |>
arrange(Problem, desc(F1))
kable(metrics, format = "simple")
nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
nb <- nbs[[i]]
nb_pred <- predict(nb, test_x_all[[dat_num]], type = "class")
nb_preds[[i]] <- nb_pred
nbcm_complete <- confusionMatrix(nb_pred, test_y[, col],
positive = "Yes")
nb_cms_complete[[i]] <- nbcm_complete
nbcm <- as.data.frame(nbcm_complete$table)
nbcm$Reference <- factor(nbcm$Reference,
levels = rev(levels(nbcm$Reference)))
nbcm <- nbcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
svm_mod <- svms[[i]]
svm_pred <- predict(svm_mod, test_x_all[[dat_num]], type = "class")
svm_preds[[i]] <- svm_pred
svmcm_complete <- confusionMatrix(svm_pred, test_y[, col],
positive = "Yes")
svm_cms_complete[[i]] <- svmcm_complete
svmcm <- as.data.frame(svmcm_complete$table)
svmcm$Reference <- factor(svmcm$Reference,
levels = rev(levels(svmcm$Reference)))
svmcm <- svmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
xgb <- xgbs[[i]]
xgb_pred <- predict(xgb, test_x_all[[dat_num]], type = "class")
xgb_preds[[i]] <- xgb_pred
xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
as.factor(test_y_num[, col]),
positive = "1")
xgb_cms_complete[[i]] <- xgbcm_complete
xgbcm <- as.data.frame(xgbcm_complete$table)
xgbcm$Reference <- factor(xgbcm$Reference,
levels = rev(levels(xgbcm$Reference)))
xgbcm <- xgbcm |>
mutate(
Label = case_when(
Prediction == 0 & Reference == 0 ~ "TN",
Prediction == 1 & Reference == 1 ~ "TP",
Prediction == 0 & Reference == 1 ~ "FN",
Prediction == 1 & Reference == 0 ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
dat_num <- (i - 1) %/% 7 + 1
mod_num <- i %% 7
if (mod_num > 0){
col <- mod_num + 1
}else{
col <- max_mod_num + 1
}
logm <- logms[[i]]
logm_pred <- predict(logm, test_x_all[[dat_num]], type = "class")
logm_preds[[i]] <- logm_pred
logmcm_complete <- confusionMatrix(factor(logm_pred,
levels = c("No", "Yes")),
test_y[, col], positive = "Yes")
logm_cms_complete[[i]] <- logmcm_complete
logmcm <- as.data.frame(logmcm_complete$table)
logmcm$Reference <- factor(logmcm$Reference,
levels = rev(levels(logmcm$Reference)))
logmcm <- logmcm |>
mutate(
Label = case_when(
Prediction == "No" & Reference == "No" ~ "TN",
Prediction == "Yes" & Reference == "Yes" ~ "TP",
Prediction == "No" & Reference == "Yes" ~ "FN",
Prediction == "Yes" & Reference == "No" ~ "FP"),
Model = paste0(dat[dat_num], prob[col - 1]))
logm_cms[[i]] <- logmcm
}
cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
cm_complete <- cms_complete[[i]]
by_class[[i]] <- as.data.frame(cm_complete$byClass)
overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
rep("Support Vector Machine", 28),
rep("eXtreme Gradient Boosting", 28),
rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
rep("Bigrams", 7),
rep("Trigrams", 7),
rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
"Logistic Regression")
metrics <- metrics |>
select(all_of(keep)) |>
mutate(Model = factor(Model, levels = mods),
Problem = factor(Problem, levels = prob)) |>
group_by(Problem) |>
filter(F1 == max(F1, na.rm = TRUE)) |>
arrange(Problem, desc(F1))
kable(metrics, format = "simple")