Abstract

Home cooks of all skill levels frequent AllRecipes.com for meal inspiration, but the quality of user-submitted recipe content on the site varies widely. We have identified seven kinds of non-exclusive problems (related to missing/misleading steps, incorrect measurements, incorrect cooking/preparation time, incorrect temperature, a key/flavor ingredient being missing or an incorrect ingredient being included, and user error) that recipe reviewers might claim a recipe suffers from. Classifying these reviews according to the problems they identify could enable AllRecipes.com to effectively hone in on their recipe content that is most likely to need reevaluation.

The methodology we developed in order to predict problems asserted in recipe reviews included: isolating reviews most likely to assert problems; employing unsupervised learning with Latent Dirichlet Allocation (LDA) to divide reviews into topics; manually labeling reviews according to sentiment expressed and problems asserted; employing supervised learning with Naive Bayes, Support Vector Machine (SVM), eXtreme Gradient Boosting, and Logistic Regression models to predict problem classes using only unigrams, only bigrams, only trigrams, or a combination of all three; and attempting to improve performance by removing highly correlated ngram features.

Only two problems were frequent enough for us to build moderately successful models: MEASUREMENTS and INGREDIENTS. The best MEASUREMENTS classifier was an SVM model using a combination of ngrams, and it achieved an F1 Score of 71.74 percent while being 79.86 percent accurate. The best INGREDIENTS classifier was a Naive Bayes model using only unigrams that achieved an F1 Score of 64.11 percent while being 70.20 percent accurate.

Keywords: natural language processing, text classification, topic modeling, support vector machine, naive bayes

Introduction

According to AllRecipes.com’s About Us page, “Only the best recipes achieve Kitchen Approved status and get published and promoted” on the Web site. When users submit recipes for “Kitchen Approved” consideration on AllRecipes.com, the Web site’s team of recipe editors claims to check the content for “Completion,” “Accuracy,” and “Replication,” among other things. The issues these quality assurance checks are supposed to cover are described below:

CHECK DESCRIPTION
Completion The ingredient list is evaluated to be sure it is complete, that measurements are correct, and that ingredients are properly ordered and grouped according to their use.
Accuracy Serving sizes and yields are checked for accuracy and to be sure that recipes reflect USDA dietary recommendations.
Replication Recipes are reviewed and edited to ensure they are replicable when following the instructions. Editors assess the techniques, equipment, and appliances used in the recipes, and explain any intermediate or advanced techniques with simple language and visuals.

However, the quality of user-submitted recipe content on the site still varies widely, and users frequently express opinions that suggest “Kitchen Approved” recipes suffer from problems even after having gone through AllRecipes.com’s purported quality assurance checks. These recipes may very well benefit from further review, and we believe utilizing text classification to predict problems asserted by reviews could help AllRecipes.com focus its attention on recipes that reviewers believe need it most.

We have identified seven distinct, but non-exclusive problems that users might claim published recipes suffer from, and descriptions of these problems are below:

PROBLEM DESCRIPTION
DIRECTIONS One or more steps was unclear, incorrect, or missing
MEASUREMENTS Included too much or too little of one or more ingredients
TIMING Needed more/less time to prep or to finish cooking
TEMPERATURE Needed higher/lower temperature to cook or assemble correctly
INGREDIENTS Missing a key/flavor ingredient or included an incorrect ingredient
INAUTHENTIC Lacked a resemblance to the dish’s traditional preparation/flavor
USER_ERROR Reviewer suspected/admitted they made a mistake

The question we intended to answer with our analysis was:

Can AllRecipes.com review text be classified according to the problems reviewers are claiming recipes suffer from?

To answer this question, we utilized a combination of unsupervised and supervised learning techniques. Of the 134,800 total reviews we gathered, we grouped the 14,954 reviews with a star rating of three or below into six topics using LDA, and we labeled 3,398 of these reviews according to sentiment expressed and problems asserted. We generated four featuresets for these reviews: one containing only unigrams, only containing only bigrams, one containing only trigrams, and a combined featureset containing all three. Four kinds of supervised learning models were trained on each featureset, and we needed seven versions of each kind of model per featureset (one for each non-exclusive problem class).

Literature Review

The classification of text, including review text and even more specifically recipe review text, has been previously analyzed from many perspectives. Our literature review focused on: 1) unsupervised learning techniques that might reduce the amount of time spent manually labeling the review text for supervised learning, as well as the pros and cons of some forms of text pre-processing prior to unsupervised learning; 2) comparisons of several supervised learning techniques commonly used in Natural Language Processing (NLP), as well as their applications in multi-class and non-exclusive classification; 3) developing featuresets from ngrams of different lengths; and 4) performance metrics for assessing text classifiers when the labeled classes are imbalanced.

Unsupervised Learning: When reviewing unsupervised learning techniques that can be applied to NLP, we decided early on that we would not consider fuzzy clustering via artificial neural networks like the ones described in Derhgawen et al. (2020) due to computation costs. We also rejected semantic orientation as described in Chaovalit & Zhou (2005), as it can only be used in binary classification tasks like simple sentiment analysis.

Instead, we looked at other methods. Both Sutherland et al. (2020) and Schofield & Mimno (2016) have used topic modeling via LDA, a common unsupervised learning technique in NLP, to effectively classify text into coherent groups automatically. Using LDA, Sutherland et al. (2020) were able to identify, then validate 14 distinct topics (value, amenities, etc.) of customer satisfaction in reviews guests left for various accommodations (hotels, etc.). So it seemed promising that we might be able to achieve similar success employing LDA in our analysis to automatically categorize reviews prior to manually labeling them. However, Sutherland et al. (2020) and Schofield & Mimno (2016) took different approaches to stemming, a popular text pre-processing method used to simplify the vocabulary of a text corpus by essentially combining words with the same beginnings but different endings into one root word. Sutherland et al. (2020) found stemming to be a key part of their topic modeling success, while Schofield & Mimno (2016) found that this method did not improve their topic modeling, no matter what number of coherent groups they tried to identify in their corpora or how little stemming they did. All stemming methods, even light ones, reduced performance. So the question, “To stem, or not to stem?” remained for us.

Supervised Learning: Having settled on LDA as our unsupervised learning technique, whether it made manually labeling our reviews any faster or not, we considered a wide variety of supervised learning techniques commonly used in NLP. We settled on: 1) Naive Bayes models, used by Lohar et al. (2017), Pranckevičius & Marcinkevičius (2017), Zhang & Lee (2003), and Madushani & Arudchelvam (2021); 2) SVM models, used by Al Amrani et al. (2018), Pranckevičius & Marcinkevičius (2017), and Zhang & Lee (2003); 3) Logistic Regression models, used by Pranckevičius & Marcinkevičius (2017); and 4) eXtreme Gradient Boosting models. Both Al Amrani et al. (2018) and Pranckevičius & Marcinkevičius (2017) used Random Forest models, but we knew from previous experience those models would not be able to handle the sparse matrices we would need to use to represent our ngram featuresets. So we chose eXtreme Gradient Boosting models, which can handle sparse matrices, instead.

Lohar et al. (2017) successfully used Naive Bayes models to classify Microsoft Office customer feedback into six categories (bug, complaint, etc.). While the categories were non-exclusive like those in our analysis, there was not as much overlap in these categories as there was in the problems we wanted to identify in recipe review text. So their best performing model was a one-versus-rest classifier, whereas we needed separate classifiers for each problem we wanted to identify.

Unlike Lohar et al. (2017), Pranckevičius & Marcinkevičius (2017) compared the results of many kinds of models, not just Naive Bayes ones. They found that Logistic Regression models performed better than Naive Bayes models, Random Forest Models, Decision Tree models, and SVM models at a multi-class classification task using Amazon product review text to predict star-rating (1-5). However, Logistic Regression models predicted some classes much better than others, unlike the rest of the models compared in this analysis, which were more stable. (Note that the only kernel function they were able to use for their SVM models was the linear kernel, as other kernel functions were not yet implemented in the Apache Spark machine learning library they used for their analysis.)

Like Pranckevičius & Marcinkevičius (2017), Zhang & Lee (2003) compared the results of many kinds of models. They found that SVM models using a linear kernel performed better than Nearest Neighbors models, Naive Bayes models, Decision Tree models, and Sparse Network of Winnows models at a multi-class classification task using question text to predict the kind of question being asked. They also found there was no performance difference between linear SVM models and SVM models using radial basis, polynomial, or sigmoid kernels.

Ngrams of Different Lengths: In addition to wanting to test a variety of models, we also wanted to test a variety of featuresets based on ngrams of different lengths. Using only Naive Bayes models, Madushani & Arudchelvam (2021) found that the models always performed better as the length of the ngrams used to train them increased. That is, bigrams alone produced better models than unigrams alone, and trigrams alone produced better models than bigrams alone. However, when comparing performance across a variety of models, Pranckevičius & Marcinkevičius (2017) found that unigrams alone were sufficient. Bigrams alone didn’t perform better than unigrams alone, and neither did trigrams alone. All three combined did not improve performance either.

Performance Metrics: Jurafsky & Martin (2024) stress the importance of precision, recall, and F scores in measuring text classifier performance when the classes are imbalanced like the recipe problems we wanted to identify in the review text. Lohar et al. (2017) found unsurprisingly that when predicting multiple non-exclusive classes, it is the classes that are the best represented in the data that are easiest for models to predict. While they only examined Naive Bayes models in their analysis, we found the same to be true regardless of the type of model used in our analysis.

Methodology

In order to predict the problems asserted in recipe reviews, we developed the following methodology:

First, we isolated the reviews most likely to assert problems, i.e. those with a rating of three or fewer stars. While some positive-leaning reviews might also have identified recipe problems, it was reasonable for us to assume that negative-leaning reviews would be more likely to do so, and we expected 3-star and below reviews to express more negative sentiment than 4- or 5-star reviews would.

Then we employed LDA to fuzzy cluster the reviews into an optimal number of topics based on their bigrams. We chose bigrams because they include more context than unigrams, and some words that are commonly considered stopwords for text data were useful for our analysis. Many reviews included text like “too much” or “not enough,” both of which could indicate MEASUREMENTS errors, and we hoped LDA would group reviews together based on similar informative language like this so that the next step of our methodology, manual labeling, might be faster. The optimal number of topics turned out to be six.

Reviews were manually labeled according to sentiment expressed and problems asserted. Once that was completed, we employed Naive Bayes, SVM, eXtreme Gradient Boosting, and Logistic Regression models to predict problem classes using only unigrams, only bigrams, only trigrams, or a combination of all three. Models were then ranked by their F1 Scores on each of the seven problem classes.

Finally, we attempted to improve performance for all models by removing highly correlated ngram features from all featuresets.

Experimentation & Results

The recipe data we gathered included several fields. Names and descriptions of the first 15 fields are below:

VARIABLE DESCRIPTION
URL Web page address for the recipe
Recipe_Name Name of the recipe
Basic_Info List of info such as Prep Time, Cook Time, Servings, etc.
Ingredients List of ingredients
Directions List of directions
Nutrition List of info such as Calories, Fat, Protein, etc.
Ratings_Total Total number of star ratings
Ratings_Avg Average star rating
Reviews_Total Total number of reviews
MHPR_User User who left Most Helpful Positive Review
MHPR_Rating Star rating of Most Helpful Positive Review
MHPR_Date Date of Most Helpful Positive Review
MHPR_Review Text of Most Helpful Positive Review
MHPR_Review_Len Length in characters of Most Helpful Positive Review
MHPR_Helpful Count of users who marked Most Helpful Positive Review helpful

The “Most Helpful Positive Review (MHPR)” mentioned in the last six of these fields and their descriptions is the first review that Allrecipes.com shows users. It is determined by user input, specifically the number of other users who have marked the review “helpful.” Some recipes didn’t have any positive reviews marked helpful by other users, so their MHPR fields contained empty strings or NA values.

The remaining field names and their descriptions are below.

VARIABLE DESCRIPTION
User_[0-8] User who left review
Rating_[0-8] Star rating for review
Date_[0-8] Date of review
Review_[0-8] Review number [0-8]
Review_Len[0-8] Length in characters of review
Helpful_[0-8] Count of users who marked review helpful

Because these fields represent the nine most recent reviews for a recipe, the same six field names were used for each review, then numbered zero through eight to represent the review number. For instance, the field “User_[0-8]” in the table above actually refers to nine iterations of the “User” field, one for each of the nine most recent reviews for a recipe. Some recipes had been reviewed less than nine times, so any review fields beyond the number of times they had actually been reviewed contained empty strings and NA values. So there was more review data available for some recipes than others.

During the data collection process, bad URLs were given a Visited value of -1 to separate them from good links we had already visited (value: 1) and links we had not yet visited (value: 0). Some of these bad links were added to the records erroneously, so we identified and removed them.

We also intended to exclude any recipes that had never been reviewed from the data collection process, but a small number of them did make their way into the records, so we removed those as well.

That left us with data for 18,347 recipes that had been reviewed at least once. We summarize the mean number of ratings, the mean star rating, and the mean number of reviews for recipes in our dataset below:

VARIABLE MEAN
Ratings Total 114.9
Ratings Average 4.5
Reviews Total 90.0

The recipes in our dataset were generally rated very highly, with a mean star rating of 4.5. They generally received large numbers of ratings (mean: 114.9) and reviews (mean: 90) as well.

Initial Data Preparation for Unsupervised Learning

To prepare our data for unsupervised learning, we created a corpus of potentially negative reviews by pivoting our wide review data into a longer format, dropping many columns we didn’t need for our analysis, and removing any reviews with star ratings of 4 or higher from the 134,800 total reviews we gathered. We hoped this would make manually labeling reviews by what problem they were identifying faster by limiting the number of reviews we would have to look at that didn’t identify any problems at all. That left us with 14,954 potentially negative reviews.

We created a featureset using only bigrams for these 14,954 potentially negative reviews. Bigrams provide more context than unigrams, and bigrams that contained words that are commonly considered stopwords for text data were useful for our analysis. Many reviews included text like “too much” or “not enough,” both of which could indicate MEASUREMENTS errors, and we hoped LDA would group reviews together based on similar informative language like this. So we did not exclude bigrams that contained words like “too” and “not” from consideration. We did eliminate bigrams containing non-useful stopwords (e.g. pronouns), as well as bigrams containing numbers, but we noticed later that our number-identification method unintentionally left bigrams containing fractions in.

Unsupervised Learning: Topic Modeling via LDA

Using values of k from two to 12, we measured the perplexity of several different LDA models that would automatically partition the data into k topics. The below plot demonstrates how the perplexity for a model tends to decrease as k increases, but after a certain point, the reduction in perplexity achieved by increasing k begins to drop off.

In our case, the reduction in LDA model perplexity began to drop off after about six topics, so we set k to six in our final LDA model. The bigrams that were most frequent in the six identified topics were:

There was a lot of commonality in the bigrams that were most likely to occur within these six topics. For instance, “next time” was the first or second most identifying bigram in four out of six topics, indicating the reviews in each of those topics were likely to exhibit some disappointment and make some suggestions for improvement. The bigrams “too much” and “way too” were the first and second most identifying bigrams for topic six, so it seemed likely from this plot that MEASUREMENTS errors might define this topic. However, we were also able to find these bigrams somewhere in the identifying bigrams list for every single other topic, so we did not expect MEASUREMENTS errors to be exclusive to topic six.

Despite all the overlap, there were some bigrams that were unique to the top identifiers for each topic. It was notable that some of these unique bigrams were related to specific ingredients though, i.e. brown sugar being an identifier of topic one, cream cheese being an identifier of topics two and four, and soy sauce being an identifier of topic six. Ingredients therefore resulted in some unintended grouping. We suspected review length might have unintentionally factored into the grouping as well, so we plotted the distribution of review length by topic to confirm whether or not that was the case:

The topics actually had pretty similar review length distributions. Topic 4 had the highest median review length at 201 characters, but that was only 22 more characters than the lowest median review length (179 characters, shared by topics three and six). So review length did not adversely affect topic modeling.

Data Labeling for Supervised Learning

We exported the potentially negative reviews, manually labeled a manageable portion of them, then loaded said newly labeled data for further data preparation and the application of supervised learning techniques.

There were 3,398 labeled observations in total: 515 observations from each of the six topics identified during LDA, plus 308 observations that were not composed of any significant bigrams and were thus assigned a topic of NA. (We recoded NA topic values as 0.) Reasons why a review might not have been composed of any significant bigrams included:

  • the review was devoid of any words (e.g. “!!!” )

  • the review contained only stopwords (e.g. “so so”)

  • all possible bigrams in the review contained stopwords (e.g. “I hated it”)

The observations were labeled in two ways:

  1. as either “Neutral” or “Negative” in SENTIMENT

  2. by which of nine* specific problems they identified (if any): DIRECTIONS, MEASUREMENTS, TIMING, TEMPERATURE, ING_KEY_MISSING, ING_FLAVOR_MISSING, ING_WRONG_INCL, INAUTHENTIC, and USER_ERROR

*Note that these nine problems were later reduced to the seven problems mentioned previously by combining the three ingredient-related problems (ING_KEY_MISSING, ING_FLAVOR_MISSING, and ING_WRONG_INCL) into a single problem (INGREDIENTS).

Descriptions of the problems and the criteria we used to determine whether a review identified them are outlined below:

PROBLEM DESCRIPTION
DIRECTIONS One or more steps was unclear, incorrect, or missing
MEASUREMENTS Included too much or too little of one or more ingredients
TIMING Needed more/less time to prep or to finish cooking
TEMPERATURE Needed higher/lower temperature to cook or assemble correctly
ING_KEY_MISSING Missing a key ingredient
ING_FLAVOR_MISSING Missing a flavor ingredient
ING_WRONG_INCL Included an incorrect ingredient
INAUTHENTIC Lacked a resemblance to the dish’s traditional preparation/flavor
USER_ERROR Reviewer suspected/admitted they made a mistake

Labeling text manually is always subjective, and it requires a strong commitment to consistency. We tried to apply the same criteria to every review and to resort to the same logic when assigning labels, especially in borderline cases. For instance, we decided it was not enough for a reviewer to say that a recipe was “missing something” for us to label it as indicating as ING_FLAVOR_MISSING. The reviewer had to go one step further and state what they added or what they would add next time in order to better the flavor. However, there were surely times when we made a judgment call one way based on the text of one review, then a different way when we encountered similar text in another review. It is also true that another person might have made very different labeling decisions throughout the process. We stand by the quality of the labels we assigned nonetheless.

Recipe review text has some aspects that make labeling it a more complicated task than labeling other kinds of text. Many reviews were very long, but expressed little or nothing of use for our analysis. Reviewers were also frequently responding to one another when they left reviews, so reading those reviews felt like listening to one half of a phone call. It was more confusing than revealing. Many reviews required some reading between the lines, and in some cases the recipe itself had to be consulted to confirm our understanding of what was being said. Lastly, reviewers can be wrong. We did not allow the incorrectness of any statements in a review to disqualify the review from receiving problem indicator labels. However, some reviewers were occasionally so incorrect in their statements that we were offended on the recipe creator’s behalf.

Because we only looked at reviews with star ratings of three or below, we originally felt it should not be possible for these reviews to receive a “Positive” label for SENTIMENT. However, we did encounter review text along the way that was clearly neither “Negative” nor “Neutral,” and it’s possible the reviewer intended to leave a higher star rating along with their seemingly “Positive” words. For the sake of consistency, we continued to categorize these “Positive” reviews as “Neutral” during labeling so that we could redefine our SENTIMENT variable later and hopefully eliminate any confusion. To that end, we replaced the SENTIMENT variable with the NEGATIVE variable, a binary predictor in which a value of 1 indicates the review text expressed a “Negative” sentiment, and a value of 0 simply indicates it did not.

By visualizing the frequency of “Negative” reviews by star rating below, we confirm that the lower the star rating was, the more “Negative” the sentiment of the review text tended to be, as expected:

Almost all of the 1- and 2-star reviews we labeled expressed “Negative” sentiment, whereas a much smaller percentage of the 3-star reviews we labeled did. We also encountered more than twice as many 3-star reviews as we did 1- or 2-star reviews during labeling, but 1- and 2-star reviews are similar enough categories that we could justifiably aggregate them. When we considered them in aggregate, the number of 1- and 2-star reviews we labeled was similar to the number of 3-star reviews we labeled.

We visualized the problems we identified in terms of both frequency and overlap since the problems were non-exclusive.

Reviews that identified MEASUREMENTS errors were most frequent, whereas reviews that identified TEMPERATURE errors were least frequent. When reviews mentioned two problems, MEASUREMENTS was frequently one of them. It is also worth noting that some of the ingredient-related issues frequently occurred together, as well as with authenticity issues. Lastly, TEMPERATURE errors occurred frequently with TIMING errors, which is sensible because in cooking, timing issues can often be solved with temperature changes, and vice versa.

Despite the fact that the primary goal of LDA was pre-sorting the reviews so that manually labeling them by problem would hopefully be faster and simpler, we don’t believe there was actually much reduction in the time or effort it cost us to label the data. We visualized the distribution of problems by topic to demonstrate why:

While there were some visible differences in the exact frequencies of problems by topic, it is clear that LDA did not segment the topics by the problems, or there would have been topics where particular problems were relatively absent and others were relatively dominant. We had suspected topic six might be defined more by MEASUREMENTS problems than the other topics because its most identifying bigrams were “too much” and “way too,” but topic three actually had the most MEASUREMENTS problems, and every topic had more MEASUREMENTS problems than any other class of problem.

It’s not terribly surprising that LDA failed to segment the topics by the problems we wanted to identify. These problems are non-exclusive, so there has to be some overlap, and we only did a moderate amount of data preparation prior to LDA before manually labeling the data.

Secondary Data Preparation for Supervised Learning

Since we anticipated unsupervised learning would be insufficient for our problem classification needs, we proceeded with preparing the labeled data further for supervised learning as planned. First, we collapsed the three ingredient-related problems into a single problem. While the specificity with which we originally labeled the data regarding ingredient issues might be helpful for other analyses, ours benefited from being slightly less complex.

The INGREDIENTS response variable took the place of the ING_KEY_MISSING, ING_FLAVOR_MISSING, and ING_WRONG_INCL response variables. A value of 1 for the new variable indicates there was at least one ingredient issue, whereas a value of 0 indicates there were no ingredient issues. For the sake of clarity, the problems/criteria table has been reprinted below to reflect this change:

PROBLEM DESCRIPTION
DIRECTIONS One or more steps was unclear, incorrect, or missing
MEASUREMENTS Included too much or too little of one or more ingredients
TIMING Needed more/less time to prep or to finish cooking
TEMPERATURE Needed higher/lower temperature to cook or assemble correctly
INGREDIENTS Missing a key/flavor ingredient or included an incorrect ingredient
INAUTHENTIC Lacked a resemblance to the dish’s traditional preparation/flavor
USER_ERROR Reviewer suspected/admitted they made a mistake

Then we refined the text pre-processing and tokenization that was done prior to LDA topic modeling. First, we did simple things like converting all text to lowercase and removing any punctuation marks except for apostrophes (after replacing right single quotation marks used as such). Then, we undertook more complex transformations.

First, we replaced any ingredients a review mentioned (i.e. “chicken,” “flour,” or “pepper”) with a single token: “__ING__.” We hoped this custom usage of stemming would help us group similar language together because we wanted to know how frequently a review discussed ingredients and the contexts in which ingredients were being discussed, but it didn’t really matter which ingredient a reviewer was talking about for our purposes. What mattered, as just one example, was whether the reviewer was saying there was “too much” or “not enough” of said ingredient. The ingredients list we used to reduce ingredient words to a single token was neither perfect nor exhaustive, but it did cover enough common ingredients to be useful.

For similar reasons, we replaced any measurement words and their common abbreviations (i.e. “tablespoons” or “tbsp”) with a single token: “__MEAS__.” We also replaced any numbers (including integers, fractions, mixed fractions, and decimals) with a single token: “__NUM__.”

We then tokenized the refined text into four featuresets: one containing only unigrams, one containing only bigrams, only containing only trigrams, and one containing a combination of all three. Each featureset was represented as a sparse matrix, which we split into train, validate, and test sets.

Supervised Learning: Model Development

We generated four kinds of supervised learning models: Naive Bayes, SVM, eXtreme Gradient Boosting, and Logistic Regression. Each type of model was trained on each featureset, and there were seven versions of each kind of model per featureset (one for each non-exclusive problem class). For all SVM models, a linear kernel was used. (Using a radial basis kernel for the SVM models was also tested, and it was not only incredibly slow to build SVM models using a radial basis kernel on this training data, but they also performed poorly on all problems, no matter which featureset was used to train them. They tended to make no positive predictions whatsoever.)

Supervised Learning: Model Validation

We made predictions using the data that was set aside for validation. Below are the top performing models by F1 Score for each problem. The featuresets the top performers used to achieve their results are also listed in the Token column.

Model Problem Token Accuracy Precision Recall F1
Naive Bayes DIRECTIONS Unigrams 0.8259 0.4848 0.3310 0.3934
Support Vector Machine MEASUREMENTS All 0.7776 0.7204 0.6442 0.6802
Support Vector Machine TIMING Unigrams 0.9012 0.6263 0.5688 0.5962
eXtreme Gradient Boosting TEMPERATURE All 0.9612 0.4800 0.3750 0.4211
Naive Bayes INGREDIENTS All 0.7412 0.6927 0.6927 0.6927
Support Vector Machine INAUTHENTIC All 0.9012 0.4074 0.2973 0.3438
Support Vector Machine USER_ERROR Unigrams 0.8894 0.5625 0.3529 0.4337

Performance on most problems was low for all models, no matter which featuresets they used, but moderate performance was achieved on two problems. The SVM model using a combination of unigrams, bigrams, and trigrams performed well on the MEASUREMENTS problem, achieving an F1 Score of 68.02 percent, and the Naive Bayes model using the same combination of ngrams performed well on the INGREDIENTS problem, achieving an F1 Score of 69.27 percent.

Supervised Learning: Model Adjustment

In hopes of improving performance, we eliminated any features that demonstrated a correlation greater than 0.9 (in absolute value) with another feature. Due to computation costs, we had to calculate correlations one at a time for unigrams, then bigrams, then trigrams. Then we eliminated these highly correlated features from any featureset they appeared in. It would have been better to consider correlations across all ngrams at once. That way, we could have seen whether a trigram was highly correlated with a unigram, for instance. But our limitations only allowed us to see whether ngrams of the same length were highly correlated.

We made new predictions on the data set aside for validation using the reduced featuresets resulting from removing highly correlated ngrams of the same length. Below are the top performing models by F1 Score for each problem based on the new predictions. The featuresets the top performers used to achieve their results are again listed in the Token column.

Model Problem Token Accuracy Precision Recall F1
Naive Bayes DIRECTIONS Unigrams 0.8224 0.4712 0.3379 0.3936
Support Vector Machine MEASUREMENTS All 0.7729 0.7117 0.6410 0.6745
Support Vector Machine TIMING Unigrams 0.9012 0.6263 0.5688 0.5962
eXtreme Gradient Boosting TEMPERATURE All 0.9600 0.4615 0.3750 0.4138
Naive Bayes INGREDIENTS All 0.7376 0.6860 0.6955 0.6907
Support Vector Machine INAUTHENTIC All 0.9106 0.4792 0.3108 0.3770
Support Vector Machine USER_ERROR Unigrams 0.8882 0.5538 0.3529 0.4311

Performance was still low for most problems for all models, no matter which featuresets they used. The SVM model using a combination of unigrams, bigrams, and trigrams still performed well on the MEASUREMENTS problem, but its F1 Score decreased slightly to 67.45 percent, and the Naive Bayes model using the same combination of ngrams still performed well on the INGREDIENTS problem, but its F1 Score decreased slightly as well (to 69.07 percent).

Supervised Learning: Model Evaluation

Even though performance on the data set aside for validation decreased for our two best models after we removed highly correlated ngrams of the same length from all featuresets, we believed this feature reduction might still benefit these models when extrapolating to unseen data. So we made final predictions using the reduced featuresets on the data set aside for testing.

Model Problem Token Accuracy Precision Recall F1
Support Vector Machine DIRECTIONS Unigrams 0.8057 0.4220 0.3108 0.3580
Support Vector Machine MEASUREMENTS All 0.7986 0.7614 0.6781 0.7174
Support Vector Machine TIMING Unigrams 0.8716 0.6330 0.5000 0.5587
Support Vector Machine TEMPERATURE All 0.9706 0.8571 0.2000 0.3243
Naive Bayes INGREDIENTS Unigrams 0.7020 0.6141 0.6706 0.6411
Support Vector Machine INAUTHENTIC All 0.8846 0.5833 0.3818 0.4615
Support Vector Machine USER_ERROR Unigrams 0.8810 0.4462 0.3085 0.3648

Performance increased for the SVM model using all ngrams on the MEASUREMENTS problems. It achieved a final F1 Score of 71.74 percent. However, performance for the Naive Bayes model decreased, and it used only unigrams instead of all ngrams to achieve a final F1 Score of 64.11 percent.

Conclusions & Future Work

We were impressed by the final performance of the SVM model using all ngrams on the MEASUREMENTS problem, given that we only labeled 22.7 percent of the 14,954 potentially negative reviews we isolated for our analysis. Had we had time to label more data, we believe the performance of the eXtreme Gradient Boosting models would have improved, as we have seen them outperform both Naive Bayes and SVM models on text classification tasks in the past.

Testing featuresets built from ngrams of different lengths was fruitful, but the featurespace dramatically increased when combining unigrams, bigrams, and trigrams. If we were to combine these into a single featureset again in the future, we would want to trim the vocabulary of the corpus further first.

We were disappointed that we couldn’t build good classifiers for all the problem classes, so the answer to the question we asked prior to our analysis, Can AllRecipes.com review text be classified according to the problems reviewers are claiming recipes suffer from?, is “kind of.” Only MEASUREMENTS and INGREDIENTS problems could be classified with moderate success.

Labeling more data would have enabled us to build better models for the less frequent problems. We saw F1 Scores greater than 50 percent for TIMING, so it’s possible we were close to having enough observations to train a good classifier for that problem. DIRECTIONS problems were the hardest to classify. While labeling, we felt that reviewers expressed issues with a recipe’s steps in much more varied and vague ways than they expressed other issues, so that makes sense. Rarely would someone say something as simple as “this recipe is missing a step.” Instead, we were more likely to encounter phrases like “if you try to put this much batter in one cake pan, it’s going to spill over; use two cake pans” instead. Considering that, it would actually probably have been worthwhile to separate the DIRECTIONS problems that were equipment-specific into a separate EQUIPMENT problem class.

There are many other ways in which this work could be improved upon and taken further. We would like to formalize and expand upon the custom stemming we did to reduce ingredients, measurements, and numbers to single tokens. We would also like to test how this recipe review-specific stemming might improve the LDA topic modeling we did, as we did not develop this stemming process until after LDA had already been performed. Since our supervised learning models performed best using either unigrams alone or a combination of unigrams, bigrams, and trigrams, and we used only bigrams during LDA, it is likely LDA would perform better using ngrams of different lengths as well. We intended for LDA to speed up our analysis, but we realize now that we would have actually needed to spend more time refining it than was available in order for us to have really gotten the most out of this tool.

Bibliography

Al Amrani, Yassine, Lazaar, Mohamed, & El Kadiri, Kamal Eddine. (2018). Random forest and support vector machine based hybrid approach to sentiment analysis. Procedia Computer Science, 127, 511–520. https://doi.org/10.1016/j.procs.2018.01.150
Chaovalit, Pimwadee, & Zhou, Lina. (2005). Movie review mining: A comparison between supervised and unsupervised classification approaches. Proceedings of the 38th Annual Hawaii International Conference on System Sciences, 112c. http://dx.doi.org/10.1109/HICSS.2005.445
Derhgawen, Shubham, Gogineni, Himaja, Chatterjee, Subhasish, & Tak, Rajesh. (2020). Efficient comment classification through NLP and fuzzy classification. International Journal of Trend in Scientific Research and Development, 4(3), 1000–1006. https://www.ijtsrd.com/papers/ijtsrd30758.pdf
Jurafsky, Daniel, & Martin, James H. (2024). Naive bayes, text classification, and sentiment. In Speech and language processing (3rd draft). https://web.stanford.edu/~jurafsky/slp3/
Lohar, Pintu, Dutta Chowdhury, Koel, Afli, Haithem, Hasanuzzaman, Mohammed, & Way, Andy. (2017). ADAPT at IJCNLP-2017 task 4: A multinomial naive bayes classification approach for customer feedback analysis task. https://www.aclweb.org/anthology/I17-4027
Madushani, PPE, & Arudchelvam, T. (2021). Fine-grained sentiment analysis on food recipe reviews. http://drr.vau.ac.lk/handle/123456789/290
Pranckevičius, Tomas, & Marcinkevičius, Virginijus. (2017). Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification. Baltic Journal of Modern Computing, 5(2), 221–232. http://dx.doi.org/10.22364/bjmc.2017.5.2.05
Schofield, Alexandra, & Mimno, David. (2016). Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4, 287–300. https://doi.org/10.1162/tacl_a_00099
Sutherland, Ian, Sim, Youngseok, Lee, Seul Ki, Byun, Jaemun, & Kiatkawsin, Kiattipoom. (2020). Topic modeling of online accommodation reviews via latent dirichlet allocation. Sustainability, 12(5), 1821. https://doi.org/10.3390/su12051821
Zhang, Dell, & Lee, Wee Sun. (2003). Question classification using support vector machines. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 26–32. https://doi.org/10.1145/860435.860443

Video Presentation

Here is a link to the video presentation.

Appendix with Code

knitr::opts_chunk$set(echo = FALSE)
library(caret)
library(e1071)
library(glmnet)
library(jsonlite)
library(knitr)
library(Matrix)
library(naivebayes)
library(RColorBrewer)
library(snakecase)
library(stats)
library(stopwords)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(UpSetR)
library(xgboost)

check1 <- c("Completion", "Accuracy", "Replication")
desc1 <- c("The ingredient list is evaluated to be sure it is complete, that measurements are correct, and that ingredients are properly ordered and grouped according to their use.", "Serving sizes and yields are checked for accuracy and to be sure that recipes reflect USDA dietary recommendations.", "Recipes are reviewed and edited to ensure they are replicable when following the instructions. Editors assess the techniques, equipment, and appliances used in the recipes, and  explain any intermediate or advanced techniques with simple language and visuals.")
tbl_print <- cbind(check1, desc1)
colnames(tbl_print) <- c("CHECK", "DESCRIPTION")
kable(tbl_print, format = "simple")

prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE",
          "INGREDIENTS", "INAUTHENTIC", "USER_ERROR")
desc2 <- c("One or more steps was unclear, incorrect, or missing",
          "Included too much or too little of one or more ingredients",
          "Needed more/less time to prep or to finish cooking",
          "Needed higher/lower temperature to cook or assemble correctly",
          "Missing a key/flavor ingredient or included an incorrect ingredient",
          "Lacked a resemblance to the dish's traditional preparation/flavor",
          "Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc2)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")

base <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/allrecipes_df_pt_"
exts <- c("1.csv", "2.csv", "3.csv", "4.csv", "5.csv")
for (i in 1:length(exts)){
    url <- paste0(base, exts[i])
    df <- read.csv(url)
    if (i == 1){
        ar_df <- df
    }else{
        ar_df <- ar_df |>
            bind_rows(df)
    }
}
url <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/links_df.csv"
links_df <- read.csv(url)
variable_descriptions1 <- read.table("variable_descriptions_p1.txt",
                                    sep = "\t", header=TRUE)
kable(variable_descriptions1, format = "simple")

variable_descriptions2 <- read.table("variable_descriptions_p2.txt",
                                    sep = "\t", header=TRUE)
kable(variable_descriptions2, format = "simple")

ar_df <- ar_df |>
    left_join(links_df, by = join_by(URL)) |>
    filter(Visited != -1) |>
    select(-Visited)
rm(links_df)

ar_df <- ar_df |>
    filter(!is.na(Reviews_Total))

ar_summ <- ar_df |>
    summarize(mean_ratings_tot = round(mean(Ratings_Total), 1),
              mean_ratings_avg = round(mean(Ratings_Avg), 1),
              mean_reviews_tot = round(mean(Reviews_Total), 1))
colnames(ar_summ) <- c("Ratings Total", "Ratings Average", "Reviews Total")
rownames(ar_summ) <- c("MEAN")
ar_summ <- as.data.frame(t(ar_summ)) |>
    rownames_to_column(var = "VARIABLE")
kable(ar_summ, format = "simple")

keep <- c("URL", "Recipe_Name", "Ratings_Total", "Ratings_Avg", "Reviews_Total")
user_sub <- ar_df |>
    select(c(all_of(keep), starts_with("User_"))) |>
    pivot_longer(cols = starts_with("User_"), names_to = "Review_Num",
                 values_to = "User", names_prefix = "User_")
rating_sub <- ar_df |>
    select(c("URL", starts_with("Rating_"))) |>
    pivot_longer(cols = starts_with("Rating_"), names_to = "Review_Num",
                 values_to = "Rating", names_prefix = "Rating_")
date_sub <- ar_df |>
    select(c("URL", starts_with("Date_"))) |>
    pivot_longer(cols = starts_with("Date_"), names_to = "Review_Num",
                 values_to = "Date", names_prefix = "Date_")
txt_sub <- ar_df |>
    select(-starts_with("Review_Len")) |>
    select(c("URL", starts_with("Review_"))) |>
    pivot_longer(cols = starts_with("Review_"), names_to = "Review_Num",
                 values_to = "Review_Txt", names_prefix = "Review_")
txt_len_sub <- ar_df |>
    select(c("URL", starts_with("Review_Len"))) |>
    pivot_longer(cols = starts_with("Review_Len"), names_to = "Review_Num",
                 values_to = "Review_Len", names_prefix = "Review_Len")
helpful_sub <- ar_df |>
    select(c("URL", starts_with("Helpful_"))) |>
    pivot_longer(cols = starts_with("Helpful_"), names_to = "Review_Num",
                 values_to = "Helpful", names_prefix = "Helpful_")
ar_df_long <- user_sub |>
    left_join(rating_sub, by = join_by(URL, Review_Num)) |>
    left_join(date_sub, by = join_by(URL, Review_Num)) |>
    left_join(txt_sub, by = join_by(URL, Review_Num)) |>
    left_join(txt_len_sub, by = join_by(URL, Review_Num)) |>
    left_join(helpful_sub, by = join_by(URL, Review_Num)) |>
    filter(!is.na(Rating))
ar_df_long_neg <- ar_df_long |>
    filter(Rating < 4)

stop_w <- as.data.frame(stopwords(language = "en", source = "nltk"))
colnames(stop_w) <- c("WORD")
excl <- c("do", "does", "did", "doing", "during", "before", "after", "up",
          "down", "over", "under", "when", "why", "how", "all", "any", "both",
          "each", "few", "more", "most", "no", "nor", "not", "only", "than",
          "too", "very", "can", "will", "don't", "should", "should've",
          "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't",
          "haven't", "isn't", "mightn't", "mustn't", "needn't", "shan't",
          "shouldn't", "wasn't", "weren't", "won't", "wouldn't")
stop_w <- stop_w |>
    mutate(SRC = "NLTK") |>
    filter(!WORD %in% excl)
keep <- c("URL", "Review_Num", "Review_Txt")
names <- c("w1", "w2")
ar_df_long_neg_bigrams <- ar_df_long_neg |>
    select(all_of(keep)) |>
    unnest_tokens(output = BIGRAM, input = Review_Txt,
                  token = "ngrams", n = 2) |>
    count(URL, Review_Num, BIGRAM, name = "COUNT") |>
    ungroup() |>
    separate_wider_delim(BIGRAM, delim = " ", names = names,
                         cols_remove = FALSE) |>
    mutate(w1 = ifelse(w1 %in% stop_w$WORD, NA, w1),
           w1 = ifelse(!is.na(as.numeric(w1)), NA, w1),
           w2 = ifelse(w2 %in% stop_w$WORD, NA, w2),
           w2 = ifelse(!is.na(as.numeric(w2)), NA, w2)) |>
    filter(!is.na(w1) & !is.na(w2)) |>
    select(-all_of(names)) |>
    mutate(BIGRAM = to_snake_case(BIGRAM))
ar_df_long_neg_bigrams_dtm <- ar_df_long_neg_bigrams |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    cast_dtm(ID, BIGRAM, COUNT)

get_perplexity <- function(dtm, k_max){
    #function adapted from https://bit.ly/4cLUHO8
    m <- matrix(,1,1)
    for (i in 2:k_max){
        lda <- LDA(dtm, i)
        per <- perplexity(lda)
        m <- plyr::rbind.fill.matrix(m, per)
    }
    m <- as.data.frame(cbind(matrix(seq(2, k_max), ncol = 1), m[-1,]))
    colnames(m) <- c("topics", "perplexity")
    return(m)
}
fn <- "data/perplexity_values.csv"
if (!file.exists(fn)){
    per_vals <- get_perplexity(ar_df_long_neg_bigrams_dtm, k_max = 12)
    write.csv(per_vals, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
    per_vals <- read.csv(fn)
}
cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(12, "Paired")
p0 <- per_vals |>
    ggplot(aes(x = topics, y = perplexity)) +
    geom_line(color = pal[9]) +
    geom_point(color = pal[10]) +
    scale_x_continuous(breaks = seq(2, 12, 1), limits = c(2, 12)) +
    scale_y_continuous(breaks = seq(0, 25000, 5000), limits = c(0, 25000)) +
    labs(title = "LDA Model Perplexity vs. Number of Topics")
p0

fn <- "ar_lda.rds"
if (!file.exists(fn)){
    ar_lda <- LDA(ar_df_long_neg_bigrams_dtm, k = 6, control = list(seed = 208))
    saveRDS(ar_lda, "ar_lda.rds")
}else{
    ar_lda <- readRDS("ar_lda.rds")
}
ar_topics <- tidy(ar_lda, matrix = "beta")
top_terms <- ar_topics |>
    group_by(topic) |>
    slice_max(beta, n = 15) |>
    ungroup() |>
    arrange(topic, -beta) |>
    mutate(term = reorder_within(term, beta, topic))
p1 <- top_terms |>
    ggplot(aes(beta, term)) +
    geom_col(color = pal[10], fill = pal[9]) +
    facet_wrap(~ topic, scales = "free_y", nrow = 2) +
    scale_y_reordered() +
    labs(title = "Bigrams Most Likely to Occur in Topics Identified by LDA")
p1


ar_gamma <- tidy(ar_lda, matrix = "gamma")
names <- c("URL", "Review_Num")
ar_classifications <- ar_gamma |>
    separate_wider_delim(document, delim = "_Review_Num_", names = names) |>
    group_by(URL, Review_Num) |>
    slice_max(gamma) |>
    ungroup() |>
    mutate(topic = as.factor(topic))
incl <- c("URL", "Review_Num", "Review_Len")
rev_len_by_top <- ar_classifications |>
    left_join(ar_df_long_neg |> select(all_of(incl)),
              by = join_by(URL, Review_Num))
annotations <- rev_len_by_top |>
    group_by(topic) |>
    summarize(`1st Qu.:` = round(quantile(Review_Len, probs = 0.25), 0),
              `Med.:` = median(Review_Len),
              `3rd Qu.:` = round(quantile(Review_Len, probs = 0.75), 0)) |>
    pivot_longer(cols = !topic, names_to = "Variable",
                 values_to = "Value")
p2 <- rev_len_by_top |>
    ggplot(aes(x = Review_Len)) + 
    geom_histogram(binwidth = 100, color = pal[10], fill = pal[9]) +
    geom_segment(aes(x = Value, xend = Value, y = 0, yend = 1000),
               data = annotations |> filter(Variable == "Med.:"),
               linetype="dashed", color = "black", linewidth = 0.5) +
    geom_text(aes(x = Value, y = 900, label = paste(Variable, Value)),
              data = annotations |> filter(Variable == "Med.:"),
              nudge_x = 450, color = "black") + 
    scale_y_continuous(breaks = seq(0, 1000, 250), limits = c(0, 1000)) +
    facet_wrap(~ topic, nrow = 3) +
    labs(x = "Review Length (Characters)", y = "Frequency",
         title = "Distribution of Review Length by Topic")
p2

fn <- "data/allrecipes_df_neg_labs.csv"
if (!file.exists(fn)){
    ar_df_long_neg_labs <- ar_df_long_neg |>
        left_join(ar_classifications, by = join_by(URL, Review_Num)) |>
        arrange(topic, desc(gamma))
    write.csv(ar_df_long_neg_labs, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
    ar_df_long_neg_labs <- read.csv(fn)
}
ar_df_long_neg_labs <- ar_df_long_neg_labs |>
    mutate(topic = ifelse(is.na(topic), 0, topic),
           gamma = ifelse(is.na(gamma), 0, gamma)) |>
    na.omit()

prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE", "ING_KEY_MISSING", "ING_FLAVOR_MISSING", "ING_WRONG_INCL", "INAUTHENTIC", "USER_ERROR")
desc3 <- c("One or more steps was unclear, incorrect, or missing",
          "Included too much or too little of one or more ingredients",
          "Needed more/less time to prep or to finish cooking",
          "Needed higher/lower temperature to cook or assemble correctly",
          "Missing a key ingredient",
          "Missing a flavor ingredient",
          "Included an incorrect ingredient",
          "Lacked a resemblance to the dish's traditional preparation/flavor",
          "Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc3)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")

ar_df_long_neg_labs <- ar_df_long_neg_labs |>
    rename(NEGATIVE = SENTIMENT) |>
    mutate(NEGATIVE = ifelse(NEGATIVE == "Negative", 1, 0))

sel <- c("Negative", "Other")
sent_star_summ_df <- ar_df_long_neg_labs |>
    filter(Rating > 0) |>
    group_by(Rating) |>
    summarize(Total = n(),
              Negative = sum(NEGATIVE)) |>
    ungroup() |>
    mutate(Rating = paste0(Rating, "★"),
           Other = Total - Negative) |>
    pivot_longer(cols = all_of(sel),
                 names_to = "Category", values_to = "Value") |>
    mutate(Perc = paste0(round(Value / Total * 100, 1), "%"))
greys <- brewer.pal(9, "Greys")
col <- c(pal[10], greys[6])
fil <- c(pal[9], greys[4])
title_str <- "Frequency of Reviews Expressing Negative Sentiment by Star Rating"
p3 <- sent_star_summ_df |>
    ggplot(aes(x = Category, y = Value,
               group = Category, color = Category, fill = Category)) +
    geom_col() +
    geom_text(aes(label = Perc), vjust = -0.75,
              size = 4, fontface = "bold") +
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, 250)) +
    scale_color_manual(values = col) +
    scale_fill_manual(values = fil) +
    facet_grid(~ Rating, scales = "free_x", space = "free_x", switch = "y") +
    labs(x = "Sentiment", y = "Frequency",
         title = title_str) +
    theme(legend.position = "none", panel.spacing = unit(0, units = "cm"),
          strip.placement = "inside",
          strip.background = element_blank(),
          strip.text = element_text(size = 12, face = "bold"),
          axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          plot.title.position = "plot")
p3

p4 <- ar_df_long_neg_labs |>
    upset(sets = prob, group.by = "sets", cutoff = 4,
          matrix.color = pal[10], main.bar.color = pal[10],
          sets.bar.color = pal[10], text.scale = 1, show.numbers = "no")
p4

sel <- c("topic", prob)
prob_by_top <- ar_df_long_neg_labs |>
    select(all_of(sel)) |>
    group_by(topic) |>
    summarize(across(all_of(prob), ~ sum(.x))) |>
    pivot_longer(cols = !topic, names_to = "Problem", values_to = "Count")
p5 <- prob_by_top |>
    filter(topic > 0) |>
    ggplot(aes(x = reorder(Problem, desc(Count)), y = Count)) + 
    geom_bar(stat = "identity", color = pal[10], fill = pal[9]) +
    labs(x = "Problem", y = "Frequency",
         title = "Distribution of Problems by Topic") +
    coord_flip() +
    facet_wrap(~ topic, nrow = 3)
p5

ing <- c("ING_KEY_MISSING", "ING_FLAVOR_MISSING", "ING_WRONG_INCL")
ar_df_long_neg_labs <- ar_df_long_neg_labs |>
    mutate(INGREDIENTS = rowSums(across(all_of(ing))),
           INGREDIENTS = ifelse(INGREDIENTS > 0, 1, 0)) |>
    select(-all_of(ing)) |>
    relocate(INGREDIENTS, .before = INAUTHENTIC)

prob <- c("DIRECTIONS", "MEASUREMENTS", "TIMING", "TEMPERATURE",
          "INGREDIENTS", "INAUTHENTIC", "USER_ERROR")
desc4 <- c("One or more steps was unclear, incorrect, or missing",
          "Included too much or too little of one or more ingredients",
          "Needed more/less time to prep or to finish cooking",
          "Needed higher/lower temperature to cook or assemble correctly",
          "Missing a key/flavor ingredient or included an incorrect ingredient",
          "Lacked a resemblance to the dish's traditional preparation/flavor",
          "Reviewer suspected/admitted they made a mistake")
tbl_print <- cbind(prob, desc4)
colnames(tbl_print) <- c("PROBLEM", "DESCRIPTION")
kable(tbl_print, format = "simple")

my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/recipe_ing_train.json"
ings <- read_json(my_url1)
copy <- ings
for (i in 1:length(copy)){
    l <- copy[[i]]
    ings[[i]] <- l$ingredients
}
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data698_research_project/main/data/recipe_ing_test.json"
ings2 <- read_json(my_url2)
copy <- ings2
for (i in 1:length(copy)){
    l <- copy[[i]]
    ings2[[i]] <- l$ingredients
}
ings <- as.list(unique(sort(c(unlist(ings), unlist(ings2)), decreasing = TRUE)))
rm(ings2, copy)
ings <- gsub("[^[:alpha:] ]", "", ings)
ings <- lapply(ings, str_squish)
ings <- lapply(ings, tolower)
ings <- lapply(ings, function(x) paste0(" ", x, " "))
ing_patt_string <- paste(unlist(ings), collapse = "|")
int_patt <- c("\\s\\d+\\s(?!\\d)")
mixed_frac_patt <- c("\\s\\d+\\s\\d+slash\\d+\\s")
frac_patt <- c("\\s\\d+slash\\d+\\s")
dec_patt <- c("\\s\\d+point\\d+\\s")
num_patt_string <- paste(int_patt, mixed_frac_patt, frac_patt, dec_patt,
                         sep = "|")
#adapted from https://en.wikipedia.org/wiki/Cooking_weights_and_measures
meas <- list("drops", "drop", "drs", "dr", "smidgens", "smidgen", "smdgs", "smdg", "smis", "smi", "pinches", "pinch", "pns", "pn", "dashes", "dash", "ds", "teaspoons", "teaspoon", "tsps", "tsp", "t", "tablespoons", "tablespoon", "tbsps", "tbsp", "ounces", "ounce", "oz", "cups", "cup", "c", "pints", "pint", "pts", "pt", "quarts", "quart", "qts", "qt", "gallons", "gallon", "gals", "gal", "pounds", "pound", "lbs", "lb", "large", "small", "medium")
meas <- lapply(meas, function(x) paste0(" ", x, " "))
meas_patt_string <- paste(unlist(meas), collapse = "|")
ar_df_long_neg_ING <- ar_df_long_neg |>
    mutate(Review_Txt = tolower(Review_Txt),
           Review_Txt = paste0(" ", Review_Txt, " "),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = "’",
                                        replacement = "'"),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = "(\\d+)\\.(\\d+)",
                                        replacement = "\\1point\\2"),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = "(\\d+)/(\\d+)",
                                        replacement = "\\1slash\\2"),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = "[^[:alnum:][:space:]']",
                                        replacement = " "),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = num_patt_string,
                                        replacement = " __NUM__ "),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = ing_patt_string,
                                        replacement = " __ING__ "),
           Review_Txt = str_replace_all(Review_Txt,
                                        pattern = meas_patt_string,
                                        replacement = " __MEAS__ "),
           Review_Txt = str_squish(Review_Txt))

keep <- c("URL", "Review_Num", "Review_Txt")
ar_df_long_neg_alln <- ar_df_long_neg_ING |>
    select(all_of(keep)) |>
    unnest_tokens(output = TOKEN, input = Review_Txt,
                  token = "ngrams", n = 3, n_min = 1,
                  ngram_delim = ".", to_lower = FALSE,
                  stopwords = stop_w$WORD) |>
    count(URL, Review_Num, TOKEN, name = "COUNT") |>
    ungroup() |>
    mutate(n = str_count(TOKEN, "\\.") + 1)
ar_df_long_neg_n1 <- ar_df_long_neg_alln |>
    filter(n == 1) |>
    select(-n)
ar_df_long_neg_n2 <- ar_df_long_neg_alln |>
    filter(n == 2) |>
    select(-n)
ar_df_long_neg_n3 <- ar_df_long_neg_alln |>
    filter(n == 3) |>
    select(-n)
ar_df_long_neg_alln <- ar_df_long_neg_alln |>
    select(-n)

sel <- c("ID", prob)
ar_df_long_neg_labs_only_num <- ar_df_long_neg_labs |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    select(all_of(sel)) |>
    arrange(ID)
ar_df_long_neg_labs_only <- ar_df_long_neg_labs_only_num |>
    mutate(across(all_of(prob), ~ factor(., levels = c(0, 1),
                                         labels = c("No", "Yes"))))
ar_df_long_neg_tokens_dtm1 <- ar_df_long_neg_n1 |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    group_by(TOKEN) |>
    filter(n() > 2) |>
    ungroup() |>
    cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm2 <- ar_df_long_neg_n2 |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    group_by(TOKEN) |>
    filter(n() > 2) |>
    ungroup() |>
    cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm3 <- ar_df_long_neg_n3 |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    group_by(TOKEN) |>
    filter(n() > 2) |>
    ungroup() |>
    cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_dtm4 <- ar_df_long_neg_alln |>
    unite("ID", URL, Review_Num, sep = "_Review_Num_") |>
    group_by(TOKEN) |>
    filter(n() > 2) |>
    ungroup() |>
    cast_dtm(ID, TOKEN, COUNT)
ar_df_long_neg_tokens_sparse1 <- tidy(ar_df_long_neg_tokens_dtm1) |>
    right_join(ar_df_long_neg_labs_only |> select(ID),
               by = join_by(document == ID)) |>
    arrange(document) |>
    mutate(term = ifelse(is.na(term), "***__***", term),
           count = ifelse(is.na(count), 1, count)) |>
    cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse2 <- tidy(ar_df_long_neg_tokens_dtm2) |>
    right_join(ar_df_long_neg_labs_only |> select(ID),
               by = join_by(document == ID)) |>
    arrange(document) |>
    mutate(term = ifelse(is.na(term), "***__***", term),
           count = ifelse(is.na(count), 1, count)) |>
    cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse3 <- tidy(ar_df_long_neg_tokens_dtm3) |>
    right_join(ar_df_long_neg_labs_only |> select(ID),
               by = join_by(document == ID)) |>
    arrange(document) |>
    mutate(term = ifelse(is.na(term), "***__***", term),
           count = ifelse(is.na(count), 1, count)) |>
    cast_sparse(document, term, count)
ar_df_long_neg_tokens_sparse4 <- tidy(ar_df_long_neg_tokens_dtm4) |>
    right_join(ar_df_long_neg_labs_only |> select(ID),
               by = join_by(document == ID)) |>
    arrange(document) |>
    mutate(term = ifelse(is.na(term), "***__***", term),
           count = ifelse(is.na(count), 1, count)) |>
    cast_sparse(document, term, count)

set.seed(1006)
sample <- sample(nrow(ar_df_long_neg_tokens_sparse1),
                 round(nrow(ar_df_long_neg_tokens_sparse1) * 0.5),
                 replace = FALSE)
train_x1 <- ar_df_long_neg_tokens_sparse1[sample, ]
train_x2 <- ar_df_long_neg_tokens_sparse2[sample, ]
train_x3 <- ar_df_long_neg_tokens_sparse3[sample, ]
train_x4 <- ar_df_long_neg_tokens_sparse4[sample, ]
train_x_all <- list(train_x1, train_x2, train_x3, train_x4)
train_y <- ar_df_long_neg_labs_only[sample, ]
train_y_num <- ar_df_long_neg_labs_only_num[sample, ]
validate_test_x1 <- ar_df_long_neg_tokens_sparse1[-sample, ]
validate_test_x2 <- ar_df_long_neg_tokens_sparse2[-sample, ]
validate_test_x3 <- ar_df_long_neg_tokens_sparse3[-sample, ]
validate_test_x4 <- ar_df_long_neg_tokens_sparse4[-sample, ]
validate_test_y <- ar_df_long_neg_labs_only[-sample, ]
validate_test_y_num <- ar_df_long_neg_labs_only_num[-sample, ]
sample <- sample(nrow(validate_test_x1),
                 round(nrow(validate_test_x1) * 0.5),
                 replace = FALSE)
validate_x1 <- validate_test_x1[sample, ]
validate_x2 <- validate_test_x2[sample, ]
validate_x3 <- validate_test_x3[sample, ]
validate_x4 <- validate_test_x4[sample, ]
validate_x_all <- list(validate_x1, validate_x2, validate_x3, validate_x4)
validate_y <- validate_test_y[sample, ]
validate_y_num <- validate_test_y_num[sample, ]
test_x1 <- validate_test_x1[-sample, ]
test_x2 <- validate_test_x2[-sample, ]
test_x3 <- validate_test_x3[-sample, ]
test_x4 <- validate_test_x4[-sample, ]
test_x_all <- list(test_x1, test_x2, test_x3, test_x4)
test_y <- validate_test_y[-sample, ]
test_y_num <- validate_test_y_num[-sample, ]

fol <- c("models/")
dat <- c("uni_", "bi_", "tri_", "all_")
fns <- c("nb1.rds", "nb2.rds", "nb3.rds", "nb4.rds", "nb5.rds", "nb6.rds",
         "nb7.rds")
nbs <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            nb <- multinomial_naive_bayes(train_x_all[[i]],
                                          train_y[, col], laplace = 1)
            saveRDS(nb, fn)
        }else{
            nb <- readRDS(fn)
        }
        count = count + 1
        nbs[[count]] <- nb
    }
}

fns <- c("svm1.rds", "svm2.rds", "svm3.rds", "svm4.rds", "svm5.rds", "svm6.rds",
         "svm7.rds")
svms <- list()
count = 0
ctrl <-  tune.control(sampling = "cross", cross = 10, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            svm_tune <- tune(svm, train.x = train_x_all[[i]],
                             train.y = train_y[, col],
                             kernel = "linear", ranges = tune_grid,
                             tunecontrol = ctrl)
            svm_mod <- svm_tune$best.model
            saveRDS(svm_mod, fn)
        }else{
            svm_mod <- readRDS(fn)
        }
        count = count + 1
        svms[[count]] <- svm_mod
    }
}

fns <- c("xgb1.json", "xgb2.json", "xgb3.json", "xgb4.json", "xgb5.json",
         "xgb6.json", "xgb7.json")
xgbs <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            xgb <- xgboost(train_x_all[[i]], train_y_num[, col], nrounds = 100,
                       objective = "binary:hinge", verbose = 0)
            xgb.save(xgb, fn)
        }else{
            xgb <- xgb.load(fn)
        }
        count = count + 1
        xgbs[[count]] <- xgb
    }
}

fns <- c("logm1.rds", "logm2.rds", "logm3.rds", "logm4.rds", "logm5.rds", "logm6.rds", "logm7.rds")
logms <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            logm <- cv.glmnet(train_x_all[[i]], train_y[, col],
                          family = "binomial", alpha = 1,
                          type.logistic = "modified.Newton",
                          type.measure = "auc", nfolds = 10)
            saveRDS(logm, fn)
        }else{
            logm <- readRDS(fn)
        }
        count = count + 1
        logms[[count]] <- logm
    }
}

nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    nb <- nbs[[i]]
    nb_pred <- predict(nb, validate_x_all[[dat_num]], type = "class")
    nb_preds[[i]] <- nb_pred
    nbcm_complete <- confusionMatrix(nb_pred, validate_y[, col],
                                     positive = "Yes")
    nb_cms_complete[[i]] <- nbcm_complete
    nbcm <- as.data.frame(nbcm_complete$table)
    nbcm$Reference <- factor(nbcm$Reference,
                             levels = rev(levels(nbcm$Reference)))
    nbcm <- nbcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    svm_mod <- svms[[i]]
    svm_pred <- predict(svm_mod, validate_x_all[[dat_num]], type = "class")
    svm_preds[[i]] <- svm_pred
    svmcm_complete <- confusionMatrix(svm_pred, validate_y[, col],
                                      positive = "Yes")
    svm_cms_complete[[i]] <- svmcm_complete
    svmcm <- as.data.frame(svmcm_complete$table)
    svmcm$Reference <- factor(svmcm$Reference,
                             levels = rev(levels(svmcm$Reference)))
    svmcm <- svmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    xgb <- xgbs[[i]]
    xgb_pred <- predict(xgb, validate_x_all[[dat_num]], type = "class")
    xgb_preds[[i]] <- xgb_pred
    xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
                                      as.factor(validate_y_num[, col]),
                                      positive = "1")
    xgb_cms_complete[[i]] <- xgbcm_complete
    xgbcm <- as.data.frame(xgbcm_complete$table)
    xgbcm$Reference <- factor(xgbcm$Reference,
                             levels = rev(levels(xgbcm$Reference)))
    xgbcm <- xgbcm |>
        mutate(
            Label = case_when(
                Prediction == 0 & Reference == 0 ~ "TN",
                Prediction == 1 & Reference == 1 ~ "TP",
                Prediction == 0 & Reference == 1 ~ "FN",
                Prediction == 1 & Reference == 0 ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    logm <- logms[[i]]
    logm_pred <- predict(logm, validate_x_all[[dat_num]], type = "class")
    logm_preds[[i]] <- logm_pred
    logmcm_complete <- confusionMatrix(factor(logm_pred,
                                              levels = c("No", "Yes")),
                                       validate_y[, col], positive = "Yes")
    logm_cms_complete[[i]] <- logmcm_complete
    logmcm <- as.data.frame(logmcm_complete$table)
    logmcm$Reference <- factor(logmcm$Reference,
                               levels = rev(levels(logmcm$Reference)))
    logmcm <- logmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    logm_cms[[i]] <- logmcm
}


cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
    cm_complete <- cms_complete[[i]]
    by_class[[i]] <- as.data.frame(cm_complete$byClass)
    overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
                        rep("Support Vector Machine", 28),
                        rep("eXtreme Gradient Boosting", 28),
                        rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
                          rep("Bigrams", 7),
                          rep("Trigrams", 7),
                          rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
           "Logistic Regression")
metrics <- metrics |>
    select(all_of(keep)) |>
    mutate(Model = factor(Model, levels = mods),
           Problem = factor(Problem, levels = prob)) |>
    group_by(Problem) |>
    filter(F1 == max(F1, na.rm = TRUE)) |>
    arrange(Problem, desc(F1))
kable(metrics, format = "simple")

fn <- "data/high_corr.csv"
if (!file.exists(fn)){
    corr <- cor(as.matrix(ar_df_long_neg_tokens_sparse1))
    corr[upper.tri(corr)] <- 0
    diag(corr) <- 0
    corr2 <- cor(as.matrix(ar_df_long_neg_tokens_sparse2))
    corr2[upper.tri(corr2)] <- 0
    diag(corr2) <- 0
    corr3 <- cor(as.matrix(ar_df_long_neg_tokens_sparse3))
    corr3[upper.tri(corr3)] <- 0
    diag(corr3) <- 0
    high_corr_uni <- colnames(corr[, apply(corr, 2, 
                                           function(x) any(abs(x) > 0.9,
                                                           na.rm = TRUE))])
    high_corr_bi <- colnames(corr2[, apply(corr2, 2,
                                           function(x) any(abs(x) > 0.9,
                                                           na.rm = TRUE))])
    high_corr_tri <- colnames(corr3[, apply(corr3, 2,
                                            function(x) any(abs(x) > 0.9,
                                                            na.rm = TRUE))])
    high_corr <- as.data.frame(c(high_corr_uni,
                                 high_corr_bi,
                                 high_corr_tri))
    colnames(high_corr) <- c("Variable")
    write.csv(high_corr, fn, row.names = FALSE, fileEncoding = "UTF-8")
}else{
    high_corr <- read.csv(fn)
}

copy <- train_x_all
rem <- as.character(high_corr$Variable)
for (i in 1:length(copy)){
    train_x <- train_x_all[[i]]
    validate_x <- validate_x_all[[i]]
    test_x <- test_x_all[[i]]
    train_x_all[[i]] <- train_x[, !colnames(train_x) %in% rem]
    validate_x_all[[i]] <- validate_x[, !colnames(validate_x) %in% rem]
    test_x_all[[i]] <- test_x[, !colnames(test_x) %in% rem]
}

fol <- c("models2/")
dat <- c("uni_", "bi_", "tri_", "all_")
fns <- c("nb1.rds", "nb2.rds", "nb3.rds", "nb4.rds", "nb5.rds", "nb6.rds",
         "nb7.rds")
nbs <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            nb <- multinomial_naive_bayes(train_x_all[[i]],
                                          train_y[, col], laplace = 1)
            saveRDS(nb, fn)
        }else{
            nb <- readRDS(fn)
        }
        count = count + 1
        nbs[[count]] <- nb
    }
}
fns <- c("svm1.rds", "svm2.rds", "svm3.rds", "svm4.rds", "svm5.rds", "svm6.rds",
         "svm7.rds")
svms <- list()
count = 0
ctrl <-  tune.control(sampling = "cross", cross = 10, nrepeat = 1)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000))
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            svm_tune <- tune(svm, train.x = train_x_all[[i]],
                             train.y = train_y[, col],
                             kernel = "linear", ranges = tune_grid,
                             tunecontrol = ctrl)
            svm_mod <- svm_tune$best.model
            saveRDS(svm_mod, fn)
        }else{
            svm_mod <- readRDS(fn)
        }
        count = count + 1
        svms[[count]] <- svm_mod
    }
}
fns <- c("xgb1.json", "xgb2.json", "xgb3.json", "xgb4.json", "xgb5.json",
         "xgb6.json", "xgb7.json")
xgbs <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            xgb <- xgboost(train_x_all[[i]], train_y_num[, col], nrounds = 100,
                       objective = "binary:hinge", verbose = 0)
            xgb.save(xgb, fn)
        }else{
            xgb <- xgb.load(fn)
        }
        count = count + 1
        xgbs[[count]] <- xgb
    }
}
fns <- c("logm1.rds", "logm2.rds", "logm3.rds", "logm4.rds", "logm5.rds", "logm6.rds", "logm7.rds")
logms <- list()
count = 0
for (i in 1:length(dat)){
    for (j in 1:length(fns)){
        fn <- paste0(fol, dat[i], fns[j])
        if (!file.exists(fn)){
            col <- j + 1
            logm <- cv.glmnet(train_x_all[[i]], train_y[, col],
                          family = "binomial", alpha = 1,
                          type.logistic = "modified.Newton",
                          type.measure = "auc", nfolds = 10)
            saveRDS(logm, fn)
        }else{
            logm <- readRDS(fn)
        }
        count = count + 1
        logms[[count]] <- logm
    }
}

nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    nb <- nbs[[i]]
    nb_pred <- predict(nb, validate_x_all[[dat_num]], type = "class")
    nb_preds[[i]] <- nb_pred
    nbcm_complete <- confusionMatrix(nb_pred, validate_y[, col],
                                     positive = "Yes")
    nb_cms_complete[[i]] <- nbcm_complete
    nbcm <- as.data.frame(nbcm_complete$table)
    nbcm$Reference <- factor(nbcm$Reference,
                             levels = rev(levels(nbcm$Reference)))
    nbcm <- nbcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    svm_mod <- svms[[i]]
    svm_pred <- predict(svm_mod, validate_x_all[[dat_num]], type = "class")
    svm_preds[[i]] <- svm_pred
    svmcm_complete <- confusionMatrix(svm_pred, validate_y[, col],
                                      positive = "Yes")
    svm_cms_complete[[i]] <- svmcm_complete
    svmcm <- as.data.frame(svmcm_complete$table)
    svmcm$Reference <- factor(svmcm$Reference,
                             levels = rev(levels(svmcm$Reference)))
    svmcm <- svmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    xgb <- xgbs[[i]]
    xgb_pred <- predict(xgb, validate_x_all[[dat_num]], type = "class")
    xgb_preds[[i]] <- xgb_pred
    xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
                                      as.factor(validate_y_num[, col]),
                                      positive = "1")
    xgb_cms_complete[[i]] <- xgbcm_complete
    xgbcm <- as.data.frame(xgbcm_complete$table)
    xgbcm$Reference <- factor(xgbcm$Reference,
                             levels = rev(levels(xgbcm$Reference)))
    xgbcm <- xgbcm |>
        mutate(
            Label = case_when(
                Prediction == 0 & Reference == 0 ~ "TN",
                Prediction == 1 & Reference == 1 ~ "TP",
                Prediction == 0 & Reference == 1 ~ "FN",
                Prediction == 1 & Reference == 0 ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    logm <- logms[[i]]
    logm_pred <- predict(logm, validate_x_all[[dat_num]], type = "class")
    logm_preds[[i]] <- logm_pred
    logmcm_complete <- confusionMatrix(factor(logm_pred,
                                              levels = c("No", "Yes")),
                                       validate_y[, col], positive = "Yes")
    logm_cms_complete[[i]] <- logmcm_complete
    logmcm <- as.data.frame(logmcm_complete$table)
    logmcm$Reference <- factor(logmcm$Reference,
                               levels = rev(levels(logmcm$Reference)))
    logmcm <- logmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    logm_cms[[i]] <- logmcm
}


cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
    cm_complete <- cms_complete[[i]]
    by_class[[i]] <- as.data.frame(cm_complete$byClass)
    overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
                        rep("Support Vector Machine", 28),
                        rep("eXtreme Gradient Boosting", 28),
                        rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
                          rep("Bigrams", 7),
                          rep("Trigrams", 7),
                          rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
           "Logistic Regression")
metrics <- metrics |>
    select(all_of(keep)) |>
    mutate(Model = factor(Model, levels = mods),
           Problem = factor(Problem, levels = prob)) |>
    group_by(Problem) |>
    filter(F1 == max(F1, na.rm = TRUE)) |>
    arrange(Problem, desc(F1))
kable(metrics, format = "simple")

nb_preds <- list()
nb_cms_complete <- list()
nb_cms <- list()
max_mod_num <- 7
for (i in 1:length(nbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    nb <- nbs[[i]]
    nb_pred <- predict(nb, test_x_all[[dat_num]], type = "class")
    nb_preds[[i]] <- nb_pred
    nbcm_complete <- confusionMatrix(nb_pred, test_y[, col],
                                     positive = "Yes")
    nb_cms_complete[[i]] <- nbcm_complete
    nbcm <- as.data.frame(nbcm_complete$table)
    nbcm$Reference <- factor(nbcm$Reference,
                             levels = rev(levels(nbcm$Reference)))
    nbcm <- nbcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    nb_cms[[i]] <- nbcm
}
svm_preds <- list()
svm_cms_complete <- list()
svm_cms <- list()
for (i in 1:length(svms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    svm_mod <- svms[[i]]
    svm_pred <- predict(svm_mod, test_x_all[[dat_num]], type = "class")
    svm_preds[[i]] <- svm_pred
    svmcm_complete <- confusionMatrix(svm_pred, test_y[, col],
                                      positive = "Yes")
    svm_cms_complete[[i]] <- svmcm_complete
    svmcm <- as.data.frame(svmcm_complete$table)
    svmcm$Reference <- factor(svmcm$Reference,
                             levels = rev(levels(svmcm$Reference)))
    svmcm <- svmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    svm_cms[[i]] <- svmcm
}
xgb_preds <- list()
xgb_cms_complete <- list()
xgb_cms <- list()
for (i in 1:length(xgbs)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    xgb <- xgbs[[i]]
    xgb_pred <- predict(xgb, test_x_all[[dat_num]], type = "class")
    xgb_preds[[i]] <- xgb_pred
    xgbcm_complete <- confusionMatrix(as.factor(xgb_pred),
                                      as.factor(test_y_num[, col]),
                                      positive = "1")
    xgb_cms_complete[[i]] <- xgbcm_complete
    xgbcm <- as.data.frame(xgbcm_complete$table)
    xgbcm$Reference <- factor(xgbcm$Reference,
                             levels = rev(levels(xgbcm$Reference)))
    xgbcm <- xgbcm |>
        mutate(
            Label = case_when(
                Prediction == 0 & Reference == 0 ~ "TN",
                Prediction == 1 & Reference == 1 ~ "TP",
                Prediction == 0 & Reference == 1 ~ "FN",
                Prediction == 1 & Reference == 0 ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    xgb_cms[[i]] <- xgbcm
}
logm_preds <- list()
logm_cms_complete <- list()
logm_cms <- list()
for (i in 1:length(logms)){
    dat_num <- (i - 1) %/% 7 + 1
    mod_num <- i %% 7
    if (mod_num > 0){
        col <- mod_num + 1
    }else{
        col <- max_mod_num + 1
    }
    logm <- logms[[i]]
    logm_pred <- predict(logm, test_x_all[[dat_num]], type = "class")
    logm_preds[[i]] <- logm_pred
    logmcm_complete <- confusionMatrix(factor(logm_pred,
                                              levels = c("No", "Yes")),
                                       test_y[, col], positive = "Yes")
    logm_cms_complete[[i]] <- logmcm_complete
    logmcm <- as.data.frame(logmcm_complete$table)
    logmcm$Reference <- factor(logmcm$Reference,
                               levels = rev(levels(logmcm$Reference)))
    logmcm <- logmcm |>
        mutate(
            Label = case_when(
                Prediction == "No" & Reference == "No" ~ "TN",
                Prediction == "Yes" & Reference == "Yes" ~ "TP",
                Prediction == "No" & Reference == "Yes" ~ "FN",
                Prediction == "Yes" & Reference == "No" ~ "FP"),
            Model = paste0(dat[dat_num], prob[col - 1]))
    logm_cms[[i]] <- logmcm
}


cms_complete <- c(nb_cms_complete, svm_cms_complete, xgb_cms_complete, logm_cms_complete)
by_class <- list()
overall <- list()
for (i in 1:length(cms_complete)){
    cm_complete <- cms_complete[[i]]
    by_class[[i]] <- as.data.frame(cm_complete$byClass)
    overall[[i]] <- as.data.frame(cm_complete$overall)
}
col1 <- as.data.frame(c(rep("Naive Bayes", 28),
                        rep("Support Vector Machine", 28),
                        rep("eXtreme Gradient Boosting", 28),
                        rep("Logistic Regression", 28)))
colnames(col1) <- c("Model")
col2 <- as.data.frame(rep(prob, 16))
colnames(col2) <- c("Problem")
col3 <- as.data.frame(rep(c(rep("Unigrams", 7),
                          rep("Bigrams", 7),
                          rep("Trigrams", 7),
                          rep("All", 7)), 4))
colnames(col3) <- c("Token")
by_class <- t(round(bind_cols(by_class), 4))
rownames(by_class) <- NULL
overall <- t(round(bind_cols(overall), 4))
rownames(overall) <- NULL
metrics <- as.data.frame(cbind(col1, col2, col3, by_class, overall))
keep <- c("Model", "Problem", "Token", "Accuracy", "Precision", "Recall", "F1")
mods <- c("Naive Bayes", "Support Vector Machine", "eXtreme Gradient Boosting",
           "Logistic Regression")
metrics <- metrics |>
    select(all_of(keep)) |>
    mutate(Model = factor(Model, levels = mods),
           Problem = factor(Problem, levels = prob)) |>
    group_by(Problem) |>
    filter(F1 == max(F1, na.rm = TRUE)) |>
    arrange(Problem, desc(F1))
kable(metrics, format = "simple")