Goal: trial two ML procedures to classify the UPCs into Market Basket categories.
Include a paragraph summary on how each ML procedure works. Please ensure they can be run through R- I’d like everything in an .rmd file. I want two independent ML procedures so we can compare their classification overlap as a first indicator to save human verification time.
The top categories that are key:
A. See attached USDA TFP 2021 doc, and please use the top and sub categories from Table 1. We just need to add one more for “Non-Food Item” (we have books etc in the Walmart pulls due to substring matches, e.g.)
A. In an output file, please have a column for
1) [food_description] from the excel sheet;
2) tfp_basket_cat_ML1 for where the first ML model categorized each
food_description value;
3) tfp_basket_cat_ML1_conf for some quantitative level of confidence in
the match;
4) tfp_basket_cat_ML2 (other model classification);
5) tfp_basket_cat_ML2_conf (other confidence measure);
6) ML_model_agreement (1 for same, 0 otherwise).
The next step would be us verifying on a subset of maybe 500 of these 4000+ categories how well each did. Then we’d move forward with that ‘trained’ model on these food_description levels to use the ML models on the UPC lists.
Two initial datasets from the USDA were used for analysis:
U. S. Department of Agriculture, Economic Research Service. Purchase to Plate Suite, March 2023. The sheet used for analysis was PP-NAP1718, for 2017-2018 data.
TFP Market Basket, which provided the “MB Categories” factor column data with the top-level labels for “Vegetables”, “Fruits”, “Grains”, “Dairy”, “Protein foods”, “Miscellaneous”.
This code prepares two sets of text data—USDA Market Basket subcategories and food descriptions—for semantic comparison using cosine similarity via the quanteda package.
The goal is to assign each food_description a USDA category based on linguistic similarity. First, it creates a corpus of subcategory names (e.g., “cheese”, “leafy greens”) and tokenizes them (i.e., breaks them into lowercased word units, removing punctuation).
A document-feature matrix (DFM) is then built to represent word frequencies for each subcategory. The same process is applied to the food_description column from the dataset. Finally, the dfm_match() function aligns the two DFMs to ensure they have the same vocabulary (features), enabling valid pairwise similarity comparisons. This preprocessing step is essential to compute cosine similarity between the subcategories and the food descriptions, which is used to assign the most appropriate USDA category to each item based on shared word patterns.
Cosine similarity is used in this context to assign each food_description to the most semantically similar USDA Market Basket subcategory by comparing the patterns of words they contain. Rather than relying on exact word matches, cosine similarity evaluates the angle between vector representations of text — meaning it captures the overall similarity in word usage, regardless of word order or length.
By computing how closely the word distribution of a food description aligns with that of each known subcategory, the method identifies which category the item most likely belongs to, based on linguistic context. This allows for more flexible and accurate classification, especially when food descriptions vary in wording but share conceptual meaning.
This code block applies the results of a cosine similarity comparison
between each food description and USDA subcategories to assign a
most-likely top-level category to each food item. First, the similarity
scores—originally in a special object—are converted to a standard matrix
for easier manipulation. The which.max()
function is used
row-wise to find the index of the most similar USDA subcategory for each
food description.
These indices are mapped to the corresponding USDA top-level
categories (tfp_category
) using the tfp_lookup
table. The assignments are matched back to the original dataset
(pp_nap
) only for rows with non-missing descriptions to
avoid misalignment. After rejoining the new category labels, the code
removes any duplicate tfp_category
columns introduced by
the merge and renames the correct one. Finally, any remaining rows
without a category assignment are labeled as “Unknown,” ensuring that
every entry in pp_nap
now has a classification label for
further analysis or model training.
This block of code prepares a cleaned and labeled dataset of food descriptions for training a text classification model using the quanteda package. First, it filters the original dataset (pp_nap) to keep only rows where both food_description and tfp_category (the USDA label) are present, ensuring the model is trained only on valid examples. These rows are stored in nb_data, while rows with missing values are stored separately as nb_data_loss for diagnostic or future use.
A quanteda text corpus is then constructed from nb_data, with each food_description as a document and its corresponding tfp_category stored as metadata (document-level variables). The text is pre-processed by tokenizing (splitting into words), converting to lowercase, removing punctuation and symbols, and optionally removing common English stopwords like “and” or “the.”
Finally, the cleaned tokens are converted into a document-feature matrix (nb_dfm) representing word frequency counts across all food descriptions, which serves as input to the Naive Bayes classifier.
Naive Bayes is a probabilistic text classification algorithm commonly used as a baseline in natural language processing. For this project, we applied Naive Bayes to classify food descriptions into USDA Market Basket categories using a TF-IDF (Term Frequency–Inverse Document Frequency) representation of the text.
TF-IDF transforms each food_description
into a sparse
numeric vector that emphasizes distinctive words in the context of the
full dataset. These vectors were then used to train a Naive Bayes
classifier that predicts the most likely category based on learned word
probabilities.
The model was trained and evaluated on a filtered dataset of food descriptions with USDA category labels.
Since our initial text classification produced poor matching results, we adopted the “Quanteda” library to improve tokenization, stopwords, stemming, etc.
GloVe (Global Vectors for Word Representation) creates dense, continuous vector embeddings that represent semantic relationships between words. However:
A random forest is an ensemble learning model that builds many decision trees and combines their results to improve the accuracy of predictions. In this context, the goal of the random forest model will be to predict which top-level TFP category (grains, dairy, etc.) the classification of a selected food item based on its text description.
Each food description is transformed into a vector of numeric features using GloVe embeddings; these embeddings capture semantic meaning. For example, “cheddar cheese” is closer to “swiss cheese” than bananas.
The numeric vector is passed as an input to the RF, which builds hundreds of decision trees, each making its own classification prediction.
The RF then aggregates all these predictions to output the most likely TFP category for each food item.
Data Preparation
A. We used the same filtered dataset as for Model 1 but focused on creating semantic embeddings for each food_description.
Word Embeddings (GloVe)Global Vectors for Word Representation
-We loaded pre-trained GloVe vectors, where each word is represented as a dense 300-dimensional numeric vector. -These vectors capture semantic meaning — e.g., “apple” and “pear” are close together in vector space.
Averaging for Descriptions
-Each food_description was split into words. The embedding vectors for all words in that description were averaged to create a single vector per product.
Model Training
A Random Forest classifier was trained on these averaged embedding vectors. A Random Forest algorithm builds many decision trees on random subsets of the data and averages their votes for classification.
The trained model predicted:
I. Category (most likely USDA basket)
II.Confidence score (proportion of trees voting for that category)
## Rows in final_results: 887
##
## 0 1
## 300 587
## Naive Bayes Accuracy: 0.7
## Random Forest Accuracy: 0.939
## Reference
## Prediction Dairy Fruits Grains Miscellaneous Protein foods Unknown
## Dairy 191 4 0 12 2 23
## Fruits 3 46 2 11 0 17
## Grains 3 2 26 8 2 10
## Miscellaneous 11 1 4 76 3 24
## Protein foods 2 1 0 0 11 18
## Unknown 5 0 2 15 0 185
## Vegetables 17 1 2 18 4 15
## Reference
## Prediction Vegetables
## Dairy 6
## Fruits 3
## Grains 2
## Miscellaneous 9
## Protein foods 0
## Unknown 4
## Vegetables 86
## Reference
## Prediction Dairy Fruits Grains Miscellaneous Protein foods Unknown
## Dairy 224 0 0 0 0 3
## Fruits 0 53 0 0 0 0
## Grains 0 0 22 1 0 2
## Miscellaneous 0 1 0 128 0 6
## Protein foods 0 0 0 0 19 1
## Unknown 7 1 14 11 3 280
## Vegetables 1 0 0 0 0 0
## Reference
## Prediction Vegetables
## Dairy 1
## Fruits 0
## Grains 1
## Miscellaneous 0
## Protein foods 0
## Unknown 1
## Vegetables 107
## # A tibble: 2 × 3
## Model Accuracy AgreementWithOtherModel
## <chr> <dbl> <dbl>
## 1 Naive Bayes 0.700 0.662
## 2 Random Forest 0.939 0.662
## ### Naive Bayes Confusion Matrix
## Reference
## Prediction Dairy Fruits Grains Miscellaneous Protein foods Unknown
## Dairy 191 4 0 12 2 23
## Fruits 3 46 2 11 0 17
## Grains 3 2 26 8 2 10
## Miscellaneous 11 1 4 76 3 24
## Protein foods 2 1 0 0 11 18
## Unknown 5 0 2 15 0 185
## Vegetables 17 1 2 18 4 15
## Reference
## Prediction Vegetables
## Dairy 6
## Fruits 3
## Grains 2
## Miscellaneous 9
## Protein foods 0
## Unknown 4
## Vegetables 86
##
## ### Random Forest Confusion Matrix
## Reference
## Prediction Dairy Fruits Grains Miscellaneous Protein foods Unknown
## Dairy 224 0 0 0 0 3
## Fruits 0 53 0 0 0 0
## Grains 0 0 22 1 0 2
## Miscellaneous 0 1 0 128 0 6
## Protein foods 0 0 0 0 19 1
## Unknown 7 1 14 11 3 280
## Vegetables 1 0 0 0 0 0
## Reference
## Prediction Vegetables
## Dairy 1
## Fruits 0
## Grains 1
## Miscellaneous 0
## Protein foods 0
## Unknown 1
## Vegetables 107
Random Forest Diagnostic:
Perfect classification (83 Dairy, 30 Protein foods, etc.).
GloVe embeddings captured true semantic distinctions (e.g., “cheddar” vs. “apple” vs. “broccoli”).
Model appears overfit to the current dataset — may not generalize perfectly to new UPCs unless cross-validation confirms this.
Conclusion: Random Forest + embeddings is a strong candidate to move forward with — after further validation.
What to do now — Summary of Immediate Next Steps Review the classification output file-
Filename: tfp_classification_results.csv
Confirm Model Generalizability
✅ DONE: Model trained on food_descriptions with near-perfect accuracy.
🔁 TO DO:
Run k-fold cross-validation to assess performance across multiple subsets.
Alternatively: Set aside a second holdout set for final validation.
Flag any signs of overfitting (e.g., perfect accuracy may mean it memorized rather than generalized).
Generate Manual Review Set
Purpose: Human verification of ML output before applying to real UPCs.
TO DO:
Stratified sample (e.g., 500 rows) from final_results.csv.
Include:
food_description
tfp_basket_cat_ML2
tfp_basket_cat_ML2_conf
tfp_basket_cat_ML1
Agreement flag