1 OVERVIEW

Goal: trial two ML procedures to classify the UPCs into Market Basket categories.

  1. Include a paragraph summary on how each ML procedure works. Please ensure they can be run through R- I’d like everything in an .rmd file. I want two independent ML procedures so we can compare their classification overlap as a first indicator to save human verification time.

  2. The top categories that are key:

A. See attached USDA TFP 2021 doc, and please use the top and sub categories from Table 1. We just need to add one more for “Non-Food Item” (we have books etc in the Walmart pulls due to substring matches, e.g.)

  1. Nest the sub-categories in sheet “PP-NAP1718” (column “food_description”) under those topline USDA TFP 2010 [Table 1] categories. [Have the two ML procedures do this first – it should perform excellently here].

A. In an output file, please have a column for
1) [food_description] from the excel sheet;
2) tfp_basket_cat_ML1 for where the first ML model categorized each food_description value;
3) tfp_basket_cat_ML1_conf for some quantitative level of confidence in the match;
4) tfp_basket_cat_ML2 (other model classification);
5) tfp_basket_cat_ML2_conf (other confidence measure);
6) ML_model_agreement (1 for same, 0 otherwise).

The next step would be us verifying on a subset of maybe 500 of these 4000+ categories how well each did. Then we’d move forward with that ‘trained’ model on these food_description levels to use the ML models on the UPC lists.

2 DATA SOURCES

Two initial datasets from the USDA were used for analysis:

  1. USDA PP National Average Prices Purchase to Plate – National Average Prices for NHANES
    By: Andrea Carlson, Christopher Lowe, and Elina Page

U. S. Department of Agriculture, Economic Research Service. Purchase to Plate Suite, March 2023. The sheet used for analysis was PP-NAP1718, for 2017-2018 data.

  1. U.S. Department of Agriculture. Thrifty Food Plan, 2021. August 2021. FNS-916. Available at https://FNS.usda.gov/TFP.USDA

TFP Market Basket, which provided the “MB Categories” factor column data with the top-level labels for “Vegetables”, “Fruits”, “Grains”, “Dairy”, “Protein foods”, “Miscellaneous”.

3 METHODOLOGY

3.1 Data Preparation

3.1.1 Quanteda Label Assignment

This code prepares two sets of text data—USDA Market Basket subcategories and food descriptions—for semantic comparison using cosine similarity via the quanteda package.

The goal is to assign each food_description a USDA category based on linguistic similarity. First, it creates a corpus of subcategory names (e.g., “cheese”, “leafy greens”) and tokenizes them (i.e., breaks them into lowercased word units, removing punctuation).

A document-feature matrix (DFM) is then built to represent word frequencies for each subcategory. The same process is applied to the food_description column from the dataset. Finally, the dfm_match() function aligns the two DFMs to ensure they have the same vocabulary (features), enabling valid pairwise similarity comparisons. This preprocessing step is essential to compute cosine similarity between the subcategories and the food descriptions, which is used to assign the most appropriate USDA category to each item based on shared word patterns.

3.1.2 Compute Cosine Similarity

Cosine similarity is used in this context to assign each food_description to the most semantically similar USDA Market Basket subcategory by comparing the patterns of words they contain. Rather than relying on exact word matches, cosine similarity evaluates the angle between vector representations of text — meaning it captures the overall similarity in word usage, regardless of word order or length.

By computing how closely the word distribution of a food description aligns with that of each known subcategory, the method identifies which category the item most likely belongs to, based on linguistic context. This allows for more flexible and accurate classification, especially when food descriptions vary in wording but share conceptual meaning.

3.1.3 Extract Best Match per Row

This code block applies the results of a cosine similarity comparison between each food description and USDA subcategories to assign a most-likely top-level category to each food item. First, the similarity scores—originally in a special object—are converted to a standard matrix for easier manipulation. The which.max() function is used row-wise to find the index of the most similar USDA subcategory for each food description.

These indices are mapped to the corresponding USDA top-level categories (tfp_category) using the tfp_lookup table. The assignments are matched back to the original dataset (pp_nap) only for rows with non-missing descriptions to avoid misalignment. After rejoining the new category labels, the code removes any duplicate tfp_category columns introduced by the merge and renames the correct one. Finally, any remaining rows without a category assignment are labeled as “Unknown,” ensuring that every entry in pp_nap now has a classification label for further analysis or model training.

3.1.4 TFP Categories Frequency Table

3.1.5 Quanteda Corpus Creation

This block of code prepares a cleaned and labeled dataset of food descriptions for training a text classification model using the quanteda package. First, it filters the original dataset (pp_nap) to keep only rows where both food_description and tfp_category (the USDA label) are present, ensuring the model is trained only on valid examples. These rows are stored in nb_data, while rows with missing values are stored separately as nb_data_loss for diagnostic or future use.

A quanteda text corpus is then constructed from nb_data, with each food_description as a document and its corresponding tfp_category stored as metadata (document-level variables). The text is pre-processed by tokenizing (splitting into words), converting to lowercase, removing punctuation and symbols, and optionally removing common English stopwords like “and” or “the.”

Finally, the cleaned tokens are converted into a document-feature matrix (nb_dfm) representing word frequency counts across all food descriptions, which serves as input to the Naive Bayes classifier.

3.2 Model 1: Naive Bayes + TF-IDF — Summary and Recommendations

3.2.1 Overview

Naive Bayes is a probabilistic text classification algorithm commonly used as a baseline in natural language processing. For this project, we applied Naive Bayes to classify food descriptions into USDA Market Basket categories using a TF-IDF (Term Frequency–Inverse Document Frequency) representation of the text.

TF-IDF transforms each food_description into a sparse numeric vector that emphasizes distinctive words in the context of the full dataset. These vectors were then used to train a Naive Bayes classifier that predicts the most likely category based on learned word probabilities.

3.2.2 How the Model Works (in plain terms)

3.2.2.1 Naive Bayes:

  • Learns how often each word appears in each USDA category (e.g., “cheddar” in Dairy, “peach” in Fruits).
  • Makes predictions by assuming each word contributes independently to the final classification (the “naive” assumption).
  • Calculates which USDA category is most probable given the words in a new food description.

3.2.2.2 TF-IDF:

  • Assigns higher weight to rare, meaningful words (e.g., “tortilla”) and lower weight to common words (e.g., “fresh”).

3.2.3 Performance and Limitations

The model was trained and evaluated on a filtered dataset of food descriptions with USDA category labels.

3.2.3.1 Revision of text classifier library

Since our initial text classification produced poor matching results, we adopted the “Quanteda” library to improve tokenization, stopwords, stemming, etc.

3.2.3.2 Why GloVe Embeddings Were Not Used

GloVe (Global Vectors for Word Representation) creates dense, continuous vector embeddings that represent semantic relationships between words. However:

  • Naive Bayes is incompatible with GloVe embeddings, which are real-valued and abstract.
  • Naive Bayes requires discrete input like word frequencies or TF-IDF scores — not semantic coordinates.
  • Therefore, TF-IDF is the correct input format for this model.

3.2.4 NB Test-Train Split

4 RANDOM FOREST MODEL

4.1 Definition

A random forest is an ensemble learning model that builds many decision trees and combines their results to improve the accuracy of predictions. In this context, the goal of the random forest model will be to predict which top-level TFP category (grains, dairy, etc.) the classification of a selected food item based on its text description.

4.2 How the model works

Each food description is transformed into a vector of numeric features using GloVe embeddings; these embeddings capture semantic meaning. For example, “cheddar cheese” is closer to “swiss cheese” than bananas.

The numeric vector is passed as an input to the RF, which builds hundreds of decision trees, each making its own classification prediction.

The RF then aggregates all these predictions to output the most likely TFP category for each food item.

4.3 Why RF is a good model for text classification

A Random Forest is:
  • built to handle high-dimensional numeric data well, making it ideal for the GloVe data.
  • typically robust to noisy or ambiguous input
  • provides data on feature importance for analysis on which words are most influential in classification
  • provides a counterbalance to Naive Bayes (probabilistic and sparse) vs. RF (non-parametric and dense)
  • 4.3.1 Model 2 — Text Embeddings + Random Forest

    1. Data Preparation

      A. We used the same filtered dataset as for Model 1 but focused on creating semantic embeddings for each food_description.

    2. Word Embeddings (GloVe)Global Vectors for Word Representation

      -We loaded pre-trained GloVe vectors, where each word is represented as a dense 300-dimensional numeric vector. -These vectors capture semantic meaning — e.g., “apple” and “pear” are close together in vector space.

    3. Averaging for Descriptions

      -Each food_description was split into words. The embedding vectors for all words in that description were averaged to create a single vector per product.

    4. Model Training

    A Random Forest classifier was trained on these averaged embedding vectors. A Random Forest algorithm builds many decision trees on random subsets of the data and averages their votes for classification.

    1. Prediction
    The trained model predicted:
    I. Category (most likely USDA basket)
    II.Confidence score (proportion of trees voting for that category)            
    ## Rows in final_results: 887
    ## 
    ##   0   1 
    ## 300 587

4.4 Model Comparison and Evaluation

## Naive Bayes Accuracy: 0.7
## Random Forest Accuracy: 0.939
##                Reference
## Prediction      Dairy Fruits Grains Miscellaneous Protein foods Unknown
##   Dairy           191      4      0            12             2      23
##   Fruits            3     46      2            11             0      17
##   Grains            3      2     26             8             2      10
##   Miscellaneous    11      1      4            76             3      24
##   Protein foods     2      1      0             0            11      18
##   Unknown           5      0      2            15             0     185
##   Vegetables       17      1      2            18             4      15
##                Reference
## Prediction      Vegetables
##   Dairy                  6
##   Fruits                 3
##   Grains                 2
##   Miscellaneous          9
##   Protein foods          0
##   Unknown                4
##   Vegetables            86
##                Reference
## Prediction      Dairy Fruits Grains Miscellaneous Protein foods Unknown
##   Dairy           224      0      0             0             0       3
##   Fruits            0     53      0             0             0       0
##   Grains            0      0     22             1             0       2
##   Miscellaneous     0      1      0           128             0       6
##   Protein foods     0      0      0             0            19       1
##   Unknown           7      1     14            11             3     280
##   Vegetables        1      0      0             0             0       0
##                Reference
## Prediction      Vegetables
##   Dairy                  1
##   Fruits                 0
##   Grains                 1
##   Miscellaneous          0
##   Protein foods          0
##   Unknown                1
##   Vegetables           107
## # A tibble: 2 × 3
##   Model         Accuracy AgreementWithOtherModel
##   <chr>            <dbl>                   <dbl>
## 1 Naive Bayes      0.700                   0.662
## 2 Random Forest    0.939                   0.662
## ### Naive Bayes Confusion Matrix
##                Reference
## Prediction      Dairy Fruits Grains Miscellaneous Protein foods Unknown
##   Dairy           191      4      0            12             2      23
##   Fruits            3     46      2            11             0      17
##   Grains            3      2     26             8             2      10
##   Miscellaneous    11      1      4            76             3      24
##   Protein foods     2      1      0             0            11      18
##   Unknown           5      0      2            15             0     185
##   Vegetables       17      1      2            18             4      15
##                Reference
## Prediction      Vegetables
##   Dairy                  6
##   Fruits                 3
##   Grains                 2
##   Miscellaneous          9
##   Protein foods          0
##   Unknown                4
##   Vegetables            86
## 
## ### Random Forest Confusion Matrix
##                Reference
## Prediction      Dairy Fruits Grains Miscellaneous Protein foods Unknown
##   Dairy           224      0      0             0             0       3
##   Fruits            0     53      0             0             0       0
##   Grains            0      0     22             1             0       2
##   Miscellaneous     0      1      0           128             0       6
##   Protein foods     0      0      0             0            19       1
##   Unknown           7      1     14            11             3     280
##   Vegetables        1      0      0             0             0       0
##                Reference
## Prediction      Vegetables
##   Dairy                  1
##   Fruits                 0
##   Grains                 1
##   Miscellaneous          0
##   Protein foods          0
##   Unknown                1
##   Vegetables           107

4.5 Visualization

5 Initial Analysis and Next Steps for Model tuning

Random Forest Diagnostic:

  1. Perfect classification (83 Dairy, 30 Protein foods, etc.).

  2. GloVe embeddings captured true semantic distinctions (e.g., “cheddar” vs. “apple” vs. “broccoli”).

  3. Model appears overfit to the current dataset — may not generalize perfectly to new UPCs unless cross-validation confirms this.

Conclusion: Random Forest + embeddings is a strong candidate to move forward with — after further validation.

What to do now — Summary of Immediate Next Steps Review the classification output file-

Filename: tfp_classification_results.csv

  1. Confirm Model Generalizability

    ✅ DONE: Model trained on food_descriptions with near-perfect accuracy.

    🔁 TO DO:

     Run k-fold cross-validation to assess performance across multiple subsets.
    
     Alternatively: Set aside a second holdout set for final validation.
    
     Flag any signs of overfitting (e.g., perfect accuracy may mean it memorized rather than generalized).
  2. Generate Manual Review Set

    Purpose: Human verification of ML output before applying to real UPCs.

    TO DO:

    Stratified sample (e.g., 500 rows) from final_results.csv.

    Include:

         food_description
    
         tfp_basket_cat_ML2
    
         tfp_basket_cat_ML2_conf
    
         tfp_basket_cat_ML1
    
         Agreement flag