Overview/Instructions:

Goal: trial two ML procedures to classify the UPCs into Market Basket categories.

  1. Include a paragraph summary on how each ML procedure works. Please ensure they can be run through R- I’d like everything in an .rmd file. I want two independent ML procedures so we can compare their classification overlap as a first indicator to save human verification time.

  2. The top categories that are key for us are as follows:

A. See attached USDA TFP 2021 doc, and please use the top and sub categories from Table 1. We just need to add one more for “Non-Food Item” (we have books etc in the Walmart pulls due to substring matches, e.g.)

  1. Nest the sub-categories in sheet “PP-NAP1718” (column “food_description”) under those topline USDA TFP 2010 [Table 1] categories. [Have the two ML procedures do this first – it should perform excellently here].

A. In an output file, please have a column for
1) [food_description] from the excel sheet;
2) tfp_basket_cat_ML1 for where the first ML model categorized each food_description value;
3) tfp_basket_cat_ML1_conf for some quantitative level of confidence in the match;
4) tfp_basket_cat_ML2 (other model classification);
5) tfp_basket_cat_ML2_conf (other confidence measure);
6) ML_model_agreement (1 for same, 0 otherwise).

The next step would be us verifying on a subset of maybe 500 of these 4000+ categories how well each did. Then we’d move forward with that ‘trained’ model on these food_description levels to use the ML models on the UPC lists.

Methodology

Model 1 — TF-IDF + Naive Bayes

1. Data Preparation
  A. We filtered the dataset to keep only rows with both a valid food_description and an assigned tfp_category.

B. The data was split into training (80%) and testing (20%) sets.

2. Text Tokenization & Vectorization

  A. Each food_description was split into individual words (tokens).

  B. A vocabulary was built from the training set — a list of all unique words.

  C. We used TF-IDF (Term Frequency – Inverse Document Frequency) to transform each food description into a numeric vector. Words that appear often in all descriptions (e.g., “fresh”, “canned”) get lower weight. Words that are distinctive (e.g., “tuna”, “broccoli”) get higher weight.

3. Model Training

  A. A Naive Bayes classifier was trained on the TF-IDF vectors.

  B. This model uses conditional probability to predict the most likely category based on the words present.

4. Prediction

  A. The same TF-IDF mapping from training was applied to the test data.

  B. The model produced:

        I. Predicted category (most likely USDA basket)

        II. Confidence score (probability of that prediction)
        

Model 2 — Text Embeddings + Random Forest

  1. Data Preparation

    A. We used the same filtered dataset as for Model 1 but focused on creating semantic embeddings for each food_description.

  2. Word Embeddings (GloVe)Global Vectors for Word Representation

    -We loaded pre-trained GloVe vectors, where each word is represented as a dense 300-dimensional numeric vector. -These vectors capture semantic meaning — e.g., “apple” and “pear” are close together in vector space.

  3. Averaging for Descriptions

    -Each food_description was split into words. The embedding vectors for all words in that description were averaged to create a single vector per product.

  4. Model Training

A Random Forest classifier was trained on these averaged embedding vectors. A Random Forest algorithm builds many decision trees on random subsets of the data and averages their votes for classification.

  1. Prediction
The trained model predicted:
I. Category (most likely USDA basket)
II.Confidence score (proportion of trees voting for that category)            
## Rows in final_results: 121
## 
##   0   1 
## 116   5

Model Comparison and Evaluation

## Naive Bayes Accuracy: 0.041
## Random Forest Accuracy: 1
##                Reference
## Prediction      Fruits Vegetables Dairy Protein foods
##   Fruits             3          3    83            29
##   Vegetables         0          2     0             1
##   Dairy              0          0     0             0
##   Protein foods      0          0     0             0
##                Reference
## Prediction      Dairy Fruits Protein foods Vegetables
##   Dairy            83      0             0          0
##   Fruits            0      3             0          0
##   Protein foods     0      0            30          0
##   Vegetables        0      0             0          5
## # A tibble: 2 × 3
##   Model         Accuracy AgreementWithOtherModel
##   <chr>            <dbl>                   <dbl>
## 1 Naive Bayes     0.0413                  0.0413
## 2 Random Forest   1                       0.0413
## ### Naive Bayes Confusion Matrix
##                Reference
## Prediction      Dairy Fruits Protein foods Vegetables
##   Dairy             0      0             0          0
##   Fruits           83      3            29          3
##   Protein foods     0      0             0          0
##   Vegetables        0      0             1          2
## 
## ### Random Forest Confusion Matrix
##                Reference
## Prediction      Dairy Fruits Protein foods Vegetables
##   Dairy            83      0             0          0
##   Fruits            0      3             0          0
##   Protein foods     0      0            30          0
##   Vegetables        0      0             0          5

Visualization

## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths