Goal: trial two ML procedures to classify the UPCs into Market Basket categories.
Include a paragraph summary on how each ML procedure works. Please ensure they can be run through R- I’d like everything in an .rmd file. I want two independent ML procedures so we can compare their classification overlap as a first indicator to save human verification time.
The top categories that are key for us are as follows:
A. See attached USDA TFP 2021 doc, and please use the top and sub categories from Table 1. We just need to add one more for “Non-Food Item” (we have books etc in the Walmart pulls due to substring matches, e.g.)
A. In an output file, please have a column for
1) [food_description] from the excel sheet;
2) tfp_basket_cat_ML1 for where the first ML model categorized each
food_description value;
3) tfp_basket_cat_ML1_conf for some quantitative level of confidence in
the match;
4) tfp_basket_cat_ML2 (other model classification);
5) tfp_basket_cat_ML2_conf (other confidence measure);
6) ML_model_agreement (1 for same, 0 otherwise).
The next step would be us verifying on a subset of maybe 500 of these 4000+ categories how well each did. Then we’d move forward with that ‘trained’ model on these food_description levels to use the ML models on the UPC lists.
1. Data Preparation
A. We filtered the dataset to keep only rows with both a valid food_description and an assigned tfp_category.
B. The data was split into training (80%) and testing (20%) sets.
2. Text Tokenization & Vectorization
A. Each food_description was split into individual words (tokens).
B. A vocabulary was built from the training set — a list of all unique words.
C. We used TF-IDF (Term Frequency – Inverse Document Frequency) to transform each food description into a numeric vector. Words that appear often in all descriptions (e.g., “fresh”, “canned”) get lower weight. Words that are distinctive (e.g., “tuna”, “broccoli”) get higher weight.
3. Model Training
A. A Naive Bayes classifier was trained on the TF-IDF vectors.
B. This model uses conditional probability to predict the most likely category based on the words present.
4. Prediction
A. The same TF-IDF mapping from training was applied to the test data.
B. The model produced:
I. Predicted category (most likely USDA basket)
II. Confidence score (probability of that prediction)
Data Preparation
A. We used the same filtered dataset as for Model 1 but focused on creating semantic embeddings for each food_description.
Word Embeddings (GloVe)Global Vectors for Word Representation
-We loaded pre-trained GloVe vectors, where each word is represented as a dense 300-dimensional numeric vector. -These vectors capture semantic meaning — e.g., “apple” and “pear” are close together in vector space.
Averaging for Descriptions
-Each food_description was split into words. The embedding vectors for all words in that description were averaged to create a single vector per product.
Model Training
A Random Forest classifier was trained on these averaged embedding vectors. A Random Forest algorithm builds many decision trees on random subsets of the data and averages their votes for classification.
The trained model predicted:
I. Category (most likely USDA basket)
II.Confidence score (proportion of trees voting for that category)
## Rows in final_results: 121
##
## 0 1
## 116 5
## Naive Bayes Accuracy: 0.041
## Random Forest Accuracy: 1
## Reference
## Prediction Fruits Vegetables Dairy Protein foods
## Fruits 3 3 83 29
## Vegetables 0 2 0 1
## Dairy 0 0 0 0
## Protein foods 0 0 0 0
## Reference
## Prediction Dairy Fruits Protein foods Vegetables
## Dairy 83 0 0 0
## Fruits 0 3 0 0
## Protein foods 0 0 30 0
## Vegetables 0 0 0 5
## # A tibble: 2 × 3
## Model Accuracy AgreementWithOtherModel
## <chr> <dbl> <dbl>
## 1 Naive Bayes 0.0413 0.0413
## 2 Random Forest 1 0.0413
## ### Naive Bayes Confusion Matrix
## Reference
## Prediction Dairy Fruits Protein foods Vegetables
## Dairy 0 0 0 0
## Fruits 83 3 29 3
## Protein foods 0 0 0 0
## Vegetables 0 0 1 2
##
## ### Random Forest Confusion Matrix
## Reference
## Prediction Dairy Fruits Protein foods Vegetables
## Dairy 83 0 0 0
## Fruits 0 3 0 0
## Protein foods 0 0 30 0
## Vegetables 0 0 0 5
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths