Processed Food at Three Major Grocery Store Chains

Background & Cleaning

Initial Import and Cleaning of my Grocery Store Dataset

#libraries
library(tidyverse) #ride or die
library(kableExtra) #for tables
library(DT) #for tables
library(plotly) #for interactivity
library(ggpubr) #for regression equations
#load data
GroceryDB_full <- read_csv("GroceryDB_foods.csv")

For my final project, I selected a dataset about groceries that are sold at three large nationwide chains. I originally found it in the Data Is Plural weekly newsletter, my favorite source for esoteric datasets.

The data includes 50,000+ food products available at Walmart, Target, and Whole Foods, including nutritional information and the degree of processing for each product.

The original data set includes 50468 rows of unique products and 27 variable columns, not all of which I understand the meaning of.

Variables in original data table:

original_ID, name, store, harmonized single category, brand, f_FPro, f_FPro_P, f_min_FPro, f_std_FPro, f_FPro_class, price, price percal, package_weight, has10_nuts, is_Nuts_Converted_100g, Protein, Total Fat, Carbohydrate, Sugars, total, Fiber, total dietary, Calcium, Iron, Sodium, Vitamin C, Cholesterol, Fatty acids, total saturated, Total Vitamin A

Consequently, the first step I took was to create a version of the dataset that has only the variables I think I’ll care about for my analysis, and to rename some of the columns.

#create version of data with fewer variables, rename columns
groceries <- 
  GroceryDB_full %>%
  select(store,
         brand,
         product_type = `harmonized single category`,
         product_name = name,
         processing_score = f_FPro,
         score_class = f_FPro_class,
         price,
         price_percal = `price percal`,
         Protein, 
         Fat = `Total Fat`,
         Carbohydrate,
         Sugar = `Sugars, total`,
         Sodium,
         Cholesterol)

The resulting new data table still has 50468 rows of products, but now only 14 variables.

Variables in cleaned data table:

store, brand, product_type, product_name, processing_score, score_class, price, price_percal, Protein, Fat, Carbohydrate, Sugar, Sodium, Cholesterol

Food Processing Score

The variable in this data that I’m most interested in exploring is called the FPro score, or the degree to which a food product has been processed.

The FPro (or processing_score, as I renamed the variable in my cleaned dataset) is a continuous variable that ranges from 0 for raw ingredients to 1 for ultra-processed foods. The score was created by a team of researchers as an alternative to older, more manual and subjective ways of categorizing the nutritional value of foods known as the NOVA Classification.

FPro captures the progressive changes in nutrient content induced by processing food. For example, FPro incrementally increases from raw onion (0.0203) to boiled onion (0.3126), fried onion (0.7779), and onion rings from frozen ingredients (0.9955). Citation

For my exploratory analysis, I’m interested in learning what relationships exist between the grocery store retailers and the processing scores of the food they sell.

Some of my initial questions are:

  • How do the distributions and summary statistics of processing scores differ between the three major retailers?
  • What does that look like when broken down by product type?
  • Do less processed foods tend to be more expensive?

Additional Cleaning to Enable Analysis

#looks like there's a ton of NA values in processing score...
NA_scores <- sum(is.na(groceries$processing_score))

groceries_clean_score <-
  groceries %>%
  filter(is.finite(processing_score))

Immediately when I started poking around my data, I realized that a large number of food products in the dataset had an NA value for processing score (23811 to be exact.) I created a new subset of the data groceries_clean_score to use without encountering any issues when looking at this variable specifically.

These NA scores represent 47.2% of the original data, which is a bummer, but 26657 is still a large number of products to work with.

total_store_count <- 
  groceries %>%
  group_by(store) %>%
  summarise(total_rows = n(), 
            NA_score = sum(is.na(processing_score)),
            NA_percent = paste((round((NA_score/total_rows), digits = 3)*100), "%"))

kable(total_store_count,
      caption = "Missing Processing Scores by Store",
      col.names = c("Store", "Total Product Count", "Products with NA value for Processing Score", "Percent of Total Products Missing Processing Score"),
      align = c("l", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"))
Missing Processing Scores by Store
Store Total Product Count Products with NA value for Processing Score Percent of Total Products Missing Processing Score
Target 17854 11351 63.6 %
Walmart 16823 7524 44.7 %
WholeFoods 15791 4936 31.3 %

Price Correlation

Do foods that are less processed tend to be less expensive?

Now that I understand how the spread of processing scores differs greatly for different categories of food, I am curious if there are any correlations between price and processing level.

Looking at all products across all three grocery store chains indicates a very weak correlation between these two variables. As you might expect, most products sold at these three stores are highly processed and very cheap.

  ggplot(groceries_clean_score, aes(x = processing_score, y = price)) +
  geom_point(aes(color = store), 
             alpha = 0.4) +
  xlim(c(0, 1)) +
  ylim(c(0, 30)) +
  theme_light() + 
    labs(title = "Price and Processing Score have a Weak Correlation for All Products",
       x = "Processing Score",
       y = "Price ($)") +
  theme(legend.position = "top",
        legend.title = element_blank()) +
    geom_smooth(method="lm", color = "black") +
  scale_color_manual(values = brandcolors) +
  stat_cor(aes(label = after_stat(rr.label)), label.x = 0.26, label.y = 28)

#there's just way too much data here, most stuff is cheap and highly processed
#I tried making this plot interactive too, but the regression was messing up because that layer doesn't have the same attributes as the scatterplot beneath. 

What about broken down by product category?

I know now that some categories of food have a wide range of different processing levels for sale. If I have a choice between a highly processed option or a more organic option of the same type of food, does the data confirm which type will be more expensive?

#i was encountering an issue where the R squared values on the plot weren't matching with the R squared values calculated here, because it was using data outside of the plot ranges (aka calculating before or after plotting)
#calculate R-squared for each product_type using lm, to later rearrange facets
rsq_data <- groceries_clean_score %>%
  filter(product_type %in% highest_IQR$product_type) %>%
  filter(!is.na(price), !is.na(processing_score)) %>%  # Remove rows with NA
  filter(processing_score >= 0, processing_score <= 1,  # Apply xlim filter
         price >= 0, price <= 30) %>%  # Apply ylim filter
  group_by(product_type) %>%
  do({
    model <- lm(price ~ processing_score, data = .)
    rsq <- summary(model)$r.squared
    data.frame(product_type = unique(.$product_type), r_squared = rsq)
  })

#any price correlation with the different product types that have high ranges?? 
groceries_clean_score %>%
  filter(product_type %in% highest_IQR$product_type) %>%
  left_join(rsq_data, by = "product_type") %>%
  mutate(product_type = factor(product_type, 
         levels = rsq_data$product_type[order(-rsq_data$r_squared)])) %>%
  ggplot(aes(x = processing_score, y = price)) +
  geom_point(aes(color = store), 
             alpha = 0.4) +
  xlim(c(0, 1)) +
  ylim(c(0, 30)) +
  theme_light() +
  facet_wrap(~product_type) +
  labs(title = "Correlation between Price and Processing Score for Specific Product Types \n (Arranged by Descending Correlation Strength)",
       x = "Processing Score",
       y = "Price ($)") +
  theme(legend.position = "top",
        legend.title = element_blank(),
        aspect.ratio = 0.6,
        axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 6)) +
  scale_color_manual(values = brandcolors) +
  geom_smooth(method="lm", color = "black") +
  stat_cor(aes(label = after_stat(rr.label)), 
           label.x = 0.1, label.y = 28, size = 3)

According to this dataset, the correlation between price and processing level is only significant for coffee and culinary ingredients (which includes things like butter, vinegar, and oils). For the other types of food with a wide variety of availability, there is a negligible difference between the price of highly processed or minimally processed options.

One logistical coding note that feels worth mentioning: in order to sort the facets in order of descending R^2 value, I had to create a new dataframe of those values that is separate than the print-out value on each plot. The numbers in each weren’t matching exactly, which infuriated me to no end for a long while. I finally realized that it was because one method was calculating R^2 based on all values in the category, while one was only using the plotted values, and I had cut off prices >$30 on the y-axis. Fixing that error felt like a huge win!

Outstanding Questions & References

Questions for Continued Exploration

After this exploration of the groceries dataset, I have a few unanswered questions that I would continue to investigate with more time:

  • I looked at the three major retailers as a whole, but I wonder if any of the specific brands sold at those retailers are significant outliers compared to the rest of the products at that store?
    • For example, brands like Annie’s Organics are probably sold at all three - does that brand skew the statistics for the entire store?
    • Do certain brands have a stronger price vs. processing correlation?
    • What about just each store’s generic brand?
  • I would also be curious to look at how the FPro score compares to the older, more outdated NOVA Classifications for categorizing food processing levels.
    • Would each product’s NOVA Class (categorical from 1-4) correspond exactly to the four quartiles of FPro from 0-1?