Processed Food at Three Major Grocery Store Chains

Background & Cleaning

Initial Import and Cleaning of my Grocery Store Dataset

#libraries
library(tidyverse) #ride or die
library(kableExtra) #for tables
library(DT) #for tables
library(plotly) #for interactivity
library(ggpubr) #for regression equations
#load data
GroceryDB_full <- read_csv("GroceryDB_foods.csv")

For my final project, I selected a dataset about groceries that are sold at three large nationwide chains. I originally found it in the Data Is Plural weekly newsletter, my favorite source for esoteric datasets.

The data includes 50,000+ food products available at Walmart, Target, and Whole Foods, including nutritional information and the degree of processing for each product.

The original data set includes 50468 rows of unique products and 27 variable columns, not all of which I understand the meaning of.

Variables in original data table:

original_ID, name, store, harmonized single category, brand, f_FPro, f_FPro_P, f_min_FPro, f_std_FPro, f_FPro_class, price, price percal, package_weight, has10_nuts, is_Nuts_Converted_100g, Protein, Total Fat, Carbohydrate, Sugars, total, Fiber, total dietary, Calcium, Iron, Sodium, Vitamin C, Cholesterol, Fatty acids, total saturated, Total Vitamin A

Consequently, the first step I took was to create a version of the dataset that has only the variables I think I’ll care about for my analysis, and to rename some of the columns.

#create version of data with fewer variables, rename columns
groceries <- 
  GroceryDB_full %>%
  select(store,
         brand,
         product_type = `harmonized single category`,
         product_name = name,
         processing_score = f_FPro,
         score_class = f_FPro_class,
         price,
         price_percal = `price percal`,
         Protein, 
         Fat = `Total Fat`,
         Carbohydrate,
         Sugar = `Sugars, total`,
         Sodium,
         Cholesterol)

The resulting new data table still has 50468 rows of products, but now only 14 variables.

Variables in cleaned data table:

store, brand, product_type, product_name, processing_score, score_class, price, price_percal, Protein, Fat, Carbohydrate, Sugar, Sodium, Cholesterol

Food Processing Score

The variable in this data that I’m most interested in exploring is called the FPro score, or the degree to which a food product has been processed.

The FPro (or processing_score, as I renamed the variable in my cleaned dataset) is a continuous variable that ranges from 0 for raw ingredients to 1 for ultra-processed foods. The score was created by a team of researchers as an alternative to older, more manual and subjective ways of categorizing the nutritional value of foods known as the NOVA Classification.

FPro captures the progressive changes in nutrient content induced by processing food. For example, FPro incrementally increases from raw onion (0.0203) to boiled onion (0.3126), fried onion (0.7779), and onion rings from frozen ingredients (0.9955). Citation

For my exploratory analysis, I’m interested in learning what relationships exist between the grocery store retailers and the processing scores of the food they sell.

Some of my initial questions are:

How do the distributions and summary statistics of processing scores differ between the three major retailers?
What does that look like when broken down by product type?
Do less processed foods tend to be more expensive?

Additional Cleaning to Enable Analysis

#looks like there's a ton of NA values in processing score...
NA_scores <- sum(is.na(groceries$processing_score))

groceries_clean_score <-
  groceries %>%
  filter(is.finite(processing_score))

Immediately when I started poking around my data, I realized that a large number of food products in the dataset had an NA value for processing score (23811 to be exact.) I created a new subset of the data groceries_clean_score to use without encountering any issues when looking at this variable specifically.

These NA scores represent 47.2% of the original data, which is a bummer, but 26657 is still a large number of products to work with.

total_store_count <- 
  groceries %>%
  group_by(store) %>%
  summarise(total_rows = n(), 
            NA_score = sum(is.na(processing_score)),
            NA_percent = paste((round((NA_score/total_rows), digits = 3)*100), "%"))

kable(total_store_count,
      caption = "Missing Processing Scores by Store",
      col.names = c("Store", "Total Product Count", "Products with NA value for Processing Score", "Percent of Total Products Missing Processing Score"),
      align = c("l", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"))

Missing Processing Scores by Store
Store	Total Product Count	Products with NA value for Processing Score	Percent of Total Products Missing Processing Score
Target	17854	11351	63.6 %
Walmart	16823	7524	44.7 %
WholeFoods	15791	4936	31.3 %

Trends by Store

How do the distribution of processing scores differ between the three stores?

I started this analysis by looking at the distribution of processing stores across all product types for the three grocery store chains. Here is a summary of the key statistical measures:

#summary statistics for each store
#first create a new list for each

target <- 
  groceries_clean_score %>%
  filter(store == "Target")

walmart <- 
  groceries_clean_score %>%
  filter(store == "Walmart")

wholefoods <- 
  groceries_clean_score %>%
  filter(store == "WholeFoods")

#compile dataframe of stats
store <- as.factor(c("Target", "Walmart", "WholeFoods"))
scoremeans <- c(mean(target$processing_score), mean(walmart$processing_score), mean(wholefoods$processing_score))
scoremedians <- c(median(target$processing_score), median(walmart$processing_score), median(wholefoods$processing_score))
scorefirstqrts <- c(quantile(target$processing_score)[["25%"]], quantile(walmart$processing_score)[["25%"]], quantile(wholefoods$processing_score)[["25%"]])
scorethirdqrts <- c(quantile(target$processing_score)[["75%"]], quantile(walmart$processing_score)[["75%"]], quantile(wholefoods$processing_score)[["75%"]])
scoreSDs <- c(sd(target$processing_score), sd(walmart$processing_score), sd(wholefoods$processing_score))

score_summarystats <- data.frame(store, scoremeans, scoreSDs, scoremedians, scorefirstqrts, scorethirdqrts)

kable(score_summarystats %>%
        mutate_if(is.numeric, round, digits = 3),
      caption = "Summary Statistics of Processing Scores for all Products at the Three Major Grocery Stores",
      col.names = c("Store", "Mean", "Standard Deviation", "Median", "1st Quartile", "3rd Quartile")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"))

Summary Statistics of Processing Scores for all Products at the Three Major Grocery Stores
Store	Mean	Standard Deviation	Median	1st Quartile	3rd Quartile
Target	0.801	0.187	0.859	0.713	0.944
Walmart	0.758	0.198	0.809	0.640	0.918
WholeFoods	0.658	0.256	0.721	0.532	0.859

Spread visually, it is easy to see that Target and Walmart tend to have more foods higher on the processing scale than Whole Foods does. All three stores have a left skewed distribution.

brandcolors <- 
  c(Target = "#cc0000",
  Walmart = "#0071ce",
  WholeFoods = "#00674b")

brandaccents <- #darkens each brand color by 50%
  c(Target = "#660000",
    Walmart = "#003867",
    WholeFoods = "#003325")

store_hist <- 
ggplot(groceries_clean_score, 
       aes(x = processing_score, fill = store)) + 
  geom_histogram(binwidth = 0.002) + 
  geom_vline(data = score_summarystats, 
             aes(xintercept = scoremeans, color = store, linetype = "Mean"), 
             linewidth = 0.75) +
  geom_vline(data = score_summarystats, 
             aes(xintercept = (scoremeans + scoreSDs), color = store, linetype = "SD"), 
             linewidth = 0.5) +
  geom_vline(data = score_summarystats, 
             aes(xintercept = (scoremeans - scoreSDs), color = store, linetype = "SD"), 
             linewidth = 0.5) +
  facet_grid(rows = "store") +
  labs(title = "Processing Score of All Food Products Sold at Three Grocery Store Chains",
       x = "Processing Score",
       y = "Count of Products") +
  scale_fill_manual(values = brandcolors) + #fill with brand color
  scale_color_manual(values = brandaccents) + #line accent colors
  scale_linetype_manual(values = c("Mean" = "longdash", "SD" = "dashed")) + # specify line types
  theme_light() +
  theme(legend.position = "top",
        legend.title = element_blank()) + 
  guides(linetype = guide_legend(title = "Line Type")) # add line types to legend

store_hist

Testing Confidence Intervals for Products Missing a Processing Score

Recall that close to half of the food products included in the original data set had an NA value for processing score.

In order to determine whether the known scores are an accurate representation of all food products available, I calculated a 95% confidence interval for each store. This will determine the range we could expect the mean score for the full “population” of food products at each store to fall into.

The resulting confidence intervals for each store are extremely thin ranges, meaning we can be certain the sample population is a good representation of all products sold.

#test confidence intervals for missing scores at each store

store_95CI <- 
  groceries_clean_score %>%
  group_by(store) %>%
  summarize(
    mean = mean(processing_score),
    ci_lower = mean - qt(0.95, length(processing_score) - 1) * sd(processing_score) / sqrt(length(processing_score)),
    ci_upper = mean + qt(0.95, length(processing_score) - 1) * sd(processing_score) / sqrt(length(processing_score)))

store_95CI$store <- factor(store_95CI$store, levels = unique(store_95CI$store))

ggplot(store_95CI, aes(x = mean, y = store, color = store)) +
  geom_point(position = position_dodge(width = 0.5), size = 1) +
  geom_errorbar(aes(xmin = ci_lower, xmax = ci_upper), 
                width = 0.5,
                linetype = "dashed",
                position = position_dodge(width = 0.5)) +
  theme_light() +
  xlim(c(0, 1)) +
  labs(title = "95% Confidence Intervals of Mean Processing Score for All Products",
       x = "Processing Score",
       y = "Store") +
  theme(legend.position = "top",
        legend.title = element_blank()) +
  annotate("label", x = store_95CI$mean - 0.1, y = store_95CI$store, 
           label = paste("95% CI:", sep = "\n", round(store_95CI$ci_lower, 4), round(store_95CI$ci_upper, 4)),
           fill = brandcolors[store_95CI$store], color = "white",
           size = 3) +
  scale_y_discrete(limits = rev(levels(store_95CI$store))) +
  scale_color_manual(values = brandcolors)

Trends by Product Type

How do the distribution of processing scores differ between types of food?

types_list <- 
  groceries %>%
  distinct(product_type)

Next, I wanted to dig into how this distribution varies for different product types. There are 47 product categories included in this dataset, ranging everything from Baby Food to Pasta Noodles to Meat and Seafood.

Which category has the most difference in the processing level of different products available? To start, I created another datatable showing summary statistics for the various product types:

product_spreads <- 
  groceries_clean_score %>%
  group_by(product_type) %>%
  summarise(n = n(),
            score_mean = mean(processing_score),
            score_median = median(processing_score),
            score_SD = sd(processing_score),
            score_min = min(processing_score),
            score_max = max(processing_score),
            score_range = (score_max-score_min)) %>%
  arrange(desc(score_median)) %>%
  mutate(product_type = factor(product_type, levels = product_type))

datatable(product_spreads %>%
            mutate_if(is.numeric, round, digits = 3),
          filter = 'top',
          colnames = c("Type", "Count", "Mean", "Median", "SD", "Min", "Max", "Range"),
          options = list(pageLength = 5, autowidth = TRUE, 
                         columnDefs = list(list(className = 'dt-center', targets = 1:7))))

Visually, it becomes clear that for most types of food, there is a wide range in the level of processing for various products available at the three chain stores. Logically the order makes sense: we would expect cookies to be more highly processed than whole foods like beans or eggs.

ggplot(groceries_clean_score, 
       aes(x = processing_score, y = product_type)) +
  geom_boxplot(fill = "snow2", staplewidth = 1, outlier.size = .7, outlier.alpha = .5) +
  labs(title = "Spread of Processing Scores Across Product Types \n (Sorted From Highest to Lowest Median)",
       x = "Processing Score",
       y = "Product Type") +
  theme_light() +
  theme(legend.position = "none", axis.text.y = element_text(size = 6)) +
  scale_y_discrete(limits = rev(levels(product_spreads$product_type))) # Reverse the order of the y-axis

Which types of food have the most variation?

What intrigues me most about this visualization of the data is the categories of food that run the entire spectrum from raw ingredients to highly processed. In order to dig into those categories further, I pulled a subset of the data containing the 12 product types with the largest Interquartile Range (IQR).

I chose IQR as the measure here (rather than the full range from maximum value to minimum value) because many categories have outlier products outside of the range of their whiskers (IQR*1.5).

I really appreciate being able to add an interactive element to this plot in order to scroll around and see the actual product names that underly each point. Some of the products don’t seem to match with their grouping, which makes me wonder about the original categorization method of this data: did it come from the researchers, from the stores themselves, barcode information, or perhaps an AI?

highest_IQR <- 
  groceries_clean_score %>%
  group_by(product_type) %>%
  summarise(n = n(),
            score_median = median(processing_score),
            score_IQR = IQR(processing_score),
            score_whiskers = (IQR(processing_score)*1.5)) %>%
  arrange(desc(score_whiskers)) %>%
  slice_head(n = 12) %>%
  mutate(product_type = factor(product_type, levels = product_type))

jitter <- 
  ggplot(groceries_clean_score %>%
         filter(product_type %in% highest_IQR$product_type), 
       aes(x = processing_score, y = product_type, color = store, fill = store, text = paste(product_name, "\n", "Sold at", store))) +
  geom_jitter(height = 0.13, alpha = 0.75, size = 1) +
  labs(title = "Product Types with the Widest Distribution of Processing Score \n (Defined by Interquartile Range)",
       x = "Processing Score",
       y = "Product Type") +
  theme_light() +
  scale_color_manual(values = brandcolors) +
  theme(legend.position = "top",
        legend.title = element_blank()) +
  scale_y_discrete(limits = rev(levels(highest_IQR$product_type)))

ggplotly(jitter, tooltip = "text") %>% 
  layout(title = list(x = 0.5,  #fix an issue with title and legend overlapping the chart
                      xanchor = "center", 
                      yanchor = "top",
                      y = 1.5), 
         margin = list(t = 100),
         legend = list(x = 0.5,  
                       xanchor = "center",
                       y = 0.998,  
                       yanchor = "bottom",
                       orientation = "h",
                       title = list(text = ""),  
                       font = list(size = 10)))

Price Correlation

Do foods that are less processed tend to be less expensive?

Now that I understand how the spread of processing scores differs greatly for different categories of food, I am curious if there are any correlations between price and processing level.

Looking at all products across all three grocery store chains indicates a very weak correlation between these two variables. As you might expect, most products sold at these three stores are highly processed and very cheap.

  ggplot(groceries_clean_score, aes(x = processing_score, y = price)) +
  geom_point(aes(color = store), 
             alpha = 0.4) +
  xlim(c(0, 1)) +
  ylim(c(0, 30)) +
  theme_light() + 
    labs(title = "Price and Processing Score have a Weak Correlation for All Products",
       x = "Processing Score",
       y = "Price ($)") +
  theme(legend.position = "top",
        legend.title = element_blank()) +
    geom_smooth(method="lm", color = "black") +
  scale_color_manual(values = brandcolors) +
  stat_cor(aes(label = after_stat(rr.label)), label.x = 0.26, label.y = 28)

#there's just way too much data here, most stuff is cheap and highly processed
#I tried making this plot interactive too, but the regression was messing up because that layer doesn't have the same attributes as the scatterplot beneath.

What about broken down by product category?

I know now that some categories of food have a wide range of different processing levels for sale. If I have a choice between a highly processed option or a more organic option of the same type of food, does the data confirm which type will be more expensive?

#i was encountering an issue where the R squared values on the plot weren't matching with the R squared values calculated here, because it was using data outside of the plot ranges (aka calculating before or after plotting)
#calculate R-squared for each product_type using lm, to later rearrange facets
rsq_data <- groceries_clean_score %>%
  filter(product_type %in% highest_IQR$product_type) %>%
  filter(!is.na(price), !is.na(processing_score)) %>%  # Remove rows with NA
  filter(processing_score >= 0, processing_score <= 1,  # Apply xlim filter
         price >= 0, price <= 30) %>%  # Apply ylim filter
  group_by(product_type) %>%
  do({
    model <- lm(price ~ processing_score, data = .)
    rsq <- summary(model)$r.squared
    data.frame(product_type = unique(.$product_type), r_squared = rsq)
  })

#any price correlation with the different product types that have high ranges?? 
groceries_clean_score %>%
  filter(product_type %in% highest_IQR$product_type) %>%
  left_join(rsq_data, by = "product_type") %>%
  mutate(product_type = factor(product_type, 
         levels = rsq_data$product_type[order(-rsq_data$r_squared)])) %>%
  ggplot(aes(x = processing_score, y = price)) +
  geom_point(aes(color = store), 
             alpha = 0.4) +
  xlim(c(0, 1)) +
  ylim(c(0, 30)) +
  theme_light() +
  facet_wrap(~product_type) +
  labs(title = "Correlation between Price and Processing Score for Specific Product Types \n (Arranged by Descending Correlation Strength)",
       x = "Processing Score",
       y = "Price ($)") +
  theme(legend.position = "top",
        legend.title = element_blank(),
        aspect.ratio = 0.6,
        axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 6)) +
  scale_color_manual(values = brandcolors) +
  geom_smooth(method="lm", color = "black") +
  stat_cor(aes(label = after_stat(rr.label)), 
           label.x = 0.1, label.y = 28, size = 3)

According to this dataset, the correlation between price and processing level is only significant for coffee and culinary ingredients (which includes things like butter, vinegar, and oils). For the other types of food with a wide variety of availability, there is a negligible difference between the price of highly processed or minimally processed options.

One logistical coding note that feels worth mentioning: in order to sort the facets in order of descending R^2 value, I had to create a new dataframe of those values that is separate than the print-out value on each plot. The numbers in each weren’t matching exactly, which infuriated me to no end for a long while. I finally realized that it was because one method was calculating R^2 based on all values in the category, while one was only using the plotted values, and I had cut off prices >$30 on the y-axis. Fixing that error felt like a huge win!

Outstanding Questions & References

Questions for Continued Exploration

After this exploration of the groceries dataset, I have a few unanswered questions that I would continue to investigate with more time:

I looked at the three major retailers as a whole, but I wonder if any of the specific brands sold at those retailers are significant outliers compared to the rest of the products at that store?
- For example, brands like Annie’s Organics are probably sold at all three - does that brand skew the statistics for the entire store?
- Do certain brands have a stronger price vs. processing correlation?
- What about just each store’s generic brand?
I would also be curious to look at how the FPro score compares to the older, more outdated NOVA Classifications for categorizing food processing levels.
- Would each product’s NOVA Class (categorical from 1-4) correspond exactly to the four quartiles of FPro from 0-1?