#libraries
library(tidyverse) #ride or die
library(kableExtra) #for tables
library(DT) #for tables
library(plotly) #for interactivity
library(ggpubr) #for regression equations
#load data
GroceryDB_full <- read_csv("GroceryDB_foods.csv")
For my final project, I selected a dataset about groceries that are sold at three large nationwide chains. I originally found it in the Data Is Plural weekly newsletter, my favorite source for esoteric datasets.
The data includes 50,000+ food products available at Walmart, Target, and Whole Foods, including nutritional information and the degree of processing for each product.
The original data set includes 50468 rows of unique products and 27 variable columns, not all of which I understand the meaning of.
Variables in original data table:
original_ID, name, store, harmonized single category, brand, f_FPro, f_FPro_P, f_min_FPro, f_std_FPro, f_FPro_class, price, price percal, package_weight, has10_nuts, is_Nuts_Converted_100g, Protein, Total Fat, Carbohydrate, Sugars, total, Fiber, total dietary, Calcium, Iron, Sodium, Vitamin C, Cholesterol, Fatty acids, total saturated, Total Vitamin A
Consequently, the first step I took was to create a version of the dataset that has only the variables I think I’ll care about for my analysis, and to rename some of the columns.
#create version of data with fewer variables, rename columns
groceries <-
GroceryDB_full %>%
select(store,
brand,
product_type = `harmonized single category`,
product_name = name,
processing_score = f_FPro,
score_class = f_FPro_class,
price,
price_percal = `price percal`,
Protein,
Fat = `Total Fat`,
Carbohydrate,
Sugar = `Sugars, total`,
Sodium,
Cholesterol)
The resulting new data table still has 50468 rows of products, but now only 14 variables.
Variables in cleaned data table:
store, brand, product_type, product_name, processing_score, score_class, price, price_percal, Protein, Fat, Carbohydrate, Sugar, Sodium, Cholesterol
The variable in this data that I’m most interested in exploring is
called the FPro score, or the degree to which a food
product has been processed.
The FPro (or processing_score, as I renamed
the variable in my cleaned dataset) is a continuous variable that ranges
from 0 for raw ingredients to 1 for ultra-processed foods. The score was
created by a team of researchers as an alternative to older, more manual
and subjective ways of categorizing the nutritional value of foods known
as the NOVA Classification.
FPro captures the progressive changes in nutrient content induced by processing food. For example, FPro incrementally increases from raw onion (0.0203) to boiled onion (0.3126), fried onion (0.7779), and onion rings from frozen ingredients (0.9955). Citation
For my exploratory analysis, I’m interested in learning what relationships exist between the grocery store retailers and the processing scores of the food they sell.
Some of my initial questions are:
#looks like there's a ton of NA values in processing score...
NA_scores <- sum(is.na(groceries$processing_score))
groceries_clean_score <-
groceries %>%
filter(is.finite(processing_score))
Immediately when I started poking around my data, I realized that a
large number of food products in the dataset had an NA
value for processing score (23811 to be exact.) I created a
new subset of the data groceries_clean_score to use without
encountering any issues when looking at this variable specifically.
These NA scores represent 47.2% of the original data,
which is a bummer, but 26657 is still a large number of
products to work with.
total_store_count <-
groceries %>%
group_by(store) %>%
summarise(total_rows = n(),
NA_score = sum(is.na(processing_score)),
NA_percent = paste((round((NA_score/total_rows), digits = 3)*100), "%"))
kable(total_store_count,
caption = "Missing Processing Scores by Store",
col.names = c("Store", "Total Product Count", "Products with NA value for Processing Score", "Percent of Total Products Missing Processing Score"),
align = c("l", "c", "c", "c")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"))
| Store | Total Product Count | Products with NA value for Processing Score | Percent of Total Products Missing Processing Score |
|---|---|---|---|
| Target | 17854 | 11351 | 63.6 % |
| Walmart | 16823 | 7524 | 44.7 % |
| WholeFoods | 15791 | 4936 | 31.3 % |
I started this analysis by looking at the distribution of processing stores across all product types for the three grocery store chains. Here is a summary of the key statistical measures:
#summary statistics for each store
#first create a new list for each
target <-
groceries_clean_score %>%
filter(store == "Target")
walmart <-
groceries_clean_score %>%
filter(store == "Walmart")
wholefoods <-
groceries_clean_score %>%
filter(store == "WholeFoods")
#compile dataframe of stats
store <- as.factor(c("Target", "Walmart", "WholeFoods"))
scoremeans <- c(mean(target$processing_score), mean(walmart$processing_score), mean(wholefoods$processing_score))
scoremedians <- c(median(target$processing_score), median(walmart$processing_score), median(wholefoods$processing_score))
scorefirstqrts <- c(quantile(target$processing_score)[["25%"]], quantile(walmart$processing_score)[["25%"]], quantile(wholefoods$processing_score)[["25%"]])
scorethirdqrts <- c(quantile(target$processing_score)[["75%"]], quantile(walmart$processing_score)[["75%"]], quantile(wholefoods$processing_score)[["75%"]])
scoreSDs <- c(sd(target$processing_score), sd(walmart$processing_score), sd(wholefoods$processing_score))
score_summarystats <- data.frame(store, scoremeans, scoreSDs, scoremedians, scorefirstqrts, scorethirdqrts)
kable(score_summarystats %>%
mutate_if(is.numeric, round, digits = 3),
caption = "Summary Statistics of Processing Scores for all Products at the Three Major Grocery Stores",
col.names = c("Store", "Mean", "Standard Deviation", "Median", "1st Quartile", "3rd Quartile")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"))
| Store | Mean | Standard Deviation | Median | 1st Quartile | 3rd Quartile |
|---|---|---|---|---|---|
| Target | 0.801 | 0.187 | 0.859 | 0.713 | 0.944 |
| Walmart | 0.758 | 0.198 | 0.809 | 0.640 | 0.918 |
| WholeFoods | 0.658 | 0.256 | 0.721 | 0.532 | 0.859 |
Spread visually, it is easy to see that Target and Walmart tend to have more foods higher on the processing scale than Whole Foods does. All three stores have a left skewed distribution.
brandcolors <-
c(Target = "#cc0000",
Walmart = "#0071ce",
WholeFoods = "#00674b")
brandaccents <- #darkens each brand color by 50%
c(Target = "#660000",
Walmart = "#003867",
WholeFoods = "#003325")
store_hist <-
ggplot(groceries_clean_score,
aes(x = processing_score, fill = store)) +
geom_histogram(binwidth = 0.002) +
geom_vline(data = score_summarystats,
aes(xintercept = scoremeans, color = store, linetype = "Mean"),
linewidth = 0.75) +
geom_vline(data = score_summarystats,
aes(xintercept = (scoremeans + scoreSDs), color = store, linetype = "SD"),
linewidth = 0.5) +
geom_vline(data = score_summarystats,
aes(xintercept = (scoremeans - scoreSDs), color = store, linetype = "SD"),
linewidth = 0.5) +
facet_grid(rows = "store") +
labs(title = "Processing Score of All Food Products Sold at Three Grocery Store Chains",
x = "Processing Score",
y = "Count of Products") +
scale_fill_manual(values = brandcolors) + #fill with brand color
scale_color_manual(values = brandaccents) + #line accent colors
scale_linetype_manual(values = c("Mean" = "longdash", "SD" = "dashed")) + # specify line types
theme_light() +
theme(legend.position = "top",
legend.title = element_blank()) +
guides(linetype = guide_legend(title = "Line Type")) # add line types to legend
store_hist
Recall that close to half of the food products included in the
original data set had an NA value for processing score.
In order to determine whether the known scores are an accurate representation of all food products available, I calculated a 95% confidence interval for each store. This will determine the range we could expect the mean score for the full “population” of food products at each store to fall into.
The resulting confidence intervals for each store are extremely thin ranges, meaning we can be certain the sample population is a good representation of all products sold.
#test confidence intervals for missing scores at each store
store_95CI <-
groceries_clean_score %>%
group_by(store) %>%
summarize(
mean = mean(processing_score),
ci_lower = mean - qt(0.95, length(processing_score) - 1) * sd(processing_score) / sqrt(length(processing_score)),
ci_upper = mean + qt(0.95, length(processing_score) - 1) * sd(processing_score) / sqrt(length(processing_score)))
store_95CI$store <- factor(store_95CI$store, levels = unique(store_95CI$store))
ggplot(store_95CI, aes(x = mean, y = store, color = store)) +
geom_point(position = position_dodge(width = 0.5), size = 1) +
geom_errorbar(aes(xmin = ci_lower, xmax = ci_upper),
width = 0.5,
linetype = "dashed",
position = position_dodge(width = 0.5)) +
theme_light() +
xlim(c(0, 1)) +
labs(title = "95% Confidence Intervals of Mean Processing Score for All Products",
x = "Processing Score",
y = "Store") +
theme(legend.position = "top",
legend.title = element_blank()) +
annotate("label", x = store_95CI$mean - 0.1, y = store_95CI$store,
label = paste("95% CI:", sep = "\n", round(store_95CI$ci_lower, 4), round(store_95CI$ci_upper, 4)),
fill = brandcolors[store_95CI$store], color = "white",
size = 3) +
scale_y_discrete(limits = rev(levels(store_95CI$store))) +
scale_color_manual(values = brandcolors)
types_list <-
groceries %>%
distinct(product_type)
Next, I wanted to dig into how this distribution varies for different
product types. There are 47 product categories included in
this dataset, ranging everything from Baby Food to Pasta Noodles to Meat
and Seafood.
Which category has the most difference in the processing level of different products available? To start, I created another datatable showing summary statistics for the various product types:
product_spreads <-
groceries_clean_score %>%
group_by(product_type) %>%
summarise(n = n(),
score_mean = mean(processing_score),
score_median = median(processing_score),
score_SD = sd(processing_score),
score_min = min(processing_score),
score_max = max(processing_score),
score_range = (score_max-score_min)) %>%
arrange(desc(score_median)) %>%
mutate(product_type = factor(product_type, levels = product_type))
datatable(product_spreads %>%
mutate_if(is.numeric, round, digits = 3),
filter = 'top',
colnames = c("Type", "Count", "Mean", "Median", "SD", "Min", "Max", "Range"),
options = list(pageLength = 5, autowidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 1:7))))
Visually, it becomes clear that for most types of food, there is a wide range in the level of processing for various products available at the three chain stores. Logically the order makes sense: we would expect cookies to be more highly processed than whole foods like beans or eggs.
ggplot(groceries_clean_score,
aes(x = processing_score, y = product_type)) +
geom_boxplot(fill = "snow2", staplewidth = 1, outlier.size = .7, outlier.alpha = .5) +
labs(title = "Spread of Processing Scores Across Product Types \n (Sorted From Highest to Lowest Median)",
x = "Processing Score",
y = "Product Type") +
theme_light() +
theme(legend.position = "none", axis.text.y = element_text(size = 6)) +
scale_y_discrete(limits = rev(levels(product_spreads$product_type))) # Reverse the order of the y-axis
What intrigues me most about this visualization of the data is the categories of food that run the entire spectrum from raw ingredients to highly processed. In order to dig into those categories further, I pulled a subset of the data containing the 12 product types with the largest Interquartile Range (IQR).
I chose IQR as the measure here (rather than the full range from maximum value to minimum value) because many categories have outlier products outside of the range of their whiskers (IQR*1.5).
I really appreciate being able to add an interactive element to this plot in order to scroll around and see the actual product names that underly each point. Some of the products don’t seem to match with their grouping, which makes me wonder about the original categorization method of this data: did it come from the researchers, from the stores themselves, barcode information, or perhaps an AI?
highest_IQR <-
groceries_clean_score %>%
group_by(product_type) %>%
summarise(n = n(),
score_median = median(processing_score),
score_IQR = IQR(processing_score),
score_whiskers = (IQR(processing_score)*1.5)) %>%
arrange(desc(score_whiskers)) %>%
slice_head(n = 12) %>%
mutate(product_type = factor(product_type, levels = product_type))
jitter <-
ggplot(groceries_clean_score %>%
filter(product_type %in% highest_IQR$product_type),
aes(x = processing_score, y = product_type, color = store, fill = store, text = paste(product_name, "\n", "Sold at", store))) +
geom_jitter(height = 0.13, alpha = 0.75, size = 1) +
labs(title = "Product Types with the Widest Distribution of Processing Score \n (Defined by Interquartile Range)",
x = "Processing Score",
y = "Product Type") +
theme_light() +
scale_color_manual(values = brandcolors) +
theme(legend.position = "top",
legend.title = element_blank()) +
scale_y_discrete(limits = rev(levels(highest_IQR$product_type)))
ggplotly(jitter, tooltip = "text") %>%
layout(title = list(x = 0.5, #fix an issue with title and legend overlapping the chart
xanchor = "center",
yanchor = "top",
y = 1.5),
margin = list(t = 100),
legend = list(x = 0.5,
xanchor = "center",
y = 0.998,
yanchor = "bottom",
orientation = "h",
title = list(text = ""),
font = list(size = 10)))
Now that I understand how the spread of processing scores differs greatly for different categories of food, I am curious if there are any correlations between price and processing level.
Looking at all products across all three grocery store chains indicates a very weak correlation between these two variables. As you might expect, most products sold at these three stores are highly processed and very cheap.
ggplot(groceries_clean_score, aes(x = processing_score, y = price)) +
geom_point(aes(color = store),
alpha = 0.4) +
xlim(c(0, 1)) +
ylim(c(0, 30)) +
theme_light() +
labs(title = "Price and Processing Score have a Weak Correlation for All Products",
x = "Processing Score",
y = "Price ($)") +
theme(legend.position = "top",
legend.title = element_blank()) +
geom_smooth(method="lm", color = "black") +
scale_color_manual(values = brandcolors) +
stat_cor(aes(label = after_stat(rr.label)), label.x = 0.26, label.y = 28)
#there's just way too much data here, most stuff is cheap and highly processed
#I tried making this plot interactive too, but the regression was messing up because that layer doesn't have the same attributes as the scatterplot beneath.
I know now that some categories of food have a wide range of different processing levels for sale. If I have a choice between a highly processed option or a more organic option of the same type of food, does the data confirm which type will be more expensive?
#i was encountering an issue where the R squared values on the plot weren't matching with the R squared values calculated here, because it was using data outside of the plot ranges (aka calculating before or after plotting)
#calculate R-squared for each product_type using lm, to later rearrange facets
rsq_data <- groceries_clean_score %>%
filter(product_type %in% highest_IQR$product_type) %>%
filter(!is.na(price), !is.na(processing_score)) %>% # Remove rows with NA
filter(processing_score >= 0, processing_score <= 1, # Apply xlim filter
price >= 0, price <= 30) %>% # Apply ylim filter
group_by(product_type) %>%
do({
model <- lm(price ~ processing_score, data = .)
rsq <- summary(model)$r.squared
data.frame(product_type = unique(.$product_type), r_squared = rsq)
})
#any price correlation with the different product types that have high ranges??
groceries_clean_score %>%
filter(product_type %in% highest_IQR$product_type) %>%
left_join(rsq_data, by = "product_type") %>%
mutate(product_type = factor(product_type,
levels = rsq_data$product_type[order(-rsq_data$r_squared)])) %>%
ggplot(aes(x = processing_score, y = price)) +
geom_point(aes(color = store),
alpha = 0.4) +
xlim(c(0, 1)) +
ylim(c(0, 30)) +
theme_light() +
facet_wrap(~product_type) +
labs(title = "Correlation between Price and Processing Score for Specific Product Types \n (Arranged by Descending Correlation Strength)",
x = "Processing Score",
y = "Price ($)") +
theme(legend.position = "top",
legend.title = element_blank(),
aspect.ratio = 0.6,
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6)) +
scale_color_manual(values = brandcolors) +
geom_smooth(method="lm", color = "black") +
stat_cor(aes(label = after_stat(rr.label)),
label.x = 0.1, label.y = 28, size = 3)
According to this dataset, the correlation between price and processing
level is only significant for coffee and culinary ingredients (which
includes things like butter, vinegar, and oils). For the other types of
food with a wide variety of availability, there is a negligible
difference between the price of highly processed or minimally processed
options.
One logistical coding note that feels worth mentioning: in order to sort the facets in order of descending R^2 value, I had to create a new dataframe of those values that is separate than the print-out value on each plot. The numbers in each weren’t matching exactly, which infuriated me to no end for a long while. I finally realized that it was because one method was calculating R^2 based on all values in the category, while one was only using the plotted values, and I had cut off prices >$30 on the y-axis. Fixing that error felt like a huge win!
After this exploration of the groceries dataset, I have a few unanswered questions that I would continue to investigate with more time:
FPro score
compares to the older, more outdated NOVA Classifications for
categorizing food processing levels.