library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)

groceryfoods <-read.csv('GroceryDB_foods.csv')

Works Cited:

Direct dataset: GroceryDB_foods.csv https://github.com/Barabasi-Lab/GroceryDB/blob/main/data/GroceryDB_foods.csv

Website: https://github.com/Barabasi-Lab/GroceryDB/tree/main/data

Question: Does the average food processing scores (FPro) differ between products sold at Whole Foods, Walmart, and Target?

Introduction: The dataset I chose talks about food sold in various popular grocery stores in the US such as Target, Walmart and WholeFoods. In the GroceryDB data set, we have 27 variables and 50,468 cases (food products sold in these grocery stores). These food products are given an f_FPro which stands for food processing score meaning how processed the products are. The higher the f_FPro score is the higher the food is processed, the lower the f_FPro score is the lower processed the food is. To answer my question, we will need two columns from our dataset, “store” and “f_FPro”. With the help of these two columns, I’ll be able to compare the average f_FPro score within the three listed stores and be able to determine which store has a different mean.

#Data exploration and cleaning the data

dim(groceryfoods)
## [1] 50468    27
colnames(groceryfoods)
##  [1] "original_ID"                  "name"                        
##  [3] "store"                        "harmonized.single.category"  
##  [5] "brand"                        "f_FPro"                      
##  [7] "f_FPro_P"                     "f_min_FPro"                  
##  [9] "f_std_FPro"                   "f_FPro_class"                
## [11] "price"                        "price.percal"                
## [13] "package_weight"               "has10_nuts"                  
## [15] "is_Nuts_Converted_100g"       "Protein"                     
## [17] "Total.Fat"                    "Carbohydrate"                
## [19] "Sugars..total"                "Fiber..total.dietary"        
## [21] "Calcium"                      "Iron"                        
## [23] "Sodium"                       "Vitamin.C"                   
## [25] "Cholesterol"                  "Fatty.acids..total.saturated"
## [27] "Total.Vitamin.A"
groceryfoods_2 <-groceryfoods %>% filter(!is.na(f_FPro)) %>%
select(store, f_FPro)
dim(groceryfoods_2)
## [1] 26657     2
head(groceryfoods_2)
##        store    f_FPro
## 1     Target 0.8152500
## 2     Target 0.8152500
## 3     Target 0.5832194
## 4     Target 0.4510556
## 5     Target 0.7735185
## 6 WholeFoods 0.4438148

Explanation: In this first chunk, we are exploring the data and also cleaning it up. We find that in the original “groceryfoods”, that there are 27 variables and 50,468 cases (food products sold in these grocery stores). We got rid of the na’s, and ultimately only kept the two columns we will need which are “stores” and “f_FPro”. After exploring and cleaning the “grocerystores” dataset and naming the clean dataset now “grocerystores_2”. This cleaned dataset now has only two variables which are, “store” and “f_FPro” and has 26,657 cases. Lastly, we see a couple of observations with using head function for the cleaned data set and see in just a preview, that target is repeateadly 6 times ranking the highest, with the highest f_FPro score being “0.815”.

#Data summarization 


library(dplyr)

count_bystores <- groceryfoods_2 %>%
group_by(store) %>%
summarise(count = n(),
mean_Fpro = mean(f_FPro, na.rm = TRUE))
print(count_bystores)
## # A tibble: 3 × 3
##   store      count mean_Fpro
##   <chr>      <int>     <dbl>
## 1 Target      6503     0.801
## 2 Walmart     9299     0.758
## 3 WholeFoods 10855     0.658

Explanation: In this chunk we use group_by and group all rows by “store”. With that we then use summarize and get two columns “store” and “count”. We see that out of the total 26,657 food products, 6,503 food products belong to Target, 9,299 food products belong to Walmart, and 10,855 food products belong to WholeFoods. We lastly, use mean to calculate the average f_FPro score for all three of the stores. We have the highest processing score with Target resulting in 0.800.

#Visualization with boxplot 

boxplot(f_FPro ~ store, data = groceryfoods_2,
main="Comparison of stores f_FPro Scores",
xlab = "Grocery stores",
ylab = "Food Processing Score (f_FPro)")

Explanation: In this chunk, we have our first visual as a boxplot. This boxplot displays the comparison of the three grocery stores, “target, walmart and whole foods” food processing scores. The highest level of food processing score seems to be target, and also has the highest median at 0.90. There are outliers for each store, in target we have many outliers but these outliers seem to range higher roughly from 0.0-0.38. This indicates that target has a small number of foods that are minimally processed according to f_FPro. Our median for Target is 0.90 indicating the rest of the majority of the products fall on the higher scale for f_FPro. For Walmart, there are also many outliers on the higher scale but fewer than target. The outliers are ranging roughly from 0.0 to 0.21, this indicates that walmart has a small number of food products that are minimally processed. Lastly, Whole foods has outliers as well but slightly lower than walmart and target ranging roughly from 0.0 to 0.5. This indicates that wholefoods has a slightly higher number of minimally processed food products than the other two stores (f_FPro). Although, the median is still relatively high roughly 0.70, indicating that the majority of the food products sold at wholefoods are still on the higher end of the f_FPro scale.

#Visualization with histogram


hist(groceryfoods_2$f_FPro, 
main = "Distribution of f_FPro Scores", xlab=  "Food Processing Score (f_FPro Scores)", ylab = "Frequency")

Explanation: The shape of the f_FPro scores distribution is left-skewed with no outliers. Based on the histogram our data seems to be in the higher range, median seems to fall around 0.80, 1st q and 3rd q roughly visually in between from 0.60 to 0.95.

Hypothesis

\(H_0: \mu_{\text{WholeFoods}} = \mu_{\text{Walmart}} = \mu_{\text{Target}}\)

\(H_a\): At least one of the stores has a different mean (f_FPro)

# Perform ANOVA

head(groceryfoods_2)
##        store    f_FPro
## 1     Target 0.8152500
## 2     Target 0.8152500
## 3     Target 0.5832194
## 4     Target 0.4510556
## 5     Target 0.7735185
## 6 WholeFoods 0.4438148
anova_result <- aov(f_FPro ~ store, data = groceryfoods_2)

anova_result
## Call:
##    aov(formula = f_FPro ~ store, data = groceryfoods_2)
## 
## Terms:
##                     store Residuals
## Sum of Squares    95.5912 1303.0049
## Deg. of Freedom         2     26654
## 
## Residual standard error: 0.2211016
## Estimated effects may be unbalanced

Explanation: In this chunk, we preformed the ANOVA test. This test allowed us to see the average of f_FPro score and the variability among the three stores. Upon our results, in terms of the “store” sum of squares we got 95.59 which displays variation of average f_FPro scores amongst the three stores.

#Summary of ANOVA result

summary(anova_result)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## store           2   95.6   47.80   977.7 <2e-16 ***
## Residuals   26654 1303.0    0.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation: The p-value is very small (<2e-16) which is less than 0.0000000000000002 , this indicates strong evidence against the null hypothesis and we are able to reject it. Ultimately, this suggests that we have strong evidence that at least one of the stores has a different mean (f_FPro) than the other stores.

#Tukey's Honestly Significant Difference (HSD) test on the ANOVA model

library(tidyverse)

TukeyHSD(anova_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = f_FPro ~ store, data = groceryfoods_2)
## 
## $store
##                           diff         lwr         upr p adj
## Walmart-Target     -0.04273107 -0.05110781 -0.03435433     0
## WholeFoods-Target  -0.14248986 -0.15061577 -0.13436395     0
## WholeFoods-Walmart -0.09975880 -0.10708100 -0.09243660     0

Explanation: In this chunk, we ran the Tukey’s Honestly Significant Difference (HSD) test. Within this test, the three stores were compared against eachother. We found that the p-values for all were 0, indicating strong difference between all three stores. Ultimately, based on the results we found that target has the highest mean of f_FPro score, walmart as the 2nd most significant score and the least significant out of the three being WholeFoods, with the lowest f_Fpro mean score.

Conclusion and Future Directions:

After cleaning,analyzing the dataset, creating visualizations and preforming statistical tests, I was able to answer my initial question. Yes the f_FPro score do differ between products sold at Target, Walmart and Wholefoods. Our result of the mean f_FPro score for each store was: Target (0.800), Walmart (0.758) and Whole Foods (0.658). These results indicate difference of mean food processing scores amongst all three stores. This also indicates that Target offers the food products with the highest f_FPro score meaning it has the most processed products. With walmart coming in second, for the store with the food products higher f_FPro scores. Alternatively, this indicates that Whole Food offers the least food products with high f_FPro. Regarding our findings, we would reject the null hypthesis stating that all the stores will have the same mean and accept out alternative hypothesis which states at least one store has a different mean f_FPro score.

A potential avenue we can take is to examine in further depth which brands specifically rank the highest in f_FPro score and also what kind of food category is ranking the highest in f_FPro. Also, possibly advocating for less processed foods in store to be more readily available to consumers.