Analyzing Misleading Food Labeling with R

Introduction

The main label on a product serves as the primary means of communication between a consumer and a producer. However, misleading food labels have become a significant concern in the food industry. They can easily influence consumers into making unhealthy food choices due to inaccurate or ambiguous information.

In this case study, I will be analyzing a dataset gotten from the Food and Nutrient Database for Dietary Studies (FNDDS) and The Food Patterns Equivalents Database (FPED). The dataset contains nutritional data for 6455 food products. The objective is to explore the prevalence of misleading food labels and assess the variation between the labeled nutritional content and the actual values.

The Impact of Misleading Food Labels

Misleading food labels can have far-reaching consequences on consumer behavior and public health. Buzzwords like “Organic,” “Healthy,” “Unsweetened,” and others creates an idea of healthiness around products, influencing consumers’ purchasing decisions. However, the reality may not match the claims made on the label.

The COM-B (Capability, Opportunity, Motivation, Behavior) model of analyzing an individual’s behavior provides valuable insights on how misleading food labels influence consumers’ behavior. Misleading food labels can hinder consumers from making informed and healthy food choices, even if they have the capability and motivation to do so. When the opportunity to make a healthy decision is taken away by misleading labels, consumers might end up making less healthy choices.

The issue is exacerbated by the lack of standardized rules for the main label of food products. For example, a product can be labeled as “no sugar added” while containing significant amounts of artificial sweeteners with complex chemical names. This lack of transparency puts the burden of fact-checking on the consumer, making it challenging for them to make truly informed decisions about their food choices.

Objective of the Case Study

The primary objective of this case study is to analyze the dataset obtained from the food survey research group. Although the data was collected in 2011 and may be slightly outdated, it still provides valuable insights into the prevalence of misleading food claims. I will specifically focus on the following nutritional contents because FDA (Food and Drug Administration) has a definition for the labels relating to these nutritional contents:

Protein
Fats
Calories
Cholesterol

The nutritional contents in the dataset will be compared with the FDA Regulatory Requirements for Nutrient Content Claims to identify misleading claims and assess the accuracy of the labeled values.

Addressing the Issue

While progress has been made with mobile applications that can scan product barcodes and fact-check nutritional labels, it is not enough to address the problem entirely. Food producers should be held to higher standards and be accountable for the claims they make on their products. Transparent and accurate food labeling is essential for consumers to make informed choices about their diet and overall health.

Data Acquisition and Preprocessing

Loading the Data

I will start by loading the dataset from the xlsx spreadsheet file format into R using the readxl package:

nutrients_table <- read_excel("food_label.xlsx", sheet = "Nutrients")
portion_data_table <- read_excel("food_label.xlsx", sheet = "Portion_data")
search_categories_table <- read_excel("food_label.xlsx", sheet = "Search_Categories")

Data Preprocessing

The data will be pre-processed by merging the “category table” with the “nutrients table”. Misleading keywords will be saved in a vector, and a new column called “Misleading label” will be added to indicate a keyword is found in a product’s label.

# Remove rows with missing food IDs in the nutrients table
nutrients_table <- nutrients_table[!is.na(nutrients_table$Food_Item_ID), ]

# Replace missing values with 0 in the nutrients table
nutrients_table[is.na(nutrients_table)] <- 0

# Define misleading keywords
misleading_keywords <- c("low cholesterol","Cholesterol Free", "2% milk","(2%) milk", "no sugar", "organic", "multigrain", "free range","gluten free","with fruit","fat free","light",'unsweetened',"sugar free", "sugar-free","low calorie","low fat","lowfat","low carb",'whole grains',"fruit flavored","healthy","high protein","high fibre","With vegetables","reduced fat","reduced sugar")

# Create a new column indicating misleading labels
nutrients_table$Misleading_Label <- ifelse(
  grepl(paste(misleading_keywords, collapse = "|"), nutrients_table$foodname, ignore.case = TRUE),
  "Misleading",
  "Not Misleading"
)

# Display the frequency of each misleading label
table(nutrients_table$Misleading_Label)

## 
##     Misleading Not Misleading 
##            654           5800

# Merge search_categories_table with nutrients_table to get the Category column
merged_table <- merge(nutrients_table, search_categories_table[, c("foodcode", "modcode", "Category")], by = c("foodcode", "modcode"), all.x = TRUE)

# Replace NA values in the Category column with "General"
merged_table$Category[is.na(merged_table$Category)] <- "General"

A total of 654 products with misleading keywords were found in the dataset. Going forward, this analysis will be focused on those 654 food products.

Analyzing Misleading Food Labels

In this section, I will explore the prevalence of different misleading labels and their association with food categories.

Categorizing Misleading Labels

The misleading keywords will be categorized into groups of nutrient types and sub-tables based on these categories will be created:

# Categorize the misleading keywords into groups of nutrient types
low_fat_related <- c("2% milk", "(2%) milk","light", "low fat", "lowfat", "reduced fat")
fat_free_related <- c("fat free")
cholesterol_free_related <- c("Cholesterol Free")
low_cholesterol_related <- c("low cholesterol")
low_sugar_related <- c("unsweetened", "reduced sugar")
sugar_free_related <- c("no sugar", "sugar free", "sugar-free")
calories_related <- c("low calorie")
protein_related <- c("high protein")
healthy_related <- c("organic", "multigrain", "free range", "gluten free", "with fruit", "whole grains", "fruit flavored", "healthy", "high fibre", "With vegetables")

# Function to split the merged table based on related keywords
split_table_by_keywords <- function(related_keywords) {
  # Create a regular expression pattern to match the related keywords
  pattern <- paste(related_keywords, collapse = "|")
  
  # Use grepl to filter the rows that match the pattern
  sub_table <- merged_table[grepl(pattern, merged_table$foodname, ignore.case = TRUE), ]
  
  return(sub_table)
}

# Split the merged_table into separate sub-tables
low_fat_table <- split_table_by_keywords( low_fat_related)
fat_free_table <- split_table_by_keywords(fat_free_related)
low_cholesterol_table <- split_table_by_keywords(low_cholesterol_related)
cholesterol_free_table <- split_table_by_keywords( cholesterol_free_related)
low_sugar_table <- split_table_by_keywords(low_sugar_related)
sugar_free_table <- split_table_by_keywords(sugar_free_related)
calories_table <- split_table_by_keywords(calories_related)
protein_table <- split_table_by_keywords(protein_related)
healthy_table <- split_table_by_keywords(healthy_related)

Investigating ‘Low Fat’ Label

In this section, analysis will be made to check if each food item meets the FDA healthy fat standards and update the ‘Healthy’ column accordingly. The criteria for healthy fat include:

Total fat should be less or equal to 3 g/serving, saturated fat should be less or equal to 1 g/serving, and certain categories are exempted from the standard.
Exempted categories include “Meat, poultry, fish, eggs, and soy” and any food item with the words “meat”, “fish”, or “poultry” in its name.
Fresh/frozen/canned/dried fruit and vegetables with no additives and water (plain and carbonated) do not have a total fat, saturated fat, or trans fat threshold.
Cheeses, milk, and dairy threshold for saturated fat is 2 g based on 1% milk-fat product.

# Create a new column called 'Healthy' in low_fat_table and initialize with 'Not healthy'
low_fat_table$Healthy <- 'Not healthy'
fat_free_table$Healthy<- "Not healthy"
# Create a regular expression pattern to match variations of the words


pattern <- "meat|poultry|fish|eggs|soy"

# Define the dairy pattern
dairy_pattern <- "milk|dairy|cheese"

# Check for total fat and "dairy_pattern" conditions first, then apply other conditions
low_fat_table$Healthy[low_fat_table$`_204 Total Fat (g)` <= 2 &
                 (grepl(dairy_pattern, low_fat_table$foodname, ignore.case = TRUE) |
                   grepl(dairy_pattern, low_fat_table$Category, ignore.case = TRUE)) |
                 (low_fat_table$`_204 Total Fat (g)` <= 3 &
                 low_fat_table$`_606 Fatty acids, total saturated (g)` <= 1 &
                 !(grepl(pattern, low_fat_table$foodname, ignore.case = TRUE) |
                   grepl(pattern, low_fat_table$Category, ignore.case = TRUE)))] <- "Healthy"


fat_free_table$Healthy[fat_free_table$`_204 Total Fat (g)` <= 0.5 &
                 fat_free_table$`_606 Fatty acids, total saturated (g)` <= 0.5] <- "Healthy"

Visualizing ‘Low Fat’ Label Healthiness

# Create a function for creating a pie chart
create_pieViz <- function(data_table, chart_title) {
  # Create a function to calculate percentage
  calculate_percentage <- function(x) {
    return(sprintf("%.1f%%", 100 * (x / sum(x))))
  }
  
  # Create the half doughnut chart
  pie_chart <- ggplot(data_table, aes(x = "", fill = Healthy)) +
    geom_bar(width = 1, color = "white") +
    coord_polar(theta = "y") +
    scale_fill_manual(values = c("cyan", "lightcyan")) +
    labs(title = chart_title, x = NULL, y = NULL) +
    theme_void() +
    geom_text(stat = "count", aes(label = calculate_percentage(..count..)), position = position_stack(vjust = 0.5)) +
    guides(fill = guide_legend(title = "Healthy Status", label.position = "right"))
  
  return(pie_chart)
}

# Select columns for visualization
selected_low_fat_table <- low_fat_table[, c("foodname", "Category", "Healthy")]
selected_fat_free_table <- fat_free_table[, c("foodname", "Category", "Healthy")]

Interestingly, out of 300 foods with “low-fat” label, 63.7% of them do not meet FDA standards.

# Create pie charts
create_pieViz(selected_low_fat_table, "Proportion of Foods Labeled 'Low-fat' and Their Actual Healthiness")

Also, out of 104 foods with “fat-free” labels, 57.7% of them do not meet the FDA standards

create_pieViz(selected_fat_free_table, "Proportion of Foods Labeled 'Fat-free' and Their Actual Healthiness")

Investigating ‘Low Cholesterol’ and ‘Cholesterol-free’ Labels

As per FDA regulations, a product can be labeled as “cholesterol-free” if it contains less than 2 mg of cholesterol per labeled serving and less that 2g of saturated fat. Similarly, to be labeled as “low cholesterol,” the product must contain 20 mg or less of cholesterol and less that 2g of saturated fat. Now, let’s verify if the products in the dataset meet these standards.

# Create a new column called 'Healthy' in low_cholesterol_table and cholesterol_free_table
low_cholesterol_table$Healthy <- 'Not healthy'
cholesterol_free_table$Healthy <- "Not healthy"

# Update 'Healthy' column for low cholesterol table
low_cholesterol_table$Healthy[low_cholesterol_table$`_601 Cholesterol (mg)` <= 20 & 
                                low_cholesterol_table$`_606 Fatty acids, total saturated (g)` <= 2] <- "Healthy"

# Update 'Healthy' column for cholesterol-free table
cholesterol_free_table$Healthy[cholesterol_free_table$`_601 Cholesterol (mg)` <= 2 & 
                                cholesterol_free_table$`_606 Fatty acids, total saturated (g)` <= 2] <- "Healthy"

Visualizing ‘Low Cholesterol’ and ‘Cholesterol-free’ Labels

Next, let’s visualize the proportion of foods labeled ‘Low Cholesterol’ and ‘Cholesterol-free’ and their actual healthiness:

# Select columns for visualization
selected_low_cholesterol_table <- low_cholesterol_table[, c("foodname", "Category", "Healthy")]
selected_cholesterol_free_table <- cholesterol_free_table[, c("foodname", "Category", "Healthy")]

The low-cholesterol product table only has one data present. Although this product has a low cholesterol content of 15mg, FDA also states that for a product to meet the standard for low cholesterol, it has to have a saturated fat content of less than 2g. This product however has 5.8g of saturated fat.

# Create pie charts and table
kable(low_cholesterol_table[, c("foodname", "_601 Cholesterol (mg)", "_606 Fatty acids, total saturated (g)")])

	foodname	_601 Cholesterol (mg)	_606 Fatty acids, total saturated (g)
470	American or cheddar imitation cheese, low cholesterol	15	5.815

Among the ten products labeled as “cholesterol-free,” only one fails to meet the FDA standards.

create_pieViz(selected_cholesterol_free_table, "Proportion of Foods Labeled 'Cholesterol-free' and Their Actual Healthiness")

Investigating ‘Low Sugar’ and ‘Sugar-free’ Labels

To be labeled as “Sugar-Free” according to FDA standards, a product must meet the following criteria:

Less than 0.5g of sugars per Reference Amount Customarily Consumed (RACC) and per labeled serving. For meals and main dishes, the sugar content should also be less than 0.5g per labeled serving.

However, FDA does not have a definition for “low sugar” labels. This means that producers can decide to add this label to their product regardless of the amount of sugar or sugar-related ingredients present in their product. For this analysis however, I decided to input the maximum sugar limit for “low sugar” labels as 5g.

# Create a new column called 'Healthy' in low_sugar_table and sugar_free_table
low_sugar_table$Healthy <- 'Not healthy'
sugar_free_table$Healthy <- "Not healthy"

# Update 'Healthy' column for low sugar table
low_sugar_table$Healthy[low_sugar_table$`_269 Sugars, total (g)` <= 5] <- "Healthy"
# Update 'Healthy' column for sugar-free table
sugar_free_table$Healthy[sugar_free_table$`_269 Sugars, total (g)` <= 0.5] <- "Healthy"
table(sugar_free_table$Healthy)

## 
##     Healthy Not healthy 
##          10          20

table(low_sugar_table$Healthy)

## 
##     Healthy Not healthy 
##          24          35

Visualizing ‘Low Sugar’ and ‘Sugar-free’ Labels

# Select columns for visualization
selected_low_sugar_table <- low_sugar_table[, c("foodname", "Category", "Healthy")]
selected_sugar_free_table <- sugar_free_table[, c("foodname", "Category", "Healthy","_269 Sugars, total (g)")]

Out of 59 products with “low-sugar” label, 24 products met the FDA standard while 35 products did not.

# Create pie charts
create_pieViz(selected_low_sugar_table, "Proportion of Foods Labeled 'Low Sugar' and Their Actual Healthiness")

Among the 30 products labeled as “sugar-free,” only 10 of them met the FDA standard for sugar content.

create_pieViz(selected_sugar_free_table, "Proportion of Foods Labeled 'Sugar-free' and Their Actual Healthiness")

Using a bar chart, I will visualize the distribution of the 30 products.

# The distribution of sugar content across sugar_free table

# Segment the _269 Sugars, total (g) column in selected_sugar_free_table
selected_sugar_free_table$Sugar_Segment <- cut(selected_sugar_free_table$`_269 Sugars, total (g)`,
                                              breaks = c(-Inf, 0.5, 1, 5, 10, 20, Inf),
                                              labels = c("<=0.5", ">0.5", ">1", ">5", ">10", ">20"),
                                              right = FALSE)


# Create a bar chart to show the proportion of sugars in each segment
ggplot(selected_sugar_free_table, aes(x = Sugar_Segment)) +
  geom_bar(fill = "cyan" , color = "pink") +
   geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5) +
  labs(title = "Proportion of Sugar Content in Different Measurements",
       x = "Sugar Content (g)",
       y = "Count") +
  theme_minimal()

Investigating ‘Low Calories’ Labels

According to FDA standards, for a product to be labeled as “low-calories,” it must contain 40 calories or less per serving.

# Create a new column called 'Healthy' in calories_table
calories_table$Healthy <- 'Not healthy'

# Update 'Healthy' column for low calorie table
calories_table$Healthy[calories_table$`_208 Energy (kcal)` <= 40] <- "Healthy"

Visualizing ‘Low Calories’ Labels

# Select columns for visualization
selected_calories_table <- calories_table[, c("foodname", "Category", "Healthy")]

# Create pie chart
create_pieViz(selected_calories_table, "Proportion of Foods Labeled 'Low Calories' and Their Actual Healthiness")

Out of 59 products with “low-calories” label, 31 of them met the FDA standards, while 28 did not.

Conclusion

In this case study, data related to misleading food labels was analyzed using R. We analyzed and visualized products including various misleading labels such as ‘Low-fat’, ‘Cholesterol-free’, ‘Low Sugar’, ‘Sugar-free’, and ‘Low Calories’ and their actual healthiness based on nutrient content. The analysis provided insights into the deceptive practices of food labeling and revealed the true healthiness of the products.

Please note that the results and interpretations presented here are based on the provided dataset and the analysis performed up to this point. Further studies or additional data may be required to make more comprehensive conclusions.