An analysis of the previous version of the data (published in April 2021) can be found at https://rpubs.com/Em_Mari3/FDC
The U.S. Department of Agriculture’s FoodData Central (FDC) database is one (if not the) most expansive collection of food composition data available. By examining the data quality of FDC we can determine what food composition data has been reliably collected and can be accurately used for research and what information needs to be further expanded upon or updated.
Furthermore, the database is expansive and difficult to parse. By analyzing and comparing collected variables in a detailed and readable format, we increase the potential of understanding the data and using it to further the state of nutrition research.
As of October 2021, there are 2 ways to download the FoodData Central Database (1). You can either download the data as a collection of csv files or you can download the data as a collection of 4 JSON files.
The csv files contain information on the experimental foods (which the JSON files do not), but all updates to branded foods are organized under their own fdc id and thus act as duplicated entries with slight changes. These duplicates are practically impossible to parse with the information given. In the JSON files they are grouped together and we are provided with information for the most recent update only. To avoid counting each of these duplicates as their own entry, we will be using the JSON files.
The JSON files will be read into R as strings using the jsonlite package and converted into data frames for ease of use in analysis. The 4 JSON files are named as follows:
To accurately compare and contrast data within the four files and identify the quality of the data overall we will have to analyze the variables available in each file and combine them into a singular data structure.
The variables common among all 4 JSON files are:
| Variable Name | Variable Description |
|---|---|
| “foodClass” | The classes of food within the data are “Survey” for FNDDS, “Branded” for branded foods, and “FinalFood” for SR legacy and foundation foods |
| “description” | The name or description of the food such as “Milk, Whole” or “100 Grand Bar” |
| “foodNutrients” | A nested variable containing all info on the nutrient composition (per 100g) and derivation of nutrient composition for each food |
| “foodAttributes” | A nested variable left blank for SR legacy and foundation foods. For branded foods this variable contains a log of any updates made to this food (using variables “id”, “name”, “value”, “foodAttributeType.id”, “foodAttributeType.name”, and “foodAttributeType.description”). For survey foods this variable contains any attributes of the ingredients used |
| “fdcId” | A unique identifier given to each food |
| “dataType” | The dataset the food is contained in (of the 4 databases FNDDS, foundation, branded, and SR legacy) |
| “publicationDate” | The day this version of the food as it appears in the data was published to the FoodData Central website |
There are 16 variables exclusive to the branded foods, they are as follows:
| Variable Name | Variable Description |
|---|---|
| “modifiedDate” | The last date the food was modified by the manufacturer |
| “availableDate” | The date the food was made available to the USDA for entry in the FoodData central database |
| “marketCountry” | The country the food product was sold in, now contains 3 unique entries “United States”, “New Zealand”, and “” (i.e. no marketCountry provided) |
| “brandOwner” | The company who owns the brand that manufactured the food |
| “gtinUpc” | The gtin or UPC barcode associated with the product |
| “dataSource” | The source of the data from the following 3 options “LI” (for Label Insight), “GDSN”, and “NZGDSN” (for GS1 and New Zealand GS1) |
| “ingredients” | The list of ingredients within the product |
| “servingSize” | The serving size specified on the packaging of the food product |
| “servingSizeUnit” | The unit of the serving size if provided in g or ml |
| “labelNutrients” | The nutrient composition of the food as it is provided on the label of the food product |
| “brandedFoodCategory” | The category of the food using the branded food categories |
| “foodUpdateLog” | A nested variable containing variables “foodClass”, “description”, “foodAttributes”, “fdcId”, “dataType”, and “publicationDate” for any previous versions of the specified food |
| “brandName” | The brand name of the product such as “LINDT” or “NEWMAN’S OWN” |
| “packageWeight” | The weight of the food product including packaging |
| “householdServingFullText” | The household serving if provided on the nutrition facts panel such as “1 cup” or “2 bars” |
| “subbrandName” | The secondary brand of the product such as “CHIPS AHOY!” or “Coca-Cola” |
The files for foundation and SR foods follow the same structure. There are 17 variables exclusive to foundation and SR foods, they are as follows:
| Variable Name | Variable Description |
|---|---|
| “nutrientConversionFactors” | A nested list of any conversion factors used to compute the nutrient composition of the food |
| “isHistoricalReference” | A variable containing “FALSE” values for every entry, likely a variable intended for future use |
| “ndbNumber” | The unique nutrient database number given to all foundation and SR legacy foods |
| “foodCategory” | The SR food category associated with the food |
| “inputFoods” | A nested variable containing the variables “id”, “foodDescription”, “inputFood.foodClass”, “inputFood.description”, “inputFood.foodCategory.id”, “inputFood.foodCategory.code”, “inputFood.foodCategory.description”, “inputFood.fdcId”, “inputFood.dataType”, and “inputFood.publicationDate” for all sample foods used to calculate the nutrition composition of the foundation food entry. For all SR entries inputFoods is a blank list with no entries |
| “foodPortions” | A nested variable containing the variables “id”, “measureUnit.id”, “measureUnit.name”, “measureUnit.abbreviation”, “modifier”, “gramWeight”, “sequenceNumber”, and “minYearAcquired” that describe the portion of the food that was sampled for analysis. For all SR foods there is no variable specified as “minYearAcquired” |
| “scientificName” | The scientific name of each food like “Solanum lycopersicum” or “Brassica oleracea (Acephala Group)” |
There are 6 variables unique to the survey foods, they are as follows:
| Variable Name | Variable Description |
|---|---|
| “foodCode” | A unique code assigned to each food in FNDDS (multiple foods with different fdc ids can share a food code if they represent the same product made in different ways or by different manufacturers) |
| “startDate” | The day the survey of foods began |
| “endDate” | The day the survey of foods ended |
| “wweiaFoodCategory” | The category of the food using the What We Eat in America(WWEIA) food categories |
| “inputFoods” | A nested variable containing the variables “id”, “unit”, “portionDescription”, “portionCode”, “foodDescription”, “sequenceNumber”, “amount”, “ingredientCode”, “ingredientWeight”, and “ingredientDescription” that describe the ingredient(s) in the food that were used for analysis |
| “foodPortions” | A nested variable containing the variables “id”, “measureUnit.id”, “measureUnit.name”, “measureUnit.abbreviation”, “modifier”, “gramWeight”, “sequenceNumber”, and “portionDescription” that describe the portion of ingredients in the food that were used for analysis |
For a majority of our analysis we will be focusing on the variables found within all 4 json files. All 4 json files were filtered to only contain the variables “description”, “foodNutrients”, “fdcId”, “dataType”, and “publicationDate” and unnested. The variable “foodClass” will be excluded as it contains the same information as “dataType”. The nested variable “foodAttribute” will be excluded due to there being no information in this variable for SR legacy and foundation foods.
For our initial analysis of the overall data we will be using the following variables:
| Variable Name | Variable Description |
|---|---|
| fdc_id | A unique number for each food in the Food Data Central Database |
| nutrient_id | A unique number given to each nutrient |
| amount | The amount of each nutrient per 100g of the listed food |
| data_points | How many data points they used to derive the nutrition value |
| min, max, median | Minimum, maximum, and median value of nutrition content within sample provided in three separate columns |
| data_type | The type of data based on how it was acquired |
| food_description | The name of the food or a brief description of the food such as “milk, whole” |
| publication_date | The date the food was published to the FoodData Central website |
| nutrient_name | The name of the nutrient |
| unit_name | The unit of each nutrient (g, mg, mcg, IU, etc..) |
| derivation_description | How the food was analyzed for nutrition content |
| source_id | A unique number given to each source of nutrient composition |
Below you’ll find the number of how much food and nutrient data there is in each data type.
| Table 1: Number of Entries per Data Type | ||
|---|---|---|
| nutrient_entries | food_entries | |
| Branded | 5137893 | 373242 |
| Foundation | 10023 | 159 |
| SR Legacy | 644125 | 7793 |
| Survey (FNDDS) | 460395 | 7083 |
| Total | 6252436 | 388277 |
Key notes/important takeaways:
In this section we will look at the range of measurements with information regarding precision.
Below is a table of the units of measurement used for each nutrient per data type.
There are 259 unique nutrient names, but not 259 unique nutrients. Multiple versions of one nutrient are often present in the data such as “Total dietary fiber (AOAC 2011.25)” and “Fiber, total dietary” or “Vitamin A, IU” and “Vitamin A, RAE”. Because of the way the nutrients were recorded, for each unit used to record a nutrient there is a unique nutrient name.
Key notes/important takeaways:
Below is a table of the frequency of derivation descriptions by data type.
| Table 3: Source per Data Type | ||||
|---|---|---|---|---|
| SR Legacy | Foundation | Branded | Survey (FNDDS) | |
| Aggregated data involving comb. of codes other then 1,12 or6 | 817 | — | — | — |
| Aggregated data involving combinations of source codes 1, 6, 12 and/or 13 | 7261 | — | — | — |
| Analytical data from the literature, partial documentation | 1075 | — | — | — |
| Analytical or derived from analytical | 208534 | 9249 | — | — |
| Assumed zero | 57711 | — | — | — |
| Calculated by manufacturer, not adjusted or rounded for NLEA | 10607 | — | — | — |
| Calculated from nutrient label by NDL | 5080 | — | — | — |
| Calculated or imputed | 166646 | 774 | — | — |
| Manufacturer's analytical; partial documentation | 4513 | — | 5137892 | — |
| Value manufacturer based label claim for added nutrients | 138 | — | — | — |
| NA | 181743 | — | 1 | 460395 |
note: NA = Not Available, unknown, or missing
It looks as though the missing data is mostly coming from SR and FNDDS. There are other files that can be downloaded to determine more about the derivation of the FNDDS foods. However, from the documentation of the FNDDS it is noted that all FNDDS nutrition values are taken from a combination of other foods in FDC. We will explore the breakdown of what data FNDDS is derived from further in a later section.
There is exactly one nutrient for one food in branded missing derivation source information, the food is “156 count, 3.0 oz, Guttenplan’s Frozen Dough” and has fdc id 1849609. The nutrient missing the information is “Fatty acids, total trans”, this is likely a clerical error.
Key notes/important takeaways:
There are many different variables formatted as dates within the FDC data, in order to get an accurate look at the age of the measurements, we will have to look at multiple of them.
The one date variable provided for all values is the publication date which represents when each food/nutrient was uploaded to FoodData Central. Below is a table summarizing this variable.
| Table 4: Publication Date | ||||
|---|---|---|---|---|
| n | min | median | max | |
| Branded | 5137893 | 1/28/2021 | 3/19/2021 | 9/29/2020 |
| Foundation | 10023 | 10/28/2021 | 12/16/2019 | 4/28/2021 |
| SR Legacy | 644125 | 4/1/2019 | 4/1/2019 | 4/1/2019 |
| Survey (FNDDS) | 460395 | 10/30/2020 | 10/30/2020 | 10/30/2020 |
All publication dates are in 2019-2021 since FDC has only been around for 2 years. This makes this data rather unhelpful to us.
Figure 1: Publication Dates
Since the last update in April 2021 a fair amount of entries have been added to Foundation and Branded.
The date variable associated with the foundation foods is “minYearAcquired” which informs us of when foods in foundation were purchases or procured for analysis. This date represents the oldest sample for each nutrient entry of each food. Let’s look at the distribution of min year acquired in Foundation.
| Table 5: Minimum Year Acquired | |
|---|---|
| n | |
| 2000 | 9 |
| 2001 | 22 |
| 2003 | 2 |
| 2006 | 2 |
| 2008 | 2 |
| 2009 | 7 |
| 2010 | 3 |
| 2011 | 10 |
| 2012 | 4 |
| 2013 | 9 |
| 2014 | 3 |
| 2015 | 9 |
| 2016 | 15 |
| 2017 | 13 |
| 2018 | 6 |
| 2019 | 11 |
| total | 127 |
Despite there being 159 foundation foods, “minYearAcquired” is only specified for 127 samples.
Figure 2: Minimum Year Acquired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 2003 2013 2011 2016 2019
On average these samples were acquired in 2011, but the median “minYearAcquired” is 2013. The distribution is being skewed by the amount of data with a “minYearAcquired” of 2001. There seems to be a general upward trend since 2008, with evident drops in input in years 2010, 2012, 2014, and 2018.
For FNDDS we will have to look at both the start date and end date of each sample. The same date is in each entry of start_date and end_date. All samples started on “2017-01-01” and ended on “2018-12-31.” This data means practically nothing to us due to the fact that all FNDDS nutrient calculations are based off of nutrition information in SR and Foundation.
For branded date we’ll look at both “modifiedDate” which is the last date the food was altered by the manufacturer and “availableDate” which is the date the food was made available for inclusion in the database.
For modified date we have:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2013-06-05" "2018-02-17" "2019-04-17" "2019-05-29" "2020-09-21" "2021-09-29"
## NA's
## "7"
For available date we have:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2013-06-05" "2018-02-17" "2019-04-17" "2019-05-29" "2020-09-21" "2021-09-29"
## NA's
## "7"
Figure 3: Branded Foods Dates
This tells us more about how long food products can sit around before being sold than it does about when the nutrition information was gathered. You’ll notice that these two sets of data are identical, they should not be and haven’t been in previous versions of the data. To see what the previous version of this variable looks like, visit the analysis of the April 2021 release (https://rpubs.com/Em_Mari3/FDC).
Key notes/important takeaways:
Below you’ll find a table of count data relating to how many nutrients were recorded per food in each data type. In this case “total” is the total number of nutrient entries in that data type, “average” refers to the average number of nutrients associated with each food and “min” and “max” refer to the minimum and maximum number of nutrient entries associated with a single food per data type. Note that any foods with zero nutrient information given (primarily the 681 foods in Branded that have no nutrient information, more information on this topic will come up in the missing data section) will not be included in the minimum as we are focusing on provided nutrient information. i.e. for Foundation, the most nutrients stated per food was 159, the least was 13. If we grabbed a random Foundation food we would expect to know the values of about 63 nutrients for that food.
| Table 6: Nutrients Listed per Food | ||||
|---|---|---|---|---|
| Total | Average | Min | Max | |
| Branded | 5137893 | 13.77 | 1 | 48 |
| Foundation | 10023 | 63.04 | 13 | 159 |
| SR Legacy | 644125 | 82.65 | 8 | 138 |
| Survey (FNDDS) | 460395 | 65.00 | 65 | 65 |
Key notes/important takeaways:
Even though not all measurements are technically nutrients, I will be referring to them all as nutrients due to the column of component names being labeled “nutrient_name.” Below is a plot of the 15 most used nutrient names overall. As you can see, the amount of entries in branded means it has a lot of bearing in the overall frequencies.
Figure 4: 15 most Frequently Reported Nutrient Names Overall
Below are 6 plots, the three plots on the left depict the top 15 most used nutrient names in that data type. The three plots on the right depict the frequencies of those same 15 nutrient names across all data types. We will be skipping FNDDS in this case because it has the exact same frequency for each of the 65 nutrients it uses. Nutrient names in the plots below are ordered from left to right depending on the frequency of each nutrient name in the specified database.
Figure 5: 15 most Frequently Reported Nutrient Names per Data Type
Key notes/important takeaways:
Note that all missing variables from the table below had no missing values.
Below is a table of missing or “NA” values for each applicable variable.
| Table 7: Count of Missing Values per Data Type | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| nutrient_rank | derivation_code | derivation_description | source_id | source_code | source_description | data_points | median | max | min | |
| Branded | 1166 | 1 | 1 | 1 | 1 | 1 | 5137893 | 5137893 | 5137893 | 5137893 |
| Foundation | 0 | 0 | 0 | 0 | 0 | 0 | 1998 | 1876 | 2274 | 2274 |
| SR Legacy | 0 | 181743 | 181743 | 181743 | 181743 | 181743 | 0 | 644125 | 547232 | 547234 |
| Survey (FNDDS) | 0 | 460395 | 460395 | 460395 | 460395 | 460395 | 460395 | 460395 | 460395 | 460395 |
| total | 1,166 | 642,139 | 642,139 | 642,139 | 642,139 | 642,139 | 5,600,286 | 6,244,289 | 6,147,794 | 6,147,796 |
formatted as percentages we have:
| Table 8: Percentage of Missing Values per Data Type | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| nutrient_rank | derivation_code | derivation_description | source_id | source_code | source_description | data_points | median | max | min | |
| Branded | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Foundation | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.199 | 0.187 | 0.227 | 0.227 |
| SR Legacy | 0 | 0.282 | 0.282 | 0.282 | 0.282 | 0.282 | 0.000 | 1.000 | 0.850 | 0.850 |
| Survey (FNDDS) | 0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
You’ll notice that the nutrients that were used most have the most missing information. Further analysis must be done to draw any true conclusions from this data due to the large number of nutrients we have information on. For now we will have to look at the Essential Nutrients and work from there to draw conclusions.
Key notes/important takeaways:
There are certain nutrients that are essential in maintaining life. In order for the data in FDC to be complete each food item would need to have an entry for each of these nutrients, obviously this is not the case.
Note: For all tables below data types will be excluded if they contain no relevant entries, if a table entry indicates “-” that implies that there were none of the listed nutrient in that category.
All percentage values displayed in plots are rounded to two decimal places, all percentage values displayed in tables are rounded to four decimal places.
Below is a table of the frequency of occurrences of each essential vitamin per data type:
| Table 9: Essential Vitamins per Data Type | |
|---|---|
| Count | Percentage |
Below you’ll find two plots, the first shows the count data as it appears above. The second plot shows the percentage of foods in each data type that contain the essential vitamins listed above. (i.e. the numbers above each bar in the second plot is the percentage of foods in that data type that contain the listed nutrient. For example, foundation contains a value of 0.65 for thiamin, this means that 65% of foods in the Foundation data contain information on thiamin content). All following sections within “Frequency of Essential Nutrients” will have plots that can be interpreted in the same manner.
Figure 6: Count of Essential Vitamins
Figure 7: Percentage of Foods Containing Essential Vitamins per Data Type
Since FNDDS entries all contain information on the same 65 nutrients, the percentage of FNDDS foods that contain information on a given nutrient will always be either 1 or 0.
In this case, multiple forms existed for a few of the essential vitamins, below you will see a breakdown of all the different forms of these entries. Each tab contains a separate table that indicates the counts of each type of entry associated with that type of vitamin.
| Table 10: A Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 11: B9 Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 12: D Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 13: E Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 14: K Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 15: Choline per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 16: Count of Essential Vitamins | ||
|---|---|---|
| min | max | median |
| Table 17: Percentage of Entries with Essential Vitamins | ||
|---|---|---|
| min | max | median |
Key notes/important takeaways:
Below you’ll find a table displaying the number of essential mineral entries exist in each data type.
| Table 18: Essential Minerals per Data Type | |
|---|---|
| Count | Percentage |
Figure 8: Count of Essential Minerals
Figure 9: Percentage of Foods Containing Essential Minerals
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 19: Count of Entries with Essential Minerals | ||
|---|---|---|
| min | max | median |
| Table 20: Percentage of Entries with Essential Minerals | ||
|---|---|---|
| min | max | median |
Key notes/important takeaways:
Below you’ll find a table of the number of essential amino acid entries that exist for each data type.
| Table 21: Amino Acids per Data Type | |
|---|---|
| Count | Percentage |
Figure 10: Count of Essential Amino Acids
Figure 11: Percentage of Foods Containing Essential Amino Acids
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 22: Count of Entries with Amino Acids | ||
|---|---|---|
| min | max | median |
| Table 23: Percentage of Entries with Amino Acids | ||
|---|---|---|
| min | max | median |
| Table 24: Omega 3 Fatty Acids per Data Type | |
|---|---|
| Count | Percentage |
Figure 12: Count and Percentage of Foods Containing Omega 3 Fatty Acids
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 25: Count of Entries with Omega 3 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 26: Percentage of Entries with Omega 3 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 27: Omega 6 Fatty Acids per Data Type | |
|---|---|
| Count | Percentage |
Figure 13: Count and Percentage of Foods Containing Omega 6 Fatty Acids
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 28: Count of Entries with Omega 6 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 29: Percentage of Entries with Omega 6 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 30: Fatty Acid Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 14: Count and Percentage of Foods Containing Fatty Acids
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 31: Count of Entries with Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 32: Percentage of Entries with Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 33: Sugar Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 15: Count and Percentage of Foods Containing Sugars
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 34: Count of Entries with Sugars | ||
|---|---|---|
| min | max | median |
| Table 35: Percentage of Entries with Sugars | ||
|---|---|---|
| min | max | median |
Below are the number of entries for total fiber in each data type.
| Table 36: Fiber Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 16: Count and Percentage of Foods Containing Fiber
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 37: Count of Entries with Fiber | ||
|---|---|---|
| min | max | median |
| Table 38: Percentage of Entries with Fiber | ||
|---|---|---|
| min | max | median |
| Table 39: Carbohydrate Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 17: Count and Percentage of Foods Containing Carbohydrates
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 40: Count of Entries with Carbohydrates | |
|---|---|
| min | max |
| Table 41: Percentage of Entries with Carbohydrates | |
|---|---|
| min | max |
| Table 42: Cartenoid Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 18: Count of Foods Containing Carotenoids
Figure 19: Percentage of Foods Containing Carotenoids
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 43: Count of Entries with Carotenoids | ||
|---|---|---|
| min | max | median |
| Table 44: Percentage of Entries with Carotenoids | ||
|---|---|---|
| min | max | median |
| Table 45: Phytosterol Entries per Data Type | |
|---|---|
| Count | Percentage |
Figure 20: Count of Foods Containing Phytosterols
Figure 21: Percentage of Foods Containing Phytosterols
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 46: Count of Entries with Phytosterols | ||
|---|---|---|
| min | max | median |
| Table 47: Percentage of Entries with Phytosterols | ||
|---|---|---|
| min | max | median |
As described previously, there are variables unique to each data type. Here we will explore some of the information only available for certain data types. All data types also have unique variables regarding food groupings, these will be investigated in a further section.
There are 3 variables in branded that could benefit from further exploration. Those variables being; “brandOwner”, “dataSource”, “brandName”, and “ingredients”.
There are 28901 unique brand names and 20538 unique brand owners in the branded foods dataset. However, despite each of these strings being unique multiple names often seem to refer to the same brand. For instance, “ANNIE’S”, “ANNIES”, and “Annie’s” all refer to the same brand. The same problem is evident in the brand owners, for example “ANETO NATURAL S. L. U. POLIGONO INDUSTRIAL SANTA MARIA”, “ANETO NATURAL S.L.U. POLIGONO INDUSTRIAL SANTA MARIA”, “ANETO NATURAL, S.L.U.”, and “ANETO NATURAL S.L.U.” are all declared as different names of the same brand owner. There are no instances of both brand owner and brand name being left blank but there are 953 food entries where brand owner is left blank and 3038 food entries where brand name is left blank.
Due to this problem, there is no true way to know how many brands or brand owners we truly have information on.
There are 3 data sources “GDSN”, “NZGDSN”, and “LI”. Where GDSN is Global Data Synchronization Network, NZGDSN is New Zeland Global Data Synchronization Network, and LI is Label Insight.
| Table 48: Data Sources of Branded Foods | |||
|---|---|---|---|
| Total Foods | Total Nutrients | Average Nutrients per Food | |
| GDSN | 13514 | 217456 | 16.17856 |
| LI | 359821 | 4915972 | 13.68441 |
| NZGDSN | 562 | 4465 | 7.94484 |
A majority of the information has been collected through Label Insight. As the newest addition (only having been added this update) NZGDSN is the least utilized source.
Figure 22: Character Length of Ingredients by Data Source
Figure 23: Number of Words in Ingredient Statement by Data Source
| Table 49: Length of Ingredient Statements | |||||
|---|---|---|---|---|---|
| Variable | Overall, N = 373,8971 | GDSN, N = 13,5141 | LI, N = 359,8211 | NZGDSN, N = 5621 | p-value2 |
| Characters per Ingredient Statement | 198 (90, 371) | 293 (138, 483) | 195 (89, 366) | 130 (38, 260) | <0.001 |
| Words per Ingredient Statement | 27 (12, 50) | 40 (19, 66) | 27 (12, 50) | 17 (5, 34) | <0.001 |
|
1
Median (IQR)
2
Kruskal-Wallis rank sum test
|
|||||
The length of the ingredient statement and by extension the amount of processing the food has undergone varies wildly between the data sources. With a p-value of less than 0.001, the difference between the amount of ingredients in each food per data source is incredibly statistically significant.
There are a lot of variables stored within the data for FNDDS that contain duplicate information. For unique variables related to this data type we will go over the nested variables “foodAttributes”, “inputFoods”, and “foodPortions”.
This nested variable contains 5 variables, 1 of which is nested further and contains 3 variables. “foodAttributes” contains variables “id”, “name”, “value”, “foodAttributeType”, and “rank”. “foodAttributeType” contains the 3 variables “id”, “name”, and “description”.
For every food in FNDDS, “foodAttributes” contains at least 2 entries, one is the WWEIA Category number and the other is the WWEIA Category description. This information appears twice for every food, once in “foodAttributes” and once in “wweiaFoodCategory”.
Foods having more than 2 entries in “foodAttributes” provide further information about the source of the food or food ingredients in the form of small additional description notes. These notes vary wildly in what information they provide, some provide brand names such as “McDonald’s”, others provide information on contents such as “leche fresca”.
Within the nested variable “inputFoods” we have 10 variables; “id”, “unit”, “portionDescription”, “portionCode”, “foodDescription”, “sequenceNumber”, “amount”, “ingredientCode”, “ingredientWeight”, and “ingredientDescription”.
The variables “foodDescription” and “ingredientDescription” are identical for all foods.
From the USDA Ag Data Commons website you can find a database of flavonoid content for many of the foods in the FDC database (2). These flavonoid values can be linked to Foundation and SR legacy foods by NDB number. Combining this data gives us amounts of the following flavonoids in 25 foods in Foundation and 1613 foods in SR legacy. In all cases, every flavonoid is provided for each food.
## [1] "Daidzein" "Genistein"
## [3] "Glycitein" "Cyanidin"
## [5] "Petunidin" "Delphinidin"
## [7] "Malvidin" "Pelargonidin"
## [9] "Peonidin" "(+)-Catechin"
## [11] "(-)-Epigallocatechin" "(-)-Epicatechin"
## [13] "(-)-Epicatechin 3-gallate" "(-)-Epigallocatechin 3-gallate"
## [15] "Theaflavin" "Thearubigins"
## [17] "Eriodictyol" "Hesperetin"
## [19] "Naringenin" "Apigenin"
## [21] "Luteolin" "Isorhamnetin"
## [23] "Kaempferol" "Myricetin"
## [25] "Quercetin" "Theaflavin-3,3'-digallate"
## [27] "Theaflavin-3'-gallate" "Theaflavin-3-gallate"
## [29] "(+)-Gallocatechin"
After the release of the USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes in 2015, alterations and additions were made to the USDA Database for the Flavonoid Content of Selected Foods in 2018. While the other supplemental databases are available on the USDA Ag Data Commons, this new update was published solely on the USDA Agricultural Research Service website (3). There are values in the USDA Database for the Flavonoid Content of Selected Foods for a total of 183 foods in SR legacy and Foundation. Of those 183, 131 can also be found in the USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes (meaning a total of 52 new foods have been added in this release). These newer values are assumed to be more accurate and if added to the FDC data should replace their previous versions.
| Table 50: Flavonoid Entries for SR Legacy Foods | |
|---|---|
| n | |
| (-)-Epicatechin | 82 |
| (-)-Epicatechin 3-gallate | 75 |
| (-)-Epigallocatechin | 73 |
| (-)-Epigallocatechin 3-gallate | 74 |
| (+)-Catechin | 83 |
| (+)-Gallocatechin | 71 |
| Apigenin | 101 |
| Cyanidin | 48 |
| Delphinidin | 46 |
| Eriodictyol | 2 |
| Hesperetin | 39 |
| Isorhamnetin | 43 |
| Kaempferol | 132 |
| Luteolin | 112 |
| Malvidin | 38 |
| Myricetin | 122 |
| Naringenin | 37 |
| Pelargonidin | 39 |
| Peonidin | 38 |
| Petunidin | 37 |
| Quercetin | 162 |
| Theaflavin | 3 |
| Theaflavin-3'-gallate | 3 |
| Theaflavin-3,3'-digallate | 3 |
| Thearubigins | 3 |
| Table 51: Flavonoid Entries for Foundation Foods | |
|---|---|
| n | |
| (-)-Epicatechin | 4 |
| (-)-Epicatechin 3-gallate | 4 |
| (-)-Epigallocatechin | 4 |
| (-)-Epigallocatechin 3-gallate | 4 |
| (+)-Catechin | 4 |
| (+)-Gallocatechin | 4 |
| Apigenin | 6 |
| Cyanidin | 2 |
| Delphinidin | 2 |
| Hesperetin | 2 |
| Isorhamnetin | 1 |
| Kaempferol | 7 |
| Luteolin | 6 |
| Malvidin | 2 |
| Myricetin | 7 |
| Naringenin | 2 |
| Pelargonidin | 2 |
| Peonidin | 2 |
| Petunidin | 2 |
| Quercetin | 7 |
From the USDA Ag Data Commons website you can find a database of proanthocyanidin content for many of the foods in the FDC database (4). These proanthocyanidin values can be linked to Foundation and SR legacy foods by NDB number. The following tables contain the names of each type of proanthocyanidin content and the number of foods in SR legacy and Foundation we have entries for.
| Table 52: Proanthocyanidin Entries for SR Legacy Foods | |
|---|---|
| n | |
| Proanthocyanidin 4-6mers | 114 |
| Proanthocyanidin 7-10mers | 110 |
| Proanthocyanidin dimers | 130 |
| Proanthocyanidin polymers (>10mers) | 108 |
| Proanthocyanidin trimers | 124 |
| Table 53: Proanthocyanidin Entries for Foundation Foods | |
|---|---|
| n | |
| Proanthocyanidin 4-6mers | 6 |
| Proanthocyanidin 7-10mers | 6 |
| Proanthocyanidin dimers | 6 |
| Proanthocyanidin polymers (>10mers) | 6 |
| Proanthocyanidin trimers | 6 |
From the USDA Ag Data Commons website you can find a database of Isoflavone content for many of the foods in the FDC database (5). These Isoflavone values can be linked to Foundation and SR legacy foods by NDB number. The following tables contain the names of each type of Isoflavone content and the number of foods in SR legacy and Foundation we have entries for.
| Table 54: Isoflavone Entries for SR Legacy Foods | |
|---|---|
| n | |
| Biochanin A | 59 |
| Coumestrol | 123 |
| Daidzein | 262 |
| Formononetin | 123 |
| Genistein | 262 |
| Glycitein | 143 |
| Total isoflavones | 259 |
| Table 55: Isoflavone Entries for Foundation Foods | |
|---|---|
| n | |
| Biochanin A | 3 |
| Coumestrol | 8 |
| Daidzein | 15 |
| Formononetin | 8 |
| Genistein | 15 |
| Glycitein | 9 |
| Total isoflavones | 15 |
This data set has significant overlap with USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes but provides additional information on “Biochanin A”, “Coumestrol”, “Formononetin”, and “Total isoflavones”.
There is no consistency between the different food groups associated with each data type. Here will dive into what the current food groups are and a possible approach to standardizing them for comparison.
FNDDS uses the WWEIA (What We Eat In America) food groups which are split into 167 unique categories. The SR legacy and Foundation foods follow the SR legacy food groups which are split into 28 unique categories. Branded has it’s own list of food groups which contains 309 unique categories. None of the three lists of unique food categories have any entries that are identical across all three lists. However, the food categories for FNDDS and branded intersect on the following category names:
## [1] "Rice" "Cheese" "Pizza" "Tomatoes" "Coffee" "Beer" "Bacon"
Due to the size and scope of this approach and analysis, it has been moved to it’s own report. For information on how a machine learning algorithm can be utilized in the standardization of food group labels see the full report here: https://rpubs.com/Em_Mari3/FoodGroups
U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.
Bhagwat, Seema; Haytowitz, David B.; Wasswa-Kintu, Shirley. (2015). USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes, Release 1.1 - December 2015. Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324677. Accessed 2022-01-12.
Haytowitz, D.B., Wu, X., Bhagwat, S. 2018. USDA Database for the Flavonoid Content of Selected Foods, Release 3.3. U.S. Department of Agriculture, Agricultural Research Service. Nutrient Data Laboratory Home Page: http://www.ars.usda.gov/nutrientdata/flav
Bhagwat, Seema; Haytowitz, David B.. (2015). USDA Database for the Proanthocyanidin Content of Selected Foods, Release 2 (2015). Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324621. Accessed 2022-01-12.
Bhagwat, Seema; Haytowitz, David B.. (2015). USDA Database for the Isoflavone Content of Selected Foods, Release 2.1 (November 2015). Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324538. Accessed 2022-01-12.