Note: A new characterization of the 2021 October data can be found at https://rpubs.com/Em_Mari3/FDC_V2
As of April 2021, the largest set of data that can be currently downloaded from the FoodData Central website is a collection of 36 files in a zipped folder, 1 file is a pdf that is meant to explain the variables in each file, the other 35 files are csv files containing data (1). The pdf of variable declarations and database structure is largely inaccurate, it contains variables not present in the data and is missing descriptions of several variables.
Included in one file, labeled “all_downloaded_table_record_counts,” is the number of data present in each of the other main files.
Some of these file names do not match the names of the actual files available from the downloadable data, but they are close enough that we can figure out what they’re going for. A lot of these files will not help much with this analysis, as they contain duplicate information and variables irrelevant to our focus.
For our initial analysis we will focus on the files:
Below is a diagram of the variables/ids used to connect the data:
These files contain the nutritional values for all foods in the database and the categories each food falls into.
The data in the file “food_category” has variables that don’t match up to any variables from the other files. As a result we are missing food categories for all SR Legacy and Foundation food items. This means will have to manually add-in food_category descriptions for foundation foods based on the NDB_number of each food in the foundation_food file and their equivalents in the sr_legacy_food file.
Since the codes for SR legacy food groups have been removed from the files in recent years so we will have to get them from the ARS website from this data we will need the files “FD_GROUP” and “FOOD_DES” (2).
Certain variable names will need to be altered so that we can combine them correctly, for instance the variable labeled “id” in the file “food_nutrient_derivation” is the same variable labeled “derivation_id” in the “food_nutrient” file. In cases like this we will always default to the longer or more descriptive variable name. We will use the file “food_nutrient” as our base, it contains 15324912 observations of 11 variables.
For the sake of simplicity, all footnote columns will be removed from each file.
The files “sr_legacy_food” and “foundation_food” will not be added to the main data set for now as the only new variable they provide information on is “NDB_number” which we do not need at the present moment. Analysis of “market_acquisition” will be done separately, as it only applies to certain branded foods and only introduces information on which stores and states foods were acquired in.
When combining all the data, we come across a problem. “food_nutrient” has a total of 1094722 unique fdc_ids, however “branded_food” has a total of 1142610 unique fdc_ids meaning at minimum 47888 foods have an fdc_id but no nutrient data.
A fact that contradicts this idea is that the row “upc_gtin” in the branded_food file has 357927 unique upc codes in it, meaning that the creators of the data gave a unique fdc_id to multiple forms of each product even though they share the same nutrient information. In the total combined data we have 1185096 unique fdc_ids, 1060007 of which belong to branded foods. Assuming each food with the same upc code has identical nutrient information, 702080 of these entries must be repeated information. On the FDC website there are 368686 view-able branded foods which is a lot closer to the number of unique upc codes rather than the number of unique fdc_ids. In this case it will be more representative if we filter out the repeated entries of branded foods.
Deleting data is never an ideal solution, so let’s gather more context and see if we can find an alternate solution.
Before removing any duplicate information lets look at how many unique foods we have per category.
| Table 2: Unique Food Entries per Data Type in csv Files | |
|---|---|
| n | |
| agricultural_acquisition | 810 |
| branded_food | 1142610 |
| experimental_food | 11 |
| foundation_food | 195 |
| market_acquistion | 5480 |
| sample_food | 1982 |
| sr_legacy_food | 7793 |
| sub_sample_food | 19126 |
| survey_fndds_food | 7083 |
| NA | 6 |
We’ve got some obvious problems here off the bat. On the FDC website, there are 159 unique foundation foods, and 378903 branded foods (the number of foods in SR legacy and FNDDS match the website frequencies) as of 12/10/2021. To combat this we will add the gtin_upc column from the branded_food file and filter for duplicated entries excluding fdc_id. In other words we will be filtering the data for unique combinations of the variables nutrient_id, amount, food_description, data_type, gtin_upc, and unit_name.
We will also alter our strategy of combining our data by exclusively joining on foods we have nutrient information on.
Now our frequencies look like this:
| Table 3: Unique Combinations of nutrient_id, amount, food_description, data_type, gtin_upc, and unit_name | |
|---|---|
| n | |
| agricultural_acquisition | 805 |
| branded_food | 521993 |
| foundation_food | 177 |
| sr_legacy_food | 7793 |
| sub_sample_food | 14837 |
| survey_fndds_food | 7083 |
| NA | 5 |
We’re still not quite to the point where we match the numbers on the website but we sure are a lot closer than we were before. But close isn’t going to cut it, we need to find another way to filter down the entries in a way that is precise and accurate.
Looking at differences in the available types of downloadable data here is crucial. In the csv download version of the food file there are 1605403 entries, in the access version of the file there are only 372954.
Altering the method to use only the foods in the food file from the access version of the data we get:
| Table 4: Unique Food Entries per Data Type in Access Files | |
|---|---|
| n | |
| branded_food | 357927 |
| experimental_food | 11 |
| foundation_food | 140 |
| sr_legacy_food | 7793 |
| survey_fndds_food | 7083 |
We did it! We have the right number of entries per data type!
Key notes/important takeaways:
Ignoring the data for foods with no nutrient values for now, let’s look at what the data would be without it.This data set will be for the amount of foods we have nutrient information for only and only include foods listed in the food file.
Now we have 10795145 observations of 26 variables, all matched by id numbers. Most of these columns are superfluous so for now we will use a subset of the data. For our initial analysis of the overall data we will be using the following variables:
| Variable Name | Variable Description |
|---|---|
| fdc_id | A unique number for each food in the Food Data Central Database |
| nutrient_id | A unique number given to each nutrient |
| amount | The amount of each nutrient per 100g of the listed food |
| data_points | How many data points they used to derive the nutrition value |
| min, max, median | Minimum, maximum, and median value of nutrition content within sample provided in three separate columns |
| min_year_acquired | The first year data collection started on a sample |
| data_type | The type of data based on how it was acquired |
| food_description | The name of the food or a brief description of the food such as “milk, whole” |
| publication_date | The date the food was published to the FoodData Central website |
| nutrient_name | The name of the nutrient |
| unit_name | The unit of each nutrient (g, mg, mcg, IU, etc..) |
| derivation_description | How the food was analyzed for nutrition content |
In this section we will look at the range of measurements with information regarding precision.
Below is a table of the units of measurement used for each nutrient per data type.
| Table 5: Units of Measurement per Data Type | ||||||||
|---|---|---|---|---|---|---|---|---|
| G | IU | KCAL | kJ | MG | MG_ATE | SP_GR | UG | |
| branded_food | 2470260 | 288601 | 354840 | 9 | 1685504 | 1040 | 0 | 30920 |
| experimental_food | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| foundation_food | 5696 | 26 | 315 | 98 | 2390 | 0 | 1 | 864 |
| sr_legacy_food | 358495 | 12537 | 7793 | 7793 | 157006 | 0 | 0 | 100501 |
| survey_fndds_food | 205407 | 0 | 7083 | 0 | 134577 | 0 | 0 | 113328 |
Since each measurement must be rounded to a certain number of significant digits, and a vast majority of the amounts in this data are rounded to 6 significant figures or less, a larger variety of units of measurement implies a larger amount of precision. Going by this logic, data on Branded food tends to be less precise than data on SR legacy or Foundation foods.
Below is a table of the frequency of derivation descriptions by data type.
The source codes the descriptions mention are as follows:
| Table 7: Source Codes | |
|---|---|
| code | description |
| 1 | Analytical or derived from analytical |
| 4 | Calculated or imputed |
| 5 | Value manufacturer based label claim for added nutrients |
| 6 | Aggregated data involving combinations of source codes 1, 6, 12 and/or 13 |
| 7 | Assumed zero |
| 8 | Calculated from nutrient label by NDL |
| 9 | Calculated by manufacturer, not adjusted or rounded for NLEA |
| 11 | Aggregated data involving comb. of codes other then 1,12 or6 |
| 12 | Manufacturer's analytical; partial documentation |
| 13 | Analytical data from the literature, partial documentation |
Since there are now 64 different measures of derivation, it doesn’t make much sense to try to fit them all into the original 10 categories.
As there are 384 unique combinations of data type and derivation this is not particularly helpful. To improve our understanding of the derivation methods used in each data type we will split the derivation description into groups.
| **Table 8**: Derivation Group per Data Type | |||||
|---|---|---|---|---|---|
| Branded | Experimental | Foundation | SR legacy | FNDDS | |
| Aggregated_data | 0 | 0 | 0 | 2746 | 0 |
| Analytical | 0 | 0 | 7594 | 181665 | 0 |
| Assumed_zero | 0 | 0 | 0 | 57711 | 0 |
| Based_on_similar_food | 0 | 0 | 0 | 47013 | 0 |
| Based_on_physical_composition | 0 | 0 | 0 | 41454 | 0 |
| Calculated | 4762782 | 0 | 578 | 41795 | 0 |
| Concentration_adjustment | 0 | 0 | 0 | 18 | 0 |
| Estimated_from_ingredients | 0 | 0 | 0 | 39304 | 0 |
| Given_by_info_provider | 68392 | 0 | 0 | 0 | 0 |
| Label | 0 | 0 | 0 | 5080 | 0 |
| Manufacturer_supplied | 0 | 0 | 0 | 15258 | 0 |
| Based_on_other_nutrient | 0 | 0 | 0 | 5618 | 0 |
| Other | 0 | 0 | 0 | 1960 | 0 |
| Product_standard | 0 | 0 | 0 | 29 | 0 |
| Recipe | 0 | 0 | 0 | 21734 | 0 |
| Summed | 0 | 0 | 1218 | 0 | 0 |
| Food_composition_tables | 0 | 0 | 0 | 997 | 0 |
Derivations were split into 17 groups based on the following groupings
In the event that you prefer the groupings set by the original 10 source codes, the table would look like this:
| Table 10: Source per Data Type | |||||
|---|---|---|---|---|---|
| Branded | Experimental | Foundation | SR legacy | FNDDS | |
| Aggregated data involving comb. of codes other then 1,12 or6 | — | — | — | 817 | — |
| Aggregated data involving combinations of source codes 1, 6, 12 and/or 13 | — | — | — | 7261 | — |
| Analytical data from the literature, partial documentation | — | — | — | 1075 | — |
| Analytical or derived from analytical | — | — | 8812 | 208534 | — |
| Assumed zero | — | — | — | 57711 | — |
| Calculated by manufacturer, not adjusted or rounded for NLEA | — | — | — | 10607 | — |
| Calculated from nutrient label by NDL | — | — | — | 5080 | — |
| Calculated or imputed | — | — | 578 | 166646 | — |
| Manufacturer's analytical; partial documentation | 4831174 | — | — | 4513 | — |
| Value manufacturer based label claim for added nutrients | — | — | — | 138 | — |
| NA | 681 | 11 | — | 181743 | 460395 |
note: NA = Not Available, unknown, or missing
You’ll notice that there were a lot of missing values in the table of derivations so lets take a look at how many derivations were missing in each data type.
| Table 11: Missing Derivation Values | |
|---|---|
| NA_derivation_count | |
| branded_food | 681 |
| experimental_food | 11 |
| foundation_food | 0 |
| sr_legacy_food | 181743 |
| survey_fndds_food | 460395 |
It looks as though the missing data is mostly coming from SR and FNDDS. There are other files that can be downloaded to determine more about the derivation of the FNDDS foods. However, from the documentation of the FNDDS it is noted that all FNDDS nutrition values are taken from a combination of other foods in FDC. We will explore the breakdown of what data FNDDS is derived from further in a later section.
Key notes/important takeaways:
There are many different variables formatted as dates within the FDC data, in order to get an accurate look at the age of the measurements, we will have to look at multiple of them.
The one date variable provided for all values is the publication date which represents when each food/nutrient was uploaded to FoodData Central. Below is a table summarizing this variable.
| Table 12: Publication Date | ||||
|---|---|---|---|---|
| n | min | median | max | |
| branded_food | 4831855 | 2019-04-01 | 2021-03-19 | 2021-03-19 |
| experimental_food | 11 | 2020-10-30 | 2020-10-30 | 2021-04-23 |
| foundation_food | 9390 | 2019-04-01 | 2019-12-16 | 2021-04-28 |
| sr_legacy_food | 644125 | 2019-04-01 | 2019-04-01 | 2019-04-01 |
| survey_fndds_food | 460395 | 2020-10-30 | 2020-10-30 | 2020-10-30 |
All publication dates are in 2019-2021 since FDC has only been around for 2 years. This makes this data rather unhelpful to us.
The date variable associated with the foundation foods is min_year_acquired which informs us of when foods in foundation were purchases or procured for analysis. This date represents the oldest sample for each nutrient entry of each food. Let’s look at the distribution of min year acquired in Foundation.
| Table 13: Minimum Year Acquired | ||||
|---|---|---|---|---|
| n | min | median | max | |
| foundation_food | 7458 | 1999 | 2015 | 2021 |
## min_year_acquired
## Min. :1999
## 1st Qu.:2011
## Median :2015
## Mean :2014
## 3rd Qu.:2018
## Max. :2021
## NA's :1932
We’ve got data ranging from 1999 to 2021, which gives us a fairly substantial range of around 22 years. However we have 1932 missing dates which is quite alarming.
To get the dates for SR legacy we have to go back to the link to the original SR legacy data and download the file “NUT_DATA” which provides a variable listed as “AddMod_Date” which specifies the last modified date for each nutrient entry.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1976-11-01" "1996-03-01" "2006-05-01" "2003-06-05" "2011-08-01" "2018-01-01"
## NA's
## "1493"
This isn’t looking so good, some of these measurements are from 1976, with the mean and median falling in the early 2000’s, a total of 1493 of the entries didn’t specify a date at all.
That is quite a jump in the early 2000’s, more likely than not there is some variable that caused the amount of information to jump like that. If I had to guess I would say that there might have been some sort of scientific advancement that allowed more people to study nutritional composition or there was a large change in government funding.
For FNDDS we will have to look at both the start date and end date of each sample. The same date is in each entry of start_date and end_date. All samples started on “2017-01-01” and ended on “2018-12-31.” This data means practically nothing to us due to the fact that all FNDDS nutrient calculations are based off of nutrition information in SR and Foundation.
For branded date we’ll look at both “modified_date” which is the last date the food was altered by the manufacturer and “available_date” which is the date the food was made available for inclusion in the database.
For modified date we have:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2013-06-05" "2018-02-16" "2019-01-18" "2019-03-05" "2020-04-08" "2021-03-04"
## NA's
## "20"
For available date we have:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2019-04-01" "2019-12-06" "2021-02-26" "2020-07-17" "2021-03-19" "2021-03-19"
This tells us more about how long food products can sit around before being sold than it does about when the nutrition information was gathered.
Key notes/important takeaways:
Below you’ll find a table of count data relating to how many nutrients were recorded per food in each data type. In this case “total nutrient entries” is the total number of nutrient entries in that data type, “average count” refers to the average number of nutrients associated with each food and “minimum count” and “maximum count” refer to the minimum and maximum number of nutrient entries associated with a single food per data type. Note that any foods with zero nutrient information given (primarily the 681 foods in Branded that have no nutrient information, more information on this topic will come up in the missing data section) will not be included in the minimum as we are focusing on provided nutrient information. i.e. for Foundation, the most nutrients stated per food was 159, the least was 13. If we grabbed a random Foundation food we would expect to know the values of about 67 nutrients for that food.
| Table 14:Nutrients Listed per Food | ||||
|---|---|---|---|---|
| Total Entries | Average Count | Minimum Count | Maximum Count | |
| branded_food | 4831855 | 13.50 | 1 | 48 |
| experimental_food | 11 | 1.00 | 1 | 1 |
| foundation_food | 9390 | 67.07 | 13 | 159 |
| sr_legacy_food | 644125 | 82.65 | 8 | 138 |
| survey_fndds_food | 460395 | 65.00 | 65 | 65 |
Key notes/important takeaways:
Even though not all measurements are technically nutrients, I will be referring to them all as nutrients due to the column of component names being labeled “nutrient_name.” Below is a plot of the 15 most used nutrient names overall. As you can see, the amount of entries in branded means it has a lot of bearing in the overall frequencies.
Below are 6 plots, the three plots on the left depict the top 15 most used nutrient names in that data type. The three plots on the right depict the frequencies of those same 15 nutrient names across all data types. We will be skipping FNDDS in this case because it has the exact same frequency for each of the 65 nutrients it uses. Nutrient names in the plots below are ordered from left to right depending on the frequency of each nutrient name in the specified database.
Key notes/important takeaways:
Note that for the following variables we have no missing values:
Below is a table of missing or “NA” values for each applicable variable.
| Table 15: Missing Values | ||
|---|---|---|
| Number of Missing Values | Percentage | |
| nutrient_id | 692 | 0.0001 |
| amount | 692 | 0.0001 |
| data_points | 5294155 | 0.8904 |
| derivation_id | 642830 | 0.1081 |
| min | 5841684 | 0.9825 |
| max | 5841682 | 0.9825 |
| median | 5938177 | 0.9987 |
| min_year_acquired | 5938318 | 0.9987 |
| NDB_number | 5292261 | 0.8901 |
| nutrient_name | 692 | 0.0001 |
| unit_name | 692 | 0.0001 |
| derivation_description | 642830 | 0.1081 |
| food_code | 5945776 | 1.0000 |
| survey_start_date | 5945776 | 1.0000 |
| survey_end_date | 5945776 | 1.0000 |
Naturally, all of the foods from data types other than FNDDS will be missing survey_start_date and survey_end_date. food_code is also only truly applicable to FNDDS.
Below you’ll find a breakdown of the values above per data type.
| Table 16: Count of Missing Values per Data Type | |||||
|---|---|---|---|---|---|
| branded_food | experimental_food | foundation_food | sr_legacy_food | survey_fndds_food | |
| nutrient_id | 681 | 11 | 0 | 0 | 0 |
| amount | 681 | 11 | 0 | 0 | 0 |
| data_points | 4831855 | 11 | 1894 | 0 | 460395 |
| derivation_id | 681 | 11 | 0 | 181743 | 460395 |
| min | 4831855 | 11 | 2189 | 547234 | 460395 |
| max | 4831855 | 11 | 2189 | 547232 | 460395 |
| median | 4831855 | 11 | 1791 | 644125 | 460395 |
| min_year_acquired | 4831855 | 11 | 1932 | 644125 | 460395 |
| NDB_number | 4831855 | 11 | 0 | 0 | 460395 |
| nutrient_name | 681 | 11 | 0 | 0 | 0 |
| unit_name | 681 | 11 | 0 | 0 | 0 |
| derivation_description | 681 | 11 | 0 | 181743 | 460395 |
| food_code | 4831855 | 11 | 9390 | 644125 | 460395 |
| survey_start_date | 4831855 | 11 | 9390 | 644125 | 460395 |
| survey_end_date | 4831855 | 11 | 9390 | 644125 | 460395 |
That table can be rather difficult to interpret without any context, so here is the same information displayed as percentages of each variable missing within each data type rounded to 3 decimal points.
| Table 17: Percentage of Missing Values per Data Type | |||||
|---|---|---|---|---|---|
| branded_food | experimental_food | foundation_food | sr_legacy_food | survey_fndds_food | |
| nutrient_id | 0 | 1 | 0.000 | 0.000 | 0 |
| amount | 0 | 1 | 0.000 | 0.000 | 0 |
| data_points | 1 | 1 | 0.202 | 0.000 | 1 |
| derivation_id | 0 | 1 | 0.000 | 0.282 | 1 |
| min | 1 | 1 | 0.233 | 0.850 | 1 |
| max | 1 | 1 | 0.233 | 0.850 | 1 |
| median | 1 | 1 | 0.191 | 1.000 | 1 |
| min_year_acquired | 1 | 1 | 0.206 | 1.000 | 1 |
| NDB_number | 1 | 1 | 0.000 | 0.000 | 1 |
| nutrient_name | 0 | 1 | 0.000 | 0.000 | 0 |
| unit_name | 0 | 1 | 0.000 | 0.000 | 0 |
| derivation_description | 0 | 1 | 0.000 | 0.282 | 1 |
| food_code | 1 | 1 | 1.000 | 1.000 | 1 |
| survey_start_date | 1 | 1 | 1.000 | 1.000 | 1 |
| survey_end_date | 1 | 1 | 1.000 | 1.000 | 1 |
Experimental foods is still missing most data, NDB_number is only applicable for SR legacy and Foundation foods. It would appear there are 681 foods in branded that have no nutrient data but do have a name and unique fdc_id, they are all foods without available nutrition facts such as alcohol. For example, the first 6 branded foods without nutrient information are as follows:
Below is a table of missing variables by nutrient name:
and again formatted as percentages we have:
You’ll notice that the nutrients that were used most have the most missing information. Further analysis must be done to draw any true conclusions from this data due to the large number of nutrients we have information on. For now we will have to look at the Essential Nutrients and work from there to draw conclusions.
Key notes/important takeaways:
Note: For all tables below data types will be excluded if they contain no relevant entries, if a table entry indicates “-” that implies that there were none of the listed nutrient in that category.
All percentage values displayed in plots are rounded to two decimal places, all percentage values displayed in tables are rounded to four decimal places.
Below is a table of the frequency of occurrences of each essential vitamin per data type:
| Table 20: Essential Vitamins per Data Type | |
|---|---|
| Count | Percentage |
Below you’ll find two plots, the first shows the count data as it appears above. The second plot shows the percentage of foods in each data type that contain the essential vitamins listed above. (i.e. the numbers above each bar in the second plot is the percentage of foods in that data type that contain the listed nutrient. For example, “foundation_food” contains a value of 0.6 for Thiamin, this means that 60% of foods in the Foundation data contain information on thiamin content). All following sections within “Frequency of Essential Nutrients” will have plots that can be interpreted in the same manner.
Since FNDDS entries all contain information on the same 65 nutrients, the percentage of FNDDS foods that contain information on a given nutrient will always be either 1 or 0.
In this case, multiple forms existed for a few of the essential vitamins, below you will see a breakdown of all the different forms of these entries. Each tab contains a separate table that indicates the counts of each type of entry associated with that type of vitamin.
| Table 21: A Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 21: B9 Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 22: D Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 23: E Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 24: K Vitamins per Data Type | |
|---|---|
| Count | Percentage |
| Table 25: Choline per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 26: Count of Essential Vitamins | ||
|---|---|---|
| min | max | median |
| Table 27: Percentage of Entries with Essential Vitamins | ||
|---|---|---|
| min | max | median |
Key notes/important takeaways:
Below you’ll find a table displaying the number of essential mineral entries exist in each data type.
| Table 28: Count of Foods Containing Essential Minerals | ||||
|---|---|---|---|---|
| Branded | Foundation | SR legacy | FNDDS | |
| Calcium | 291784 | 128 | 7708 | 7083 |
| Chromium | 194 | — | — | — |
| Copper | 3830 | 128 | 7284 | 7083 |
| Iodine | 1436 | 21 | — | — |
| Iron | 292662 | 128 | 7713 | 7083 |
| Magnesium | 11539 | 128 | 7421 | 7083 |
| Manganese | 3880 | 128 | 6492 | — |
| Molybendum | 228 | 30 | — | — |
| Phosphorus | 11907 | 128 | 7467 | 7083 |
| Potassium | 128639 | 128 | 7516 | 7083 |
| Selenium | 2397 | 81 | 6865 | 7083 |
| Zinc | 9212 | 128 | 7406 | 7083 |
| Chloride | 232 | — | — | — |
| Sodium | 353724 | 110 | 7709 | 7083 |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 29: Count of Entries with Essential Minerals | ||
|---|---|---|
| min | max | median |
| Table 30: Percentage of Entries with Essential Minerals | ||
|---|---|---|
| min | max | median |
Key notes/important takeaways:
Below you’ll find a table of the number of essential amino acid entries that exist for each data type.
| Table 31: Amino Acids per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 32: Count of Entries with Amino Acids | ||
|---|---|---|
| min | max | median |
| Table 33: Percentage of Entries with Amino Acids | ||
|---|---|---|
| min | max | median |
| Table 34: Omega 3 Fatty Acids per Data Type | |
|---|---|
| Count | Percentage |
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 35: Count of Entries with Omega 3 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 36: Percentage of Entries with Omega 3 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 37: Omega 6 Fatty Acids per Data Type | |
|---|---|
| Count | Percentage |
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrients per data type.
| Table 38: Count of Entries with Omega 6 Fatty Acids | ||
|---|---|---|
| min | max | median |
| **Table 39 **:Percentage of Entries with Omega 6 Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 40: Fatty Acid Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 41: Count of Entries with Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 42: Percentage of Entries with Fatty Acids | ||
|---|---|---|
| min | max | median |
| Table 43: Sugar Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 44: Count of Entries with Sugars | ||
|---|---|---|
| min | max | median |
| Table 45: Percentage of Entries with Sugars | ||
|---|---|---|
| min | max | median |
Below are the number of entries for total fiber in each data type.
| Table 46: Fiber Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 47: Count of Entries with Fiber | ||
|---|---|---|
| min | max | median |
| Table 48: Percentage of Entries with Fiber | ||
|---|---|---|
| min | max | median |
| Table 49: Carbohydrate Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 50: Count of Entries with Carbohydrates | ||
|---|---|---|
| min | max | median |
| Table 51: Percentage of Entries with Carbohydrates | ||
|---|---|---|
| min | max | median |
| Table 52: Cartenoid Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 53: Count of Entries with Cartenoids | ||
|---|---|---|
| min | max | median |
| Table 54: Percentage of Entries with Cartenoids | ||
|---|---|---|
| min | max | median |
| Table 55: Phytosterol Entries per Data Type | |
|---|---|
| Count | Percentage |
Key notes/important takeaways:
Below, you’ll find a table with three sections, the first describing the frequency of “min” values for each data type, the second containing frequencies of “max” values, and the third containing frequencies of “median” values for each of the listed nutrient per data type.
| Table 56: Count of Entries with Phytosterols | ||
|---|---|---|
| min | max | median |
| Table 57:Percentage of Entries with Phytosterols | ||
|---|---|---|
| min | max | median |
Part of Foundation is based on a dataset of branded foods that contains a significant amount of data not available for the other foods.
A column for market country has been added to the data for branded foods. However, all entries indicate the market country to be “United States.” It is possible that this shows an intention to collect more data from other market countries.
In the file market_acquisition we have a breakdown of where each of 5480 of the collected branded foods came from (Note: the summary file claims there are only 5327 entries in this file which is true when you download the data in Access format but there are 5480 entries in the csv version of the file). Specifically we have a “store_state” and “store_city” for each acquired item, many items are duplicated due to having been acquired multiple times from different locations. The list of unique entries in the store_state variable are as follows:
## [1] "NE" "PA" "AZ" "IA" "TX" "CA"
## [7] "IN" "GA" "NJ" "KY" "NC" "FL"
## [13] "NY" "CO" "TN" "IL" "WA" "MA"
## [19] "MN" "MI" "WI" "KS" "AL" "CT"
## [25] "AR" "VA" "OH" "MO" "OK" ""
## [31] "Al" "WV" "NM" "NV" "MD" "Atlantic"
## [37] "West" "Plains" "Midwest"
34 of these entries are abbreviations of state names, 4 are region descriptions and 1 represents missing values.
This data is very far from being appropriately distributed and has the potential to create a confounding variable in the event of analysis if data is not identical across all locations.
FNDDS uses the wweia food groups which are split into 167 unique categories. The SR legacy and Foundation foods follow the SR legacy food groups which are split into 28 unique categories. Branded has it’s own list of food groups which contains 257 unique categories. None of the three lists of unique food categories have any entries that are identical across all three lists. However, the food categories for FNDDS and branded intersect on the following category names:
## [1] "Cheese" "Rice" "Tomatoes" "Pizza" "Coffee" "Beer"
Ingredients for FNDDS food are not available in the downloads from the FDC website, they are however downloadable from the ars website (https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/food-surveys-research-group/docs/fndds-download-databases/). From here you can download the file “FNDDSIngred” which can be linked to the information provided by FoodData Central using the food_code variable. From the data available from the ARS website you can also link each ingredient to it SR_legacy equivalent and view the “Nutrient Value” of each score which is a number that ranks each food based on nutrient density. However, as you can see through the summary statistics below there is a problem with outliers in this variable. More than 90% of foods have a Nutrient value of less than 1, while several points are in the thousands.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.28 34.91 7.80 45902.00
Since all FNDDS foods are based on nutrient entries in SR legacy and Foundation, the derivations of each nutrient will reflect that. From the data available on the ARS website (link above), we can download the files “FNDDSIngred”, “IngredNutVal”, and “DerivDesc” to determine which foods from SR legacy and foundation were used as ingredients and then from there we can look at the derivations of those foods.
From the USDA Ag Data Commons website you can find a database of flavonoid content for many of the foods in the FDC database (3). These flavonoid values can be linked to Foundation and SR legacy foods by NDB number. Combining this data gives us amounts of the following flavonoids in 25 foods in Foundation and 1613 foods in SR legacy. In all cases, every flavonoid is provided for each food.
## [1] "Daidzein" "Genistein"
## [3] "Glycitein" "Cyanidin"
## [5] "Petunidin" "Delphinidin"
## [7] "Malvidin" "Pelargonidin"
## [9] "Peonidin" "(+)-Catechin"
## [11] "(-)-Epigallocatechin" "(-)-Epicatechin"
## [13] "(-)-Epicatechin 3-gallate" "(-)-Epigallocatechin 3-gallate"
## [15] "Theaflavin" "Thearubigins"
## [17] "Eriodictyol" "Hesperetin"
## [19] "Naringenin" "Apigenin"
## [21] "Luteolin" "Isorhamnetin"
## [23] "Kaempferol" "Myricetin"
## [25] "Quercetin" "Theaflavin-3,3'-digallate"
## [27] "Theaflavin-3'-gallate" "Theaflavin-3-gallate"
## [29] "(+)-Gallocatechin"
After the release of the USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes in 2015, alterations and additions were made to the USDA Database for the Flavonoid Content of Selected Foods in 2018. While the other supplemental databases are available on the USDA Ag Data Commons, this new update was published solely on the USDA Agricultural Research Service website (4). There are values in the USDA Database for the Flavonoid Content of Selected Foods for a total of 183 foods in SR legacy and Foundation. Of those 183, 131 can also be found in the USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes (meaning a total of 52 new foods have been added in this release). These newer values are assumed to be more accurate and if added to the FDC data should replace their previous versions.
| Table 58: Flavonoid Entries for SR Legacy Foods | |
|---|---|
| n | |
| (-)-Epicatechin | 82 |
| (-)-Epicatechin 3-gallate | 75 |
| (-)-Epigallocatechin | 73 |
| (-)-Epigallocatechin 3-gallate | 74 |
| (+)-Catechin | 83 |
| (+)-Gallocatechin | 71 |
| Apigenin | 101 |
| Cyanidin | 48 |
| Delphinidin | 46 |
| Eriodictyol | 2 |
| Hesperetin | 39 |
| Isorhamnetin | 43 |
| Kaempferol | 132 |
| Luteolin | 112 |
| Malvidin | 38 |
| Myricetin | 122 |
| Naringenin | 37 |
| Pelargonidin | 39 |
| Peonidin | 38 |
| Petunidin | 37 |
| Quercetin | 162 |
| Theaflavin | 3 |
| Theaflavin-3'-gallate | 3 |
| Theaflavin-3,3'-digallate | 3 |
| Thearubigins | 3 |
| Table 59: Flavonoid Entries for Foundation Foods | |
|---|---|
| n | |
| (-)-Epicatechin | 4 |
| (-)-Epicatechin 3-gallate | 4 |
| (-)-Epigallocatechin | 4 |
| (-)-Epigallocatechin 3-gallate | 4 |
| (+)-Catechin | 4 |
| (+)-Gallocatechin | 4 |
| Apigenin | 6 |
| Cyanidin | 2 |
| Delphinidin | 2 |
| Hesperetin | 2 |
| Isorhamnetin | 1 |
| Kaempferol | 7 |
| Luteolin | 6 |
| Malvidin | 2 |
| Myricetin | 7 |
| Naringenin | 2 |
| Pelargonidin | 2 |
| Peonidin | 2 |
| Petunidin | 2 |
| Quercetin | 7 |
From the USDA Ag Data Commons website you can find a database of proanthocyanidin content for many of the foods in the FDC database (5). These proanthocyanidin values can be linked to Foundation and SR legacy foods by NDB number. The following tables contain the names of each type of proanthocyanidin content and the number of foods in SR legacy and Foundation we have entries for.
| Table 60: Proanthocyanidin Entries for SR Legacy Foods | |
|---|---|
| n | |
| Proanthocyanidin 4-6mers | 114 |
| Proanthocyanidin 7-10mers | 110 |
| Proanthocyanidin dimers | 130 |
| Proanthocyanidin polymers (>10mers) | 108 |
| Proanthocyanidin trimers | 124 |
| Table 61: Proanthocyanidin Entries for Foundation Foods | |
|---|---|
| n | |
| Proanthocyanidin 4-6mers | 6 |
| Proanthocyanidin 7-10mers | 6 |
| Proanthocyanidin dimers | 6 |
| Proanthocyanidin polymers (>10mers) | 6 |
| Proanthocyanidin trimers | 6 |
From the USDA Ag Data Commons website you can find a database of Isoflavone content for many of the foods in the FDC database (6). These Isoflavone values can be linked to Foundation and SR legacy foods by NDB number. The following tables contain the names of each type of Isoflavone content and the number of foods in SR legacy and Foundation we have entries for.
| Table 62: Isoflavone Entries for SR Legacy Foods | |
|---|---|
| n | |
| Biochanin A | 59 |
| Coumestrol | 123 |
| Daidzein | 262 |
| Formononetin | 123 |
| Genistein | 262 |
| Glycitein | 143 |
| Total isoflavones | 259 |
| Table 63: Isoflavone Entries for Foundation Foods | |
|---|---|
| n | |
| Biochanin A | 3 |
| Coumestrol | 8 |
| Daidzein | 15 |
| Formononetin | 8 |
| Genistein | 15 |
| Glycitein | 9 |
| Total isoflavones | 15 |
This data set has significant overlap with USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes but provides additional information on “Biochanin A”, “Coumestrol”, “Formononetin”, and “Total isoflavones”.
U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.
US Department of Agriculture, Agricultural Research Service. 2016. Nutrient Data Laboratory. USDA National Nutrient Database for Standard Reference, Release 28 (Slightly revised). Version Current: May 2016. http://www.ars.usda.gov/nea/bhnrc/mafcl
Bhagwat, Seema; Haytowitz, David B.; Wasswa-Kintu, Shirley. (2015). USDA’s Expanded Flavonoid Database for the Assessment of Dietary Intakes, Release 1.1 - December 2015. Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324677. Accessed 2022-01-12.
Haytowitz, D.B., Wu, X., Bhagwat, S. 2018. USDA Database for the Flavonoid Content of Selected Foods, Release 3.3. U.S. Department of Agriculture, Agricultural Research Service. Nutrient Data Laboratory Home Page: http://www.ars.usda.gov/nutrientdata/flav
Bhagwat, Seema; Haytowitz, David B.. (2015). USDA Database for the Proanthocyanidin Content of Selected Foods, Release 2 (2015). Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324621. Accessed 2022-01-12.
Bhagwat, Seema; Haytowitz, David B.. (2015). USDA Database for the Isoflavone Content of Selected Foods, Release 2.1 (November 2015). Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://doi.org/10.15482/USDA.ADC/1324538. Accessed 2022-01-12.