This report serves to characterize the quality and quantity of food composition data in FoodData Central (FDC) as of the October 2022 release. Data used for analysis was retrieved on 12/27/2022 and reflects fixes made to missing fields on 12/8/2022.
Data from FDC can be exported as CSV delimited ASCII files or in JavaScript Object Notation (JSON). While both download types hypothetically contain identical data, there have been known to be significant differences in the content of the files such as missing variables. Therefore, both file types will be investigated.
Since very few fields/variables are consistent across data types (Foundation, FNDDS, Branded, SR Legacy), data types will be analyzed separately.
Data on Foundation Foods comes directly from the USDA Agricultural Research Service (ARS) and is only published on FDC. This data type is comprised of analytically derived values for nutrients and other food components. Of the FDC data types, Foundation has the most transparent food composition data. This transparency is a direct result of the extensive metadata including number of samples, sampling location, date of collection, analytical approaches used, and if appropriate, agricultural information such as genotype and production practices.
As of this release, there are 210 foundation foods. Of the 210 foods, 51 (24.3%) were added in the last year.
Below you will find a characterization of nutrient measures provided for foundation foods.
Figure 1: Frequency of Nutrient Measures in Foundation
Key notes/important takeaways:
The minimum year each food sample was acquired is provided for foundation foods rather than the minimum date associated with each food or nutrient measure. However, this information is only provided for samples of 85 foundation foods.
Acquisition dates of foundation foods obtained through agricultural acquisition are provided in the csv version of the data, but the amount of foods with acquisition dates is rather limited.
The expiration date of foundation foods obtained through market acquisition can also be found within the csv files. However, this information is unreliable as an indicator of age for the nutrient measures.
Figure 2: Frequency of Sample Nutrient Measures by Minimum Year Acquired in Foundation
Key notes/important takeaways:
There are three variables in the FoodData Central data that identify the origin of each nutrient measures. There is the derivation description which is the method by which each nutrient measure was derived. Then there are the source code and source description which identify the overall origin of the nutrient measures.
Note that these variables are missing for 12 nutrient measures (0.1% of measures in foundation).
Table 3: Source and Derivation of Nutrient Measures in Foundation
Foundation uses the same food categories as SR, excluding “American Indian/Alaska Native Foods”, “Baby Foods”, “Breakfast Cereals”, “Fast Foods”, “Lamb, Veal, and Game Products”, “Meals, Entrees, and Side Dishes”, and “Snacks”.
Table 5: Nutrient Measures per Food Category in Foundation
Here we investigate the level of completeness for nutrient profiles in foundation. This analysis will be assessed by the number of nutrient measures provided for each food.
This will be carried out using 3 assessments of completeness:
Figure 3: Total Nutrient Measures per Food in Foundation
Figure 4: Essential Nutrient Measures per Food in Foundation
Figure 5: NFP Nutrient Measures per Food in Foundation
In Figure 5, the median and mean are identical so only the median was depicted.
Each data type was created for a different purpose and has it’s own set of unique variables. Foundation exists to monitor variation in nutrient and food composition profiles. It is the only data type that is comprised of direct and transparent USDA analysis and as such is highly reliable.
While there are relatively very few foods in this data type, the USDA is actively working to expand the scope of foundation. In the meantime, what foundation lacks in quantity, it makes up for in quality.
Variables unique to this data type monitor the collection and analysis of multiple samples of each food.
The Food and Nutrient Database for Dietary Studies (FNDDS) is the primary food composition database used for What We Eat in America (WWEIA) and the National Health and Nutrition Examination Survey (NHANES). Typically, a new version of each of the 3 data sets (FNDDS, WWEIA, NHANES) is released every 2 years; though due to the pandemic there has been a notable interruption in releases. In this release of FDC, the 2019-2020 FNDDS data is provided.
FNDDS is curated by the USDA Food Surveys Research Group: Beltsville, MD. The initial unabridged releases of FNDDS can be found on the ARS website. Additional excel sheets tracking discontinued or changed food codes between releases are also provided by the Food Surveys Research Group.
As of the current release, there are 5624 FNDDS foods.
Below you’ll find all nutrient measures available in FNDDS. Unlike Foundation, Branded, and SR Legacy, FNDDS provides information on the same 65 nutrient measures for every food with no exceptions.
Figure 6: Frequency of Nutrient Measures in FNDDS
Key notes/important takeaways:
There are two variables provided for the age of nutrient measures in FNDDS, they are the start date and end date of each sample. All samples started on 1/1/2019 and ended on 12/31/2020.
Note that these are not the age of the nutrient measures but an indicator of which NHANES/WWEIA cycle the FNDDS data is associated with. FNDDS is mainly derived from nutrient measures in Foundation and SR Legacy. To accurately assess the age of FNDDS nutrient measures, consult the “Age of Measurements” section for Foundation and SR Legacy.
There is no variable for the origin of the nutrient measurements in FNDDS, instead we are given a list of foods in SR Legacy and Foundation that were used as components to calculate the nutrient measurements in FNDDS. However,the documentation of FNDDS states that for a few ingredient codes, a source other than SR Legacy or Foundation was the basis for either all, or for only select nutrients.
Further documentation of nutrient value sources appears in the table “fndds_ingredient_nutrient_value”. A majority of nutrient values are gathered from SR Legacy; as such, SR Legacy is the best indicator of age for FNDDS.
This version of FNDDS is split into 169 WWEIA food categories. These food categories are further grouped into 15 broader groups in a PDF on the ARS website.
Here we investigate the level of completeness for nutrient profiles in foundation. This analysis will be assessed by the number of nutrient measures provided for each food.
This will be carried out using 3 assessments of completeness:
Since the same nutrient measures are provided for every food in FNDDS, figures are not necessary here. Every FNDDS food has 65 associated nutrient measures including 24 essential nutrients and 16 NFP nutrients.
FNDDS was created to monitor the food and nutrient intake of Americans and as such contains standardized nutrient profiles of commonly consumed foods. While this information is less transparent than foundation, it contains a consistent and well-sized nutrient profile for each food and is well-suited for it’s purpose.
While there are very few (if any) unique variables related to food composition in FNDDS, it’s value primarily lies in its links to NHANES and WWEIA.
The Branded Foods data type contains data from labels of national and international branded foods collected by a public-private partnership. Partners submit label data to Label Insight or 1WorldSync through the Global Data Synchronization Network. The USDA standardizes NFP nutrient measures by calculating values per 100 grams. Otherwise, all data appears exactly how it was submitted which can cause inconsistencies and confusion.
Members of the Public-Private Partnership are:
While this data is updated monthly, new CSV and JSON data downloads are only provided every 6 months (in April and October). The API can be used to access new data before the releases are added to the downloads page.
This version of branded contains a total of 393327 foods.
Below you’ll find all the nutrient measures in Branded. Note that 11880 foods in branded have NO associated nutrient measures.
Figure 7: Frequency of Nutrient Measures in Branded
Branded offers two measures of age; “modifiedDate” which is the last date the food was altered by the manufacturer and “availableDate” which is the date the food was made available for inclusion in the database.
Unfortunately, in the JSON files the “availableDate” was mistakenly overwritten with the “modifiedDate”. This means that “availableDate” and “modifiedDate” are identical.
This overwritten information is also reflected on the FDC website.
Within the branded_food csv table, 7 foods are missing a modified_date.
Figure 8: Modified Date of Branded Foods
Figure 9: Available Date of Branded Foods
As mentioned previously, all measurements are submitted by partner companies. As such the source of all foods in branded is “Manufacturer’s analytical; partial documentation”. More information is provided through the derivation description variable.
Beyond analyzing the derivation variables, we can look at the data_source variable in the branded_food table. This variable indicates whether each food came from GS1 or Label insight. Noticeably 95.9% of the foods in this release came from Label Insight.
Since the food categories of branded foods are provided by partner companies, groupings can be sporadic and confusing. There are 337 different branded food categories, one of these categories is an empty string (or ” “).
Here we investigate the level of completeness for nutrient profiles in foundation. This analysis will be assessed by the number of nutrient measures provided for each food.
This will be carried out using 3 assessments of completeness:
Figure 10: Total Nutrient Measures per Food in Branded
Figure 11: Essential Nutrient Measures per Food in Branded
Figure 12: NFP Nutrient Measures per Food in Branded
Branded contains a variety of information regarding food labeling. A plethora of label-related variables are unique to this data type including the brand name, UPC, package weight, and most importantly ingredient lists. While branded is severely limited in the nutrition composition it provides, it is the only data type to formally include ingredient lists.
In addition, branded contains a significantly higher number of foods than any other data type. While the data is riddled with errors and lacks any quality control or standardization, it is absolutely massive. If you’re looking for quantity over quality, then branded is for you.
The US Department of Agriculture (USDA) National Nutrient Database for Standard Reference is the major source of food composition data in the United States and provides the foundation for most food composition databases in the public and private sectors. It is comprised of historic data derived from various sources of published literature.
SR Legacy has not been updated since it’s release by the ARS in April of 2018. The original and complete version of the data has since been removed from the ARS website and can now be found in the United States government data catalog and the USDA Ag Data Commons website.
In total, SR Legacy provides nutrient information on 7793 foods.
Below you’ll find the nutrient measures and corresponding units for all foods in SR legacy.
Figure 13: Frequency of Nutrient Measures in SR Legacy
FDC does not provide any variables for the age of nutrient measures in SR.
To get the dates for SR legacy we have to go back to the original SR legacy data download
on the USDA Ag Data Commons website the file NUT_DATA which provides a variable listed as AddMod_Date which specifies the last modified date for each nutrient entry.
Table 12: Nutrient Measure Additions and Modifications in SR Legacy
Figure 14: Nutrient Measure Additions and Modifications in SR Legacy
Key notes/important takeaways:
There are three variables in the FoodData Central data that identify the origin of each nutrient measures. There is the derivation description which is the method by which each nutrient measure was derived. Then there are the source code and source description which identify the overall origin of each derivation method.
Table 13: Source and Derivation of Nutrient Measures in SR Legacy
Key notes/important takeaways:
There are 25 food categories present in SR legacy, below you’ll find a break-down of how many foods and nutrient measures were collected for each food category.
Table 15: Nutrient Measures per Food Category in SR Legacy
Key notes/important takeaways:
Here we investigate the level of completeness for nutrient profiles in foundation. This analysis will be assessed by the number of nutrient measures provided for each food.
This will be carried out using 3 assessments of completeness:
Figure 15: Total Nutrient Measures per Food in SR Legacy
Figure 16: Essential Nutrient Measures per Food in SR Legacy
Figure 17: NFP Nutrient Measures per Food in SR Legacy
SR Legacy may not have the most foods or the most nutrient measures but it is the current standard of food composition and is widely used by the research community. This data type provides (on average) the largest and most complete nutrient profiles for each food. Though it may be hard to monitor the quality of that data since it spans so many sources. Additionally, in it’s unabridged form this data set provides a comprehensive list of sources that can be used to track the origin of every provided value.
Given the data in this analysis, we will now address a few running themes and put this analysis into the greater context of the current data landscape. This will be done through answering the following questions:
There were a lot of structural changes have been introduced in the FDC data over the past year. In October 2021 FDC discontinued the Access format and introduced downloadable JSON files, the initial jump to the new format was rocky but the quality of the new JSON files has improved significantly.
That being said, the JSON files are still riddled with notable errors, including but not limited to:
Beyond structural changes, new variables and tables have been added to the database. New data is now being collected about microbes! There isn’t much microbe data yet but it’s a very promising sign indicating future growth. Additionally, further variables characterizing the branded foods have been created such as discontinued_date, preparation_state_code, trade_channel, and short_description. While these variables have not been populated yet, they hint toward future expansion.
Although not necessarily new, additional FNDDS variables were provided in the FDC download files with this release. In past releases, FDC did not include the derivation variables for FNDDS. While there are more variables in FNDDS and SR Legacy that can be brought in from the source material, this is a positive gesture of increasing transparency and reusability of the data.
Until now, we have considered the 4 data types independent of one another. Now that we understand them individually, we can take a step back and look at how they interconnect. Foundation, SR Legacy, and FNDDS are all heavily intertwined, while Branded is siloed away from the others.
Let us first consider the relationships between Foundation and SR Legacy. Structurally, they are quite similar. They share food categories, have similar naming structures, and both index foods using NDB number (nutrient database number). The main difference between them lies in how each of them was collected. Foundation started being released once SR Legacy was discontinued and serves as almost a sequel or supplement. There are foods in the 2 data types that share an NDB number, allowing for 2 nutrient profiles of the same food to be used in conjunction.
Both Foundation and SR Legacy feed into the data used to create nutrient profiles in FNDDS. As such, many FNDDS foods can be deconstructed into a collection of SR and Foundation nutrient profiles.
The NDB number used to link SR Legacy and Foundation, is far from a one-off. The USDA has released multiple databases that can be used to expand the nutrient profiles of SR Legacy and Foundation using this index. Using the “Specific Component Data Sets” available from the ARS webpage, nutrient profiles in SR Legacy and Foundation can be expanded to include further pro-anthocyanidin, flavonoid, choline, iodine, flouride, isoflavone, and sodium measurements. Soon they will even be releasing a data set on glucosinates!
These specific component data sets have also been used to generate flavanoid profiles for FNDDS foods. Though the FNDDS flavonoid data is often released much later than the main FNDDS releases, with the flavanoid profiles of the 2017-2018 version of FNDDS only being released in late summer of 2022.
These additions to FNDDS, SR Legacy, and Foundation expand the completion of these data types and broaden the possible applications of the data.
Ultimately, with this new release of FDC we see very productive and helpful changes in expansion of food and nutrient composition data, with a clear push toward transparency and accountability. This bodes well for the future of nutrition science overall!