NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.
We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; what food composition databases are being used? To answer this question we performed a manual review of the literature.
To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.
Scimago identifies the top 5 journals in the category nutrition and dietetics as:
Key notes/important takeaways:
We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.
Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.
We identified the following:
Publications were further reviewed in detail by downloading and reading each publication in its entirety. All food and nutrition data sources were documented.
Food and nutrition data sources were evaluated based on the inclusion/exclusion criteria below.
Included Sources:
Excluded Sources:
Included data sources were investigated by trained data analysts and further information was collected directly from publisher websites. This information includes:
If additional sources of food and nutrient data were identified on publisher websites, they were added to the list of data sources and investigated for further information.
Collection was done manually by human researchers and thus is subject to error.
We were unable to locate or find information on a few sources. We spent a considerable amount of time searching, but were never able to fully confirm their existence.
Obscure data sources are as follows:
These sources are included in exploratory analysis. Although obscure, we assume that information provided by the authors is correct.
Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of sources.
Figure 1: Search Strategy and Publication Selection Process
The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded sources to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.
| Computational Cleansing Process | Translation |
|---|---|
| The number of unique strings collected in each variable was computed and compared to the total number of entries collected | Confirm that data was collected on the correct number of publications |
| All entries for intext and formal sources that were flagged as duplicate strings were identified and investigated manually | Inspect intext and formal source entries for possible duplications in data entry |
| The number of NA or missing entries were collected for each variable and compared to the number of sources identified in each phase | Verify that no entry was unintentionally left blank |
| All categorical variables were mutated into factors and factor levels were examined | Confirm that all categorical responses were within a predefined list of possible options |
Below you’ll find table 1 which displays the number of publications and sources identified in each phase.
| Table 1: Publication and Source Count in Phases 0-3 | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Publications | Sources | ||||||||
| Phase 0 | Phase 1 | Phase 2 | Phase 3 | Phase 1 | Phase 2 | Phase 3 | |||
| Advances in Nutrition | 173 |
87 |
12 |
12 |
133 |
51 |
46 | ||
| American Journal of Clinical Nutrition | 391 |
164 |
81 |
71 |
230 |
145 |
121 | ||
| Annual Review of Nutrition | 18 |
9 |
1 |
1 |
9 |
1 |
1 | ||
| International Journal of Behavioral Nutrition and Physical Activity | 163 |
23 |
7 |
7 |
38 |
22 |
22 | ||
| Nutrition Reviews | 165 |
42 |
6 |
6 |
63 |
27 |
27 | ||
| Total | 910 | 325 | 107 | 97 | 473 | 246 | 217 | ||
We discovered that authors would refer to food composition data such as the nutrient value of a particular item but would not provide the data source or the data source was generic such as “local food table”. In these circumstances we know food composition data was used but had no way to identify the source.
Below is a characterization of the excluded data sources.
Key notes/important takeaways:
There are 29 unidentifiable data sources
Data sources were not referred to consistently, so they had to be grouped before they could be analyzed.
The tables below allow us to review the title of each source individually and by group. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).
Key notes/important takeaways:
Key notes/important takeaways:
For each of the 175 data sources we manually searched Google (chrome browser) for surface-level information provided directly by data source - website, book or manual published directly by author were included. All collected variables will be explained as they become relevant.
Table 5: Data Sources Organized by Category
Figure 2: Frequency of Data Sources By Category
The Accessibility of each source indicates whether or not the data can be freely accessed by researchers or the general public.
For each data source we’ve identified these forms of Accessibility and grouped them into 3 categories; “Open Access”, “Restricted Access”, and “Inaccessible”. Where “Open Access” encompasses sources that can be accessed for free with no authorization, “Restricted Access” encompasses sources that could be located but are blocked by a pay-wall or require authorization, and “Inaccessible” encompasses all data sources that could not be located or no longer have a point of access.
Table 6: Data Sources Organized by Accessibility
Figure 3: Frequency of Data Sources By Accessibility
Figure 4: Format of Data Sources by Accessibility
We were able to identify exportable file formats for 112 data sources.
Interactivity is used to describe whether or not the original source of data has an interactive interface rather than static tables. The form of interaction can be as simple as the ability to search and filter results or as complex as the ability to run algorithms or generate graphs.
Table 7: Data Sources Organized by Interactivity and Format
Figure 5: Frequency of Data Sources By Interactivity
Here the country of origin is the country in which the data source was created, regardless of where the data was collected or pertains to.
Table 8: Data Sources Organized by Country of Origin
Figure 6: Frequency of Data Sources By Country of Origin
Table 9: Data Sources Organized by Years in Use
Table 10: Data Sources Organized by Update Frequency
USDA has led the way in food composition data access and Accessibility for the past 60 years.
To get a picture of USDA influence, the data sources were split into 5 groups; “Includes USDA Data”, “Likely Includes USDA Data”, “Does Not Include USDA Data”, and “Unknown.”
“Likely Includes USDA Data” refers to situations in which it is not clearly stated but suspected that USDA data was used. For example the source stated “International food data sources were included.”
Table 11: Data Sources Organized by Relationship to The USDA
Figure 7: Frequency of Data Sources by Relationship to The USDA
Common Keywords used in publications citing food and nutrient data were used as an indicator of common usage of data.
Figure 8:Common Keywords
| Table 12: Top 15 Most Utilized Keywords | |
|---|---|
| Publication_Keywords | Frequency |
| Diet | 37 |
| Sugar-Sweetened Beverages | 20 |
| Adult | 19 |
| Beverages | 18 |
| Coffee | 18 |
| Dietary Guidelines | 17 |
| Alcohol | 16 |
| Juice | 16 |
| Milk | 16 |
| Tea | 16 |
| Water | 16 |
| Biomarkers | 15 |
| Carotenoids | 14 |
| Dietary Assessment | 14 |
| Fruit And Vegetable Intake | 14 |
Variables pertaining only to food composition sources were collected in order to compare the scope or size of each food composition data source. For each food composition data source we collected estimated number of foods and/or beverages recorded and the estimated number of measurements provided for each food or beverage.
“If a tree falls in a forest and no one is around to hear it, does it make a sound?” If an excellent source of food and nutrient data exists but no one applies that data in research, does the quality of that data matter? The quality of a data source is only as important as the level to which it is utilized and accepted by the research community as a whole. While we have confirmed that 177 food and nutrient data sources are currently being used to some extent, they are not all being utilized to the same extent. The top 5 data sources most utilized in publications are NHANES, SR, FDC, NDSR, and WWEIA all of which are comprised of USDA data. Therefore, by characterizing the current landscape of USDA data we can establish the standards of food and nutrient data currently in use by the research community.
Table 14: Data Sources Using USDA Data Organized by Category
Table 15: Data Sources Using USDA Data Organized by Country of Origin
Figure 9: Frequency of Data Sources That Include USDA Data by Country of Origin
Table 16: Data Sources Using USDA Data Organized by Accessibility
Table 17: Data Sources Using USDA Data Organized by Interactivity
Figure 10:Common Keywords of Sources Using USDA Data
| Table 18: Top 15 Most Utilized Keywords | |
|---|---|
| Publication_Keywords | Frequency |
| Diet | 23 |
| Sugar-Sweetened Beverages | 16 |
| Beverages | 14 |
| Coffee | 14 |
| Dietary Guidelines | 13 |
| Alcohol | 12 |
| Biomarkers | 12 |
| Juice | 12 |
| Milk | 12 |
| Tea | 12 |
| Water | 12 |
| Animal Protein | 11 |
| Carotenoids | 11 |
| Dietary Assessment | 11 |
| Fruit And Vegetable Intake | 11 |