logo

1 Introduction


NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.

We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; what food composition databases are being used? To answer this question we performed a manual review of the literature.

To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.

Scimago identifies the top 5 journals in the category nutrition and dietetics as:

  1. Annual Review of Nutrition
  2. International Journal of Behavioral Nutrition and Physical Activity
  3. American Journal of Clinical Nutrition
  4. Advances in Nutrition
  5. Nutrition Reviews

Key notes/important takeaways:

  • Though many journals cite food composition data such as the Journal of Food Composition and Analysis, these 5 journals were chosen as the most prestigious and widely used.


2 Search Strategy and Data Collection


Phase 0-1


We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.

Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.

We identified the following:

  • The total number of published papers in 2020
  • The number of studies published in 2020 that mentioned food and nutrition data in any way

Phase 2


Publications were further reviewed in detail by downloading and reading each publication in its entirety. All food and nutrition data sources were documented.

Phase 3


Food and nutrition data sources were evaluated based on the inclusion/exclusion criteria below.

Included Sources:

  1. Direct and Identifiable sources of data

Excluded Sources:

  1. Unidentifiable sources of data

Included data sources were investigated by trained data analysts and further information was collected directly from publisher websites. This information includes:

  • The type of data identified in the source
  • Whether or not the source was created using data from USDA sources
  • The country in which the source originates
  • How long the data was updated and maintained
  • Whether or not the data could be freely accessed by the general public
  • What format the data source was accessible as
  • Whether or not the data source is interactive in any way
  • How often the source is updated
  • The number of food and beverages were identified and how many measurements were collected for each (food composition sources only)

If additional sources of food and nutrient data were identified on publisher websites, they were added to the list of data sources and investigated for further information.


Limitations


Collection was done manually by human researchers and thus is subject to error.

We were unable to locate or find information on a few sources. We spent a considerable amount of time searching, but were never able to fully confirm their existence.

Obscure data sources are as follows:

  • “Nutrilet”
  • “Star of Nutrition Software”
  • “Taiwan food-composition table”
  • “Sistema de Evaluación de Hábitos Nutricionales y Consumo de Nutrimentos (SNUT)”

These sources are included in exploratory analysis. Although obscure, we assume that information provided by the authors is correct.


3 Summary Diagram


Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of sources.


Figure 1: Search Strategy and Publication Selection Process

4 Data Cleansing and Organization


The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded sources to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.


Computational Cleansing Process Translation
The number of unique strings collected in each variable was computed and compared to the total number of entries collected Confirm that data was collected on the correct number of publications
All entries for intext and formal sources that were flagged as duplicate strings were identified and investigated manually Inspect intext and formal source entries for possible duplications in data entry
The number of NA or missing entries were collected for each variable and compared to the number of sources identified in each phase Verify that no entry was unintentionally left blank
All categorical variables were mutated into factors and factor levels were examined Confirm that all categorical responses were within a predefined list of possible options


Below you’ll find table 1 which displays the number of publications and sources identified in each phase.


Table 1: Publication and Source Count in Phases 0-3
Publications Sources
Phase 0 Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3
Advances in Nutrition

173

87

12

12

133

51

46

American Journal of Clinical Nutrition

391

164

81

71

230

145

121

Annual Review of Nutrition

18

9

1

1

9

1

1

International Journal of Behavioral Nutrition and Physical Activity

163

23

7

7

38

22

22

Nutrition Reviews

165

42

6

6

63

27

27

Total 910 325 107 97 473 246 217

4.1 Characterization of Excluded Sources


We discovered that authors would refer to food composition data such as the nutrient value of a particular item but would not provide the data source or the data source was generic such as “local food table”. In these circumstances we know food composition data was used but had no way to identify the source.

Below is a characterization of the excluded data sources.



Key notes/important takeaways:

  • There are 29 unidentifiable data sources

    • In 6 cases, food composition data was used in the text of the publication but no source was identified
    • In 5 cases, the source identified in the text of the publication was too vague to identify (e.g. “152-food-item SFFQ” , “company websites”, etc…) and no formal source was provided
    • In 6 cases, the source identified in the text of the publication was too vague to identify and the formal source lead to a publication which did not provide the source of the data used
    • In 12 cases, food composition data was used but not identified in the text of the publication and the formal source did not lead to the source of the data used


4.2 Grouping Publication Data Sources


Data sources were not referred to consistently, so they had to be grouped before they could be analyzed.

The tables below allow us to review the title of each source individually and by group. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).



Key notes/important takeaways:

  • A total of 230 sources were identified from the 217 publications previously mentioned, this difference was due to situations in which one publication referenced more than one data source.



Key notes/important takeaways:

  • The 230 sources identified in table 3 were grouped into 108 data sources in table 4, meaning that each source was identified around 2.1 times on average.


5 Exploratory Analysis


For each of the 175 data sources we manually searched Google (chrome browser) for surface-level information provided directly by data source - website, book or manual published directly by author were included. All collected variables will be explained as they become relevant.

5.1 Data Category


Table 5: Data Sources Organized by Category

Figure 2: Frequency of Data Sources By Category


5.2 Accessibility


The Accessibility of each source indicates whether or not the data can be freely accessed by researchers or the general public.

For each data source we’ve identified these forms of Accessibility and grouped them into 3 categories; “Open Access”, “Restricted Access”, and “Inaccessible”. Where “Open Access” encompasses sources that can be accessed for free with no authorization, “Restricted Access” encompasses sources that could be located but are blocked by a pay-wall or require authorization, and “Inaccessible” encompasses all data sources that could not be located or no longer have a point of access.

Table 6: Data Sources Organized by Accessibility


Figure 3: Frequency of Data Sources By Accessibility


5.2.1 Accessible Formats


Figure 4: Format of Data Sources by Accessibility


We were able to identify exportable file formats for 112 data sources.


5.3 Interactivity


Interactivity is used to describe whether or not the original source of data has an interactive interface rather than static tables. The form of interaction can be as simple as the ability to search and filter results or as complex as the ability to run algorithms or generate graphs.


Table 7: Data Sources Organized by Interactivity and Format


Figure 5: Frequency of Data Sources By Interactivity


5.4 Country of Origin


Here the country of origin is the country in which the data source was created, regardless of where the data was collected or pertains to.

Table 8: Data Sources Organized by Country of Origin


Figure 6: Frequency of Data Sources By Country of Origin


5.5 Years in Use


Table 9: Data Sources Organized by Years in Use

5.6 Update Frequency


Table 10: Data Sources Organized by Update Frequency

5.7 Sources Using USDA Data


USDA has led the way in food composition data access and Accessibility for the past 60 years.

To get a picture of USDA influence, the data sources were split into 5 groups; “Includes USDA Data”, “Likely Includes USDA Data”, “Does Not Include USDA Data”, and “Unknown.”

“Likely Includes USDA Data” refers to situations in which it is not clearly stated but suspected that USDA data was used. For example the source stated “International food data sources were included.”


Table 11: Data Sources Organized by Relationship to The USDA

Figure 7: Frequency of Data Sources by Relationship to The USDA


5.8 Use of Food and Nutrient Data in Publications


Common Keywords used in publications citing food and nutrient data were used as an indicator of common usage of data.


Figure 8:Common Keywords


Table 12: Top 15 Most Utilized Keywords
Publication_Keywords Frequency
Diet 37
Sugar-Sweetened Beverages 20
Adult 19
Beverages 18
Coffee 18
Dietary Guidelines 17
Alcohol 16
Juice 16
Milk 16
Tea 16
Water 16
Biomarkers 15
Carotenoids 14
Dietary Assessment 14
Fruit And Vegetable Intake 14

5.9 Precision and Scope of Food Composition Data


Variables pertaining only to food composition sources were collected in order to compare the scope or size of each food composition data source. For each food composition data source we collected estimated number of foods and/or beverages recorded and the estimated number of measurements provided for each food or beverage.



6 Characterization of USDA Data Usage


“If a tree falls in a forest and no one is around to hear it, does it make a sound?” If an excellent source of food and nutrient data exists but no one applies that data in research, does the quality of that data matter? The quality of a data source is only as important as the level to which it is utilized and accepted by the research community as a whole. While we have confirmed that 177 food and nutrient data sources are currently being used to some extent, they are not all being utilized to the same extent. The top 5 data sources most utilized in publications are NHANES, SR, FDC, NDSR, and WWEIA all of which are comprised of USDA data. Therefore, by characterizing the current landscape of USDA data we can establish the standards of food and nutrient data currently in use by the research community.

6.1 Data Category


Table 14: Data Sources Using USDA Data Organized by Category

6.2 Country of Origin


Table 15: Data Sources Using USDA Data Organized by Country of Origin

Figure 9: Frequency of Data Sources That Include USDA Data by Country of Origin

6.3 Accessibility


Table 16: Data Sources Using USDA Data Organized by Accessibility

6.4 Interactivity


Table 17: Data Sources Using USDA Data Organized by Interactivity

6.5 Use of Food and Nutrient Data in Publications


Figure 10:Common Keywords of Sources Using USDA Data


Table 18: Top 15 Most Utilized Keywords
Publication_Keywords Frequency
Diet 23
Sugar-Sweetened Beverages 16
Beverages 14
Coffee 14
Dietary Guidelines 13
Alcohol 12
Biomarkers 12
Juice 12
Milk 12
Tea 12
Water 12
Animal Protein 11
Carotenoids 11
Dietary Assessment 11
Fruit And Vegetable Intake 11


7 Full Table of All Variables