logo

1 Introduction


NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.

We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; where are researchers sourcing their data, how are they citing it and is it shareable in a way that allowed the studies to be reproduced? To answer this question we performed a manual review of the literature.

To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.

Scimago identifies the top 5 journals in the category nutrition and dietetics as:

  1. Annual Review of Nutrition
  2. International Journal of Behavioral Nutrition and Physical Activity
  3. American Journal of Clinical Nutrition
  4. Advances in Nutrition
  5. Nutrition Reviews

Key notes/important takeaways:

  • Though many journals cite food composition data such as the Journal of Food Composition and Analysis, these 5 journals were chosen as the most prestigious and widely used.


2 Data Collection


Phase 0-1


We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.

Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.

We identified the following:

  • The total number of published papers in 2020
  • The number of studies published in 2020 that directly utilized food data in any way.

Publications identified as - potentially - using food composition data were further reviewed in Phase 2.


Key notes/important takeaways:

  • In total, 910 articles were published in the 5 journals in the year 2020.
  • 325 publications were identified as potentially containing food composition data.


Phase 2


Publications identified from Phase 1 were further reviewed in detail by downloading and reading each publication in its entirety (including footnotes, tables, and figures), taking particular care to investigate sections containing any of the keywords above. If supplemental data was identified as containing further information about food composition data, the supplemental data was also downloaded and reviewed. We documented if/how food composition data was cited. Publications that did not use food composition data were excluded from analysis.

Special Case:

One publication had to be excluded at this step due to special circumstances. The publication “Criterion-Related Validity of Spectroscopy-Based Skin Carotenoid Measurements as a Proxy for Fruit and Vegetable Intake: A Systematic Review” from the journal Advances in Nutrition contained long lists of formal citations associated with relatively few intext citations making it impossible to accurately collect and match the intext and formal citations.


Key notes/important takeaways:

  • Only 2 citations for food composition were collected from review articles. Most review articles identified in phase 1 did not claim to cite any food composition data directly, instead citing and referring to publications analyzing food composition data.
  • 89 publications were identified as using food composition.
  • 164 citations were collected at this phase.


Phase 3


Citations collected in phase 2 were evaluated based on the inclusion/exclusion criteria below. Our purpose was to fully characterize how scientists are sourcing, citing and sharing food composition data.

Included Citations:

  1. Direct and Identifiable sources of food composition data

Excluded Citations:

  1. Not a source of food composition data
  2. Unidentifiable source of food composition data
  3. Conflicting source of food composition data


Key notes/important takeaways:

  • 83 citations were excluded at this phase, leaving 81 citations spanning 62 publications for analysis.


3 Data Cleansing


The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded citations to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.


Computational Cleansing Process Translation
The number of unique strings collected in each variable was computed and compared to the total number of entries collected Confirm that data was collected on the correct number of publications
All entries for intext and formal citations that were flagged as duplicate strings were identified and investigated manually Inspect intext and formal citation entries for possible duplications in data entry
The number of NA or missing entries were collected for each variable and compared to the number of citations identified in each phase Verify that no entry was unintentionally left blank
All categorical variables were mutated into factors and factor levels were examined Confirm that all categorical responses were within a predefined list of possible options


Below you will find table 1 which displays the number of publications and citations identified in each phase.


Table 1: Publication and Citation Count in Phases 0-3
Publications Citations
Phase 0 Phase 1 Phase 2 Phase 3 Phase 2 Phase 3
Advances in Nutrition

173

87

3

0

21

8

American Journal of Clinical Nutrition

391

164

75

52

124

62

Annual Review of Nutrition

18

9

0

0

0

0

International Journal of Behavioral Nutrition and Physical Activity

163

23

7

6

9

7

Nutrition Reviews

165

42

4

4

10

4

Total 910 325 89 62 164 81


Key notes/important takeaways:

  • A total of 848 publications were excluded.
  • Of the 164 citations collected, only 81 lead to direct sources of food composition data.
  • Included citations exceed the number of publications as 1 or more citations were recorded per publication.


3.1 Characterization of Excluded Citations


Although they did not fit into our inclusion criteria, excluded citations provided insight into some of the troubles in food data citing.

Below is a characterization of the three types of excluded data citations.

  1. Not a source of food composition data
  2. Unidentifiable source of food composition data
  3. Conflicting source of food composition data

Not a source of food composition data:


Our first exclusion criteria addresses publications that used and cited food composition data in their analysis but provided intext and/or formal citation(s) that did not lead to food composition data. For example, nutrient values were displayed, yet the source of the values were a FFQ (Food Frequency Questionnaire).


Key notes/important takeaways:

  • NHANES was cited 17 times
  • FFQ (of various types) were cited 10 times
  • In multiple cases there would be a citation for a publication using a source of food composition data rather than the source of food composition data itself

Unidentifiable sources of food composition data:


Our second exclusion criteria addresses publications that provided citations that were unidentifiable. For example, in one publication food composition data was claimed to have been collected from a “company website” with no further information provided. Alternatively, there were instances in which a source of food composition data was identified that could not be found such as the “Star of Nutrition” software.


Key notes/important takeaways:

  • 5 citations: no source was identified in the text and no formal citation was provided

  • 3 citations: no source was identified intext and a formal citation to an unrelated or vague source was provided

  • 6 citations: A vague or unidentifiable source was identified in the text but no formal citation was provided, these sources were identified as:

    • “General Mills Bell Institute of Health & Nutrition (BIHN) Whole Grain Database”
    • “chemical (nutrient) analyses”
    • “local food nutrient content database”
    • “nutrient databank”
    • “company websites”
    • “Star of Nutrition software”
  • 2 citations: A vague or unidentifiable source was identified in the text and the formal citation provided no information to identify the source provided intext, the sources identified intext were identified as:

    • “DIETSYS”
    • “locally produced Indonesian food composition table”

Conflicting sources of food composition data:


Our third exclusion criteria addresses publications that provided both an intext and corresponding formal citation but they lead to conflicting sources. For example, the intext citation stated the “Australian Food Composition Tables” but the corresponding formal citation leads to a publication; not to the Australian Food Composition Tables as expected.


Key notes/important takeaways:

  • 2 citations had to be excluded due to the intext and formal citations referring to different versions or editions of the same data
  • 8 formal citation lead to publications not to original and intext-cited data
  • 100% of intext citations stated a food composition source - none sourced a publication in text
  • In a large portion of these publications a source of food composition would be cited intext but the provided formal citation would lead to a publication using the source of food composition instead of the source itself

Summary of Exclusions


Key notes/important takeaways:

  • In total, 83 of the 164 citations collected in phase 2 (50.6%) were excluded.
  • Extrapolating out to all publications, this implies that around 50% of the time food composition data is not being cited and not given appropriate credit.

Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of citations.


Figure 1: Search Strategy and Publication Selection Process

sourcing and landscape-Sourcing

4 Data Organization


After data collection and cleansing we were able to organize the variables into meaningful groups for exploratory analysis.

4.1 Grouping Food Composition Data Sources


We confirmed our hypothesis and previous scoping review findings; the same food composition data sources are cited in various ways, using different titles.

The tables below allow us to review all citation titles individually and by common name groups. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).



After grouping the various titles into common names we are able to review the frequency of each data source.

Important Note: For some publications a single source of data was used and cited, however there were variations between the intext and formal citation titles. For example if you search table 5 for “USDA” you see the variation between the intext and formal citation titles - these are called “unique titles.” Therefore this publication example had two unique titles for one common name.



Figure 2: Unique Titles per Common Name


Key notes/important takeaways:

  • There are 46 common names consolidated from 115 unique titles (a sum of Unique_Titles)
  • For each common name there were up to 10 unique titles
    • on average 2.5 title variations
  • On average, there was more than one unique title per publication
    • There 115 total unique titles coming from 62 publications


5 Exploratory Analysis


5.1 Descriptive Analysis of Citations


We dig a bit deeper here to compare included vs excluded citations as they relate to publication and study type variables.


Analysis of Citations by Journal


From the 5 journals, we will compare the number of included and excluded citations.


Table 7: Included and Excluded Citations by Journal
Included Excluded
Advances in Nutrition 8 13
American Journal of Clinical Nutrition 62 62
Annual Review of Nutrition 0 0
International Journal of Behavioral Nutrition and Physical Activity 7 2
Nutrition Reviews 4 6
Total 81 83


Key notes/important takeaways:

  • All publications from Annual Review of Nutrition were excluded, none of them directly used food composition data.
  • International Journal of Behavioral Nutrition and Physical Activity had the highest ratio of included citations to excluded citations.
  • 75.6% of citations (124 of 164) were collected from publications in the American Journal of Clinical Nutrition. 50% were excluded and 50% were included (62 were included and 62 were excluded, this is not a mistake or error just a coincidence).


Analysis of Citations by Publication Type


To determine if the number of included and excluded citations were associated with the type of publications they were found within, further analysis was needed. The journals provided publication types unique to each journal.


Table 8: Included and Excluded Citations by Publication Type
Included Excluded
Editorial 1 2
Original Research Communications 60 60
Perspective 8 11
Research 7 2
Review 0 2
Special Article 3 4
Supplement Article 2 2
Total 81 83


Key notes/important takeaways:

  • 67 of the 81 included citations (82.7%) were collected from research publications (including both Research and Original Research Communications).
  • Both citations collected from review publications were excluded.


Analysis of Citations by Study Type


To determine if the number of included and excluded citations were associated with the type of studies they were found within, further analysis was needed. The study type was collected as the type of study if explicitly identified in the title, abstract, and/or methodology section of the publication.


Table 9: Included and Excluded Citations by Study Type
Included Excluded
A Nested Case-control Study 1 1
Cluster-Randomized Controlled Trial 1 0
Cross-sectional Study 6 13
Expert Opinion 1 2
Follow-Up 1 0
Prospective Cohort Study 2 3
Prospective Observational Cohort Study 1 1
Prospective Study 3 3
Randomized, Controlled, Crossover Trial 0 1
RCT 14 3
No Study Type Provided 51 56
Total 81 83


Key notes/important takeaways:

  • 108 of the 164 citations collected (65.9%) were from publications that did not clearly define a study type
  • The following study types had to be excluded due to unidentifiable citations as noted above in the exclusion criteria: - Systematic Review - Systematic Review and Meta-Analysis - Umbrella Review


5.2 Stated Rationale


In the future we hope to use the best data source for our research, but do current authors state their rationale for use? To investigate this, we recorded intext statements within the publications that clearly provided a rationale for the choice of food composition data.


Figure 3: Rationale for Citations


Key notes/important takeaways:

  • 7 citations had authors who explicitly stated a rationale for use of the food composition data cited, 74 did not

5.3 Year Published or Edition


Reproducibility is dependent on knowing what version or edition of the data was used. To investigate this idea, for each citation collected we checked if the year the data was published or the edition of the data was provided.

For example, USDA Standard Reference (SR) was cited 6 times; one for SR26, 2 for SR27, 1 for SR28, one for SR Legacy, and one where the edition was not specified. All of these different versions of SR contain different information and a lot of updates were made between editions. Trying to run the same analysis on SR26 and SR Legacy can give you extremely different results.


Figure 4: Editions for Citations


Key notes/important takeaways:

  • Nearly 1 in 3 authors did not provide information on the edition of the data used

5.4 Stated Data Edits


Any changes or edits made to data used in analysis must be specified for it to be fully reproducible. In an attempt to see how often an explanation is provided we collected any stated edits made to the data.


Figure 5: Edits for Citations


Key notes/important takeaways:

  • less then 10% of authors share if/what modifications are made to the data reported on

5.5 Section Cited


We wanted to better understand where authors placed the food composition data citation(s). We anticipated that the methodology section would be the predominate location in the publication - assuming methodology is present. Below you’ll find a table detailing the section of the publication where each intext citation was located.


Table 10: Section Cited
Count
Discussion 11
Introduction 2
Methods 67
Supplemental Methods 1
Total 81


Key notes/important takeaways:

  • 82.7% food composition citations (67 of the 81) were identified in the methodology section of the publications
  • Some inconsistency does exist, for example one publication cited the source of data in the supplemental methods section


6 Exploring Missing Entries by Variable


As a reminder, below you’ll find the definitions for the variables collected in our search.

Variable Name Variable Description
“Study_Type” Identified in the title, abstract, or methods section of the publication documented as “RCT”, “Prospective Cohort Study”, “Expert Opinion”, etc…
“Formal_Citation” The formal citation associated with a corresponding intext citation
“Formal_Citation_Title” The title of the data source as it appears in the formal citation
“Intext_Citation_Title” The title of the food composition data as it appears in the intext citation
“Data_Edits” Edits or additions made to the food composition data before use in analysis
“Data_Edition” The release date or edition of the food composition data


The graphics below represent the following NA/missing data found above:

  • For intext and formal citation titles, missing data can be seen in the blank values found in table 5
  • For data edits and edition, the missing values are represented as “Not Provided” in sections 5.3 and 5.4
  • For study type, the missing entries are represented as “NA” in table 9


This heatmap visualizes the percentage of missing entries for the 82 included citations. Since no citations were collected from Annual Review of Nutrition it is not present in the heatmap.


Figure 6: Missing Entries by Journal


Below you’ll find the overall percentage of missing data for each of these variables regardless of which journal the publications were collected from.


Figure 7: Overall Percentage of Missing Entries

Key notes/important takeaways:

  • Over 13% of citations did not clearly state (or state at all) what data was used within the text of the publication
  • Data edits had the highest frequency of missing data
  • Nearly 36% of citations were intext only (i.e. no formal citation provided)
  • Citations collected from the International Journal of Behavioral Nutrition and Physical Activity, had the overall lowest percentage of missing data
  • Citations collected from Advances in Nutrition had the highest proportion of missing data


7 Checklist Methodology



8 Checklist



9 References


  1. SCImago, (n.d.). SJR - SCImago Journal & Country Rank [Portal]. 2021-03-01, from http://www.scimagojr.com

  2. R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

  3. Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

  4. Yihui Xie, Joe Cheng and Xianying Tan (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.18. https://CRAN.R-project.org/package=DT

  5. Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt

  6. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

  7. Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl

  8. Greg Lin (2022). reactable: Interactive Data Tables Based on ‘React Table’. https://glin.github.io/reactable/, https://github.com/glin/reactable.

  9. Kyle Cuilla (2021). reactablefmtr: Easily Customize Interactive Tables Made with Reactable. R package version 1.0.0. https://CRAN.R-project.org/package=reactablefmtr

  10. Yihui Xie (2021). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.24.

  11. Yihui Xie (2016). bookdown: Authoring Books and Technical Documents with R Markdown. Chapman and Hall/CRC. ISBN 978-1138700109

  12. Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2021). htmltools: Tools for HTML. R package version 0.5.1.1. https://CRAN.R-project.org/package=htmltools

  13. Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.

  14. Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

  15. Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

  16. JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.

  17. Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.

  18. Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN

    1. URL https://bookdown.org/yihui/rmarkdown-cookbook.