NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.
We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; where are researchers sourcing their data, how are they citing it and is it shareable in a way that allowed the studies to be reproduced? To answer this question we performed a manual review of the literature.
To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.
Scimago identifies the top 5 journals in the category nutrition and dietetics as:
Key notes/important takeaways:
We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.
Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.
We identified the following:
Publications identified as - potentially - using food composition data were further reviewed in Phase 2.
Key notes/important takeaways:
Publications identified from Phase 1 were further reviewed in detail by downloading and reading each publication in its entirety (including footnotes, tables, and figures), taking particular care to investigate sections containing any of the keywords above. If supplemental data was identified as containing further information about food composition data, the supplemental data was also downloaded and reviewed. We documented if/how food composition data was cited. Publications that did not use food composition data were excluded from analysis.
Special Case:
One publication had to be excluded at this step due to special circumstances. The publication “Criterion-Related Validity of Spectroscopy-Based Skin Carotenoid Measurements as a Proxy for Fruit and Vegetable Intake: A Systematic Review” from the journal Advances in Nutrition contained long lists of formal citations associated with relatively few intext citations making it impossible to accurately collect and match the intext and formal citations.
Key notes/important takeaways:
Citations collected in phase 2 were evaluated based on the inclusion/exclusion criteria below. Our purpose was to fully characterize how scientists are sourcing, citing and sharing food composition data.
Included Citations:
Excluded Citations:
Key notes/important takeaways:
The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded citations to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.
| Computational Cleansing Process | Translation |
|---|---|
| The number of unique strings collected in each variable was computed and compared to the total number of entries collected | Confirm that data was collected on the correct number of publications |
| All entries for intext and formal citations that were flagged as duplicate strings were identified and investigated manually | Inspect intext and formal citation entries for possible duplications in data entry |
| The number of NA or missing entries were collected for each variable and compared to the number of citations identified in each phase | Verify that no entry was unintentionally left blank |
| All categorical variables were mutated into factors and factor levels were examined | Confirm that all categorical responses were within a predefined list of possible options |
Below you will find table 1 which displays the number of publications and citations identified in each phase.
| Table 1: Publication and Citation Count in Phases 0-3 | ||||||||
|---|---|---|---|---|---|---|---|---|
| Publications | Citations | |||||||
| Phase 0 | Phase 1 | Phase 2 | Phase 3 | Phase 2 | Phase 3 | |||
| Advances in Nutrition | 173 |
87 |
3 |
0 |
21 |
8 | ||
| American Journal of Clinical Nutrition | 391 |
164 |
75 |
52 |
124 |
62 | ||
| Annual Review of Nutrition | 18 |
9 |
0 |
0 |
0 |
0 | ||
| International Journal of Behavioral Nutrition and Physical Activity | 163 |
23 |
7 |
6 |
9 |
7 | ||
| Nutrition Reviews | 165 |
42 |
4 |
4 |
10 |
4 | ||
| Total | 910 | 325 | 89 | 62 | 164 | 81 | ||
Key notes/important takeaways:
Although they did not fit into our inclusion criteria, excluded citations provided insight into some of the troubles in food data citing.
Below is a characterization of the three types of excluded data citations.
Our first exclusion criteria addresses publications that used and cited food composition data in their analysis but provided intext and/or formal citation(s) that did not lead to food composition data. For example, nutrient values were displayed, yet the source of the values were a FFQ (Food Frequency Questionnaire).
Key notes/important takeaways:
Our second exclusion criteria addresses publications that provided citations that were unidentifiable. For example, in one publication food composition data was claimed to have been collected from a “company website” with no further information provided. Alternatively, there were instances in which a source of food composition data was identified that could not be found such as the “Star of Nutrition” software.
Key notes/important takeaways:
5 citations: no source was identified in the text and no formal citation was provided
3 citations: no source was identified intext and a formal citation to an unrelated or vague source was provided
6 citations: A vague or unidentifiable source was identified in the text but no formal citation was provided, these sources were identified as:
2 citations: A vague or unidentifiable source was identified in the text and the formal citation provided no information to identify the source provided intext, the sources identified intext were identified as:
Our third exclusion criteria addresses publications that provided both an intext and corresponding formal citation but they lead to conflicting sources. For example, the intext citation stated the “Australian Food Composition Tables” but the corresponding formal citation leads to a publication; not to the Australian Food Composition Tables as expected.
Key notes/important takeaways:
Key notes/important takeaways:
Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of citations.
Figure 1: Search Strategy and Publication Selection Process
After data collection and cleansing we were able to organize the variables into meaningful groups for exploratory analysis.
We confirmed our hypothesis and previous scoping review findings; the same food composition data sources are cited in various ways, using different titles.
The tables below allow us to review all citation titles individually and by common name groups. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).
After grouping the various titles into common names we are able to review the frequency of each data source.
Important Note: For some publications a single source of data was used and cited, however there were variations between the intext and formal citation titles. For example if you search table 5 for “USDA” you see the variation between the intext and formal citation titles - these are called “unique titles.” Therefore this publication example had two unique titles for one common name.
Figure 2: Unique Titles per Common Name
Key notes/important takeaways:
We dig a bit deeper here to compare included vs excluded citations as they relate to publication and study type variables.
From the 5 journals, we will compare the number of included and excluded citations.
| Table 7: Included and Excluded Citations by Journal | ||
|---|---|---|
| Included | Excluded | |
| Advances in Nutrition | 8 | 13 |
| American Journal of Clinical Nutrition | 62 | 62 |
| Annual Review of Nutrition | 0 | 0 |
| International Journal of Behavioral Nutrition and Physical Activity | 7 | 2 |
| Nutrition Reviews | 4 | 6 |
| Total | 81 | 83 |
Key notes/important takeaways:
To determine if the number of included and excluded citations were associated with the type of publications they were found within, further analysis was needed. The journals provided publication types unique to each journal.
| Table 8: Included and Excluded Citations by Publication Type | ||
|---|---|---|
| Included | Excluded | |
| Editorial | 1 | 2 |
| Original Research Communications | 60 | 60 |
| Perspective | 8 | 11 |
| Research | 7 | 2 |
| Review | 0 | 2 |
| Special Article | 3 | 4 |
| Supplement Article | 2 | 2 |
| Total | 81 | 83 |
Key notes/important takeaways:
To determine if the number of included and excluded citations were associated with the type of studies they were found within, further analysis was needed. The study type was collected as the type of study if explicitly identified in the title, abstract, and/or methodology section of the publication.
| Table 9: Included and Excluded Citations by Study Type | ||
|---|---|---|
| Included | Excluded | |
| A Nested Case-control Study | 1 | 1 |
| Cluster-Randomized Controlled Trial | 1 | 0 |
| Cross-sectional Study | 6 | 13 |
| Expert Opinion | 1 | 2 |
| Follow-Up | 1 | 0 |
| Prospective Cohort Study | 2 | 3 |
| Prospective Observational Cohort Study | 1 | 1 |
| Prospective Study | 3 | 3 |
| Randomized, Controlled, Crossover Trial | 0 | 1 |
| RCT | 14 | 3 |
| No Study Type Provided | 51 | 56 |
| Total | 81 | 83 |
Key notes/important takeaways:
In the future we hope to use the best data source for our research, but do current authors state their rationale for use? To investigate this, we recorded intext statements within the publications that clearly provided a rationale for the choice of food composition data.
Figure 3: Rationale for Citations
Key notes/important takeaways:
Reproducibility is dependent on knowing what version or edition of the data was used. To investigate this idea, for each citation collected we checked if the year the data was published or the edition of the data was provided.
For example, USDA Standard Reference (SR) was cited 6 times; one for SR26, 2 for SR27, 1 for SR28, one for SR Legacy, and one where the edition was not specified. All of these different versions of SR contain different information and a lot of updates were made between editions. Trying to run the same analysis on SR26 and SR Legacy can give you extremely different results.
Figure 4: Editions for Citations
Key notes/important takeaways:
Any changes or edits made to data used in analysis must be specified for it to be fully reproducible. In an attempt to see how often an explanation is provided we collected any stated edits made to the data.
Figure 5: Edits for Citations
Key notes/important takeaways:
We wanted to better understand where authors placed the food composition data citation(s). We anticipated that the methodology section would be the predominate location in the publication - assuming methodology is present. Below you’ll find a table detailing the section of the publication where each intext citation was located.
| Table 10: Section Cited | |
|---|---|
| Count | |
| Discussion | 11 |
| Introduction | 2 |
| Methods | 67 |
| Supplemental Methods | 1 |
| Total | 81 |
Key notes/important takeaways:
As a reminder, below you’ll find the definitions for the variables collected in our search.
| Variable Name | Variable Description |
|---|---|
| “Study_Type” | Identified in the title, abstract, or methods section of the publication documented as “RCT”, “Prospective Cohort Study”, “Expert Opinion”, etc… |
| “Formal_Citation” | The formal citation associated with a corresponding intext citation |
| “Formal_Citation_Title” | The title of the data source as it appears in the formal citation |
| “Intext_Citation_Title” | The title of the food composition data as it appears in the intext citation |
| “Data_Edits” | Edits or additions made to the food composition data before use in analysis |
| “Data_Edition” | The release date or edition of the food composition data |
The graphics below represent the following NA/missing data found above:
This heatmap visualizes the percentage of missing entries for the 82 included citations. Since no citations were collected from Annual Review of Nutrition it is not present in the heatmap.
Figure 6: Missing Entries by Journal
Below you’ll find the overall percentage of missing data for each of these variables regardless of which journal the publications were collected from.
Figure 7: Overall Percentage of Missing Entries
Key notes/important takeaways:
SCImago, (n.d.). SJR - SCImago Journal & Country Rank [Portal]. 2021-03-01, from http://www.scimagojr.com
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie, Joe Cheng and Xianying Tan (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.18. https://CRAN.R-project.org/package=DT
Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Greg Lin (2022). reactable: Interactive Data Tables Based on ‘React Table’. https://glin.github.io/reactable/, https://github.com/glin/reactable.
Kyle Cuilla (2021). reactablefmtr: Easily Customize Interactive Tables Made with Reactable. R package version 1.0.0. https://CRAN.R-project.org/package=reactablefmtr
Yihui Xie (2021). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.24.
Yihui Xie (2016). bookdown: Authoring Books and Technical Documents with R Markdown. Chapman and Hall/CRC. ISBN 978-1138700109
Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2021). htmltools: Tools for HTML. R package version 0.5.1.1. https://CRAN.R-project.org/package=htmltools
Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.
Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.
Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN