NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.
We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; what food composition databases are being used? To answer this question we performed a manual review of the literature.
To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.
Scimago identifies the top 5 journals in the category nutrition and dietetics as:
Key notes/important takeaways:
We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.
Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.
We identified the following:
Publications were further reviewed in detail by downloading and reading each publication in its entirety. All food and nutrition data sources were documented.
Food and nutrition data sources were evaluated based on the inclusion/exclusion criteria below.
Included Sources:
Excluded Sources:
Included data sources were investigated by trained data analysts and further information was collected directly from publisher websites. This information includes:
The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded sources to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.
| Computational Cleansing Process | Translation |
|---|---|
| The number of unique strings collected in each variable was computed and compared to the total number of entries collected | Confirm that data was collected on the correct number of publications |
| All entries for intext and formal sources that were flagged as duplicate strings were identified and investigated manually | Inspect intext and formal source entries for possible duplications in data entry |
| The number of NA or missing entries were collected for each variable and compared to the number of sources identified in each phase | Verify that no entry was unintentionally left blank |
| All categorical variables were mutated into factors and factor levels were examined | Confirm that all categorical responses were within a predefined list of possible options |
Below you’ll find table 1 which displays the number of publications and sources identified in each phase.
| Table 1: Publication and Source Count in Phases 0-3 | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Publications | Sources | ||||||||
| Phase 0 | Phase 1 | Phase 2 | Phase 3 | Phase 1 | Phase 2 | Phase 3 | |||
| Advances in Nutrition | 173 |
87 |
12 |
12 |
133 |
51 |
46 | ||
| American Journal of Clinical Nutrition | 391 |
164 |
81 |
71 |
230 |
145 |
121 | ||
| Annual Review of Nutrition | 18 |
9 |
1 |
1 |
9 |
1 |
1 | ||
| International Journal of Behavioral Nutrition and Physical Activity | 163 |
23 |
7 |
7 |
38 |
22 |
22 | ||
| Nutrition Reviews | 165 |
42 |
6 |
6 |
63 |
27 |
27 | ||
| Total | 910 | 325 | 107 | 97 | 473 | 246 | 217 | ||
We discovered that authors would refer to food composition data such as the nutrient value of a particular item but would not provide the data source or the data source was generic such as “local food table”. In these circumstances we know food composition data was used but had no way to identify the source.
Below is a characterization of the excluded data sources.
Key notes/important takeaways:
There are 29 unidentifiable data sources
Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of sources.
Figure 1: Search Strategy and Publication Selection Process
After data collection and cleansing we were able to organize the sources into meaningful groups for exploratory analysis.
Data sources were not referred to consistently, so they had to be grouped before they could be analyzed.
The tables below allow us to review the title of each source individually and by group. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).
Key notes/important takeaways:
Key notes/important takeaways:
For each of the 108 data sources we manually searched Google (chrome browser) for surface-level information provided directly by data source - website, book or manual published directly by author were included. All collected variables will be explained as they become relevant.
We were unable to locate or find information on a few sources. We spent a considerable amount of time searching, but were never able to fully confirm their existence.
Obscure data sources are as follows:
These sources are included in exploratory analysis. Although obscure, we assume that information provided by the authors is correct.
We were curious to know which and how many food composition data sources were reported in each journal.
Table 5: Data Sources Organized By Journal
Figure 2: Frequency of Data Sources By Journal
Key notes/important takeaways:
To better understand the use of each data source, the sources were organized into categories based on how the data was used (such as SR Legacy being used for food composition or the implementation of a diet quality score). We also documented how data was retrieved.
Table 6: Data Sources Organized By Category and Format
Figure 3: Frequency of Data Sources By Category
Key notes/important takeaways:
Figure 4: Frequency of Data Sources By Format
Key notes/important takeaways:
Each data source has it’s own points of access such as online publications or physical books, each of which has different levels of availability and interactivity.
The availability of each source indicates whether or not the data can be freely accessed by researchers or the general public.
For each data source we’ve identified these forms of availability and grouped them into 3 categories; “Open Access”, “Restricted Access”, and “Inaccessible”. Where “Open Access” encompasses sources that can be accessed for free with no authorization, “Restricted Access” encompasses sources that could be located but are blocked by a pay-wall or require authorization, and “Inaccessible” encompasses all data sources that could not be located or no longer have a point of access.
Table 7: Data Sources Organized by Availability and Format
Figure 5: Frequency of Data Sources By Availability
Key notes/important takeaways:
Interactivity is used to describe whether or not the original source of data has an interactive interface rather than static tables. The form of interaction can be as simple as the ability to search and filter results or as complex as the ability to run algorithms or generate graphs.
Table 8: Data Sources Organized by Interactivity and Format
Figure 6: Frequency of Data Sources By Interactivity
Key notes/important takeaways:
Availability and interactivity both influence the level of ease required to attain and analyze each data source. By examining these variables together we can get a better picture of how accessible these data sources really are.
Table 9: Data Sources Organized by Availability and Interactivity
Figure 7: Frequency of Data Sources by Availability and Interactivity
Key notes/important takeaways:
To get an idea of how many of these sources are still in use and to assess the age of data currently being used, information was collected on how long each source was updated and maintained.
Years in use is the years during which each data source is (or was) actively being updated and maintained. From the years in use we calculated the duration of use (presented in the table below as Time_Range) and grouped the data sources into 4 categories; “Currently in Use”, “No Longer in Use”, “Static Release”, and “unknown”. Where “unknown” represents all data sources for which no information about when the data was available was easily accessible.
Table 10: Data Sources Organized by Current Status and Year in Use
Figure 8: Frequency of Data Sources by Current Status
Key notes/important takeaways:
Here the country of origin is the country in which the data source was created, regardless of where the data was collected or pertains to.
Table 11: Data Sources Organized by Country of Origin and Availability
Figure 9: Frequency of Data Sources by Country of Origin
Key notes/important takeaways:
Variables pertaining only to food composition sources were collected in order to compare the scope or size of each food composition data source. For each food composition data source we collected estimated number of foods and/or beverages recorded and the estimated number of measurements provided for each food or beverage.
Key notes/important takeaways:
USDA has led the way in food composition data access and availability for the past 60 years.
To get a picture of USDA influence, the data sources were split into 5 groups; “Includes USDA Data”, “Likely Includes USDA Data”, “Does Not Include USDA Data”, and “Unknown.”
“Likely Includes USDA Data” refers to situations in which it is not clearly stated but suspected that USDA data was used. For example the source stated “International food data sources were included.”
Table 13: Data Sources Organized by Relationship to The USDA
Figure 10: Frequency of Data Sources by Relationship to The USDA
Key notes/important takeaways:
Table 14: Data Sources Organized By Journal and Relationship to The USDA
Figure 11: Frequency of Data Sources by Journal and Relationship to The USDA
Key notes/important takeaways:
Table 15: Data Sources Organized By Category, Format, and Relationship to The USDA
Figure 12: Frequency of Data Sources by Category and Relationship to The USDA
Key notes/important takeaways:
Table 16: Data Sources Organized by Availability and Relationship to The USDA
Figure 13: Frequency of Data Sources by Availability and Relationship to The USDA
Key notes/important takeaways:
Table 17: Data Sources Organized by Interactivity and Relationship to The USDA
Figure 14: Frequency of Data Sources by Interactivity and Relationship to The USDA
Key notes/important takeaways:
Table 18: Data Sources Organized by Current Status, Years In Use, and Relationship to The USDA
Figure 15: Frequency of Data Sources by Current Status and Relationship to The USDA
Key notes/important takeaways:
Table 19: Data Sources Organized by Country of Origin and Relationship to The USDA
Figure 16: Frequency of Data Sources That Include USDA Data by Country of Origin
Key notes/important takeaways:
To better understand how USDA data is implemented, we collected how the data was used, the objective of the research it was used for and the tool (data category) that was implemented. Data objectives were collected from the keywords of each publication.
Table 20: Implementation of USDA Data
Figure 17: Implementation of USDA Data
We utilized a wordcloud to showcase the most used objective keywords in each data category.
| Table 21: Diet Quality Scores Top 10 | |
|---|---|
| Objective_Keywords | Frequency |
| Animal Protein | 2 |
| Fsa-Ofcom | 2 |
| Nrf Index | 2 |
| Nutri-Score | 2 |
| Nutrient Profiling | 2 |
| Pdcaas | 2 |
| Plant Protein | 2 |
| Protein Quality | 2 |
| Alternate Healthy Eating Index | 1 |
| Biomarker | 1 |
Key notes/important takeaways:
| Table 22: Dietary Standards and Guidance Top 10 | |
|---|---|
| Objective_Keywords | Frequency |
| Biomarkers | 2 |
| Carotenoids | 2 |
| Child And Adult Care Food Program | 2 |
| Dietary Assessment | 2 |
| Dietary Reference Intakes | 2 |
| Feeding Programs | 2 |
| Fruit And Vegetable Intake | 2 |
| Nutrient Adequacy | 2 |
| Nutrient Intake | 2 |
| Pressure-Mediated Reflection Spectroscopy | 2 |
Key notes/important takeaways:
| Table 23: Food Consumption Surveys and Patterns Top 10 | |
|---|---|
| Objective_Keywords | Frequency |
| Nhanes | 7 |
| Diet | 4 |
| Coffee | 3 |
| Alcohol | 2 |
| Animal Protein | 2 |
| Beverages | 2 |
| Blood Pressure | 2 |
| Bmi | 2 |
| Cardiometabolic Health | 2 |
| Cholesterol | 2 |
Key notes/important takeaways:
| Table 24: Dietary Assessment Tools Top 10 | |
|---|---|
| Objective_Keywords | Frequency |
| Front-Of-Pack Labelling | 8 |
| Interpretative Nutrition Rating System | 8 |
| Mobile Health | 8 |
| Smartphone Application | 8 |
| Diet | 7 |
| Biomarkers | 5 |
| Carotenoids | 5 |
| Dietary Assessment | 5 |
| Fruit And Vegetable Intake | 5 |
| Pressure-Mediated Reflection Spectroscopy | 5 |
Key notes/important takeaways:
| Table 25: Food Composition Top 10 | |
|---|---|
| Objective_Keywords | Frequency |
| Beverages | 10 |
| Coffee | 10 |
| Diet | 10 |
| Dietary Guidelines | 10 |
| Sugar-Sweetened Beverages | 10 |
| Alcohol | 9 |
| Juice | 9 |
| Milk | 9 |
| Tea | 9 |
| Water | 9 |
Key notes/important takeaways:
Dig into the searchable and filterable table, which includes all variables previously explored.
Below you’ll find a correlation plot between applicable collected variables. The higher the number indicated in each box, the higher the correlation between the pair of variables.
Figure 18: Correlations Between Collected Variables
Key notes/important takeaways:
SCImago, (n.d.). SJR - SCImago Journal & Country Rank [Portal]. 2021-03-01, from http://www.scimagojr.com
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie, Joe Cheng and Xianying Tan (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.18. https://CRAN.R-project.org/package=DT
Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Greg Lin (2022). reactable: Interactive Data Tables Based on ‘React Table’. https://glin.github.io/reactable/, https://github.com/glin/reactable.
Kyle Cuilla (2021). reactablefmtr: Easily Customize Interactive Tables Made with Reactable. R package version 1.0.0. https://CRAN.R-project.org/package=reactablefmtr
Yihui Xie (2021). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.24.
Yihui Xie (2016). bookdown: Authoring Books and Technical Documents with R Markdown. Chapman and Hall/CRC. ISBN 978-1138700109
Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2021). htmltools: Tools for HTML. R package version 0.5.1.1. https://CRAN.R-project.org/package=htmltools
Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.
Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.
Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN
Navarro, D. J. (2015) Learning statistics with R: A tutorial for psychology students and other beginners. (Version 0.5) University of Adelaide. Adelaide, Australia
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida,
Andy South (2017). rnaturalearth: World Map Data from Natural Earth. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearth
Arel-Bundock et al., (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848
Winston Chang (2019). webshot: Take Screenshots of Web Pages. R package version 0.5.2. https://CRAN.R-project.org/package=webshot
Dawei Lang and Guan-tin Chien (2018). wordcloud2: Create Word Cloud by ‘htmlwidget’. R package version 0.2.1. https://CRAN.R-project.org/package=wordcloud2