1 Introduction

NI’s goal is to advance nutrition science through data - and food composition data is at the heart of all nutrition knowledge. To advance our mission we must first understand the data we have now. We begin with what we assumed was a simple question “How many published studies used USDA food composition data.” To do this, we performed a scoping review (see details of scoping review here). To our surprise the scoping review returned a surprisingly low yield. After a sample review of 50 publications, we realized that USDA food composition data citations were either not present or so inconsistent even a rigorous scoping review strategy could not identify relevant publications.

We concluded that researchers are not citing USDA food composition databases; which led us to ask a broader question; what food composition databases are being used? To answer this question we performed a manual review of the literature.

To complete this manual review, we focused on studies published by 5 top-tier journals in the year 2020, as ranked by Scimago (1). This ranking system was specially chosen to broaden our inclusion of journals with a focus on the prestige of citing journals. The Scimago journal rank (SJR) measures the number of citations received by a journal weighted by the prestige of journals generating the citation. A journal’s SJR indicates the average number of weighted citations during a selected year per document during the previous three years.

Scimago identifies the top 5 journals in the category nutrition and dietetics as:

Annual Review of Nutrition
International Journal of Behavioral Nutrition and Physical Activity
American Journal of Clinical Nutrition
Advances in Nutrition
Nutrition Reviews

Key notes/important takeaways:

Though many journals cite food composition data such as the Journal of Food Composition and Analysis, these 5 journals were chosen as the most prestigious and widely used.

2 Search Strategy and Data Collection

Phase 0-1

We identified qualifying publications with a preliminary review of each article by title and abstract only. We employed a systematic approach to identify qualifying publications using keywords.

Keywords used to identify qualifying articles include: Nutrient, nutrients, nutrient density, nutrient profiling, nutrient composition, diet quality, nutrient values, macronutrients, micronutrients, vitamins, minerals, diet patterns, diet, meals, snacks, drinks, food, nutritional aspects, nutrient content, nutrition content, nutrient timing, nutrition requirements, dietary requirements, dietary behaviors, food behaviors, database, nutrient database, food database, weight loss, food data, nutrient data, food composition, composition data.

We identified the following:

The total number of published papers in 2020
The number of studies published in 2020 that mentioned food and nutrition data in any way

Phase 2

Publications were further reviewed in detail by downloading and reading each publication in its entirety. All food and nutrition data sources were documented.

Phase 3

Food and nutrition data sources were evaluated based on the inclusion/exclusion criteria below.

Included Sources:

Direct and Identifiable sources of data

Excluded Sources:

Unidentifiable sources of data

Included data sources were investigated by trained data analysts and further information was collected directly from publisher websites. This information includes:

The type of data identified in the source
Whether or not the source was created using data from USDA sources
The country in which the source originates
How long the data was updated and maintained
Whether or not the data could be freely accessed by the general public
What format the data source was accessible as
Whether or not the data source is interactive in any way
How often the source is updated
The number of food and beverages were identified and how many measurements were collected for each (food composition sources only)

3 Data Cleansing

The data collected went through multiple rounds of quality analysis. Cleansing included manual verification of all entered data and characterization of excluded sources to the point of exhaustion by three trained data analysts, as well as the following computational cleansing processes.

Computational Cleansing Process	Translation
The number of unique strings collected in each variable was computed and compared to the total number of entries collected	Confirm that data was collected on the correct number of publications
All entries for intext and formal sources that were flagged as duplicate strings were identified and investigated manually	Inspect intext and formal source entries for possible duplications in data entry
The number of NA or missing entries were collected for each variable and compared to the number of sources identified in each phase	Verify that no entry was unintentionally left blank
All categorical variables were mutated into factors and factor levels were examined	Confirm that all categorical responses were within a predefined list of possible options

Below you’ll find table 1 which displays the number of publications and sources identified in each phase.

Table 1: Publication and Source Count in Phases 0-3
	Publications				Sources
	Phase 0	Phase 1	Phase 2	Phase 3	Phase 1	Phase 2	Phase 3
Advances in Nutrition	173	87	12	12	133	51	46
American Journal of Clinical Nutrition	391	164	81	71	230	145	121
Annual Review of Nutrition	18	9	1	1	9	1	1
International Journal of Behavioral Nutrition and Physical Activity	163	23	7	7	38	22	22
Nutrition Reviews	165	42	6	6	63	27	27
Total	910	325	107	97	473	246	217

3.1 Characterization of Excluded Sources

We discovered that authors would refer to food composition data such as the nutrient value of a particular item but would not provide the data source or the data source was generic such as “local food table”. In these circumstances we know food composition data was used but had no way to identify the source.

Below is a characterization of the excluded data sources.

Key notes/important takeaways:

There are 29 unidentifiable data sources
- In 6 cases, food composition data was used in the text of the publication but no source was identified
- In 5 cases, the source identified in the text of the publication was too vague to identify (e.g. “152-food-item SFFQ” , “company websites”, etc…) and no formal source was provided
- In 6 cases, the source identified in the text of the publication was too vague to identify and the formal source lead to a publication which did not provide the source of the data used
- In 12 cases, food composition data was used but not identified in the text of the publication and the formal source did not lead to the source of the data used

4 Summary Diagram

Below you’ll find a flow diagram of the detailed search strategy and publication selection process including inclusion and exclusion criteria of sources.

Figure 1: Search Strategy and Publication Selection Process

8_18_22 sourcing and landscape

5 Data Organization

After data collection and cleansing we were able to organize the sources into meaningful groups for exploratory analysis.

5.1 Grouping Data Sources

Data sources were not referred to consistently, so they had to be grouped before they could be analyzed.

The tables below allow us to review the title of each source individually and by group. The common name refers to the the name of the data source as it appears in it’s published format (book, website, software, etc…).

Key notes/important takeaways:

A total of 230 sources were identified from the 217 publications previously mentioned, this difference was due to situations in which one publication referenced more than one data source.

Key notes/important takeaways:

The 230 sources identified in table 3 were grouped into 108 data sources in table 4, meaning that each source was identified around 2.1 times on average.

6 Exploratory Analysis

For each of the 108 data sources we manually searched Google (chrome browser) for surface-level information provided directly by data source - website, book or manual published directly by author were included. All collected variables will be explained as they become relevant.

6.1 Obscure Data Sources

We were unable to locate or find information on a few sources. We spent a considerable amount of time searching, but were never able to fully confirm their existence.

Obscure data sources are as follows:

“Nutrilet”
“Star of Nutrition Software”
“Taiwan food-composition table”

These sources are included in exploratory analysis. Although obscure, we assume that information provided by the authors is correct.

6.2 Data Sources per Journal

We were curious to know which and how many food composition data sources were reported in each journal.

Table 5: Data Sources Organized By Journal

Figure 2: Frequency of Data Sources By Journal

Key notes/important takeaways:

57.4% of the sources were in the American Journal of Clinical Nutrition.
15 Sources were found in multiple journals.

6.3 Category and Format

To better understand the use of each data source, the sources were organized into categories based on how the data was used (such as SR Legacy being used for food composition or the implementation of a diet quality score). We also documented how data was retrieved.

Table 6: Data Sources Organized By Category and Format

Figure 3: Frequency of Data Sources By Category

Key notes/important takeaways:

49 sources (45.4%) are actual sources of food composition data.
Most dietary assessment tools were interactive programs and software (20 of 24 or 83.3%).
There was only 1 software in the food consumption surveys and patterns, it was not free to access.
All sources released in book format were food composition sources.

Figure 4: Frequency of Data Sources By Format

Key notes/important takeaways:

Over a third of sources (33) were available only as downloadable tables.
46 sources (42.6%) were available as downloadable tables total.

6.4 Availability and Interactivity

Each data source has it’s own points of access such as online publications or physical books, each of which has different levels of availability and interactivity.

Availability

The availability of each source indicates whether or not the data can be freely accessed by researchers or the general public.

For each data source we’ve identified these forms of availability and grouped them into 3 categories; “Open Access”, “Restricted Access”, and “Inaccessible”. Where “Open Access” encompasses sources that can be accessed for free with no authorization, “Restricted Access” encompasses sources that could be located but are blocked by a pay-wall or require authorization, and “Inaccessible” encompasses all data sources that could not be located or no longer have a point of access.

Table 7: Data Sources Organized by Availability and Format

Figure 5: Frequency of Data Sources By Availability

Key notes/important takeaways:

66 sources (61.1%) are confirmed to be open access
16 sources (14.8%) were inaccessible they could not be located or no longer have a point of access despite being recently cited
17 sources (15.7%) required payment to be used
13 data sources that were research publications or studies, 7 required a purchase to be viewed

Interactivity

Interactivity is used to describe whether or not the original source of data has an interactive interface rather than static tables. The form of interaction can be as simple as the ability to search and filter results or as complex as the ability to run algorithms or generate graphs.

Table 8: Data Sources Organized by Interactivity and Format

Figure 6: Frequency of Data Sources By Interactivity

Key notes/important takeaways:

A majority of the sources (51) are not interactive, meaning that we are relying on the static tables to be updated
2 sources (1.9%) were research centers. While data can still be accessed using these resources, they themselves are not the data.

Availability and Interactivity

Availability and interactivity both influence the level of ease required to attain and analyze each data source. By examining these variables together we can get a better picture of how accessible these data sources really are.

Table 9: Data Sources Organized by Availability and Interactivity

Figure 7: Frequency of Data Sources by Availability and Interactivity

Key notes/important takeaways:

Around 60% of open access sources are not interactive. This makes sense as standalone pdf type formats are the most feasible options.
A majority (26 or 66.7%) of interactive sources are open access.

6.5 Format and Accessibility

6.6 Current Status and Years in Use

To get an idea of how many of these sources are still in use and to assess the age of data currently being used, information was collected on how long each source was updated and maintained.

Years in use is the years during which each data source is (or was) actively being updated and maintained. From the years in use we calculated the duration of use (presented in the table below as Time_Range) and grouped the data sources into 4 categories; “Currently in Use”, “No Longer in Use”, “Static Release”, and “unknown”. Where “unknown” represents all data sources for which no information about when the data was available was easily accessible.

Table 10: Data Sources Organized by Current Status and Year in Use

Figure 8: Frequency of Data Sources by Current Status

Key notes/important takeaways:

A large proportion of sources (54 of the 108 or 50%) are still presently in use.
7 of the 54 (13%) data sources still in use have been in use for more than 50 years.
3 of the 54 (5.6%) sources currently still in use have been developed in the last 5 years.
4 sources no longer in use did not specify either a start or end date of their time in use. Combining this with the 11 unknown sources, we don’t know the full range of years in use for 15 or 13.9% of the sources.

6.7 Country of Origin

Here the country of origin is the country in which the data source was created, regardless of where the data was collected or pertains to.

Table 11: Data Sources Organized by Country of Origin and Availability

Figure 9: Frequency of Data Sources by Country of Origin

Key notes/important takeaways:

48 sources (44.4%) are solely US based data.
After the US, the second most common origin is Canada with 9 sources (8.3%). This means the gap between the US and the next most common origin is 36.1%.

6.8 Variables Exclusive to Food Composition Data

Variables pertaining only to food composition sources were collected in order to compare the scope or size of each food composition data source. For each food composition data source we collected estimated number of foods and/or beverages recorded and the estimated number of measurements provided for each food or beverage.

Key notes/important takeaways:

41 of the 45 food composition data sources (91.1%) provided data for food and beverage measures.
36 of the 45 food composition data sources (80%) provided data for the number of nutrient measurements.
FoodData Central provides the most foods and/or beverage data (393,954).
Computerized Heterocyclic Amines Resource for Research in Epidemiology of Disease provides the least foods and or/or beverage data (not including single nutrient measures from a publication).
The Australian Food Composition Database (NUTTAB) provides the most nutrient measures (256).
USDA National Fluoride Database of Selected Beverages and Foods provides the least nutrient measures (1).

7 USDA Relationship

USDA has led the way in food composition data access and availability for the past 60 years.

To get a picture of USDA influence, the data sources were split into 5 groups; “Includes USDA Data”, “Likely Includes USDA Data”, “Does Not Include USDA Data”, and “Unknown.”

“Likely Includes USDA Data” refers to situations in which it is not clearly stated but suspected that USDA data was used. For example the source stated “International food data sources were included.”

Table 13: Data Sources Organized by Relationship to The USDA

Figure 10: Frequency of Data Sources by Relationship to The USDA

Key notes/important takeaways:

Over 60% of sources used USDA data (counting both includes and likely includes)
The sources “The Fred Hutchinson Cancer Research Center” and “The National Cancer Institute” were excluded since they are research centers and not data

7.1 Journal and USDA Relationship

Table 14: Data Sources Organized By Journal and Relationship to The USDA

Figure 11: Frequency of Data Sources by Journal and Relationship to The USDA

Key notes/important takeaways:

More than half of the data sources collected from each journal (not counting Annual Review of Nutrition) included USDA data.
Annual Review of Nutrition did not cite any USDA sources

7.2 Category, Format, and USDA Relationship

Table 15: Data Sources Organized By Category, Format, and Relationship to The USDA

Figure 12: Frequency of Data Sources by Category and Relationship to The USDA

Key notes/important takeaways:

Overall, the USDA has the largest proportion of influence regardless of category.

7.3 Availability and USDA Relationship

Table 16: Data Sources Organized by Availability and Relationship to The USDA

Figure 13: Frequency of Data Sources by Availability and Relationship to The USDA

Key notes/important takeaways:

2/3 open access data sources include USDA data.
Around 80% (44 of 56) of sources that include USDA data are open access.

7.4 Interactivity and USDA Relationship

Table 17: Data Sources Organized by Interactivity and Relationship to The USDA

Figure 14: Frequency of Data Sources by Interactivity and Relationship to The USDA

Key notes/important takeaways:

More than half of the interactive and non-interactive sources include USDA data.

7.5 Current Status, Years In Use, and USDA Relationship

Table 18: Data Sources Organized by Current Status, Years In Use, and Relationship to The USDA

Figure 15: Frequency of Data Sources by Current Status and Relationship to The USDA

Key notes/important takeaways:

72.7% of sources no longer in use used USDA data.
55.8% of data sources still in use include USDA data.
None of the static release sources included USDA data.

7.6 Country of Origin and USDA Relationship

Table 19: Data Sources Organized by Country of Origin and Relationship to The USDA

Figure 16: Frequency of Data Sources That Include USDA Data by Country of Origin

Key notes/important takeaways:

15 Countries are known to be using USDA data, greater than 50%.
30 of the 56 (53.6%) sources that include USDA data originate in the US.
All 6 data sources originating in Australia include USDA data.

8 Implementation of USDA Data

To better understand how USDA data is implemented, we collected how the data was used, the objective of the research it was used for and the tool (data category) that was implemented. Data objectives were collected from the keywords of each publication.

Table 20: Implementation of USDA Data

Figure 17: Implementation of USDA Data

8.1 Objectives in Each Data Category

We utilized a wordcloud to showcase the most used objective keywords in each data category.

Diet Quality Score

Table 21: Diet Quality Scores Top 10
Objective_Keywords	Frequency
Animal Protein	2
Fsa-Ofcom	2
Nrf Index	2
Nutri-Score	2
Nutrient Profiling	2
Pdcaas	2
Plant Protein	2
Protein Quality	2
Alternate Healthy Eating Index	1
Biomarker	1

Key notes/important takeaways:

Only 8 objectives were associated with multiple diet quality scores.

Dietary Standards and Guidance

Table 22: Dietary Standards and Guidance Top 10
Objective_Keywords	Frequency
Biomarkers	2
Carotenoids	2
Child And Adult Care Food Program	2
Dietary Assessment	2
Dietary Reference Intakes	2
Feeding Programs	2
Fruit And Vegetable Intake	2
Nutrient Adequacy	2
Nutrient Intake	2
Pressure-Mediated Reflection Spectroscopy	2

Key notes/important takeaways:

Food Consumption Surveys and Patterns

Table 23: Food Consumption Surveys and Patterns Top 10
Objective_Keywords	Frequency
Nhanes	7
Diet	4
Coffee	3
Alcohol	2
Animal Protein	2
Beverages	2
Blood Pressure	2
Bmi	2
Cardiometabolic Health	2
Cholesterol	2

Key notes/important takeaways:

NHANES was the most frequently used objective word relating to Food Consumption Surveys and Patterns

Dietary Assessment Tools

Table 24: Dietary Assessment Tools Top 10
Objective_Keywords	Frequency
Front-Of-Pack Labelling	8
Interpretative Nutrition Rating System	8
Mobile Health	8
Smartphone Application	8
Diet	7
Biomarkers	5
Carotenoids	5
Dietary Assessment	5
Fruit And Vegetable Intake	5
Pressure-Mediated Reflection Spectroscopy	5

Key notes/important takeaways:

Food Composition

Table 25: Food Composition Top 10
Objective_Keywords	Frequency
Beverages	10
Coffee	10
Diet	10
Dietary Guidelines	10
Sugar-Sweetened Beverages	10
Alcohol	9
Juice	9
Milk	9
Tea	9
Water	9

Key notes/important takeaways:

9 All Variables for Each Data Source

Dig into the searchable and filterable table, which includes all variables previously explored.

10 Correlations Between Collected Variables

Below you’ll find a correlation plot between applicable collected variables. The higher the number indicated in each box, the higher the correlation between the pair of variables.

Figure 18: Correlations Between Collected Variables

Key notes/important takeaways:

Correlations that are fairly obvious:
- strong association between format and availability
- strong association between format and interactive
- strong association between format and category
- strong association between format and years in use
- strong association between years in use and update frequency
Correlations that make sense but are not necessarily obvious:
- strong association between inclusion of USDA data and country of origin
- medium association between inclusion of USDA data and update frequency
- medium association between inclusion of USDA data and years in use
Correlations that we may have not necessarily expected:
- strong association between inclusion of USDA data and format
- medium association between inclusion of USDA data and availability
- medium association between years in use and availability
- medium association between years in use and interactive
All statistically significant correlations are medium or strong (i.e. no two variables were found to be nearly or entirely uncorrelated)

11 References

SCImago, (n.d.). SJR - SCImago Journal & Country Rank [Portal]. 2021-03-01, from http://www.scimagojr.com
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Yihui Xie, Joe Cheng and Xianying Tan (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.18. https://CRAN.R-project.org/package=DT
Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham and Jennifer Bryan (2019). readxl: Read Excel Files. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Greg Lin (2022). reactable: Interactive Data Tables Based on ‘React Table’. https://glin.github.io/reactable/, https://github.com/glin/reactable.
Kyle Cuilla (2021). reactablefmtr: Easily Customize Interactive Tables Made with Reactable. R package version 1.0.0. https://CRAN.R-project.org/package=reactablefmtr
Yihui Xie (2021). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.24.
Yihui Xie (2016). bookdown: Authoring Books and Technical Documents with R Markdown. Chapman and Hall/CRC. ISBN 978-1138700109
Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2021). htmltools: Tools for HTML. R package version 0.5.1.1. https://CRAN.R-project.org/package=htmltools
Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.11. URL https://rmarkdown.rstudio.com.
Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.
Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN
1. URL https://bookdown.org/yihui/rmarkdown-cookbook.
Navarro, D. J. (2015) Learning statistics with R: A tutorial for psychology students and other beginners. (Version 0.5) University of Adelaide. Adelaide, Australia
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida,
Andy South (2017). rnaturalearth: World Map Data from Natural Earth. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearth
Arel-Bundock et al., (2018). countrycode: An R package to convert country names and country codes. Journal of Open Source Software, 3(28), 848, https://doi.org/10.21105/joss.00848
Winston Chang (2019). webshot: Take Screenshots of Web Pages. R package version 0.5.2. https://CRAN.R-project.org/package=webshot
Dawei Lang and Guan-tin Chien (2018). wordcloud2: Create Word Cloud by ‘htmlwidget’. R package version 0.2.1. https://CRAN.R-project.org/package=wordcloud2

Characterization of The Landscape of Food Composition Data

Nutrient Institute - Data Landscape Series

David Heber MD PhD, Zhaoping Li, MD PhD, Samuel A. Abariga, MD. MS. MPH, Emily Jennings-Dobbs MA, Shavawn M Forester PhD RD

2022-08-19

1 Introduction

2 Search Strategy and Data Collection

Phase 0-1

Phase 2

Phase 3

3 Data Cleansing

3.1 Characterization of Excluded Sources

4 Summary Diagram

5 Data Organization

5.1 Grouping Data Sources

6 Exploratory Analysis

6.1 Obscure Data Sources

6.2 Data Sources per Journal

6.3 Category and Format

6.4 Availability and Interactivity

Availability

Interactivity

Availability and Interactivity

6.5 Format and Accessibility

6.6 Current Status and Years in Use

6.7 Country of Origin

6.8 Variables Exclusive to Food Composition Data

7 USDA Relationship

7.1 Journal and USDA Relationship

7.2 Category, Format, and USDA Relationship

7.3 Availability and USDA Relationship

7.4 Interactivity and USDA Relationship

7.5 Current Status, Years In Use, and USDA Relationship

7.6 Country of Origin and USDA Relationship

8 Implementation of USDA Data

8.1 Objectives in Each Data Category

Diet Quality Score

Dietary Standards and Guidance

Food Consumption Surveys and Patterns

Dietary Assessment Tools

Food Composition

9 All Variables for Each Data Source

10 Correlations Between Collected Variables

11 References