1 Characterization of Available Data on Food Composition

In the US, the largest collection of food composition data is the United States Department of Agriculture’s FoodData Central (FDC)1
which is an aggregation of 3 previous USDA composition databases and any updated collection of food composition data since the initial aggregation in 2019. All data presented here is accuracte as od the October 2021 release. The 3 original databases are:

  • USDA National Nutrient Database for Standard Reference (SR)2
    • SR is the major source of food composition data in the United States and provides the foundation for most food composition databases in the public and private sectors. This database contains data on 7,793 food items and up to 150 food components
  • Food and Nutrient Database For Dietary Studies (FNDDS)3
    • FNDDS is a database that provides the nutrient values for foods and beverages reported in What We Eat in America, the dietary intake component of the National Health and Nutrition Examination Survey. This database provides information on exactly 65 nutrient and food components for 7,083 foods.
  • USDA Global Branded Food Products Database 4
    • This database provides nutrient composition of branded foods and private label data provided by the food industry. This is by far the largest data set containing 373,242 foods but for each food the number of food components or nutrients ranges from 1 to 48.

Two additional types of data have been added specifically for use in FDC, they are:

  • Foundation Foods
    • A database containing an extremely in-depth analysis of common food staples. Foundation contains information on up to 159 food components or nutrients for 159 foods.
  • Experimental Foods
    • This is a small list of foods not publicly available being analyzed for research purposes, it is likely that we will not include this data in a majority of our analysis since it doesn’t have very many applications to the average person.

2 A Little Background Information

Each food in the FDC database has information on a varying number of food components or nutrients such as ‘calories’ or ‘fat’, in this case the number of provided nutrients corresponds highly with which database it came from. Consider each nutrient from each food a single data point, the table below summarizes how much data we truly have from each database.

For an extremely in-depth analysis and characterization of the FDC database including an analysis of missing vital nutrients and minerals for each data type and possible expansions to the current database, check out the full breakdown here: https://rpubs.com/Em_Mari3/FDC_V2

3 The Goals of Our Current Analysis

  1. Many variables between the different data types are inconsistent. For instance FNDDS uses the food groups created for the What We Eat In America (WWEIA) Database5
    which are split into 167 unique categories. The SR and Foundation foods follow the SR legacy food groups which are split into 28 unique categories. Branded has it’s own list of food groups which contains 257 unique categories. None of the three lists of unique food categories have any entries that are identical across all three lists. So can we assign a consistent food grouping to all foods regardless of data type?

  2. A large portion of foods in the database are missing vital information but it’s hard to get an overview of which foods we know most about. Assuming a consistent grouping can be achieved, can we identify which food groups we don’t have enough information on?

4 Food Groups

For the sake of simplicity and consistency we will try to use the food groups defined in the SR database to categorize the foods. The categories are as follows:

For the sake of parsimony, we will attempt to extend the food groupings defined in SR to all foods in the FDC database. Foods in both the data types SR Legacy and Foundation are already classified under these food groupings, 80% of these foods will be used as training data and 20% as testing data, to build a neural network for classification.

An attempt will be made to categorize the foods into groups using the 20 most reported nutrients in the training data. However, there is a problem here regarding missing data. All foods are missing data on at least one of these 20 nutrients. To resolve this issue, k-Nearest Neighbor Imputation using the 20 nearest neighbors in the training data will be implemented to fill in the gaps. This method will replace missing nutrient values with the mean of that nutrient value in the 20 most similar foods used to train the imputer.

5 A Classification Model Using Nutrient Values

The initial plan was to predict food groups based on the top 20 most frequently provided nutrients in SR legacy alone. Below you will find a correlation plot displaying the 20 most frequent most frequently provided nutrients in SR legacy and their correlations with each other.

For this model a fairly basic neural network was created, the architecture of which is displayed below.

Early stopping with a patience of 10 epochs was implemented to avoid overfitting, as such this model was trained using 34 epochs. Below you’ll find a plot displaying the training and validation accuracy and loss calculated at each epoch.

At the final epoch, this model achieved a test accuracy and test loss of 80.35% and 0.631 respectively and a validation accuracy and validation loss of 78.63% and 0.8621.

In other words, when applied to foundation and SR legacy foods, the neural network classified the food into the correct category 78.63% of the time.

6 Addition of Further Variables

After the results of the previous model, we can conclude there is a strong correlation between those top 20 nutrients and the food group of each food. However, the efficiency of this model may benefit from further input variables. To test this idea, a new neural network using both the 20 nutrients listed above and the full name or description was created.

Text from the food descriptions or names from the variable “food_description” were used to classify food groups through long short-term memory (LSTM) which is an artificial recurrent neural network (RNN), this model was then combined with our previous model using the keras API. After outputs from the models were concatenated, additional dense layers were added to lower overall loss.

The architecture of this new model is displayed below.

Once again, with a patience of 10 epochs was implemented to avoid overfitting. This time the model was trained using 21 epochs. The loss and accuracy of both the main model and the model using only the text descriptions were evaluated for comparison. Below you’ll find a plot displaying the training and validation accuracy and loss calculated at each epoch.

The final loss of the main model was 0.0467 with a validation loss of 0.2436. The final epoch had a training accuracy of 98.49% and a validation accuracy of 95.6% when classifying foods from SR Legacy and Foundation.

The model trained using the name or description of the foods alone resulted in a loss of 0.0439 with a validation loss of 0.163. Evaluated at the final epoch the training accuracy was 98.72% and the validation accuracy was 95.84% when classifying foods from SR Legacy and Foundation.

7 Analysis of Findings

Below, you’ll find 3 confusion matrices displaying the categories foods were placed in by each of the three models described above. For the sake of readability each label was assigned a number as seen in the key below.

The initial model had problems identifying the difference between labels 13 and 4 (Lamb, veal, and game products and Beef Products), 12 and 25 (Fruits and fruit juices and vegetable and vegetable products), 12 and 2 (Fruits and fruit juices and baby foods), and 19 and 9 (Restaurant foods and Fast Foods). All of these misclassifications make sense, they all happened on types of foods that were very similar.

There is one classification in this model that makes less sense. The model appears to have had a hard time distinguishing between labels 22 and 25 (Soups, Sauces, and Gravies and Vegetables and Vegetable Products). This could be due to the presence of a large variety of vegetable-based soups with little variety of other soup and gravy options.

Further models had trouble distinguishing the differences between labels 12 and 25 (Fruits and fruit juices and vegetable and vegetable products) but had little trouble with other previous misclassifications.

8 Application of the Combined Model

The model using both nutrient content and food description was applied to all data, results can be seen in the figures below.

The following 2 figures show the number of foods classified as belonging to each food group and the number of nutrient entries associated with those foods overall.

The distribution of the overall information we have for each food group follows no apparent pattern and is far from matching a uniform distribution. That being said, from these results we can see the evidence of misclassification. It would appear that most of the breakfast cereals in branded have been misclassified as sweets, likely due to high sugar content as most cereals in our test set were relatively lower in sugar content. Arguments could be made that this would also be a correct classification and that it is really the food groups that should be restructured.

The following figure depicts the average number of nutrients reported for foods in each food group for each data type.

There does appear to be a disparity in the average number of nutrients reported across the food groups. Foundation, which generally has the highest average reported number of nutrients per food, does not have any nutrient data for foods in the categories; “Snacks”, “Meals, Entrees, and Side Dishes”, “Lamb, Veal, and Game Products”, “Breakfast Cereals”, “Baby Foods”, and “American Indian/Alaska Native Foods”. This makes sense as foundation is one of the newer expansions of the FDC data and currently only contains analysis on 159 foods.

It appears that based on our classifications of the branded foods there is no evident food group that lacks nutrition information per food but rather the distribution of missing nutrients is spread evenly throughout the food groups. However, there does seem to be a large bias in what food groups have been inputted into the FDC database most frequently.

9 References

  1. U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.

  2. Haytowitz, David B.; Ahuja, Jaspreet K.C.; Wu, Xianli; Somanchi, Meena; Nickle, Melissa; Nguyen, Quyen A.; Roseland, Janet M.; Williams, Juhi R.; Patterson, Kristine Y.; Li, Ying; Pehrsson, Pamela R.. (2019). USDA National Nutrient Database for Standard Reference, Legacy Release. Nutrient Data Laboratory, Beltsville Human Nutrition Research Center, ARS, USDA. https://data.nal.usda.gov/dataset/usda-national-nutrient-database-standard-reference-legacy-release. Accessed 2021-11-24.

  3. U.S. Department of Agriculture, Agricultural Research Service. 2020. USDA Food and Nutrient Database for Dietary Studies 2017-2018. Food Surveys Research Group Home Page, http://www.ars.usda.gov/nea/bhnrc/fsrg

  4. Pehrsson, Pamela R.; Haytowitz, David B.; McKillop, Kyle A.; Moore, G.; Finley, John W.; Fukagawa, Naomi K.. (2018). USDA Branded Food Products Database. USDA Agricultural Research Service. https://data.nal.usda.gov/dataset/usda-branded-food-products-database. Accessed 2021-11-24.

  5. U.S. Department of Agriculture, Agricultural Research Service, Beltsville Human Nutrition Research Center, Food Surveys Research Group (Beltsville, MD) and U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics (Hyattsville, MD). What We Eat in America, NHANES. Available from: https://data.nal.usda.gov/dataset/what-we-eat-america-wweia-database [accessed 2021-11-24].