The COVID-19 pandemic created many shifts - from lifestyle changes to global health concerns. A key factor studied during this time is the dietary habits during the pandemic. A data set compiled from data collected by the United Nations Food and Agriculture Organization (FAO), country population counts from the Population Reference Bureau (PRB), and Covid-19 data compiled by the Johns Hopkins Center for Systems Science and Engineering (CSSE). This data set focuses specifically on the protein intake percentages over all countries around the world during the pandemic.
This data analysis aims to investigate an important facet of nutrition worldwide - protein intake from Animal Products. The significance of this study is that it allows us to examine which countries consume more animal-related foodstuffs than others as well as what the average amount of animal protein consumed is. This dietary data is very insightful as it can look into non-health concerns such as how wealthy the country is: wealthier countries are likely to consume more protein derived from animals since animal products are more expensive to produce. In addition, analysis of animal product intake can be studied versus Covid-19 cases and trends to see if any correlations between nutrition and health outcomes can be made.
This data set was found on Kaggle and consists of 3 sections of data collected: protein intake, population count, and Covid-19 cases and other macro data.
The food intake data was collected by the UN FAO during the Covid-19 pandemic. The data was collected through FAOSTAT questionnaires and distributed to member states’ citizens in a random sample. This data set specifically pertains to an estimate of the percent of protein intake by every major category of protein source where each entry represents a UN member state. Categories of protein source range from Alcohol, Cereals, and Grains to Animals, Milk, and Eggs to Nuts, Spices, and Vegetables (both starchy and non-starchy).
This data analysis is specifically focused on Animal Products, which refers to all other animal-derived foodstuffs excluding those explicitly listed in the data set (animal fats, eggs, meat, milk excluding butter).
Population count per country was collected from data collected by the Population Reference Bureau. International data is generated through “current estimates based on recent censuses, official national data, or analyses conducted by national statistical offices, regional organizations, PRB, UN Population Division, or International programs of the U.S. Census Bureau” (PRB, 2023).
This data analysis is only concerned with the estimates for the population count per country.
The Covid-19 data was collected by the John Hopkins Coronavirus Resource Center. They worked 24/7 to constantly update with latest information provided by local and national governments to track confirmed cases, deaths, recovered persons, and active cases categorized by the province/state, country/region, and the time that the update was made.
Two confidence intervals are constructed at the 95% confidence level to analyze the mean percent of protein intake that comes from Animal Products. The first confidence interval was created using a t-distribution. The second confidence interval was created using the bootstrap method.
A t-test was performed on the Animal Products variable in the protein data. The resulting 95% confidence interval is an imperfect estimation, however.
The t-test and its corresponding t-interval relies on many assumptions, such as homoscedasticity, independence, and normality of population distribution: some of which can’t be satisfied without performing tests that validate said distributions.
Bootstrapping is the process of resampling the data with replacement from the observed data to generate more statistics. This method of sampling a sample works because of the principle that a sample generally reflects the distribution of the population, thus providing a reasonable approximation of the population. While this method similarly has caveats, there are benefits to bootstrapping over using confidence intervals from t-tests.
Bootstrapping doesn’t rely on assumptions of the distribution of the population data. In addition, the data is more versatile when dealing with data whose sample sizes are N < 30. This makes bootstrapping a more robust and an attractive alternative to other forms of calculating confidence intervals because parametric methods may not always be applicable.
The 95% confidence interval for Animal Products, as derived through a t-test, is 95% CI[20.03, 22.43].
The 95% confidence interval for Animal Products using the bootstrap method is 95% CI[20.12, 22.37].
A plot of the bootstrap sampling distribution of the sample mean was also constructed.
The two aforementioned confidence intervals help to demonstrate why bootstrapping is so effective and how it works. The first confidence interval generated from the t-distribution, serving as a reference for the distribution’s mean, is effectively mirrored by the second confidence interval derived through the bootstrap method. The 2.5th percentile of the bootstrap confidence interval is only off by 0.09 and the 97.5th percentile is only off by -0.06, demonstrating that the bootstrap confidence interval faithfully mirrors that of the t-test confidence interval.
In addition, bootstrapping created a normally distributed sample distribution, emphasizing the benefit of using bootstrapping over the t-distribution that no assumptions need to be made about the population distribution.
These results reinforce that 1) bootstrapping is a valid way to create a sampling distribution and 2) bootstrapping has benefits over other ‘conventional’ methods of sampling.
We see from the results that bootstrapping has created a distribution faithful to the distribution from the conventional t-distribution, with the added benefit of not requiring assumptions to be made about the population distribution. This allows for bootstrapping to be a valid way to sample in scenarios where the population distribution is non-parametric or whose normality can’t be ascertained.
There are potential limitations to these results, however. As discussed previously, bootstrapping is useful in scenarios where the sample size is small (generally considered when sample size N < 30). The experiment ran 1000 iterations of a sample size of 170. With a sample size as large as N = 170, t-distribution could be assumed to be normally distributed regardless of the underlying characteristics in the data (unless otherwise explicitly stated to be non-parametric). However, the principle still stands that the bootstrap method is an effective and valid way to sample.
Ren, M.(2020). Protein_Supply_Quantity_Data.csv[Data set]. https://www.kaggle.com/datasets/mariaren/covid19-healthy-diet-dataset?select=Protein_Supply_Quantity_Data.csv
FAO. (2023). Data collection. Food and Agriculture Organization of the United Nations. https://www.fao.org/food-agriculture-statistics/data-collection/en/
PRB. (2023). International Data. PRB. https://www.prb.org/international/indicator/population/snapshot
Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1.